Instituto Politécnico de Bragança (PORTUGAL)
About this paper:
Appears in: INTED2019 Proceedings
Publication year: 2019
Pages: 4611-4620
ISBN: 978-84-09-08619-1
ISSN: 2340-1079
doi: 10.21125/inted.2019.1140
Conference name: 13th International Technology, Education and Development Conference
Dates: 11-13 March, 2019
Location: Valencia, Spain
In the school context, one of the main metrics for institution performance is the student’s dropout rate. The decrease of the number of students in a university implies a reduction of the main resources necessary for its operation, such as the stagnation of investments in infrastructure, the number of professor, modern technologies and equipment, among other means that improve the quality of education.

The difficulty of this problem is that we need to identify early as possible the students that are at risk of dropout, in order to adopt measures before they give up. Is also important to understand what reasons might lead them to dropout and assisting them in the right way. Since it is not a trivial problem, it is fundamental that institutions analyze as many parameters as possible, seeking to cover all cases that could help to identify the dropout risk. Universities produce a big amount of data, but are usually distributed in several databases and not always organized in an easy way for analysis. For instance, the online learning systems records, where we can extract data like logins, messages, resources, among other information, can be essential in identifying dropouts, thus increasing the richness of the study for the area of big data analytics.

This work proposes a model for the early identification of students at dropout risk, extracting weekly the academic data generated by the university and applying machine learning techniques with the aim of producing a classification of dropout. This classification translates the student's situation as dropout or not dropout based on the collection of previous years limited to the same week. During the school year, we calculated how many times the student was classified as dropout and generate the model for the student critical rate. With this rate we produce a ranking of necessity, allowing institutions to target their resources in a critical order, minimizing their expenses and the errors of the model itself.

For this work, we use as a case study a higher education institution from Portugal, which provided data of three different datasets, the first one referring to the basic information of the students, such as grades, the quantity of school subjects approved and not approved, the quantity of school years, among others. The second one refers to the presences in classroom and the last one referring to the records generated by the Sakai virtual learning environment, that is used by the institution. The data refers to the years 2009 to 2017, resulting in 200 million records and approximately 50 gigabytes of data.

As we had a considerable range of years, we took advantage of this volume of data to identify which would be the best training cycle for the algorithm, since the data can be very distinct over the years, and we obtain the result that the period of 4 years was shown to be more efficient using the proposed model. From this, we apply our technique in the years 2013, 2014, 2015 and 2016 each one with the previous 4 years of training and we calculate the critical rate of every week. With this we discovered a new training parameter, the own critical rate, in which it was applied to obtain the result of the year 2017, achieving a better result than before. The model can still be extended to more parameters and tends to get better results over the years by improving its own critical rate.
Machine learning, education data mining, big data, students dropout, predictive model.