INTELLECTUAL DATA ANALYSIS IN PREDICTING STUDENT DROPOUT DURING IN-HOUSE VOCATIONAL TRAINING OF TECHNICAL PERSONNEL
1 Technical School of St. Petersburg Metro (RUSSIAN FEDERATION)
2 ITMO University (RUSSIAN FEDERATION)
3 The Herzen State Pedagogical University of Russia (RUSSIAN FEDERATION)
About this paper:
Conference name: 13th annual International Conference of Education, Research and Innovation
Dates: 9-10 November, 2020
Location: Online Conference
Abstract:
Introduction:
The growth of large cities stimulate the growing need for public transport operators, especially subway operators. Their training is financially supported by loans for studies, scholarships, etc. However, significant parts of the students do not finish their studies. Thus, the company organizing the training suffers material losses, and the need for personnel is not fully satisfied. That is why it is necessary already at the initial stages of education to predict the prospects of graduation for a particular student.
Background and problem statement:
As world practice shows, vocational training of transit operators is relatively short-term (up to 6 months) and is organized in the form of on-the-job training including theoretical instruction. The entry requirements for enrollment are relatively low - only some secondary school education is mandatory, although a high school diploma is generally preferred, there are no age restrictions for applicants.
To date, a number of studies have been performed to predict student’s drop out based on educational data mining [Devasia, 2016], including probabilistic methods, regression analysis, and various machine learning algorithms [Jaiswal, 2019]. However, they are focused mainly on students of an academic form of study at universities and colleges [Alyahyan, 2020], or on the social support of students from socially disadvantaged strata [Fix, 2019], i.e. rely on large and statistically uniform cohorts of students under consideration.
But the groups formed in the preparation of transit operators are not such. The number of students in each group is small, the groups are heterogeneous in most parameters, which are traditionally included in the predictors of dropout [Munoz, 2019], including age, level of education and other pedagogically significant parameters. The direct application of statistical or machine learning methods in this case is either difficult or may lead to unreliable results.
The objective of the article is to build a model for identifying and evaluating the parameters that are most significant for predicting the dropout of students through the intellectual analysis of data on vocational training of transit operators. The work is carried out on the example of the specialty "Metro drivers", St. Petersburg, Russia.
Research Methods. Intellectual data analysis was carried out in several stages. At the first stage, the collection and preprocessing of available data about students for the period 2015-2019 were made. (more than 1200 entries, 22 parameters). At the second stage, an exploratory data analysis [Tukey, 1961] was applied to the obtained dataset, which used the methods of constructing histograms and correlation analysis visualized in the form of a heat map. As a result, the number of parameters was reduced to 5, and one of the most significant was the complex attribute - Rating of the educational institution the student graduated. At the third stage, a logistic regression was constructed according to the identified significant parameters. Due to the large imbalance of the dataset, Recall, Precision and F1-score were used to assess the quality of dropout prediction.
Results and discussion:
As preliminary results, the constructed model gave F1 = 0.82, which is a rather good evaluation level of prediction. To improve the model, we intend to include other methods of data mining in it, as well as obtain feedback by taking into account current updates to the curriculum.Keywords:
Vocational training, predicting student dropout, exploratory data analysis, logistic regression.