DIGITAL LIBRARY
SECONDARY SCHOOL STUDENT PERFORMANCE PREDICTION USING ENTITY EMBEDDING OF CATEGORICAL VARIABLES
1 Eötvös Loránd University (HUNGARY)
2 Trnava University in Trnava Faculty of Education (SLOVAKIA)
About this paper:
Appears in: INTED2021 Proceedings
Publication year: 2021
Pages: 10016-10020
ISBN: 978-84-09-27666-0
ISSN: 2340-1079
doi: 10.21125/inted.2021.2094
Conference name: 15th International Technology, Education and Development Conference
Dates: 8-9 March, 2021
Location: Online Conference
Abstract:
Educational data mining (EDM) is an emerging discipline that is concerned with developing methods for investigating data that come from educational settings. Those methods are used to better understand students and the settings in which they learn. EDM transforms raw data collected by educational systems into useful information that can be utilized to make informed decisions and answer research questions.

Data mining as a main branch of EDM involves the use of data analysis tools to discover previously unknown, patterns, and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning (ML) methods. These techniques can discover information within the data that queries and reports can't effectively reveal.

Data comes in different forms like structured and unstructured. Structure data is similar to relational database tables. It consists of mixed data types of columns. EDM comes generally fall into one of the four broad categories:
Almost all of the ML models require data to be in numerical forms. Data usually come into different forms like nominal categorical form. For instance, a student's "gender" attribute. This attribute can take two string values: "male" and "female". Categorical data need preprocessing to transform it into numerical data. If there is no order or relation between values, there will be an issue that misleads the model. For example, if the Categorical column is the "size" that takes values like "small", "medium", and "large". In this case, encoding the "small" as 1, "medium" as 2, and "large" as 3 will be fine due to the sequence between values. On the other hand, if a "colour" variable with "red", "green", and "blue" string values. Integer encoding assigns an integer value to the string values like 1 for "red", 2 for "green", and 3 for "blue". There is no ordinal relationship in the colour variable, so it will mislead the model.

To overcome the issue of nominal values encoding as numerical, one hot encoding technique is usually used. This technique adds a new binary variable for each unique value. It results in a sparse matrix for a column. Hence, one-hot encoding increases the dimensions of the dataset.

A method, Entity embeddings of categorical variables, was recently introduced that maps similar values close to each other in the embedding space.

This study will use a dataset of Portuguese schools. The dataset has two core classes (Mathematics and Portuguese). The paper addresses the prediction of student performance by using past school grades (first and second periods), demographic, social and other school-related data. We first encode nominal attributes using the embedding technique. Then combine the embedded features with the numerical ones and feed them to a deep neural network to predict the students' performances. The prediction accuracy is expected to improve, and Entity embedding reduces memory usage and speeds up neural networks compared with one-hot encoding.
Keywords:
Secondary school student, performance prediction, data analysis, educational data mining.