SHAPLEY VALUES TO EXPLAIN MACHINE LEARNING MODELS OF SCHOOL STUDENT’S

In this work we perform an analysis of distance learning format influence, caused by COVID-19 pandemic on school students’ academic performance. This study is based on a large dataset consisting of school students grades for 2020 academic year taken from “Electronic education in Tatarstan Republic” system. The analysis is based on the use of machine learning methods and feature importance technique realized by using Python programming language. One of the priorities of this work is to identify the academic factors causing the most sensitive impact on school students’ performance. In this work we used the Shapley values method for solving this task. This method is widely used for the feature importance estimation task and can evaluate impact of every studied feature on the output of machine learning models. The study-related conditional factors include characteristics of teachers, types and kinds of educational organization, area of their location and subjects for which marks were obtained.


INTRODUCTION
Failure to achieve educational goals negatively affects society as a whole and is a serious problem. This problem can manifest itself most significantly during periods of drastic changes, one of which was the introduction of distance learning during the COVID-19 pandemic. To quantify the influence of this event on educational system, a variety of quantitative models based on modern statistical methods in combination with Big Data approaches can be used, as has shown in Li et al. [2021].
Machine learning (ML) is one of the new and actively developing methods of analysis, combining approaches that can "learn" based on the received data, which allows to perform a wide range of different tasks. ML can be used to solve problems of detection, recognition, prediction, prediction, diagnostics, and optimization.
A large number of huge datasets has been accumulated recently in educational system, which can be used to analyze and then improve educational process, as was demonstrated by Park [2020]. For example, Livieris et al. [2019] analyze a dataset consisting of performance of 3716 students in course of Mathematics of the first 5 years of secondary school. They develop two semisupervised machine learning algorithms to predict students' performance in the final examinations and then evaluate methods' accuracy. Authors compare these two methods with supervised machine learning method and as a result, these approaches outperform it, and the final accuracy exceeds 80%.
[2021] used well-known algorithms of machine learning Logistic Regression and Support Vector Machine to predict whether student is eligible to acquire a degree or not. Authors analyzed dataset of 1460 students' final year's results and obtained a model trained to 99.27% and 99.72% accuracy. Also, Nuanmeeseri et al. [2022] analyzed dataset of 1650 university students' academic performance. As a result, after adjusting model's parameters, authors achieved accuracy of 96.98%, so their model outperformed other considered machine learning methods and can be effectively used to evaluate significant academic performance factors in drastically changing period.
In our work, we study changes of academic performance of whole school grades in the framework of a variety of machine learning methods with the following feature importance analysis to identify significant parameters that affect academic performance the most after the introduction of distance learning format due to the COVID-19 pandemic. Hastie et al. [2009] introduce Machine learning as a set of mathematical techniques that give computer algorithms an ability to learn. This methodology is based on the input and required output of the algorithms and can automate the way how humans are able to carry out the task, as stated by Mnih et al. [2015].

MACHINE LEARNING TECHNIQUES
Ensemble methods are groups of algorithms that use several machine learning methods at once and makes correction of each other's errors. Bostanabad et al. [2016] define supervised learning as a type of algorithms where the method is supplied with example inputs along with the required output, which then allows it to learn a rule that maps inputs to outputs. Bengio et al. [2013] state that in unsupervised learning, on the contrary, only the inputs are supplied, and the learning algorithm is required to determine the structure of the input and perform according to unknown characteristics [10].
In this work we use supervised machine learning methods: Decision Tree, Gradient Boosting, K-nearest neighbors (KNN) Regressor, Lasso Regression, Linear Regression and MultiLayer Perceptron neural networks, Support Vector Regressor; and ensemble method: Random Forest.
In our study, we solved the regression task to predict Cohen's effect size, defined by Cohen [1988], based on subsets of school grades' marks in February and March, and April and May. Cohen's effect size measures the difference between mean values of two variables Cohen [1988].

SHAP FEATURE IMPORTANCE IMPLEMENTATION
Usually, machine learning models are difficult to interpret and it's hard to identify which features affect the output of the models the most. SHAP method (Shapley additive explanations) is one of the techniques used to solve this problem. This method is based on cooperative game theory, explained by Shapley [1953], and is used to increase transparency and interpretability of machine learning models. Absolute SHAP value shows us how much a single feature affected the prediction. SHAP values can represent the local importance of features and how it changes with lower and higher values, as shown by Sahakyan et al. [2021].

EXPERIMENTAL DATA DESCRIPTION
In this work, we study the influence of COVID-19 pandemic on school students' academic performance by analyzing a large dataset consisting of data from all schools in Tatarstan Republic, introduced by Ustin et al. [2022]. The dataset includes marks of entire grades of school students for main subjects for grades from 2 to 11.
During the preprocessing of original data, for the following analysis by machine learning methods, the initial dataset was modified into a new dataset consisting of features describing different parameters. These parameters included teachers' characteristics (age, sex, and educational category), mean mark of grade for February and March of 2020, school characteristics (location in or out of town, region of location, organization kind and type, subject). Data was filtered to consider school grades with at least 60 school grades in certain time periods (February and March, April and May 2020). For every row in dataset, Cohen's effect size was calculated. Figure 1 shows histograms for certain grades that represent whole dataset. It should be noted that most parameter values are positive, i.e., after the introduction of distance learning format, grades have generally increased. MultiLayer Perceptron consisted of the input layer, two hidden layers with 64 neurons and output layer with 1 neuron. We used ReLU as activation function, Adam as optimizer with learning rate equal to 0.00005 and Mean Squared Error (MSE) as loss function. Figure 2 shows the learning curve of onelayer Linear regression and MultiLayer Perceptron.

EVALUATION OF THE IMPORTANCE OF EXPLANATORY VARIABLES
At the second stage of our analysis, we evaluated importance of our explanatory features for predicting values of Cohen's effect size. Figure 4 shows The main influence on the prediction of the Cohen's effect size value is exerted by the mean value of school grade in February and March, which obviously follows from the formula for the parameter . Also, significant contribution to the prediction of the parameter Cohen's effect size value is also made by the age of teachers: usually it is either not defined, or also negative (with an increase of age value, the value of the parameter decreases), which means that young teachers were more likely to give higher grades after introduction of distance learning format.
There exists also a significant improvement in school marks for the lessons of history, biology, while for such important subjects as physics, mathematics and Russian language, grades decreased after the introduction of distance learning. Besides that, location in certain regions: Naberezhnye Chelny, Kazan's Vakhitovsky, Novo-Savinovsky and Privolzhsky districts, also made significant positive contribution to the value of effect size . And in opposite, for schools located in Nizhnekamsk and Sovetsky district of Kazan, mean marks decreased significantly. Location of schools in the town also made positive contribution to the value of parameter d, while location outside of the town had a negative impact.
Besides that, different kinds of schools also played a special role as the used models features. The most significant influence was due to whether the educational organizations were secondary schools, lyceums, gymnasiums, or boarding schools. In case of lyceums, gymnasiums and boarding schools, the influence was strictly positive and increased the value of the Cohen's effect size , which means that after the introduction of distance learning into them, the marks of school grades increased. A different situation has developed in secondary schools: on average, the impact of the introduction of the distance learning format was mixed and did not affect academic performance in a certain way.
The influence of all the above factors may be explained by the fact that, depending on the characteristics of teachers, subjects taught and geographical location, the approach and time of transition to a new, previously practically unused format of education varied in different schools.

CONCLUSIONS
In this paper, we performed analysis of variation of academic performance in a large set of schools in Tatarstan Republic in the period before and during distance learning caused by COVID-19 pandemic.
We used eight different machine learning methods to solve the regression task of forecasting value of Cohen's effect size . We determined the values of the error function corresponding to all applied algorithms and established school classes for which prediction is easier and the ones for which prediction is more difficult. We discovered impact of age of teachers to the forecasting of parameter; lessons for which marks were more significant in the studied task and areas of Tatarstan Republic, location of school in which increased or decreased Cohen's effect size. Moreover, we discovered that the kind of educational organization also plays a special role in the forecasting task and identified the ones which had a significant impact on the value of Cohen's effect size. The impact of these study-related factors may indicate that different schools, school types and teacher had different periods of adaptation to a rapidly changing learning format, and these changes can be evaluated using feature importance method in combination with machine learning algorithms. The results obtained during the research, after appropriate verification, may be used to evaluate the influence on academic performance of school students after introduction of distance learning.