ANALYZING STUDENTS’ ACADEMIC PERFORMANCE THROUGH EDUCATIONAL DATA MINING

Predicting students’ performance is a very important task in any educational system. Therefore, to predict the learner’s behavior towards studies many data mining techniques are used like clustering, classification, regression. In this paper, new student’s performance prediction model and new features are introduced that have a great influence on student’s academic achievement i.e. student absence days in class and parents’ involvement in the learning process. In this paper, considerable attention is on the punctuality of students and the effect of participation of parents in the learning process. This category of features is concerned with the learner’s interaction with the e–learning management system. Three different classifiers such as Naive Bayes, Decision Tree, and Artificial Neural Network are used to examine the effect of these features on students’ educational performance. The accuracy of the proposed model achieved up to 10% to 15% and is much improved as compared to the results when such features are removed.


INTRODUCTION
In the discipline of data mining and its well-known application Knowledge Discovery in Databases (KDD) , one of the new evolving fields now-a-days is Education Data Mining (EDM) that emphasizes on discovering the useful knowledge and mining the useful patterns from educational information systems such as, course management system (Moodle, Blackboard etc.) , online learning management system , registration systems , admissions systems, and so on which help out students at each stage of their studies like from primary to higher education. Romero and Ventura (2007) proposed that the data can be obtained through manual traditional surveys. The further investigation on education data mining (Romero, Ventura & Garcia, 2008) concluded that data can be gathered from many sources such as databases of academic institutes, online learning management system. In this field, a major focus of concern is to analyze and discover meaningful rules and patterns to either encourage students to manage their education and deliverables in a better way and enhances their performance or to give educational institutes direction to maintain the policies for the betterment of students. Abu Tair and El-Halees (2012) analyzed student's data by creating the decision trees, making an association or sequential mining rules and classifying students for enhancing their performance and taking fruitful decisions in the fascinating research area. Romero and Ventura (2010) concluded that many data mining techniques used to generate specific patterns, rules, classification and prediction to help students in the future. In this paper student performance model is introduced which focus on important features i.e parents' participation in the learning process and student absence days. The dataset is obtained from Kalboard 360 e-learning system. The performance model applies different classifiers such as decision tree, naive bayes, and artificial network to examine the effect of such features on students' academic performance. For building the student's performance model source of data is obtained from http://www.kaggle.com, this is an educational dataset of e-learning website, the dataset contains 500 records and having 17 different features. Then, we applied three of the data mining algorithms. Finally, the results are evaluated by using different measures.

LITERATURE REVIEW
Educational data mining is used to find potential knowledge that helps in the utilization of active learning in technological aspects. E-learning is becoming one of the most important areas of research in developing countries. So many well-developed countries switched their educational system into fully or partially automated which not only helps students but teachers as well to provide ease of learning. A survey is made where many data mining application is applied to the course management system. It was a tutorial and case study related to the Moodle system to improve the students' learning experience and their courses. Quadri and Kalyankar (2010) shows C4.5 decision tree algorithm to arrange a set of attributes in hierarchical form, this technique is used by many researchers due to its simplicity through which set of classification rules can be formed. Some of the well-known Decision Tree algorithms are J48, C4.5 and CART. Murugananthan and Shiva (2016) proposed a new approach in deriving association rules for optimal learning sequence of tutors and students using a Kmeans clustering algorithm. An Artificial neural network is one of the most used practices in mining educational data. This is very intelligent algorithm which works based on a neuron that relate to each other and work together to produce the output. Arsad and Buniyamin (2013) used artificial neural network for predicting academic progress of bachelor's degree student. Hien and Haddawy (2007) used Naïve Byes algorithm to predict final Cumulative Grade Point Average (CGPA) at the time of admission which was based on their academic background. The study about students' educational behavior (Amrieh, Hamtini & Aljarah, 2015) proposed framework having a category of a feature called "Behavioral feature" is introduced where they focus on student's behavioral features and their relationship with student's academic success. The authors (Amrieh, Hamtini & Aljarah, 2016) used the same framework to examine student's progress by using ensemble techniques which enhance the overall accuracy of results. So, numerous researches have been conducted so far to predict the students' performance using data mining. But few of them highlighted the important features that affect students' educational performance. In this research, we are going to use the most important category of the features that affect the grades of a student and their overall performance.

DATASET AND DATA PREPROCESSING
The dataset for building the proposed student's performance model to anticipate the students' academic performance is acquired from https://www.kaggle.com/ aljarah/xAPI-EDu-Data. It is an instructive dataset collected from e-learning system called Kalboard 360. The dataset consists of 500 student records. It has 17 different features.

E-LEARNING MANAGEMENT SYSTEM
It is an e-learning system that engages learners, track progress and delivers targeted outcomes. Learning is significant, innovative and interactive. Student engagement was defined by ("Kalboard 360 e-learning system", 2000) as "People ENGAGE and INTERACT with it for better understanding and effective learning. That's why the only focus of this system is on custom-made solutions. Core competency lies in their decade of experience, expertise, and creativity of the solutions". The emphasis is on delivering an inspiring and engrossing experience for students. The aim of this system is to build a world where e-learning and development matters. Their main objective is to tackle recent technologies to develop online learning methods for students and educational institutes. Where they can offer several customized courses options related to students demands. As compared to conventional methods like books, PDFs, PowerPoint's, training manuals they have shifted to fully interactive activities based on e-learning procedures. Course designer prepares a fully interactive course layout where audio voice can be included so that student can get desired content in any format.

Feature Description of features Category of features Gender
The student gender i.e masculine or feminine

Demographical Features Country
A Country student belongs to.

Birthplace
Born place of student

Parent Responsible
Parent of the student (dad or mom)

Levels of Education
Different educational stages of students like high, medium and low level

ID of Section
Class section A, B or C student belongs to.

Course
Offered courses such as (IT, Math, English, Arabic, Science, Quran)

Punctuality of student in the class
No. of student available days in class (Below-07 or Above-07)

Parent involvement
Survey forms provided by tutors is answered by parents or not Participation of parents on the whole learning process

Satisfaction of Parent
This feature is concerned with the intensity of satisfaction of the parent (Positive or Negative)

Group Discussions
These all features are concerned with student behavior while interacting with Kalboard 360 elearning website.

Assignments viewed by a student
After a dataset is collected the most important task is to pre-process data by applying pre-processing techniques. As real data is not complete (inadequate attributes, missing values of interest, having summarized data). So to eliminate noise and outlier data pre-processing is applied which includes cleaning data, transforming data and selection and analysis of appropriate features.

PRE-PROCESSING DATA
The techniques are applied to convert unstructured data into some conventional format so that it can be easily accepted and used by data mining algorithm.

DATA CLEANING
Data cleaning is one of the major tasks in preprocessing. Data cleaning is used to remove noisy and inconsistent data and to deal with incomplete values. In this work, we used a dataset of 500 records out of which 20 records contain some missing values from different categories so after cleaning the final dataset becomes 480 records.

DATA TRANSFORMATION
Data transformation is applied to transform the numerical values into nominal values for classification to represent class labels. In Table 2 we distribute the dataset into below-mentioned class intervals lowest level, medium level and highest level based on student's grade or marks.

FEATURES SELECTION AND ANALYSIS
A research study (Karegowda, Manjunath & Jayaram, 2010) analyzed feature selection as most important task in data preprocessing. The objective of this step is to choose some important and appropriate subset of features from dataset to transform or reduce the number of attributes that can appear in the algorithm, therefore reducing the proportion of feature area so that the repeatable and inappropriate data is removed. In this way, feature selection helps in enhancing the performance of the learning algorithm by improving the data quality. Feature selection methods are divided into two main categories (1) Wrapper Based methods (2) Filter Based methods. Filter method is applied to identify relevant subset of features while avoids the remaining. These methods rank the features by using variable ranking techniques so that highly ranked features can be selected and applied to the learning algorithm. Acharya and Sinha (2014) investigate many feature ranking techniques such as information gain and gain ratio that is used for feature evaluation. In our work, we applied selection algorithms based on the gain ratio which is filter based approach to examine different feature scores so that the most important features for building students' performance model can be identified. Figure 1 shows the highly ranked features after filter based evaluation.
As shown in Figure 1 student absence days got the highest rank followed by category related to parent's involvement like their answering survey, satisfaction from school and so on. In Figure 1 we have observed that an important subset of features is selected while others are eliminated. In this way, the features we are considering in this research got the highest rank which means that student punctuality and their parents' participation during whole education practice have a great effect on their academic performance.

METHODOLOGY
In this paper, we present students' performance framework using three different classifiers, to assess the subset of features having an effect on students' academic achievement. Figure 2 demonstrates the primary steps in the given framework. This framework begins by gathering information from Kalboard 360 online learning management system referenced in section 3.

Figure 2.
Steps of students' performance prediction model. This step is trailed by the next step which is pre-processing data related to changing the gathered data into some convenient format. So in this step, first of all, we applied data cleaning technique to remove the irrelevant and redundant data from the dataset. After that, the numerical values are transformed into nominal values for classification to represent class labels. To achieve the task, we distribute the dataset into three class labels (highest level, medium level, and lowest level) based on student's total grade At this step, dataset has a ratio of 199 students at the lowest level, at the middle level there are 248 students and at the highest level there are 33 students. A step onward, feature selection and analysis are used to pick the optimum list of features with highest scores. As appeared in Figure 1, we applied selection algorithms based on the gain ratio which is filter based approach to examine different feature scores. At last, we proposed a framework for having three classifiers. The classification algorithms are used to get to know about features that may affect students academic achievements. The three different classifiers that are applied to assess the student's performance are Decision Tree (DT), Naïve Bayes (NB) and Artificial Neural Network (ANN).

NAÏVE BAYES CLASSIFIER
This classifier work on a strategy to evaluate the probabilities of different attributes from training data set for any class after that utilizes these probabilities to characterize new elements. Each level has associated probabilities. With a middle level, it is 0.3, 0.27 with low level and, 0.44 with high level.

DECISION TREE CLASSIFIER
DT is used to discover rules that characterize the data based on a lot of braches and helps in the decision. For nominal attributes, it gives the best results. Figure 3 demonstrates a J48 pruned tree having 31 number of leaves. Size of the tree is 48.

ARTIFICIAL NEURAL NETWORK CLASSIFIER
In research study (Naser, Zaqout, Ghosh, Atallah & Alajrami, 2015) used ANN which is an approach of neural network prepares data for achieving good accuracy. ANN framework is used to generate patterns and to solve complex prediction problems. It comprises an input layer, the output layer, and a hidden layer. The input is taken by input layer from the user and output to the user is sent by the output layer. Middle layer is between input layer and output layer. The neurons of middle layer are just associated with different neurons and do not straightforwardly interface with the main user application. For knowledge representation patterns and results are assessed.

SETTING ENVIRONMENT
The experiment is performed on PC having RAM of 8GB, 5 intel core (2.50 GHz). Weka tool in classification algorithms (Arora, 2012) analyzed good accuracy and prediction results. We used Weka tool in our work to evaluate our proposed models, comparisons and results. Training set, cross-validation, supplied test set, and percentage split are few options available for test purpose. The dataset is distributed into a training set and test set using 10 folds cross-validation because this option is widely used one, especially if we have a limited amount of dataset. The dataset is randomly divided into ten subsets. Weka tool uses set 1 for test purpose and remaining 9 sets for training purpose for first training and uses set 2 for testing and rest of 9 sets for training and repeat that in total ten times by interchanging the set each time with next one. In the end, the average success rate is calculated.

EVALUATION MEASURES
For evaluating the quality of different classification techniques applied on students' academic performance model we use four different measures accuracy, precision, recall, and f-measure. Table 3 demonstrates different calculated measures, it shows confusion matrix comprises of 1,2,3 and 4 equation. Yes is for positive values and No is for negative values whereas TP is for true positive values and FP is for false positive values similarly FN is for false negative and TN is for true negative. Accuracy is calculated as correct classifications divided by a total number of classifications. The Recall is the proportion of rightly classified to total unclassified and rightly classified cases. Precision is the proportion of rightly classified to total misclassified and rightly classified cases. F-measure is also included which is a combination of precision and recall and it is considered the best indicator of the relationship between them.

Precision = TruePostive
(3) TruePositive+FalsePositive F-measure = 2 Precision*Recall (4) Precision+Recall In our case, there are three classes. Table 4 shows the classification confusion matrix based on A, B, and C class.  Recall A= TP a /(TP a +Q ab +Q ac ) Recall B= TP b /(TP b +Q ba +Q bc ) Recall C= TP c /(TP c +Q ca +Q cb ) Precision for considered class can be calculated as: Precision A=TP a /(TP a +Q ba +Q ca ) Precision B=TP b /(TP b +Q ab +Q cb ) Precision C=TP c /(TP c +Q ac +Q bc )

RESULTS
Different results are examined based on three different classification techniques which are applied to student dataset to predict students' academic performance. Table 5,6,7 shows confusion matrix for three different classifiers i.e DT, NB, and ANN based on which above measures are calculated for A, B, and C class while Accuracy of the overall algorithm is calculated. F-measure C = 74.0 Following the above procedure, the results for accuracy, recall, precision, and Fmeasure is calculated for naïve bayes and artificial neural network by acquainting the data from given respective tables.  F-measure for Class A is 75.0, F-measure B is 83.5 and F-measure C is 77.9 Table 8 shows results using three data mining algorithms (ANN, NB, DT). Two different classifications results are achieved by each algorithm (1) classification results with highly ranked features (RF) i.e. student absence days and parent's participation (2) classification results without those highly ranked features (WRF). Details of results with a highly ranked feature are given above. The results without those features can be achieved in a similar way. In Table 8, we can see good classification results with highly ranked features as compared with the results without those features this proves there is a great impact of student punctuality in class and their parents' involvement in learning process to students' academic success and achievements. Observing the results in Table 8 we notice that ANN outperform other classification algorithms. Artificial Neural Network provides 78.1 accuracies with highly ranked features and 59.1 without ranked features. 78.1 means 375 out of 480 students are correctly classified to correct class label i.e. High Medium and Low and 105 students are incorrectly classified.

CONCLUSION
Academic performance of students is a pillar for their successful future and becoming a big area of interest for all academic institutions over the world. Nowadays the use of e-learning management system is increasing rapidly, and many developed countries have shifted their educational system to fully or partially automated systems because this system generates a huge amount of data that contains hidden knowledge and patterns that can be used to generate meaningful knowledge to help students to improve their academic grades and achievements. In this research, we introduce a students' performance model with new categories of features related to student's punctuality in classes and their parents' participation in the learning process. The overall performance of students' academic prediction framework is examined by three different classification algorithms decision tree, naïve bayes, and artificial neural network. The results show that these features have a strong impact on the academic success of a student. The model provides very good accuracy while using these categories of features and is achieved 10 to 15% increased as compared with results when removing such features.