Improving Classification in Imbalanced Educational Datasets using Over-sampling
Abstract
Learning Analytics (LA) involves a growing range of methods for understanding and optimizing learning and the environments in which it occurs. Different Machine Learning (ML) algorithms or learning classifiers can be used to implement LA, with the goal of predicting learning outcomes and classifying the data into predetermined categories. Many educational datasets are imbalanced, where the number of samples in one category is significantly larger than in other categories. Ordinarily, it is ML’s performance on the minority categories that is the most important. Since most ML classification algorithms ignore the minority categories, and in turn have poor performance, so learning from imbalanced datasets is really challenging. In order to address this challenge and also to improve the performance of different classifiers, Synthetic Minority Over-sampling Technique (SMOTE) is used to oversample the minority categories. In this paper, the accuracy of seven well-known classifiers considering 5 and 10-fold cross-validation and the F1-score are compared. The imbalanced dataset collected based on self-regulated learning activities contains the learning behaviour of 6,423 medical students who used a web-based study platform—Hypocampus—with different educational topics for one year. Also, two diagnostic tools including Area Under the Receiver Operating Characteristics (AUC-ROC) curves and Precision-Recall (PR) curves are applied to predict probabilities of an observation belonging to each category in a classification problem. Using these diagnostic tools may help LA researchers on how to deal with imbalanced educational datasets. The outcomes of our experimental results show that Neural Network with 92.77% in 5-fold cross-validation, 93.20% in 10-fold cross-validation and 0.95 in F1-score has the highest accuracy and performance compared to other classifiers when we applied the SMOTE technique. Also, the probability of detection in different classifiers using SMOTE has shown a significant improvement.Downloads
Download data is not yet available.
Downloads
Published
2020-11-23
Conference Proceedings Volume
Section
Articles
How to Cite
Improving Classification in Imbalanced Educational Datasets using Over-sampling. (2020). International Conference on Computers in Education, 278-283. https://library.apsce.net/index.php/ICCE/article/view/3931