Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset
Abstract
Software Defect Prediction Dataset as a component of the Software Defect Prediction model has a very vital role. However, NASA Software Defect Prediction has a problem with imbalance in minority data. This study compares the performance of oversampling techniques in overcoming this. A total of 90 oversampling techniques in the form of SMOTE and its variants were used. The results of this study indicate that there is no oversampling technique that is able to overcome this. The original dataset without oversampling shows good performance at the level of accuracy and f1-score but has low performance on auc-score and g-score. Several oversampling techniques show increased performance on auc-score and g-score, unfortunately at the same time showing a decrease in performance on accuracy and f1-score.
References
[2] M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some comments on the NASA software defect datasets,” IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1208– 1215, 2013, doi: 10.1109/TSE.2013.11.
[3] M. J. Siers and M. Z. Islam, “Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem,” Inf Syst, vol. 51, pp. 62–71, 2015, doi: 10.1016/j.is.2015.02.006.
[4] S. Choirunnisa, B. Meidyani, and S. Rochimah, “Software Defect Prediction using Oversampling Algorithm: A-SUWO,” 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar, EECCIS 2018, pp. 337–341, 2018, doi: 10.1109/EECCIS.2018.8692874.
[5] H. Ghinaya, R. Herteno, M. R. Faisal, A. Farmadi, and F. Indriani, “Analysis of Important Features in Software Defect Prediction using Synthetic Minority Oversampling Techniques (SMOTE), Recursive Feature Elimination (RFE) and Random Forest,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 276–288, 2024.
[6] S. Feng et al., “COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction,” Inf Softw Technol, vol. 129, no. September, p. 106432, 2021, doi: 10.1016/j.infsof.2020.106432.
[7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. February, pp. 321– 357, 2002, doi: 10.1613/jair.953.
[8] N. V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” Data Mining and Knowledge Discovery Handbook, no. May, pp. 875–886, 2009, doi: 10.1007/978-0-387-09823- 4_45.
[9] V. López, A. Fernández, and F. Herrera, “On the importance of
the validation technique for classification with imbalanced
datasets: Addressing covariate shift when data is skewed,” Inf Sci (N Y), vol. 257, pp. 1–13, 2014, doi: 10.1016/j.ins.2013.09.038.
[10] T. Raeder, G. Forman, and N. V. Chawla, “Learning from Imbalanced Data: Evaluation Matters,” Intelligent Systems Reference Library, vol. 23, pp. 315–331, 2012, doi: 10.1007/978- 3-642-23166-7_12.
[11] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf Sci (N Y), vol. 250, pp. 113–141, 2013, doi: 10.1016/j.ins.2013.07.007.
[12] T. Ryan Hoens and N. V. Chawla, “Imbalanced datasets: From sampling to classifiers,” Imbalanced Learning: Foundations, Algorithms, and Applications, pp. 43–59, 2013, doi: 10.1002/9781118646106.ch3.
[13] G. Kovács, “An empirical comparison and evaluation of minority
oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing Journal, vol. 83, no. July, 2019, doi: 10.1016/j.asoc.2019.105662.
[14] M. Z. F. N. Siswantoro and U. L. Yuhana, “Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study,” Teknika, vol. 12, no. 2, pp. 166–172, 2023, doi: 10.34148/teknika.v12i2.634.
[15] I. T. Jolliffe, “Principal components,” Data Handling in Science and Technology, vol. 20, no. PART A, pp. 519–556, 1998, doi: 10.1016/S0922-3487(97)80047-0.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Penulis yang telah mempublikasikan artikel pada JAIC menyatakan setuju bahwa:
1. Artikel belum dan tidak pernah dipublikasikan sebelumnya pada jurnal ilmiah lain, prosiding ataupun jurnal elektronik lainnya.
2. Artikel yang telah diserahkan menjadi hak penuh kepada pengelola JAIC Politeknik Negeri Batam
3. Artikel diperbolehkan untuk dishare ke khalayak untuk meningkatkan produktivitas rujukan dan sitasi dari naskah yang telah terbit.