Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset

  • Deni Hidayat Universitas Nusa Mandiri
  • Lindung Parningotan Manik Badan Riset dan Informatika Nasional

Abstract

Software Defect Prediction Dataset as a component of the Software Defect Prediction model has a very vital role. However, NASA Software Defect Prediction has a problem with imbalance in minority data. This study compares the performance of oversampling techniques in overcoming this. A total of 90 oversampling techniques in the form of SMOTE and its variants were used. The results of this study indicate that there is no oversampling technique that is able to overcome this. The original dataset without oversampling shows good performance at the level of accuracy and f1-score but has low performance on auc-score and g-score. Several oversampling techniques show increased performance on auc-score and g-score, unfortunately at the same time showing a decrease in performance on accuracy and f1-score.

References

[1] C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr., “Does Bug Prediction Support Human Developers?Findings from a Google Case Study,” 2013.
[2] M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some comments on the NASA software defect datasets,” IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1208– 1215, 2013, doi: 10.1109/TSE.2013.11.
[3] M. J. Siers and M. Z. Islam, “Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem,” Inf Syst, vol. 51, pp. 62–71, 2015, doi: 10.1016/j.is.2015.02.006.
[4] S. Choirunnisa, B. Meidyani, and S. Rochimah, “Software Defect Prediction using Oversampling Algorithm: A-SUWO,” 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar, EECCIS 2018, pp. 337–341, 2018, doi: 10.1109/EECCIS.2018.8692874.
[5] H. Ghinaya, R. Herteno, M. R. Faisal, A. Farmadi, and F. Indriani, “Analysis of Important Features in Software Defect Prediction using Synthetic Minority Oversampling Techniques (SMOTE), Recursive Feature Elimination (RFE) and Random Forest,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 276–288, 2024.

[6] S. Feng et al., “COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction,” Inf Softw Technol, vol. 129, no. September, p. 106432, 2021, doi: 10.1016/j.infsof.2020.106432.
[7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. February, pp. 321– 357, 2002, doi: 10.1613/jair.953.
[8] N. V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” Data Mining and Knowledge Discovery Handbook, no. May, pp. 875–886, 2009, doi: 10.1007/978-0-387-09823- 4_45.
[9] V. López, A. Fernández, and F. Herrera, “On the importance of
the validation technique for classification with imbalanced
datasets: Addressing covariate shift when data is skewed,” Inf Sci (N Y), vol. 257, pp. 1–13, 2014, doi: 10.1016/j.ins.2013.09.038.
[10] T. Raeder, G. Forman, and N. V. Chawla, “Learning from Imbalanced Data: Evaluation Matters,” Intelligent Systems Reference Library, vol. 23, pp. 315–331, 2012, doi: 10.1007/978- 3-642-23166-7_12.
[11] V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf Sci (N Y), vol. 250, pp. 113–141, 2013, doi: 10.1016/j.ins.2013.07.007.
[12] T. Ryan Hoens and N. V. Chawla, “Imbalanced datasets: From sampling to classifiers,” Imbalanced Learning: Foundations, Algorithms, and Applications, pp. 43–59, 2013, doi: 10.1002/9781118646106.ch3.
[13] G. Kovács, “An empirical comparison and evaluation of minority
oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing Journal, vol. 83, no. July, 2019, doi: 10.1016/j.asoc.2019.105662.
[14] M. Z. F. N. Siswantoro and U. L. Yuhana, “Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study,” Teknika, vol. 12, no. 2, pp. 166–172, 2023, doi: 10.34148/teknika.v12i2.634.
[15] I. T. Jolliffe, “Principal components,” Data Handling in Science and Technology, vol. 20, no. PART A, pp. 519–556, 1998, doi: 10.1016/S0922-3487(97)80047-0.
Published
2024-11-13
How to Cite
HIDAYAT, Deni; MANIK, Lindung Parningotan. Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset. JOURNAL OF APPLIED INFORMATICS AND COMPUTING, [S.l.], v. 8, n. 2, p. 472-477, nov. 2024. ISSN 2548-6861. Available at: <http://704209.wb34atkl.asia/index.php/JAIC/article/view/8605>. Date accessed: 28 nov. 2024. doi: https://doi.org/10.30871/jaic.v8i2.8605.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.