Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset

Deni Hidayat; Lindung Parningotan Manik

doi:10.30871/jaic.v8i2.8605

Authors

Deni Hidayat Universitas Nusa Mandiri
Lindung Parningotan Manik Badan Riset dan Informatika Nasional

DOI:

https://doi.org/10.30871/jaic.v8i2.8605

Keywords:

Software Defect Prediction, Oversampling, SMOTE

Abstract

Software Defect Prediction Dataset as a component of the Software Defect Prediction model has a very vital role. However, NASA Software Defect Prediction has a problem with imbalance in minority data. This study compares the performance of oversampling techniques in overcoming this. A total of 90 oversampling techniques in the form of SMOTE and its variants were used. The results of this study indicate that there is no oversampling technique that is able to overcome this. The original dataset without oversampling shows good performance at the level of accuracy and f1-score but has low performance on auc-score and g-score. Several oversampling techniques show increased performance on auc-score and g-score, unfortunately at the same time showing a decrease in performance on accuracy and f1-score.

Downloads

Download data is not yet available.

References

C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr., "Does Bug Prediction Support Human Developers?Findings from a Google Case Study," 2013.

M. Shepperd, Q. Song, Z. Sun, and C. Mair, "Data quality: Some comments on the NASA software defect datasets," IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1208"“ 1215, 2013, doi: 10.1109/TSE.2013.11.

M. J. Siers and M. Z. Islam, "Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem," Inf Syst, vol. 51, pp. 62"“71, 2015, doi: 10.1016/j.is.2015.02.006.

S. Choirunnisa, B. Meidyani, and S. Rochimah, "Software Defect Prediction using Oversampling Algorithm: A-SUWO," 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar, EECCIS 2018, pp. 337"“341, 2018, doi: 10.1109/EECCIS.2018.8692874.

H. Ghinaya, R. Herteno, M. R. Faisal, A. Farmadi, and F. Indriani, "Analysis of Important Features in Software Defect Prediction using Synthetic Minority Oversampling Techniques (SMOTE), Recursive Feature Elimination (RFE) and Random Forest," Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 276"“288, 2024.

S. Feng et al., "COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction," Inf Softw Technol, vol. 129, no. September, p. 106432, 2021, doi: 10.1016/j.infsof.2020.106432.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol. 16, no. February, pp. 321"“ 357, 2002, doi: 10.1613/jair.953.

N. V. Chawla, "Data Mining for Imbalanced Datasets: An Overview," Data Mining and Knowledge Discovery Handbook, no. May, pp. 875"“886, 2009, doi: 10.1007/978-0-387-09823- 4_45.

V. LÃ³pez, A. FernÃ¡ndez, and F. Herrera, "On the importance of

the validation technique for classification with imbalanced

datasets: Addressing covariate shift when data is skewed," Inf Sci (N Y), vol. 257, pp. 1"“13, 2014, doi: 10.1016/j.ins.2013.09.038.

T. Raeder, G. Forman, and N. V. Chawla, "Learning from Imbalanced Data: Evaluation Matters," Intelligent Systems Reference Library, vol. 23, pp. 315"“331, 2012, doi: 10.1007/978- 3-642-23166-7_12.

V. LÃ³pez, A. FernÃ¡ndez, S. GarcÃa, V. Palade, and F. Herrera, "An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics," Inf Sci (N Y), vol. 250, pp. 113"“141, 2013, doi: 10.1016/j.ins.2013.07.007.

T. Ryan Hoens and N. V. Chawla, "Imbalanced datasets: From sampling to classifiers," Imbalanced Learning: Foundations, Algorithms, and Applications, pp. 43"“59, 2013, doi: 10.1002/9781118646106.ch3.

G. KovÃ¡cs, "An empirical comparison and evaluation of minority

oversampling techniques on a large number of imbalanced datasets," Applied Soft Computing Journal, vol. 83, no. July, 2019, doi: 10.1016/j.asoc.2019.105662.

M. Z. F. N. Siswantoro and U. L. Yuhana, "Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study," Teknika, vol. 12, no. 2, pp. 166"“172, 2023, doi: 10.34148/teknika.v12i2.634.

I. T. Jolliffe, "Principal components," Data Handling in Science and Technology, vol. 20, no. PART A, pp. 519"“556, 1998, doi: 10.1016/S0922-3487(97)80047-0.

Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

submit

tools

issn