Comparison of Oversampling Techniques on Minority Data Using Imbalance Software Defect Prediction Dataset
Abstract
Software Defect Prediction Dataset as a component of the Software Defect Prediction model has a very vital role. However, NASA Software Defect Prediction has a problem with imbalance in minority data. This study compares the performance of oversampling techniques in overcoming this. A total of 90 oversampling techniques in the form of SMOTE and its variants were used. The results of this study indicate that there is no oversampling technique that is able to overcome this. The original dataset without oversampling shows good performance at the level of accuracy and f1-score but has low performance on auc-score and g-score. Several oversampling techniques show increased performance on auc-score and g-score, unfortunately at the same time showing a decrease in performance on accuracy and f1-score.
Downloads
References
C. Lewis, Z. Lin, C. Sadowski, X. Zhu, R. Ou, and E. J. Whitehead Jr., “Does Bug Prediction Support Human Developers?Findings from a Google Case Study,” 2013.
M. Shepperd, Q. Song, Z. Sun, and C. Mair, “Data quality: Some comments on the NASA software defect datasets,” IEEE Transactions on Software Engineering, vol. 39, no. 9, pp. 1208– 1215, 2013, doi: 10.1109/TSE.2013.11.
M. J. Siers and M. Z. Islam, “Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem,” Inf Syst, vol. 51, pp. 62–71, 2015, doi: 10.1016/j.is.2015.02.006.
S. Choirunnisa, B. Meidyani, and S. Rochimah, “Software Defect Prediction using Oversampling Algorithm: A-SUWO,” 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar, EECCIS 2018, pp. 337–341, 2018, doi: 10.1109/EECCIS.2018.8692874.
H. Ghinaya, R. Herteno, M. R. Faisal, A. Farmadi, and F. Indriani, “Analysis of Important Features in Software Defect Prediction using Synthetic Minority Oversampling Techniques (SMOTE), Recursive Feature Elimination (RFE) and Random Forest,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 276–288, 2024.
S. Feng et al., “COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction,” Inf Softw Technol, vol. 129, no. September, p. 106432, 2021, doi: 10.1016/j.infsof.2020.106432.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. February, pp. 321– 357, 2002, doi: 10.1613/jair.953.
N. V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” Data Mining and Knowledge Discovery Handbook, no. May, pp. 875–886, 2009, doi: 10.1007/978-0-387-09823- 4_45.
V. López, A. Fernández, and F. Herrera, “On the importance of
the validation technique for classification with imbalanced
datasets: Addressing covariate shift when data is skewed,” Inf Sci (N Y), vol. 257, pp. 1–13, 2014, doi: 10.1016/j.ins.2013.09.038.
T. Raeder, G. Forman, and N. V. Chawla, “Learning from Imbalanced Data: Evaluation Matters,” Intelligent Systems Reference Library, vol. 23, pp. 315–331, 2012, doi: 10.1007/978- 3-642-23166-7_12.
V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Inf Sci (N Y), vol. 250, pp. 113–141, 2013, doi: 10.1016/j.ins.2013.07.007.
T. Ryan Hoens and N. V. Chawla, “Imbalanced datasets: From sampling to classifiers,” Imbalanced Learning: Foundations, Algorithms, and Applications, pp. 43–59, 2013, doi: 10.1002/9781118646106.ch3.
G. Kovács, “An empirical comparison and evaluation of minority
oversampling techniques on a large number of imbalanced datasets,” Applied Soft Computing Journal, vol. 83, no. July, 2019, doi: 10.1016/j.asoc.2019.105662.
M. Z. F. N. Siswantoro and U. L. Yuhana, “Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study,” Teknika, vol. 12, no. 2, pp. 166–172, 2023, doi: 10.34148/teknika.v12i2.634.
I. T. Jolliffe, “Principal components,” Data Handling in Science and Technology, vol. 20, no. PART A, pp. 519–556, 1998, doi: 10.1016/S0922-3487(97)80047-0.
Copyright (c) 2024 Deni Hidayat, Lindung Parningotan Manik
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).