Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset

Yohanes Dimas Pratama; Abu Salam

doi:10.30871/jaic.v9i3.9353

Authors

Yohanes Dimas Pratama Universitas Dian Nuswantoro
Abu Salam Universitas Dian Nuswantoro

DOI:

https://doi.org/10.30871/jaic.v9i3.9353

Keywords:

Data Normalization, K-Nearest Neighbors, Diabetes Classification, Min-Max Scaling, Z-Score Scaling

Abstract

This study analyzes the comparison of data normalization techniques in the K-Nearest Neighbors (KNN) model for diabetes classification using the Pima Indians Diabetes dataset. The three normalization techniques evaluated are Min-Max Scaling, Z-Score Scaling, and Decimal Scaling. After preprocessing, such as handling missing values and removing duplicates, as well as feature selection using the Random Forest method, the features removed include SkinThickness, Insulin, Pregnancies, and BloodPressure. The evaluation was carried out using accuracy, precision, recall, F1-Score, specificity, and ROC AUC metrics. The results show that Min-Max Scaling provides a significant improvement in all metrics, with the highest accuracy of 0.8117 and ROC AUC of 0.8050. Z-Score Scaling provides good results, but not as good as Min-Max Scaling. Decimal Scaling shows the lowest performance. Statistical tests using Paired T-Test show significant differences between Min-Max Scaling and without normalization on all metrics (P-Value <0.05), while Z-Score Scaling and Decimal Scaling are only significant on some metrics, with P-Values of 0.08363 and 0.43839 respectively for accuracy and ROC AUC. Overall, Min-Max Scaling proved to be the best normalization method for improving KNN performance in diabetes classification.

Downloads

Download data is not yet available.

References

[1] I. W. Suryasa, M. Rodríguez-Gámez, and T. Koldoris, “Health and Treatment of Diabetes Mellitus,” Int J Health Sci (Qassim), vol. 5, no. 1, pp. I–V, 2021, doi: 10.53730/IJHS.V5N1.2864.

[2] L. Ryden, G. Ferrannini, and E. Standl, “Risk prediction in patients with diabetes: is SCORE 2D the perfect solution?,” Jul. 21, 2023, Oxford University Press. doi: 10.1093/eurheartj/ehad263.

[3] S. Alam, M. K. Hasan, S. Neaz, N. Hussain, M. F. Hossain, and T. Rahman, “Diabetes Mellitus: Insights from Epidemiology, Biochemistry, Risk Factors, Diagnosis, Complications and Comprehensive Management,” Jun. 01, 2021, MDPI. doi: 10.3390/diabetology2020004.

[4] S. Templer, S. Abdo, and T. Wong, “Preventing diabetes complications,” Intern Med J, vol. 54, no. 8, pp. 1264–1274, Aug. 2024, doi: 10.1111/imj.16455.

[5] S. Zhang and J. Li, “KNN Classification With One-Step Computation,” IEEE Trans Knowl Data Eng, vol. 35, no. 3, pp. 2711–2723, Mar. 2023, doi: 10.1109/TKDE.2021.3119140.

[6] N. Ukey, Z. Yang, B. Li, G. Zhang, Y. Hu, and W. Zhang, “Survey on Exact kNN Queries over High-Dimensional Data Space,” Jan. 01, 2023, MDPI. doi: 10.3390/s23020629.

[7] S. Zhang, “Challenges in KNN Classification,” IEEE Trans Knowl Data Eng, vol. 34, no. 10, pp. 4663–4675, Oct. 2022, doi: 10.1109/TKDE.2021.3049250.

[8] M. V. Polyakova and V. N. Krylov, “Data normalization methods to improve the quality of classification in the breast cancer diagnostic system,” Applied Aspects of Information Technology, vol. 5, no. 1, pp. 55–63, Apr. 2022, doi: 10.15276/aait.05.2022.5.

[9] M. Zulkifilu and A. Yasir, “About Some Data Precaution Techniques For K-Means Clustering Algorithm,” UMYU Scientifica, vol. 1, no. 1, pp. 12–19, 2022, doi: 10.47430/usci.1122.003.

[10] M. Pagan, M. Zarlis, and A. Candra, “Investigating the impact of data scaling on the k-nearest neighbor algorithm,” Computer Science and Information Technologies, vol. 4, no. 2, pp. 135–142, Jul. 2023, doi: 10.11591/csit.v4i2.pp135-142.

[11] A. Alsarhan, F. Hussein, S. Moh, and F. S. El-Salhi, “The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance,” Data (Basel), vol. 6, no. 2, 2021, doi: 10.3390/data.

[12] S. Sinsomboonthong, “Performance Comparison of New Adjusted Min-Max with Decimal Scaling and Statistical Column Normalization Methods for Artificial Neural Network Classification,” Int J Math Math Sci, vol. 2022, 2022, doi: 10.1155/2022/3584406.

[13] C. C. Olisah, L. Smith, and M. Smith, “Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective,” Comput Methods Programs Biomed, vol. 220, Jun. 2022, doi: 10.1016/j.cmpb.2022.106773.

[14] A. M. Vommi and T. K. Battula, “A hybrid filter-wrapper feature selection using Fuzzy KNN based on Bonferroni mean for medical datasets classification: A COVID-19 case study,” Expert Syst Appl, vol. 218, May 2023, doi: 10.1016/j.eswa.2023.119612.

[15] Y. Zhao, “Comparative Analysis of Diabetes Prediction Models Using the Pima Indian Diabetes Database,” ITM Web of Conferences, vol. 70, p. 02021, Jan. 2025, doi: 10.1051/itmconf/20257002021.

[16] V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Comput Appl, vol. 35, no. 22, pp. 16157–16173, Aug. 2023, doi: 10.1007/s00521-022-07049-z.

[17] V. Patil and D. R. Ingle, “Comparative Analysis of Different ML Classification Algorithms with Diabetes Prediction through Pima Indian Diabetics Dataset,” in 2021 International Conference on Intelligent Technologies, CONIT 2021, Institute of Electrical and Electronics Engineers Inc., Jun. 2021. doi: 10.1109/CONIT51480.2021.9498361.

[18] H. Karamti et al., “Improving Prediction of Cervical Cancer Using KNN Imputed SMOTE Features and Multi-Model Ensemble Learning Approach,” Cancers (Basel), vol. 15, no. 17, Sep. 2023, doi: 10.3390/cancers15174412.

[19] M. N. Maskuri, K. Sukerti, and R. M. Herdian Bhakti, “Penerapan Algoritma K-Nearest Neighbor (KNN) untuk Memprediksi Penyakit Stroke Stroke Desease Predict Using KNN Algorithm,” Jurnal Ilmiah Intech : Information Technology Journal of UMUS, vol. 4, no. 1, May 2022.

[20] C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Mar. 29, 2021, Frontiers Media S.A. doi: 10.3389/fenrg.2021.652801.

[21] O. Alotaibi, E. Pardede, and S. Tomy, “Cleaning Big Data Streams: A Systematic Literature Review,” Aug. 01, 2023, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/technologies11040101.

[22] M. Arif, maruf Setiawan, A. Dwi Hartono, M. Arif Ma, and ruf Setiawan, “Menggunakan Metode Machine Learning Untuk Memprediksi Nilai Mahasiswa Dengan Model Prediksi Multiclass,” Jurnal Informatika: Jurnal pengembangan IT, vol. 10, no. 1, p. 2025, 2025, doi: 10.30591/jpit.v9ix.xxx.

[23] L. A. Demidova, “Two‐stage hybrid data classifiers based on svm and knn algorithms,” Symmetry (Basel), vol. 13, no. 4, Apr. 2021, doi: 10.3390/sym13040615.

[24] N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction,” Jun. 27, 2022, Frontiers Media SA. doi: 10.3389/fbinf.2022.927312.

[25] M. Alduailij, Q. W. Khan, M. Tahir, M. Sardaraz, M. Alduailij, and F. Malik, “Machine-Learning-Based DDoS Attack Detection Using Mutual Information and Random Forest Feature Importance Method,” Symmetry (Basel), vol. 14, no. 6, Jun. 2022, doi: 10.3390/sym14061095.

[26] A. G. Priya Varshini, K. Anitha Kumari, and V. Varadarajan, “Estimating software development efforts using a random forest-based stacked ensemble approach,” Electronics (Switzerland), vol. 10, no. 10, May 2021, doi: 10.3390/electronics10101195.

[27] R. A. Disha and S. Waheed, “Performance analysis of machine learning models for intrusion detection system using Gini Impurity-based Weighted Random Forest (GIWRF) feature selection technique,” Cybersecurity, vol. 5, no. 1, Dec. 2022, doi: 10.1186/s42400-021-00103-8.

[28] P. J. Muhammad Ali, “Investigating the Impact of Min-Max Data Normalization on the Regression Performance of K-Nearest Neighbor with Different Similarity Measurements,” ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, vol. 10, no. 1, pp. 85–91, Jun. 2022, doi: 10.14500/aro.10955.

[29] Henderi , T. Wahyuningsih , and E. Rahwanto, “Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer,” International Journal of Informatics and Information System, vol. 4, no. 1, Mar. 2021, [Online]. Available: http://archive.ics.uci.edu/ml.

[30] M. R. Firmansyah and Y. P. Astuti, “Stroke Classification Comparison with KNN through Standardization and Normalization Techniques,” Advance Sustainable Science, Engineering and Technology, vol. 6, no. 1, Jan. 2024, doi: 10.26877/asset.v6i1.17685.

[31] Emad Majeed Hameed and Hardik Joshi, “Improving Diabetes Prediction by Selecting Optimal K and Distance Measures in KNN Classifier,” Journal of Techniques, vol. 6, no. 3, pp. 19–25, Aug. 2024, doi: 10.51173/jt.v6i3.2587.

[32] G. Fatima and S. Saeed, “A Novel Weighted Ensemble Method to Overcome the Impact of Under-fitting and Over-fitting on the Classification Accuracy of the Imbalanced Data Sets,” Pakistan Journal of Statistics and Operation Research, vol. 17, no. 2, pp. 483–496, 2021, doi: 10.18187/pjsor.v17i2.3640.

[33] S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimed Tools Appl, vol. 82, no. 22, pp. 34163–34181, Sep. 2023, doi: 10.1007/s11042-023-15165-8.

[34] A. S. Maklad, M. A. Mahdy, A. Malki, N. Niki, and A. A. Mohamed, “Advancing Early Detection of Colorectal Adenomatous Polyps via Genetic Data Analysis: A Hybrid Machine Learning Approach,” Journal of Computer and Communications, vol. 12, no. 07, pp. 23–38, 2024, doi: 10.4236/jcc.2024.127003.

[35] A. Zweifach, “Samples in many cell-based experiments are matched/paired but taking this into account does not always increase power of statistical tests for differences in means,” Mol Biol Cell, vol. 35, no. 1, p. br1, Jan. 2024, doi: 10.1091/mbc.E23-05-0159.

Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

submit

tools

issn