Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction

M. Fadli Kurniawan; Dyah Ayu Megawaty

doi:10.30871/jaic.v9i5.9815

Authors

M. Fadli Kurniawan Universitas Teknokrat Indonesia
Dyah Ayu Megawaty Universitas Teknokrat Indonesia

DOI:

https://doi.org/10.30871/jaic.v9i5.9815

Keywords:

Diabetes Prediction, Logistic Regression, Random Forest, Support Vector Machine, K-Nearest Neighbors

Abstract

Diabetes mellitus is a prevalent chronic illness that continues to grow in incidence worldwide, placing significant strain on healthcare systems. The timely prediction of diabetes is crucial for early intervention and management. This study explores the comparative effectiveness of four machine learning algorithms Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) in identifying diabetes cases using a large public dataset containing 100,000 patient records obtained from open source Kaggle. The dataset includes nine clinical variables, such as age, gender, body mass index (BMI), blood glucose level, and HbA1c levels, among others. To address class imbalance, which showed less than 10% positive (diabetic) cases initially, the Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training data after an 80:20 stratified split. All models were evaluated using 5-fold stratified cross-validation, measuring their performance through accuracy, precision, recall, F1-score, area under the ROC curve (AUC), and training time. Among the models, Random Forest achieved the highest classification accuracy (96.88%) and AUC (99.70%), indicating superior overall performance. Furthermore, McNemar statistical tests revealed that the differences in performance between Random Forest and the other models were statistically significant. An analysis of feature importance highlighted that HbA1c, glucose level, and BMI were the most influential predictors. These results demonstrate that Random Forest offers the most balanced combination of accuracy, interpretability, and robustness, making it highly suitable for real-world clinical screening scenarios where early detection of diabetes is critical.

Downloads

Download data is not yet available.

References

[1] M. Saputra, J. P. Sidabuke, R. P. Sinulingga, R. B. Tamba, F. Sains, and D. Teknologi, “Analisis Metode Algoritma K-Nearest Neighbor (KNN) Dan Naive Bayes Untuk Klasifikasi Diabetes Mellitus,” Jurnal TEKINKOM, vol. 6, no. 2, p. 2023, 2023, doi: 10.37600/tekinkom.v6i2.942.

[2] M. Sholeh, D. Andayati, R. Yuliana Rachmawati, P. Studi Informatika, and F. Teknologi Informasi dan Bisnis, “Data Mining Model Klasifikasi Menggunakan Algoritma K-Nearest Neighbor Dengan Normalisasi Untuk Prediksi Penyakit Diabetes Data Mining Model Classification Using Algorithm K-Nearest Neighbor With Normalization For Diabetes Prediction,” 2022.

[3] K. Thaiyalnayaki, “Classification of diabetes using deep learning and svm techniques,” International Journal of Current Research and Review, vol. 13, no. 1, pp. 146–149, Jan. 2021, doi: 10.31782/IJCRR.2021.13127.

[4] A. M. Ridwan and G. D. Setyawan, “Perbandingan Berbagai Model Machine Learning Untuk Mendeteksi Diabetes,” TEKNOKOM, vol. 6, no. 2, pp. 127–132, Aug. 2023, doi: 10.31943/teknokom.v6i2.152.

[5] P. R. Putri and R. Alit, “Klasifikasi Penyakit Diabetes Melitus Menggunakan Metode Support Vector Machine (SVM),” Journal of Informatics and Computer Science, vol. 06, 2024.

[6] K. A. Saputro, E. M. Atsir, and H. Hasanah, “https://ejurnal.methodist.ac.id/index.php/tamika/issue/view/222,” TAMIKA: Jurnal Tugas Akhir Manajemen Informatika & Komputerisasi Akuntansi, vol. 4, no. 2, pp. 159–166, Dec. 2024, doi: 10.46880/tamika.Vol4No2.pp159-166.

[7] N. Nur Muttaqin, “Klasifikasi Penyakit Diabetes Menggunakan Metode Random Forest Dan Adaboost,” 2024.

[8] V. Kant Singh Guru Ghasidas Vishwavidyalaya, M. K. Sahu, N. Dev Yadav, V. Kant Singh Assistant Professor, and M. Sahu Assistant Professor, “A Comparative Analysis Of Svm Kernels For Detection Of Diabetes,” 2022. [Online]. Available: https://www.researchgate.net/publication/363439771

[9] O. M. Haq, A. Ridwan, and T. G. Pratama, “Analisis Perbandingan Kinerja Algoritma Naïve Bayes Dan KNN Untuk Memprediksi Penyakit Diabetes,” Jurnal Ilmiah Komputer, vol. 21, 2025, [Online].

[10] R. Artanto, W. Sujana, I. Made, and A. Agastya, “Application of Machine Learning Algorithm for Osteoporosis Disease Prediction System,” 2024. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

[11] M. Fadli and R. A. Saputra, “Klasifikasi Dan Evaluasi Performa Model Random Forest Untuk Prediksi Stroke Classification And Evaluation Of Performance Models Random Forest For Stroke Prediction,” JT: Jurnal Teknik, vol. 12. , 2023, [Online]. Available: http://jurnal.umt.ac.id/index.php/jt/index

[12] Md. A. R. Refat, M. al Amin, C. Kaushal, Mst. N. Yeasmin, and M. K. Islam, “A Comparative analysis of Early Stage Diabetes Prediction using Machine Learning and Deep learning Approach.” Nov. 01, 2021. doi: 10.36227/techrxiv.16870623.v1.

[13] D. Kurniawan Saputro, M. Fiko Rastio Ajie, S. Azizah, and D. Hartanti, “Penerapan Logistic Regression untuk Mendeteksi Penyakit Jantung pada Pasien,” 2023.

[14] T. Riska Muliani, J. Sumarsono, I. S. Siti Wardatullatifah, P. Studi Teknik Pertanian, and F. Teknologi Pangan dan Agroindustri, “Deteksi Tingkat Kematangan Buah Alpukat (Persea americana Mill.) Menggunakan Algoritma Klasifikasi Dan Metode Stratified K-Fold Cross Validation Detection of Avocado Fruit Ripeness Level Using Classification Algorithm and Stratified K-Fold Cross Validation Method,” 2024. [Online]. Available: https://journal.unram.ac.id/index.php/agent

[15] R. Rizki, R. Athallah, I. Cholissodin, and P. P. Adikara, “Prediksi Potensi Pengidap Penyakit Diabetes berdasarkan Faktor Risiko Menggunakan Algoritme Kernel K-Nearest Neighbor,” 2022. [Online]. Available: http://j-ptiik.ub.ac.id

[16] Muhammad Yusril Aldean, Paradise, and Novanda Alim Setya Nugraha, “16 - Analisis Sentimen Masyarakat Terhadap Vaksinasi Covid-19 di Twitter Menggunakan Metode Random Forest Classifier (Studi Kasus Vaksin Sinovac),” Journal of Informatics, Information System, Software Engineering and Applications, vol. 4, p. .064-072, 2022.

[17] H. Apriyani, “Perbandingan Metode Naïve Bayes Dan Support Vector Machine Dalam Klasifikasi Penyakit Diabetes Melitus,” 2020. [Online]. Available: https://journal-computing.org/index.php/journal-ita/index

[18] R. Andanika Siallagan, “Prediksi Penyakit Diabetes Mellitus Menggunakan Algoritma C4.5,” Jurnal Responsif, vol. 3, no. 1, pp. 44–52, 2021, [Online]. Available: http://ejurnal.ars.ac.id/index.php/jti

[19] B. Andriska, C. Permana, and I. K. Dewi, “Komparasi Metode Klasifikasi Data Mining Decision Tree dan Naïve Bayes Untuk Prediksi Penyakit Diabetes,” Jurnal Informatika dan Teknologi, vol. 4, no. 1, 2021, doi: 10.29408/jit.v4i1.2994.

[20] J. S. Komputer, K. Buatan, and A. Ridwan, “Penerapan Algoritma Naïve Bayes Untuk Klasifikasi Penyakit Diabetes Mellitus,” 2020.

[21] A. Damayanti and A. Baita, "Comparison of Support Vector Machine (SVM) and Random Forest (RF) Algorithm Performance with Random Undersampling Technique to Predict Gestational Diabetes Mellitus Risk," Journal of Applied Informatics and Computing (JAIC), vol. 9, no. 2, pp. 328–337, Apr. 2025. [Online]. Available: https://jurnal.polibatam.ac.id/index.php/JAIC/article/view/9009/2644

[22] B. R. Prasetyo et al., “Model Diabetes,” JITET J. Inform. dan Tek. Elekro Terap., vol. 12, no. 3, 2024

Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

submit

tools

issn