Comparison of LightGBM and CatBoost Algorithms for Diabetes Prediction Based on Clinical Data

Authors

  • Muhammad Sidik Latuconsina Universitas Amikom Yogyakarta
  • Majid Rahardi Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.30871/jaic.v10i1.12179

Keywords:

Diabetes Prediction, SHAP, SMOTE, LightGBM, CatBoost

Abstract

Diabetes Mellitus presents a global health challenge necessitating accurate early detection to prevent fatal complications. However, clinical data often exhibit imbalanced class distributions, hindering standard prediction models from effectively detecting positive patients. This study aims to compare the performance of two modern Gradient Boosting algorithms, LightGBM and CatBoost, in predicting diabetes risk. Random Forest and Logistic Regression algorithms were included as baseline models to benchmark effectiveness. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied during the training data preprocessing stage. The dataset was sourced from the Kaggle public repository (Diabetes Prediction Dataset), comprising 100,000 patient medical records with clinical attributes such as age, body mass index (BMI), and HbA1c levels. Performance evaluation utilized Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) metrics. Experimental results demonstrated a tight competition, where LightGBM achieved the highest Accuracy of 97.16%. However, CatBoost demonstrated superior sensitivity (Recall) of 69.71% and the highest F1-Score of 80.48%. This makes CatBoost the most reliable model in minimizing False Negatives compared to LightGBM and Random Forest, whereas Logistic Regression showed the lowest performance. Furthermore, interpretability analysis using SHAP (SHapley Additive exPlanations) revealed that HbA1c and blood glucose levels were the most dominant features in detection, validating the model's alignment with clinical diagnosis. This study concludes that the CatBoost algorithm combined with SMOTE offers a more sensitive, transparent, and efficient diabetes prediction for medical screening.

Downloads

Download data is not yet available.

References

[1] H. Sun et al., “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045,” Diabetes Res. Clin. Pract., vol. 183, Jan. 2022, doi: 10.1016/j.diabres.2021.109119.

[2] E. Er Unja, B. Trihandini, and P. Sarjana Keperawatan dan Ners, “Journal of Nursing Invention Hubungan Kadar Gula Darah Dengan Hipertensi Pada Pasien Diabetes Melitus Tipe 2 Di Wilayah Kerja Puskesmas Teluk Tiram Kota Banjarmasin Tahun 2024”, doi: 10.33859/jni.

[3] S. Rumondang, B. P. Sedli, and O. R. H. Umboh, “Pengaruh Inflamasi Mikro terhadap Penyakit Ginjal pada Pasien Diabetes Melitus Tipe-2 Microinflamation Influence on Renal Diseases in Type 2 Diabetes Mellitus,” Medical Scope Journal, vol. 4, no. 1, pp. 40–47, 2022, doi: 10.35790/msj.v4.i1.44682.

[4] M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes prediction using ensembling of different machine learning classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020, doi: 10.1109/ACCESS.2020.2989857.

[5] M. Fadli Kurniawan and D. Ayu Megawaty, “Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

[6] N. P. Tigga and S. Garg, “Prediction of Type 2 Diabetes using Machine Learning Classification Methods,” in Procedia Computer Science, Elsevier B.V., 2020, pp. 706–716. doi: 10.1016/j.procs.2020.03.336.

[7] Y. Zhang and A. Haghani, “A gradient boosting method to improve travel time prediction,” Transp. Res. Part C Emerg. Technol., vol. 58, pp. 308–324, Sep. 2015, doi: 10.1016/j.trc.2015.02.019.

[8] R. Kaur, R. Kumar, S. Kaur, G. Singh, A. Kaur, and S. Singh, “Machine Learning for Diabetes Prediction: Performance Analysis Using Logistic Regression, Naïve Bayes, and Decision Tree Models,” Healthcraft Frontiers, vol. 02, no. 04, pp. 169–187, Dec. 2024, doi: 10.56578/hf020401.

[9] Y. Zhang, “A Comparative Study of Logistic Regression and Machine Learning Models for Diabetes Prediction Using the BRFSS Dataset,” Applied and Computational Engineering, vol. 196, no. 1, pp. 177–185, Oct. 2025, doi: 10.54254/2755-2721/2025.ld28522.

[10] P. Branco, I. Torgo, R. P. Ribeiro, and L. Torgo, “A Survey of Predictive Modeling on Imbalanced Do-mains,” 2016.

[11] T. F. Sukamto, C. L. Prameswary, D. Royadi, and D. Sofia, “Diabetes Disease Prediction on Unbalanced Data Using SMOTE-Tomek Links and Random Forest Algorithm,” G-Tech: Jurnal Teknologi Terapan, vol. 9, no. 3, pp. 1194–1203, Jul. 2025, doi: 10.70609/g-tech.v9i3.7164.

[12] K. Ujaran, K. Ridwan, E. Heni Hermaliani, M. Ernawati, and C. Author, “Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada,” 2024. [Online]. Available: http://jurnal.bsi.ac.id/index.php/co-science

[13] E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 3, pp. 677–690, Jul. 2022, doi: 10.30812/matrik.v21i3.1726.

[14] P. Netayawijit, W. Chansanam, and K. Sorn-In, “Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support,” Healthcare (Switzerland), vol. 13, no. 20, Oct. 2025, doi: 10.3390/healthcare13202588.

[15] A. Aich, M. M. Murshed, S. Hewage, and A. Mayeaux, “CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction,” Sep. 2025, [Online]. Available: http://arxiv.org/abs/2506.17326

[16] G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” [Online]. Available: https://github.com/Microsoft/LightGBM.

[17] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Jan. 2019, [Online]. Available: http://arxiv.org/abs/1706.09516

[18] C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artif. Intell. Rev., vol. 54, no. 3, pp. 1937–1967, Mar. 2021, doi: 10.1007/s10462-020-09896-5.

[19] A. Zarghani, “Comparative Analysis of LSTM Neural Networks and Traditional Machine Learning Models for Predicting Diabetes Patient Readmission.”

[20] Y. Zhang, H. Zhang, D. Wang, N. Li, H. Lv, and G. Zhang, “Development of a 5-Year Risk Prediction Model for Transition From Prediabetes to Diabetes Using Machine Learning: Retrospective Cohort Study,” J. Med. Internet Res., vol. 27, no. 1, 2025, doi: 10.2196/73190.

[21] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” 2002.

[22] M. Rahman, D. Islam, R. J. Mukti, and I. Saha, “A deep learning approach based on convolutional LSTM for detecting diabetes,” Comput. Biol. Chem., vol. 88, Oct. 2020, doi: 10.1016/j.compbiolchem.2020.107329.

[23] Z. Rafie, M. S. Talab, B. E. Z. Koor, A. Garavand, C. Salehnasab, and M. Ghaderzadeh, “Leveraging XGBoost and explainable AI for accurate prediction of type 2 diabetes,” BMC Public Health, vol. 25, no. 1, Dec. 2025, doi: 10.1186/s12889-025-24953-w.

[24] M. Hasan and F. Yasmin, “Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers.”

Downloads

Published

2026-02-11

How to Cite

[1]
M. S. Latuconsina and M. Rahardi, “Comparison of LightGBM and CatBoost Algorithms for Diabetes Prediction Based on Clinical Data”, JAIC, vol. 10, no. 1, pp. 1058–1065, Feb. 2026.

Most read articles by the same author(s)

1 2 3 > >> 

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.