Comparison of LightGBM and CatBoost Algorithms for Diabetes Prediction Based on Clinical Data
DOI:
https://doi.org/10.30871/jaic.v10i1.12179Keywords:
Diabetes Prediction, SHAP, SMOTE, LightGBM, CatBoostAbstract
Diabetes Mellitus presents a global health challenge necessitating accurate early detection to prevent fatal complications. However, clinical data often exhibit imbalanced class distributions, hindering standard prediction models from effectively detecting positive patients. This study aims to compare the performance of two modern Gradient Boosting algorithms, LightGBM and CatBoost, in predicting diabetes risk. Random Forest and Logistic Regression algorithms were included as baseline models to benchmark effectiveness. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied during the training data preprocessing stage. The dataset was sourced from the Kaggle public repository (Diabetes Prediction Dataset), comprising 100,000 patient medical records with clinical attributes such as age, body mass index (BMI), and HbA1c levels. Performance evaluation utilized Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) metrics. Experimental results demonstrated a tight competition, where LightGBM achieved the highest Accuracy of 97.16%. However, CatBoost demonstrated superior sensitivity (Recall) of 69.71% and the highest F1-Score of 80.48%. This makes CatBoost the most reliable model in minimizing False Negatives compared to LightGBM and Random Forest, whereas Logistic Regression showed the lowest performance. Furthermore, interpretability analysis using SHAP (SHapley Additive exPlanations) revealed that HbA1c and blood glucose levels were the most dominant features in detection, validating the model's alignment with clinical diagnosis. This study concludes that the CatBoost algorithm combined with SMOTE offers a more sensitive, transparent, and efficient diabetes prediction for medical screening.
Downloads
References
[1] H. Sun et al., “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045,” Diabetes Res. Clin. Pract., vol. 183, Jan. 2022, doi: 10.1016/j.diabres.2021.109119.
[2] E. Er Unja, B. Trihandini, and P. Sarjana Keperawatan dan Ners, “Journal of Nursing Invention Hubungan Kadar Gula Darah Dengan Hipertensi Pada Pasien Diabetes Melitus Tipe 2 Di Wilayah Kerja Puskesmas Teluk Tiram Kota Banjarmasin Tahun 2024”, doi: 10.33859/jni.
[3] S. Rumondang, B. P. Sedli, and O. R. H. Umboh, “Pengaruh Inflamasi Mikro terhadap Penyakit Ginjal pada Pasien Diabetes Melitus Tipe-2 Microinflamation Influence on Renal Diseases in Type 2 Diabetes Mellitus,” Medical Scope Journal, vol. 4, no. 1, pp. 40–47, 2022, doi: 10.35790/msj.v4.i1.44682.
[4] M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes prediction using ensembling of different machine learning classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020, doi: 10.1109/ACCESS.2020.2989857.
[5] M. Fadli Kurniawan and D. Ayu Megawaty, “Comparison of Logistic Regression, Random Forest, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) Algorithms in Diabetes Prediction,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC
[6] N. P. Tigga and S. Garg, “Prediction of Type 2 Diabetes using Machine Learning Classification Methods,” in Procedia Computer Science, Elsevier B.V., 2020, pp. 706–716. doi: 10.1016/j.procs.2020.03.336.
[7] Y. Zhang and A. Haghani, “A gradient boosting method to improve travel time prediction,” Transp. Res. Part C Emerg. Technol., vol. 58, pp. 308–324, Sep. 2015, doi: 10.1016/j.trc.2015.02.019.
[8] R. Kaur, R. Kumar, S. Kaur, G. Singh, A. Kaur, and S. Singh, “Machine Learning for Diabetes Prediction: Performance Analysis Using Logistic Regression, Naïve Bayes, and Decision Tree Models,” Healthcraft Frontiers, vol. 02, no. 04, pp. 169–187, Dec. 2024, doi: 10.56578/hf020401.
[9] Y. Zhang, “A Comparative Study of Logistic Regression and Machine Learning Models for Diabetes Prediction Using the BRFSS Dataset,” Applied and Computational Engineering, vol. 196, no. 1, pp. 177–185, Oct. 2025, doi: 10.54254/2755-2721/2025.ld28522.
[10] P. Branco, I. Torgo, R. P. Ribeiro, and L. Torgo, “A Survey of Predictive Modeling on Imbalanced Do-mains,” 2016.
[11] T. F. Sukamto, C. L. Prameswary, D. Royadi, and D. Sofia, “Diabetes Disease Prediction on Unbalanced Data Using SMOTE-Tomek Links and Random Forest Algorithm,” G-Tech: Jurnal Teknologi Terapan, vol. 9, no. 3, pp. 1194–1203, Jul. 2025, doi: 10.70609/g-tech.v9i3.7164.
[12] K. Ujaran, K. Ridwan, E. Heni Hermaliani, M. Ernawati, and C. Author, “Penerapan Metode SMOTE Untuk Mengatasi Imbalanced Data Pada,” 2024. [Online]. Available: http://jurnal.bsi.ac.id/index.php/co-science
[13] E. Erlin, Y. Desnelita, N. Nasution, L. Suryati, and F. Zoromi, “Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang,” MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 21, no. 3, pp. 677–690, Jul. 2022, doi: 10.30812/matrik.v21i3.1726.
[14] P. Netayawijit, W. Chansanam, and K. Sorn-In, “Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support,” Healthcare (Switzerland), vol. 13, no. 20, Oct. 2025, doi: 10.3390/healthcare13202588.
[15] A. Aich, M. M. Murshed, S. Hewage, and A. Mayeaux, “CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction,” Sep. 2025, [Online]. Available: http://arxiv.org/abs/2506.17326
[16] G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” [Online]. Available: https://github.com/Microsoft/LightGBM.
[17] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: unbiased boosting with categorical features,” Jan. 2019, [Online]. Available: http://arxiv.org/abs/1706.09516
[18] C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artif. Intell. Rev., vol. 54, no. 3, pp. 1937–1967, Mar. 2021, doi: 10.1007/s10462-020-09896-5.
[19] A. Zarghani, “Comparative Analysis of LSTM Neural Networks and Traditional Machine Learning Models for Predicting Diabetes Patient Readmission.”
[20] Y. Zhang, H. Zhang, D. Wang, N. Li, H. Lv, and G. Zhang, “Development of a 5-Year Risk Prediction Model for Transition From Prediabetes to Diabetes Using Machine Learning: Retrospective Cohort Study,” J. Med. Internet Res., vol. 27, no. 1, 2025, doi: 10.2196/73190.
[21] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” 2002.
[22] M. Rahman, D. Islam, R. J. Mukti, and I. Saha, “A deep learning approach based on convolutional LSTM for detecting diabetes,” Comput. Biol. Chem., vol. 88, Oct. 2020, doi: 10.1016/j.compbiolchem.2020.107329.
[23] Z. Rafie, M. S. Talab, B. E. Z. Koor, A. Garavand, C. Salehnasab, and M. Ghaderzadeh, “Leveraging XGBoost and explainable AI for accurate prediction of type 2 diabetes,” BMC Public Health, vol. 25, no. 1, Dec. 2025, doi: 10.1186/s12889-025-24953-w.
[24] M. Hasan and F. Yasmin, “Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers.”
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Muhammad Sidik Latuconsina, Majid Rahardi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








