Comparison of Light Gradient Boosting Machine, eXtreme Gradient Boosting, and CatBoost with Balancing and Hyperparameter Tuning for Hypertension Risk Prediction on Clinical Dataset
DOI:
https://doi.org/10.30871/jaic.v9i5.10400Keywords:
Hypertension, Prediction, LGBM, SMOTE, Feature selectionAbstract
Hypertension is a long-lasting condition that is highly prevalent and significantly contributes to cardiovascular issues, making early identification a crucial preventive action. This research evaluates the efficacy of three boosting algorithms, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LGBM), and CatBoost in forecasting hypertension risk. A publicly accessible dataset consisting of 4,363 samples was employed, followed by data preprocessing, feature selection through a voting method that integrates Boruta, Recursive Feature Elimination (RFE), and SelectKBest, as well as addressing class imbalance using the Synthetic Minority Over-sampling Technique (SMOTE) and ADASYN (Adaptive Synthetic Sampling Approach). The models were additionally fine-tuned through hyperparameter optimization using GridSearchCV and Repeated Stratified K-Fold Cross Validation. The evaluation results demonstrate that all three algorithms exhibited strong predictive capabilities, with CatBoost leading the way, achieving an accuracy of 0.992, precision of 0.992, recall of 0.992, F1-score of 0.992, and ROC-AUC of 0.9987. Analyzing the confusion matrix further validated that CatBoost had the lowest number of misclassifications when compared to XGBoost and LGBM. Additionally, the use of SHapley Additive exPlanations (SHAP) for model interpretability highlighted that the key factors influencing the prediction of hypertension risk are blood pressure, body mass index (BMI), overall physical activity, waist circumference, triglyceride levels, age, and LDL cholesterol levels, aligning with established medical knowledge. To facilitate real-world use, the top-performing model was implemented into a user-friendly website interface, allowing users to predict their hypertension risk interactively. These findings illustrate that boosting algorithms, especially CatBoost, offer an accurate, dependable, and interpretable machine learning method for creating hypertension risk prediction systems.
Downloads
References
[1] Mayo Clinic, “High blood pressure (hypertension).” Accessed: Jul. 17, 2025. [Online]. Available: https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/symptoms-causes/syc-20373410
[2] World Health Organization, “Hypertension.” Accessed: Jul. 17, 2025. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/hypertension
[3] K. Sawicka, M. Szczyrek, I. Jastrzębska, M. Prasał, A. Zwolak, and J. Daniluk, “Hypertension-The Silent Killer,” 2011. [Online]. Available: www.jpccr.eu
[4] A. Silvianah, P. Studi Ilmu Keperawatan, and S. Tinggi Ilmu Kesehatan Dian Husada Mojokerto, “HUBUNGAN KEPATUHAN MINUM OBAT HIPERTENSI DENGAN PERUBAHAN TEKANAN DARAH PADA LANSIA DI POSYANDU LANSIA,” 2024.
[5] B. Zhou et al., “Worldwide trends in hypertension prevalence and progress in treatment and control from 1990 to 2019: a pooled analysis of 1201 population-representative studies with 104 million participants,” The Lancet, vol. 398, no. 10304, pp. 957–980, Sep. 2021, doi: 10.1016/S0140-6736(21)01330-1.
[6] T. Inoue, “Unawareness and untreated hypertension: a public health problem needs to be solved,” Apr. 01, 2025, Springer Nature. doi: 10.1038/s41440-025-02118-x.
[7] J. S. Cho and J. H. Park, “Application of artificial intelligence in hypertension,” Dec. 01, 2024, BioMed Central Ltd. doi: 10.1186/s40885-024-00266-9.
[8] A. Az’zahra Tarimana, M. Ryan, S. Fajar, M. A. Saktiawan, and R. A. Saputra, “PREDIKSI PENYAKIT HIPERTENSI MENGGUNAKAN MACHINE LEARNING DENGAN ALGORITMA REGRESI LOGISTIK,” 2024.
[9] G. Almuzadid and R. Subhiyakto, “Stroke Risk Classification Using the Ensemble Learning Method of XGBoost and Random Forest,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC
[10] A. Hadianto and W. H. Utomo, “CatBoost Optimization Using Recursive Feature Elimination,” vol. 9, pp. 169–178, 2024, doi: 10.15575/join.v9i1.1324.
[11] W. Chang et al., “Prediction of hypertension outcomes based on gain sequence forward tabu search feature selection and xgboost,” Diagnostics, vol. 11, no. 5, May 2021, doi: 10.3390/diagnostics11050792.
[12] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00369-8.
[13] X. Ji et al., “Prediction Model of Hypertension Complications Based on GBDT and LightGBM,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Feb. 2021. doi: 10.1088/1742-6596/1813/1/012008.
[14] A. R. Rafi et al., “PERBANDINGAN ALGORITMA LIGHTGBM DAN ANN UNTUK MENENTUKAN KUALITAS ANGGUR MERAH,” 2025.
[15] D. Mayang Pratiwi and L. Mufidah, Perbandingan Metode Decision Tree Classifier dan XGBoost Classifier Dalam Memprediksi Penyakit Jantung, vol. 4, no. 1. 2024.
[16] A. A. Ibrahim, R. L. Ridwan, M. M. Muhammed, R. O. Abdulaziz, and G. A. Saheed, “Comparison of the CatBoost Classifier with other Machine Learning Methods,” 2020. [Online]. Available: www.ijacsa.thesai.org
[17] O. T. Biala, “A COMPARATIVE STUDY OF CATBOOST AND ARTIFICIAL NEURAL NETWORKS IN ENHANCING TRIP GENERATION MODELLING FOR ILORIN CITY,” Journal of Civil Engineering, Science and Technology, vol. 15, no. 1, pp. 18–29, Apr. 2024, doi: 10.33736/jcest.6196.2024.
[18] R. Kurniawan et al., “Hypertension prediction using machine learning algorithm among Indonesian adults,” IAES International Journal of Artificial Intelligence, vol. 12, no. 2, pp. 776–784, Jun. 2023, doi: 10.11591/ijai.v12.i2.pp776-784.
[19] F. V. Ongkosianbhadra and C. C. Lestari, “Pengembangan Model Prediksi Risiko Hipertensi Menggunakan Algoritma Gradient Boosting Decision Tree Yang Dioptimalkan,” Jurnal Informatika dan Sistem Informasi, vol. 9, no. 2, pp. 90–99, Dec. 2023, doi: 10.37715/juisi.v9i2.4403.
[20] S. S. Chai, K. L. Goh, W. L. Cheah, Y. H. R. Chang, and G. W. Ng, “Hypertension Prediction in Adolescents Using Anthropometric Measurements: Do Machine Learning Models Perform Equally Well?,” Applied Sciences (Switzerland), vol. 12, no. 3, Feb. 2022, doi: 10.3390/app12031600.
[21] T. Mroz et al., “Predicting hypertension control using machine learning,” PLoS One, vol. 19, no. 3 March, Mar. 2024, doi: 10.1371/journal.pone.0299932.
[22] Axel Frederick Félix Jiménez and Vania Stephany Sánchez Lee, “Hipertension Arterial Mexico Data Set.” Accessed: Jul. 19, 2025. [Online]. Available: https://www.kaggle.com/datasets/frederickfelix/hipertensin-arterial-mxico
[23] IBM, “What is exploratory data analysis (EDA)?,” 2021. Accessed: Jul. 21, 2025. [Online]. Available: https://www.ibm.com/think/topics/exploratory-data-analysis
[24] R. Febrian and A. Mudya Yolanda, “Comparison of Recursive Feature Elimination and Boruta as Feature Selection in Greenhouse Gas Emission Data Classification,” 2024.
[25] M. A. Tariq, “A Study on Comparative Analysis of Feature Selection Algorithms for Students Grades Prediction,” Journal of Information and Organizational Sciences, vol. 48, no. 1, pp. 133–147, Jun. 2024, doi: 10.31341/jios.48.1.7.
[26] M. Hasan et al., “Enhancing stroke disease classification through machine learning models via a novel voting system by feature selection techniques,” Jan. 01, 2025, Public Library of Science. doi: 10.1371/journal.pone.0312914.
[27] A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Apr. 01, 2018, AI Access Foundation. doi: 10.1613/jair.1.11192.
[28] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in Proceedings of the International Joint Conference on Neural Networks, 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.
[29] A. Damayanti and A. Baita, “Comparison of Support Vector Machine (SVM) and Random Forest (RF) Algorithm Performance with Random Undersampling Technique to Predict Gestational Diabetes Mellitus Risk,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC
[30] S. Sah, B. Surendiran, R. Dhanalakshmi, and M. Yamin, “Covid-19 cases prediction using SARIMAX Model by tuning hyperparameter through grid search cross-validation approach,” Expert Syst, vol. 40, no. 5, Jun. 2023, doi: 10.1111/exsy.13086.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Dewi Ayu Murtiningsih, Bety Wulan Sari, Ika Nur Fajri

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








