Stroke Risk Classification Using the Ensemble Learning Method of XGBoost and Random Forest
DOI:
https://doi.org/10.30871/jaic.v9i3.9528Keywords:
Ensemble Learning, Random Forest, SMOTE-ENN, Stroke Risk Classification, XGBoostAbstract
Stroke is a leading cause of global death and disability. This study proposes a stroke risk classification model using ensemble learning that combines Random Forest and XGBoost algorithms. A Kaggle dataset with 5110 samples (249 stroke, 4861 non-stroke) presented significant class imbalance. To address this, a comprehensive preprocessing pipeline was implemented, including feature encoding, feature scaling, feature selection using ANOVA F-test, outlier handling with Z-Score and IQR methods, and missing value imputation using MICE. The SMOTE-ENN approach was applied to handle class imbalance, resulting in a more balanced sample distribution. The dataset was split into 80% training and 20% testing data (hold-out test) to ensure objective evaluation. Hyperparameter optimization was performed using Bayesian optimization, while model evaluation employed stratified K-fold cross-validation to prevent overfitting. Validation on the hold-out test set demonstrated exceptional ensemble model performance with an AUC of 0.99, 98% accuracy, 98% precision, and 98% recall. Feature importance analysis identified average glucose level and age as the strongest stroke risk predictors. The proposed approach significantly improved predictive accuracy compared to previous research, demonstrating the effectiveness of ensemble learning and preprocessing methods in developing reliable, high-performing machine learning models for early stroke risk assessment.
Downloads
References
[1] World Stroke Organization, “WSO Global Stroke Fact Sheet 2022,” 2022. Accessed: Feb. 27, 2025. [Online]. Available: https://www.world-stroke.org/assets/downloads/WSO_Global_Stroke_Fact_Sheet.pdf
[2] D. T. Murdiansyah, “Prediksi Stroke Menggunakan Extreme Gradient Boosting,” JIKO (Jurnal Informatika dan Komputer), vol. 8, no. 2, p. 419, Sep. 2024, doi: 10.26798/jiko.v8i2.1295.
[3] Vi. Wulandari, Mustakim, R. Novita, and N. E. Rozanda, “Implementation of Machine Learning Algorithm for Stroke Risk Classification by Applying Sequential Forward Selection,” 2025.
[4] S. Handayani, Fajrizal, Taslim, D. Toresa, and Syahril, “Peningkatan Performa Model Gradient Boosting dalam Klasifikasi Stroke Melalui Optimasi Grid Search,” JURNAL FASILKOM, vol. 14, no. 3, pp. 722–728, Dec. 2024, doi: 10.37859/jf.v14i3.7893.
[5] M. Issaiy, D. Zarei, S. Kolahi, and D. S. Liebeskind, “Machine learning and deep learning algorithms in stroke medicine: a systematic review of hemorrhagic transformation prediction models,” J Neurol, vol. 272, no. 1, p. 37, Jan. 2025, doi: 10.1007/s00415-024-12810-6.
[6] Y. Aulia, A. Andriyansyah, S. Suharjito, and S. W. Nensi, “Analisis Prediksi Stroke dengan Membandingkan Tiga Metode Klasifikasi Decision Tree, Naïve Bayes, dan Random Forest,” Jurnal Ilmu Komputer dan Informatika, vol. 3, no. 2, pp. 89–98, Jan. 2024, doi: 10.54082/jiki.90.
[7] A. F. Riany and G. Testiana, “Penerapan Data Mining untuk Klasifikasi Penyakit Stroke Menggunakan Algoritma Naïve Bayes,” Jurnal SAINTEKOM, vol. 13, no. 1, pp. 42–54, Mar. 2023, doi: 10.33020/saintekom.v13i1.352.
[8] S. Suhliyyah, H. Hikmayanti Handayani, and K. Ahmad Baihaqi, “Implementasi Algoritma Logistic Regression Untuk Klasifikasi Penyakit Stroke,” Syntax : Jurnal Informatika, vol. 12, no. 01, pp. 15–23, May 2023, doi: 10.35706/syji.v12i01.8329.
[9] K. Akmal, A. Faqih, and F. Dikananda, “Perbandingan Metode Algoritma Naïve Bayes Dan K-Nearest Neighbors Untuk Klasifikasi Penyakit Stroke,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 7, no. 1, pp. 470–477, Mar. 2023, doi: 10.36040/jati.v7i1.6367.
[10] Ary Prandika Siregar, Dwi Priyadi Purba, Jojor Putri Pasaribu, and Khairul Reza Bakara, “Implementasi Algoritma Random Forest Dalam Klasifikasi Diagnosis Penyakit Stroke,” Jurnal Penelitian Rumpun Ilmu Teknik, vol. 2, no. 4, pp. 155–164, Nov. 2023, doi: 10.55606/juprit.v2i4.3039.
[11] N. Nuraeni, “Klasifikasi Data Mining Untuk Prediksi Penyakit Kardiovaskular,” Jurnal TEKINKOM, vol. 7, no. 1, 2024, doi: 10.37600/tekinkom.v7i1.1276.
[12] R. Chen et al., “A study on predicting the length of hospital stay for Chinese patients with ischemic stroke based on the XGBoost algorithm,” BMC Med Inform Decis Mak, vol. 23, no. 1, p. 49, Mar. 2023, doi: 10.1186/s12911-023-02140-4.
[13] R. Estian Pambudi, Sriyanto, and Firmansyah, “Klasifikasi Penyakit Stroke Menggunakan Algoritma Decision Tree C.45,” Jurnal TEKNIKA, vol. x, No.x, pp. 1–5, Aug. 2022, doi: https://doi.org/10.5281/zenodo.7535865.
[14] M. Hasanudin, S. Dwiasnati, and W. Gunawan, “Pelatihan Datascience pada Pra-Pemrosesan Data untuk Siswa SMK Media Informatika - Jakarta,” Jurnal Pengabdian Pada Masyarakat, vol. 9, no. 4, pp. 882–888, Nov. 2024, doi: 10.30653/jppm.v9i4.921.
[15] B. Nugroho and A. Denih, “Perbandingan Kinerja Metode Pra-Pemrosesan Dalam Pengklasifikasian Otomatis Dokumen Paten,” Komputasi: Jurnal Ilmiah Ilmu Komputer dan Matematika, vol. 17, no. 2, pp. 381–387, Jul. 2020, doi: 10.33751/komputasi.v17i2.2148.
[16] H. S. Laqueur, A. B. Shev, and R. M. C. Kagawa, “SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations,” Am J Epidemiol, vol. 191, no. 3, pp. 516–525, Feb. 2022, doi: 10.1093/aje/kwab271.
[17] I. M. Karo Karo and H. Hendriyana, “Klasifikasi Penderita Diabetes menggunakan Algoritma Machine Learning dan Z-Score,” Jurnal Teknologi Terpadu, vol. 8, no. 2, pp. 94–99, Dec. 2022, doi: 10.54914/jtt.v8i2.564.
[18] A. Alabrah, “An Improved CCF Detector to Handle the Problem of Class Imbalance with Outlier Normalization Using IQR Method,” Sensors, vol. 23, no. 9, p. 4406, Apr. 2023, doi: 10.3390/s23094406.
[19] Z. R. Fadilah and A. W. Wijayanto, “Perbandingan Metode Klasterisasi Data Bertipe Campuran: One-Hot-Encoding, Gower Distance, dan K-Prototype Berdasarkan Akurasi (Studi Kasus: Chronic Kidney Disease Dataset),” Journal of Applied Informatics and Computing, vol. 7, no. 1, pp. 57–67, Jul. 2023, doi: 10.30871/jaic.v7i1.5857.
[20] C. Herdian, A. Kamila, and I. G. Agung Musa Budidarma, “Studi Kasus Feature Engineering Untuk Data Teks: Perbandingan Label Encoding dan One-Hot Encoding Pada Metode Linear Regresi,” Technologia : Jurnal Ilmiah, vol. 15, no. 1, p. 93, Jan. 2024, doi: 10.31602/tji.v15i1.13457.
[21] P. Sanyal and S. K. Dalui, “Computational fluid dynamics and artificial neural network‐based analysis and forecasting of wind effects on obliquely parallel multiple building models using categorical variable encoding,” The Structural Design of Tall and Special Buildings, vol. 33, no. 8, Jun. 2024, doi: 10.1002/tal.2105.
[22] S. Abdumalikov, J. Kim, and Y. Yoon, “Performance Analysis and Improvement of Machine Learning with Various Feature Selection Methods for EEG-Based Emotion Classification,” Applied Sciences, vol. 14, no. 22, p. 10511, Nov. 2024, doi: 10.3390/app142210511.
[23] S. F. N. Halim and U. Azmi, “Analisis Perbandingan Klasifikasi dan Penerapan Teknik SMOTE Dalam Imbalanced Data Pada Credit Card Default,” Jurnal Sains dan Seni ITS, vol. 12, no. 2, May 2023, doi: 10.12962/j23373520.v12i2.111833.
[24] I Gede Harsemadi, I Komang Dharmendra, and I Made Pasek Pradnyana Wijaya, “Klasifikasi Emosi Pada Tweet Berbahasa Indonesia Menggunakan Teknik Sampling ENN,” Jurnal Teknologi Informasi dan Komputer, vol. 9, no. 5, Oct. 2023, doi: 10.36002/jutik.v9i5.2646.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Gullam Almuzadid, Egia Rosi Subhiyakto

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).