Stroke Risk Classification Using the Ensemble Learning Method of XGBoost and Random Forest

Gullam Almuzadid; Egia Rosi Subhiyakto

doi:10.30871/jaic.v9i3.9528

Authors

Gullam Almuzadid Univesitas Dian Nuswantoro
Egia Rosi Subhiyakto Universitas Dian Nuswantoro

DOI:

https://doi.org/10.30871/jaic.v9i3.9528

Keywords:

Ensemble Learning, Random Forest, SMOTE-ENN, Stroke Risk Classification, XGBoost

Abstract

Stroke is a leading cause of global death and disability. This study proposes a stroke risk classification model using ensemble learning that combines Random Forest and XGBoost algorithms. A Kaggle dataset with 5110 samples (249 stroke, 4861 non-stroke) presented significant class imbalance. To address this, a comprehensive preprocessing pipeline was implemented, including feature encoding, feature scaling, feature selection using ANOVA F-test, outlier handling with Z-Score and IQR methods, and missing value imputation using MICE. The SMOTE-ENN approach was applied to handle class imbalance, resulting in a more balanced sample distribution. The dataset was split into 80% training and 20% testing data (hold-out test) to ensure objective evaluation. Hyperparameter optimization was performed using Bayesian optimization, while model evaluation employed stratified K-fold cross-validation to prevent overfitting. Validation on the hold-out test set demonstrated exceptional ensemble model performance with an AUC of 0.99, 98% accuracy, 98% precision, and 98% recall. Feature importance analysis identified average glucose level and age as the strongest stroke risk predictors. The proposed approach significantly improved predictive accuracy compared to previous research, demonstrating the effectiveness of ensemble learning and preprocessing methods in developing reliable, high-performing machine learning models for early stroke risk assessment.

Downloads

Download data is not yet available.

References

[1] World Stroke Organization, “WSO Global Stroke Fact Sheet 2022,” 2022. Accessed: Feb. 27, 2025. [Online]. Available: https://www.world-stroke.org/assets/downloads/WSO_Global_Stroke_Fact_Sheet.pdf

[2] D. T. Murdiansyah, “Prediksi Stroke Menggunakan Extreme Gradient Boosting,” JIKO (Jurnal Informatika dan Komputer), vol. 8, no. 2, p. 419, Sep. 2024, doi: 10.26798/jiko.v8i2.1295.

[3] Vi. Wulandari, Mustakim, R. Novita, and N. E. Rozanda, “Implementation of Machine Learning Algorithm for Stroke Risk Classification by Applying Sequential Forward Selection,” 2025.

[4] S. Handayani, Fajrizal, Taslim, D. Toresa, and Syahril, “Peningkatan Performa Model Gradient Boosting dalam Klasifikasi Stroke Melalui Optimasi Grid Search,” JURNAL FASILKOM, vol. 14, no. 3, pp. 722–728, Dec. 2024, doi: 10.37859/jf.v14i3.7893.

[5] M. Issaiy, D. Zarei, S. Kolahi, and D. S. Liebeskind, “Machine learning and deep learning algorithms in stroke medicine: a systematic review of hemorrhagic transformation prediction models,” J Neurol, vol. 272, no. 1, p. 37, Jan. 2025, doi: 10.1007/s00415-024-12810-6.

[6] Y. Aulia, A. Andriyansyah, S. Suharjito, and S. W. Nensi, “Analisis Prediksi Stroke dengan Membandingkan Tiga Metode Klasifikasi Decision Tree, Naïve Bayes, dan Random Forest,” Jurnal Ilmu Komputer dan Informatika, vol. 3, no. 2, pp. 89–98, Jan. 2024, doi: 10.54082/jiki.90.

[7] A. F. Riany and G. Testiana, “Penerapan Data Mining untuk Klasifikasi Penyakit Stroke Menggunakan Algoritma Naïve Bayes,” Jurnal SAINTEKOM, vol. 13, no. 1, pp. 42–54, Mar. 2023, doi: 10.33020/saintekom.v13i1.352.

[8] S. Suhliyyah, H. Hikmayanti Handayani, and K. Ahmad Baihaqi, “Implementasi Algoritma Logistic Regression Untuk Klasifikasi Penyakit Stroke,” Syntax : Jurnal Informatika, vol. 12, no. 01, pp. 15–23, May 2023, doi: 10.35706/syji.v12i01.8329.

[9] K. Akmal, A. Faqih, and F. Dikananda, “Perbandingan Metode Algoritma Naïve Bayes Dan K-Nearest Neighbors Untuk Klasifikasi Penyakit Stroke,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 7, no. 1, pp. 470–477, Mar. 2023, doi: 10.36040/jati.v7i1.6367.

[10] Ary Prandika Siregar, Dwi Priyadi Purba, Jojor Putri Pasaribu, and Khairul Reza Bakara, “Implementasi Algoritma Random Forest Dalam Klasifikasi Diagnosis Penyakit Stroke,” Jurnal Penelitian Rumpun Ilmu Teknik, vol. 2, no. 4, pp. 155–164, Nov. 2023, doi: 10.55606/juprit.v2i4.3039.

[11] N. Nuraeni, “Klasifikasi Data Mining Untuk Prediksi Penyakit Kardiovaskular,” Jurnal TEKINKOM, vol. 7, no. 1, 2024, doi: 10.37600/tekinkom.v7i1.1276.

[12] R. Chen et al., “A study on predicting the length of hospital stay for Chinese patients with ischemic stroke based on the XGBoost algorithm,” BMC Med Inform Decis Mak, vol. 23, no. 1, p. 49, Mar. 2023, doi: 10.1186/s12911-023-02140-4.

[13] R. Estian Pambudi, Sriyanto, and Firmansyah, “Klasifikasi Penyakit Stroke Menggunakan Algoritma Decision Tree C.45,” Jurnal TEKNIKA, vol. x, No.x, pp. 1–5, Aug. 2022, doi: https://doi.org/10.5281/zenodo.7535865.

[14] M. Hasanudin, S. Dwiasnati, and W. Gunawan, “Pelatihan Datascience pada Pra-Pemrosesan Data untuk Siswa SMK Media Informatika - Jakarta,” Jurnal Pengabdian Pada Masyarakat, vol. 9, no. 4, pp. 882–888, Nov. 2024, doi: 10.30653/jppm.v9i4.921.

[15] B. Nugroho and A. Denih, “Perbandingan Kinerja Metode Pra-Pemrosesan Dalam Pengklasifikasian Otomatis Dokumen Paten,” Komputasi: Jurnal Ilmiah Ilmu Komputer dan Matematika, vol. 17, no. 2, pp. 381–387, Jul. 2020, doi: 10.33751/komputasi.v17i2.2148.

[16] H. S. Laqueur, A. B. Shev, and R. M. C. Kagawa, “SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations,” Am J Epidemiol, vol. 191, no. 3, pp. 516–525, Feb. 2022, doi: 10.1093/aje/kwab271.

[17] I. M. Karo Karo and H. Hendriyana, “Klasifikasi Penderita Diabetes menggunakan Algoritma Machine Learning dan Z-Score,” Jurnal Teknologi Terpadu, vol. 8, no. 2, pp. 94–99, Dec. 2022, doi: 10.54914/jtt.v8i2.564.

[18] A. Alabrah, “An Improved CCF Detector to Handle the Problem of Class Imbalance with Outlier Normalization Using IQR Method,” Sensors, vol. 23, no. 9, p. 4406, Apr. 2023, doi: 10.3390/s23094406.

[19] Z. R. Fadilah and A. W. Wijayanto, “Perbandingan Metode Klasterisasi Data Bertipe Campuran: One-Hot-Encoding, Gower Distance, dan K-Prototype Berdasarkan Akurasi (Studi Kasus: Chronic Kidney Disease Dataset),” Journal of Applied Informatics and Computing, vol. 7, no. 1, pp. 57–67, Jul. 2023, doi: 10.30871/jaic.v7i1.5857.

[20] C. Herdian, A. Kamila, and I. G. Agung Musa Budidarma, “Studi Kasus Feature Engineering Untuk Data Teks: Perbandingan Label Encoding dan One-Hot Encoding Pada Metode Linear Regresi,” Technologia : Jurnal Ilmiah, vol. 15, no. 1, p. 93, Jan. 2024, doi: 10.31602/tji.v15i1.13457.

[21] P. Sanyal and S. K. Dalui, “Computational fluid dynamics and artificial neural network‐based analysis and forecasting of wind effects on obliquely parallel multiple building models using categorical variable encoding,” The Structural Design of Tall and Special Buildings, vol. 33, no. 8, Jun. 2024, doi: 10.1002/tal.2105.

[22] S. Abdumalikov, J. Kim, and Y. Yoon, “Performance Analysis and Improvement of Machine Learning with Various Feature Selection Methods for EEG-Based Emotion Classification,” Applied Sciences, vol. 14, no. 22, p. 10511, Nov. 2024, doi: 10.3390/app142210511.

[23] S. F. N. Halim and U. Azmi, “Analisis Perbandingan Klasifikasi dan Penerapan Teknik SMOTE Dalam Imbalanced Data Pada Credit Card Default,” Jurnal Sains dan Seni ITS, vol. 12, no. 2, May 2023, doi: 10.12962/j23373520.v12i2.111833.

[24] I Gede Harsemadi, I Komang Dharmendra, and I Made Pasek Pradnyana Wijaya, “Klasifikasi Emosi Pada Tweet Berbahasa Indonesia Menggunakan Teknik Sampling ENN,” Jurnal Teknologi Informasi dan Komputer, vol. 9, no. 5, Oct. 2023, doi: 10.36002/jutik.v9i5.2646.

Stroke Risk Classification Using the Ensemble Learning Method of XGBoost and Random Forest

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

submit

tools

issn