Interpretable Machine Learning with SHAP and XGBoost for Lung Cancer Prediction Insights
Abstract
Lung cancer remains one of the leading causes of death worldwide, and early detection through accurate and reliable methods is essential to improve patient prognosis. This study proposes a lung cancer classification model that integrates XGBoost with SHapley Additive exPlanations (SHAP) and Random Over Sampling (ROS) techniques to address the data imbalance problem. Using hyperparameter optimization through Optuna, the resulting model demonstrated superior performance, with an average accuracy of 96.84%, precision of 99.23%, recall of 94.51%, F1-score of 96.74%, specificity of 99.17%, and AUC of 96.84% in a 10-fold cross-validation evaluation. SHAP analysis provided significant interpretability, identifying key features such as gender, smoking habits, and physical signs of yellow fingers as the factors that most influence the model's predictions. The results of this study indicate that the proposed model is not only accurate, but also interpretable, making a significant contribution to supporting better clinical decision making in lung cancer diagnosis.
Downloads
References
F. S. Gomiasti, W. Warto, E. Kartikadarma, J. Gondohanindijo, and D. R. I. M. Setiadi, “Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 396–406, Mar. 2024, doi: 10.62411/jcta.10106.
R. Yanuar, S. Sa’adah, and P. E. Yunanto, “Implementation of Hyperparameters to the Ensemble Learning Method for Lung Cancer Classification,” Build. Informatics, Technol. Sci., vol. 5, no. 2, pp. 498–508, Sep. 2023, doi: 10.47065/bits.v5i2.4096.
Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P.-H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 2, pp. 125–136, Mar. 2024, doi: 10.35882/jeeemi.v6i2.382.
E. Dritsas and M. Trigka, “Lung Cancer Risk Prediction with Machine Learning Models,” Big Data Cogn. Comput., vol. 6, no. 4, p. 139, Nov. 2022, doi: 10.3390/bdcc6040139.
T. R. Noviandy, G. M. Idroes, and I. Hardi, “An Interpretable Machine Learning Strategy for Antimalarial Drug Discovery with LightGBM and SHAP,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 84–95, Aug. 2024, doi: 10.62411/faith.2024-16.
R. K. Pathan, I. J. Shorna, M. S. Hossain, M. U. Khandaker, H. I. Almohammed, and Z. Y. Hamd, “The efficacy of machine learning models in lung cancer risk prediction with explainability,” PLoS One, vol. 19, no. 6, p. e0305035, Jun. 2024, doi: 10.1371/journal.pone.0305035.
S. T. Rikta, K. M. M. Uddin, N. Biswas, R. Mostafiz, F. Sharmin, and S. K. Dey, “XML-GBM lung: An explainable machine learning-based application for the diagnosis of lung cancer,” J. Pathol. Inform., vol. 14, p. 100307, Jan. 2023, doi: 10.1016/j.jpi.2023.100307.
M. I. Akazue, I. A. Debekeme, A. E. Edje, C. Asuai, and U. J. Osame, “Unmasking Fraudsters: Ensemble Features Selection to Enhance Random Forest Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 2, pp. 201–211, Dec. 2023, doi: 10.33633/jcta.v1i2.9462.
M. A. Araaf, K. Nugroho, and D. R. I. M. Setiadi, “Comprehensive Analysis and Classification of Skin Diseases based on Image Texture Features using K-Nearest Neighbors Algorithm,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 31–40, Sep. 2023, doi: 10.33633/jcta.v1i1.9185.
A. Wibowo and H. Hariyanto, “Comparison of Naive Bayes Method with Support Vector Machine in Helpdesk Ticket Classification,” J. Appl. Informatics Comput., vol. 7, no. 2, pp. 165–171, Nov. 2023, doi: 10.30871/jaic.v7i2.6376.
A. N. Safriandono, D. R. I. M. Setiadi, A. Dahlan, F. Z. Rahmanti, I. S. Wibisono, and A. A. Ojugo, “Analyzing Quantum Feature Engineering and Balancing Strategies Effect on Liver Disease Classification,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 51–63, Jun. 2024, doi: 10.62411/faith.2024-12.
E. Vieira, D. Ferreira, C. Neto, A. Abelha, and J. Machado, “Data Mining Approach to Classify Cases of Lung Cancer,” in Trends and Applications in Information Systems and Technologies, 2021, pp. 511–521. doi: 10.1007/978-3-030-72657-7_49.
F. Omoruwou, A. A. Ojugo, and S. E. Ilodigwe, “Strategic Feature Selection for Enhanced Scorch Prediction in Flexible Polyurethane Form Manufacturing,” J. Comput. Theor. Appl., vol. 1, no. 3, pp. 346–357, Feb. 2024, doi: 10.62411/jcta.9539.
R. E. Ako et al., “Effects of Data Resampling on Predicting Customer Churn via a Comparative Tree-based Random Forest and XGBoost,” J. Comput. Theor. Appl., vol. 2, no. 1, pp. 86–101, Jun. 2024, doi: 10.62411/jcta.10562.
D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.
E. B. Wijayanti, D. R. I. M. Setiadi, and B. H. Setyoko, “Dataset Analysis and Feature Characteristics to Predict Rice Production based on eXtreme Gradient Boosting,” J. Comput. Theor. Appl., vol. 1, no. 3, pp. 299–310, Feb. 2024, doi: 10.62411/jcta.10057.
D. R. I. M. Setiadi, H. M. M. Islam, G. A. Trisnapradika, and W. Herowati, “Analyzing Preprocessing Impact on Machine Learning Classifiers for Cryotherapy and Immunotherapy Dataset,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 39–50, Jun. 2024, doi: 10.62411/faith.2024-2.
C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. Big Data, vol. 11, no. 1, p. 7, Jan. 2024, doi: 10.1186/s40537-023-00857-7.
T. Riston et al., “Oversampling Methods for Handling Imbalance Data in Binary Classification,” in Computational Science and Its Applications – ICCSA 2023 Workshops, 2023, pp. 3–23. doi: 10.1007/978-3-031-37108-0_1.
M. A. Bhat, “Lung Cancer Classification Dataset.” Nov. 05, 2023. [Online]. Available: https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer
K. Cabello-Solorzano, I. Ortigosa de Araujo, M. Peña, L. Correia, and A. J. Tallón-Ballesteros, “The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis,” in 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023), 2023, pp. 344–353. doi: 10.1007/978-3-031-42536-3_33.
S. Watanabe, “Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance,” arXiv. Apr. 21, 2023. [Online]. Available: http://arxiv.org/abs/2304.11127
T. R. Noviandy, K. Nisa, G. M. Idroes, I. Hardi, and N. R. Sasmita, “Classifying Beta-Secretase 1 Inhibitor Activity for Alzheimer’s Drug Discovery with LightGBM,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 358–367, Mar. 2024, doi: 10.62411/jcta.10129.
Copyright (c) 2024 Taufik Kurniawan, Laily Hermawanti, Achmad Nuruddin Safriandono
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).