Hyperparameter Optimization and Feature Selection Analysis on the XGBoost Model for Hepatitis C Infection Prediction
DOI:
https://doi.org/10.30871/jaic.v9i6.10876Keywords:
Classification, Feature Selection, Hepatitis C, Hyperparameter Optimization, XGBoostAbstract
Hepatitis C is a liver disease that can progress to chronic conditions such as cirrhosis and liver cancer. Early detection is essential and can be supported through machine learning approaches. This study analyzes the effect of feature selection and hyperparameter tuning on the performance of the XGBoost model in classifying hepatitis C infection. The dataset, obtained from Kaggle, contains laboratory test attributes. The preprocessing stage involved handling missing values, encoding categorical variables, removing outlier classes, and normalizing data using StandardScaler. After stratified splitting, the training set was balanced using the SMOTE technique. Feature selection was carried out using the ANOVA F-score method, and hyperparameter tuning was performed using GridSearchCV. Three model scenarios were compared: baseline, with feature selection, and with combined feature selection and hyperparameter tuning. The evaluation results showed that the third model achieved the best performance with 96% accuracy, 79% precision, 81% recall, and a 78% F1-score, despite a slight decrease in the ROC AUC value. This approach has proven effective in improving model performance and is relevant for supporting more accurate hepatitis C diagnosis systems.
Downloads
References
[1] C. Shen, X. Jiang, M. Li, and Y. Luo, “Hepatitis Virus and Hepatocellular Carcinoma: Recent Advances,” Cancers (Basel), vol. 15, no. 2, 2023, doi: 10.3390/cancers15020533.
[2] World Health Organization., Global Hepatitis Report 2024. 2024. [Online]. Available: https://www.who.int/publications/i/item/9789240091672
[3] A. Sharma, T. Khade, and S. M. Satapathy, “A cross dataset meta-model for hepatitis C detection using multi-dimensional pre-clustering,” Sci Rep, vol. 15, no. 1, pp. 1–17, 2025, doi: 10.1038/s41598-025-91298-0.
[4] Y. Wang, B. Yin, and Q. Zhu, “Application of Machine Learning Algorithms in Predicting Hepatitis C,” ACM International Conference Proceeding Series, pp. 359–365, 2023, doi: 10.1145/3644116.3644176.
[5] A. Zulfiqar, T. Iqbal, and A. Munir, “Machine Learning-Based Classification Algorithms for Predicting Hepatitis C : A Comprehensive Analysis,” JCBI, vol. 08, no. 01, 2024.
[6] L. T. Nalasari, S. Anam, and N. Shofianah, “Liver Cirrhosis Classification using Extreme Gradient Boosting Classifier and Harris Hawk Optimization as Hyperparameter Tuning,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 2, pp. 508–519, 2025, doi: 10.35882/jeeemi.v7i2.730.
[7] O. Iparraguirre-villanueva, R. O. Flores-castañeda, H. Chero-valdivieso, and O. Iparraguirre-villanueva, “Predicting hepatitis C infection with machine learning algorithms : a prospective study,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 4, pp. 4403–4413, 2024, doi: 10.11591/ijai.v13.i4.pp4403-4413.
[8] A. Alizargar, Y. L. Chang, and T. H. Tan, “Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques,” Bioengineering, vol. 10, no. 4, 2023, doi: 10.3390/bioengineering10040481.
[9] M. Sayadi, V. Varadarajan, E. Gozali, and M. Sadeghi, “Effective factors in diagnosing the degree of hepatitis C using machine learning,” Frontiers in Health Informatics, vol. 12, 2023, doi: 10.30699/fhi.v12i0.440.
[10] B. Bischl et al., “Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges,” Wiley Interdiscip Rev Data Min Knowl Discov, vol. 13, no. 2, 2023, doi: 10.1002/widm.1484.
[11] K. Cao et al., “Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model,” Sci Rep, vol. 15, no. 1, pp. 1–12, 2025, doi: 10.1038/s41598-025-96520-7.
[12] A. M. Ali et al., “Explainable Machine Learning Approach for Hepatitis C Diagnosis Using SFS Feature Selection,” Machines, vol. 11, no. 3, pp. 1–14, 2023, doi: 10.3390/machines11030391.
[13] A. A. Syahputra and R. E. Saputro, “Application of the XGBoost Model with Hyperparameter Tuning for Industry Classification for Job Applicants,” Sinkron, vol. 8, no. 3, pp. 1920–1931, 2024, doi: 10.33395/sinkron.v8i3.13840.
[14] E. Pérez-gómez et al., “Exploratory integration of near-infrared spectroscopy with clinical data : a machine learning approach for HCV detection in serum samples,” Front Med (Lausanne), 2025.
[15] M. Wiens, A. Verone-Boyle, N. Henscheid, J. T. Podichetty, and J. Burton, “A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications,” Clin Transl Sci, vol. 18, no. 3, p. e70172, 2025, doi: 10.1111/cts.70172.
[16] Wearefuture01, “Hepatitis C Prediction.” [Online]. Available: https://www.kaggle.com/datasets/wearefuture01/hepatitis-c-prediction
[17] A. Juna et al., “Water Quality Prediction Using KNN Imputer and Multilayer Perceptron,” Water (Switzerland), vol. 14, no. 17, pp. 1–19, 2022, doi: 10.3390/w14172592.
[18] Q. A. Hidayaturrohman and E. Hanada, “Impact of Data Pre-Processing Techniques on XGBoost Model Performance for Predicting All-Cause Readmission and Mortality Among Patients with Heart Failure,” BioMedInformatics, vol. 4, no. 4, pp. 2201–2212, 2024, doi: 10.3390/biomedinformatics4040118.
[19] B. Al-Smadi, B. Hammo, H. Faris, and P. A. Castillo, “Enhancing the Classification of Imbalanced Arabic Medical Questions Using DeepSMOTE,” AI (Switzerland), vol. 6, no. 4, pp. 1–26, 2025, doi: 10.3390/ai6040077.
[20] J. H. Joloudari, A. Marefat, M. A. Nematollahi, S. S. Oyelere, and S. Hussain, “Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks,” Applied Sciences (Switzerland), vol. 13, no. 6, 2023, doi: 10.3390/app13064006.
[21] B. Raufi and L. Longo, “Comparing ANOVA and PowerShap Feature Selection Methods via Shapley Additive Explanations of Models of Mental Workload Built with the Theta and Alpha EEG Band Ratios,” BioMedInformatics, vol. 4, no. 1, pp. 853–876, 2024, doi: 10.3390/biomedinformatics4010048.
[22] C. Meaney, X. Wang, J. Guan, and T. A. Stukel, “Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users,” BMC Med Res Methodol, vol. 25, no. 1, 2025, doi: 10.1186/s12874-025-02561-x.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Nadia Martha Lefi, Majid Rahardi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








