Hyperparameter Optimization and Feature Selection Analysis on the XGBoost Model for Hepatitis C Infection Prediction

Authors

  • Nadia Martha Lefi Universitas Amikom Yogyakarta
  • Majid Rahardi Universitas Amikom Yogyakarta 

DOI:

https://doi.org/10.30871/jaic.v9i6.10876

Keywords:

Classification, Feature Selection, Hepatitis C, Hyperparameter Optimization, XGBoost

Abstract

Hepatitis C is a liver disease that can progress to chronic conditions such as cirrhosis and liver cancer. Early detection is essential and can be supported through machine learning approaches. This study analyzes the effect of feature selection and hyperparameter tuning on the performance of the XGBoost model in classifying hepatitis C infection. The dataset, obtained from Kaggle, contains laboratory test attributes. The preprocessing stage involved handling missing values, encoding categorical variables, removing outlier classes, and normalizing data using StandardScaler. After stratified splitting, the training set was balanced using the SMOTE technique. Feature selection was carried out using the ANOVA F-score method, and hyperparameter tuning was performed using GridSearchCV. Three model scenarios were compared: baseline, with feature selection, and with combined feature selection and hyperparameter tuning. The evaluation results showed that the third model achieved the best performance with 96% accuracy, 79% precision, 81% recall, and a 78% F1-score, despite a slight decrease in the ROC AUC value. This approach has proven effective in improving model performance and is relevant for supporting more accurate hepatitis C diagnosis systems.

Downloads

Download data is not yet available.

References

[1] C. Shen, X. Jiang, M. Li, and Y. Luo, “Hepatitis Virus and Hepatocellular Carcinoma: Recent Advances,” Cancers (Basel), vol. 15, no. 2, 2023, doi: 10.3390/cancers15020533.

[2] World Health Organization., Global Hepatitis Report 2024. 2024. [Online]. Available: https://www.who.int/publications/i/item/9789240091672

[3] A. Sharma, T. Khade, and S. M. Satapathy, “A cross dataset meta-model for hepatitis C detection using multi-dimensional pre-clustering,” Sci Rep, vol. 15, no. 1, pp. 1–17, 2025, doi: 10.1038/s41598-025-91298-0.

[4] Y. Wang, B. Yin, and Q. Zhu, “Application of Machine Learning Algorithms in Predicting Hepatitis C,” ACM International Conference Proceeding Series, pp. 359–365, 2023, doi: 10.1145/3644116.3644176.

[5] A. Zulfiqar, T. Iqbal, and A. Munir, “Machine Learning-Based Classification Algorithms for Predicting Hepatitis C : A Comprehensive Analysis,” JCBI, vol. 08, no. 01, 2024.

[6] L. T. Nalasari, S. Anam, and N. Shofianah, “Liver Cirrhosis Classification using Extreme Gradient Boosting Classifier and Harris Hawk Optimization as Hyperparameter Tuning,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 2, pp. 508–519, 2025, doi: 10.35882/jeeemi.v7i2.730.

[7] O. Iparraguirre-villanueva, R. O. Flores-castañeda, H. Chero-valdivieso, and O. Iparraguirre-villanueva, “Predicting hepatitis C infection with machine learning algorithms : a prospective study,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 4, pp. 4403–4413, 2024, doi: 10.11591/ijai.v13.i4.pp4403-4413.

[8] A. Alizargar, Y. L. Chang, and T. H. Tan, “Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques,” Bioengineering, vol. 10, no. 4, 2023, doi: 10.3390/bioengineering10040481.

[9] M. Sayadi, V. Varadarajan, E. Gozali, and M. Sadeghi, “Effective factors in diagnosing the degree of hepatitis C using machine learning,” Frontiers in Health Informatics, vol. 12, 2023, doi: 10.30699/fhi.v12i0.440.

[10] B. Bischl et al., “Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges,” Wiley Interdiscip Rev Data Min Knowl Discov, vol. 13, no. 2, 2023, doi: 10.1002/widm.1484.

[11] K. Cao et al., “Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model,” Sci Rep, vol. 15, no. 1, pp. 1–12, 2025, doi: 10.1038/s41598-025-96520-7.

[12] A. M. Ali et al., “Explainable Machine Learning Approach for Hepatitis C Diagnosis Using SFS Feature Selection,” Machines, vol. 11, no. 3, pp. 1–14, 2023, doi: 10.3390/machines11030391.

[13] A. A. Syahputra and R. E. Saputro, “Application of the XGBoost Model with Hyperparameter Tuning for Industry Classification for Job Applicants,” Sinkron, vol. 8, no. 3, pp. 1920–1931, 2024, doi: 10.33395/sinkron.v8i3.13840.

[14] E. Pérez-gómez et al., “Exploratory integration of near-infrared spectroscopy with clinical data : a machine learning approach for HCV detection in serum samples,” Front Med (Lausanne), 2025.

[15] M. Wiens, A. Verone-Boyle, N. Henscheid, J. T. Podichetty, and J. Burton, “A Tutorial and Use Case Example of the eXtreme Gradient Boosting (XGBoost) Artificial Intelligence Algorithm for Drug Development Applications,” Clin Transl Sci, vol. 18, no. 3, p. e70172, 2025, doi: 10.1111/cts.70172.

[16] Wearefuture01, “Hepatitis C Prediction.” [Online]. Available: https://www.kaggle.com/datasets/wearefuture01/hepatitis-c-prediction

[17] A. Juna et al., “Water Quality Prediction Using KNN Imputer and Multilayer Perceptron,” Water (Switzerland), vol. 14, no. 17, pp. 1–19, 2022, doi: 10.3390/w14172592.

[18] Q. A. Hidayaturrohman and E. Hanada, “Impact of Data Pre-Processing Techniques on XGBoost Model Performance for Predicting All-Cause Readmission and Mortality Among Patients with Heart Failure,” BioMedInformatics, vol. 4, no. 4, pp. 2201–2212, 2024, doi: 10.3390/biomedinformatics4040118.

[19] B. Al-Smadi, B. Hammo, H. Faris, and P. A. Castillo, “Enhancing the Classification of Imbalanced Arabic Medical Questions Using DeepSMOTE,” AI (Switzerland), vol. 6, no. 4, pp. 1–26, 2025, doi: 10.3390/ai6040077.

[20] J. H. Joloudari, A. Marefat, M. A. Nematollahi, S. S. Oyelere, and S. Hussain, “Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks,” Applied Sciences (Switzerland), vol. 13, no. 6, 2023, doi: 10.3390/app13064006.

[21] B. Raufi and L. Longo, “Comparing ANOVA and PowerShap Feature Selection Methods via Shapley Additive Explanations of Models of Mental Workload Built with the Theta and Alpha EEG Band Ratios,” BioMedInformatics, vol. 4, no. 1, pp. 853–876, 2024, doi: 10.3390/biomedinformatics4010048.

[22] C. Meaney, X. Wang, J. Guan, and T. A. Stukel, “Comparison of methods for tuning machine learning model hyper-parameters: with application to predicting high-need high-cost health care users,” BMC Med Res Methodol, vol. 25, no. 1, 2025, doi: 10.1186/s12874-025-02561-x.

Downloads

Published

2025-12-06

How to Cite

[1]
N. M. Lefi and M. Rahardi, “Hyperparameter Optimization and Feature Selection Analysis on the XGBoost Model for Hepatitis C Infection Prediction”, JAIC, vol. 9, no. 6, pp. 3338–3345, Dec. 2025.

Most read articles by the same author(s)

1 2 3 > >> 

Similar Articles

<< < 3 4 5 6 7 > >> 

You may also start an advanced similarity search for this article.