Analysis of Gradient Boosting Algorithms with Optuna Optimization and SHAP Interpretation for Phishing Website Detection
DOI:
https://doi.org/10.30871/jaic.v10i1.11857Keywords:
Gradient Boosting, Machine learning, Optuna, Phishing Detection, SHAPAbstract
Phishing remains a persistent cybersecurity threat, evolving rapidly to bypass traditional blacklist-based detection systems. Machine Learning (ML) approaches offer a promising solution, yet finding the optimal balance between detection accuracy and model interpretability remains a challenge. This study aims to evaluate and optimize the performance of three state-of-the-art Gradient Boosting algorithms—XGBoost, LightGBM, and CatBoost—for phishing website detection. The research utilizes the UCI Phishing Websites dataset consisting of 11,055 instances. The novelty of this study lies in the implementation of the Optuna framework with the Tree-structured Parzen Estimator (TPE) for automated hyperparameter optimization and the application of SHAP (Shapley Additive Explanations) interaction values to interpret the "black-box" models. The experimental results demonstrate that the LightGBM model, optimized via Optuna, achieved the highest performance with an F1-Score of 0.9798, outperforming the baseline model (0.9713) by 0.87%. Furthermore, SHAP analysis identified 'SSLfinal_State' as the most critical determinant for distinguishing phishing sites. This study confirms that optimizing modern boosting algorithms significantly enhances phishing detection capabilities while providing necessary explainability for cybersecurity analysts.
Downloads
References
[1] R. Mohammad and L. McCluskey. "Phishing Websites," UCI Machine Learning Repository, 2012. [Online]. Available: https://doi.org/10.24432/C51W2X.
[2] "Phishing Activity Trends Report, 1st Quarter 2023," Anti-Phishing Working Group, 2023. [Online]. Available: https://docs.apwg.org/reports/apwg_trends_report_q1_2023.pdf
[3] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, "Optuna: A Next-generation Hyperparameter Optimization Framework," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2623–2631, doi: https://doi.org/10.48550/arXiv.1907.10902.
[4] S. M. Lundberg and S.-I. Lee, "A Unified Approach to Interpreting Model Predictions," in Advances in Neural Information Processing Systems, 2017, pp. 4765–4774, doi: https://doi.org/10.48550/arXiv.1705.07874.
[5] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, "CatBoost: unbiased boosting with categorical features," in Advances in Neural Information Processing Systems, 2018, pp. 6638–6648, doi: https://doi.org/10.48550/arXiv.1706.09516.
[6] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794, doi: https://doi.org/10.48550/arXiv.1603.02754.
[7] G. Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree," in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 3146–3154.
[8] S. A. Althobaiti, A. Al-Sarem, and F. Saeed, "Phishing Website Detection using an Optimized Ensemble Learning Approach," Electronics, vol. 12, no. 15, p. 3314, 2023, doi: https://doi.org/10.32604/csse.2022.020414.
[9] M. Somesha, A. Pais, R. S. Rao, and V. S. Rathour, "Efficient deep learning mechanisms for phishing website detection with Explainable AI," Computer Standards & Interfaces, vol. 84, p. 103688, 2023, doi: https://doi.org/10.1007/s12046-020-01392-4.
[10] Y. Ding, G. Luktarhan, P. Li, and A. S. Sadiq, "A novel intrusion detection model based on CatBoost and LightGBM," in 2022 IEEE 12th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 2022, pp. 263-267, doi: https://doi.org/10.3390/sym12091458.
[11] A. H. Fitwi, Y. Chen, and S. Zhu, "Hyperparameter Optimization for Machine Learning-Based Phishing Detection," IEEE Access, vol. 8, pp. 11405–11419, 2020, doi: https://doi.org/10.1002/spy2.256.
[12] A. Hannousse and S. Yahiouche, "Securing the Internet of Things technologies from phishing attacks: A generic and robust approach using ensemble learning," Computers & Security, vol. 108, p. 102353, 2021, doi: https://doi.org/10.1016/j.engappai.2021.104347.
[13] Al-garadi, M.A., Varathan, K.D. and Ravana, S.D. (2016) Cybercrime Detection in Online Communications: The Experimental Case of Cyberbullying Detection in the Twitter Network. Computers in Human Behavior, 63, 433-443., doi: https://doi.org/10.1016/j.chb.2016.05.051.
[14] Akhtar, H. M. U., Nauman, M., Akhtar, N., Hameed, M., Hameed, S., & Tareen, M. Z. (2025). Mitigating Cyber Threats: Machine Learning and Explainable AI for Phishing Detection. VFAST Transactions on Software Engineering, 13(2), 170–195, doi : https://doi.org/10.21015/vtse.v13i2.2129.
[15] K. Omari, A. A. Al-Sarem, F. Saeed, and W. Al-Qerem, "Phishing detection using gradient boosting classifier," Procedia Computer Science, vol. 230, pp. 120–127, 2023. doi: 10.1016/j.procs.2023.12.009.
[16] I. Muraina, "Ideal dataset splitting ratios in machine learning algorithms: General concerns for data scientists and data analysts," International Journal of Advanced Research in Engineering and Technology (IJARET), vol. 13, no. 3, pp. 1-15, 2022.
[17] J. H. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, 2001.
[18] A. Natekin and A. Knoll, "Gradient boosting machines, a tutorial," Frontiers in Neurorobotics, vol. 7, p. 21, 2013. doi: 10.3390/app13084649.
[19] Y. Zhang and A. Haghani, "A gradient boosting method to improve travel time prediction," Transportation Research Part C: Emerging Technologies, vol. 58, pp. 308-324, 2015. doi: 10.1016/j.trc.2015.02.019.
[20] S. Alnemari and M. Alshammari, "Detecting Phishing Domains Using Machine Learning," Applied Sciences, vol. 13, no. 8, p. 4649, 2023. doi: 10.3390/app13084649.
[21] G. Varoquaux and O. Colliot, "Evaluating machine learning models and their diagnostic value," in Machine Learning for Brain Disorders, New York, NY: Humana, 2023, pp. 301-330. doi: 10.1007/978-1-0716-3195-9_20.
[22] A. Ubing, S. Kamilia, A. Abdullah, N. Zaman, and M. Supramaniam, "Phishing website detection: An improved accuracy through feature selection and ensemble learning," International Journal of Advanced Computer Science and Applications, vol. 10, no. 1, pp. 252-257, 2019.
[23] B. Banik and A. Sarma, "Phishing URL Detection System Based on URL Features Using SVM," International Journal of Electronics and Applied Research, vol. 5, no. 2, pp. 40-55, 2018. doi: 10.33665/IJEAR.2018.v05i02.003.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Rahmat Fauzi Abu Bakar, Majid Rahardi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








