Analysis of Gradient Boosting Algorithms with Optuna Optimization and SHAP Interpretation for Phishing Website Detection

Authors

  • Rahmat Fauzi Abu Bakar Universitas Amikom Yogyakarta
  • Majid Rahardi Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.30871/jaic.v10i1.11857

Keywords:

Gradient Boosting, Machine learning, Optuna, Phishing Detection, SHAP

Abstract

Phishing remains a persistent cybersecurity threat, evolving rapidly to bypass traditional blacklist-based detection systems. Machine Learning (ML) approaches offer a promising solution, yet finding the optimal balance between detection accuracy and model interpretability remains a challenge. This study aims to evaluate and optimize the performance of three state-of-the-art Gradient Boosting algorithms—XGBoost, LightGBM, and CatBoost—for phishing website detection. The research utilizes the UCI Phishing Websites dataset consisting of 11,055 instances. The novelty of this study lies in the implementation of the Optuna framework with the Tree-structured Parzen Estimator (TPE) for automated hyperparameter optimization and the application of SHAP (Shapley Additive Explanations) interaction values to interpret the "black-box" models. The experimental results demonstrate that the LightGBM model, optimized via Optuna, achieved the highest performance with an F1-Score of 0.9798, outperforming the baseline model (0.9713) by 0.87%. Furthermore, SHAP analysis identified 'SSLfinal_State' as the most critical determinant for distinguishing phishing sites. This study confirms that optimizing modern boosting algorithms significantly enhances phishing detection capabilities while providing necessary explainability for cybersecurity analysts.

Downloads

Download data is not yet available.

References

[1] R. Mohammad and L. McCluskey. "Phishing Websites," UCI Machine Learning Repository, 2012. [Online]. Available: https://doi.org/10.24432/C51W2X.

[2] "Phishing Activity Trends Report, 1st Quarter 2023," Anti-Phishing Working Group, 2023. [Online]. Available: https://docs.apwg.org/reports/apwg_trends_report_q1_2023.pdf

[3] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, "Optuna: A Next-generation Hyperparameter Optimization Framework," in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2623–2631, doi: https://doi.org/10.48550/arXiv.1907.10902.

[4] S. M. Lundberg and S.-I. Lee, "A Unified Approach to Interpreting Model Predictions," in Advances in Neural Information Processing Systems, 2017, pp. 4765–4774, doi: https://doi.org/10.48550/arXiv.1705.07874.

[5] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, "CatBoost: unbiased boosting with categorical features," in Advances in Neural Information Processing Systems, 2018, pp. 6638–6648, doi: https://doi.org/10.48550/arXiv.1706.09516.

[6] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794, doi: https://doi.org/10.48550/arXiv.1603.02754.

[7] G. Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree," in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 3146–3154.

[8] S. A. Althobaiti, A. Al-Sarem, and F. Saeed, "Phishing Website Detection using an Optimized Ensemble Learning Approach," Electronics, vol. 12, no. 15, p. 3314, 2023, doi: https://doi.org/10.32604/csse.2022.020414.

[9] M. Somesha, A. Pais, R. S. Rao, and V. S. Rathour, "Efficient deep learning mechanisms for phishing website detection with Explainable AI," Computer Standards & Interfaces, vol. 84, p. 103688, 2023, doi: https://doi.org/10.1007/s12046-020-01392-4.

[10] Y. Ding, G. Luktarhan, P. Li, and A. S. Sadiq, "A novel intrusion detection model based on CatBoost and LightGBM," in 2022 IEEE 12th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 2022, pp. 263-267, doi: https://doi.org/10.3390/sym12091458.

[11] A. H. Fitwi, Y. Chen, and S. Zhu, "Hyperparameter Optimization for Machine Learning-Based Phishing Detection," IEEE Access, vol. 8, pp. 11405–11419, 2020, doi: https://doi.org/10.1002/spy2.256.

[12] A. Hannousse and S. Yahiouche, "Securing the Internet of Things technologies from phishing attacks: A generic and robust approach using ensemble learning," Computers & Security, vol. 108, p. 102353, 2021, doi: https://doi.org/10.1016/j.engappai.2021.104347.

[13] Al-garadi, M.A., Varathan, K.D. and Ravana, S.D. (2016) Cybercrime Detection in Online Communications: The Experimental Case of Cyberbullying Detection in the Twitter Network. Computers in Human Behavior, 63, 433-443., doi: https://doi.org/10.1016/j.chb.2016.05.051.

[14] Akhtar, H. M. U., Nauman, M., Akhtar, N., Hameed, M., Hameed, S., & Tareen, M. Z. (2025). Mitigating Cyber Threats: Machine Learning and Explainable AI for Phishing Detection. VFAST Transactions on Software Engineering, 13(2), 170–195, doi : https://doi.org/10.21015/vtse.v13i2.2129.

[15] K. Omari, A. A. Al-Sarem, F. Saeed, and W. Al-Qerem, "Phishing detection using gradient boosting classifier," Procedia Computer Science, vol. 230, pp. 120–127, 2023. doi: 10.1016/j.procs.2023.12.009.

[16] I. Muraina, "Ideal dataset splitting ratios in machine learning algorithms: General concerns for data scientists and data analysts," International Journal of Advanced Research in Engineering and Technology (IJARET), vol. 13, no. 3, pp. 1-15, 2022.

[17] J. H. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, 2001.

[18] A. Natekin and A. Knoll, "Gradient boosting machines, a tutorial," Frontiers in Neurorobotics, vol. 7, p. 21, 2013. doi: 10.3390/app13084649.

[19] Y. Zhang and A. Haghani, "A gradient boosting method to improve travel time prediction," Transportation Research Part C: Emerging Technologies, vol. 58, pp. 308-324, 2015. doi: 10.1016/j.trc.2015.02.019.

[20] S. Alnemari and M. Alshammari, "Detecting Phishing Domains Using Machine Learning," Applied Sciences, vol. 13, no. 8, p. 4649, 2023. doi: 10.3390/app13084649.

[21] G. Varoquaux and O. Colliot, "Evaluating machine learning models and their diagnostic value," in Machine Learning for Brain Disorders, New York, NY: Humana, 2023, pp. 301-330. doi: 10.1007/978-1-0716-3195-9_20.

[22] A. Ubing, S. Kamilia, A. Abdullah, N. Zaman, and M. Supramaniam, "Phishing website detection: An improved accuracy through feature selection and ensemble learning," International Journal of Advanced Computer Science and Applications, vol. 10, no. 1, pp. 252-257, 2019.

[23] B. Banik and A. Sarma, "Phishing URL Detection System Based on URL Features Using SVM," International Journal of Electronics and Applied Research, vol. 5, no. 2, pp. 40-55, 2018. doi: 10.33665/IJEAR.2018.v05i02.003.

Downloads

Published

2026-02-05

How to Cite

[1]
R. F. Abu Bakar and M. Rahardi, “Analysis of Gradient Boosting Algorithms with Optuna Optimization and SHAP Interpretation for Phishing Website Detection”, JAIC, vol. 10, no. 1, pp. 664–672, Feb. 2026.

Most read articles by the same author(s)

1 2 3 > >> 

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.