Optimizing Email Spam Detection through Handling Class Imbalance with Class Weights and Hyperparameter Using GridSearchCV

Muhammad Ridho Nursyam; Muhammad Koprawi; Dony Ariyus

doi:10.30871/jaic.v10i1.12060

Authors

Muhammad Ridho Nursyam Universitas Amikom Yogyakarta
Muhammad Koprawi Universitas Amikom Yogyakarta
Dony Ariyus Universitas Amikom Yogyakarta

DOI:

https://doi.org/10.30871/jaic.v10i1.12060

Keywords:

Spam Detection, Machine Learning, Class Imbalance, GridSearchCV, Email Spam

Abstract

Email spam is a major problem in digital communication that can disrupt productivity, burden network resources, and pose a security threat. This research focuses on optimizing spam email detection using a machine learning approach by addressing class imbalance through class weighting and hyperparameter tuning using GridSearchCV. To improve model accuracy and sensitivity, a combination of diverse datasets is applied to provide a wider scope of training data. The models used in this study include Support Vector Machine (SVM), Random Forest, Multinomial Naive Bayes (MNB), and XGBoost. Evaluation is carried out based on metrics such as accuracy, precision, recall, and F1-score, before and after hyperparameter tuning. The experimental results show that SVM produces the highest accuracy after tuning, reaching 97.10%, compared to 96.73% before hyperparameter tuning. In addition, Random Forest, MNB, and XGBoost also show significant improvements, with each model achieving better performance after tuning. Overall, this study shows that dataset merging and class weight adjustment can significantly improve the model's ability to detect spam, as well as provide a basis for implementing the model in a more effective email spam detection system.

Downloads

Download data is not yet available.

References

[1] S. M. M. Rahman, A. H. Sarower, and T. Bhuiyan, “Detection and Classification of Spam Email: A Machine Learning-Based Experimental Analysis,” in Proceedings of Trends in Electronics and Health Informatics, vol. 1034, M. Mahmud, M. S. Kaiser, A. Bandyopadhyay, K. Ray, and S. Al Mamun, Eds., in Lecture Notes in Networks and Systems, vol. 1034. , Singapore: Springer Nature Singapore, 2025, pp. 241–260. doi: 10.1007/978-981-97-3937-0_17.

[2] G. Nasreen, M. Murad Khan, M. Younus, B. Zafar, and M. Kashif Hanif, “Email spam detection by deep learning models using novel feature selection technique and BERT,” Egyptian Informatics Journal, vol. 26, p. 100473, June 2024, doi: 10.1016/j.eij.2024.100473.

[3] L. Á. Redondo-Gutierrez, F. Jáñez-Martino, E. Fidalgo, E. Alegre, V. González-Castro, and R. Alaiz-Rodríguez, “Detecting malware using text documents extracted from spam email through machine learning,” in Proceedings of the 22nd ACM Symposium on Document Engineering, San Jose California: ACM, Sept. 2022, pp. 1–4. doi: 10.1145/3558100.3563854.

[4] Y. Guo, Z. Mustafaoglu, and D. Koundal, “Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms,” JCCE, vol. 2, no. 1, pp. 5–9, Apr. 2022, doi: 10.47852/bonviewJCCE2202192.

[5] S. Md. M. Hossain and I. H. Sarker, “Content-based Spam Email Detection Using N-gram Machine Learning Approach,” Sept. 14, 2021, MATHEMATICS & COMPUTER SCIENCE. doi: 10.20944/preprints202109.0236.v1.

[6] M. V. Madhavan, S. Pande, P. Umekar, T. Mahore, and D. Kalyankar, “Comparative Analysis of Detection of Email Spam With the Aid of Machine Learning Approaches,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1022, no. 1, p. 012113, Jan. 2021, doi: 10.1088/1757-899X/1022/1/012113.

[7] F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, and E. Fidalgo, “Trustworthiness of spam email addresses using machine learning,” in Proceedings of the 21st ACM Symposium on Document Engineering, Limerick Ireland: ACM, Aug. 2021, pp. 1–4. doi: 10.1145/3469096.3475060.

[8] M. A. Bouke, A. Abdullah, M. T. Abdullah, S. A. Zaid, H. El Atigh, and S. H. ALshatebi, “A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern,” J. Info. Tech. Comp., vol. 4, no. 1, pp. 15–28, June 2023, doi: 10.48185/jitc.v4i1.653.

[9] T. A. Almeida, J. M. Gómez, and A. Yamakami, “Contributions to the study of SMS Spam Filtering: New Collection and Results”.

[10] U. Nuha and C.-H. Lin, Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data. 2024. doi: 10.48550/arXiv.2407.04990.

[11] A. C. Acock, “Working With Missing Values,” J of Marriage and Family, vol. 67, no. 4, pp. 1012–1028, Nov. 2005, doi: 10.1111/j.1741-3737.2005.00191.x.

[12] Z. B. Siddique, M. A. Khan, I. U. Din, A. Almogren, I. Mohiuddin, and S. Nazir, “Machine Learning-Based Detection of Spam Emails,” Scientific Programming, vol. 2021, pp. 1–11, Dec. 2021, doi: 10.1155/2021/6508784.

[13] S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS ONE, vol. 16, no. 8, p. e0254937, Aug. 2021, doi: 10.1371/journal.pone.0254937.

[14] A. K. Shrivas, A. K. Dewangan, and S. M. Ghosh, “Robust Text Classifier for Classification of Spam E-Mail Documents with Feature Selection Technique,” ISI, vol. 26, no. 5, pp. 437–444, Oct. 2021, doi: 10.18280/isi.260502.

[15] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”.

[16] V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.

[17] M. Adnan, M. O. Imam, M. F. Javed, and I. Murtza, “Improving spam email classification accuracy using ensemble techniques: a stacking approach,” Int. J. Inf. Secur., vol. 23, no. 1, pp. 505–517, Feb. 2024, doi: 10.1007/s10207-023-00756-1.

[18] K. R. M. Fernando and C. P. Tsokos, “Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 7, pp. 2940–2951, July 2022, doi: 10.1109/TNNLS.2020.3047335.

[19] S. Pudasaini, A. Shakya, S. P. Pandey, P. Paudel, S. Ghimire, and P. Ale, “SMS Spam Detection using Relevance Vector Machine,” Procedia Computer Science, vol. 230, pp. 337–346, 2023, doi: 10.1016/j.procs.2023.12.089.

[20] B. Wang and V. Pavlu, “December 8, 2014 based on notes by Andrew Ng.”.

[21] M. Nivedha and S. Raja, “Detection of email spam using Natural Language Processing based Random Forest approach,” International Journal of Computer Science and Mobile Computing, vol. 11, no. 2, pp. 7–22, 2022.

[22] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[23] M. Abbas, K. A. Memon, A. A. Jamali, S. Memon, and A. Ahmed, “Multinomial Naive Bayes Classification Model for Sentiment Analysis”.

[24] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

[25] C. Dewi, F. A. Indriawan, and H. J. Christanto, “Spam classification problems using support vector machine and grid search,” Int. J. Appl. Sci. Eng., vol. 20, no. 4, pp. 1–10, 2023, doi: 10.6703/IJASE.202312_20(4).006.

[26] D. Chicco, N. Tötsch, and G. Jurman, “The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Mining, vol. 14, no. 1, p. 13, Feb. 2021, doi: 10.1186/s13040-021-00244-z.

[27] T. A. Assegie, “Evaluation of Supervised Learning Models for Automatic Spam Email Detection,” July 27, 2023, In Review. doi: 10.21203/rs.3.rs-3191190/v1.

[28] Rivaldo Jeffmarvin, Hafizh Dzaky, Yusup Ardiyanto, Apriliyanto Dwi Saputra, Deri Irawan, and Jason Bernard Ardianto, “Analisis Perbandingan: SMOTE dan Undersampling pada Klasifikasi Spam Naïve Bayes: Studi Eksperimen perbandingan pada Dataset Email Berbahasa Indonesia,” JIITE, vol. 2, no. 2, pp. 377–383, Aug. 2025, doi: 10.63547/jiite.v2i2.92.

[29] Y. Guo, Z. Mustafaoglu, and D. Koundal, “Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms,” JCCE, vol. 2, no. 1, pp. 5–9, Apr. 2022, doi: 10.47852/bonviewJCCE2202192.

[30] Prachi Bhatnagar and Dr. S. D. Degadwala, “Efficient Email Spam Classification with N-gram Features and Ensemble Learning,” Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol, vol. 10, no. 2, pp. 278–284, Mar. 2024, doi: 10.32628/CSEIT2410220.

Optimizing Email Spam Detection through Handling Class Imbalance with Class Weights and Hyperparameter Using GridSearchCV

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

submit

tools

issn