Optimizing Email Spam Detection through Handling Class Imbalance with Class Weights and Hyperparameter Using GridSearchCV
DOI:
https://doi.org/10.30871/jaic.v10i1.12060Keywords:
Spam Detection, Machine Learning, Class Imbalance, GridSearchCV, Email SpamAbstract
Email spam is a major problem in digital communication that can disrupt productivity, burden network resources, and pose a security threat. This research focuses on optimizing spam email detection using a machine learning approach by addressing class imbalance through class weighting and hyperparameter tuning using GridSearchCV. To improve model accuracy and sensitivity, a combination of diverse datasets is applied to provide a wider scope of training data. The models used in this study include Support Vector Machine (SVM), Random Forest, Multinomial Naive Bayes (MNB), and XGBoost. Evaluation is carried out based on metrics such as accuracy, precision, recall, and F1-score, before and after hyperparameter tuning. The experimental results show that SVM produces the highest accuracy after tuning, reaching 97.10%, compared to 96.73% before hyperparameter tuning. In addition, Random Forest, MNB, and XGBoost also show significant improvements, with each model achieving better performance after tuning. Overall, this study shows that dataset merging and class weight adjustment can significantly improve the model's ability to detect spam, as well as provide a basis for implementing the model in a more effective email spam detection system.
Downloads
References
[1] S. M. M. Rahman, A. H. Sarower, and T. Bhuiyan, “Detection and Classification of Spam Email: A Machine Learning-Based Experimental Analysis,” in Proceedings of Trends in Electronics and Health Informatics, vol. 1034, M. Mahmud, M. S. Kaiser, A. Bandyopadhyay, K. Ray, and S. Al Mamun, Eds., in Lecture Notes in Networks and Systems, vol. 1034. , Singapore: Springer Nature Singapore, 2025, pp. 241–260. doi: 10.1007/978-981-97-3937-0_17.
[2] G. Nasreen, M. Murad Khan, M. Younus, B. Zafar, and M. Kashif Hanif, “Email spam detection by deep learning models using novel feature selection technique and BERT,” Egyptian Informatics Journal, vol. 26, p. 100473, June 2024, doi: 10.1016/j.eij.2024.100473.
[3] L. Á. Redondo-Gutierrez, F. Jáñez-Martino, E. Fidalgo, E. Alegre, V. González-Castro, and R. Alaiz-Rodríguez, “Detecting malware using text documents extracted from spam email through machine learning,” in Proceedings of the 22nd ACM Symposium on Document Engineering, San Jose California: ACM, Sept. 2022, pp. 1–4. doi: 10.1145/3558100.3563854.
[4] Y. Guo, Z. Mustafaoglu, and D. Koundal, “Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms,” JCCE, vol. 2, no. 1, pp. 5–9, Apr. 2022, doi: 10.47852/bonviewJCCE2202192.
[5] S. Md. M. Hossain and I. H. Sarker, “Content-based Spam Email Detection Using N-gram Machine Learning Approach,” Sept. 14, 2021, MATHEMATICS & COMPUTER SCIENCE. doi: 10.20944/preprints202109.0236.v1.
[6] M. V. Madhavan, S. Pande, P. Umekar, T. Mahore, and D. Kalyankar, “Comparative Analysis of Detection of Email Spam With the Aid of Machine Learning Approaches,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1022, no. 1, p. 012113, Jan. 2021, doi: 10.1088/1757-899X/1022/1/012113.
[7] F. Jáñez-Martino, R. Alaiz-Rodríguez, V. González-Castro, and E. Fidalgo, “Trustworthiness of spam email addresses using machine learning,” in Proceedings of the 21st ACM Symposium on Document Engineering, Limerick Ireland: ACM, Aug. 2021, pp. 1–4. doi: 10.1145/3469096.3475060.
[8] M. A. Bouke, A. Abdullah, M. T. Abdullah, S. A. Zaid, H. El Atigh, and S. H. ALshatebi, “A Lightweight Machine Learning-Based Email Spam Detection Model Using Word Frequency Pattern,” J. Info. Tech. Comp., vol. 4, no. 1, pp. 15–28, June 2023, doi: 10.48185/jitc.v4i1.653.
[9] T. A. Almeida, J. M. Gómez, and A. Yamakami, “Contributions to the study of SMS Spam Filtering: New Collection and Results”.
[10] U. Nuha and C.-H. Lin, Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data. 2024. doi: 10.48550/arXiv.2407.04990.
[11] A. C. Acock, “Working With Missing Values,” J of Marriage and Family, vol. 67, no. 4, pp. 1012–1028, Nov. 2005, doi: 10.1111/j.1741-3737.2005.00191.x.
[12] Z. B. Siddique, M. A. Khan, I. U. Din, A. Almogren, I. Mohiuddin, and S. Nazir, “Machine Learning-Based Detection of Spam Emails,” Scientific Programming, vol. 2021, pp. 1–11, Dec. 2021, doi: 10.1155/2021/6508784.
[13] S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS ONE, vol. 16, no. 8, p. e0254937, Aug. 2021, doi: 10.1371/journal.pone.0254937.
[14] A. K. Shrivas, A. K. Dewangan, and S. M. Ghosh, “Robust Text Classifier for Classification of Spam E-Mail Documents with Feature Selection Technique,” ISI, vol. 26, no. 5, pp. 437–444, Oct. 2021, doi: 10.18280/isi.260502.
[15] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”.
[16] V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.
[17] M. Adnan, M. O. Imam, M. F. Javed, and I. Murtza, “Improving spam email classification accuracy using ensemble techniques: a stacking approach,” Int. J. Inf. Secur., vol. 23, no. 1, pp. 505–517, Feb. 2024, doi: 10.1007/s10207-023-00756-1.
[18] K. R. M. Fernando and C. P. Tsokos, “Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 7, pp. 2940–2951, July 2022, doi: 10.1109/TNNLS.2020.3047335.
[19] S. Pudasaini, A. Shakya, S. P. Pandey, P. Paudel, S. Ghimire, and P. Ale, “SMS Spam Detection using Relevance Vector Machine,” Procedia Computer Science, vol. 230, pp. 337–346, 2023, doi: 10.1016/j.procs.2023.12.089.
[20] B. Wang and V. Pavlu, “December 8, 2014 based on notes by Andrew Ng.”.
[21] M. Nivedha and S. Raja, “Detection of email spam using Natural Language Processing based Random Forest approach,” International Journal of Computer Science and Mobile Computing, vol. 11, no. 2, pp. 7–22, 2022.
[22] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[23] M. Abbas, K. A. Memon, A. A. Jamali, S. Memon, and A. Ahmed, “Multinomial Naive Bayes Classification Model for Sentiment Analysis”.
[24] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.
[25] C. Dewi, F. A. Indriawan, and H. J. Christanto, “Spam classification problems using support vector machine and grid search,” Int. J. Appl. Sci. Eng., vol. 20, no. 4, pp. 1–10, 2023, doi: 10.6703/IJASE.202312_20(4).006.
[26] D. Chicco, N. Tötsch, and G. Jurman, “The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Mining, vol. 14, no. 1, p. 13, Feb. 2021, doi: 10.1186/s13040-021-00244-z.
[27] T. A. Assegie, “Evaluation of Supervised Learning Models for Automatic Spam Email Detection,” July 27, 2023, In Review. doi: 10.21203/rs.3.rs-3191190/v1.
[28] Rivaldo Jeffmarvin, Hafizh Dzaky, Yusup Ardiyanto, Apriliyanto Dwi Saputra, Deri Irawan, and Jason Bernard Ardianto, “Analisis Perbandingan: SMOTE dan Undersampling pada Klasifikasi Spam Naïve Bayes: Studi Eksperimen perbandingan pada Dataset Email Berbahasa Indonesia,” JIITE, vol. 2, no. 2, pp. 377–383, Aug. 2025, doi: 10.63547/jiite.v2i2.92.
[29] Y. Guo, Z. Mustafaoglu, and D. Koundal, “Spam Detection Using Bidirectional Transformers and Machine Learning Classifier Algorithms,” JCCE, vol. 2, no. 1, pp. 5–9, Apr. 2022, doi: 10.47852/bonviewJCCE2202192.
[30] Prachi Bhatnagar and Dr. S. D. Degadwala, “Efficient Email Spam Classification with N-gram Features and Ensemble Learning,” Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol, vol. 10, no. 2, pp. 278–284, Mar. 2024, doi: 10.32628/CSEIT2410220.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Muhammad Ridho Nursyam, Muhammad Koprawi, Dony Ariyus

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








