Hybrid Lexical-Semantic Approach for Clickbait Classification in Indonesian News
DOI:
https://doi.org/10.30871/jaic.v10i2.12295Keywords:
clickbait, Text Classification, TF-IDF (Term Frequency-Inverse Document Frequency, IndoBERT, hybrid pre-processingAbstract
The rise of digital media in Indonesia has led to the proliferation of hyperbolic clickbait headlines that undermine media credibility and spread misinformation. This study proposes a hybrid lexical-semantic workflow for clickbait classification in Indonesian news by integrating lexical representations from Term Frequency-Inverse Document Frequency (TF-IDF) and token overlap ratios with deep contextual representations from IndoBERT embeddings. Using a dataset of 3,000 news articles from Detik.com labeled through a weak supervision strategy based on headline-content discrepancy, the extracted features were processed using a Single Layer Perceptron (SLP) and compared against Logistic Regression (LR) and Random Forest (RF) models. Evaluation via stratified 5-fold cross-validation focused on the F1-score to address class imbalance, revealing that hybrid features enhance model robustness against subtle misalignments. While LR reached a peak F1-score of 0.90 in lexical settings, the hybrid SLP configuration yielded a stable F1-score of 0.87, with feature importance analysis identifying IndoBERT semantic similarity as the most critical predictor. Ultimately, this hybrid approach successfully balances semantic depth and computational efficiency, offering an effective framework for safeguarding information integrity on Indonesian digital news platforms.
Downloads
References
[1] M. N. Fakhruzzaman, S. Z. Jannah, R. A. Ningrum, and I. Fahmiyah, “Clickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT),” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2102.01497
[2] M. Al-Sarem et al., “An improved multiple features and machine learning-based approach for detecting clickbait news on social networks,” Applied Sciences (Switzerland), vol. 11, no. 20, Oct. 2021, doi: 10.3390/app11209487.
[3] A. Muqadas, H. U. Khan, M. Ramzan, A. Naz, T. Alsahfi, and A. Daud, “Deep learning and sentence embeddings for detection of clickbait news from online content,” Sci. Rep., vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-025-97576-1.
[4] J. Sirusstara, N. Alexander, A. Alfarisy, S. Achmad, and R. Sutoyo, “Clickbait Headline Detection in Indonesian News Sites using Robustly Optimized BERT Pre-training Approach (RoBERTa),” in 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 248-253. doi: 10.1109/AiDAS56890.2022.9918678.
[5] Sutriawan, S. Rustad, G. F. Shidik, and Pujiono, “Performance Evaluation of Text Embedding Models for Ambiguity Classification in Indonesian News Corpus: A Comparative Study of TF-IDF, Word2Vec, FastText BERT, and GPT,” Ingenierie des Systemes d’Information, vol. 30, no. 6, pp. 1469-1482, Jun. 2025, doi: 10.18280/isi.300606.
[6] W. Du, C. Ge, S. Yao, N. Chen, and L. Xu, “Applicability Analysis and Ensemble Application of BERT with TF-IDF, TextRank, MMR, and LDA for Topic Classification Based on Flood-Related VGI,” ISPRS Int. J. Geoinf., vol. 12, no. 6, Jun. 2023, doi: 10.3390/ijgi12060240.
[7] A. Chowanda, Nadia, and L. M. M. Kolbe, “Identifying clickbait in online news using deep learning,” Bulletin of Electrical Engineering and Informatics, vol. 12, no. 3, pp. 1755-1761, Jun. 2023, doi: 10.11591/eei.v12i3.4444.
[8] D. Iskandar and A. Kurniawati, “Analisis Perbandingan Teknik Word2vec dan Doc2vec dalam Mengukur Kemiripan Dokumen Menggunakan Cosine Similarity,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 12, no. 1, pp. 133-144, Feb. 2025, doi: 10.25126/jtiik.2025129143.
[9] A. Khanom, D. Kiesow, M. Zdun, and C. R. Shyu, “The News Crawler: A Big Data Approach to Local Information Ecosystems,” Media Commun., vol. 11, no. 3, pp. 318-329, 2023, doi: 10.17645/mac.v11i3.6789.
[10] N. Sardana, D. Varshney, and S. Luthra, “Enhanced Clickbait Detection through Ensemble Machine Learning Techniques,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 599-608. doi: 10.1016/j.procs.2025.04.294.
[11] U. Khairani, V. Mutiawani, and H. Ahmadian, “Pengaruh Tahapan Preprocessing Terhadap Model Indobert Dan Indobertweet Untuk Mendeteksi Emosi Pada Komentar Akun Berita Instagram,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 4, pp. 887-894, Aug. 2024, doi: 10.25126/jtiik.1148315.
[12] M. A. Taha, H. D. A. Jabar, and W. K. Mohammed, “A Machine Learning Algorithms for Detecting Phishing Websites: A Comparative Study,” Iraqi Journal for Computer Science and Mathematics, vol. 5, no. 3, pp. 275-286, 2024, doi: 10.52866/ijcsm.2024.05.03.015.
[13] F. Rumaisa, “Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis,” Brilliance: Research of Artificial Intelligence, vol. 5, no. 1, pp. 21-25, Mar. 2025, doi: 10.47709/brilliance.v5i1.5679.
[14] M. Bronakowski, M. Al-khassaweneh, and A. Al Bataineh, “Automatic Detection of Clickbait Headlines Using Semantic Analysis and Machine Learning Techniques,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042456.
[15] R. N. Tanaja, Johnny, M. A. Rafif, and A. A. S. Gunawan, “Fake News Detection using Machine Learning: Integrating FakeBERT Classification, Style Analysis, and Credibility Verification,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 1067-1076. doi: 10.1016/j.procs.2025.09.048.
[16] A. Hashemi, M. R. Moosavi, W. Shi, and A. Giachanou, “Enhancing fake news detection through estimating user tendencies to spread fake news,” Data Inf. Manag., vol. 10, no. 2, Jun. 2026, doi: 10.1016/j.dim.2025.100115.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Mohammad Nizar Farizi, Lukmanul Hakim, Sultan Ahmad Haidir, Ema Utami

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








