Hybrid Lexical-Semantic Approach for Clickbait Classification in Indonesian News

Authors

  • Mohammad Nizar Farizi Master of Informatics, Universitas AMIKOM Yogyakarta
  • Lukmanul Hakim Master of Informatics, Universitas AMIKOM Yogyakarta
  • Sultan Ahmad Haidir Master of Informatics, Universitas AMIKOM Yogyakarta
  • Ema Utami Master of Informatics, Universitas AMIKOM Yogyakarta

DOI:

https://doi.org/10.30871/jaic.v10i2.12295

Keywords:

clickbait, Text Classification, TF-IDF (Term Frequency-Inverse Document Frequency, IndoBERT, hybrid pre-processing

Abstract

The rise of digital media in Indonesia has led to the proliferation of hyperbolic clickbait headlines that undermine media credibility and spread misinformation. This study proposes a hybrid lexical-semantic workflow for clickbait classification in Indonesian news by integrating lexical representations from Term Frequency-Inverse Document Frequency (TF-IDF) and token overlap ratios with deep contextual representations from IndoBERT embeddings. Using a dataset of 3,000 news articles from Detik.com labeled through a weak supervision strategy based on headline-content discrepancy, the extracted features were processed using a Single Layer Perceptron (SLP) and compared against Logistic Regression (LR) and Random Forest (RF) models. Evaluation via stratified 5-fold cross-validation focused on the F1-score to address class imbalance, revealing that hybrid features enhance model robustness against subtle misalignments. While LR reached a peak F1-score of 0.90 in lexical settings, the hybrid SLP configuration yielded a stable F1-score of 0.87, with feature importance analysis identifying IndoBERT semantic similarity as the most critical predictor. Ultimately, this hybrid approach successfully balances semantic depth and computational efficiency, offering an effective framework for safeguarding information integrity on Indonesian digital news platforms.

Downloads

Download data is not yet available.

References

[1] M. N. Fakhruzzaman, S. Z. Jannah, R. A. Ningrum, and I. Fahmiyah, “Clickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT),” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2102.01497

[2] M. Al-Sarem et al., “An improved multiple features and machine learning-based approach for detecting clickbait news on social networks,” Applied Sciences (Switzerland), vol. 11, no. 20, Oct. 2021, doi: 10.3390/app11209487.

[3] A. Muqadas, H. U. Khan, M. Ramzan, A. Naz, T. Alsahfi, and A. Daud, “Deep learning and sentence embeddings for detection of clickbait news from online content,” Sci. Rep., vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-025-97576-1.

[4] J. Sirusstara, N. Alexander, A. Alfarisy, S. Achmad, and R. Sutoyo, “Clickbait Headline Detection in Indonesian News Sites using Robustly Optimized BERT Pre-training Approach (RoBERTa),” in 2022 3rd International Conference on Artificial Intelligence and Data Sciences: Championing Innovations in Artificial Intelligence and Data Sciences for Sustainable Future, AiDAS 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 248-253. doi: 10.1109/AiDAS56890.2022.9918678.

[5] Sutriawan, S. Rustad, G. F. Shidik, and Pujiono, “Performance Evaluation of Text Embedding Models for Ambiguity Classification in Indonesian News Corpus: A Comparative Study of TF-IDF, Word2Vec, FastText BERT, and GPT,” Ingenierie des Systemes d’Information, vol. 30, no. 6, pp. 1469-1482, Jun. 2025, doi: 10.18280/isi.300606.

[6] W. Du, C. Ge, S. Yao, N. Chen, and L. Xu, “Applicability Analysis and Ensemble Application of BERT with TF-IDF, TextRank, MMR, and LDA for Topic Classification Based on Flood-Related VGI,” ISPRS Int. J. Geoinf., vol. 12, no. 6, Jun. 2023, doi: 10.3390/ijgi12060240.

[7] A. Chowanda, Nadia, and L. M. M. Kolbe, “Identifying clickbait in online news using deep learning,” Bulletin of Electrical Engineering and Informatics, vol. 12, no. 3, pp. 1755-1761, Jun. 2023, doi: 10.11591/eei.v12i3.4444.

[8] D. Iskandar and A. Kurniawati, “Analisis Perbandingan Teknik Word2vec dan Doc2vec dalam Mengukur Kemiripan Dokumen Menggunakan Cosine Similarity,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 12, no. 1, pp. 133-144, Feb. 2025, doi: 10.25126/jtiik.2025129143.

[9] A. Khanom, D. Kiesow, M. Zdun, and C. R. Shyu, “The News Crawler: A Big Data Approach to Local Information Ecosystems,” Media Commun., vol. 11, no. 3, pp. 318-329, 2023, doi: 10.17645/mac.v11i3.6789.

[10] N. Sardana, D. Varshney, and S. Luthra, “Enhanced Clickbait Detection through Ensemble Machine Learning Techniques,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 599-608. doi: 10.1016/j.procs.2025.04.294.

[11] U. Khairani, V. Mutiawani, and H. Ahmadian, “Pengaruh Tahapan Preprocessing Terhadap Model Indobert Dan Indobertweet Untuk Mendeteksi Emosi Pada Komentar Akun Berita Instagram,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 11, no. 4, pp. 887-894, Aug. 2024, doi: 10.25126/jtiik.1148315.

[12] M. A. Taha, H. D. A. Jabar, and W. K. Mohammed, “A Machine Learning Algorithms for Detecting Phishing Websites: A Comparative Study,” Iraqi Journal for Computer Science and Mathematics, vol. 5, no. 3, pp. 275-286, 2024, doi: 10.52866/ijcsm.2024.05.03.015.

[13] F. Rumaisa, “Evaluation of Indonesian Language Stemmer Algorithms: A Comparative Analysis,” Brilliance: Research of Artificial Intelligence, vol. 5, no. 1, pp. 21-25, Mar. 2025, doi: 10.47709/brilliance.v5i1.5679.

[14] M. Bronakowski, M. Al-khassaweneh, and A. Al Bataineh, “Automatic Detection of Clickbait Headlines Using Semantic Analysis and Machine Learning Techniques,” Applied Sciences (Switzerland), vol. 13, no. 4, Feb. 2023, doi: 10.3390/app13042456.

[15] R. N. Tanaja, Johnny, M. A. Rafif, and A. A. S. Gunawan, “Fake News Detection using Machine Learning: Integrating FakeBERT Classification, Style Analysis, and Credibility Verification,” in Procedia Computer Science, Elsevier B.V., 2025, pp. 1067-1076. doi: 10.1016/j.procs.2025.09.048.

[16] A. Hashemi, M. R. Moosavi, W. Shi, and A. Giachanou, “Enhancing fake news detection through estimating user tendencies to spread fake news,” Data Inf. Manag., vol. 10, no. 2, Jun. 2026, doi: 10.1016/j.dim.2025.100115.

Downloads

Published

2026-04-16

How to Cite

[1]
M. N. Farizi, L. Hakim, S. A. Haidir, and E. Utami, “Hybrid Lexical-Semantic Approach for Clickbait Classification in Indonesian News”, JAIC, vol. 10, no. 2, pp. 1200–1209, Apr. 2026.

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.