Implementation of Text Mining for Evaluating the Relevance Between News Headlines and Content on a Web-Based Platform

Authors

  • Desak Gede Inten Purnawati Teknologi Informasi, Universitas Udayana
  • Desy Purnami Singgih Putri Teknologi Informasi, Universitas Udayana
  • I Nyoman Piarsa Teknologi Informasi, Universitas Udayana

DOI:

https://doi.org/10.30871/jaic.v9i4.9732

Keywords:

News, Text Mining, LSTM, IndoBERT, Similarity

Abstract

Technological advancements in the era of the Industrial Revolution 4.0 have significantly transformed how society accesses and consumes information, particularly through online news portals. This study aims to analyze the relevance between news headlines and article content on Indonesian online news platforms by employing text mining techniques and similarity checking methods. To enhance the accuracy of relevance assessment, this research utilizes two deep learning-based modeling algorithms: Long Short-Term Memory (LSTM) and IndoBERT. The data was collected from three leading Indonesian news portals detik.com, kompas.com, and suara.com with a total of 52,242 articles from the entertainment and national news categories, gathered between July 1 and September 30, 2024. The dataset includes attributes such as headline, category, publication date, author, article URL, and news content. The research process consists of several stages, including data collection through web scraping, data pre-processing (which involves cleaning the category, author, and content columns), content summarization, text similarity calculation, and data labeling into three classes (relevan, berlebihan, and nonrelevan). Evaluation results show that the IndoBERT model outperforms LSTM, achieving the best performance with a training accuracy of 0.9048 and a training loss of 0.2514, as well as a validation accuracy of 0.8604 and a validation loss of 0.4039. These findings demonstrate that IndoBERT is effective in assessing the coherence between news headlines and content in today’s digital age.

Downloads

Download data is not yet available.

References

[1] APJII, “Survei APJII: Pengguna Internet Indonesia Tembus 221 Juta Orang ,” https://www.cnnindonesia.com/teknologi/20240131152906-213-1056781/survei-apjii-pengguna-internet-indonesia-tembus-221-juta-orang.

[2] E. Juliyana and C. A. Nuraflah, “Peranan Internet Dalam Meningkatkan Citra Sma Swasta Budi Agung Medan,” Jurnal Network Media, vol. 3, Feb. 2020.

[3] BPTI, “Badan Pengembangan Teknologi dan Informasi,” https://bpti.uhamka.ac.id/sharing/mengenal-python-penjelasan-dan-penggunaannya/.

[4] N. Rahmatika, G. F. Prisanto, S. Tinggi Ilmu Komunikasi InterStudi, J. I. Wijaya No, and J. Selatan, “Pengaruh Berita Clickbait Terhadap Kepercayaan pada Media di Era Attention Economy,” Avant Garde: Jurnal Ilmu Komunikasi, vol. 10, no. 02, pp. 190–200, 2022.

[5] B. Hu, Z. Mao, and Y. Zhang, “An overview of fake news detection: From a new perspective,” Fundamental Research, vol. 5, no. 1, pp. 332–346, Jan. 2025, doi: 10.1016/j.fmre.2024.01.017.

[6] H. A. Ahmadi and A. Chowanda, “Clickbait Classification Model on Online News with Semantic Similarity Calculation Between News Title and Content,” Building of Informatics, Technology and Science (BITS), vol. 4, no. 4, Mar. 2023, doi: 10.47065/bits.v4i4.3030.

[7] N. Newman, R. Fletcher, C. T. Robertson, A. R. Arguedas, and R. K. Nielsen, “Reuters Institute Digital News Report 2024,” 2024. doi: 10.60625/risj-vy6n-4v57.

[8] N. Tendikov et al., “Security Information Event Management data acquisition and analysis methods with machine learning principles,” Results in Engineering, vol. 22, p. 102254, Jun. 2024, doi: 10.1016/j.rineng.2024.102254.

[9] M. A. Zamzam, “Sistem Automatic Text Summarization Menggunakan Algoritma Textrank,” MATICS, vol. 12, no. 2, pp. 111–116, Sep. 2020, doi: 10.18860/mat.v12i2.8372.

[10] M. R. Hadwirianto, F. Hamami, and O. N. Pratiwi, “Extractive Text Summarization Terhadap Artikel Berita Indonesia Berbasis Machine Learning,” e-Proceeding of Engineering , vol. 11, 2024.

[11] A. Arsad, M. Hamid, and M. Santosa, “Penerapan Teks Mining Dan Cosine Similarity Untuk Menentukan Kesamaan Dokumen Skripsi,” IJIS - Indonesian Journal On Information System, vol. 9, no. 1, p. 99, Apr. 2024, doi: 10.36549/ijis.v9i1.314.

[12] A. Sagheer and M. Kotb, “Time series forecasting of petroleum production using deep LSTM recurrent networks,” Neurocomputing, vol. 323, pp. 203–213, Jan. 2019, doi: 10.1016/j.neucom.2018.09.082.

[13] R. Pramana, M. Jonathan, H. S. Yani, and R. Sutoyo, “A Comparison of BiLSTM, BERT, and Ensemble Method for Emotion Recognition on Indonesian Product Reviews,” Procedia Comput Sci, vol. 245, pp. 399–408, 2024, doi: 10.1016/j.procs.2024.10.266.

[14] S. Alaparthi and M. Mishra, “Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey,” Jul. 2020.

[15] K. Kaur and P. Kaur, “BERT-CNN: Improving BERT for Requirements Classification using CNN,” Procedia Comput Sci, vol. 218, pp. 2604–2611, 2023, doi: 10.1016/j.procs.2023.01.234.

[16] F. S. Nahm, “Receiver operating characteristic curve: overview and practical use for clinicians,” Korean J Anesthesiol, vol. 75, no. 1, pp. 25–36, Feb. 2022, doi: 10.4097/kja.21209.

[17] M. P. Behera, A. Sarangi, D. Mishra, and S. K. Sarangi, “A Hybrid Machine Learning algorithm for Heart and Liver Disease Prediction Using Modified Particle Swarm Optimization with Support Vector Machine,” Procedia Comput Sci, vol. 218, pp. 818–827, 2023, doi: 10.1016/j.procs.2023.01.062.

Downloads

Published

2025-08-05

How to Cite

[1]
D. G. I. Purnawati, D. P. Singgih Putri, and I. N. Piarsa, “Implementation of Text Mining for Evaluating the Relevance Between News Headlines and Content on a Web-Based Platform”, JAIC, vol. 9, no. 4, pp. 1463–1476, Aug. 2025.

Issue

Section

Articles