Sentiment Classification of Health Education YouTube Comments Using IndoBERT Embeddings with Logistic Regression and Naïve Bayes
DOI:
https://doi.org/10.30871/jaic.v10i3.13016Keywords:
Sentiment analysis, IndoBERT, SMOTE, Logistic Regression, Text Clasification, Naïve BayesAbstract
Class imbalance is a common issue in sentiment classification of social media data, particularly in mental health–related discussions where certain sentiment classes are underrepresented. This study focuses on sentiment classification of mental health–related YouTube comments by utilizing IndoBERT as a pre-trained language model to generate contextual text embeddings. Sentiment classification is subsequently performed using conventional machine learning algorithms, namely Logistic Regression and Naïve Bayes. The research framework includes data collection through the YouTube Data API, text preprocessing, semi-manual sentiment labeling into positive, neutral, and negative classes, and dataset partitioning using an 80:20 train–test split. To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is applied exclusively to the training data to prevent data leakage. Feature representation is obtained from IndoBERT embeddings with a dimensionality of 768. Model performance is evaluated using accuracy, precision, recall, and F1-score. Experimental results show that Logistic Regression outperforms Naïve Bayes, achieving an accuracy of 78%, compared to 56% for Naïve Bayes. This indicates that Logistic Regression is more effective in handling dense contextual embeddings generated by transformer-based models. Overall, the findings demonstrate that combining contextual embeddings with data balancing techniques can improve sentiment classification performance in mental health–related social media analysis, particularly in low-resource language settings.
Downloads
References
[1] Y. Puspitasari, “Edukasi Pentingnya Screening Informasi Youtube Sebagai Sarana Mencari Informasi Kesehatan,” Genitri J. Pengabdi. Masy. Bid. Kesehat., vol. 2, no. 2, pp. 132–136, 2023.
[2] J. Kristiyono and N. Dwi Hermawan, “Analisis Komunikasi Interaktif Brando Franco dengan Penontonnya dalam Live Streaming di Kanal YouTube Windah Basudara,” JCommsci - J. Media Commun. Sci., vol. 6, no. 2, pp. 11–19, May 2023, doi: 10.29303/jcommsci.v6i2.206.
[3] G. P. Ertansyah, R. T. C. Kusuma, and A. A. Sari, “Analisis Sentimen Pada Media Sosial Menggunakan Teknik Natural Language Processing,” Pros. Semin. Nas. Teknol. Inf. Dan Bisnis, pp. 183–189, Jul. 2025, doi: 10.47701/qgcey104.
[4] M. Monica and A. Purwanto, “Comparative Performance Evaluation of Naïve Bayes and Logistic Regression for Indonesian YouTube Comment Sentiment Classification,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 6, no. 2, pp. 934–943, May 2026, doi: 10.57152/malcom.v6i2.2639.
[5] M. P. Syah, A. P. Wardani, M. Idhom, and Trimono, “Perbandingan Representasi Teks Tf-Idf Dan Bert Terhadap Akurasi Cosine Similarity Dalam Penilaian Otomatis Jawaban Berbasis Teks,” Data Sci. Indones. DSI, vol. 5, no. 1, pp. 47–59, Jul. 2025, doi: 10.47709/dsi.v5i1.6021.
[6] B. Wilie et al., “‘IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,’” 2019.
[7] D. Marutho and V. G. Utomo, “‘Benchmarking IndoBERT and Transformer Models for Sentiment Classification on Indonesian E-Government Service Reviews,’ Jurnal Transformatika,” vol. 23, no. 1, pp. 85–95, 2025.
[8] D. Ramdani, M. Irfan, and N. Lukman, “‘Perbandingan Kinerja IndoBERT, IndoRoBERTa dan NusaBERT dalam Analisis Sentimen Isu LGBT di Media Sosial X,’ INTERNAL Journal,” vol. 8, no. 2, pp. 91–107, 2025.
[9] V. D. Setiawan, D. U. Iswavigra, and E. Anggiratih, “‘Implementation of IndoBERT for Sentiment Analysis of the Constitutional Court’s Decision Regarding the Minimum Age of Vice Presidential Candidates,’ Scientific Journal of Informatics,” vol. 12, no. 3, pp. 397–406, 2025, doi: 10.15294/sji.v12i3.26360.
[10] L. Geni, E. Yulianti, and D. I. Sensuse, “‘Sentiment Analysis of Tweets Before the 2024 Elections in Indonesia Using IndoBERT Language Models,’ Jurnal Ilmiah Teknik Elektro Komputer dan Informatika,” vol. 9, no. 3, pp. 746–757, 2024, doi: 10.26555/jiteki.v9i3.26490.
[11] R. Merdiansah and A. A. Ridha, “‘Analisis Sentimen Pengguna X Indonesia Terkait Kendaraan Listrik Menggunakan IndoBERT,’” vol. 7, pp. 221–228, 2024.
[12] S. Uyun, R. P. Rosalin, L. V. Sari, and H. H. Sucinta, “‘A Hybrid Classification Model Based on BERT for Multi-Class Sentiment Analysis on Twitter,’ Jurnal Ilmiah Teknik Elektro Komputer dan Informatika,” vol. 11, no. 2, pp. 194–205, 2025, doi: 10.26555/jiteki.v11i2.30665.
[13] and M. L. D. Al Akhdaan, T. E. Sutanto, “‘Confident Learning on IndoBERT: Enhancing Sentiment Classification Performance,’ The Indonesian Journal of Computer Science,” vol. 13, no. 5, pp. 8350–8359, 2024.
[14] M. A. Gumilang, F. Abdillah, M. Y. Amin, and M. Hasan, ““Sentiment Analysis of Indonesian Ministries’ Social Media: Citizen Responses Utilizing TextBlob Analyzer,”,” vol. 23, no. 2, pp. 203–216, 2024.
[15] V. E. Sidauruk and W. Herowati, “‘IndoBERT-Based Sentiment Analysis of Political Discourse on Platform X: The Case of Prabowo-Gibran Administration,’ Journal of Applied Informatics and Computing,” vol. 10, no. 1, pp. 673–683, 2026.
[16] A. Sehatman, D. Ronaldo, and A. Chandra, “‘Analisis Sentimen Publik Indonesia terhadap Konflik Israel-Iran di Media Sosial Menggunakan IndoBERT dan Explainable AI (LIME),’ Jurnal Teknologi Informasi,” vol. 20, no. 1, pp. 78–87, 2026.
[17] F. Rodriguez-Torres, J. A. Carrasco-Ochoa, and J. F. Martínez-Trinidad, “Deterministic oversampling methods based on SMOTE,” J. Intell. Fuzzy Syst., vol. 36, no. 5, pp. 4945–4955, Jan. 2019, doi: 10.3233/JIFS-179041.
[18] H. Ma’rifah, A. P. Wibawa, and M. I. Akbar, “Klasifikasi Artikel Ilmiah Dengan Berbagai Skenario Preprocessing,” Sains Apl. Komputasi Dan Teknol. Inf., vol. 2, no. 2, p. 70, Apr. 2020, doi: 10.30872/jsakti.v2i2.2681.
[19] D. Rifaldi, A. Fadlil, and Herman, “Teknik Preprocessing Pada Text Mining Menggunakan Data Tweet ‘Mental Health,’” Decode J. Pendidik. Teknol. Inf., vol. 3, no. 2, pp. 161–171, Apr. 2023, doi: 10.51454/decode.v3i2.131.
[20] S. W. Iriananda, R. P. Putra, and A. Farhan, “Kinerja Auto Labelling pada Analisis Sentimen terhadap Pasangan Calon Presiden 2024 di Media Sosial X,” Conf. Innov. Appl. Sci. Technol. CIASTECH, pp. 618–633, Dec. 2023, doi: 10.31328/ciastech.v6i1.5354.
[21] A. A. Pratiwi and M. Kamayani, “Perbandingan Pelabelan Data dalam Analisis Sentimen Kurikulum Proyek di platform TikTok: Pendekatan Naïve Bayes,” J. Eksplora Inform., vol. 14, no. 1, pp. 96–107, Sep. 2024, doi: 10.30864/eksplora.v14i1.1093.
[22] P. Ayuningtyas, S. Khomsah, and S. Sudianto, “Pelabelan Sentimen Berbasis Semi-Supervised Learning menggunakan Algoritma LSTM dan GRU,” JISKA J. Inform. Sunan Kalijaga, vol. 9, no. 3, pp. 217–229, Sep. 2024, doi: 10.14421/jiska.2024.9.3.217-229.
[23] N. Mqadi, N. Naicker, and T. Adeliyi, “A SMOTe based Oversampling Data-Point Approach to Solving the Credit Card Data Imbalance Problem in Financial Fraud Detection,” Int. J. Comput. Digit. Syst., vol. 10, no. 1, pp. 277–286, Feb. 2021, doi: 10.12785/ijcds/100128.
[24] S. Rabbani, D. Safitri, F. T. P. Siregar, R. Rahmaddeni, and L. Efrizoni, “Evaluation of Support Vector Machine, Naive Bayes, Decision Tree, and Gradient Boosting Algorithms for Sentiment Analysis on ChatGPT Twitter Dataset,” Indones. J. Artif. Intell. Data Min., vol. 7, no. 1, pp. 11–21, Nov. 2023, doi: 10.24014/ijaidm.v7i1.24662.
[25] N. Z. Zahra, S. Farhanatussaidah, N. N. Afifah, L. Muthoharoh, A. Satria, and M. C. T. Manullang, “Benchmarking Logistic Regression, SVM, Naive Bayes, and IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian Product Reviews,” May 05, 2026, arXiv: arXiv:2605.03439. doi: 10.48550/arXiv.2605.03439.
[26] D. Powers, “‘Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation,’ Journal of Machine Learning Technologies,” pp. 37–63.
[27] A. El Hannani, R. Errattahi, F. Z. Salmam, T. Hain, and H. Ouahmane, “Evaluation of the effectiveness and efficiency of state-of-the-art features and models for automatic speech recognition error detection,” J. Big Data, vol. 8, no. 1, p. 5, Jan. 2021, doi: 10.1186/s40537-020-00391-w.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Andre Septa Wijaya, Amiq Fahmi, Yuventius Tyas Catur Pramudi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








