Evaluating the Impact of Random Over Sampling on IndoBERT Performance for Indonesian Sentiment Analysis
DOI:
https://doi.org/10.30871/jaic.v9i6.11488Keywords:
Sentiment Analysis, IndoBERT, Random Over Sampler, Imbalanced Data, Model EvaluationAbstract
Sentiment analysis is a prominent research area in natural language processing (NLP). For the Indonesian language, IndoBERT has emerged as a leading model due to its competitive performance. However, its effectiveness is strongly influenced by balanced class distribution. A common challenge arises because user reviews on digital platforms, such as the Google Play Store, often exhibit imbalanced classes. This study investigates the effectiveness of the Random Over Sampler (ROS) technique in improving IndoBERT’s performance under imbalanced data conditions. The dataset consists of 13,821 user reviews of the IDN App collected from the Google Play Store between 2015 and July 2025. Prior to modeling, data preprocessing was performed, including punctuation removal, case folding, stopword removal, tokenizing, normalization, and stemming to ensure textual consistency. Reviews were categorized into two sentiment classes: positive (3–5 stars) and negative (1–2 stars). Two experimental scenarios were conducted: (1) IndoBERT without ROS and (2) IndoBERT with a balanced dataset using ROS. Model performance was evaluated using accuracy, precision, recall, and F1-score, with data split into 70% training, 20% validation, and 10% testing. Results showed a significant improvement after ROS implementation: 94.55% accuracy, 94.61% precision, 94.53% recall, and 94.54% F1-score. Confusion matrix analysis indicated improved classification of the minority class, reducing the error rate by 49%. However, learning curve analysis revealed potential overfitting due to ROS. Further research is needed to optimize ROS strategies for better performance and generalization.
Downloads
References
[1] R. M. R. W. P. K. Atmaja and W. Yustanti, “Analisis Sentimen Customer Review Aplikasi Ruang Guru dengan Metode BERT (Bidirectional Encoder Representations from Transformers),” Jeisbi, vol. 02, no. 03, p. 2021, 2021.
[2] F. A. D. Aryanti, A. Luthfiarta, and D. A. I. Soeroso, “Aspect-Based Sentiment Analysis with LDA and IndoBERT Algorithm on Mental Health App: Riliv,” J. Appl. Informatics Comput., vol. 9, no. 2, pp. 361–375, 2025, doi: 10.30871/jaic.v9i2.8958.
[3] R. Kusnadi, Y. Yusuf, A. Andriantony, R. Ardian Yaputra, and M. Caintan, “Analisis Sentimen Terhadap Game Genshin Impact Menggunakan Bert,” Rabit J. Teknol. dan Sist. Inf. Univrab, vol. 6, no. 2, pp. 122–129, 2021, doi: 10.36341/rabit.v6i2.1765.
[4] J. U. S. Lazuardi and A. Juarna, “Analisis Sentimen Ulasan Pengguna Aplikasi Joox Pada Android Menggunakan Metode Bidirectional Encoder Representation From Transformer (Bert),” J. Ilm. Inform. Komput., vol. 28, no. 3, pp. 251–260, 2023, doi: 10.35760/ik.2023.v28i3.10090.
[5] Vidya Chandradev, I Made Agus Dwi Suarjaya, and I Putu Agung Bayupati, “Analisis Sentimen Review Hotel Menggunakan Metode Deep Learning BERT,” J. Buana Inform., vol. 14, no. 02, pp. 107–116, 2023, doi: 10.24002/jbi.v14i02.7244.
[6] M. A. Nugraha, M. I. Mazdadi, A. Farmadi, Muliadi, and T. H. Saragih, “Penyeimbangan Kelas SMOTE dan Seleksi Fitur Ensemble Filter pada Support Vector Machine untuk Klasifikasi Penyakit Liver,” J. Teknol. Inf. dan Ilmu Komput., vol. 10, no. 6, pp. 1273–1284, 2023, doi: 10.25126/jtiik.2023107234.
[7] M. P. Pulungan, A. Purnomo, and A. Kurniasih, “Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Kepribadian MBTI Menggunakan Naive Bayes Classifier,” J. Teknol. Inf. dan Ilmu Komput., vol. 11, no. 5, pp. 1033–1042, 2024, doi: 10.25126/jtiik.2024117989.
[8] Muhammad Bayu Nugroho, Akhmad Khanif Zyen, and Nur Aeni Widiastuti, “Multiclass Sentiment Analysis of Electric Vehicle Incentive Policies Using IndoBERT and DeBERTa Algorithms,” J. Appl. Informatics Comput., vol. 9, no. 3, pp. 910–919, 2025, doi: 10.30871/jaic.v9i3.9511.
[9] I. D. Apostolopoulos, “Investigating the Synthetic Minority Class Oversampling Technique (Smote) on an Imbalanced Cardiovascular Disease (Cvd) Dataset,” Int. J. Eng. Appl. Sci. Technol., vol. 04, no. 09, pp. 431–434, 2020, doi: 10.33564/ijeast.2020.v04i09.058.
[10] L. Mahmoudi and M. Salem, “BalBERT: A New Approach to Improving Dataset Balancing for Text Classification,” Rev. d’Intelligence Artif., vol. 37, no. 2, pp. 425–431, 2023, doi: 10.18280/ria.370219.
[11] T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Inf., vol. 14, no. 1, 2023, doi: 10.3390/info14010054.
[12] D. C. Li, Q. S. Shi, Y. S. Lin, and L. S. Lin, “A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets,” Entropy, vol. 24, no. 3, 2022, doi: 10.3390/e24030322.
[13] M. Mujahid et al., “Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-024-00943-4.
[14] I. Araf, A. Idri, and I. Chairi, Cost-sensitive learning for imbalanced medical data: a review, vol. 57, no. 4. Springer Netherlands, 2024. doi: 10.1007/s10462-023-10652-8.
[15] Y. Feng, M. Zhou, and X. Tong, “Imbalanced classification: A paradigm-based review,” Stat. Anal. Data Min., vol. 14, no. 5, pp. 383–406, 2021, doi: 10.1002/sam.11538.
[16] M. A. Fathin, Y. Sibaroni, and S. S. Prasetyowati, “Handling Imbalance Dataset on Hoax Indonesian Political News Classification using IndoBERT and Random Sampling,” J. Media Inform. Budidarma, vol. 8, no. 1, p. 352, 2024, doi: 10.30865/mib.v8i1.7099.
[17] C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-023-00857-7.
[18] P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.
[19] Y. A. Sir and A. H. H. Soepranoto, “Pendekatan Resampling Data Untuk Menangani Masalah Ketidakseimbangan Kelas,” J. Komput. dan Inform., vol. 10, no. 1, pp. 31–38, 2022, doi: 10.35508/jicon.v10i1.6554.
[20] S. Bej, K. Schulz, P. Srivastava, M. Wolfien, and O. Wolkenhauer, “A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets,” IEEE Access, vol. 9, pp. 123358–123374, 2021, doi: 10.1109/ACCESS.2021.3108450.
[21] S. F. Taskiran, B. Turkoglu, E. Kaya, and T. Asuroglu, “A comprehensive evaluation of oversampling techniques for enhancing text classification performance,” Sci. Rep., vol. 15, no. 1, pp. 1–20, 2025, doi: 10.1038/s41598-025-05791-7.
[22] D. A. Sani, “A Random Oversampling and BERT-based Model Approach for Handling Imbalanced Data in Essay Answer Correction,” J. Infotel, vol. 16, no. 4, pp. 729–739, 2024, doi: 10.20895/infotel.v16i4.1224.
[23] M. Y. Ridho and E. Yulianti, “From Text to Truth: Leveraging IndoBERT and Machine Learning Models for Hoax Detection in Indonesian News,” J. Ilm. Tek. Elektro Komput. dan Inform., vol. 10, no. 3, pp. 544–555, 2024, doi: 10.26555/jiteki.v10i3.29450.
[24] X. Wang and L. Aitchison, “How to set AdamW’s weight decay as you scale model and dataset size,” 2025, [Online]. Available: http://arxiv.org/abs/2405.13698
[25] W. Utomo, “IDN App.” Accessed: Jul. 30, 2025. [Online]. Available: https://www.idn.app/about
[26] J. B. Wang, C. A. Zou, and G. H. Fu, “AWSMOTE: An SVM-Based Adaptive Weighted SMOTE for Class-Imbalance Learning,” Sci. Program., vol. 2021, 2021, doi: 10.1155/2021/9947621.
[27] A. R. Putra and D. E. Ratnawati, “Analisis Sentimen Berbasis Aspek pada Aplikasi Mobile Menggunakan Naïve Bayes berdasarkan Ulasan Pengguna Playstore (Studi Kasus : Jconnect Mobile),” J. Teknol. Inf. dan Ilmu Komput., vol. 12, no. 2, pp. 293–300, 2025, doi: 10.25126/jtiik.2025127556.
[28] H. Imaduddin, F. Y. A’la, and Y. S. Nugroho, “Sentiment Analysis in Indonesian Healthcare Applications using IndoBERT Approach,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 8, pp. 113–117, 2023, doi: 10.14569/IJACSA.2023.0140813.
[29] R. R. Suryono, “Sentiment Classification of Indonesian-Language Roblox Reviews Using IndoBERT with SMOTE Optimization,” vol. 9, no. 4, pp. 1868–1877, 2025.
[30] E. Eskiyaturrofikoh and R. R. Suryono, “Analisis Sentimen Aplikasi X Pada Google Play Store Menggunakan Algoritma Naïve Bayes Dan Support Vector Machine (Svm),” JIPI (Jurnal Ilm. Penelit. dan Pembelajaran Inform., vol. 9, no. 3, pp. 1408–1419, 2024, doi: 10.29100/jipi.v9i3.5392.
[31] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” pp. 843–857, 2024, doi: 10.18653/v1/2020.aacl-main.85.
[32] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” COLING 2020 - 28th Int. Conf. Comput. Linguist. Proc. Conf., pp. 757–770, 2020, doi: 10.18653/v1/2020.coling-main.66.
[33] M. F. Ashidiq, L. Muflikhah, and B. D. Setiawan, “Deteksi Nefropati Diabetik Pada Pasien Diabetes Melitus Menggunakan Regresi Logistik,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 9, no. 2, pp. 2548–964, 2025, [Online]. Available: http://j-ptiik.ub.ac.id
[34] M. Pota, M. Ventura, R. Catelli, and M. Esposito, “An effective bert-based pipeline for twitter sentiment analysis: A case study in Italian,” Sensors (Switzerland), vol. 21, no. 1, pp. 1–21, 2021, doi: 10.3390/s21010133.
[35] R. Nihalani and K. Shah, “Enhancing Grammatical Error Detection using BERT with Cleaned Lang-8 Dataset,” 2024.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Dimas Ramadhan Alfinsyah, Bambang Pilu Hartato

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








