Hybrid Machine Learning for Knowledge Discovery in E-Commerce Reviews
DOI:
https://doi.org/10.30871/jaic.v10i3.12635Keywords:
K-Means Clustering, Knowledge Discovery in Database, Random Forest, Tokopedia, Uninformative ReviewAbstract
The rapid growth of e-commerce platforms like Tokopedia has triggered a massive accumulation of over 65,000 customer reviews, yet it is often accompanied by information pollution in the form of non-informative reviews that hinder consumer decision-making processes. This research aims to extract new knowledge regarding these review characteristics through the implementation of the Knowledge Discovery in Database (KDD) framework, integrating a hybrid K-Means Clustering and Random Forest algorithm. Diverging from conventional classification approaches, this study utilizes K-Means as an exploratory instrument to naturally map six latent topic patterns of reviews based on their textual structure. Experiments were conducted on 35,000 data samples using TF-IDF features enriched by cluster labels as structural predictors. The results indicate that the hybrid model achieves 94.41% accuracy with an F1-score of 0.90 for the non-informative class, showing high stability via 5-Fold Cross-Validation (94.56% ± 0.19%) . The most crucial knowledge discovery is evidenced through SHAP analysis, where the cluster feature ranks 7th out of 1,001 predictor features, confirming that semantic grouping provides a richer structural context than pure lexical features . Furthermore, error analysis reveals specific linguistic challenges such as sarcasm and semantic ambiguity as constraints in automated review detection . This research provides a managerial contribution to e-commerce platforms in enhancing information quality and mitigating information overload issues.
Downloads
References
[1] W. Nurwahyudi, M. Isnaini, S. J. Roszi, dan A. W. Laily, “Analisis Sentimen Ulasan Produk Moisturizer Skintific Di Tokopedia Menggunakan Support Vector Machine,” Jurnal Sistem Informasi dan Bisnis Cerdas, vol. 18, no. 1, hlm. 129–142, 2025.
[2] S. Azimi dan Y. Andonova, “Did you find this review helpful?,” Marketing Intelligence & Planning, vol. 41, no. 3, hlm. 329-343, 2023.
[3] M. Bilal dan A. A. Almazroi, “Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews,” Electronic Commerce Research, vol. 23, no. 4, hlm. 2737-2757, 2023.
[4] H. Alamsyah, Y. Cahyana, dan A. R. Pratama, “Deteksi Fake Review Menggunakan Metode Support Vector Machine dan Naive Bayes Di Tokopedia,” Jutisi: Jurnal Ilmiah Teknik Informatika Dan Sistem Informasi, vol. 12, no. 2, hlm. 585–598, 2023.
[5] P. Demetria dan A. Wedhasmara, “Analisis Sentimen Pelanggan Terhadap Penilaian Produk Pada Tokopedia Nyemil.Saji Menggunakan Metode Support Vector Machine,” JEMSI: Jurnal Ekonomi Manajemen Sistem Informasi, vol. 7, no. 2, hlm. 1350–1361, 2025.
[6] D. A. Ardhani dan K. D. Tania, “Knowledge Discovery on E-Commerce Customer Churn Using Interpretable Machine Learning: A Comparative Study of SHAP-Based Classifiers,” Journal of Applied Informatics and Computing, vol. 9, no. 5, hlm. 2695–2702, Okt. 2025.
[7] N. Alfira, M. R. T. Ramdhani, M. R. P. Budika, M. V. Santoso, dan N. Zahry, “Analisis Sentimen Terhadap Komentar Negatif (Hate Speech) Di Twitter Dengan Algoritma K-Means Clustering Menggunakan RapidMiner,” Journal of Information Technology and Informatics Engineering, vol. 1, no. 1, hlm. 57–61, 2025.
[8] C. V. Angkoso, M. A. N. Thrisna, B. D. Satoto, dan A. Kusumaningsih, “Optimasi Klasifikasi Sentimen Menggunakan Random Forest dengan Preprocessing K-Means Clustering dan SMOTE,” JEPIN (Jurnal Edukasi dan Penelitian Informatika), vol. 10, no. 3, hlm. 389–400, 2024.
[9] O. Y. Inonu, K. Magda, dan A. Amarudin, “Analisis Kinerja Algoritma Random Forest Dengan Model Machine Learning Pada Dataset Penyakit Diabetes,” Expert J. Manaj. Sist. Inf. dan Teknol., vol. 15, no. 1, hlm. 1, 2025.
[10] D. N. Fadhilahrizka, K. D. Tania, dan R. D. Kurnia, “Analisis Komparatif Algoritma Random Forest, XGBoost, dan CatBoost untuk Klasifikasi Tingkat Stres Pengguna Media Sosial,” Rabit: Jurnal Teknologi dan Sistem Informasi Univrab, vol. 11, no. 1, hlm. 1843–1853, 2026.
[11] A. A. J. Al-Abadi, M. B. Mohamed, dan A. Fakhfakh, “Enhanced Random Forest Classifier with K-Means Clustering (ERF-KMC) for Detecting and Preventing Distributed-Denial-of-Service and Man-in-the-Middle Attacks in Internet-of-Medical-Things Networks,” Computers, vol. 12, no. 12, hlm. 262, Des. 2023.
[12] D. Priyanto, H. Hairani, K. Marzuki, dan M. Innuddin, “Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN,” Engineering, Technology & Applied Science Research, vol. 15, no. 5, hlm. 27646–27652, 2025.
[13] F. Fiddin, M. Y. Syahbarna, dan M. Ridwan, “Penggunaan Supervised Learning untuk Prediksi Validitas Ulasan Negatif Aplikasi Tokopedia Berdasarkan Pengalaman Pengguna Ahli,” Jurnal SAINTIKOM (Jurnal Sains Manajemen Informatika dan Komputer), vol. 23, no. 2, hlm. 409–417, 2024.
[14] M. I. A. Rois, G. Dwilestari, dan N. Suarna, “Prediksi Persetujuan Pinjaman Menggunakan Dataset Loan Approval Menggunakan Algoritma Klasifikasi,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 9, no. 1, hlm. 1342–1347, 2025.
[15] D. Athallah, F. Fathoni, dan M. I. F. Rachmad, “Klasifikasi Risiko Strok Menggunakan Algoritma Random Forest dengan Teknik Knowledge Discovery in Database,” Indonesian Journal Computer Science, vol. 4, no. 1, hlm. 38-44, 2025.
[16] R. N. Mauliza dan Y. R. Sipayung, “Penerapan Text Mining Dalam Menganalisis Pendapat Masyarakat Terhadap Pemilu 2024 Pada Media Sosial X Menggunakan Metode Naive Bayes,” Technomedia Journal, vol. 9, no. 1, hlm. 1–16, 2024.
[17] C. N. Oktariana dan N. R. Oktadini, “Analisis Sentimen Ulasan Pengguna Aplikasi Tokopedia Menggunakan Algoritma Random Forest,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 9, no. 1, 2025.
[18] H. Hajaroh, T. Suprapti, dan R. Narasati, “Implementasi Algoritma Naive Bayes Untuk Analisis Sentimen Ulasan Produk Makanan Dan Minuman Di Tokopedia,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 8, no. 1, hlm. 111–118, 2024.
[19] M. Idris, A. Rifai, dan K. D. Tania, “Sentiment Analysis of Tokopedia App Reviews using Machine Learning and Word Embeddings,” Sinkron: Jurnal dan Penelitian Teknik Informatika, vol. 9, no. 1, hlm. 210–219, 2025.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Jeremiah Alwin Siahaan, Lailla Syal Syabilla, M. Thoriqul Fadli, Mei Intan Natasyah, Allsela Meiriza, Ken Ditha Tania, Ahmad Rifai

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








