Hybrid Machine Learning for Knowledge Discovery in E-Commerce Reviews

Authors

  • Jeremiah Alwin Siahaan Universitas Sriwijaya
  • Lailla Syal Syabilla Universitas Sriwijaya
  • M. Thoriqul Fadli Universitas Sriwijaya
  • Mei Intan Natasyah Universitas Sriwijaya
  • Allsela Meiriza Universitas Sriwijaya
  • Ken Ditha Tania Universitas Sriwijaya
  • Ahmad Rifai Universitas Sriwijaya

DOI:

https://doi.org/10.30871/jaic.v10i3.12635

Keywords:

K-Means Clustering, Knowledge Discovery in Database, Random Forest, Tokopedia, Uninformative Review

Abstract

The rapid growth of e-commerce platforms like Tokopedia has triggered a massive accumulation of over 65,000 customer reviews, yet it is often accompanied by information pollution in the form of non-informative reviews that hinder consumer decision-making processes. This research aims to extract new knowledge regarding these review characteristics through the implementation of the Knowledge Discovery in Database (KDD) framework, integrating a hybrid K-Means Clustering and Random Forest algorithm. Diverging from conventional classification approaches, this study utilizes K-Means as an exploratory instrument to naturally map six latent topic patterns of reviews based on their textual structure. Experiments were conducted on 35,000 data samples using TF-IDF features enriched by cluster labels as structural predictors. The results indicate that the hybrid model achieves 94.41% accuracy with an F1-score of 0.90 for the non-informative class, showing high stability via 5-Fold Cross-Validation (94.56% ± 0.19%) . The most crucial knowledge discovery is evidenced through SHAP analysis, where the cluster feature ranks 7th out of 1,001 predictor features, confirming that semantic grouping provides a richer structural context than pure lexical features . Furthermore, error analysis reveals specific linguistic challenges such as sarcasm and semantic ambiguity as constraints in automated review detection . This research provides a managerial contribution to e-commerce platforms in enhancing information quality and mitigating information overload issues.

Downloads

Download data is not yet available.

Author Biographies

Jeremiah Alwin Siahaan, Universitas Sriwijaya

Undergraduate Student at the Department of Information Systems, Universitas Sriwijaya.

Lailla Syal Syabilla, Universitas Sriwijaya

Undergraduate Student at the Department of Information Systems, Universitas Sriwijaya.

M. Thoriqul Fadli, Universitas Sriwijaya

Undergraduate Student at the Department of Information Systems, Universitas Sriwijaya.

Mei Intan Natasyah, Universitas Sriwijaya

Undergraduate Student at the Department of Information Systems, Universitas Sriwijaya.

Allsela Meiriza, Universitas Sriwijaya

Lecturer at the Department of Information Systems, Universitas Sriwijaya.

Ken Ditha Tania, Universitas Sriwijaya

Lecturer at the Department of Information Systems, Universitas Sriwijaya.

Ahmad Rifai, Universitas Sriwijaya

Lecturer at the Department of Information Systems, Universitas Sriwijaya.

References

[1] W. Nurwahyudi, M. Isnaini, S. J. Roszi, dan A. W. Laily, “Analisis Sentimen Ulasan Produk Moisturizer Skintific Di Tokopedia Menggunakan Support Vector Machine,” Jurnal Sistem Informasi dan Bisnis Cerdas, vol. 18, no. 1, hlm. 129–142, 2025.

[2] S. Azimi dan Y. Andonova, “Did you find this review helpful?,” Marketing Intelligence & Planning, vol. 41, no. 3, hlm. 329-343, 2023.

[3] M. Bilal dan A. A. Almazroi, “Effectiveness of Fine-tuned BERT Model in Classification of Helpful and Unhelpful Online Customer Reviews,” Electronic Commerce Research, vol. 23, no. 4, hlm. 2737-2757, 2023.

[4] H. Alamsyah, Y. Cahyana, dan A. R. Pratama, “Deteksi Fake Review Menggunakan Metode Support Vector Machine dan Naive Bayes Di Tokopedia,” Jutisi: Jurnal Ilmiah Teknik Informatika Dan Sistem Informasi, vol. 12, no. 2, hlm. 585–598, 2023.

[5] P. Demetria dan A. Wedhasmara, “Analisis Sentimen Pelanggan Terhadap Penilaian Produk Pada Tokopedia Nyemil.Saji Menggunakan Metode Support Vector Machine,” JEMSI: Jurnal Ekonomi Manajemen Sistem Informasi, vol. 7, no. 2, hlm. 1350–1361, 2025.

[6] D. A. Ardhani dan K. D. Tania, “Knowledge Discovery on E-Commerce Customer Churn Using Interpretable Machine Learning: A Comparative Study of SHAP-Based Classifiers,” Journal of Applied Informatics and Computing, vol. 9, no. 5, hlm. 2695–2702, Okt. 2025.

[7] N. Alfira, M. R. T. Ramdhani, M. R. P. Budika, M. V. Santoso, dan N. Zahry, “Analisis Sentimen Terhadap Komentar Negatif (Hate Speech) Di Twitter Dengan Algoritma K-Means Clustering Menggunakan RapidMiner,” Journal of Information Technology and Informatics Engineering, vol. 1, no. 1, hlm. 57–61, 2025.

[8] C. V. Angkoso, M. A. N. Thrisna, B. D. Satoto, dan A. Kusumaningsih, “Optimasi Klasifikasi Sentimen Menggunakan Random Forest dengan Preprocessing K-Means Clustering dan SMOTE,” JEPIN (Jurnal Edukasi dan Penelitian Informatika), vol. 10, no. 3, hlm. 389–400, 2024.

[9] O. Y. Inonu, K. Magda, dan A. Amarudin, “Analisis Kinerja Algoritma Random Forest Dengan Model Machine Learning Pada Dataset Penyakit Diabetes,” Expert J. Manaj. Sist. Inf. dan Teknol., vol. 15, no. 1, hlm. 1, 2025.

[10] D. N. Fadhilahrizka, K. D. Tania, dan R. D. Kurnia, “Analisis Komparatif Algoritma Random Forest, XGBoost, dan CatBoost untuk Klasifikasi Tingkat Stres Pengguna Media Sosial,” Rabit: Jurnal Teknologi dan Sistem Informasi Univrab, vol. 11, no. 1, hlm. 1843–1853, 2026.

[11] A. A. J. Al-Abadi, M. B. Mohamed, dan A. Fakhfakh, “Enhanced Random Forest Classifier with K-Means Clustering (ERF-KMC) for Detecting and Preventing Distributed-Denial-of-Service and Man-in-the-Middle Attacks in Internet-of-Medical-Things Networks,” Computers, vol. 12, no. 12, hlm. 262, Des. 2023.

[12] D. Priyanto, H. Hairani, K. Marzuki, dan M. Innuddin, “Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN,” Engineering, Technology & Applied Science Research, vol. 15, no. 5, hlm. 27646–27652, 2025.

[13] F. Fiddin, M. Y. Syahbarna, dan M. Ridwan, “Penggunaan Supervised Learning untuk Prediksi Validitas Ulasan Negatif Aplikasi Tokopedia Berdasarkan Pengalaman Pengguna Ahli,” Jurnal SAINTIKOM (Jurnal Sains Manajemen Informatika dan Komputer), vol. 23, no. 2, hlm. 409–417, 2024.

[14] M. I. A. Rois, G. Dwilestari, dan N. Suarna, “Prediksi Persetujuan Pinjaman Menggunakan Dataset Loan Approval Menggunakan Algoritma Klasifikasi,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 9, no. 1, hlm. 1342–1347, 2025.

[15] D. Athallah, F. Fathoni, dan M. I. F. Rachmad, “Klasifikasi Risiko Strok Menggunakan Algoritma Random Forest dengan Teknik Knowledge Discovery in Database,” Indonesian Journal Computer Science, vol. 4, no. 1, hlm. 38-44, 2025.

[16] R. N. Mauliza dan Y. R. Sipayung, “Penerapan Text Mining Dalam Menganalisis Pendapat Masyarakat Terhadap Pemilu 2024 Pada Media Sosial X Menggunakan Metode Naive Bayes,” Technomedia Journal, vol. 9, no. 1, hlm. 1–16, 2024.

[17] C. N. Oktariana dan N. R. Oktadini, “Analisis Sentimen Ulasan Pengguna Aplikasi Tokopedia Menggunakan Algoritma Random Forest,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 9, no. 1, 2025.

[18] H. Hajaroh, T. Suprapti, dan R. Narasati, “Implementasi Algoritma Naive Bayes Untuk Analisis Sentimen Ulasan Produk Makanan Dan Minuman Di Tokopedia,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 8, no. 1, hlm. 111–118, 2024.

[19] M. Idris, A. Rifai, dan K. D. Tania, “Sentiment Analysis of Tokopedia App Reviews using Machine Learning and Word Embeddings,” Sinkron: Jurnal dan Penelitian Teknik Informatika, vol. 9, no. 1, hlm. 210–219, 2025.

Downloads

Published

2026-06-17

How to Cite

[1]
J. A. Siahaan, “Hybrid Machine Learning for Knowledge Discovery in E-Commerce Reviews”, JAIC, vol. 10, no. 3, pp. 2790–2798, Jun. 2026.

Most read articles by the same author(s)

1 2 > >> 

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.