Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

  • Made Agus Putra Subali Institut Teknologi dan Bisnis STIKOM Bali
  • I Gusti Rai Agung Sugiartha Institut Teknologi dan Bisnis STIKOM Bali
  • Komang Budiarta Institut Teknologi dan Bisnis STIKOM Bali
  • I Made Budi Adnyana Institut Teknologi dan Bisnis STIKOM Bali
Keywords: Clustering, Balinese Language Documents, Balinese Stemmer, Mini Batch k-Means, k-Means

Abstract

Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.

Downloads

Download data is not yet available.

References

I. B. G. W. Putra, M. Sudarma, and I. N. S. Kumara, “Klasifikasi Teks Bahasa Bali dengan Metode Supervised Learning Naive Bayes Classifier,” Teknologi Elektro, vol. 15, no. 2, pp. 81–86, 2016.

S. R. Fitriyani and H. Murfi, “The K-means with mini batch algorithm for topics detection on online news,” in 4th International Conference on Information and Communication Technology (ICoICT), Bandung, 2016.

M. Erisoglu, N. Calis, and S. Sakallioglu, “A new algorithm for initial cluster centers in k-means algorithm,” Pattern Recognition Letters, vol. 32, no. 14, pp. 1701–1705, 2011.

K. M. Kumar and A. R. M. Reddy, “An efficient k-means clustering filtering algorithm using density based initial cluster centers,” Information Sciences, vol. 418–419, pp. 286–301, 2017.

M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, vol. 40, no. 1, pp. 200–210, 2013.

D. Sculley, “Web-scale k-means clustering,” in Proceedings of the 19th international conference on World wide web, New York, 2010, pp. 1177–1178.

A. Feizollah, N. B. Anuar, R. Salleh, and F. Amalina, “Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis,” in International Symposium on Biometrics and Security Technologies (ISBAST), Kuala Lumpur, 2014.

Y. Xu, W. Qu, Z. Li, G. Min, K. Li, and Z. Liu, “Efficient k-means++ approximation with MapReduce,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12, pp. 3135–3144, 2014, doi: 10.1109/TPDS.2014.2306193.

D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, 2007, pp. 1027–1035.

M. A. P. Subali and P. Wijaya, “Sistem Question Answering untuk Bahasa Bali menggunakan Metode Rule-Based dan String Similarity,” Techno.COM, vol. 20, no. 2, pp. 300–308, 2021.

M. A. P. Subali and C. Fatichah, “Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali,” Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), vol. 6, no. 2, 2019.

M. A. Fauzi, A. Z. Arifin, and A. Yuniarti, “Term Weighting Berbasis Indeks Buku dan Kelas untuk Perangkingan Dokumen Berbahasa Arab,” Lontar Komputer, vol. 5, no. 2, pp. 435–442, 2014.

S. I. Murpratiwi, I. G. A. Indrawan, and A. Aranta, “Analisis Pemilihan Cluster Optimal dalam Segmentasi Pelanggan Toko Retail,” Jurnal Pendidikan Teknologi dan Kejuruan, vol. 18, no. 2, pp. 152–163.

Published
2023-12-05
How to Cite
[1]
M. Subali, I. G. R. A. Sugiartha, K. Budiarta, and I. M. B. Adnyana, “Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++”, JAIC, vol. 7, no. 2, pp. 258-262, Dec. 2023.
Section
Articles