Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++
Abstract
Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.
Downloads
References
I. B. G. W. Putra, M. Sudarma, and I. N. S. Kumara, “Klasifikasi Teks Bahasa Bali dengan Metode Supervised Learning Naive Bayes Classifier,” Teknologi Elektro, vol. 15, no. 2, pp. 81–86, 2016.
S. R. Fitriyani and H. Murfi, “The K-means with mini batch algorithm for topics detection on online news,” in 4th International Conference on Information and Communication Technology (ICoICT), Bandung, 2016.
M. Erisoglu, N. Calis, and S. Sakallioglu, “A new algorithm for initial cluster centers in k-means algorithm,” Pattern Recognition Letters, vol. 32, no. 14, pp. 1701–1705, 2011.
K. M. Kumar and A. R. M. Reddy, “An efficient k-means clustering filtering algorithm using density based initial cluster centers,” Information Sciences, vol. 418–419, pp. 286–301, 2017.
M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, vol. 40, no. 1, pp. 200–210, 2013.
D. Sculley, “Web-scale k-means clustering,” in Proceedings of the 19th international conference on World wide web, New York, 2010, pp. 1177–1178.
A. Feizollah, N. B. Anuar, R. Salleh, and F. Amalina, “Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis,” in International Symposium on Biometrics and Security Technologies (ISBAST), Kuala Lumpur, 2014.
Y. Xu, W. Qu, Z. Li, G. Min, K. Li, and Z. Liu, “Efficient k-means++ approximation with MapReduce,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12, pp. 3135–3144, 2014, doi: 10.1109/TPDS.2014.2306193.
D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Philadelphia, 2007, pp. 1027–1035.
M. A. P. Subali and P. Wijaya, “Sistem Question Answering untuk Bahasa Bali menggunakan Metode Rule-Based dan String Similarity,” Techno.COM, vol. 20, no. 2, pp. 300–308, 2021.
M. A. P. Subali and C. Fatichah, “Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali,” Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), vol. 6, no. 2, 2019.
M. A. Fauzi, A. Z. Arifin, and A. Yuniarti, “Term Weighting Berbasis Indeks Buku dan Kelas untuk Perangkingan Dokumen Berbahasa Arab,” Lontar Komputer, vol. 5, no. 2, pp. 435–442, 2014.
S. I. Murpratiwi, I. G. A. Indrawan, and A. Aranta, “Analisis Pemilihan Cluster Optimal dalam Segmentasi Pelanggan Toko Retail,” Jurnal Pendidikan Teknologi dan Kejuruan, vol. 18, no. 2, pp. 152–163.
Copyright (c) 2023 Made Agus Putra Subali, I Gusti Rai Agung Sugiartha, Komang Budiarta, I Made Budi Adnyana
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).