Perbandingan Metode Klasterisasi Data Bertipe Campuran: One-Hot-Encoding, Gower Distance, dan K-Prototype Berdasarkan Akurasi (Studi Kasus: Chronic Kidney Disease Dataset)
Abstract
This study aims to compare the one-hot-encoding method, Gower distance combined with k-means, DBSCAN, and OPTICS algorithms, and k-prototype for clustering mixed data types based on accuracy. The dataset used in this research is the chronic kidney disease (CKD) dataset sourced from the UCI Machine Learning Repository. Based on the evaluation using the silhouette index, it is found that k-prototype with the number of clusters k=2 is the most optimal clustering method because it provides the highest silhouette index value compared to the other four methods, with a value of 0,3796. Cluster 1 contains 175 observations, while cluster 2 contains 225 observations. When associated with the labels on the dataset, the clustering results provide an accuracy value of 81,25 percent.
Downloads
References
J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Edition. 2011.
P. A. Popoola, J. R. Tapamo, and A. G. Assounga, “Cluster Analysis of Mixed and Missing Chronic Kidney Disease Data in KwaZulu-Natal Province, South Africa,” IEEE Access, vol. 9, pp. 52125–52143, 2021, doi: 10.1109/ACCESS.2021.3069684.
R. Wijayati and D. R. S. Saputro, “Clustering Data Campuran Numerik dan Kategorik Menggunakan Algoritme K-Prototype,” PRISMA: Prosiding Seminar Nasional Matematika 6, pp. 702–706, 2023, [Online]. Available: https://journal.unnes.ac.id/sju/index.php/prisma/
A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data Knowl Eng, vol. 63, no. 2, pp. 503–527, Nov. 2007, doi: 10.1016/j.datak.2007.03.016.
Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” 1998.
M. Ganmanah and A. Kudus, “Penerapan Algoritme K-Prototypes untuk Pengelompokkan Desa-Desa di Provinsi Jawa Barat Berdasarkan Indikator Indeks Desa Membangun Tahun 2020,” Prosiding Statistika, vol. 7, no. 2, pp. 543–548, 2021, doi: 10.29313/.v0i0.28974.
J. C. Gower, “A General Coefficient of Similarity and Some of Its Properties,” 1971.
S. Revathy, B. Bharathi, P. Jeyanthi, and M. Ramesh, “Chronic kidney disease prediction using machine learning models,” Int J Eng Adv Technol, vol. 9, no. 1, pp. 6364–6367, Oct. 2019, doi: 10.35940/ijeat.A2213.109119.
L. Antony et al., “A Comprehensive Unsupervised Framework for Chronic Kidney Disease Prediction,” IEEE Access, vol. 9, pp. 126481–126501, 2021, doi: 10.1109/ACCESS.2021.3109168.
S. Gopika and M. Vanitha, “Machine Learning Approach of Chronic Kidney Disease Prediction Using Clustering Technique,” Int J Innov Res Sci Eng Technol, vol. 6, no. 7, pp. 14488–14496, 2017, doi: 10.15680/IJIRSET.2017.0607267.
A. F. Sallaby and A. Azlan, “Analysis of Missing Value Imputation Application with K-Nearest Neighbor (K-NN) Algorithm in Dataset,” The IJICS (International Journal of Informatics and Computer Science), vol. 5, no. 2, p. 141, Aug. 2021, doi: 10.30865/ijics.v5i2.3185.
J. Gandhi, R. Goyal, J. Guha, K. Pithawala, and S. Joshi, “Comparative Study on Hierarchical and Density based Methods of Clustering using Data Analysis,” in International Conference on IoT based Control Networks and Intelligent Systems (ICICNIS 2020) , 2020, pp. 62–68. [Online]. Available: https://ssrn.com/abstract=3768295
S. Wang, J. G. Yabes, and C.-C. H. Chang, “Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables,” Journal of Data Science, pp. 15–36, 2021, doi: 10.6339/21-jds996.
M. N. T. Elbatta, “An improvement for DBSCAN algorithm for best result in varied densities.” Islamic University of Gaza, Gaza, 2012.
M. Refaldy, S. Annas, and Z. Rais, “K-Prototype Algorithm in Grouping Regency/City in South Sulawesi Province Based on 2020 People’s Welfare,” ARRUS Journal of Mathematics and Applied Science, vol. 3, no. 1, pp. 11–19, May 2023, doi: 10.35877/mathscience1763.
M. R. Irianto, A. Maududie, and F. N. Arifin, “Implementation of K-Means Clustering Method for Trend Analysis of Thesis Topics (Case Study: Faculty of Computer Science, University of Jember),” BERKALA SAINSTEK, vol. 10, no. 4, p. 210, Dec. 2022, doi: 10.19184/bst.v10i4.29524.
L. Rubini, P. Soundarapandian, and P. Eswaran, “Chronic_Kidney_Disease.” UCI Machine Learning Repository, 2015.
F. Muchtar, “Gambaran Hematologi pada Pasien Gagal Ginjal Kronik yang Menjalani Hemodialisa,” 2013.
D. G. A. Suryawan, I. A. M. S. Arjani, and I. G. Sudarmanto, “Gambaran Kadar Ureum dan Kreatinin Serum pada Pasien Gagal Ginjal Kronis yang Menjalani Terapi Hemodialisis di RSUD Sanjiwani Gianyar,” Meditory, vol. 4, no. 2, pp. 145–153, 2016.
E. Sulistiowati and S. Idaiani, “Faktor Risiko Penyakit Ginjal Kronik Berdasarkan Analisis Cross-sectional Data Awal Studi Kohort Penyakit Tidak Menular Penduduk Usia 25-65 Tahun di Kelurahan Kebon Kalapa, Kota Bogor,” Buletin Penelitian Kesehatan, vol. 43, no. 3, pp. 163–172, 2015.
Copyright (c) 2023 Zahra Rizky Fadilah, Arie Wahyu Wijayanto
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).