Perbandingan Metode Klasterisasi Data Bertipe Campuran: One-Hot-Encoding, Gower Distance, dan K-Prototype Berdasarkan Akurasi (Studi Kasus: Chronic Kidney Disease Dataset)

  • Zahra Rizky Fadilah Politeknik Statistika STIS
  • Arie Wahyu Wijayanto Politeknik Statistika STIS
Keywords: Clustering, Gower Distance, K-Prototype, Mixed-Data Type, One-Hot-Encoding

Abstract

This study aims to compare the one-hot-encoding method, Gower distance combined with k-means, DBSCAN, and OPTICS algorithms, and k-prototype for clustering mixed data types based on accuracy. The dataset used in this research is the chronic kidney disease (CKD) dataset sourced from the UCI Machine Learning Repository. Based on the evaluation using the silhouette index, it is found that k-prototype with the number of clusters k=2 is the most optimal clustering method because it provides the highest silhouette index value compared to the other four methods, with a value of 0,3796. Cluster 1 contains 175 observations, while cluster 2 contains 225 observations.  When associated with the labels on the dataset, the clustering results provide an accuracy value of 81,25 percent.

Downloads

Download data is not yet available.

References

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Edition. 2011.

P. A. Popoola, J. R. Tapamo, and A. G. Assounga, “Cluster Analysis of Mixed and Missing Chronic Kidney Disease Data in KwaZulu-Natal Province, South Africa,” IEEE Access, vol. 9, pp. 52125–52143, 2021, doi: 10.1109/ACCESS.2021.3069684.

R. Wijayati and D. R. S. Saputro, “Clustering Data Campuran Numerik dan Kategorik Menggunakan Algoritme K-Prototype,” PRISMA: Prosiding Seminar Nasional Matematika 6, pp. 702–706, 2023, [Online]. Available: https://journal.unnes.ac.id/sju/index.php/prisma/

A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data Knowl Eng, vol. 63, no. 2, pp. 503–527, Nov. 2007, doi: 10.1016/j.datak.2007.03.016.

Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” 1998.

M. Ganmanah and A. Kudus, “Penerapan Algoritme K-Prototypes untuk Pengelompokkan Desa-Desa di Provinsi Jawa Barat Berdasarkan Indikator Indeks Desa Membangun Tahun 2020,” Prosiding Statistika, vol. 7, no. 2, pp. 543–548, 2021, doi: 10.29313/.v0i0.28974.

J. C. Gower, “A General Coefficient of Similarity and Some of Its Properties,” 1971.

S. Revathy, B. Bharathi, P. Jeyanthi, and M. Ramesh, “Chronic kidney disease prediction using machine learning models,” Int J Eng Adv Technol, vol. 9, no. 1, pp. 6364–6367, Oct. 2019, doi: 10.35940/ijeat.A2213.109119.

L. Antony et al., “A Comprehensive Unsupervised Framework for Chronic Kidney Disease Prediction,” IEEE Access, vol. 9, pp. 126481–126501, 2021, doi: 10.1109/ACCESS.2021.3109168.

S. Gopika and M. Vanitha, “Machine Learning Approach of Chronic Kidney Disease Prediction Using Clustering Technique,” Int J Innov Res Sci Eng Technol, vol. 6, no. 7, pp. 14488–14496, 2017, doi: 10.15680/IJIRSET.2017.0607267.

A. F. Sallaby and A. Azlan, “Analysis of Missing Value Imputation Application with K-Nearest Neighbor (K-NN) Algorithm in Dataset,” The IJICS (International Journal of Informatics and Computer Science), vol. 5, no. 2, p. 141, Aug. 2021, doi: 10.30865/ijics.v5i2.3185.

J. Gandhi, R. Goyal, J. Guha, K. Pithawala, and S. Joshi, “Comparative Study on Hierarchical and Density based Methods of Clustering using Data Analysis,” in International Conference on IoT based Control Networks and Intelligent Systems (ICICNIS 2020) , 2020, pp. 62–68. [Online]. Available: https://ssrn.com/abstract=3768295

S. Wang, J. G. Yabes, and C.-C. H. Chang, “Hybrid Density- and Partition-Based Clustering Algorithm for Data With Mixed-Type Variables,” Journal of Data Science, pp. 15–36, 2021, doi: 10.6339/21-jds996.

M. N. T. Elbatta, “An improvement for DBSCAN algorithm for best result in varied densities.” Islamic University of Gaza, Gaza, 2012.

M. Refaldy, S. Annas, and Z. Rais, “K-Prototype Algorithm in Grouping Regency/City in South Sulawesi Province Based on 2020 People’s Welfare,” ARRUS Journal of Mathematics and Applied Science, vol. 3, no. 1, pp. 11–19, May 2023, doi: 10.35877/mathscience1763.

M. R. Irianto, A. Maududie, and F. N. Arifin, “Implementation of K-Means Clustering Method for Trend Analysis of Thesis Topics (Case Study: Faculty of Computer Science, University of Jember),” BERKALA SAINSTEK, vol. 10, no. 4, p. 210, Dec. 2022, doi: 10.19184/bst.v10i4.29524.

L. Rubini, P. Soundarapandian, and P. Eswaran, “Chronic_Kidney_Disease.” UCI Machine Learning Repository, 2015.

F. Muchtar, “Gambaran Hematologi pada Pasien Gagal Ginjal Kronik yang Menjalani Hemodialisa,” 2013.

D. G. A. Suryawan, I. A. M. S. Arjani, and I. G. Sudarmanto, “Gambaran Kadar Ureum dan Kreatinin Serum pada Pasien Gagal Ginjal Kronis yang Menjalani Terapi Hemodialisis di RSUD Sanjiwani Gianyar,” Meditory, vol. 4, no. 2, pp. 145–153, 2016.

E. Sulistiowati and S. Idaiani, “Faktor Risiko Penyakit Ginjal Kronik Berdasarkan Analisis Cross-sectional Data Awal Studi Kohort Penyakit Tidak Menular Penduduk Usia 25-65 Tahun di Kelurahan Kebon Kalapa, Kota Bogor,” Buletin Penelitian Kesehatan, vol. 43, no. 3, pp. 163–172, 2015.

Published
2023-07-31
How to Cite
[1]
Z. Fadilah and A. Wijayanto, “Perbandingan Metode Klasterisasi Data Bertipe Campuran: One-Hot-Encoding, Gower Distance, dan K-Prototype Berdasarkan Akurasi (Studi Kasus: Chronic Kidney Disease Dataset)”, JAIC, vol. 7, no. 1, pp. 63-73, Jul. 2023.
Section
Articles