Hyperparameter Optimization of CNN Based Open Set Speaker Verification Using MFCC and Speaker Embedding for Voice Biometric Security
DOI:
https://doi.org/10.30871/jaic.v10i3.13027Keywords:
Audio Augmentation, CNN, Hyperparameter Optimization, MFCC, Speaker Embedding, Voice BiometricsAbstract
The development of voice based biometric security systems has increased the demand for authentication methods capable of operating accurately and securely in open set speaker verification scenarios. In this scenario, the system is required not only to recognize registered users but also to reject unknown users who are not included in the system database. This study focuses on hyperparameter optimization in a Convolutional Neural Network Embedding based speaker verification system using Mel Frequency Cepstral Coefficient (MFCC) features and speaker embeddings. The optimization process was conducted through several experimental stages, including MFCC parameter tuning, CNN architecture tuning, embedding dimension tuning, and audio augmentation analysis. The dataset consisted of Indonesian speech recordings from 8 registered speakers and 1 unknown speaker, sampled at 16 kHz under controlled recording conditions. The dataset was divided into training, enrollment, and testing subsets to support open set speaker verification evaluation and reduce data leakage. System performance was evaluated using accuracy, validation loss, False Acceptance Rate (FAR), False Rejection Rate (FRR), best threshold, and inference time. The experimental results show that the best configuration was achieved using the MFCC-C parameters (N_MFCC = 40, N_FFT = 1024, HOP_LENGTH = 256, N_MELS = 40), the CNN-E architecture with three convolution blocks (32-64-128), an embedding dimension of 64, and lightweight augmentation consisting of noise injection, pitch shifting, and time stretching. This configuration achieved stable system performance with a test accuracy of 96.43% and a FAR of 8.7%, while maintaining lightweight computational complexity and real time inference capability. The results also indicate that excessive augmentation may increase embedding overlap between speakers, thereby reducing system security performance. However, the study was conducted on a limited scale dataset and has not yet evaluated robustness against spoofing attacks, replay attacks, or adversarial synthesized voice attacks. Overall, the study indicates that hyperparameter optimization influences the balance between accuracy, computational efficiency, and biometric security performance in lightweight CNN based voice biometric authentication systems under limited scale evaluation conditions.
Downloads
References
[1] I. P. Ihsan, S. Buwarda, H. Novianty, I. A. Putra, and U. Fajar, “Voice Recognition Untuk Otomatisasi Sistem Pengakses Pintu,” JSAI : Journal Scientific and Applied Informatics, vol. 4, no. 01, 2021, doi: 10.36085.
[2] H. Isyanto, A. S. Arifin, and M. Suryanegara, “Fast and Accurate Voice Biometrics with Deep Learning Algorithm of CNN Depthwise Separable Convolution Model and Fusion of DWT-MFCC Methods,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika, vol. 8, no. 3, p. 431, Oct. 2022, doi: 10.26555/jiteki.v8i3.24515.
[3] Haris Isyanto, “Accurate, Fast and Low Computation Cost of Voice Biometrics Performance using Model of CNN Depthwise Separable Convolution and Method of Hybrid DWT-MFCC for Security System,” Buletin Pos dan Telekomunikasi, vol. 22, no. 1, Jun. 2024, doi: 10.17933/bpostel.v22i1.393.
[4] A. Koriah, P. Teknik Informatika, S. N. Syaikh Zainuddin Anjani Jalan Raya Mataram, and L. Timur, “Rancang Bangun Sistem Keamanan Pintu Rumah Dengan Voice Recognition Dan Rfid Gelang Berbasis Iot (Design A Home Door Security System With Voice Recognition And Iot-Based Rfid Bracelets).” doi: https://doi.org/10.46764/teknimedia.v5i2.241.
[5] F. A. Alaba, M. Othman, I. A. T. Hashem, and F. Alotaibi, “Internet of Things security: A survey,” Journal of Network and Computer Applications, vol. 88, pp. 10–28, 2017, doi: https://doi.org/10.1016/j.jnca.2017.04.002.
[6] V. Muthumanikandan, Shajeth, and V. Sathya, “Voice-driven IoT: Revolutionizing home and hospital automation for enhanced security,” AIP Conf. Proc., vol. 3383, no. 1, p. 040010, Feb. 2026, doi: 10.1063/5.0308829.
[7] A. B. Arief, “Perancangan Smart Home Berbasis Internet Of Things Dengan Fokus Pada Pengendalian Suara Melalui Integrasi Google Home,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 14, no. 1, Jan. 2026, doi: 10.23960/jitet.v14i1.8872.
[8] Mutiara Syafrizal, Vidya Ikawati, Agus Siswanto, and N. Lestari, “Sistem Keamanan Door Lock Berbasis Voice Recognition Dengan Natural Language Processing,” Infotronik : Jurnal Teknologi Informasi dan Elektronika, vol. 9, no. 1, pp. 1–11, Jun. 2024, doi: 10.32897/infotronik.2024.9.1.3611.
[9] X. Liu, M. Sahidullah, and T. Kinnunen, “Learnable MFCCs for Speaker Verification,” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2102.10322
[10] M. Ardiana, T. Dutono, and T. B. Santoso, “Gender Classification Based Speaker’s Voice using YIN Algorithm and MFCC,” in 2021 International Electronics Symposium (IES), 2021, pp. 438–444. doi: 10.1109/IES53407.2021.9593959.
[11] M. Ardiana, T. Dutono, D. Tri, and B. Santoso, “Jurnal Politeknik Caltex Riau Identifikasi Jenis Kelamin Secara Real Time Berdasarkan Suara Pada Raspberry Pi,” 2022. [Online]. Available: https://jurnal.pcr.ac.id/index.php/jkt/
[12] A. Ashar, M. Shahid Bhatti, and U. Mushtaq, “Speaker Identification Using a Hybrid CNN-MFCC Approach,” 2020. doi: 10.1109/ICETST49965.2020.9080730.
[13] A. Wirdiani, S. Ndung’u Machetho, K. Gede, D. Putra, R. S. Hartati, and H. A. Ferdian, “Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining,” vol. 14, no. 2, 2024.
[14] M. Kumar, N. Mohd, G. Shivam, A. Goyal, D. Parashar, and R. Khan, “Hybrid Aquila optimizer-Harris Hawks optimization for CNN hyperparameter tuning in brain tumor classification,” Sci. Rep., vol. 16, no. 1, Dec. 2026, doi: 10.1038/s41598-026-43329-7.
[15] C. Author, B. Kanata, and S. M. Al Sasongko, “Enhancing Heart Sounds Classification Using MFCC And CNN,” International Journal of Informatics and Computation (IJICOM), vol. 8, no. 1, 2026, doi: 10.35842/ijicom.
[16] S. Simboni Tege, K. Katalay Pierre, O. Oshasha Fiston, S. Frey, A. Ntumba Nkongolo, and B. Kuya Jirince, “Comparative Evaluation of MFCC and Mel-spectrogram Features for CNN-Based Respiratory Abnormality Detection,” 2026. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC
[17] C. Author, B. Kanata, and S. M. Al Sasongko, “Enhancing Heart Sounds Classification Using MFCC And CNN,” International Journal of Informatics and Computation (IJICOM), vol. 8, no. 1, 2026, doi: 10.35842/ijicom.
[18] A. Gusev et al., “Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances,” Feb. 2020, [Online]. Available: http://arxiv.org/abs/2002.06033
[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[20] G. Costantini, V. Cesarini, and E. Brenna, “High-Level CNN and Machine Learning Methods for Speaker Recognition,” Sensors, vol. 23, no. 7, Apr. 2023, doi: 10.3390/s23073461.
[21] F. N. Rahman, T. Listyorini, and E. Supriyati, “Analisis Akurasi Cnn Pada Data Olah Suara Manusia Menggunakan Parameter Koefisien Mfcc Dan Max Length,” 2025.
[22] H. Isyanto, A. S. Arifin, and M. Suryanegara, “Fast and Accurate Voice Biometrics with Deep Learning Algorithm of CNN Depthwise Separable Convolution Model and Fusion of DWT-MFCC Methods,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika, vol. 8, no. 3, p. 431, Oct. 2022, doi: 10.26555/jiteki.v8i3.24515.
[23] C. Li et al., “Deep Speaker: an End-to-End Neural Speaker Embedding System,” May 2017, [Online]. Available: http://arxiv.org/abs/1705.02304
[24] J. Galić, B. Marković, Đ. Grozdić, B. Popović, and S. Šajić, “Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering,” Applied Sciences (Switzerland), vol. 14, no. 18, Sep. 2024, doi: 10.3390/app14188223.
[25] S. Seo and J. H. Kim, “Self-attentive multi-layer aggregation with feature recalibration and deep length normalization for text-independent speaker verification system,” Electronics (Switzerland), vol. 9, no. 10, pp. 1–14, Oct. 2020, doi: 10.3390/electronics9101706.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Mirza Ardiana, Mat Syai’in, Alief Nur Aisyi Maulidhia, Aulia Rahma Annisa, Yudi Andika, Sholahuddin Muhammad Irsyad, Fauzan Izzul Haq

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








