Hyperparameter Optimization of CNN Based Open Set Speaker Verification Using MFCC and Speaker Embedding for Voice Biometric Security

Mirza Ardiana; Mat Syai’in; Alief Nur Aisyi Maulidhia; Aulia Rahma Annisa; Yudi Andika; Sholahuddin Muhammad Irsyad; Fauzan Izzul Haq

doi:10.30871/jaic.v10i3.13027

Authors

Mirza Ardiana Politeknik Perkapalan Negeri Surabaya
Mat Syai’in Politeknik Perkapalan Negeri Surabaya
Alief Nur Aisyi Maulidhia Politeknik Perkapalan Negeri Surabaya
Aulia Rahma Annisa Politeknik Perkapalan Negeri Surabaya
Yudi Andika Politeknik Perkapalan Negeri Surabaya
Sholahuddin Muhammad Irsyad Politeknik Perkapalan Negeri Surabaya
Fauzan Izzul Haq Politeknik Perkapalan Negeri Surabaya

DOI:

https://doi.org/10.30871/jaic.v10i3.13027

Keywords:

Audio Augmentation, CNN, Hyperparameter Optimization, MFCC, Speaker Embedding, Voice Biometrics

Abstract

The development of voice based biometric security systems has increased the demand for authentication methods capable of operating accurately and securely in open set speaker verification scenarios. In this scenario, the system is required not only to recognize registered users but also to reject unknown users who are not included in the system database. This study focuses on hyperparameter optimization in a Convolutional Neural Network Embedding based speaker verification system using Mel Frequency Cepstral Coefficient (MFCC) features and speaker embeddings. The optimization process was conducted through several experimental stages, including MFCC parameter tuning, CNN architecture tuning, embedding dimension tuning, and audio augmentation analysis. The dataset consisted of Indonesian speech recordings from 8 registered speakers and 1 unknown speaker, sampled at 16 kHz under controlled recording conditions. The dataset was divided into training, enrollment, and testing subsets to support open set speaker verification evaluation and reduce data leakage. System performance was evaluated using accuracy, validation loss, False Acceptance Rate (FAR), False Rejection Rate (FRR), best threshold, and inference time. The experimental results show that the best configuration was achieved using the MFCC-C parameters (N_MFCC = 40, N_FFT = 1024, HOP_LENGTH = 256, N_MELS = 40), the CNN-E architecture with three convolution blocks (32-64-128), an embedding dimension of 64, and lightweight augmentation consisting of noise injection, pitch shifting, and time stretching. This configuration achieved stable system performance with a test accuracy of 96.43% and a FAR of 8.7%, while maintaining lightweight computational complexity and real time inference capability. The results also indicate that excessive augmentation may increase embedding overlap between speakers, thereby reducing system security performance. However, the study was conducted on a limited scale dataset and has not yet evaluated robustness against spoofing attacks, replay attacks, or adversarial synthesized voice attacks. Overall, the study indicates that hyperparameter optimization influences the balance between accuracy, computational efficiency, and biometric security performance in lightweight CNN based voice biometric authentication systems under limited scale evaluation conditions.

Downloads

Download data is not yet available.

References

[1] I. P. Ihsan, S. Buwarda, H. Novianty, I. A. Putra, and U. Fajar, “Voice Recognition Untuk Otomatisasi Sistem Pengakses Pintu,” JSAI : Journal Scientific and Applied Informatics, vol. 4, no. 01, 2021, doi: 10.36085.

[2] H. Isyanto, A. S. Arifin, and M. Suryanegara, “Fast and Accurate Voice Biometrics with Deep Learning Algorithm of CNN Depthwise Separable Convolution Model and Fusion of DWT-MFCC Methods,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika, vol. 8, no. 3, p. 431, Oct. 2022, doi: 10.26555/jiteki.v8i3.24515.

[3] Haris Isyanto, “Accurate, Fast and Low Computation Cost of Voice Biometrics Performance using Model of CNN Depthwise Separable Convolution and Method of Hybrid DWT-MFCC for Security System,” Buletin Pos dan Telekomunikasi, vol. 22, no. 1, Jun. 2024, doi: 10.17933/bpostel.v22i1.393.

[4] A. Koriah, P. Teknik Informatika, S. N. Syaikh Zainuddin Anjani Jalan Raya Mataram, and L. Timur, “Rancang Bangun Sistem Keamanan Pintu Rumah Dengan Voice Recognition Dan Rfid Gelang Berbasis Iot (Design A Home Door Security System With Voice Recognition And Iot-Based Rfid Bracelets).” doi: https://doi.org/10.46764/teknimedia.v5i2.241.

[5] F. A. Alaba, M. Othman, I. A. T. Hashem, and F. Alotaibi, “Internet of Things security: A survey,” Journal of Network and Computer Applications, vol. 88, pp. 10–28, 2017, doi: https://doi.org/10.1016/j.jnca.2017.04.002.

[6] V. Muthumanikandan, Shajeth, and V. Sathya, “Voice-driven IoT: Revolutionizing home and hospital automation for enhanced security,” AIP Conf. Proc., vol. 3383, no. 1, p. 040010, Feb. 2026, doi: 10.1063/5.0308829.

[7] A. B. Arief, “Perancangan Smart Home Berbasis Internet Of Things Dengan Fokus Pada Pengendalian Suara Melalui Integrasi Google Home,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 14, no. 1, Jan. 2026, doi: 10.23960/jitet.v14i1.8872.

[8] Mutiara Syafrizal, Vidya Ikawati, Agus Siswanto, and N. Lestari, “Sistem Keamanan Door Lock Berbasis Voice Recognition Dengan Natural Language Processing,” Infotronik : Jurnal Teknologi Informasi dan Elektronika, vol. 9, no. 1, pp. 1–11, Jun. 2024, doi: 10.32897/infotronik.2024.9.1.3611.

[9] X. Liu, M. Sahidullah, and T. Kinnunen, “Learnable MFCCs for Speaker Verification,” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2102.10322

[10] M. Ardiana, T. Dutono, and T. B. Santoso, “Gender Classification Based Speaker’s Voice using YIN Algorithm and MFCC,” in 2021 International Electronics Symposium (IES), 2021, pp. 438–444. doi: 10.1109/IES53407.2021.9593959.

[11] M. Ardiana, T. Dutono, D. Tri, and B. Santoso, “Jurnal Politeknik Caltex Riau Identifikasi Jenis Kelamin Secara Real Time Berdasarkan Suara Pada Raspberry Pi,” 2022. [Online]. Available: https://jurnal.pcr.ac.id/index.php/jkt/

[12] A. Ashar, M. Shahid Bhatti, and U. Mushtaq, “Speaker Identification Using a Hybrid CNN-MFCC Approach,” 2020. doi: 10.1109/ICETST49965.2020.9080730.

[13] A. Wirdiani, S. Ndung’u Machetho, K. Gede, D. Putra, R. S. Hartati, and H. A. Ferdian, “Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining,” vol. 14, no. 2, 2024.

[14] M. Kumar, N. Mohd, G. Shivam, A. Goyal, D. Parashar, and R. Khan, “Hybrid Aquila optimizer-Harris Hawks optimization for CNN hyperparameter tuning in brain tumor classification,” Sci. Rep., vol. 16, no. 1, Dec. 2026, doi: 10.1038/s41598-026-43329-7.

[15] C. Author, B. Kanata, and S. M. Al Sasongko, “Enhancing Heart Sounds Classification Using MFCC And CNN,” International Journal of Informatics and Computation (IJICOM), vol. 8, no. 1, 2026, doi: 10.35842/ijicom.

[16] S. Simboni Tege, K. Katalay Pierre, O. Oshasha Fiston, S. Frey, A. Ntumba Nkongolo, and B. Kuya Jirince, “Comparative Evaluation of MFCC and Mel-spectrogram Features for CNN-Based Respiratory Abnormality Detection,” 2026. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

[17] C. Author, B. Kanata, and S. M. Al Sasongko, “Enhancing Heart Sounds Classification Using MFCC And CNN,” International Journal of Informatics and Computation (IJICOM), vol. 8, no. 1, 2026, doi: 10.35842/ijicom.

[18] A. Gusev et al., “Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances,” Feb. 2020, [Online]. Available: http://arxiv.org/abs/2002.06033

[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.

[20] G. Costantini, V. Cesarini, and E. Brenna, “High-Level CNN and Machine Learning Methods for Speaker Recognition,” Sensors, vol. 23, no. 7, Apr. 2023, doi: 10.3390/s23073461.

[21] F. N. Rahman, T. Listyorini, and E. Supriyati, “Analisis Akurasi Cnn Pada Data Olah Suara Manusia Menggunakan Parameter Koefisien Mfcc Dan Max Length,” 2025.

[22] H. Isyanto, A. S. Arifin, and M. Suryanegara, “Fast and Accurate Voice Biometrics with Deep Learning Algorithm of CNN Depthwise Separable Convolution Model and Fusion of DWT-MFCC Methods,” Jurnal Ilmiah Teknik Elektro Komputer dan Informatika, vol. 8, no. 3, p. 431, Oct. 2022, doi: 10.26555/jiteki.v8i3.24515.

[23] C. Li et al., “Deep Speaker: an End-to-End Neural Speaker Embedding System,” May 2017, [Online]. Available: http://arxiv.org/abs/1705.02304

[24] J. Galić, B. Marković, Đ. Grozdić, B. Popović, and S. Šajić, “Whispered Speech Recognition Based on Audio Data Augmentation and Inverse Filtering,” Applied Sciences (Switzerland), vol. 14, no. 18, Sep. 2024, doi: 10.3390/app14188223.

[25] S. Seo and J. H. Kim, “Self-attentive multi-layer aggregation with feature recalibration and deep length normalization for text-independent speaker verification system,” Electronics (Switzerland), vol. 9, no. 10, pp. 1–14, Oct. 2020, doi: 10.3390/electronics9101706.

Hyperparameter Optimization of CNN Based Open Set Speaker Verification Using MFCC and Speaker Embedding for Voice Biometric Security

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

submit

tools

issn