From Speech to Summary: A Pipeline-Based Evaluation of Whisper and Transformer Models for Indonesian Dialogue Summarization
DOI:
https://doi.org/10.30871/jaic.v10i1.11826Keywords:
Speech summarization, Automatic Speech Recognition, Indonesian Language, Whisper, Zero-Shot Summarization, BenchmarkAbstract
The rapid increase in online meetings has produced massive amounts of undocumented spoken content, creating a practical need for automatic summarization. For Indonesian, this task is hindered by a dual-faceted resource scarcity and a lack of foundational benchmarks for pipeline components. This paper addresses this gap by creating a new synthetic conversational dataset for Indonesian and conducting two systematic, discrete benchmarks to identify the optimal components for an end-to-end pipeline. First, we evaluated six Whisper ASR model variants (from tiny to turbo) and found a clear, non-obvious winner: the turbo (distil-large-v2) model was not only the most accurate (7.97% WER) but also one of the fastest (1.25s inference), breaking the expected cost-accuracy trade-off. Second, we benchmarked 13 zero-shot summarization models on gold-standard transcripts, which revealed a critical divergence between lexical and semantic performance. Indonesian-specific models excelled at lexical overlap (ROUGE-1: 17.09 for cahya/t5-base...), while the multilingual google/long-t5-tglobal-base model was the clear semantic winner (BERTScore F1: 67.09).
Downloads
References
[1] S. Adnams, “The distributed workplace of the future is now,” Gartner, Report G00726412, 2020.
[2] J. A. Allen and S. G. Rogelberg, “Manager-led group meetings: A context for promoting employee engagement,” Group Organ. Manag., vol. 38, no. 5, pp. 543–569, Oct. 2013.
[3] A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, and A. Qazi, “Word level language identification in Indonesian-Javanese-English code-mixed text,” Procedia Comput. Sci., vol. 244, pp. 105–112, 2024.
[4] A. F. Hidayatullah, R. Apong, D. Lai, and A. Qazi, “Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets,” PeerJ Comput. Sci., vol. 9, June 2023.
[5] G. Winata, A. F. Aji, Z. X. Yong, and T. Solorio, “The decades progress on code-switching research in NLP: A systematic survey on trends and challenges,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023, pp. 2936–2978.
[6] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv [eess.AS], 06-Dec-2022.
[7] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020.
[8] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” arXiv [cs.CL], 01-Nov-2020.
[9] I. McCowan et al., “The AMI meeting corpus,” pp. 137–140, Aug. 2005.
[10] A. Purwarianti et al., “NusaDialogue: Dialogue summarization and generation for underrepresented and extremely low-resource languages,” pp. 82–100, 2025.
[11] R. F. Khoiroh, E. Julianto, S. A. Ardiyansa, H. A. Fajri, A. A. R. Yasa, and B. Sangapta, “Implementasi Speech Recognition Whisper pada Debat Calon Wakil Presiden Republik Indonesia,” Ex, vol. 14, no. 2, pp. 67–74, July 2024.
[12] A. Aulia, L. Dessi, P. Ayu, T. Dipta, A. Kurniawati, and S. Sakriani, “Enhancing Indonesian automatic speech recognition: Evaluating multilingual models with diverse speech variabilities,” arXiv [cs.CL], 11-Oct-2024.
[13] Z. Aljneibi, S. Almenhali, and L. Lanca, “Convolutional neural network application for automated lung cancer detection on chest CT using Google AI Studio,” Radiography (Lond.), no. 103152, p. 103152, Sept. 2025.
[14] G. Comanici et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv [cs.CL], 16-Oct-2025.
[15] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv [cs.LG], 23-Oct-2019.
[16] G. E. Abdul, I. A. Ali, and C. Megha, “Fine-tuned T5 for abstractive summarization,” Int. J. Perform. Eng., vol. 17, no. 10, p. 900, 2021.
[17] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Annu Meet Assoc Comput Linguistics, pp. 74–81, July 2004.
[18] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv [cs.CL], 21-Apr-2019.
[19] D. Klakow and J. Peters, “Testing the correlation of word error rate and perplexity,” Speech Commun., vol. 38, no. 1–2, pp. 19–28, Sept. 2002.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Martin Clinton Tosima Manullang, Winda Yulita, Fathan Andi Kartagama, A. Edwin Krisandika Putra

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








