From Speech to Summary: A Pipeline-Based Evaluation of Whisper and Transformer Models for Indonesian Dialogue Summarization

Authors

  • Martin Clinton Tosima Manullang Teknik Informatika, Fakultas Teknologi Industri, Institut Teknologi Sumatera
  • Winda Yulita Teknik Informatika, Fakultas Teknologi Industri, Institut Teknologi Sumatera
  • Fathan Andi Kartagama Teknik Informatika, Fakultas Teknologi Industri, Institut Teknologi Sumatera
  • A. Edwin Krisandika Putra Teknik Informatika, Fakultas Teknologi Industri, Institut Teknologi Sumatera

DOI:

https://doi.org/10.30871/jaic.v10i1.11826

Keywords:

Speech summarization, Automatic Speech Recognition, Indonesian Language, Whisper, Zero-Shot Summarization, Benchmark

Abstract

The rapid increase in online meetings has produced massive amounts of undocumented spoken content, creating a practical need for automatic summarization. For Indonesian, this task is hindered by a dual-faceted resource scarcity and a lack of foundational benchmarks for pipeline components. This paper addresses this gap by creating a new synthetic conversational dataset for Indonesian and conducting two systematic, discrete benchmarks to identify the optimal components for an end-to-end pipeline. First, we evaluated six Whisper ASR model variants (from tiny to turbo) and found a clear, non-obvious winner: the turbo (distil-large-v2) model was not only the most accurate (7.97% WER) but also one of the fastest (1.25s inference), breaking the expected cost-accuracy trade-off. Second, we benchmarked 13 zero-shot summarization models on gold-standard transcripts, which revealed a critical divergence between lexical and semantic performance. Indonesian-specific models excelled at lexical overlap (ROUGE-1: 17.09 for cahya/t5-base...), while the multilingual google/long-t5-tglobal-base model was the clear semantic winner (BERTScore F1: 67.09).

Downloads

Download data is not yet available.

References

[1] S. Adnams, “The distributed workplace of the future is now,” Gartner, Report G00726412, 2020.

[2] J. A. Allen and S. G. Rogelberg, “Manager-led group meetings: A context for promoting employee engagement,” Group Organ. Manag., vol. 38, no. 5, pp. 543–569, Oct. 2013.

[3] A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, and A. Qazi, “Word level language identification in Indonesian-Javanese-English code-mixed text,” Procedia Comput. Sci., vol. 244, pp. 105–112, 2024.

[4] A. F. Hidayatullah, R. Apong, D. Lai, and A. Qazi, “Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets,” PeerJ Comput. Sci., vol. 9, June 2023.

[5] G. Winata, A. F. Aji, Z. X. Yong, and T. Solorio, “The decades progress on code-switching research in NLP: A systematic survey on trends and challenges,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023, pp. 2936–2978.

[6] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv [eess.AS], 06-Dec-2022.

[7] B. Wilie et al., “IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020.

[8] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” arXiv [cs.CL], 01-Nov-2020.

[9] I. McCowan et al., “The AMI meeting corpus,” pp. 137–140, Aug. 2005.

[10] A. Purwarianti et al., “NusaDialogue: Dialogue summarization and generation for underrepresented and extremely low-resource languages,” pp. 82–100, 2025.

[11] R. F. Khoiroh, E. Julianto, S. A. Ardiyansa, H. A. Fajri, A. A. R. Yasa, and B. Sangapta, “Implementasi Speech Recognition Whisper pada Debat Calon Wakil Presiden Republik Indonesia,” Ex, vol. 14, no. 2, pp. 67–74, July 2024.

[12] A. Aulia, L. Dessi, P. Ayu, T. Dipta, A. Kurniawati, and S. Sakriani, “Enhancing Indonesian automatic speech recognition: Evaluating multilingual models with diverse speech variabilities,” arXiv [cs.CL], 11-Oct-2024.

[13] Z. Aljneibi, S. Almenhali, and L. Lanca, “Convolutional neural network application for automated lung cancer detection on chest CT using Google AI Studio,” Radiography (Lond.), no. 103152, p. 103152, Sept. 2025.

[14] G. Comanici et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv [cs.CL], 16-Oct-2025.

[15] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv [cs.LG], 23-Oct-2019.

[16] G. E. Abdul, I. A. Ali, and C. Megha, “Fine-tuned T5 for abstractive summarization,” Int. J. Perform. Eng., vol. 17, no. 10, p. 900, 2021.

[17] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Annu Meet Assoc Comput Linguistics, pp. 74–81, July 2004.

[18] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv [cs.CL], 21-Apr-2019.

[19] D. Klakow and J. Peters, “Testing the correlation of word error rate and perplexity,” Speech Commun., vol. 38, no. 1–2, pp. 19–28, Sept. 2002.

Downloads

Published

2026-02-04

How to Cite

[1]
M. C. T. Manullang, W. Yulita, F. A. Kartagama, and A. E. K. Putra, “From Speech to Summary: A Pipeline-Based Evaluation of Whisper and Transformer Models for Indonesian Dialogue Summarization”, JAIC, vol. 10, no. 1, pp. 522–534, Feb. 2026.

Most read articles by the same author(s)

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.