A Comparative Analysis of Character and Word-Based Tokenization for Kawi-Indonesian Neural Machine Translation

I Gede Bintang Arya Budaya; I Gede Putra Mas Yusadara

doi:10.30871/jaic.v9i6.11283

Authors

I Gede Bintang Arya Budaya Institute of Technology and Business STIKOM Bali
I Gede Putra Mas Yusadara Institut of Technology and Business STIKOM Bali

DOI:

https://doi.org/10.30871/jaic.v9i6.11283

Keywords:

Transformer, Low Resources Language, Computational Linguistic, FLAN-T5

Abstract

Preserving regional languages is a strategic step in preserving cultural heritage while expanding access to knowledge across generations. One approach that can support this effort is the application of automatic translation technology to digitize and learn local language texts. This study compares two tokenization strategies, word-based and character-based on a Kawi–Indonesian translation model using the FLAN-T5-Small Transformer architecture. The dataset used consists of 4,987 preprocessed sentence pairs, trained for 10 epochs with a batch size of 8. Statistical analysis shows that Kawi texts have an average length of 39.6 characters (5.4 words) per sentence, while Indonesian texts have an average length of 54.9 characters (7.5 words). These findings suggest that Kawi sentences tend to be lexically dense, with low word repetition and high morphological variation, which can increase the learning complexity of the model. Evaluation using BLEU and METEOR metrics shows that the model with word-based tokenization achieved a BLEU score of 0.45 and a METEOR score of 0.05, while the character-based model achieved a BLEU score of 0.24 and a METEOR score of 0.04. Although the dataset size has increased compared to previous studies, these results indicate that the additional data is not sufficient to overcome the limitations of the semantic representation of the Kawi language. Therefore, this study serves as an initial baseline that can be further developed through subword tokenization approaches, dataset expansion, and training strategy optimization to improve the quality of local language translations in the future.

Downloads

Download data is not yet available.

References

[1] A. Hidayat, “Revitalization of ancient Indonesian characters and the maintenance efforts in past 10 years,” LADU: Journal of Languages and Education, vol. 1, no. 4, pp. 179–186, May 2021, doi: 10.56724/ladu.v1i4.69.

[2] Y. Wang, “Cognitive and sociocultural dynamics of self-regulated use of machine translation and generative AI tools in academic EFL writing,” System, vol. 126, p. 103505, Nov. 2024, doi: 10.1016/j.system.2024.103505.

[3] P. Koehn and R. Knowles, “Six Challenges for Neural Machine Translation,” in Proceedings of the First Workshop on Neural Machine Translation, T. Luong, A. Birch, G. Neubig, and A. Finch, Eds., Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 28–39. doi: 10.18653/v1/W17-3204.

[4] J. Wang, L. Yang, J. Wang, Y. Guan, L. Bai, and H. Luo, “A data-guided curriculum towards low-resource neural machine translation,” Expert Systems with Applications, vol. 283, p. 127673, July 2025, doi: 10.1016/j.eswa.2025.127673.

[5] M. E. Pacheco Martínez, M. Carrillo Ruiz, and M. de L. Sandoval Solís, “Harmony search for hyperparameters optimization of a low resource language transformer model trained with a novel parallel corpus Ocelotl Nahuatl – Spanish,” Systems and Soft Computing, vol. 6, p. 200152, Dec. 2024, doi: 10.1016/j.sasc.2024.200152.

[6] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. Accessed: Sept. 15, 2025. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

[7] P. Susongko and I. Rosdiana, “Educational Theory Based on Ancient Javanese Philosophy”.

[8] M. S. Zurbuchen, “Kawi--an Introduction,” in Introduction to Old Javanese Language and Literature, in A Kawi Prose Anthology. , University of Michigan Press, 1976, pp. 1–12. Accessed: Oct. 30, 2025. [Online]. Available: https://www.jstor.org/stable/10.3998/mpub.11902952.7

[9] O. Sudana, D. Putra, M. Sudarma, R. S. Hartati, R. P. Prastika, and A. Wirdiani, “E-Translator Kawi to Bahasa,” MATEC Web of Conferences, vol. 159, p. 01047, 2018, doi: 10.1051/matecconf/201815901047.

[10] A. Qorbani, R. Ramezani, A. Baraani, and A. Kazemi, “Multilingual neural machine translation for low-resource languages by twinning important nodes,” Neurocomputing, vol. 634, p. 129890, June 2025, doi: 10.1016/j.neucom.2025.129890.

[11] B. Li, Y. Weng, F. Xia, and H. Deng, “Towards better Chinese-centric neural machine translation for low-resource languages,” Computer Speech & Language, vol. 84, p. 101566, Mar. 2024, doi: 10.1016/j.csl.2023.101566.

[12] I. G. B. A. Budaya, M. W. A. Kesiman, and I. M. G. Sunarya, “The Influence of Word Vectorization for Kawi Language to Indonesian Language Neural Machine Translation,” Journal of Information Technology and Computer Science, vol. 7, no. 1, pp. 81–93, Sept. 2022, doi: 10.25126/jitecs.202271387.

[13] I. G. B. A. Budaya, M. W. A. Kesiman, and I. M. G. Sunarya, “Perancangan Mesin Translasi berbasis Neural dari Bahasa Kawi ke dalam Bahasa Indonesia menggunakan Microframework Flask,” Jurnal Sistem dan Informatika (JSI), vol. 16, no. 2, pp. 94–103, June 2022, doi: 10.30864/jsi.v16i2.440.

[14] T. P. D. Penerjemah, P. S. Veda, and V. Samiti, “Direktorat Jenderal Bimbingan Masyarakat Hindu Kementerian Agama Republik Indonesia 202”.

[15] S. Longpre et al., “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning,” in Proceedings of the 40th International Conference on Machine Learning, PMLR, July 2023, pp. 22631–22648. Accessed: Sept. 19, 2025. [Online]. Available: https://proceedings.mlr.press/v202/longpre23a.html

[16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135.

[17] S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds., Ann Arbor, Michigan: Association for Computational Linguistics, June 2005, pp. 65–72. Accessed: Sept. 19, 2025. [Online]. Available: https://aclanthology.org/W05-0909/

[18] “The BLEU Score for Automatic Evaluation of English to Bangla NMT | SpringerLink.” Accessed: Sept. 19, 2025. [Online]. Available: https://link.springer.com/chapter/10.1007/978-981-33-4087-9_34

[19] H. Saadany and C. Orasan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference TRITON 2021, 2021, pp. 48–56. doi: 10.26615/978-954-452-071-7_006.

A Comparative Analysis of Character and Word-Based Tokenization for Kawi-Indonesian Neural Machine Translation

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

submit

tools

issn