A Comparative Analysis of Character and Word-Based Tokenization for Kawi-Indonesian Neural Machine Translation
DOI:
https://doi.org/10.30871/jaic.v9i6.11283Keywords:
Transformer, Low Resources Language, Computational Linguistic, FLAN-T5Abstract
Preserving regional languages is a strategic step in preserving cultural heritage while expanding access to knowledge across generations. One approach that can support this effort is the application of automatic translation technology to digitize and learn local language texts. This study compares two tokenization strategies, word-based and character-based on a Kawi–Indonesian translation model using the FLAN-T5-Small Transformer architecture. The dataset used consists of 4,987 preprocessed sentence pairs, trained for 10 epochs with a batch size of 8. Statistical analysis shows that Kawi texts have an average length of 39.6 characters (5.4 words) per sentence, while Indonesian texts have an average length of 54.9 characters (7.5 words). These findings suggest that Kawi sentences tend to be lexically dense, with low word repetition and high morphological variation, which can increase the learning complexity of the model. Evaluation using BLEU and METEOR metrics shows that the model with word-based tokenization achieved a BLEU score of 0.45 and a METEOR score of 0.05, while the character-based model achieved a BLEU score of 0.24 and a METEOR score of 0.04. Although the dataset size has increased compared to previous studies, these results indicate that the additional data is not sufficient to overcome the limitations of the semantic representation of the Kawi language. Therefore, this study serves as an initial baseline that can be further developed through subword tokenization approaches, dataset expansion, and training strategy optimization to improve the quality of local language translations in the future.
Downloads
References
[1] A. Hidayat, “Revitalization of ancient Indonesian characters and the maintenance efforts in past 10 years,” LADU: Journal of Languages and Education, vol. 1, no. 4, pp. 179–186, May 2021, doi: 10.56724/ladu.v1i4.69.
[2] Y. Wang, “Cognitive and sociocultural dynamics of self-regulated use of machine translation and generative AI tools in academic EFL writing,” System, vol. 126, p. 103505, Nov. 2024, doi: 10.1016/j.system.2024.103505.
[3] P. Koehn and R. Knowles, “Six Challenges for Neural Machine Translation,” in Proceedings of the First Workshop on Neural Machine Translation, T. Luong, A. Birch, G. Neubig, and A. Finch, Eds., Vancouver: Association for Computational Linguistics, Aug. 2017, pp. 28–39. doi: 10.18653/v1/W17-3204.
[4] J. Wang, L. Yang, J. Wang, Y. Guan, L. Bai, and H. Luo, “A data-guided curriculum towards low-resource neural machine translation,” Expert Systems with Applications, vol. 283, p. 127673, July 2025, doi: 10.1016/j.eswa.2025.127673.
[5] M. E. Pacheco Martínez, M. Carrillo Ruiz, and M. de L. Sandoval Solís, “Harmony search for hyperparameters optimization of a low resource language transformer model trained with a novel parallel corpus Ocelotl Nahuatl – Spanish,” Systems and Soft Computing, vol. 6, p. 200152, Dec. 2024, doi: 10.1016/j.sasc.2024.200152.
[6] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. Accessed: Sept. 15, 2025. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[7] P. Susongko and I. Rosdiana, “Educational Theory Based on Ancient Javanese Philosophy”.
[8] M. S. Zurbuchen, “Kawi--an Introduction,” in Introduction to Old Javanese Language and Literature, in A Kawi Prose Anthology. , University of Michigan Press, 1976, pp. 1–12. Accessed: Oct. 30, 2025. [Online]. Available: https://www.jstor.org/stable/10.3998/mpub.11902952.7
[9] O. Sudana, D. Putra, M. Sudarma, R. S. Hartati, R. P. Prastika, and A. Wirdiani, “E-Translator Kawi to Bahasa,” MATEC Web of Conferences, vol. 159, p. 01047, 2018, doi: 10.1051/matecconf/201815901047.
[10] A. Qorbani, R. Ramezani, A. Baraani, and A. Kazemi, “Multilingual neural machine translation for low-resource languages by twinning important nodes,” Neurocomputing, vol. 634, p. 129890, June 2025, doi: 10.1016/j.neucom.2025.129890.
[11] B. Li, Y. Weng, F. Xia, and H. Deng, “Towards better Chinese-centric neural machine translation for low-resource languages,” Computer Speech & Language, vol. 84, p. 101566, Mar. 2024, doi: 10.1016/j.csl.2023.101566.
[12] I. G. B. A. Budaya, M. W. A. Kesiman, and I. M. G. Sunarya, “The Influence of Word Vectorization for Kawi Language to Indonesian Language Neural Machine Translation,” Journal of Information Technology and Computer Science, vol. 7, no. 1, pp. 81–93, Sept. 2022, doi: 10.25126/jitecs.202271387.
[13] I. G. B. A. Budaya, M. W. A. Kesiman, and I. M. G. Sunarya, “Perancangan Mesin Translasi berbasis Neural dari Bahasa Kawi ke dalam Bahasa Indonesia menggunakan Microframework Flask,” Jurnal Sistem dan Informatika (JSI), vol. 16, no. 2, pp. 94–103, June 2022, doi: 10.30864/jsi.v16i2.440.
[14] T. P. D. Penerjemah, P. S. Veda, and V. Samiti, “Direktorat Jenderal Bimbingan Masyarakat Hindu Kementerian Agama Republik Indonesia 202”.
[15] S. Longpre et al., “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning,” in Proceedings of the 40th International Conference on Machine Learning, PMLR, July 2023, pp. 22631–22648. Accessed: Sept. 19, 2025. [Online]. Available: https://proceedings.mlr.press/v202/longpre23a.html
[16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135.
[17] S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds., Ann Arbor, Michigan: Association for Computational Linguistics, June 2005, pp. 65–72. Accessed: Sept. 19, 2025. [Online]. Available: https://aclanthology.org/W05-0909/
[18] “The BLEU Score for Automatic Evaluation of English to Bangla NMT | SpringerLink.” Accessed: Sept. 19, 2025. [Online]. Available: https://link.springer.com/chapter/10.1007/978-981-33-4087-9_34
[19] H. Saadany and C. Orasan, “BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text,” in Proceedings of the Translation and Interpreting Technology Online Conference TRITON 2021, 2021, pp. 48–56. doi: 10.26615/978-954-452-071-7_006.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 I Gede Bintang Arya Budaya, I Gede Putra Mas Yusadara

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








