Performance Comparison of Embeddings and Keyword Selection Methods in Enterprise Document

Authors

  • Putri Cristin Institut Teknologi Sepuluh Nopember Surabaya
  • Brenda Natalia Institut Teknologi Sepuluh Nopember
  • Joseph Clio Limantara Institut Teknologi Sepuluh Nopember
  • Sarwosri Institut Teknologi Sepuluh Nopember

DOI:

https://doi.org/10.30871/jaic.v9i4.9971

Keywords:

KeyBERT, Embedding Models, Enterprise Documents, Keyword extraction, Keyword Selection Method

Abstract

Keyword extraction is widely used in domains such as social media and e-commerce, but its application for enterprise document retrieval remains limited. Most organizations still depend on structured systems or rule-based approaches for indexing, which often lack semantic understanding and scalability. While several techniques like TextRank and RAKE have been explored, few studies assess their effectiveness on operational document retrieval in institutional settings, revealing a research gap. This study investigates the use of KeyBERT to extract keywords from university documents, including SOPs, manuals, and guidelines. KeyBERT leverages transformer-based embeddings to generate semantically relevant keywords and is chosen for its ease of use, model flexibility, and no need for labeled data. Additionally, it supports diversification strategies such as Maximum Marginal Relevance (MMR) and MaxSum to reduce redundancy and enhance keyword variety. We evaluate six embedding models combined with three keyword selection methods: Cosine similarity, MMR, and MaxSum. The best F1 score of 0.78 is achieved using Cosine with the paraphrase-MiniLM-L3-v2 model, along with an average extraction time of 184.02 seconds. These findings highlight the effectiveness of combining lightweight embeddings with strategic keyword selection for enterprise-scale document indexing.

Downloads

Download data is not yet available.

Author Biographies

Brenda Natalia, Institut Teknologi Sepuluh Nopember

Teknik Informatika, Institut Teknologi Sepuluh Nopember

Joseph Clio Limantara, Institut Teknologi Sepuluh Nopember

Teknik Informatika, Institut Teknologi Sepuluh Nopember

Sarwosri, Institut Teknologi Sepuluh Nopember

Teknik Informatika, Institut Teknologi Sepuluh Nopember

References

[1] B. Issa, M. B. Jasser, H. N. Chua, and M. Hamzah, “A Comparative Study on Embedding Models for Keyword Extraction Using KeyBERT Method,” in 2023 IEEE 13th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia: IEEE, Oct. 2023, pp. 40–45. doi: 10.1109/ICSET59111.2023.10295108.

[2] M. Nadim, D. Akopian, and A. Matamoros, “A Comparative Assessment of Unsupervised Keyword Extraction Tools,” IEEE Access, vol. 11, pp. 144778–144798, 2023, doi: 10.1109/ACCESS.2023.3344032.

[3] Y. Bi, T. Anderson, and S. McClean, “Rule Generation Based on Rough Set Theory for Text classification,” in Research and Development in Intelligent Systems XVII, M. Bramer, A. Preece, and F. Coenen, Eds., London: Springer London, 2001, pp. 157–170. doi: 10.1007/978-1-4471-0269-4_12.

[4] M. Yahya, D. Eleyan, and A. Eleyan, “A Systematic Literature Review Of Automatic Keyword Extraction Algorithms: Textrank And,” . Vol., no. 20, 2021.

[5] R. Keeling et al., “Empirical Comparisons of CNN with Other Learning Algorithms for Text Classification in Legal Document Review,” in 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA: IEEE, Dec. 2019, pp. 2038–2042. doi: 10.1109/BigData47090.2019.9006248.

[6] M. Grootendorst, MaartenGr/KeyBERT: BibTeX. (Jan. 25, 2021). Zenodo. doi: 10.5281/ZENODO.4461265.

[7] Z. H. Amur, Y. K. Hooi, G. M. Soomro, H. Bhanbhro, S. Karyem, and N. Sohu, “Unlocking the Potential of Keyword Extraction: The Need for Access to High-Quality Datasets,” Appl. Sci., vol. 13, no. 12, p. 7228, Jun. 2023, doi: 10.3390/app13127228.

[8] J.-S. Lee and J. Hsiang, “Patent classification by fine-tuning BERT language model,” World Pat. Inf., vol. 61, p. 101965, Jun. 2020, doi: 10.1016/j.wpi.2020.101965.

[9] L. Kelebercová and M. Munk, “Search queries related to COVID-19 based on keyword extraction,” Procedia Comput. Sci., vol. 207, pp. 2618–2627, 2022, doi: 10.1016/j.procs.2022.09.320.

[10] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3980–3990. doi: 10.18653/v1/D19-1410.

[11] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” 2019, arXiv. doi: 10.48550/ARXIV.1910.01108.

[12] P. Bajaj et al., “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset,” 2016, arXiv. doi: 10.48550/ARXIV.1611.09268.

[13] M. A. A. Fattah and R. Meiyanti, “Comparison Of Maximal Marginal Relevance (MMR) And Textrank Automatic Text Summarization Methods In Journals,” vol. 2, 2024.

[14] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, 1st ed. Cambridge University Press, 2008. doi: 10.1017/CBO9780511809071.

[15] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne Australia: ACM, Aug. 1998, pp. 335–336. doi: 10.1145/290941.291025.

[16] A. M. A. Zeyad and A. Biradar, “Abstractive Multi-Document Summarization: Exploiting Maximal Marginal Relevance and Pretrained Models,” in 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India: IEEE, Jul. 2023, pp. 1–5. doi: 10.1109/ICCCNT56998.2023.10307351.

[17] C. Yoo and H. Lee, “Improving Abstractive Dialogue Summarization Using Keyword Extraction,” Appl. Sci., vol. 13, no. 17, p. 9771, Aug. 2023, doi: 10.3390/app13179771.

[18] S. Dhokane, C. Deshmukh, A. Bollabattin, S. Karande, B. Karangale, and P. S. Varade, “BM25 Implementation For Information Retrieval: Candidate Shortlister For Recruitment Process,” in 2024 Intelligent Systems and Machine Learning Conference (ISML), Hyderabad, India: IEEE, May 2024, pp. 722–727. doi: 10.1109/ISML60050.2024.11007378.

[19] Y. Chen, Y. Guo, Y. Xie, and Z. Mi, “Legal and Regulation Retrieval System Based on Hierarchical Retrieval,” in 2021 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Beijing, China: IEEE, Oct. 2021, pp. 1–5. doi: 10.1109/CCCI52664.2021.9583204.

[20] M. Faysse et al., “ColPali: Efficient Document Retrieval With,” 2025.

Downloads

Published

2025-08-03

How to Cite

[1]
P. Cristin, B. Natalia, J. C. Limantara, and Sarwosri, “Performance Comparison of Embeddings and Keyword Selection Methods in Enterprise Document”, JAIC, vol. 9, no. 4, pp. 1254–1265, Aug. 2025.

Issue

Section

Articles

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.