Scientific Paper Recommendation System: Application of Sentence Transformers and Cosine Similarity Using arXiv Data
DOI:
https://doi.org/10.30871/jaic.v9i4.9766Keywords:
Recommendation System, Sentence Transformer, Cosine Similarity, arXiv, Semantic SimilarityAbstract
Searching for relevant scientific literature faces complex challenges due to the proliferation of academic publications. This research develops a semantic similarity-based scientific paper recommendation system by utilizing Sentence Transformer (all-MiniLM-L6-v2 model) and cosine similarity algorithm on arXiv dataset (15,504 papers in Computer Science). The system is implemented as a Streamlit-based interactive web application that accepts user queries and recommends related papers based on semantic similarity. Performance evaluation using Precision, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) metrics showed that embedding text from the Introduction section without pre-processing yielded the best performance (NDCG: 0.7590; MAP: 0.6960; MRR: 0.7254), outperforming Abstract-based or text combination approaches. A user test of 45 respondents confirmed the effectiveness of the system: 95.5% expressed satisfaction with the relevance of the recommendations, and 93.3% confirmed a significant reduction in manual search time. The findings prove that retaining the raw text structure in the Introduction is optimal for semantic representation. Development suggestions include multidomain dataset expansion and transformer model optimization for accuracy improvement.
Downloads
References
[1] J. Beel, B. Gipp, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, Nov. 2016, doi: 10.1007/s00799-015-0156-0.
[2] Z. Ali, G. Qi, K. Muhammad, B. Ali, and W. A. Abro, “Paper recommendation based on heterogeneous network embedding,” Knowl Based Syst, vol. 210, Dec. 2020, doi: 10.1016/j.knosys.2020.106438.
[3] C. K. Kreutz and R. Schenkel, “Scientific paper recommendation systems: a literature review of recent publications,” International Journal on Digital Libraries, vol. 23, no. 4, pp. 335–369, Dec. 2022, doi: 10.1007/s00799-022-00339-w.
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.04805
[5] X. Kong, M. Mao, W. Wang, J. Liu, and B. Xu, “VOPRec: Vector Representation Learning of Papers with Text Information and Structural Identity for Recommendation,” IEEE Trans Emerg Top Comput, vol. 9, no. 1, pp. 226–237, Jan. 2021, doi: 10.1109/TETC.2018.2830698.
[6] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi, “On the Use of ArXiv as a Dataset,” Apr. 2019, [Online]. Available: http://arxiv.org/abs/1905.00075
[7] S. J. Mielke et al., “Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP,” Dec. 2021, [Online]. Available: http://arxiv.org/abs/2112.10508
[8] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” 2019. [Online]. Available: https://github.com/UKPLab/
[9] F. A. Nugroho, F. Septian, D. A. Pungkastyo, and J. Riyanto, “Penerapan Algoritma Cosine Similarity untuk Deteksi Kesamaan Konten pada Sistem Informasi Penelitian dan Pengabdian Kepada Masyarakat,” Jurnal Informatika Universitas Pamulang, vol. 5, no. 4, p. 529, Dec. 2021, doi: 10.32493/informatika.v5i4.7126.
[10] Douglas. Steinley, “K-means clustering: A half-century synthesis,” British Journal of Mathematical and Statistical Psychology, vol. 59, no. 1, pp. 1–34, May 2006, doi: https://doi.org/10.1348/000711005X48266.
[11] Y. Gulzar, A. A. Alwan, R. M. Abdullah, A. Z. Abualkishik, and M. Oumrani, “OCA: Ordered Clustering-Based Algorithm for E-Commerce Recommendation System,” Sustainability (Switzerland), vol. 15, no. 4, Feb. 2023, doi: 10.3390/su15042947.
[12] S. M. Miraftabzadeh, C. G. Colombo, M. Longo, and F. Foiadelli, “K-Means and Alternative Clustering Methods in Modern Power Systems,” 2023, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2023.3327640.
[13] C. Wang, “Frontiers in Computing and Intelligent Systems Pattern Classification of Stock Price Moving,” 2022.
[14] J.-W. Z. Jun-Wu Zhai, Y.-C. T. Jun-Wu Zhai, W.-T. L. Yu-Chen Tian, and K. L. Wen-Tao Li, “Canopy-MMD Text Clustering Algorithm Based on Simulated Annealing and Canopy Optimization,” 電腦學刊, vol. 34, no. 1, pp. 075–086, Feb. 2023, doi: 10.53106/199115992023023401006.
[15] H. Humaira and R. Rasyidah, “Determining The Appropiate Cluster Number Using Elbow Method for K-Means Algorithm,” European Alliance for Innovation n.o., Mar. 2020. doi: 10.4108/eai.24-1-2018.2292388.
[16] H. Zhao, “Design and Implementation of an Improved K-Means Clustering Algorithm,” Mobile Information Systems, vol. 2022, 2022, doi: 10.1155/2022/6041484.
[17] Z. Fayyaz, M. Ebrahimian, D. Nawara, A. Ibrahim, and R. Kashef, “Recommendation systems: Algorithms, challenges, metrics, and business opportunities,” Applied Sciences (Switzerland), vol. 10, no. 21, pp. 1–20, Nov. 2020, doi: 10.3390/app10217748.
[18] O. Jeunen, I. Potapov, and A. Ustimenko, “On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recommendation,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, Aug. 2024, pp. 1222–1233. doi: 10.1145/3637528.3671687.
[19] L. J. Cronbach, “COEFFICIENT ALPHA AND THE INTERNAL STRUCTURE OF TESTS*,” 1951.
[20] M. Tavakol and R. Dennick, “Making sense of Cronbach’s alpha,” Jun. 27, 2011. doi: 10.5116/ijme.4dfb.8dfd.
[21] M. Bereczki and S. Girdzijauskas, “Graph Neural Networks for Article Recommendation based on Implicit User Feedback and Content,” 2021.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ananda Pannadhika Putra, Desy Purnami Singgih Putri, AA.Kt.Agung Cahyawan Wiranatha

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








