Scientific Paper Recommendation System: Application of Sentence Transformers and Cosine Similarity Using arXiv Data

Authors

  • Ananda Pannadhika Putra Teknologi Informasi, Universitas Udayana
  • Desy Purnami Singgih Putri Teknologi Informasi, Universitas Udayana
  • AA.Kt.Agung Cahyawan Wiranatha Teknologi Informasi, Universitas Udayana

DOI:

https://doi.org/10.30871/jaic.v9i4.9766

Keywords:

Recommendation System, Sentence Transformer, Cosine Similarity, arXiv, Semantic Similarity

Abstract

Searching for relevant scientific literature faces complex challenges due to the proliferation of academic publications. This research develops a semantic similarity-based scientific paper recommendation system by utilizing Sentence Transformer (all-MiniLM-L6-v2 model) and cosine similarity algorithm on arXiv dataset (15,504 papers in Computer Science). The system is implemented as a Streamlit-based interactive web application that accepts user queries and recommends related papers based on semantic similarity. Performance evaluation using Precision, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) metrics showed that embedding text from the Introduction section without pre-processing yielded the best performance (NDCG: 0.7590; MAP: 0.6960; MRR: 0.7254), outperforming Abstract-based or text combination approaches. A user test of 45 respondents confirmed the effectiveness of the system: 95.5% expressed satisfaction with the relevance of the recommendations, and 93.3% confirmed a significant reduction in manual search time. The findings prove that retaining the raw text structure in the Introduction is optimal for semantic representation. Development suggestions include multidomain dataset expansion and transformer model optimization for accuracy improvement.

Downloads

Download data is not yet available.

References

[1] J. Beel, B. Gipp, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, Nov. 2016, doi: 10.1007/s00799-015-0156-0.

[2] Z. Ali, G. Qi, K. Muhammad, B. Ali, and W. A. Abro, “Paper recommendation based on heterogeneous network embedding,” Knowl Based Syst, vol. 210, Dec. 2020, doi: 10.1016/j.knosys.2020.106438.

[3] C. K. Kreutz and R. Schenkel, “Scientific paper recommendation systems: a literature review of recent publications,” International Journal on Digital Libraries, vol. 23, no. 4, pp. 335–369, Dec. 2022, doi: 10.1007/s00799-022-00339-w.

[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.04805

[5] X. Kong, M. Mao, W. Wang, J. Liu, and B. Xu, “VOPRec: Vector Representation Learning of Papers with Text Information and Structural Identity for Recommendation,” IEEE Trans Emerg Top Comput, vol. 9, no. 1, pp. 226–237, Jan. 2021, doi: 10.1109/TETC.2018.2830698.

[6] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi, “On the Use of ArXiv as a Dataset,” Apr. 2019, [Online]. Available: http://arxiv.org/abs/1905.00075

[7] S. J. Mielke et al., “Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP,” Dec. 2021, [Online]. Available: http://arxiv.org/abs/2112.10508

[8] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” 2019. [Online]. Available: https://github.com/UKPLab/

[9] F. A. Nugroho, F. Septian, D. A. Pungkastyo, and J. Riyanto, “Penerapan Algoritma Cosine Similarity untuk Deteksi Kesamaan Konten pada Sistem Informasi Penelitian dan Pengabdian Kepada Masyarakat,” Jurnal Informatika Universitas Pamulang, vol. 5, no. 4, p. 529, Dec. 2021, doi: 10.32493/informatika.v5i4.7126.

[10] Douglas. Steinley, “K-means clustering: A half-century synthesis,” British Journal of Mathematical and Statistical Psychology, vol. 59, no. 1, pp. 1–34, May 2006, doi: https://doi.org/10.1348/000711005X48266.

[11] Y. Gulzar, A. A. Alwan, R. M. Abdullah, A. Z. Abualkishik, and M. Oumrani, “OCA: Ordered Clustering-Based Algorithm for E-Commerce Recommendation System,” Sustainability (Switzerland), vol. 15, no. 4, Feb. 2023, doi: 10.3390/su15042947.

[12] S. M. Miraftabzadeh, C. G. Colombo, M. Longo, and F. Foiadelli, “K-Means and Alternative Clustering Methods in Modern Power Systems,” 2023, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/ACCESS.2023.3327640.

[13] C. Wang, “Frontiers in Computing and Intelligent Systems Pattern Classification of Stock Price Moving,” 2022.

[14] J.-W. Z. Jun-Wu Zhai, Y.-C. T. Jun-Wu Zhai, W.-T. L. Yu-Chen Tian, and K. L. Wen-Tao Li, “Canopy-MMD Text Clustering Algorithm Based on Simulated Annealing and Canopy Optimization,” 電腦學刊, vol. 34, no. 1, pp. 075–086, Feb. 2023, doi: 10.53106/199115992023023401006.

[15] H. Humaira and R. Rasyidah, “Determining The Appropiate Cluster Number Using Elbow Method for K-Means Algorithm,” European Alliance for Innovation n.o., Mar. 2020. doi: 10.4108/eai.24-1-2018.2292388.

[16] H. Zhao, “Design and Implementation of an Improved K-Means Clustering Algorithm,” Mobile Information Systems, vol. 2022, 2022, doi: 10.1155/2022/6041484.

[17] Z. Fayyaz, M. Ebrahimian, D. Nawara, A. Ibrahim, and R. Kashef, “Recommendation systems: Algorithms, challenges, metrics, and business opportunities,” Applied Sciences (Switzerland), vol. 10, no. 21, pp. 1–20, Nov. 2020, doi: 10.3390/app10217748.

[18] O. Jeunen, I. Potapov, and A. Ustimenko, “On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n Recommendation,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, Aug. 2024, pp. 1222–1233. doi: 10.1145/3637528.3671687.

[19] L. J. Cronbach, “COEFFICIENT ALPHA AND THE INTERNAL STRUCTURE OF TESTS*,” 1951.

[20] M. Tavakol and R. Dennick, “Making sense of Cronbach’s alpha,” Jun. 27, 2011. doi: 10.5116/ijme.4dfb.8dfd.

[21] M. Bereczki and S. Girdzijauskas, “Graph Neural Networks for Article Recommendation based on Implicit User Feedback and Content,” 2021.

Downloads

Published

2025-08-05

How to Cite

[1]
A. P. Putra, D. P. Singgih Putri, and A. C. Wiranatha, “Scientific Paper Recommendation System: Application of Sentence Transformers and Cosine Similarity Using arXiv Data”, JAIC, vol. 9, no. 4, pp. 1374–1382, Aug. 2025.

Issue

Section

Articles