Identification of Source Code Plagiarism Using a Natural Language Processing (NLP) Approach Based on Code Writing Style Analysis
DOI:
https://doi.org/10.30871/jaic.v9i6.11206Keywords:
Code Plagiarism, CodeBert, Siamese Network, Deep Learning, Source Code Plagiarism, IdentificationAbstract
Source code plagiarism identificatio requires a system capable of identifying semantic similarity rather than mere textual resemblance. This study utilized a dataset of 1,000 source code files, which after cleaning resulted in 996 individual code samples collected from GitHub repositories. The dataset included various programming languages (Python, Java, JavaScript, TypeScript, C++), divided into 697 training data, 149 validation data, and 149 testing data. The model employed was CodeBERT, configured with a hidden size of 768, 12 layers, and 12 attention heads. CodeBERT generated vector embeddings for each code sample, which were then projected by a Siamese Network to calculate cosine similarity between code pairs. Testing used a threshold of 0.80 to classify plagiarism. The identification results achieved an accuracy of 96.4%, precision of 95.2%, recall of 97.8%, F1-score of 96.4%, and an error rate of 4.6%. The system produced similarity scores and status labels of “plagiarism detected” or “not detected,” demonstrating the effectiveness of the CodeBERT-based approach for adaptive and intelligent code similarity identificatio.
Downloads
References
[1] M. S. Ramli, S. Cokrowibowo, and M. F. Rustan, “Uji Plagiarism pada Tugas Mahasiswa Menggunakan Algoritma Winnowing,” J. Appl. Comput. Sci. Technol., vol. 2, no. 2, pp. 108–112, 2021, doi: 10.52158/jacost.v2i2.177.
[2] I. G. A. Eka Putra and I. W. Supriana, “Deteksi Plagiarisme Source Code Tugas Mahasiswa Menggunakan Algoritma Cosine Similarity Dan Pembobotan TF-IDF,” J. Nas. Teknol. Inf. dan Apl., vol. 1, no. 1, p. 575, 2022, [Online]. Available: https://ojs.unud.ac.id/index.php/jnatia/article/view/92871
[3] Di. K. Tankala, T. Venugopal, and B. Vikas, “Java Source Code Similarity Detection Using Siamese Networks,” J. Theor. Appl. Inf. Technol., vol. 100, no. 17, pp. 5507–5514, 2022.
[4] T. Sonnekalb, B. Gruner, C. A. Brust, and P. Mäder, “Generalizability of Code Clone Detection on CodeBERT,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Sep. 2022. doi: 10.1145/3551349.3561165.
[5] M. A. Pratiwi and N. Aisya, “Fenomena plagiarisme akademik di era digital,” Publ. Lett., vol. 1, no. 2, pp. 16–33, 2021, doi: 10.48078/publetters.v1i2.23.
[6] S. Sahar, M. Younas, M. M. Khan, and M. U. Sarwar, “DP-CCL: A Supervised Contrastive Learning Approach Using CodeBERT Model in Software Defect Prediction,” IEEE Access, vol. 12, no. January, pp. 22582–22594, 2024, doi: 10.1109/ACCESS.2024.3362896.
[7] V. R. Joseph and A. Vakayil, “SPlit: An Optimal Method for Data Splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022, doi: 10.1080/00401706.2021.1921037.
[8] S. Lu et al., “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation,” Adv. Neural Inf. Process. Syst., 2021.
[9] M. Sajid, M. Sanaullah, M. Fuzail, T. S. Malik, and S. M. Shuhidan, “Comparative analysis of text-based plagiarism detection techniques,” PLoS One, vol. 20, no. 4 April, pp. 1–28, 2025, doi: 10.1371/journal.pone.0319551.
[10] V. R. Joseph, “Optimal ratio for data splitting,” Stat. Anal. Data Min., vol. 15, no. 4, pp. 531–538, 2022, doi: 10.1002/sam.11583.
[11] S. Arshad, S. Abid, and S. Shamail, “CodeBERT for Code Clone Detection: A Replication Study,” Proc. - 2022 IEEE 16th Int. Work. Softw. Clones, IWSC 2022, pp. 39–45, 2022, doi: 10.1109/IWSC55060.2022.00015.
[12] F. Ebrahim and M. Joy, “Semantic Similarity Search for Source Code Plagiarism Detection: An Exploratory Study,” Annu. Conf. Innov. Technol. Comput. Sci. Educ. ITiCSE, vol. 1, pp. 360–366, 2024, doi: 10.1145/3649217.3653622.
[13] A. Fedele, R. Guidotti, and D. Pedreschi, Explaining Siamese networks in few-shot learning, vol. 113, no. 10. Springer US, 2024. doi: 10.1007/s10994-024-06529-8.
[14] N. Gandhi, K. Gopalan, and P. Prasad, “A Support Vector Machine based approach for plagiarism detection in Python code submissions in undergraduate settings,” Front. Comput. Sci., vol. 6, 2024, doi: 10.3389/fcomp.2024.1393723.
[15] E. E. Htet et al., “Code Plagiarism Checking Function and Its Application for Code Writing Problem in Java Programming Learning Assistant System †,” Analytics, vol. 3, no. 1, pp. 46–62, 2024, doi: 10.3390/analytics3010004.
[16] B. Kriuk and F. Kriuk, “Multi-Objective Optimal Threshold Selection for Similarity Functions in Siamese Networks for Semantic Textual Similarity Tasks,” 2024, doi: 10.20944/preprints202407.0020.v1.
[17] P. T. Nguyen, J. Di Rocco, C. Di Sipio, R. Rubei, D. Di Ruscio, and M. Di Penta, “GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT,” J. Syst. Softw., vol. 214, no. August 2023, p. 112059, 2024, doi: 10.1016/j.jss.2024.112059.
[18] B. Wan, S. Dong, J. Zhou, and Y. Qian, “SJBCD: A Java Code Clone Detection Method Based on Bytecode Using Siamese Neural Network,” Appl. Sci., vol. 13, no. 17, 2023, doi: 10.3390/app13179580.
[19] M. A. Yahya and D. K. Kim, “CLCD-I: Cross-Language Clone Detection by Using Deep Learning with InferCode,” Computers, vol. 12, no. 1, pp. 1–11, 2023, doi: 10.3390/computers12010012.
[20] M. Zubkov, E. Spirin, E. Bogomolov, and T. Bryksin, Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection, vol. 1, no. 1. Association for Computing Machinery, 2022. doi: 10.2139/ssrn.4159812.
[21] R. Maertens et al., “Discovering and exploring cases of educational source code plagiarism with Dolos,” SoftwareX, vol. 26, no. May, p. 101755, 2024, doi: 10.1016/j.softx.2024.101755.
[22] A. Y. Bramantya, T. Hasanuddin, and F. Umar, “Analisis Metode Winnowing Dalam Pendeteksian Plagiarisme Judul,” Bul. Sist. Inf. dan Teknol. Islam, vol. 3, no. 4, pp. 268–273, 2022, doi: 10.33096/busiti.v3i4.1469.
[23] E. Dickey, “The Failure of Plagiarism Detection in Competitive Programming,” 2025, [Online]. Available: http://arxiv.org/abs/2505.08244
[24] W. Yang, “Identification and Prevention of Code Open Source Quotation and Plagiarism — Innovative Solutions to Enhance Code Plagiarism Detection Tools,” Acad. J. Comput. Inf. Sci., vol. 7, no. 1, pp. 65–71, 2024, doi: 10.25236/ajcis.2024.070110.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Muhammad Ilham Akbar, Novita Kurnia Ningrum

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








