Identification of Source Code Plagiarism Using a Natural Language Processing (NLP) Approach Based on Code Writing Style Analysis

Authors

  • Muhammad Ilham Akbar Universitas Dian Nuswantoro
  • Novita Kurnia Ningrum Universitas Dain Nuswantoro

DOI:

https://doi.org/10.30871/jaic.v9i6.11206

Keywords:

Code Plagiarism, CodeBert, Siamese Network, Deep Learning, Source Code Plagiarism, Identification

Abstract

Source code plagiarism identificatio requires a system capable of identifying semantic similarity rather than mere textual resemblance. This study utilized a dataset of 1,000 source code files, which after cleaning resulted in 996 individual code samples collected from GitHub repositories. The dataset included various programming languages (Python, Java, JavaScript, TypeScript, C++), divided into 697 training data, 149 validation data, and 149 testing data. The model employed was CodeBERT, configured with a hidden size of 768, 12 layers, and 12 attention heads. CodeBERT generated vector embeddings for each code sample, which were then projected by a Siamese Network to calculate cosine similarity between code pairs. Testing used a threshold of 0.80 to classify plagiarism. The identification results achieved an accuracy of 96.4%, precision of 95.2%, recall of 97.8%, F1-score of 96.4%, and an error rate of 4.6%. The system produced similarity scores and status labels of “plagiarism detected” or “not detected,” demonstrating the effectiveness of the CodeBERT-based approach for adaptive and intelligent code similarity identificatio.

Downloads

Download data is not yet available.

References

[1] M. S. Ramli, S. Cokrowibowo, and M. F. Rustan, “Uji Plagiarism pada Tugas Mahasiswa Menggunakan Algoritma Winnowing,” J. Appl. Comput. Sci. Technol., vol. 2, no. 2, pp. 108–112, 2021, doi: 10.52158/jacost.v2i2.177.

[2] I. G. A. Eka Putra and I. W. Supriana, “Deteksi Plagiarisme Source Code Tugas Mahasiswa Menggunakan Algoritma Cosine Similarity Dan Pembobotan TF-IDF,” J. Nas. Teknol. Inf. dan Apl., vol. 1, no. 1, p. 575, 2022, [Online]. Available: https://ojs.unud.ac.id/index.php/jnatia/article/view/92871

[3] Di. K. Tankala, T. Venugopal, and B. Vikas, “Java Source Code Similarity Detection Using Siamese Networks,” J. Theor. Appl. Inf. Technol., vol. 100, no. 17, pp. 5507–5514, 2022.

[4] T. Sonnekalb, B. Gruner, C. A. Brust, and P. Mäder, “Generalizability of Code Clone Detection on CodeBERT,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Sep. 2022. doi: 10.1145/3551349.3561165.

[5] M. A. Pratiwi and N. Aisya, “Fenomena plagiarisme akademik di era digital,” Publ. Lett., vol. 1, no. 2, pp. 16–33, 2021, doi: 10.48078/publetters.v1i2.23.

[6] S. Sahar, M. Younas, M. M. Khan, and M. U. Sarwar, “DP-CCL: A Supervised Contrastive Learning Approach Using CodeBERT Model in Software Defect Prediction,” IEEE Access, vol. 12, no. January, pp. 22582–22594, 2024, doi: 10.1109/ACCESS.2024.3362896.

[7] V. R. Joseph and A. Vakayil, “SPlit: An Optimal Method for Data Splitting,” Technometrics, vol. 64, no. 2, pp. 166–176, 2022, doi: 10.1080/00401706.2021.1921037.

[8] S. Lu et al., “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation,” Adv. Neural Inf. Process. Syst., 2021.

[9] M. Sajid, M. Sanaullah, M. Fuzail, T. S. Malik, and S. M. Shuhidan, “Comparative analysis of text-based plagiarism detection techniques,” PLoS One, vol. 20, no. 4 April, pp. 1–28, 2025, doi: 10.1371/journal.pone.0319551.

[10] V. R. Joseph, “Optimal ratio for data splitting,” Stat. Anal. Data Min., vol. 15, no. 4, pp. 531–538, 2022, doi: 10.1002/sam.11583.

[11] S. Arshad, S. Abid, and S. Shamail, “CodeBERT for Code Clone Detection: A Replication Study,” Proc. - 2022 IEEE 16th Int. Work. Softw. Clones, IWSC 2022, pp. 39–45, 2022, doi: 10.1109/IWSC55060.2022.00015.

[12] F. Ebrahim and M. Joy, “Semantic Similarity Search for Source Code Plagiarism Detection: An Exploratory Study,” Annu. Conf. Innov. Technol. Comput. Sci. Educ. ITiCSE, vol. 1, pp. 360–366, 2024, doi: 10.1145/3649217.3653622.

[13] A. Fedele, R. Guidotti, and D. Pedreschi, Explaining Siamese networks in few-shot learning, vol. 113, no. 10. Springer US, 2024. doi: 10.1007/s10994-024-06529-8.

[14] N. Gandhi, K. Gopalan, and P. Prasad, “A Support Vector Machine based approach for plagiarism detection in Python code submissions in undergraduate settings,” Front. Comput. Sci., vol. 6, 2024, doi: 10.3389/fcomp.2024.1393723.

[15] E. E. Htet et al., “Code Plagiarism Checking Function and Its Application for Code Writing Problem in Java Programming Learning Assistant System †,” Analytics, vol. 3, no. 1, pp. 46–62, 2024, doi: 10.3390/analytics3010004.

[16] B. Kriuk and F. Kriuk, “Multi-Objective Optimal Threshold Selection for Similarity Functions in Siamese Networks for Semantic Textual Similarity Tasks,” 2024, doi: 10.20944/preprints202407.0020.v1.

[17] P. T. Nguyen, J. Di Rocco, C. Di Sipio, R. Rubei, D. Di Ruscio, and M. Di Penta, “GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT,” J. Syst. Softw., vol. 214, no. August 2023, p. 112059, 2024, doi: 10.1016/j.jss.2024.112059.

[18] B. Wan, S. Dong, J. Zhou, and Y. Qian, “SJBCD: A Java Code Clone Detection Method Based on Bytecode Using Siamese Neural Network,” Appl. Sci., vol. 13, no. 17, 2023, doi: 10.3390/app13179580.

[19] M. A. Yahya and D. K. Kim, “CLCD-I: Cross-Language Clone Detection by Using Deep Learning with InferCode,” Computers, vol. 12, no. 1, pp. 1–11, 2023, doi: 10.3390/computers12010012.

[20] M. Zubkov, E. Spirin, E. Bogomolov, and T. Bryksin, Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection, vol. 1, no. 1. Association for Computing Machinery, 2022. doi: 10.2139/ssrn.4159812.

[21] R. Maertens et al., “Discovering and exploring cases of educational source code plagiarism with Dolos,” SoftwareX, vol. 26, no. May, p. 101755, 2024, doi: 10.1016/j.softx.2024.101755.

[22] A. Y. Bramantya, T. Hasanuddin, and F. Umar, “Analisis Metode Winnowing Dalam Pendeteksian Plagiarisme Judul,” Bul. Sist. Inf. dan Teknol. Islam, vol. 3, no. 4, pp. 268–273, 2022, doi: 10.33096/busiti.v3i4.1469.

[23] E. Dickey, “The Failure of Plagiarism Detection in Competitive Programming,” 2025, [Online]. Available: http://arxiv.org/abs/2505.08244

[24] W. Yang, “Identification and Prevention of Code Open Source Quotation and Plagiarism — Innovative Solutions to Enhance Code Plagiarism Detection Tools,” Acad. J. Comput. Inf. Sci., vol. 7, no. 1, pp. 65–71, 2024, doi: 10.25236/ajcis.2024.070110.

Downloads

Published

2025-12-05

How to Cite

[1]
M. I. Akbar and N. K. Ningrum, “Identification of Source Code Plagiarism Using a Natural Language Processing (NLP) Approach Based on Code Writing Style Analysis”, JAIC, vol. 9, no. 6, pp. 3079–3086, Dec. 2025.

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.