Implementation of the Tomek-Link Approach in Machine Learning Models for ABO3 Perovskite Bandgap Classification
DOI:
https://doi.org/10.30871/jaic.v10i2.12318Keywords:
Band Gap, Class Imbalance, Machine Learning, Perovskite Oxide, Tomek LinksAbstract
ABO₃ perovskite oxide materials exhibit significant variations in electronic properties, particularly in their band gap characteristics (direct vs. indirect), which are crucial for optoelectronic applications. Experimental approaches and density functional theory (DFT)-based calculations for determining the band gap type require high costs and computational resources, making Machine Learning (ML) a more efficient alternative. However, the imbalanced class distribution in the perovskite oxide dataset (84% direct and 16% indirect after data cleaning) potentially leads to model bias towards the majority class. This study evaluates the effect of applying Tomek Links, a decision boundary cleaning-based undersampling technique, on the performance of band gap type classification using Multi-Layer Perceptron (MLP), Gradient Boosting, CatBoost, and Extra Trees. The dataset consists of 3,469 samples with six predictor numerical features and a binary classification target. Tomek Links is applied exclusively to the training data in a controlled ML pipeline that includes feature standardization and 5-fold stratified cross-validation. Results show that the application of Tomek Links improves minority class (Indirect) recall by up to +9% and macro F1-score by up to +0.019 compared to the baseline, with minimal changes in global accuracy. Feature importance analysis identifies average ionic character as the primary determinant of classification, consistent with material band structure theory. These findings confirm that Tomek Links is effective as a decision boundary cleaning mechanism to reduce bias towards the majority class and improve model sensitivity in data-driven material exploration.
Downloads
References
[1] S. Rahman et al., “Structural, electronic, optical and mechanical properties of oxide-based perovskite ABO3 (A = Cu, Nd and B = Sn, Sc): A DFT study,” J. Solid State Chem., vol. 317, p. 123650, Jan. 2023, doi: 10.1016/j.jssc.2022.123650.
[2] Y. Choi et al., “Perovskite nanocomposites: synthesis, properties, and applications from renewable energy to optoelectronics,” Dec. 01, 2024, Korea Nano Technology Research Society. doi: 10.1186/s40580-024-00440-7.
[3] S. Rahman et al., “Exploring the multifaceted properties of novel oxide-based perovskites ABO3 (A=Nd and BLr, Y): A DFT study,” Mater. Sci. Semicond. Process., vol. 180, p. 108558, Sep. 2024, doi: 10.1016/j.mssp.2024.108558.
[4] M. Fatmi et al., “Structural, electronic, optical, and thermoelectric properties of CaXO₃ (X = Si, Ge, Ti) perovskite for photovoltaics and optical devices,” Sci. Rep., vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-025-31002-4.
[5] F. Dinic, I. Neporozhnii, and O. Voznyy, “Machine learning models for the discovery of direct band gap materials for light emission and photovoltaics,” Comput. Mater. Sci., vol. 231, p. 112580, Jan. 2024, doi: 10.1016/j.commatsci.2023.112580.
[6] A. Sabagh Moeini, F. Shariatmadar Tehrani, and A. Naeimi-Sadigh, “Machine learning-enhanced band gaps prediction for low-symmetry double and layered perovskites,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-77081-7.
[7] F. J. Kusuma et al., “Direct band gap prediction of single and double perovskite using cost-sensitive ensemble learning,” J. Alloys Compd., vol. 1037, p. 182102, Aug. 2025, doi: 10.1016/j.jallcom.2025.182102.
[8] S. P. G, M. N. Mattur, N. Nagappan, S. Rath, and T. Thomas, “Prediction of nature of band gap of perovskite oxides (ABO3) using a machine learning approach,” Journal of Materiomics, vol. 8, no. 5, pp. 937–948, Sep. 2022, doi: 10.1016/j.jmat.2022.04.006.
[9] J. Jiang et al., “A review of machine learning methods for imbalanced data challenges in chemistry,” Apr. 22, 2025, Royal Society of Chemistry. doi: 10.1039/d5sc00270b.
[10] A. Mathew, A. A. B. Baloch, A. Yakasai, H. Mittal, V. Alberts, and J. V. Karunamurthy, “Machine learning-driven crystal system prediction for perovskites using augmented XRD data,” Eng. Appl. Artif. Intell., vol. 164, p. 113247, Jan. 2026, doi: 10.1016/j.engappai.2025.113247.
[11] D. Devi, S. kr Biswas, and B. Purkayastha, “Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance,” Pattern Recognit. Lett., vol. 93, pp. 3–12, Jul. 2017, doi: 10.1016/j.patrec.2016.10.006.
[12] E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, May 2022, doi: 10.3390/s22093246.
[13] E. Ogoshi et al., “Learning from machine learning: the case of band-gap directness in semiconductors,” Discover Materials , vol. 4, no. 1, Dec. 2024, doi: 10.1007/s43939-024-00073-x.
[14] Y. Oh, S. Song, and J. Bae, “A Review of Bandgap Engineering and Prediction in 2D Material Heterostructures: A DFT Perspective,” Dec. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/ijms252313104.
[15] N. Prameswari, W. Ghozi, and F. A. Rafrastara, “Systematic XGBoost Pipeline for Phishing Website Detection: Hyperparameter Tuning Approach with Nested Cross-Validation,” Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi, vol. 11, no. 1, pp. 98–110, Feb. 2026, doi: 10.25139/inform.v11i1.11221.
[16] D. Wilimitis and C. G. Walsh, “Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial,” JMIR AI, vol. 2, p. e49023, Dec. 2023, doi: 10.2196/49023.
[17] S. P. Barad, S. Kumar, and S. Mishra, “Estimation of Electronic Band Gap Energy From Material Properties Using Machine Learning,” in 2024 1st International Conference on Cognitive, Green and Ubiquitous Computing (IC-CGU), IEEE, Mar. 2024, pp. 1–6. doi: 10.1109/IC-CGU58078.2024.10530748.
[18] A. Demircioğlu, “Applying oversampling before cross-validation will lead to high bias in radiomics,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-62585-z.
[19] P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowl. Based. Syst., vol. 212, p. 106631, Jan. 2021, doi: 10.1016/j.knosys.2020.106631.
[20] M. Moradi and J. Hamidzadeh, “Handling class imbalance and overlap with a Hesitation-based instance selection method,” Knowl. Based. Syst., vol. 294, p. 111745, Jun. 2024, doi: 10.1016/j.knosys.2024.111745.
[21] Q. Tao, P. Xu, M. Li, and W. Lu, “Machine learning for perovskite materials design and discovery,” Dec. 01, 2021, Nature Research. doi: 10.1038/s41524-021-00495-8.
[22] V. Tummalapalli, “Using SMOTE and TOMEK Link Sampling Techniques to Address Imbalanced Data Challenges in the Machine Learning models.”
[23] Q. Dai, J. wei Liu, and Y. Liu, “Multi-granularity relabeled under-sampling algorithm for imbalanced data,” Appl. Soft Comput., vol. 124, p. 109083, Jul. 2022, doi: 10.1016/j.asoc.2022.109083.
[24] K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,” J. Big Data, vol. 12, no. 1, Dec. 2025, doi: 10.1186/s40537-025-01313-4.
[25] D. Wulan Yekti Rahayu et al., “Performance of Machine Learning Algorithms on Imbalanced Sentiment Datasets Without Balancing Techniques,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC
[26] O. Rainio, J. Teuho, and R. Klén, “Evaluation metrics and statistical tests for machine learning,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-56706-x.
[27] A. Tharwat, “Classification assessment methods,” Applied Computing and Informatics, vol. 17, no. 1, pp. 168–192, Jan. 2021, doi: 10.1016/j.aci.2018.08.003.
[28] D. Chicco and G. Jurman, “The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification,” BioData Min., vol. 16, no. 1, Dec. 2023, doi: 10.1186/s13040-023-00322-4.
[29] M. Bhagat and B. Bakariya, “A Comprehensive Review of Cross-Validation Techniques in Machine Learning.”
[30] S. Szeghalmy and A. Fazekas, “A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning,” Sensors, vol. 23, no. 4, Feb. 2023, doi: 10.3390/s23042333.
[31] T. Abedin, H. Xu, and S. Uddin, “The impact of K selection in K fold cross-validation on bias and variance in supervised learning models,” Sci. Rep., vol. 16, no. 1, Dec. 2026, doi: 10.1038/s41598-026-37247-x.
[32] E. T. Anzaku, H. Wang, A. Babalola, A. Van Messem, and W. De Neve, “Re-assessing accuracy degradation: a framework for understanding DNN behavior on similar-but-non-identical test datasets,” Mach. Learn., vol. 114, no. 3, Mar. 2025, doi: 10.1007/s10994-024-06693-x.
[33] L. Ferrer, O. Scharenborg, and T. Bäckström, “Good practices for evaluation of machine learning systems,” Dec. 2024, [Online]. Available: http://arxiv.org/abs/2412.03700
[34] M. Sivakumar, S. Parthasarathy, and T. Padmapriya, “Trade-off between training and testing ratio in machine learning for medical image processing,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/PEERJ-CS.2245.
[35] S. Kapse, M. Voccia, F. Viñes, and F. Illas, “Chemical bonding and electronic properties along Group 13 metal oxides,” J. Mol. Model., vol. 30, no. 6, Jun. 2024, doi: 10.1007/s00894-024-05957-6.
[36] Q. Gao, R. Gao, J. Kang, and S.-H. Wei, “The Role of Ionicity in Transparent Conducting Materials,” J. Phys. Chem. Lett., vol. 16, no. 33, pp. 8474–8479, Aug. 2025, doi: 10.1021/acs.jpclett.5c01925.
[37] A. Naskar, R. Khanal, and S. Choudhury, “Role of chemistry and crystal structure on the electronic defect states in cs-based halide perovskites,” Materials, vol. 14, no. 4, pp. 1–14, Feb. 2021, doi: 10.3390/ma14041032.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Johana Oktavia Ramadhani, Aliyah Zahratu Rizqi, Desvita Maharani, Muhamad Akrom

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








