Implementation of the Tomek-Link Approach in Machine Learning Models for ABO3 Perovskite Bandgap Classification

Authors

  • Johana Oktavia Ramadhani Universitas Dian Nuswantoro
  • Aliyah Zahratu Rizqi Universitas Dian Nuswantoro
  • Desvita Maharani Universitas Dian Nuswantoro
  • Muhamad Akrom Universitas Dian Nuswantoro

DOI:

https://doi.org/10.30871/jaic.v10i2.12318

Keywords:

Band Gap, Class Imbalance, Machine Learning, Perovskite Oxide, Tomek Links

Abstract

ABO₃ perovskite oxide materials exhibit significant variations in electronic properties, particularly in their band gap characteristics (direct vs. indirect), which are crucial for optoelectronic applications. Experimental approaches and density functional theory (DFT)-based calculations for determining the band gap type require high costs and computational resources, making Machine Learning (ML) a more efficient alternative. However, the imbalanced class distribution in the perovskite oxide dataset (84% direct and 16% indirect after data cleaning) potentially leads to model bias towards the majority class. This study evaluates the effect of applying Tomek Links, a decision boundary cleaning-based undersampling technique, on the performance of band gap type classification using Multi-Layer Perceptron (MLP), Gradient Boosting, CatBoost, and Extra Trees. The dataset consists of 3,469 samples with six predictor numerical features and a binary classification target. Tomek Links is applied exclusively to the training data in a controlled ML pipeline that includes feature standardization and 5-fold stratified cross-validation. Results show that the application of Tomek Links improves minority class (Indirect) recall by up to +9% and macro F1-score by up to +0.019 compared to the baseline, with minimal changes in global accuracy. Feature importance analysis identifies average ionic character as the primary determinant of classification, consistent with material band structure theory. These findings confirm that Tomek Links is effective as a decision boundary cleaning mechanism to reduce bias towards the majority class and improve model sensitivity in data-driven material exploration.

Downloads

Download data is not yet available.

References

[1] S. Rahman et al., “Structural, electronic, optical and mechanical properties of oxide-based perovskite ABO3 (A = Cu, Nd and B = Sn, Sc): A DFT study,” J. Solid State Chem., vol. 317, p. 123650, Jan. 2023, doi: 10.1016/j.jssc.2022.123650.

[2] Y. Choi et al., “Perovskite nanocomposites: synthesis, properties, and applications from renewable energy to optoelectronics,” Dec. 01, 2024, Korea Nano Technology Research Society. doi: 10.1186/s40580-024-00440-7.

[3] S. Rahman et al., “Exploring the multifaceted properties of novel oxide-based perovskites ABO3 (A=Nd and BLr, Y): A DFT study,” Mater. Sci. Semicond. Process., vol. 180, p. 108558, Sep. 2024, doi: 10.1016/j.mssp.2024.108558.

[4] M. Fatmi et al., “Structural, electronic, optical, and thermoelectric properties of CaXO₃ (X = Si, Ge, Ti) perovskite for photovoltaics and optical devices,” Sci. Rep., vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-025-31002-4.

[5] F. Dinic, I. Neporozhnii, and O. Voznyy, “Machine learning models for the discovery of direct band gap materials for light emission and photovoltaics,” Comput. Mater. Sci., vol. 231, p. 112580, Jan. 2024, doi: 10.1016/j.commatsci.2023.112580.

[6] A. Sabagh Moeini, F. Shariatmadar Tehrani, and A. Naeimi-Sadigh, “Machine learning-enhanced band gaps prediction for low-symmetry double and layered perovskites,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-77081-7.

[7] F. J. Kusuma et al., “Direct band gap prediction of single and double perovskite using cost-sensitive ensemble learning,” J. Alloys Compd., vol. 1037, p. 182102, Aug. 2025, doi: 10.1016/j.jallcom.2025.182102.

[8] S. P. G, M. N. Mattur, N. Nagappan, S. Rath, and T. Thomas, “Prediction of nature of band gap of perovskite oxides (ABO3) using a machine learning approach,” Journal of Materiomics, vol. 8, no. 5, pp. 937–948, Sep. 2022, doi: 10.1016/j.jmat.2022.04.006.

[9] J. Jiang et al., “A review of machine learning methods for imbalanced data challenges in chemistry,” Apr. 22, 2025, Royal Society of Chemistry. doi: 10.1039/d5sc00270b.

[10] A. Mathew, A. A. B. Baloch, A. Yakasai, H. Mittal, V. Alberts, and J. V. Karunamurthy, “Machine learning-driven crystal system prediction for perovskites using augmented XRD data,” Eng. Appl. Artif. Intell., vol. 164, p. 113247, Jan. 2026, doi: 10.1016/j.engappai.2025.113247.

[11] D. Devi, S. kr Biswas, and B. Purkayastha, “Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance,” Pattern Recognit. Lett., vol. 93, pp. 3–12, Jul. 2017, doi: 10.1016/j.patrec.2016.10.006.

[12] E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, May 2022, doi: 10.3390/s22093246.

[13] E. Ogoshi et al., “Learning from machine learning: the case of band-gap directness in semiconductors,” Discover Materials , vol. 4, no. 1, Dec. 2024, doi: 10.1007/s43939-024-00073-x.

[14] Y. Oh, S. Song, and J. Bae, “A Review of Bandgap Engineering and Prediction in 2D Material Heterostructures: A DFT Perspective,” Dec. 01, 2024, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/ijms252313104.

[15] N. Prameswari, W. Ghozi, and F. A. Rafrastara, “Systematic XGBoost Pipeline for Phishing Website Detection: Hyperparameter Tuning Approach with Nested Cross-Validation,” Inform : Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi, vol. 11, no. 1, pp. 98–110, Feb. 2026, doi: 10.25139/inform.v11i1.11221.

[16] D. Wilimitis and C. G. Walsh, “Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial,” JMIR AI, vol. 2, p. e49023, Dec. 2023, doi: 10.2196/49023.

[17] S. P. Barad, S. Kumar, and S. Mishra, “Estimation of Electronic Band Gap Energy From Material Properties Using Machine Learning,” in 2024 1st International Conference on Cognitive, Green and Ubiquitous Computing (IC-CGU), IEEE, Mar. 2024, pp. 1–6. doi: 10.1109/IC-CGU58078.2024.10530748.

[18] A. Demircioğlu, “Applying oversampling before cross-validation will lead to high bias in radiomics,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-62585-z.

[19] P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowl. Based. Syst., vol. 212, p. 106631, Jan. 2021, doi: 10.1016/j.knosys.2020.106631.

[20] M. Moradi and J. Hamidzadeh, “Handling class imbalance and overlap with a Hesitation-based instance selection method,” Knowl. Based. Syst., vol. 294, p. 111745, Jun. 2024, doi: 10.1016/j.knosys.2024.111745.

[21] Q. Tao, P. Xu, M. Li, and W. Lu, “Machine learning for perovskite materials design and discovery,” Dec. 01, 2021, Nature Research. doi: 10.1038/s41524-021-00495-8.

[22] V. Tummalapalli, “Using SMOTE and TOMEK Link Sampling Techniques to Address Imbalanced Data Challenges in the Machine Learning models.”

[23] Q. Dai, J. wei Liu, and Y. Liu, “Multi-granularity relabeled under-sampling algorithm for imbalanced data,” Appl. Soft Comput., vol. 124, p. 109083, Jul. 2022, doi: 10.1016/j.asoc.2022.109083.

[24] K. M. Sujon, R. Hassan, K. Choi, and M. A. Samad, “Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models,” J. Big Data, vol. 12, no. 1, Dec. 2025, doi: 10.1186/s40537-025-01313-4.

[25] D. Wulan Yekti Rahayu et al., “Performance of Machine Learning Algorithms on Imbalanced Sentiment Datasets Without Balancing Techniques,” 2025. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

[26] O. Rainio, J. Teuho, and R. Klén, “Evaluation metrics and statistical tests for machine learning,” Sci. Rep., vol. 14, no. 1, Dec. 2024, doi: 10.1038/s41598-024-56706-x.

[27] A. Tharwat, “Classification assessment methods,” Applied Computing and Informatics, vol. 17, no. 1, pp. 168–192, Jan. 2021, doi: 10.1016/j.aci.2018.08.003.

[28] D. Chicco and G. Jurman, “The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification,” BioData Min., vol. 16, no. 1, Dec. 2023, doi: 10.1186/s13040-023-00322-4.

[29] M. Bhagat and B. Bakariya, “A Comprehensive Review of Cross-Validation Techniques in Machine Learning.”

[30] S. Szeghalmy and A. Fazekas, “A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning,” Sensors, vol. 23, no. 4, Feb. 2023, doi: 10.3390/s23042333.

[31] T. Abedin, H. Xu, and S. Uddin, “The impact of K selection in K fold cross-validation on bias and variance in supervised learning models,” Sci. Rep., vol. 16, no. 1, Dec. 2026, doi: 10.1038/s41598-026-37247-x.

[32] E. T. Anzaku, H. Wang, A. Babalola, A. Van Messem, and W. De Neve, “Re-assessing accuracy degradation: a framework for understanding DNN behavior on similar-but-non-identical test datasets,” Mach. Learn., vol. 114, no. 3, Mar. 2025, doi: 10.1007/s10994-024-06693-x.

[33] L. Ferrer, O. Scharenborg, and T. Bäckström, “Good practices for evaluation of machine learning systems,” Dec. 2024, [Online]. Available: http://arxiv.org/abs/2412.03700

[34] M. Sivakumar, S. Parthasarathy, and T. Padmapriya, “Trade-off between training and testing ratio in machine learning for medical image processing,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/PEERJ-CS.2245.

[35] S. Kapse, M. Voccia, F. Viñes, and F. Illas, “Chemical bonding and electronic properties along Group 13 metal oxides,” J. Mol. Model., vol. 30, no. 6, Jun. 2024, doi: 10.1007/s00894-024-05957-6.

[36] Q. Gao, R. Gao, J. Kang, and S.-H. Wei, “The Role of Ionicity in Transparent Conducting Materials,” J. Phys. Chem. Lett., vol. 16, no. 33, pp. 8474–8479, Aug. 2025, doi: 10.1021/acs.jpclett.5c01925.

[37] A. Naskar, R. Khanal, and S. Choudhury, “Role of chemistry and crystal structure on the electronic defect states in cs-based halide perovskites,” Materials, vol. 14, no. 4, pp. 1–14, Feb. 2021, doi: 10.3390/ma14041032.

Downloads

Published

2026-04-22

How to Cite

[1]
J. O. Ramadhani, A. Z. Rizqi, D. Maharani, and M. Akrom, “Implementation of the Tomek-Link Approach in Machine Learning Models for ABO3 Perovskite Bandgap Classification”, JAIC, vol. 10, no. 2, pp. 1788–1798, Apr. 2026.

Issue

Section

Articles

Similar Articles

1 2 3 4 5 > >> 

You may also start an advanced similarity search for this article.