Gaussian Mixture-Based Data Augmentation Improves QSAR Prediction of Corrosion Inhibition Efficiency
DOI:
https://doi.org/10.30871/jaic.v9i5.10895Keywords:
Corrosion Inhibition, Data Augmentation, gaussian Mixture Model, Machine Learning, Small DataAbstract
Predicting corrosion inhibition efficiency IE (%) is often hindered by small, heterogeneous datasets. This study proposes a Gaussian mixture–based data augmentation pipeline to strengthen QSAR generalization under data scarcity. A curated set of 70 drug-like compounds with 14 physicochemical and quantum descriptors was cleaned, split 90/10 (train/test), and transformed using a Quantile Transformer followed by a Robust Scaler. A Gaussian Mixture model (GMM) with 2–5 components selected by the variational lower bound was fitted to the transformed training features and used to generate up to 2,500 synthetic samples. Eight regressors (Gaussian Process, Decision Tree, Random Forest, Bagging, Gradient Boosting, Extra Trees, SVR, and KNN) were evaluated on the held-out test set using R2 and RMSE. Augmentation improved performance across several families: for example, Gaussian Process R2 improved from −1.54 to 0.54 (RMSE 11.71 to 5.01) and Decision Tree R2 from −0.33 to 0.63 (RMSE 8.48 to 4.44), Bagging and Random Forest showed R2 increases of 0.67 and 0.40, respectively. The optimal synthetic size varied by model.
Downloads
References
[1] “The Global Cost and Impact of Corrosion.” [Online]. Available: https://inspectioneering.com/news/2016-03-08/5202/nace-study-estimates-global-cost-of-corrosion-at-25-trillion-ann
[2] A. A. A. Serrano, A. Miralrio, and C. Beltran-Perez, “Models for predicting corrosion inhibition efficiency of common drugs on steel surfaces: A rationalized comparison among methodologies,” Appl. Surf. Sci. Adv., vol. 22, p. 100621, Aug. 2024, doi: 10.1016/j.apsadv.2024.100621.
[3] I. Baskin and Y. Ein-Eli, “Chemoinformatics for corrosion science: Data-driven modeling of corrosion inhibition by organic molecules,” Mol. Inform., vol. 43, no. 11, p. e202400082, 2024, doi: 10.1002/minf.202400082.
[4] C. Özkan et al., “Laying the experimental foundation for corrosion inhibitor discovery through machine learning,” Npj Mater. Degrad., vol. 8, no. 1, p. 21, Feb. 2024, doi: 10.1038/s41529-024-00435-z.
[5] L. Camacho, G. Douzas, and F. Bacao, “Geometric SMOTE for regression,” Expert Syst. Appl., vol. 193, p. 116387, May 2022, doi: 10.1016/j.eswa.2021.116387.
[6] J. G. Avelino, G. D. C. Cavalcanti, and R. M. O. Cruz, “Resampling strategies for imbalanced regression: a survey and empirical analysis,” Artif. Intell. Rev., vol. 57, no. 4, p. 82, Mar. 2024, doi: 10.1007/s10462-024-10724-3.
[7] “Bayesian Inference-Based Gaussian Mixture Models With Optimal Components Estimation Towards Large-Scale Synthetic Data Generation for In Silico Clinical Trials,” IEEE Open J. Eng. Med. Biol., vol. 3, pp. 108–114, June 2022, doi: 10.1109/OJEMB.2022.3181796.
[8] L. Kühnel et al., “Synthetic data generation for a longitudinal cohort study - evaluation, method extension and reproduction of published data analysis results,” Sci. Rep., vol. 14, no. 1, p. 14412, June 2024, doi: 10.1038/s41598-024-62102-2.
[9] S. Rustad, M. Akrom, T. Sutojo, and H. K. Dipojono, “A Feature Restoration for Machine Learning on Anti-Corrosion Materials,” July 12, 2024, Social Science Research Network, Rochester, NY: 4892891. doi: 10.2139/ssrn.4892891.
[10] M. Akrom, S. Rustad, and H. Kresno Dipojono, “Prediction of Anti-Corrosion performance of new triazole derivatives via Machine learning,” Comput. Theor. Chem., vol. 1236, p. 114599, June 2024, doi: 10.1016/j.comptc.2024.114599.
[11] C. Beltran-Perez et al., “A General Use QSAR-ARX Model to Predict the Corrosion Inhibition Efficiency of Drugs in Terms of Quantum Mechanical Descriptors and Experimental Comparison for Lidocaine,” Int. J. Mol. Sci., vol. 23, no. 9, p. 5086, Jan. 2022, doi: 10.3390/ijms23095086.
[12] D. Gadaleta et al., “SAR and QSAR modeling of a large collection of LD50 rat acute oral toxicity data,” J. Cheminformatics, vol. 11, no. 1, p. 58, Dec. 2019, doi: 10.1186/s13321-019-0383-2.
[13] T. Yu, C. Nantasenamat, S. Kachenton, N. Anuwongcharoen, and T. Piacham, “Cheminformatic Analysis and Machine Learning Modeling to Investigate Androgen Receptor Antagonists to Combat Prostate Cancer,” ACS Omega, vol. 8, no. 7, pp. 6729–6742, Feb. 2023, doi: 10.1021/acsomega.2c07346.
[14] P. Ambure and M. N. D. S. Cordeiro, “Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets,” in Ecotoxicological QSARs, K. Roy, Ed., in Methods in Pharmacology and Toxicology. , New York, NY: Springer US, 2020, pp. 97–109. doi: 10.1007/978-1-0716-0150-1_5.
[15] A. Rácz, D. Bajusz, and K. Héberger, “Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification,” Molecules, vol. 26, no. 4, p. 1111, Jan. 2021, doi: 10.3390/molecules26041111.
[16] P. De, S. Kar, P. Ambure, and K. Roy, “Prediction reliability of QSAR models: an overview of various validation tools,” Arch. Toxicol., vol. 96, no. 5, pp. 1279–1295, May 2022, doi: 10.1007/s00204-022-03252-y.
[17] W. Wang and B.-Y. Jing, “Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression,” J. Mach. Learn. Res., vol. 23, no. 193, pp. 1–67, 2022.
[18] V. L. Deringer, A. P. Bartók, N. Bernstein, D. M. Wilkins, M. Ceriotti, and G. Csányi, “Gaussian Process Regression for Materials and Molecules,” Chem. Rev., vol. 121, no. 16, pp. 10073–10141, Aug. 2021, doi: 10.1021/acs.chemrev.1c00022.
[19] “On the Development of Descriptor-Based Machine Learning Models for Thermodynamic Properties: Part 2 Applicability Domain and Outliers.” [Online]. Available: https://www.mdpi.com/1999-4893/16/12/573
[20] J. M. H. Pinheiro et al., “The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks,” 2025, arXiv. doi: 10.48550/ARXIV.2506.08274.
[21] M. Akrom, “Green Corrosion Inhibitors for Iron Alloys: A Comprehensive Review of Integrating Data-Driven Forecasting, Density Functional Theory Simulations, and Experimental Investigation,” J. Multiscale Mater. Inform., vol. 1, no. 1, pp. 22–37, Apr. 2024, doi: 10.62411/jimat.v1i1.10495.
[22] T. Huix, A. Korba, A. Durmus, and E. Moulines, “Variational inference, Mixture of Gaussians, Bayesian Machine Learning,” June 06, 2024, arXiv: arXiv:2406.04012. doi: 10.48550/arXiv.2406.04012.
[23] A. Abio et al., “A transfer learning method in press hardening surrogate modeling: From simulations to real-world,” J. Manuf. Syst., vol. 77, pp. 320–340, Dec. 2024, doi: 10.1016/j.jmsy.2024.09.012.
[24] R. Sibindi, R. W. Mwangi, and A. G. Waititu, “A boosting ensemble learning based hybrid light gradient boosting machine and extreme gradient boosting model for predicting house prices,” Eng. Rep., vol. 5, no. 4, p. e12599, 2023, doi: 10.1002/eng2.12599.
[25] P. F. Sadr, M. Ebrahimi, M. Nekoei, and B. Chahkandi, “QSAR study of novel indole derivatives in hepatitis treatment by stepwise- multiple linear regression and support vector machine,” Arch. Pharm. Pract., vol. 11, no. 1–2020, pp. 27–37, 2020.
[26] D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Comput. Sci., vol. 7, p. e623, July 2021, doi: 10.7717/peerj-cs.623.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Darnell Ignasius, Muhamad Akrom, Setyo Budi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) ) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).








