Performance of Multivariate Missing Data Imputation Methods on Climate Data

Authors

  • Amalia Safira Widyawati Institut Pertanian Bogor
  • Anwar Fitrianto Institut Pertanian Bogor
  • Pika Silvianti Institut Pertanian Bogor

DOI:

https://doi.org/10.30871/jaic.v9i6.11316

Abstract

Climate data plays an important role in various aspects of life. However, missing data is often found, which can interfere with data processing and reduce the quality of analysis. Therefore, appropriate handling methods are needed to ensure that the analysis results remain valid. This study aims to compare the performance of several imputation methods for missing multivariate data based on the identification of actual missing data patterns, and to determine the appropriate imputation method based on the mechanism of missing data. This study also aims to apply the best method to data with actual missing data patterns to assess its effect on descriptive statistical changes required for further climatological analysis. The methods used include monthly averages, missRanger, k-Nearest Neighbor (k-NN), and Iterative Robust-Model Imputation (IRMI). The missing data information was obtained from Global Surface Summary of the Day (GSOD) data, namely temperature, precipitation, humidity, pressure, and wind speed variables with a daily frequency for 11 years, with a missing data proportion of 11.4%. The missing data patterns were then applied to relatively complete NASA Power data to evaluate the imputation results. The results show that IRMI is less capable of handling extreme missing data conditions, namely 17 completely missing rows. In contrast, k-NN, missRanger, and monthly averages provided better results in both extreme and non-extreme conditions. Of the four methods, monthly averages were chosen because they were able to overcome missing data while maintaining multivariate structure with 58% on sMAPE and 2.64% on relative difference.

Downloads

Download data is not yet available.

References

[1] S. R. Wicaksono, Prinsip Dasar Kualitas Data. Malang: Seribu Bintang, 2023. doi: 10.5281/zenodo.12155308.

[2] F. Rafii and T. Kechadi, “Collection of historical weather data: Issues with missing values,” ACM Int. Conf. Proceeding Ser., no. 365, 2019, doi: 10.1145/3368756.3368974.

[3] A. Little and B. Rubin, Analysis with missing, 3rd ed. Hokoben: Wiley, 2020.

[4] G. Gunawan, “Analisis data hidrologi sungai air bengkulu menggunakan metode statistik,” J. Inersia, vol. 9, no. 1, pp. 47–58, 2017.

[5] C. Ocampo-marulanda et al., “Missing data estimation in extreme rainfall indices for the Metropolitan area of Cali - Colombia : An approach based on artificial neural networks,” Data Br., vol. 39, p. 107592, 2021, doi: 10.1016/j.dib.2021.107592.

[6] B. Gomer, “Mcar, mar, and mnar values in the same dataset: a realistic evaluation of methods for handling missing data,” Multivariate Behav. Res., vol. 54, no. 1, p. 153, 2019, doi: 10.1080/00273171.2018.1557033.

[7] P. Saeipourdizaj, P. Sarbakhsh, and A. Gholampour, “Application of imputation methods for missing values of pm10 and o3 data: interpolation, moving average and k-nearest neighbor methods,” Environ. Heal. Eng. Manag., vol. 8, no. 3, pp. 215–226, 2021, doi: 10.34172/EHEM.2021.25.

[8] M. Templ and M. Ulmer, “The impact of misclassifications and outliers on imputation methods,” J. Appl. Stat., vol. 51, no. 14, pp. 2894–2928, 2024, doi: 10.1080/02664763.2024.2325969.

[9] K. Gurtskaia, J. Schwerter, and P. Doebler, “Adapting tree-based multiple imputation methods for multi-level data ? A simulation study,” arXiv Prepr., vol., no., p., 2024, doi: 10.48550.

[10] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Diego: Morgan Kaupmann, 2001.

[11] M. Mayer, “Package ‘ missRanger ,’” pp. 1–10, 2024, doi: 10.1093/bioinformatics/btr597>.

[12] M. N. Wright and A. Ziegler, “Ranger: A fast implementation of random forests for high dimensional data in C++ and R,” J. Stat. Softw., vol. 77, no. 1, pp. 1–17, 2017, doi: 10.18637/jss.v077.i01.

[13] J. Schwerter, K. Gurtskaia, A. Romero, B. Zeyer-Gliozzo, and M. Pauly, “Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies,” arXiv Prepr., vol., p., 2024.

[14] Y. S. Resheff and D. Weinshall, “Optimized linear imputation,” in 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), Setúbal: SCITEPRESS, 2017, pp. 17–25. doi: 10.5220/0006092900170025.

[15] M. Templ, “Enhancing precision in large-scale data analysis: an innovative robust imputation algorithm for managing outliers and missing values,” Mathematics, vol. 11, no. 12, 2023, doi: 10.3390/math11122729.

[16] M. Templ, A. Kowarik, and P. Filzmoser, “Iterative stepwise regression imputation using standard and robust methods,” Comput. Stat. Data Anal., vol. 55, no. 10, pp. 2793–2806, 2011, doi: 10.1016/j.csda.2011.04.012.

[17] C. Li, “Little’s test of missing completely at random,” Stata J., vol. 13, no. 4, pp. 795–809, 2013, doi: 10.1177/1536867x1301300407.

[18] M. W. Heymans and J. W. R. Twisk, “Handling missing data in clinical research,” J. Clin. Epidemiol., vol. 151, pp. 185–188, 2022, doi: 10.1016/j.jclinepi.2022.08.016.

[19] M. Jamshidian and S. Jalal, “Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data,” Psychometrika, vol. 75, no. 4, pp. 649–674, 2010, doi: 10.1007/s11336-010-9175-3.

[20] P. McKnight, K. McKnight, S. Sidani, and A. Figueredo, Missing Data: A Gentle Introduction. New York City: The Guilford Press, 2007.

[21] C. K. Enders, Missing Applied Analysis Data. New York City: The Guilford Press, 2010.

[22] M. Martinkova, “Overview of observed clausius-clapeyron scaling of extreme precipitation in midlatitudes,” Atmosphere (Basel)., vol. 11, pp. 1–16, 2020, doi: 10.3390/atmos11080786.

[23] C. Xu et al., “Asian-Australian summer monsoons linkage to ENSO strengthened by global warming,” npj Clim. Atmos. Sci., vol. 6, no. 1, 2023, doi: 10.1038/s41612-023-00341-2.

[24] H. Aguilera, C. Guardiola-Albert, and C. Serrano-Hidalgo, “Estimating extremely large amounts of missing precipitation data,” J. Hydroinformatics, vol. 22, no. 3, pp. 578–592, 2020, doi: 10.2166/hydro.2020.127.

[25] Y. Zhou, S. Aryal, and M. R. Bouadjenek, “A comprehensive review of handling missing data: exploring special missing mechanisms,” 2024.

[26] A. J. Mason, R. D. Grieve, A. Richards-belle, P. R. Mouncey, D. A. Harrison, and J. R. Carpenter, “Open Access A framework for extending trial design to facilitate missing data sensitivity analyses,” BMC Med. Res. Methodol., vol. 2, pp. 1–12, 2020, doi: 10.1186/s12874-020-00930-2.

[27] D. M. P. Murti, U. Pujianto, A. P. Wibawa, and M. I. Akbar, “K-nearest neighbor (K-NN) based missing data imputation,” in 5th International Conference on Science in Information Technology (ICSITech), 2019, pp. 83–88. doi: https://doi.org/10.1109/icsitech46713.2019.8987530.

[28] N. Umar and A. Gray, “Optimal parameter choice for imputing missing values in water level data using the k-nearest neighbour (kNN) method.,” in The Doctoral School Multidisciplinary Symposium (DSMS 2023), Glasglow, United Kingdom, 2023, pp. 1–2.

[29] H. Manlea, Klimatologi Dasar. Jakarta: PT Literasi Nusantara Abadi Group, 2020.

[30] [BMKG], “Indonesia Typical Meteorogical Year,” Badan Meteorologi, Klimatologi, dan Geofisika. Accessed: Nov. 24, 2025. [Online]. Available: https://iklim.bmkg.go.id/id/i-tmy/

[31] C. Martinez-Villalobos and J. D. Neelin, “Why Do Precipitation Intensities Tend to Follow Gamma Distributions ?,” J. Atmos. Sci., vol. 76, no. 1, pp. 3611–3631, 2019, doi: 10.1175/JAS-D-18-0343.1.

[32] C. Guilloteau, A. Mamalakis, L. Vulis, P. V. V. Le, T. T. Georgiou, and E. Foufoula-Georgiou, “Rotated spectral principal component cnalysis ( rsPCA ) for identifying dynamical codes of variability in climate cystems,” J. Clim., vol. 34, pp. 715–736, 2021, doi: 10.1175/JCLI-D-20-0266.1.

[33] G. Sottile, A. Francipane, G. Adelfino, and L. V. Noto, “A PCA-based clustering algorithm for the identification of stratiform and convective precipitation at the event scale : an application to the sub-hourly precipitation of Sicily , Italy,” Stoch. Environ. Res. Risk Assess., vol. 36, no. 8, pp. 2303–2317, 2022, doi: 10.1007/s00477-021-02028-7.

Downloads

Published

2025-12-20

How to Cite

[1]
A. S. Widyawati, A. Fitrianto, and P. Silvianti, “Performance of Multivariate Missing Data Imputation Methods on Climate Data”, JAIC, vol. 9, no. 6, pp. 3953–3963, Dec. 2025.