Data-Driven Modeling of Human Development Index in Eastern Indonesia's Region Using Gaussian Techniques Empowered by Machine Learning

The Human Development Index (HDI) is a statistical measure used to measure and evaluate the progress and quality of human life in a country. For the Government of Indonesia, HDI is important because it is used to create or develop effective policies and programs. In addition, HDI is also used as one of the allocators in determining the General Allocation Fund. The 2022 HDI data released by BPS shows that there has been an increase in the HDI in each district/city over the last 12 years, including in the regions of Eastern Indonesia. High and low HDI values are influenced by several factors, and there are indications that there is spatial diversity where surrounding areas tend to have HDI levels that are not far from the area. The Geographically Weighted Regression method is used in this study because it takes into account spatial aspects. However, the GWR model must be built repeatedly if there is regional expansion. Therefore, a GWR model that applies machine learning methods is needed where the model is built and tested using different datasets, namely training data and test data, so that the model can predict new data better. The results obtained are that the GWR model with test data has a better R-Square value when compared to the GWR model previously trained using training data, which is 0.9946702, based on the linear regression model shows the results that the most influential factor on HDI in Eastern Indonesia is expected years of schooling ( 𝑋 2 ) .


Introduction
Human development for the Government of Indonesia is seen as more important than anything else (Jean Sanny Mongan, 2019).This is because human-centered development can be an indicator of a country's progress.By definition, human development can be defined as an effort made with the aim of improving human welfare in a comprehensive, equitable and sustainable manner.Some aspects that are improved include education, health, security, and access to resources (BPS, 2022).A country will be left behind if human development and welfare are not considered (Tarigan, 2021).
To improve human development a measure is needed in its calculation (Tarigan, 2021).In measuring and evaluating the progress and quality of human life in a country statistically, the Human Development Index (HDI) is used (Dwiyanto Pamungkas & Dewi, 2022).In addition, the Human Development Index (HDI) can also identify specific problems and challenges that exist in a country in the continuation of human development.Therefore, the Indonesian government pays more attention to the HDI in order to create and develop effective policies and programs.HDI is a strategic data for Indonesia, this is because HDI is used to measure government performance as well as one of the allocators for determining the General Allocation Fund (Dana Alokasi Umum) (BPS, 2022).
The development of HDI in each province in Indonesia over the past 12 years has increased significantly.This can be seen from the comparison of data on the status of human development in each Indonesian province in 2010 with data on the status of human development in each Indonesian province in 2022 where some regions have a very good increase.This also applies to each district/city in Indonesia where the percentage of districts/cities with low HDI status has gradually decreased from 2010 with 20.66 percent in 2022 to only 3.89 percent over the past twelve years.This condition was followed by an increase in the number of districts/cities with high HDI status (BPS, 2022).
Human development has the main goal of creating an environment where people can enjoy longevity, health, and lead productive lives.With the formation of a productive society, it will take a role in advancing economic growth and economic development of the country.Therefore, to spur economic growth, human development is also needed, this also applies to the context of the regional economy.A region will experience backwardness, including in its economic performance, if development policies in the region do not encourage the improvement of human quality (Dwiyanto Pamungkas & Dewi, 2022).
Indonesia is divided into two regions, namely the Western Region of Indonesia and the Eastern Region of Indonesia.The two regions show that the regions of Western Indonesia tend to experience faster economic development when compared to regions in Eastern Indonesia.In the Western Region of Indonesia, for example, the availability of infrastructure, education facilities, health facilities, and transportation facilities, is much better when compared to areas in Eastern Indonesia (Tubaka, 2019).According to the United Nations Development Program (UNDP), the Human Development Index is formed by three fundamental dimensions of knowledge, decent standard of living, longevity, and healthy living, in other words, it is continuous with the availability of infrastructure such as in the aspects of education and health (Sukmawati, 2022).
The location of a region also affects the Human Development Index (HDI), where neighboring regions have a higher relationship when compared to regions or areas that are far apart (Tarigan, 2021).Therefore, this study uses the Geographically Weighted Regression (GWR) method to conduct spatial analysis (based on location) to examine the factors that affect the Human Development Index (HDI) in Eastern Indonesia.The model is built by applying the concept of Machine Learning in order to get the best model.In addition, this research also aims to make predictions easier in the event of regional expansion without building a new model.

Related Work
In an era of globalization and a growing understanding of the variety of regional development, research into factors that influence regional development is becoming more in-depth and diverse.In this context, an intriguing study concentrating on Indonesia's eastern area has arisen, where attention is given to the HDI and how specific factors influence the HDI in this region.A prior study on the island of Sumatra by Permai et al., (2021) highlighted the use of GWR to examine the impact of fiscal decentralization indicators on economic performance.The results suggest that, when compared to traditional linear regression methodologies, the GWR method can reveal spatial relationships in these effects and provide more information.These findings highlight the importance of advanced spatial techniques in the investigation of various geographies.
Furthermore, a study conducted on the Indonesian island of Java by Kurniawati (2019) demonstrates that Geographically Weighted Quantile Regression (GWQR) is highly effective in evaluating the Human Development Index.GWQR allows for quantile regression analysis while accounting for spatial variability, which aids in understanding the relationship between numerous factors and HDI.The findings of this study illustrate the GWQR model's superiority in explaining disparities throughout Java.Another study ( 2021) by Pravitasari and colleagues focuses on regional development on the island of Java.They uncover factors impacting regional development using spatiotemporal pattern analysis, including GWR.
The key finding is that the influence of local causes on development levels varies significantly, demonstrating the necessity of spatial techniques in understanding regional change.
Furthermore, a study conducted in the United States by Bozorgi (2021) investigated the use of machine-learning approaches in drug overdose analysis.This study emphasizes the significance of knowing environmental elements that influence health concerns such as drug overdose, as well as how machine learning can aid in the prediction of peak risks in spatial analysis.These findings show that machine learning algorithms may uncover robust predictors in high-risk situations.
Finally, Behrens (2018) developed a new way of combining spatially autocorrelated Euclidean distance fields (EDF) with machine learning, resulting in the Euclidean distance field in machinelearning (EDM) method.This study demonstrates EDM's capacity to account for spatial nonstationarity and presents an alternative to spatial modeling.All of this research emphasizes the need for detailed spatial analysis in understanding regional variability and how specific factors influence growth.This study extends the previously utilized GWR modeling approach by incorporating features of machine learning schemes.The purpose of this study is to better understand the spatial impact of the Human Development Index (HDI) in Indonesia's eastern area.We hope to find more complicated patterns and elements that traditional geographical analysis may miss by creating the GWR model using a machine learning technique.As a result, this study contributes significantly to understanding spatial differences in human development levels in the region, bridging the knowledge gap between traditional spatial analysis and the development of machine learning methods in the context of HDI in eastern Indonesia.Patgiri et al., (2019) where the study used the SVM and Random Forest algorithms for classification to detect malicious URLs, resulting in the most accurate ratio in split data of 80:20.This stage has an important effect on the machine learning model (Nguyen et al., 2021).Splitting data is expected to improve model accuracy and avoid overfitting and underfitting data that has never been seen before.

Materials and Methods
The training data is then used to build Multiple Linear Regression and Geographically Weighted Regression models.In this study, there are independent variables and dependent variables, where the Human Development Index as the dependent variable is influenced by four factors (independent variables).The independent variables are life expectancy rate at birth ( 1 ), expected years of schooling ( 2 ), average years of schooling( 3 ) dan adjusted per capita expenditure( 4 ).The Multiple Linear Regression model used to conduct the Breusch-Pagan Test will show whether there is spatial diversity in the model for coverage of regions in Eastern Indonesia (Cholid et al., 2019).
From the results of the Breusch-Pagan Test that has been carried out, it is known that there is spatial diversity, so it is necessary to build a Geographically Weighted Regression model where the method is a development of the Multiple Linear Regression model that considers the coordinates of an area (Cholid et al., 2019).The formula of the MLR model constructed is as in formula 1 (Kartika & Kholijah, 2020), while the formula of the GWR model is as written in formula 2 (Lutfiani & Scolastika Mariani, 2019) : =  0 (  ,   ) + ∑   (  ,   )  +    =1 (2) Determining the optimum bandwidth is also applied to the GWR model, the bandwidth can be analogized as a radius, where a point whose location is within the radius is considered to still have an influence.Bandwidth plays an important role in the GWR model because it will affect the accuracy of the model to the data where its function is to adjust the bias and model on the data (Cholid et al., 2019).The research variables used in this study as a whole are summarized in Table 1.After building MLR and GWR models using train data, test data is used to test MLR and GWR models.To get the best model, in the next stage, the model evaluation process is carried out so that it can be seen whether the model that has been formed is in accordance with the data (Wahyudi et al., 2023).There are several methods used in this study to evaluate the model, namely AIC (Akaike Information Criterion), AICC (Corrected Akaike Information Criterion), and R-Square.Finally, the most influential variables in each district/city were mapped or referred to as importance variables, so that further observations could be made.In this study, R Studio was used at the data split stage to obtain importance variables, while ArcGIS was used to visualize importance variables or mapping.

Multiple Linear Regression
The multiple linear regression model globally represents the regions in Eastern Indonesia.In addition, this model is generally applicable and does not consider spatial effects.The multiple linear regression model at a later stage will be checked to determine whether there is a spatial factor or heteroscedasticity using the Breusch-Pagan Test.At this stage, the data used is training data, and obtained multiple linear regression model parameter estimates as in table 2 and formula 3 below : From the multiple linear regression model obtained and described in Table 2, it can be seen that among the four independent variables, variable  2 has a higher significance with a coefficient value of 1.300258.

Geographically Weighted Regression (GWR) a. Variety Homogeneity Test
The multiple linear regression model that has been built using training data then carried out a variance homogeneity test to determine whether there is spatial heterogeneity (Cholid et al., 2019).The heteroscedasticity test in this study uses Breusch-Pagan. the p-value is less than a = 0.05 or there is heteroscedasticity, the GWR model can be built.The heteroscedasticity test results can be seen in Table 3 below: In Table 3, it is known that the test results produce a p-value of 0.0002211, which means this value is less than a=0.05.So it can be said that there is heteroscedasticity in the model.The results obtained reject the  0 hypothesis, which indicates that the residual variation in the model has spatial heterogeneity or is not homogeneous.Furthermore, the GWR method can be applied to overcome the problem of spatial heterogeneity in the linear regression model (Cholid et al., 2019).

b. Optimum Bandwidth Selection
In building the GWR model, it is necessary to select the most optimal bandwidth first.In this study, two types of kernels were compared, namely fixed gaussian and fixed bisquare.To determine the best kernel for each district/city in Eastern Indonesia, a model is made on each weight so as to get the Cross Validation (CV) value of the weight (Cholid et al., 2019).The data used at this stage is train data.The results of both experiments are as written in Table 4 below: From the experiments conducted, it is found that the most optimum kernel used in this case is fixed gaussian, because it has a smaller Cross Validation (CV) value when compared to the fixed bisquare weight.

c. Parameter Estimation of the GWR Model
The GWR model is built using the optimum weighting function which is fixed gaussian and the data used are training data and test data.The following Table 5 shows the parameter estimates of the GWR model that globally describes the relationship between the Human Development Index() and its four independent variables, namely life expectancy rate at birth( 1 ), expected years of schooling( 2 ), average years of schooling( 3 ) and adjusted per capita expenditure( 4 ) in Eastern Indonesia :

Performance Comparison of Multiple Linear Regression and GWR Models
At the model evaluation stage, a comparison of the AIC, AICC, and R-Square values of the GWR model with training data and test data will be carried out.Table 6 is a comparison of the calculation results of the three evaluation methods used in this study: From the evaluation results listed in Table 6, it can be seen that the performance of the GWR model is superior to multiple linear regression, where the R-Square value in both the train and test data GWR has a higher value.The AIC and AICC values in the GWR model are smaller when compared to the MLR model, this also indicates that the GWR model is better than the MLR model.In addition, Table 6 shows the expected results, where the R-Square value on the test data is higher when compared to the training data.The GWR model trained using test data has an R-squared value of 0.9946702, while the model trained with training data has a value of 0.9920565.

Variable Importance (Mapping)
The GWR model that has been built previously is still global, where the model still represents all districts/cities in Eastern Indonesia.Furthermore, GWR model development is carried out in each district/city in Eastern Indonesia.The GWR modeling results will be mapped for each independent variable, namely life expectancy rate at birth ( 1 ), expected years of schooling ( 2 ) average years of schooling ( 3 ) and adjusted per capita expenditure ( 4 ).The following are the results of mapping the four independent variables: Figure 2 shows the significance of the factor of life expectancy rate at birth ( 1 ) is high in districts/cities on the islands of Kalimantan, Sulawesi, NTB, and NTT.In Figure 3, the regions most influenced by the factor of expected years of schooling ( 2 ) are on the islands of Sulawesi, Papua, NTT, NTB, and Maluku Islands.For the factor of average years of schooling ( 3 ) has the highest significance in the regions in Papua shown in Figure 4 and for the factor of adjusted per capita expenditure( 4 ) has a relatively similar influence in each district/city in Eastern Indonesia shown in Figure 5.
When building a GWR model that considers longitude and latitude points, it will automatically form a GWR model in each district/city in Eastern Indonesia.The following are three samples of GWR models from 232 districts/cities in Eastern Indonesia.The districts/cities taken for the sample are Sumbawa, Banjarmasin City, and Jayapura City as written in formulas 4, 5, and 6.Sumbawa has a longitude point of 118.1171082 and a latitude point of -8.738072 which in the following formula is represented by the symbol (  ,   ), while Banjarmasin City is located at a longitude point of 114.5943784 and a latitude of -3.3186067 represented by the symbol (  ,   ), and Jayapura City at a longitude point of 140.6689995 and a latitude of -2.5916025 represented by the symbol (  ,   ).  = −1.021553( ,   ) + 0.48993(  ,   ). 1 + 1.230049(  ,   ). 2 + 1.243145(  ,   ). 3 + 0.001001(  ,   ). 4 (4) = −0.790558( ,   ) + 0.490046(  ,   ). 1 + 1.220203(  ,   ). 2 + 1.236045(  ,   ). 3 + 0.000997(  ,   ). 4 (5) From the three samples of GWR models that have been obtained, it can be seen that each district/city in Eastern Indonesia has a different GWR model and differences in the most influential variables in each region.This is also one of the things that distinguishes between the GWR model and the MLR model, where the MLR model only produces one model that globally represents the regions in Eastern Indonesia, while the GWR model produces many models at once in one run.

Conclusions
Based on the stages that have been carried out in this study, it can be seen that the multiple regression model built using the Human Development Index data obtained from BPS in 2022 shows that there is spatial variance for the Eastern Indonesia region.Therefore, the GWR method will be better and more relevant to be applied in this study.The results of the construction of multiple linear regression models show that the expected years of schooling( 2 ) have a significant effect on HDI in Eastern Indonesia.Meanwhile, the GWR model was built using a fixed gaussian weighting function because it has a smaller CV value.From the GWR model trained using training data and testing on test data, it was found that the model can predict better on new data.This can be seen from the R-Square value of the GWR model with test data, which is 0.9946702 when compared to the GWR model previously trained using training data, which is 0.9920565.In addition, the influential factors in each district/city in the Eastern Indonesia Region are different, but the adjusted per capita expenditure factor( 4 ) has a relatively similar influence in each district/city based on the mapping that has been built.

Figure 1
Figure 1 explains the framework of this research.Conceptually, the stages of this research are Data Collecting, Model Development, and Model Evaluation.At the Collecting Data stage, the data used is secondary data obtained from data from Statistics Indonesia in 2022.The data contains the values of the independent variables that affect the Human Development Index and the dependent variable, namely the Human Development Index in each district/city in Eastern Indonesia.The number of districts/cities involved is 232 regions.

Figure 1 .
Figure 1.Research FrameworkThis research applies one of the machine learning methods, namely splitting data, which divides the dataset into two parts, namely train data and test data.Training data is used to train the model, while test data is used to test the model's performance on data that has never been seen before.The dataset of the Human Development Index in each district/city is splitting data with a ratio of 80:20, 80 for training data and 20 for testing data.The use of the 80:20 ratio in this study is based on research conducted byPatgiri et al., (2019) where the study used the SVM and Random Forest algorithms for classification to detect malicious URLs, resulting in the most accurate ratio in split data of 80:20.This stage has an important effect on the machine learning model(Nguyen et al., 2021).Splitting data is expected to improve model accuracy and avoid overfitting and underfitting data that has never been seen before.The training data is then used to build Multiple Linear Regression and Geographically Weighted Regression models.In this study, there are independent variables and dependent variables, where the Human Development Index as the dependent variable is influenced by four factors (independent variables).The independent variables are life expectancy rate at birth ( 1 ), expected years of schooling ( 2 ), average years of schooling( 3 ) dan adjusted per capita expenditure( 4 ).The Multiple Linear Regression model used to conduct

FigFigure 3 .Figure 4 .Figure 5 .
Fig Visualization of areas affected by the factor of life expectancy rate at birth ( 1 )

Table 1 .
Research Variables

Table 2 .
Parameter Estimation of MLR Model

Table 3 .
Heteroscedasticity Test Results

Table 4 .
Comparison of Fixed Gaussian and Fixed Bisquare

Table 5 .
Parameter Estimation of GWR Model