Predicting Missing Value Data on IEC TC10 Datasets for Dissolved Gas Analysis using Tertius Algorithm

ABSTRACT


I. INTRODUCTION
The transformer is one of the key components of a power plant. In electricity systems, the power transformer converts high voltage electricity to low voltage electricity and vice versa. Working for a very long time, transformers often rarely get intensive attention. Yet, with the intensity of work and functionality, transformers are very vulnerable to damage that can cause problems in electricity distribution to consumers. Transformer components that are vulnerable to damage include liquid insulation material in hydrocarbon oil. After being used for a long time, this material will experience degradation in quality.
An engine fault in a power transformer will happen when the engine is used for a long time. The formation of gases, such as ethane (C2H6), methane (CH4), hydrogen (H2), acetylene (C2H2), ethene (C2H4), carbon dioxide (CO2), and carbon monoxide (CO) that occurs in oil Power transformers as a result of decomposition of transformer oil occur due to thermal faults in more extreme cases such as overheating or electrical faults [1]. This process is also followed by the erosion of the insulation paper on the walls of the power transformer.
Previous studies have assessed that some of the gaseous compounds produced can diagnose the condition of power transformers. A. Siddique[2] researched the transformer fault analysis approach using the MultiLayer Perceptron Neural Network (MLPNN) combined with the conventional Roger and Doernenburg ratio approaches. By combining this method, it is proven to produce better accuracy results. H.Malik investigated the diagnosis of DGA based on the most influential parameters using Extreme Machine Learning (ELM) [2]. Furthermore, research conducted by Setiawan [7] Dissolved Gas Analysis (DGA) is a method used to measure the condition of a power transformer based on the type and amount of gas dissolved in transformer oil caused by the decomposition process of insulator oil. Gas contaminant particles are formed in the form of, Ethane (C2H6), Methane (CH4), Hydrogen (H2), Acetylene (C2H2), Ethene (C2H4), Carbon Dioxide (CO2), and Carbon Monoxide (CO) [3].
Many DGA evaluation approaches are developed, both conventional techniques and Artificial Intelligence Techniques (AITs) based on ANN, SVM, Fuzzy Logic, and Type-2 Fuzzy Logic. However, these methods did not get significant results [8].
Based on data obtained from the IEEE guide for interpreting Gases Generated in Oil Immersed Transformers, ANSI/IEEE std. C57.104, 1991rev 2008. Many approaches have been developed in DGA analysis, both conventional approaches, and approaches based on Artificial Intelligence Techniques (AITs). Conventional approaches include the Doernenburg Ratio Method, Roger Method, Key Gas Method, and Duval Triangle [1].
Transformers as high-voltage equipment cannot be separated from the possibility of experiencing abnormal conditions triggered by internal or external factors. These unnatural conditions, in general, can be in the form of overheating, corona, and arcing, which can cause disturbances to their performance. One method to find out whether there is an abnormality in a transformer is to know the impact of the anomaly of the transformer itself. To determine the effect of eccentricity on the transformer, used Dissolved Gas Analysis (DGA) method was used. DGA is a method used to measure the condition of a power transformer based on the type and amount of gas dissolved in the transformer oil caused by the decomposition process of petroleum and insulator [9].
When an abnormality occurs in the transformer, the insulating oil as a hydrocarbon chain will decompose due to a large amount of energy. It will form hydrocarbon gases that dissolve in the insulating oil. DGA is a process to calculate the levels of hydrocarbon gases that are formed due to disturbances.
There are two causes of changes in gas composition in the transformer, namely disturbances due to heat and electrical disturbances. Decomposition of gases due to heat occurs due to oil and solid materials from the insulation in the transformer. The gas formation process generally occurs due to the ionic bombardment process. There is little heat generated, associated with low energy and partial energy dissipation.
Another approach that can be used is to assume that all the gaseous hydrocarbons in the oil decompose into the same substance and each product of the resulting importance is the same as one another. In thermodynamic models, it is possible to calculate the pressure of each part produced gas as a function of temperature, using the equilibrium constant equations for the relevant decomposition reactions.
IEC TC10 is a widely used dataset as a reference in dealing with DGA problems. However, this dataset is not perfect. There are many problems related to the data contained in this dataset. One of these problems is associated with the number of missing or empty data values. This problem is also known as missing value data. In general, the problem of lost value data is not a significant problem if the quantity of available information is vast. However, this problem will significantly affect performance if the available data is small, as is the case in the case of this IEC TC10 dataset [10]. Therefore, this research will focus on handling the problem of missing value data. In this study, the Tertius algorithm will be used to deal with the issue of missing value data, and the J48 algorithm and Random forest will be used to measure the accuracy of the measurement.

A. Experimental Design
Several stages will be conducted in this research. The Stage mainly consists of two main stages: the data preparation stage, which includes the fixed of missing data value, and the data testing stage. The rest of the scene can be seen in the Figure below. The stages of research to be carried out are divided into two main stages, namely the data preprocessing step and the data testing stage [11] [12].

III. RESULT AND DISCUSSION
A. Data collection This study will obtain 167 DGA data objects from the publication dataset IEC60599 by M.Duval [10] [13]. The IEC TC 10 database is one of the methods for using Dissolved Gas Analysis (DGA) presented in the NEW IEC Publication 60599 authored by Michel Duval. The data contained in the IEC TC 10 database is derived from testing approximately 10,000 transformers [1].
Within these datasets, seven types of gas molecules are employed as tools to identify faults. These gas molecules are Hydrogen (H2), Methane (CH4), Ethane (C2 H6), Ethylene (C2 H4), Acetylene (C2H2), Carbon Monoxide (CO), and Carbon Dioxide (CO2). The types of faults occurring in the transformers vary based on the concentrations of these gases present in their insulating oil [3]. From 167, the data is divided into sections, namely 9 PD condition data, 26 data for D1 condition, 48 data for D2 state, 16 data for T1 and T2 conditions, 18 data for T3 state, and 50 data for the normal condition from IEC TC10 database.

B. Fixing the missing value
Missing data is a very crucial problem in the case of DGA. In this research Tertius algorithm will be used. The Tertius Algorithm is one type of association algorithm that can be employed to explore relationships among data within a dataset. This algorithm was first introduced in the work of Peter A. Flach and Nicholas Lachiche in 2001 [14] [15].
Association rule mining aims to find interesting associations or correlations among items in a dataset. These associations are commonly expressed in the form of rules, often referred to as "if-then" rules. For example, "If item A and item B are present, then item C is likely to be present as well." The Tertius algorithm operates on a dataset, searching for frequent itemsets and generating association rules based on those itemsets. The process can be summarized as follows [16]: 1. Frequent Itemset Generation: The algorithm starts by identifying all frequent itemsets in the dataset. A frequent itemset is a set of items that occur together in the data with a frequency above a specified minimum support threshold. The support threshold is a user-defined parameter that determines the minimum frequency required for an itemset to be considered "frequent." 2. Rule Generation: From the frequent item sets, the algorithm generates association rules. An association rule consists of an antecedent (the "if" part) and a consequent (the "then" part). The antecedent and consequent are subsets of items from a frequent itemset. The algorithm generates rules with various combinations of antecedent and consequent to capture different associations. 3. Rule Evaluation: The generated rules are evaluated based on a measure such as "confidence" or "confirmation." Confidence measures the likelihood of the consequent being present given the antecedent. Higher confidence values indicate stronger associations. On the other hand, confirmation is used in the Tertius algorithm and measures the goodness of the rule based on statistical significance. 4. Rule Selection: After evaluating the rules, the algorithm selects the most interesting and significant rules based on a predefined threshold or user-defined criteria. These selected rules are considered to be meaningful associations in the data.
The Tertius algorithm's distinguishing feature lies in its use of "confirmation" to evaluate and prioritize the generated rules. Smaller values of confirmation indicate better quality rules, meaning the algorithm tends to favor rules with less redundancy and higher significance [17]. The selection of the Tertius Algorithm for data preprocessing in this research is justified by its superior performance compared to other association algorithms. In essence, the Tertius Algorithm operates similarly to other association algorithms, which involves searching for relationships among data in the dataset. However, the Tertius Algorithm incorporates the use of confirmation values in selecting the generated rules, where smaller confirmation values lead to better-quality rules [18].
Due to a large amount of missing data, it can cause low performance in calculating the accuracy of data classification. There are 56 missing data in the IEC TC10 dataset, divided into 9 PD, 3 D1, 3 D2, 6 T12, 2 T3, and 33 standard categories. This problem is solved based on the diagnostic class.
Based on IEC TC10 data, 9 attributes are missing in the PD data. Information regarding the lost data can be seen in the following table PD data snippet: Columns in gray represent missing data. Based on the TDCG conversion, the missing values are categorized based on the lowest class of each attribute. Therefore, all missing values are assigned a temporary feature, i.e., 'W.' To facilitate the search for lost data, each data code is assigned a serial number, as shown in Table 3 below: Columns in gray represent missing data. Furthermore, Table 2 is processed to look for rules to determine the relationships of missing data. The following are the search results for rules from PD data based on the Tertius and Tertius classification algorithms. Based on this rule, the prediction results of missing PD data are obtained as shown in Table 4 below: Missing data code is a code number for missing data based on Table 3, and Column value is a temporary value based on TDCG conversion. The Tertius Rule and Tertius Classification rule are algorithms used to predict the loss value. The number value in the Tertius Rule column and the Tertius classification rule indicates that the value was successfully expected based on the rule to that number from the Tertius algorithm. While 0 shows that the value is not predictable. The predicted value column is a column to determine the expected value based on TDCG.
Based on Table 4, there are 6 predictable data, namely data with codes 3, 13, 15, 18, 33, and 43. Meanwhile, 3 data that cannot be predicted are codes 4, 14, and 19. The 3 data cannot be expected. This is because there is no related rule that explains the data. The missing data is in a lined column condition, so the reason rule cannot be obtained. To deal with this problem, the remaining data that the Tertius algorithm cannot predict will be searched for its value using the mean value approach. The final results of the prediction of missing PD data can be seen in Table 5. For the next step is fixing missing attribute value on Discharge of low energy data (D1) and Discharge of High energy data (D2). Discharge of low energy (D1) is a spark-shaped disturbance that causes the formation of larger holes in the insulating paper or carbon particles in the oil. Based on IEC TC10 data, there are 3 data attributes missing in D1 data.
Discharge of high energy (D2) is a disturbance of power flowing through and causing widespread carbonization of the insulating material, coalescence of iron, and possibly disconnection of equipment. Based on IEC TC10 data, there are 3 data attributes missing in D2 data.
Searching for missing data attributes in D1 and D2 data is similar to the search for missing data in the previous PD. The search results are as follows:  Based on the search results with the Tertius algorithm, missing data cannot be predicted with the Tertius algorithm. These data are data with codes 40, 87, and 90. The 3 unpredictable data is caused by the absence of a related rule that explains the data. Meanwhile, data D2, 2 out of 3 missing data, can be predicted using the Tertius algorithm. The data are data with codes numbered 20 and 50.
Meanwhile, 1 data cannot be predicted, namely data with code 40. Data that cannot be predicted using the Tertius algorithm will find its value using the mean value approach to deal with this problem. The final result of the prediction of missing data D1 can be seen in Table 6 above.
A thermal fault below 300ºC (T1) is a disturbance that causes the color of the paper to turn brown. A thermal spot above 300ºC (T2) is a disturbance that can cause carbonized paper. Based on IEC TC10 data, there are 6 data attributes missing in T1 and T2 data.
Thermal faults above 700ºC (T3) are a type of disturbance that can cause carbonized oil and metal to be discolored or even melted. Based on IEC TC10 data, there are 2 data attributes missing in T3 data.
The process of searching for missing data attributes in T1 and T2 data is similar to the search for missing data in the previous D1 and D2. The search results are as follow:  In T1 and T2 data, Missing data can be predicted with the Tertius algorithm. These data are data with 28, 33, 38, 46, 68, and 73. In T3 data, 2 missing data can be predicted by the Tertius algorithm. These data are data with code numbers 23 and 30. The value of the missing data attributes can be seen in table 7 above.
Based on IEC TC10 data, there are 33 missing data attributes in Normal data. Information regarding lost data and the data transformation results can be seen in the appendix. The following are the rule search results from Normal data based on the Tertius and Tertius classification algorithms.

C. Data Normalization
Gas comparison data still has a range of values that are too large. Therefore the data normalization process needs to be done. In this study, data normalization consists of the process of scale equations using the mapminmax function in Matlab.

D. Data Testing
Data testing is divided into 2 parts: data testing using the comparison method and testing data using AIL. Testing is done by dividing the dataset randomly with a portion of 80% Training data and 20% Testing data. The results of testing the data with the comparison method are as follows 1. J48 Method Data testing using the J48 method was carried out in 30 trials, from these trials obtained the following results graph. The graph in Figure 9 obtained an average measurement result of 62.73%. Whereas the average processing time is 0.0133 seconds 2. Using Random Forest Method Data testing using the Random Forest method was carried out in 30 trials. From these trials obtained the following results graph. The graph in Figure 9 obtained an average measurement result of 70.71%. In contrast, the average processing time is 0.0117 seconds.

IV. CONCLUSION
This research was conducted to predict the missing data value attribute in IEC TC10 data which is very important in the DGA analysis process. In this research, the researcher uses the Association rules and the Tertius algorithm to predict the value. The following procedure tests the accuracy percentage using the j48 and Random Forest methods. Searching for missing data using the Tertius algorithm on IEC TC10 data was successfully carried out. Of the total 56 missing data, 36 could be predicted well. The classification accuracy results obtained are 70.71% using Random Forest and 62.73% using J48.
This study has several limitations, and the following are suggestions for further research. It is necessary to explore other missing data search methods because the method used in this study cannot resolve all the lost data in the IEC TC10 dataset. Furthermore, adding the stages of finding the most influential attributes as an effort to get better accuracy results.