Sunteți pe pagina 1din 20

Missing value imputation using a fuzzy clustering-

based EM approach

Submitted by:
Angad Singh Waraich (104663426)
Damanpreet Kaur (104678365)
Gaurav Goyal (104669736)
Prabjot Singh (104682416)
ABSTRACT
Data editing and cleaning assume an essential part in datamining by guaranteeing good nature of
data. A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation
Framework for Data Pre-processing (FEMI) credits the missing value by making a sophisticated
deduction considering records that are like the records having a missing value. While applying the
EM approach, the method searches similar records and makes a deduction using the fuzzy c- means
clustering. We use 72 different records to test our efficiency of the algorithm. The two evaluation criteria
used here are NRMS and AE which are explained additionally.

1
CONTENTS
Abstract ......................................................................................................................................................... 1
Introduction ................................................................................................................................................... 3
Work and Development procedure ............................................................................................................... 4
Block Diagram .......................................................................................................................................... 5
Settings...................................................................................................................................................... 6
Numerical Data ......................................................................................................................................... 7
Categorical data ........................................................................................................................................ 8
Categorical and Numerical Data ............................................................................................................... 9
Fuzzy C-Means Clustering ....................................................................................................................... 9
Statistical Analysis ...................................................................................................................................... 11
RMSE...................................................................................................................................................... 11
NRMSE ................................................................................................................................................... 11
Absolute Error......................................................................................................................................... 11
Simulation ................................................................................................................................................... 12
Numerical Data ....................................................................................................................................... 12
Input Data........................................................................................................................................ 12
Output Data ..................................................................................................................................... 14
Chart Representation ............................................................................................................................... 16
Categorical Data...................................................................................................................................... 17
Conclusion .................................................................................................................................................. 18
References ................................................................................................................................................... 19

Table of Figures
Figure 1 ZOO - NRMS COMPARISION ......................................................................................................... 16
Figure 2 HOE- AE Comparison ..................................................................................................................... 17

2
INTRODUCTION
Data and information have become major tools for corporations all around the world, based on
which major decisions are made. Therefore, unless major care is taken a portion of the data is lost.
Data pre-processing is a vital step in data mining and is used for the imputation of missing data.
In this project, we will introduce and implement a Fuzzy Expectation Maximization and Fuzzy
Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). Various
algorithm namely EMI, GKNN, FKMI, SVR and IBLLS have been used however EMI is most
frequently used as it can be applied to both continuous and categorical values. EM estimates the
mean and standard deviation for cluster to maximize the likelihood of observed data. EM can help
to re-estimate the missing value by if new parameter estimates are correct. Whereas with FEMI
we plan to build upon the idea of EMI, instead of using all records to find the missing data we
create clusters wherein different records are placed in different clusters. FEMI also makes use of
two levels of fuzziness.
Fuzzy Expectation Maximization and Fuzzy Clustering based Missing Value Imputation
Framework for Data Pre-processing (FEMI) makes use of a k- fuzzy clustering technique and a
fuzzy expectation maximization algorithm for imputation.
Fuzzy clustering is a procedure for clustering different data groups that can have correlation to
each other. Membership degrees ae assigned to the input data and therefore, it is used to calculate
the relation between the input data and the missing value. If the data sets are more like each other
than the possibility of getting the imputation becomes very high because the changes in the
characteristics are really negligible. Deteriorating number of possible values helps to achieve high
imputation accuracy. So, it is argued that, by knowing the value of one attribute, it is possible to
impute the value of the other attribute more accurately. Keep the fore mentioned facts in mind it
was concluded that like data should be organized in the form of clusters. This is where a distinct
difference is found between EMI and FEMI as, EMI uses all records for imputation of missing
attributes in contrast FEMI makes use of clusters; based on the argument that typically the records
grouped together in a cluster should also have a higher correlation between two attributes of the
data set.

3
WORK AND DEVELOPMENT PROCEDURE
Clustering algorithms can be categorized into two groups that is hard clustering and soft clustering.
Hard clustering a record is a subset of only one cluster and belongs to only one cluster. Fuzzy
cluster, also known as soft clustering, has a record that has a likelihood to belong in each cluster.
This likelihood or probability is called membership degree. The membership degree ranges from
0 to 1 and it can have infinite possibilities. The value 0 represents that the record has no likelihood
or no membership degree with the cluster whereas the value 1 represents that the record belongs
to that cluster only. The total likelihood of the record with all the clusters is 1. This membership
degree is further used to evaluate the missing record in the cluster.
A fuzzy clustering approach allows a record to be assigned to all the clusters with different
membership degrees based on its attributes and that since it is a fuzzy cluster. Therefore, a record
may not have one characteristic and it may belong to one cluster or other in some degree. It is
deduced that if one record is lost it must have a membership degree with each cluster. Keeping
that in mind, the Fuzzy technique collates the missing degree of the record with each cluster and
then imputes the record. The missing record is effected by the membership degree and it then
influences the imputation of the missing record.

The second level of fuzziness allows us to find the missing record by using the cluster from which
the record is missing. It takes the consideration that each record has a fuzzy connection with the
cluster in question. Therefore, the records that are like the missing record can be found which in
turn can be used to analyze the value of the missing record.

4
BLOCK DIAGRAM

5
SETTINGS

The datasets provided to us are of three types purely categorical, purely numerical and a hybrid of
both. The approach to all three methods is unique and is detailed below.
The first step in the process that is common to all our categories; i.e. the options are set and the
files are imported.
The options are set as follows:
OPTIONS.regress='mridge';
OPTIONS.stagtol=5e-3;
OPTIONS.maxit=30;
OPTIONS.inflation=1;
OPTIONS.disp=1;
OPTIONS.relvar_res=5e-2;
OPTION.minvarfrac=0;

6
NUMERICAL DATA

The data from this set is copied into another dataset (lets assume the dataset be DN). The
new dataset (i.e. DN) is then sent to normalization.

The normalized dataset DN will further be broken down into two datasets; The first
containing records with no missing values and the second containing only those records
that have missing values.

Membership degrees are then assigned to the records of both sub datasets in respect to all
the clusters. We apply an inbuilt MATLAB function called Cluster Creation using FCM.
Using the membership degree that is found through the above procedure, we find the
centroids of the given datasets and then evaluating the membership function.

To calculate the imputation values, the Union of membership function values are then used
to form a new matrix. The above-mentioned matrix is then used to impute the data using
the Fuzzy EM approach. The basic theory is that we take four parameters, i.e. k, index of
the record, dataset having records including the missing values, the complete dataset (given
in the start), and the membership degree.

The value is then combined to calculate the whole record and then imputed to the final
value.

The fore mentioned formulas and techniques are then used to form the complete matrix
called DF, which has no missing values and then the whole matrix is de-normalized to form
the original matrix.

7
CATEGORICAL DATA

Purely categorical datasets are handled as follows-:

The categorical data was then converted to ordinal data, and in cases where more than one
column of categorical data was present a loop was implemented, each iteration of which
converts a new column to the ordinal form which is in turn stored in a new matrix.
Expectation minimization is then carried out to impute the missing values. Expectation
maximization (EM) algorithm is an iterative method to find maximum likelihood estimates
of parameters in statistical models, where the model depends on unobserved latent
variables.
The EM iteration alternates between performing an expectation (E) step, which creates a
function for the expectation of the log-likelihood evaluated using the current estimate for
the parameters, and a maximization (M) step, which computes parameters maximizing the
expected log-likelihood found on the E step.
These parameter-estimates are then used to determine the distribution of the latent variables
in the next E step.

8
CATEGORICAL AND NUMERICAL DATA
The first step was to run the dataset through a code to recognize if one or more than one
columns has categorical attributes.
The dataset was then separated to form two new matrixes, one containing the numerical
data whereas the other held categorical data.
The numerical and categorical datasets are then handled separately as explained in their
respective sub categories.

FUZZY C-MEANS CLUSTERING

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two
or more clusters. It is based on minimization of the following objective function:

Where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi
is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any
norm expressing the similarity between any measured data and the center.

Fuzzy partitioning is carried out through an iterative optimization of the objective function shown
above, with the update of membership uij and the cluster centers cj by:

This iteration will stop when , where is a termination criterion between


0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle
point of Jm.

9
The algorithm is composed of the following steps:
Initialize U=[uij] matrix, U(0)

At k-step: calculate the centers vectors C(k)=[cj] with U(k)

Update U(k) , U(k+1)

If || U(k+1) - U(k)||< then STOP; otherwise return to step 2.

10
STATISTICAL ANALYSIS

RMSE
The root mean square error, often called RMSE, is used as an evaluation criteria for
calculating the difference between the calculated and the original record. Its value ranges
from 0 to . The lowest the RMSE value, the better the clustering algorithm is. If we
assume N to be the number of artificial records which have missing values, Oi be the
actual value of the ith missing record and Pi is the imputed value for the missing ith record.

NRMSE
NRMSE is the normalized RMSE in which the RMSE value is then divided by the maximum and
minimum value.

ABSOLUTE ERROR

Imputed decimal values are compared to the original. A variable is used to as a counter to compare
the values of the imputed data set and the original data set, every time the value in the original and
imputed data is the same the counter is incremented by one. This value is key to calculated the
AE. The number of iterations run for the counter comparison are the same as the number of rows
in the datasets.

AE = Value of the counter / Total Number of Rows

11
SIMULATION

NUMERICAL DATA

INPUT DATA

It is a sample input data which is illustrated in the data Zoo -> Simple1.

12
13
OUTPUT DATA

14
REGEM:
Percentage of values missing: 0.06
Stagnation tolerance: 5.00e-03
Maximum number of iterations: 30
Initialization of missing values by mean substitution.
One multiple ridge regression per record:
one regularization parameter per record.
Xerr = (33,13) 1.3922
NRMS = 0.0076

15
CHART REPRESENTATION

0.06
SIMPLE UD 0.0178
0.0056
0.00031535
0.0136
SIMPLE 0.0117
0.0023
0.0076
0.0959
MEDIUM UD 0.0717
0.0276
0.0178
0.0513
MEDIUM 0.0834
0.0311
0.0069
0.1774
COMPLEX UD 0.0979
0.0298
0.0363
0.1757
COMPLEX 0.0869
0.0843
0.0047
0.0857
BLENDED UD 0.0766
0.0268
0.0084
0.1054
BLENDED 0.0796
0.0382
0.0053

0 0.05 0.1 0.15 0.2

Figure 1 ZOO - NRMS COMPARISION

16
CATEGORICAL DATA

0.953
SIMPLE UD 0.9915
0.9915
1
0.9658
SIMPLE 0.9872
0.9915
1
0.8462
MEDIUM UD 0.9359
0.953
0.9915
0
MEDIUM 0.9444
0.9744
0.9915
0.8077
COMPLEX UD 0.9103
0.9573
0.9915
0.8162
COMPLEX 0.9017
0.953
0.9915
0.8718
BLENDED UD 0.9402
0.9744
0.9915
0
BLENDED 0.9487
0.9658
0.9957
0 0.2 0.4 0.6 0.8 1 1.2

Figure 2 HOE- AE Comparison

17
CONCLUSION

To conclude the base idea of this technique is to make an educated guess for a missing value using
the most similar records. It takes the fuzzy nature of clustering into consideration while identifying
the group of most similar records. Therefore, it considers all groups of records as similar, with
some degree of similarity. Moreover, while imputing a missing value based on a group, it also
considers the fuzzy nature of all records for belonging to the group. Therefore, it uses a novel fuzzy
expectation maximization algorithm to impute missing values.
This technique uses random value k in experiments. With same value of k, FEMI is better than
other techniques for imputation of missing data. The value of k can be computed automatically.
This method is used for imputation of data and not clustering the records. If we have four clusters
in data set while FEMI finds three groups of record for each cluster and thereby obtaining twelve
groups, the record between each group is similar. Therefore, the imputation efficiency remains
higher as compared to other methods even EM.

18
REFERENCES
http://www.csulb.edu/msaintg/ppa696/696stsig.htm] (2014). Accessed 12 May 2014
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences.
J Mach LearnRes 6:17051749
Batista G, Monard M (2003) An analysis of four missing data treatment methods for
supervised learning.Appl Artif Intell 17(56):519533
Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm.
Comput Geosci10(2):191203
Bilmes JA et al (1998) A gentle tutorial of the em algorithm and its application to parameter
estimationfor gaussian mixture and hidden markov models. Int Comput Sci Inst 4(510):126
B TH, Dysvik B, Jonassen I (2004) Lsimpute: accurate estimation of missing values in
microarray datawith least squares methods. Nucleic Acids Res 32(3):e34e34
Branch JW, Giannella C, Szymanski B, Wolff R, Kargupta H (2013) In-network outlier
detection inwireless sensor networks. Knowl Inf Syst 34(1):2354
Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value
imputation. JBioinform Comput Biol 4(5):935958
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans
Intell Syst Technol2: 27:127:27. http://www.csie.ntu.edu.tw/cjlin/libsvm
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal
Mach Intell2:224227
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the
em algorithm.J R Stat Soc Ser B (Methodological) 39(1):138

19

S-ar putea să vă placă și