A New Imputation Method For Missing Attribute Values in Data Mining

Journal of Applied Computer Science & Mathematics, no.
10 (5) /2011, Suceava
A New Imputation Method for Missing Attribute Values in Data Mining

1
1
Diwakar SHUKLA, 2Rahul SINGHAI, 3Narendra Singh THAKUR
Deptt. of Mathematics and Statistics Dr. H.S. Gour University, Sagar(M.P.) India 2 IIPS, Devi Ahilya University, Indore (M.P.) India 3 B.T. Institute of Research and Technology Sironja, Sagar (M.P.) India 1 diwakarshukla@rediffmail.com, 2singhai_rahul@hotmail.com, 3nst_stats@yahoo.co.in
AbstractOne reduction problem in the data cleaning & data reduction step of KDD process is the presence of missing values in attributes. Many of analysis tasks have proposed to deal with missing values and have developed several treatments to guess them. One of the most common methods to replace the missing values is the mean method of imputation. In this paper we suggest a new imputation method using modified ratio estimator in two phase sampling scheme and by using this method, we input the missing values of a target attribute in a data warehouse. Our simulation study shows that the estimator of mean from this method is found more efficient than compare to other imputation methods. Keywords: KDD (Knowledge Discovery in Databases.) Data mining attribute missing values, Imputation methods, Sampling.
I. INTRODUCTION With the ever-increasing database sizes, randomization and randomized algorithms have become vital data management tools. In particular, random sampling is one of the most important sources of randomness for such algorithms (Motwani and Raghavan(1995)). Recently, important applications have called for the need for incremental mining to discover up-to-date patterns hidden in the continuous input data (Charikar et al. (1997), Chen et al. (2002),Fisher et al. (1987), Lee et al. (2001), Zhang et al. (1996). It is believed that the demand of online sampling techniques is increasing since they can prominently reduce the computational cost of the incremental mining applications (Lee et al. (1998)). Traditionally, random sampling is the most widely utilized sampling strategy for data mining applications. According to the Chernoff bounds, the consistency between the population proportion and the sample proportion of a measured pattern can be probabilistically guaranteed when the sample size is large (Domingo et al. (2002) and Zaki et al. (1997)). Sampling implies, sample the data by creating one or more data tables. The samples should be big enough to contain the significant information, yet small enough to process quickly. Perhaps the most persistent problems concerning the processing of large databases are speed and cost. Analytical routines required for exploration and modeling to run faster on samples than on the entire database. Even the fastest hardware and software combinations have difficulty performing complex analyses such as fitting a stepwise
logistic regression with millions of records and hundreds of input variables. Exploring a representative sample is easier, more efficient, and can be as accurate as exploring the entire database. After the initial sample is explored, some preliminary models can be fitted and assessed. If the preliminary models perform well, then perhaps the data mining project can continue to the next phase. However, it is likely that the initial modeling generates additional, more specific questions, and more data exploration is required. A major benefit of sampling is the speed and efficiency of working with a smaller data table that still contains the essence of the entire database. Ideally, one uses enough data to reveal the important findings, and no more. Statistics is a branch of applied mathematics that enables researchers to identify relationships and to search for ways to understand relationships. Modern statistical methods rely on sampling. Analysts routinely use sampling techniques to run initial models, enable exploration of data, and determine whether more analysis is needed. Data mining techniques that use statistical sampling methods can reveal valuable information and complex relationships in large amounts of data. In data mining, the sample data is used to construct predictive models, which in turn, are used to make predictions about an entire database. Therefore, the validity of a sample, that is, whether the sample is representative of the entire database, is critically important. Whether a sample is representative of the entire database is determined by two characteristics the size of the sample and the quality of the sample. The importance of sample size is relatively easy to understand. However, understanding the importance of quality is a bit more involved, because a quality sample for one business problem may not be a quality sample for another problem. A) Imputation in data mining In traditional statistical analysis, the application of databases is pretty straightforward. In order to obtain a good quality and representative database, one approach is to get the value added database for data mining applications. The concept of the database with value added can be divided into three phases from the statistical viewpoint: (1) Sampling survey, (2) Functional, and (3) Application. Any sampling survey can be subdivided into three parts as follows:
14
Computer Science Section
(1) The imputation of missing data: Missing data indicates lost information. If imputation of the missing data is implemented, the database will display a better result. (2) Index and criteria: The structure of the sample and population is discussed using similarity and correlation. The prediction capability of the value-added data is measured against the index and criteria. The result is evaluated for improved prediction accuracy of the value-added data. (3) Sampling methods: Discuss the efficiency of the different sampling methods for different datasets. Missing attribute value is a common problem where sampling techniques are used to obtain a reduced representation of a Data Warehouse, so as to ease the mining process and imputation is frequently used for this purpose. The ratio method of imputation can be used when the target attribute Y is positively correlated with an auxiliary attribute X whose dataset mean is known in advance. In case the domain mean of the auxiliary attribute X is not known, we use two phase sampling (Shukla, 2002). In the present work, we have proposed an imputation method by making use of modified ratio type estimator of G.N.Singh & Upadhayay (2001) on two phase sample. Let A = {X , Y , Z } is a finite three attribute set of any database/ datawarehouse. The target attribute domain Y consists of Yi (i = 1,2...N ) values of main interest and attribute domain X consists of X i (i = 1,2....N ) as auxiliary values. The mean value of target attribute is Y =
1 N
results in k =
1 ; 2 1 + f 2C X
1 1 f2 = ' . n N
This two-phase sampling scheme can be expressed in either of the following two manners [for detail see Cochran (2005)]. (i) by taking sub-sample from large sample.(fig. 1(a)). (ii) (ii) by taking sample independent from large sample.(fig. 1(b))
Yi
i =1
and the mean values of reference (auxiliary) attribute is 1 N X = X i with X is known, we need to estimate the N i =1
target attribute mean Y which is unknown by the ratio y 1 n estimator defined as y r = X , where y = yi n i =1 x
Fig 1.(a) Sub Sample from Large Sample
and x =
1 n
xi
i =1
are the sample mean of size n taken from
the attribute set A by simple random sample without replacement (SRWOR) scheme. But in situation when X is not known then to estimate Y . Singh and Upadhyay (2001) have proposed a modified two-phase ratio estimator in the form Trd = Y
n ' x* where x = xi is the sample mean of x x i =1 based on the preliminary sample of size n (> n) in the first
Samples (s) having n tuples Fig 1. (b): Sample Independent from Large Sample
phase and x* is estimator x by using the technique of Searls (1964) where x * = k x . The minimum mean square of x* is obtained by minimising MSE x* = E x* X
2
'
which
It may possible that sample has some observations unavailable, how to cope up those in warehouse and estimate mean value is a problem. In both cases assume that the sample S of n units contains r available values (r < n) forming a subspace R and (n-r) missing values with sub-space R C i.e. S = R R C . For every i R the Yi are available values of
15
Journal of Applied Computer Science & Mathematics, no. 10 (5) /2011, Suceava
attribute Y and for i R C the Yi values are missing and imputed values are to be derived to replace these missing values. II. PROPOSED IMPUTATION STRATEGIES In the case of proposed imputation procedure the attribute domain takes the following forms under both cases: Case I : if i R yi (YMRI1 )i = y r x* (2.1) if i R C (n r ) n r x Case II : if i R yi (YMRI 2 )i = y r x* (2.2) C (n r ) n. r if i R xr Theorem 2.1 : Under these estimation strategies the point estimator of Y are : ' (1) Trd 1 = y S 1
Using the concept of two phase sampling, following Rao and Sitter (1995) and the mechanism of MCAR for given r, n and n we write Under design case I
' E (e1 ) = E (e2 ) = E (e3 ) = E (e3 ) = 0; 2 2 2 2 2 2 E (e1 ) = 1CY ; E (e2 ) = 1C X ; E (e3 ) = 2C X ; '2 2 E (e3 ) = 3C X , E (e1e2 ) = 1CY C X ; E (e1e3 ) = 2 CY C X ' 2 ' 2 E (e1e3 ) = 3 CY C X ; E (e2e3 ) = 2C X ; E (e2e3 ) = 3C X ' 2 E (e3e3 ) = 3C X .
(3.2)
Under design case II

2 2 2 2 2 2 E (e1 ) = 4CY ; E (e2 ) = 4C X ; E (e3 ) = 5C X ; (3.3) ' 2 2 E (e3 ) = 3C X ; E (e1e2 ) = 4 CY C X ; E (e1e3 ) = 5 CY C X ' 2 ' ' E (e1e3 ) = 0; E (e2e3 ) = 2C X ; E (e2e3 ) = 0; E (e3e3 ) = 0
' E (e1 ) = E (e2 ) = E (e3 ) = E (e3 ) = 0
( )
( )
=
=
1 ( yMRI1 )i n iS
1 ( yMRI 1 )i + ( yMRI1 )i n iR iR C y 1 = r y r + r n x* r x n x
where 1 1 1 1 1 1 1 = ' ; 2 = ' ; 3 = ' r n n n N n 1 1 1 1 4 = ; 5 = r N n n N n

Theorem 3.1:
Estimator
(T ) and (T )
' rd 1
' rd 2
could
be (3.4)
expressed in following term: ' ' ' ' 2 (i) Trd 1 = k Y 1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3
( ) (ii) (T ) ( )
1
' rd 2
= kY
' 1 + e1 e2 + e3
' e1e2 e3e2
' + e1e3
2 + e2
] (3.5)
= yr
(2)
( )
' Trd 2
= yS
( )
x* x
2
Proof:
(2.3)
' (i) Trd
1 ( yMRI 2 )i = n iS = yr . x* xr
Q x* = k x ' x ' k .Y (1 + e1 ) X (1 + e3 ) (From equation (3.1)) = X (1 + e3 )
= yr
x*
= yr .
k x' x
' = k .Y (1 + e1 )(1 + e3 )(1 + e3 ) 1
(2.4)
III. PROPERTIES OF PROPOSED IMPUTATION STRATEGIES IN DATAWAREHOUSE Let B(.) and M(.) denote the bias and mean squared error (M.S.E.) of proposed methods under a given sampling design. The large sample approximations are y r = Y (1 + e1 ); x r = X (1 + e2 ) (3.1) ' ' x = X (1 + e3 ); x = X (1 + e3 )
' 2 = k .Y (1 + e1 )(1 + e3 )(1 e3 + e3 .....) Using binomial theorem and ignoring higher terms we have ' ' ' 2 = k Y 1 + e1 e3 + e3 e1e 3 + e1e3 e3 e 3 + e3
' (ii) Trd
( )
= yr .
x* xr
[ { = kY [ + e e 1
1
' = kY (1 + e1 )(1 + e3 )(1 + e2 ) 1
' 2 = kY (1 + e1 ) (1 + e3 )(1 e 2 + e 2 ......) 2
}]
' ' ' 2 + e3 e1e2 e3e2 + e1e3 + e2
16
' Theorem 3.2: Bias of Trd
( )
' and Trd
( )
under design I &
' (ii) M (Trd ) 2 = I
II respectively upto first order of approximation are:
] = Y [(k C C )] (ii) B[(T ) ] = Y [(k 1) + k ( )(C C C )] (iii) B[(T ) ] = Y [(k 1) + k (C C C )] (iv) B[(T ) ] = Y [(k 1) + k (C C C )] Proof: (i) The bias of [(T ) ] is given by B[(T ) ] = E [(T ) Y ]
(i) B
' (Trd )1 I 2 1) + k ( 2 3 )(C X 1 3
(3.6) (3.7)
2 2 (k 1) 2 + k 2 1CY + (1 3 ) C X 2 CY C X = Y 2 2 +2k (k 1)(1 3 ) C X CY C X
)}
(3.11)
' rd 2 I
2 X
' (iii) M (Trd )1
' rd 1 II
2 X
(3.8) (3.9)
' rd 2 II
2 X
2 2 ( k 1) 2 + k 2 4CY + ( 5 + 3 )C X 2 5 CY C X = Y 2 2 +2 5k ( k 1) C X CY C X
II
(3.12)
' rd 1 I ' rd 1
' (iv) M (Trd ) 2
' rd 1 I
' ' ' ' 2 B (Trd )1 I = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y ' ' ' 2 = Y .E (k 1) + k (e1 e3 + e3 e1e 3 + e1e3 e3 e3 + e3 ) from equation (3.2)) 2 = Y (k 1) + k ( 2 3 )(C X CY C X ) ' (ii) : The bias of (Trd ) 2
' rd 2
Using equation (3.1)
2 2 (k 1) 2 + k 2 4CY + ( 4 + 3 )C X 2 4 CY C X = Y 2 2 +2 4 k (k 1) C X CY C X
] ]
II
(3.13)
' ' Proof (i) M (Trd )1 I = E (Trd )1 Y
' ' ' 2 = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y
' ' ' 2 = Y 2 E (k 1) + k (e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 )
is given by
' rd 2
[ = Y [(k 1) + k {E (e
=E
B
B (T ) = E (T ) Y = I
' e1e2 e3e2 ' + e1e3 2 + e2 ) Y
' kY (1 + e1 e2 + e3
2 2 (k 1) 2 + k 21CY + k 2 ( 2 3 ) C X 2 CY C X = Y 2 2 + 2k (k 1)( 2 3 ) C X CY C X
2 2)
E (e1 e 2 )
' E (e 3 e 2 ) +
' E (e1 e3 )
}]
' (ii) : M (Trd ) 2
' ' ' 2 = E kY (1 + e1 e2 + e3 e1e2 + e1e3 e2e3 + e2 ) Y
] = E[(T
I
' rd ) 2
] }
2
] = Y [(k 1) + k ( )(C C C )] (iii): The bias of [(T ) ] is given by B[(T ) ] = E [(T ) Y ]

' (Trd ) 2 I 1 3 2 X
(from equation (3.3))

Y X
' rd 1 II
2 2 (k 1) 2 + k 2 1CY + (1 3 ) C X 2 CY C X = Y 2 2 +2k (k 1)(1 3 ) C X CY C X ' ' (iii) : M (Trd )1 II = E (Trd )1 Y
' rd 1 II
' rd 1
II
Using equation (3.1)

' ' ' ' 2 B (Trd )1 II = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y
= Y (k 1) +
= Y (k
2 k ( 5 CY C X + 5C X
2 1) + k 5 (C X
' (iv): The bias of (Trd ) 2 ' B (Trd ) 2
] Theorem 3.3: M.S.E of (T ) and (T )

2 = Y (k 1) + k 4 (C X CY C X ) ' rd 1
II
' = E (Trd ) 2 Y
CY C X )
' ' ' 2 = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y
)}
2 2 (k 1) 2 + k 2 4CY + ( 5 + 3 )C X 2 5 CY C X = Y 2 2 +2k (k 1) 5 C X CY C X ' (iv) : M (Trd ) 2
II
is given by
' ' ' 2 = E kY (1 + e1 e2 + e3 e1e2 + e1e3 e2e3 + e2 ) Y
II
' = E (Trd ) 2 Y
II
2 2 (k 1) 2 + k 2 ( 4CY + ( 4 + 3 )C X 2 4 CY C X ) = Y 2 2 +2 4k (k 1) C X CY C X
' rd 2
under design I &
II respectively upto first order of approximation are: ' (i) M (Trd )1 =

2 2 ( k 1) 2 + k 21CY + k 2 ( 2 3 ) C X 2 CY C X = Y 2 (3.10) 2 +2k ( k 1)( 2 3 ) C X CY C X
IV. COMPARISON OF THE ESTIMATORS IN DATA MINING ENVIRONMENT

(i)
D1
' = M Trd
2 2 = Y 2 ( 2 1 ) k 2 (C X 2 CY C X ) 2k (k 1) C X CY C X
[( ) ] M [(T ) ]
1I
' rd 2 I
}]
17
Journal of Applied Computer Science & Mathematics, no. 10 (5) /2011, Suceava
' M Trd
[( ) ]
2 I
' is better than M Trd
[( ) ]
1I
if D1 > 0
' n < r which is not possible, hence , M [(Trd ) ] is better. 1I
( 2 1 ) > 0
Step 5: Repeat the above steps 50,000 times, which provides multiple sample based estimates 1, y2 , y3 ,...., y50000 . y Step 6: Step 7:
M ( y) =
(ii)
' D2 = M Trd 2 4
= Y 2 ( 5
' M Trd
[( ) ] M [(T ) ] )[k { 2 C C }+ 2k (k 1){ C C

1 II ' rd 2 II X 2 X Y
' is better than M Trd
Bias of y is B ( y ) =
M.S.E. of y is 1 50000 yi Y 50000 i =1
2
1 50000 yi Y 50000 i =1
2 X
CY C X
}]
[( ) ]
2 II
[( ) ]
' rd 1 II
' rd 1 I
1 II
if D 2 > 0
TABLE 7.1: BIAS AND M.S.E. OF
(T )
' rd 1 AND
(T )
' rd 2
' n < r which is not possible, hence, M [(Trd ) ] is better. 1 II
( 5 4 ) > 0
Design F1
[(T ) ]
' rd 1 I
[(T ) ]
Design F2
' rd 2 I
[(T ) ]
' rd 1 II
[(T ) ]
' rd 2 II
(iii)
[( ) ] M [(T ) ] M [(T ) ] is better than M [(T ) ]

' D3 = M Trd 1I
' rd 1 II
Bias -0.1899
MSE Bias
MSE Bias
MSE Bias
MSE
if D3 > 0
3.401 2.888 4.06 0.341 4.826 -0.3912 0.3327 2 3 11 9 9 VIII. CONCLUSIONS
k>
2 25 CX CY CX
(iv)
[( ) ] M [(T ) ] M [(T ) ] is better than M [(T ) ] if D C 2( ){ C C } k> ( )(C C )

' D4 = M Trd 2 I ' rd 2 II
' rd 2 II ' rd 2 I 1 3 4 2 X Y X 1 4 2 Y 2 X
2 (1 4 ) CY2 +(2 23 35 ) CX 2(2 3 25 ) CY CX
The content of this paper has a comparative approach for the two estimator examined under two different strategies of
' two-phase sampling. The estimator Trd
>0
' of mean squared error than Trd
[( ) ]
[( ) ]
2 I
is better in terms
1I
estimator under design I.
Moreover in design II, the estimator over
VI. EMPIRICAL STUDY The attached appendix A has generated artificial dataset of size N = 200 containing values of main attribute Y and auxiliary attribute X. Parameter of these attribute domain are given below: 2 2 Y = 42.485; X = 18.515; SY = 199.0598; S X = 48.5375;
[(T ) ]
[(T ) ]
' rd 1 II
is found better
' rd 2 II
estimator. Both suggested methods of
imputation are capable enough to obtain the values of missing observations in data warehouse. Therefore, suggested strategies are good enough in application for data forming. REFERENCES
[1]. Charikar, M., Chekuri, C., Feder, T., and Motwani, R. (1997): Incremental Clustering and Dynamic Information Retrieval, Proc. ACM Symp. Theory of Computing. Chen, C. Y., Hwang, S. C. and Oyang, Y. J. (2002): An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory, Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining. Chuang, K. T., Lin, K. P., and Chen, M. S. (2007): QualityAware Sampling and Its Applications in Incremental Data Mining, IEEE Transactions on Knowledge and Data Engineering,Vol. 19, No. 4, Cochran, W. G. (2005): Sampling Techniques, John Wiley and Sons, New York. Domingo, C., Gavalda, R. and Watanabe, O. (2002): Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms, Data Mining and Knowledge Discovery. Fisher, D. (1987): Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning.
= 0.8652; C X = 0.3763; CY = 0.3321.

VII. SIMULATION The bias and optimum m.s.e. of proposed estimators under both designs are computed through 50,000 repeated samples n, n ' as per design. Computations are in table 7.1. For design I and II the simulation procedure has following steps : Step 1: Draw a random sample S ' of size n ' = 110 from the dataset of N = 200 by SRSWOR. Step 2: Draw a random sub-sample of size n = 50 from S ' . Step 3: Drop down 5 units randomly from each second sample corresponding to Y. Step 4: Impute dropped units of Y by proposed methods and available methods and compute the relevant statistic.
[2].
[3].
[4]. [5].
[6].
18
[7].
[8].
[9].
[10].
[11].
[12]. [13].
[14]. [15].
[16]. [17].
[18].
Heitjan, D. F. and Basu, S. (1996): Distinguishing Missing at random and missing completely at random, The American Statistician, 50, 207-213. Joshi, S. and Jermaine,C. (2008): Materialized Sample Views for Database Approximation, IEEE Transactions on Knowledge and Data Engineering, Vol. 20, No. 3. Lee, C. H., Lin, C. R. and Chen, M. S. (2001): SlidingWindow Filtering: An Efficient Algorithm for Incremental Mining, Proc. Conf. Information and Knowledge Management. Lee, S. D., Cheung, D. W. L. and Kao, B. (1998): Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules, Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, vol. 2, no. 3, pp. 233-262. Liu, Li, Tu, Y., Li, Y and Zou, G. (2005, 2006): Imputation for missing data and variance estimation when auxiliary information is incomplete, Model Assisted Statistics and Applications, 83-94. Motwani, R. and Raghavan, P. (1995): Randomized Algorithms, Cambridge Univ. Press. Reddy, V. N. (1978): A study on the use of prior knowledge on certain population parameters in estimation, Sankhya, C, 40, 29-37. Shukla, D. (2010): F-T estimator under two-phase sampling, Metron, 59, 1-2, 253-263. Shukla, D., Singhai, R., and Dembla, N. (2002): Some Imputation method to treat missing values in knowledge discovery in Data warehouse, IJDE, 1, 2, 1-13. Singh, S. (2009): A new method of imputation in survey sampling, Statistics, Vol. 43, 5 , 499 - 511. Singh, S. and Horn, S. (2000): Compromised imputation in survey sampling, Metrika, 51, 266-276. Yeh, R. L., L. Ching, Shia B. C., Cheng, Y. T., Huwang Y. F. (2008): Imputing manufacturing material in data mining, J. Intell Manuf 109118. Zaki, M., Parthasarathy, S., Li, W., and Ogihara, M. (1997): Evaluation of Sampling for Data Mining of Association Rules, Proc. Intl Workshop Research Issues in Data Eng.
Appendix A Artificial Dataset (N = 200)

Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi Yi Xi 45 15 40 29 36 15 35 13 65 27 67 25 50 15 61 25 73 28 23 07 57 31 33 13 30 11 37 15 23 09 37 17 45 21 32 15 35 19 52 25 50 20 55 35 43 20 41 15 62 25 70 30 69 23 72 31 63 23 25 09 54 23 33 11 35 15 37 16 20 08 27 13 40 23 27 13 35 19 43 19 39 23 45 20 68 38 45 18 68 30 60 27 53 29 65 30 67 23 35 15 60 25 20 07 20 08 37 17 26 11 21 11 31 15 30 14 46 23 39 18 60 35 36 14 70 42 65 25 85 45 40 21 55 30 39 19 47 17 30 11 51 17 25 09 18 07 34 13 26 12 23 11 20 11 33 17 39 15 37 17 42 18 40 18 50 23 30 09 40 15 35 15 71 33 43 21 53 19 38 13 26 09 28 13 20 09 41 20 40 15 24 09 40 20 31 15 35 17 20 11 38 12 58 25 56 25 28 08 32 12 30 17 74 31 57 23 51 17 60 25 32 11 40 15 27 13 35 15 56 25 21 08 50 25 47 25 30 13 23 09 28 8 56 28 45 18 32 11 60 22 25 09 55 17 37 15 54 18 60 27 30 13 33 13 23 12 39 21 41 15 39 15 45 23 43 23 31 19 35 15 42 15 62 21 32 11 38 13 57 19 38 15 39 14 71 30 57 21 40 15 45 19 38 17 42 25 45 25 47 25 33 17 35 17 35 17 53 25 39 17 38 17 58 19 30 09 61 23 47 17 23 11 43 17 71 32 59 23 47 17 55 25 41 15 37 21 24 11 43 21 25 11 30 16 30 16 63 35 45 19 35 13 46 18 38 17 58 21 55 21 55 21 45 19 70 29 39 20 30 11 54 27 33 13 45 22 27 13 33 15 35 19 35 18 40 19 41 21 37 19
[19]. Zhang, T., Ramakrishnan, R. and Livny M. (1996): BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc. ACM SIGMOD.
Prof. D. Shukla is working as an Associate Professor in the Department of Mathematics and Statistics, Sagar University, Sagar (M.P.), INDIA and having over 24 years experience of teaching to U.G. and P.G. classes. His areas of research are Sampling Theory, Graph Theory, Stochastic Modelling, Computer Network and Operating Systems. Mr. Rahul Singhai is working as an Assistant Professor in I.I.P.S, Devi Ahilya University, Indore, (M.P.), INDIA and having over 8 years experience of teaching to U.G. and P.G. classes. His areas of research are Data Mining, DBMS, Computer Network and Operating Systems. Dr. Narendra Singh Thakur is working as Lecturer in the B.T.I.R.T., Sironja, Sagar, (M.P.), INDIA. His areas of research are Data Mining, Sampling Theory, Graph Theory, and Stochastic Modelling.
19

A New Imputation Method For Missing Attribute Values in Data Mining

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A New Imputation Method For Missing Attribute Values in Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Journal of Applied Computer Science & Mathematics, no.

10 (5) /2011, Suceava

A New Imputation Method for Missing Attribute Values in Data Mining

Diwakar SHUKLA, 2Rahul SINGHAI, 3Narendra Singh THAKUR

Computer Science Section

Fig 1.(a) Sub Sample from Large Sample

are the sample mean of size n taken from

Under design case II

where 1 1 1 1 1 1 1 = ' ; 2 = ' ; 3 = ' r n n n N n 1 1 1 1 4 = ; 5 = r N n n N n

' e1e2 e3e2

' (i) Trd

Q x* = k x ' x ' k .Y (1 + e1 ) X (1 + e3 ) (From equation (3.1)) = X (1 + e3 )

' = k .Y (1 + e1 )(1 + e3 )(1 + e3 ) 1

' (ii) Trd

' = kY (1 + e1 )(1 + e3 )(1 + e2 ) 1

' 2 = kY (1 + e1 ) (1 + e3 )(1 e 2 + e 2 ......) 2

' ' ' 2 + e3 e1e2 e3e2 + e1e3 + e2

Computer Science Section

' Theorem 3.2: Bias of Trd

' and Trd

under design I &

' (ii) M (Trd ) 2 = I

II respectively upto first order of approximation are:

2 2 (k 1) 2 + k 2 1CY + (1 3 ) C X 2 CY C X = Y 2 2 +2k (k 1)(1 3 ) C X CY C X

' (iii) M (Trd )1

' (iv) M (Trd ) 2

Using equation (3.1)

' ' Proof (i) M (Trd )1 I = E (Trd )1 Y

' ' ' 2 = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y

' ' ' 2 = Y 2 E (k 1) + k (e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 )

' (ii) : M (Trd ) 2

' ' ' 2 = E kY (1 + e1 e2 + e3 e1e2 + e1e3 e2e3 + e2 ) Y

] = Y [(k 1) + k ( )(C C C )] (iii): The bias of [(T ) ] is given by B[(T ) ] = E [(T ) Y ]

(from equation (3.3))

2 2 (k 1) 2 + k 2 1CY + (1 3 ) C X 2 CY C X = Y 2 2 +2k (k 1)(1 3 ) C X CY C X ' ' (iii) : M (Trd )1 II = E (Trd )1 Y

Using equation (3.1)

' (iv): The bias of (Trd ) 2 ' B (Trd ) 2

] Theorem 3.3: M.S.E of (T ) and (T )

' ' ' 2 = E kY (1 + e1 e3 + e3 e1e3 + e1e3 e3e3 + e3 ) Y

2 2 (k 1) 2 + k 2 4CY + ( 5 + 3 )C X 2 5 CY C X = Y 2 2 +2k (k 1) 5 C X CY C X ' (iv) : M (Trd ) 2

' ' ' 2 = E kY (1 + e1 e2 + e3 e1e2 + e1e3 e2e3 + e2 ) Y

under design I &

II respectively upto first order of approximation are: ' (i) M (Trd )1 =

IV. COMPARISON OF THE ESTIMATORS IN DATA MINING ENVIRONMENT

' is better than M Trd

' n < r which is not possible, hence , M [(Trd ) ] is better. 1I

[( ) ] M [(T ) ] )[k { 2 C C }+ 2k (k 1){ C C

TABLE 7.1: BIAS AND M.S.E. OF

' n < r which is not possible, hence, M [(Trd ) ] is better. 1 II

[( ) ] M [(T ) ] M [(T ) ] is better than M [(T ) ]

3.401 2.888 4.06 0.341 4.826 -0.3912 0.3327 2 3 11 9 9 VIII. CONCLUSIONS

[( ) ] M [(T ) ] M [(T ) ] is better than M [(T ) ] if D C 2( ){ C C } k> ( )(C C )

2 (1 4 ) CY2 +(2 23 35 ) CX 2(2 3 25 ) CY CX

' of mean squared error than Trd

estimator under design I.

Moreover in design II, the estimator over

estimator. Both suggested methods of

= 0.8652; C X = 0.3763; CY = 0.3321.

Computer Science Section

Appendix A Artificial Dataset (N = 200)

S-ar putea să vă placă și