0 evaluări0% au considerat acest document util (0 voturi)
86 vizualizări7 pagini
Data mining focuses on:1. The detection and correction of data quality problem. 2. Use of algorithm that can tolerate poor data quality. Data quality issue involves both measurement and data collection problem: outliers, missing and inconsistent values and duplicate dat.
Data mining focuses on:1. The detection and correction of data quality problem. 2. Use of algorithm that can tolerate poor data quality. Data quality issue involves both measurement and data collection problem: outliers, missing and inconsistent values and duplicate dat.
Data mining focuses on:1. The detection and correction of data quality problem. 2. Use of algorithm that can tolerate poor data quality. Data quality issue involves both measurement and data collection problem: outliers, missing and inconsistent values and duplicate dat.
Data Mining : it is a process of analysing data from different
perspective and summarizing it into useful information.. Data mining is an analytical tool for analysing data. So data mining is applied to data that are collected for another purpose and future use. So data mining cannot take advantage of data quality at its source. As data mining focuses on:- 1. The detection and correction of data quality problem. 2. Use of algorithm that can tolerate poor data quality.
Aspects of data quality :- 1. Measurement and data collection issues ---- It is unrealistic to expect data will be perfect. There will be problem due to human error
Limitation of measuring devices Flaws in data collection object Spurious or duplicate object ie multiple data object that all corresponds to single real object. Various problem that involve measurement error are: noise, artifacts, bias, accuracy, precision. We conclude Data quality issue involves both measurement and data collection problem: outliers, missing and inconsistent values and duplicate dat
Measurement and data collection errors :- Measurement errors refers to any problem occuring from measurement process. Common problem is that the value recorded differs from true value to some extent. > Data collection error refers to omitting data objects and attribute value or inappropriately including a data object. Eg study of animal of certain species that are similar in aspects. Noise and artifacts Noise is a random component of measuring error. It may involve distortion of value or addition of spurious object. Noise is used in connection which has spatial or temporal component. techniques from signal and image processing can frequently be used to reduce noise . So a robust algorithm is developed that accept result when noise is present. OUTLIERS Outliers are (1) data object that in some sense have characteristics that are different from most of the other data object in the data set. (2) values of an attribute that are unusual wrt to typical attribute value. It is imp to distinguish between the notion of noise and outliers. Missing values It is not unusual for an object to be missing one or more attribute value. In some cases info is nt able to be collected , eg some people decline to give their age and weight. Another eg form has conditional part which is to be filled when person answer previous question. Missing value should be taken in account during the data analysis. Strategy to deal with missing value Eliminate data object or attribute:- If data set has only few object that has missing value then it can be omited. Eliminate attribute that has missing value. Estimate missing values :- Reliably estimate a smooth fashion that changes in a reasonable time but has few widely scattered missing values :- in this case missing value can be estimated by using remaining values. Ignore the missing value during analysis In data mining approach we need to ignore as the objects are clustered ans similarity between pair of object need to be calculated, if both the pair have missing value then similarity can be calculated by the pair that do not have missing value. Inconsistant values:- Data can contain inconsistent values. Eg consider a address field where both zip code and city. Issues related to applications Data is of high quality if it is suitable for its intended use. This approach is proven and quite useful in business and industry. Rather than measurement and data collection issue some general issues are : 1. Timeliness: some data starts to age as soon as it is collected . 2. Relevance : the available data must contain info necessary for application. 3. Knowledge about the data : data sets are accompanished by documentation.