Sunteți pe pagina 1din 7

Data Quality

Data Mining : it is a process of analysing data from different


perspective and summarizing it into useful information..
Data mining is an analytical tool for analysing data.
So data mining is applied to data that are collected for another purpose
and future use. So data mining cannot take advantage of data quality
at its source. As data mining focuses on:-
1. The detection and correction of data quality problem.
2. Use of algorithm that can tolerate poor data quality.


Aspects of data quality :-
1. Measurement and data collection issues ----
It is unrealistic to expect data will be perfect. There will be problem due
to human error














Limitation of measuring devices
Flaws in data collection object
Spurious or duplicate object ie multiple data object that all corresponds to
single real object.
Various problem that involve measurement error are:
noise, artifacts, bias, accuracy, precision.
We conclude Data quality issue involves both measurement and data collection
problem: outliers, missing and inconsistent values and duplicate dat

Measurement and data collection errors :-
Measurement errors refers to any problem occuring from measurement
process.
Common problem is that the value recorded differs from true value to some
extent.
> Data collection error refers to omitting data objects and attribute value or
inappropriately including a data object. Eg study of animal of certain species
that are similar in aspects.
Noise and artifacts
Noise is a random component of measuring error. It may involve distortion
of value or addition of spurious object.
Noise is used in connection which has spatial or temporal component.
techniques from signal and image processing can frequently be used to
reduce noise .
So a robust algorithm is developed that accept result when noise is
present.
OUTLIERS
Outliers are (1) data object that in some sense have characteristics that are
different from most of the other data object in the data set.
(2) values of an attribute that are unusual wrt to typical attribute value.
It is imp to distinguish between the notion of noise and outliers.
Missing values
It is not unusual for an object to be missing one or more attribute value.
In some cases info is nt able to be collected , eg some people decline to
give their age and weight.
Another eg form has conditional part which is to be filled when person
answer previous question.
Missing value should be taken in account during the data analysis.
Strategy to deal with missing value
Eliminate data object or attribute:-
If data set has only few object that has missing value then it can be omited.
Eliminate attribute that has missing value.
Estimate missing values :-
Reliably estimate a smooth fashion that changes in a reasonable time but has
few widely scattered missing values :- in this case missing value can be
estimated by using remaining values.
Ignore the missing value during analysis
In data mining approach we need to ignore as the objects are clustered ans
similarity between pair of object need to be calculated, if both the pair have
missing value then similarity can be calculated by the pair that do not have
missing value.
Inconsistant values:-
Data can contain inconsistent values. Eg consider a address field where both
zip code and city.
Issues related to applications
Data is of high quality if it is suitable for its intended use. This approach is
proven and quite useful in business and industry. Rather than measurement and
data collection issue some general issues are :
1. Timeliness: some data starts to age as soon as it is collected .
2. Relevance : the available data must contain info necessary for application.
3. Knowledge about the data : data sets are accompanished by
documentation.

S-ar putea să vă placă și