Documente Academic
Documente Profesional
Documente Cultură
ABSTRACT
The term “Data Mining” is primarily used by statisticians, database researchers, and the business communities. Data mining is
an extension of traditional data analysis and statistical approaches as it incorporates analytical techniques drawn from various
disciplines like Artificial Intelligence (AI), Machine Learning and more. This paper focuses on the potential contributions that
Data Mining (DM) could make for Humanity in organizations and bureaus. It first provides a basic introduction to Data
Mining processes and Tasks. Next it gives a basic understanding of how statistics and Data Mining are bonded though they are
discrete. Further the paper clearly details the future trends of Data Mining and how it could travel in the phase of Human life.
Keywords: Knowledge Discovery, Descriptive Mining, Predictive Mining, Statistics &Analysis.
1. INTRODUCTION
1.1 What is Data Mining? It should be described as “the practice of investigating large pre-existing databases in order
to generate new information”. More simply, it extracts information (useful data) from large dataset. Further, Data
mining offers the ability to view data in a new light, discovering associations and patterns within in the underlying data
which were never valued before. The technique provides answers and prompts further questions from new discoveries.
This type of data discovery is also called as Knowledge Discovery in Databases(KDD).
1.2 Why Data Mining? “Necessity is the mother of invention” –(Plato) [1]. As the globe is becoming data rich but
information poor, the need to maintain or manipulate or use data is becoming imperative. The fast-growing tremendous
amount of data collected and stored in large and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools. Thus, Data Mining which does the required data analysis and mining using
various KDD processes and tasks [1-2] can provide answers and helps the decision-making process simpler, has
conquered the globe for all kind of data managements, data handling and data discovery exertions.
1.3 KDD Process
Step 1: Data Collection - Data collection is the process of gathering and measuring data, information or any variables
of interest in a standardized and established manner that enables the data collector to answer or test hypothesis and
evaluate outcomes of the particular collection. This is an integral, usually initial, component of any research done in
any field of study such as the physical and social sciences, business, humanities and others. These collections may be
from different source of data like flat files, web data, database, document files etc.
Step2: Data Cleaning and Integration - Cleaning is a process that makes database perfect. It eliminates the missing
values and correct the noisy data. Always a perfect database gives better and reliable results. Integration combines data
from different data sources gathered during data collection. Data analysis or mining process require data to be
integrated as a data warehouse though collected by various means.
Step 3: Data Selection and Transformation- Data selection is the process of selecting data, relevant to the analysis of
interest. Transformation is a step where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
Step 4: Data mining – It is an essential process where intelligent methods or tasks are applied in order to extract data
patterns. These patterns are further evaluated for dredging knowledge [3].
Step 5: Pattern evaluation – It identifies the truly interesting(useful) patterns representing knowledge based on some
interestingness measures [4]. These measures are intended for selecting and ranking patterns according to their
potential interest of the user/application.
Step 6: Knowledge presentation -It is the process of visualizing the acquired knowledge during the process of Data
Mining in a pictorial or graphical format. Many knowledge representation and data visualization methods like graph,
maps, charts, matrices are used to present the mined knowledge to the user [3].
goal is to see if a change in the independent variable will result in a change in the dependent variable. This information
helps in understanding an independent variable's predictive abilities.
Discriminant Analysis is used to predict membership in two or more mutually exclusive groups from a set of
predictors, when there is no natural ordering on the groups. Discriminant analysis can be seen as the inverse of a one-
way multivariate analysis of variance (MANOVA) in that the levels of the independent variable (or factor) for
MANOVA become the categories of the dependent variable for discriminant analysis, and the dependent variables of
the MANOVA become the predictors for discriminant analysis [8].
Regression Analysis is a statistical tool that uses the relation between two or more quantitative variables so that one
variable (dependent variable) can be predicted from the other independent variables. But no matter how strong the
statistical relations are between the variables, no cause-and-effect pattern is necessarily implied by the regression
model. Regression analysis comes in many flavors, including simple linear, multiple linear, curvilinear, and multiple
curvilinear regression models, as well as logistic regression [9].
4. CONCLUSION
This paper has discussed about what is Data mining, why it has come into humanities and has discussed about the
process, tasks involved for mining data efficiently. The disciplines of Statistics and Data Mining have also been
discussed to prove that these areas are highly interrelated and share a symbiotic relationship though they are not
synonymic. It has also described clearly about the categories of Data mining as descriptive and predictive which was
always a misnomer for Data Mining beginners. Finally, a discussion is given on the Trends of Data Mining and how
future would be behind Data Mining. It is clear that as long as Human survive, Data Mining too do.
REFERENCES
1. J. Han, M. Kamber, "Data Mining", Morgan Kaufmann Publishers, San Francisco, CA, 2001.
2. Hilderman R. J., Hamilton H. J., "Knowledge Discovery and Interest Measures", Kluwer Academic, Boston, 2002.
3. Ian.H. Witten &Eibe Frank, “Data Mining-Practical Machine Learning Tools and Techniques” Second
Edition, Morgan Kaufmann Series.
4. Rakesh Agrawal and Ramakrishnan Srikant, Fast algorithms for mining association rules, Proc. of the
20th VLDB Conf. (VLDB 94), 1994, pp. 487–499 Mm
5. David Olson, “Descriptive Data Mining”, Springer Nature Publishers, Singapore.
6. Daniel T.L,Chantal D.L, “Data Mining and Predictive Analytics”,Second Edition,Wiley Publisher.
7. J.H. Friedman, Data mining and statistics: What is the connection? in: Keynote Speech of the 29th
Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997.
8. William R.K, “Discriminant Analysis”, Series:Quantitative Applications in the Social Sciences, SAGE
University.
9. John O. Rawlings Sastry, G. Pantula David A. Dickey , Applied Regression Analysis: A Research Tool,
Second Edition, Springer Texts in Statistics.