Sunteți pe pagina 1din 4

IPASJ International Journal of Information Technology (IIJIT)

Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm


A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

DATA MINING FOR HUMANITY: AN


OVERVIEW
Anirudh A1, Kayal Padmanandam2
1
Student, Bachelor Degree, St. Mary’s College, Yousufguda, Hyderabad
2
Lecturer, St. Mary’s College, Yousufguda, Hyderabad

ABSTRACT
The term “Data Mining” is primarily used by statisticians, database researchers, and the business communities. Data mining is
an extension of traditional data analysis and statistical approaches as it incorporates analytical techniques drawn from various
disciplines like Artificial Intelligence (AI), Machine Learning and more. This paper focuses on the potential contributions that
Data Mining (DM) could make for Humanity in organizations and bureaus. It first provides a basic introduction to Data
Mining processes and Tasks. Next it gives a basic understanding of how statistics and Data Mining are bonded though they are
discrete. Further the paper clearly details the future trends of Data Mining and how it could travel in the phase of Human life.
Keywords: Knowledge Discovery, Descriptive Mining, Predictive Mining, Statistics &Analysis.

1. INTRODUCTION
1.1 What is Data Mining? It should be described as “the practice of investigating large pre-existing databases in order
to generate new information”. More simply, it extracts information (useful data) from large dataset. Further, Data
mining offers the ability to view data in a new light, discovering associations and patterns within in the underlying data
which were never valued before. The technique provides answers and prompts further questions from new discoveries.
This type of data discovery is also called as Knowledge Discovery in Databases(KDD).
1.2 Why Data Mining? “Necessity is the mother of invention” –(Plato) [1]. As the globe is becoming data rich but
information poor, the need to maintain or manipulate or use data is becoming imperative. The fast-growing tremendous
amount of data collected and stored in large and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools. Thus, Data Mining which does the required data analysis and mining using
various KDD processes and tasks [1-2] can provide answers and helps the decision-making process simpler, has
conquered the globe for all kind of data managements, data handling and data discovery exertions.
1.3 KDD Process
Step 1: Data Collection - Data collection is the process of gathering and measuring data, information or any variables
of interest in a standardized and established manner that enables the data collector to answer or test hypothesis and
evaluate outcomes of the particular collection. This is an integral, usually initial, component of any research done in
any field of study such as the physical and social sciences, business, humanities and others. These collections may be
from different source of data like flat files, web data, database, document files etc.
Step2: Data Cleaning and Integration - Cleaning is a process that makes database perfect. It eliminates the missing
values and correct the noisy data. Always a perfect database gives better and reliable results. Integration combines data
from different data sources gathered during data collection. Data analysis or mining process require data to be
integrated as a data warehouse though collected by various means.
Step 3: Data Selection and Transformation- Data selection is the process of selecting data, relevant to the analysis of
interest. Transformation is a step where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
Step 4: Data mining – It is an essential process where intelligent methods or tasks are applied in order to extract data
patterns. These patterns are further evaluated for dredging knowledge [3].
Step 5: Pattern evaluation – It identifies the truly interesting(useful) patterns representing knowledge based on some
interestingness measures [4]. These measures are intended for selecting and ranking patterns according to their
potential interest of the user/application.

Volume 6, Issue 10, October 2018 Page 17


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

Step 6: Knowledge presentation -It is the process of visualizing the acquired knowledge during the process of Data
Mining in a pictorial or graphical format. Many knowledge representation and data visualization methods like graph,
maps, charts, matrices are used to present the mined knowledge to the user [3].

Figure 1. KDD Process

1.4 Data Mining Tasks


The Data mining tasks can be categorized as Descriptive and Predictive Data Mining Tasks.
Descriptive data mining [5] uses the predefined knowledge available for its analysis. It says what
happened in the past and can also describe how could it enhance the current scenario of the business.
Example it can express the profit of sales store-wise, region-wise, city-wise etc. It can also give ideas about
how to enhance the sales for future. Shortly to say, it describes and diagnose the data.
Predictive data mining [6] uses the pre-defined knowledge for its future prediction analysis. It says what
may happen in the future by analyzing the trend patterns of the available data. In addition, it can also
caution about the actions required. It involves utilizing a variety of statistical, modeling, data mining, and
machine learning techniques to dig into historical data and allows analysts to make predictions. Predictive
analytics can only forecast what might happen in the future since they are probabilistic. This needs larger
data expertise and tool set and it is considered as one of the advanced, intelligent and complex tasks in
analytics.
Both the types of mining can answer question like ‘what’, ‘why’, ‘Future ’and ‘Action’ required. Some of
the tasks of Descriptive and Predictive Mining are given in Table 1.
Table 1. Descriptive Vs Predictive Data Mining Tasks.
DESCRIPTIVE PREDICTIVE
Concept/Class Description Time Series analysis
Association Analysis Regression Analysis
Classification Prediction
Clustering Deviation Analysis

2. DATA MINING AND STATISTICS


The disciplines of statistics and data mining both aim to discover structure in data. So much do their aims overlap,
some people regard data mining as a subset of statistics. But that is not a realistic assessment as data mining also makes
use of ideas, tools, and methods from other areas – particularly with business, database technology and machine
learning and is not heavily concerned with some areas in which statisticians are interested [7]. Statistical procedures
do, however play a major role in data mining, particularly in the processes of developing and accessing models. Most of
the learning algorithms use statistical tests and also for correcting models that are over fitted. Statistical tests are also
used to validate machine learning models and to evaluate machine learning algorithms. Some of the commonly used
statistical analysis techniques are discussed below.
2.1 Different Types of Statistical /Mining Analysis
Cluster Analysis seeks to organize information about variables so that relatively homogeneous groups, or "clusters,"
can be formed. The clusters formed with this family of methods should be highly internally homogenous (members are
similar to one another) and highly externally heterogeneous (members are not like members of other clusters).
Correlation Analysis measures the relationship between two variables. The resulting correlation coefficient shows,
changes in one variable will result in changes in the other. When comparing the correlation between two variables, the

Volume 6, Issue 10, October 2018 Page 18


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

goal is to see if a change in the independent variable will result in a change in the dependent variable. This information
helps in understanding an independent variable's predictive abilities.
Discriminant Analysis is used to predict membership in two or more mutually exclusive groups from a set of
predictors, when there is no natural ordering on the groups. Discriminant analysis can be seen as the inverse of a one-
way multivariate analysis of variance (MANOVA) in that the levels of the independent variable (or factor) for
MANOVA become the categories of the dependent variable for discriminant analysis, and the dependent variables of
the MANOVA become the predictors for discriminant analysis [8].
Regression Analysis is a statistical tool that uses the relation between two or more quantitative variables so that one
variable (dependent variable) can be predicted from the other independent variables. But no matter how strong the
statistical relations are between the variables, no cause-and-effect pattern is necessarily implied by the regression
model. Regression analysis comes in many flavors, including simple linear, multiple linear, curvilinear, and multiple
curvilinear regression models, as well as logistic regression [9].

3. TRENDS IN DATA MINING


In spite of having different commercial systems for data mining, a lot of challenges come up when they are actually
implemented. With rapid evolution in the field of data mining, companies are expected to stay abreast with all the new
developments.
Business/Domain Experts need to keep track of the newest data mining trends and stay updated and sound in the
industry to overcome challenges.
3.1 Important Future Trends in Data Mining
Businesses which have been slow in adopting the process of data mining are now catching up with the others.
Extracting important information through the process of data mining is widely used to make critical business decisions.
In the coming decade, we can expect data mining to become as ubiquitous as some of the more prevalent technologies
used today. Some of the key data mining trends for the future include -
1. Multimedia Data Mining
This is one of the latest methods which is catching up because of the growing ability to capture useful data accurately. It
involves the extraction of data from different kinds of multimedia sources such as audio, text, hypertext, video, images,
etc. and the data is converted into a numerical representation in different formats. This method can be used in
clustering and classifications, performing similarity checks, and also to identify associations.
2. Ubiquitous Data Mining
This method involves the mining of data from mobile devices to get information about individuals. In spite of having
several challenges in this type such as complexity, privacy, cost, etc. this method has a lot of opportunities to be
enormous in various industries especially in studying human-computer interactions.
3. Distributed Data Mining
This type of data mining is gaining popularity as it involves the mining of huge amount of information stored in
different company locations or at different organizations. Highly sophisticated algorithms are used to extract data from
different locations and provide proper insights and reports based upon them.
4. Spatial and Geographic Data Mining
This is new trending type of data mining which includes extracting information from environmental, astronomical, and
geographical data which also includes images taken from outer space. This type of data mining can reveal various
aspects such as distance and topology which is mainly used in geographic information systems and other navigation
applications.
5. Time Series and Sequence Data Mining
The primary application of this type of data mining is study of cyclical and seasonal trends. This practice is also helpful
in analyzing even random events which occur outside the normal series of events. This method is mainly being use by
retail companies to access customer's buying patterns and their behaviors.

4. CONCLUSION
This paper has discussed about what is Data mining, why it has come into humanities and has discussed about the
process, tasks involved for mining data efficiently. The disciplines of Statistics and Data Mining have also been
discussed to prove that these areas are highly interrelated and share a symbiotic relationship though they are not
synonymic. It has also described clearly about the categories of Data mining as descriptive and predictive which was

Volume 6, Issue 10, October 2018 Page 19


IPASJ International Journal of Information Technology (IIJIT)
Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm
A Publisher for Research Motivation ........ Email:editoriijit@ipasj.org
Volume 6, Issue 10, October 2018 ISSN 2321-5976

always a misnomer for Data Mining beginners. Finally, a discussion is given on the Trends of Data Mining and how
future would be behind Data Mining. It is clear that as long as Human survive, Data Mining too do.
REFERENCES
1. J. Han, M. Kamber, "Data Mining", Morgan Kaufmann Publishers, San Francisco, CA, 2001.
2. Hilderman R. J., Hamilton H. J., "Knowledge Discovery and Interest Measures", Kluwer Academic, Boston, 2002.
3. Ian.H. Witten &Eibe Frank, “Data Mining-Practical Machine Learning Tools and Techniques” Second
Edition, Morgan Kaufmann Series.
4. Rakesh Agrawal and Ramakrishnan Srikant, Fast algorithms for mining association rules, Proc. of the
20th VLDB Conf. (VLDB 94), 1994, pp. 487–499 Mm
5. David Olson, “Descriptive Data Mining”, Springer Nature Publishers, Singapore.
6. Daniel T.L,Chantal D.L, “Data Mining and Predictive Analytics”,Second Edition,Wiley Publisher.
7. J.H. Friedman, Data mining and statistics: What is the connection? in: Keynote Speech of the 29th
Symposium on the Interface: Computing Science and Statistics, Houston, TX, 1997.
8. William R.K, “Discriminant Analysis”, Series:Quantitative Applications in the Social Sciences, SAGE
University.
9. John O. Rawlings Sastry, G. Pantula David A. Dickey , Applied Regression Analysis: A Research Tool,
Second Edition, Springer Texts in Statistics.

Volume 6, Issue 10, October 2018 Page 20

S-ar putea să vă placă și