p144 Data Mining

A Paper Presentation on
- Information repository with knowledge discovery
Presented By:
PRAVEEN GANGULA SRIVASTHAV NANDANAVANAM
III/IV B.Tech (C.S.E), III/IV B.Tech (C.S.E),
G.M.R.I.T., G.M.R.I.T.,
RAJAM. RAJAM.
E-mail:sairamanababu@yahoo.com nssagar@gmail.com
ABSTRACT
Organisations are today suffering from a malaise of data overflow.
The developments in the transaction processing technology has given rise to a situation
where the amount and rate of data capture is very high, but the processing of this data
into information that can be utilised for decision making, is not developing at the same
pace Data Mining is the process of extracting valid, previously unknown,
comprehensible, and actionable information from large databases and using it to make
crucial business decisions. Data Mining contains two models: predictive model and
descriptive model. It contains various tasks such as Classification, Regression, Time
series analysis, prediction, clustering, summarization etc.
The main aim of Data Mining is Knowledge Discovery in Databases
(KDD). KDD is used to derive the patterns that are useful for Data Extraction. KDD
process contains a mechanism that includes Selection, Preprocessing Transformation, and
Interpretation. Data Mining basically depends on Classification and Clustering, which
provides strength to the Data Mining. Data mining metrics are applied to measure the
effectiveness of functions using ROI (Return on investment).
Data Mining contains the relative concepts such as OLTP systems,
Fuzzy sets, and Web search engines etc.Data Mining can also be extended to Web
Mining, Spatial Mining, and Temporal Mining.
Our paper focuses on the need for information repositories and
discovery of knowledge and thence the overview of, the so hyped, Data Mining.
INTRODUCTION:
One of the reasons behind maintaining any database is to enable the user
to find interesting trends in the data. Data mining has been defined as "The nontrivial
extraction of implicit, previously unknown, and potentially useful information from
data". It uses machine learning, statistical and visualization techniques to discovery and
present knowledge in a form which is easily comprehensible to humans.
DEFINITION: -
The process of extracting valid, previously unknown, comprehensible
and actionable information from large databases and using it to make crucial business
decisions.
Why data mining?
Data mining got its start in what is now known as “customer relationship
management” (CRM). It is widely recognized that companies of all sizes need to learn to
emulate what small; service-oriented businesses have always done well – creating one-to-
one relationships with their customers. In every industry, forward-looking companies are
trying to move towards the one-to-one ideal of understanding each customer individually
and to use that understanding to make it easier for the customer to do business with them
rather than with a competitor. These same companies are learning to look at the lifetime
value of each customer so they know which ones are worth investing money and effort to
hold on to and which ones to let drop.
As noted, a small business builds one-to-one relationships with its
customers by noticing their needs, remembering their preferences, and learning from past
interactions how to serve them better in the future. In large commercial enterprises, the
first step - noticing what the customer does - has already largely been automated. On-line
transaction processing (OLTP) systems are everywhere, collecting data on seemingly
everything. The customer-focused enterprise regards every record of an interaction with a
client or prospect as a learning opportunity. But, learning requires more than simply
gathering data. In fact, many companies gather hundreds of gigabytes or terabytes of data
from and about their customers without learning anything. Data is gathered because it is
needed for some operational purpose, e.g. inventory control or billing.
DATA MINING MODELS: -
Data mining mainly concerned with the use of software techniques for
finding hidden and unexpected patterns and relationships in sets of data. There are two
data mining models:
1. Predictive Model: This makes a prediction about values of data using known results
found from different data and it may be made based on the use of other historical data.
Predictive model data mining tasks are Classification, Regression, Time series analysis
and prediction. Example: - Credit Card Usage
2. Descriptive Model: This identifies patterns or relationships in data. It is used to
explore the properties of the data examined. Descriptive model data mining tasks are
Clustering, Summarization, Association rules, and Sequence discovery etc.
Example: - Manual Evaluation
DATA MINING
Predictive Descriptive
Classification Clustering
Regression Summarization
Time Series Analysis Association rules
Prediction Sequence discovery
Fig: Data Mining Models and Tasks
DATA MINING TASKS: -

These describe the functions performed by the data mining techniques.
1. Classification: -Classification maps data into predefined groups or classes. Always
these classes are determined before examining the data.
2. Regression: -Regression is used to map a data item to real valued prediction variable,
which is used to determine which function is “best”.
3. Time series analysis: -With time series analysis, the value of an attribute is examined
as it varies over time.
4. Prediction: -Prediction is to know the future data states based on past and current data.
5. Clustering: -Clustering maps data into groups, which are defined, by data alone. That
is, the similar data are grouped into clusters. Domain experts will maintain these
clusters.
6. Summarization: -Summarization is also called characterization or generalization. It
maps data into subsets with associated simple descriptions. That is, it characterizes
the contents of the database.
7. Association rules: -An association rule is a model that identifies data relationships,
which are used to identify the items.
8. Sequence discovery: -Sequence discovery is used to determine sequential patterns in
data, based on a time sequence of actions.
ROLE OF ‘KDD’ IN DATAMINING: -
Definition of KDD process: -
Knowledge Discovery in Databases (KDD) is the process of finding useful
information and patterns in data.
Selection Preprocessing Transformation Data Mining

Interpretation
Initial Target Preprocessed Transformed Model

Knowledge
Data Data Data Data
Fig: KDD Process
Importance of KDD process in data mining: -
KDD process plays an important role in the process of data mining.
Data mining process uses the patterns derived by the KDD process. The KDD process
takes the data as input and produces the useful information desired by the users as the
output. It is an interactive process (itself) and it requires much elapsed time to produce
the accurate results.
Steps involved in KDD process: -
 Selection: -The required data is obtained from various databases, files, and non-
electronic sources.
 Preprocessing: -Here, the erroneous data and missing data will be identified and
using the data mining tools, the corrections of data will be done, if necessary.
 Transformation: -Here, data may be encoded or transformed into more generalized,
usable, and reduced formats.
 Data mining: - The suitable algorithms will be applied to the transformed data
to generate the desired results.
 Interpretation (Evaluation): -The results will be visualized to the users in an
effective way (using GUI strategies).
DEVELOPMENT PROCESS OF DATA MINING:

Information Retrieval
Data Bases Statistics

Data
Mining
Algorithms Machine Learning

Fig: Development Process of Data Mining
The following functions are used in the developing process of data mining.
1. Induction:-Induction is used to proceed from very specific knowledge to more general
information.
2. Compression:-The compressed description of data characteristics is only found in the
model.
3. Querying:-Different types of data mining queries may be developed.
4. Approximation:-Approximation helps uncover hidden information about data in large
database.
5. Search problem:-The size and efficiency of developing an abstract model are
considered.
DATAMINING ISSUES: -
1. Human interaction: -The interaction with both domain and technical experts may be
needed to assist in interpreting the results.
2. Outliers: -Outliers are data entries that do not fit into the derived model. These also
must be eliminated.
3. Large Data Sets: -The massive data sets create number of problems. Sampling and
parallelization are tools that are used to reduce those problems.
4.Multimedia data: -It complicates the database and algorithms.
5. Missing data: -The estimates used for missing data in preprocessing phase can lead to
invalid results.
6. Irrelevant data: -Some attributes in the database might not be used for the current
task.
7. Noisy data: -Incorrect attribute values must be corrected before running data mining
applications.
8. Changing data: -If the data base changes the algorithm must be completely rerun.
9. Integration: -Integration of data mining functions into traditional DBMS systems is
certainly a desirable goal.
DATA MINING METRICS: -

The data mining functions effectiveness can be measured using ROI
(Return On Investment, from overall usefulness perspective). ROI examines the
difference between what the data mining technique cause and what the savings or benefits
from its use are. The metrics used include the traditional metrics of space and time based
on complexity analysis
RELATED CONCEPTS: -
1. Database/OLTP Systems: -Database is a collection of data usually associated with
some organizations. A DBMS is the software used to access the database.
Now, the DB and OLTP (On Line Transaction Processing systems) use the data mining
queries which produce output as a KDD object, where normal queries produce a
subset of DB as output. A KDD object is a rule, a classification or a cluster.
2. Fuzzy sets and Fuzzy logic: -A Fuzzy set is a set, F, in which the set member ship
function, f, is a real valued function with output in the range [0,1] : an element x is
said to belong to ‘F’ with probability f(x) and simultaneously to be in ~F with
probability 1-f(x).Fuzzy sets may be used to describe number of data mining
functions in various DB areas.
3. Decision support systems: -Decision support systems are comprehensive computer
systems and related tools that assist managers in making decisions and solving
problems. A DSS could be enterprise-wide, thus allowing upper level managers the
data needed to make intelligent business decision that impact the entire organization.
The needed data must be extracted through data mining only, in DSS systems.
4. Web search engines: -Web search engines are used to access the data and can be
viewed as query systems. Search engine queries can be stated as keyword, Boolean,
weighted etc. The data being searched in search engines, pages with heterogeneous
data and extensive hyper links.
5. Pattern Matching: -Pattern matching or pattern recognition finds occurrences of a
predefined pattern in data. Pattern matching serves for Text editors, Information
retrieval and Time series analysis.
CORE TOPICS: -
1. Classification: -Classification is the most familiar and most popular data mining
technique. Classification problem is defined as follows.
Given a database D= {t1, t2 …tn} of tuples (items, records) and a set of classes C=
{c1, c2 …cm} the classification problem is to define a mapping f: D→C where each ti
is assigned to one class. A class, Cj, contains precisely those tuples mapped to it; that
is, Cj= {ti/ f( ti)=Cj, 1< = i< = n and ti ∈D)
Algorithms used for solving classification problems in data mining: -
• Statistical – based algorithms • Distance – based algorithms
• Decision tree – based algorithms • Neural network – based
algorithms
• Rule – based algorithms • Combing techniques algorithms
Example: -In air force security checking, Pattern recognition is one type of classification.
2. Clustering: -Clustering is defined as grouping the similar data according to
characteristics found in the actual data. Example: -An international online catalog
company wishes to group its customers based on a common features include income,
age, number of children, marital status and education. Now depending on the type of
advertising, the required attributes are only considered for clustering.
 Characteristics: -
• Set of like elements. Elements from different clusters are alike.
• The distance between points in a cluster is less than that of the distance
between a point in the cluster and any point outside it.
 Problems occurred while clustering is applied to real world database: -
• Outlier handling and Interpreting the semantic of each cluster may be difficult
• Cluster membership may change over time
• No prior knowledge of what data should be used and number of clusters used.
 Algorithms used to solve clustering problems: -
• Hierarchical algorithms: Agglomerative algorithms and Divisive algorithms
• Categorical algorithms
• Large database: Sampling algorithms and Compression
 Applications of clustering: -
• Plant and animal classification and Disease classification
• Image processing, pattern recognition and Document retrieval.
ADVANCED TOPICS: -
1. Web Mining: -Web mining is mining of data related to the World Wide Web.
Web data can be classified into the following classes.
• Content of actual web pages
• Intra page structure includes the HTML or XML code for the page
• Inter page structure is the actual linkage structure between web pages
• User profiles include demographic and registration information obtained about
uses. This could also be includes information found in cookies.
The above classes of Web data may be manipulated using different web mining tasks
such as: Web Content Mining, Web Structure Mining and Web Usage Mining.
2. Spatial Mining: -Spatial mining is data mining as applied to spatial databases or
spatial data.
Specialized operations and data structures are used to access spatial data. Some of
those are:
• Spatial queries
• Thematic maps
• Spatial data structures
• Image databases
 Spatial data mining primitives: -
Primitive operations involved between spatial objects are,
• Disjoint
• Equals
• Overlaps or intersects
• Covered by or inside or contained
in
Generalization and Specialization: -The use of a concept hierarchy shows levels of
relationships among data. Spatial data mining techniques have involved both
generalization and specialization type approaches.
Spatial rules: -There are different types of rules found in spatial data mining.
• Spatial characteristic rules • Spatial discriminant rules
• Spatial discriminant rules
Areas of applications of spatial data mining: -
• GIS systems, Geology, Environmental Science
• Resource management, Agriculture, Medicine, Robotics
3. Temporal Mining: -Temporal data mining is the mining process of temporal data from
temporal databases. Temporal mining involves the concepts such as Modeling
temporal events, Time series, Pattern detection
There are other types of mining like distributed mining, ubiquituous mining, constrained-
based data mining ,phenomenal data mining etc
APPLICATIONS OF DATA MINING: -

Retail marketing: Identifying buying patterns of customers, Market basket analysis etc.
Banking: Detecting patterns of fraudulent credit cards, identifying loyal customers etc.
Insurance: Claims analysis, predicting which customers will buy new policies etc.
Medicine: Characterizing patient behavior to predict surgery visits etc.
Other Benefits: Lower cost of Data Mining Systems&Data Processing, Reduction in

Computer Generated Hardcopy Reports.
CONCLUSION:
The field of data mining, like statistics, concerns itself with “learning from
data” or “turning data into information”. So we are observed as data mining as statistics.
It is important to note that data mining can learn from statistics. There is the opportunity
for an immensely rewarding synergy between data miners and statisticians. However,
most data miners tend to be ignorant of statistics and client’s domain; statisticians tend to
be ignorant of data mining and client’s domain; and clients tend to be ignorant of data
mining and statistics. Unfortunately, they also tend to be inhibited by myopic points of
view:- computer scientists focus upon database manipulations and processing algorithms;
statisticians focus upon identifying and handling uncertainties; and clients focus upon
integrating knowledge into the knowledge domain. Moreover, most data miners and
statisticians continue to statistically criticize each other. This is detrimental to both
disciplines. Data mining and statistics will inevitably grow toward each other in the near
future because data mining will not become knowledge discovery without statistical
thinking, statistics will not be able to succeed on massive and complex datasets without
data mining approaches. A maturity challenge is for data miners, statisticians and clients
to recognize their dependence on each other and for all of them to widen their focus until
true collaboration becomes reality. The critical challenge for us all is to view the
challenges as opportunities for our joint success.
REFERENCES:
www.datamines.com.
Data Mining –A.K.Pujari

p144 Data Mining

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

p144 Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

A Paper Presentation on

- Information repository with knowledge discovery

DATA MINING TASKS: -

Selection Preprocessing Transformation Data Mining

Initial Target Preprocessed Transformed Model

DEVELOPMENT PROCESS OF DATA MINING:

Data Bases Statistics

Algorithms Machine Learning

DATA MINING METRICS: -

APPLICATIONS OF DATA MINING: -

Other Benefits: Lower cost of Data Mining Systems&Data Processing, Reduction in

S-ar putea să vă placă și