Data Mining Abhas

INTRODUCTION
WHAT IS DATA
Data is defined as set of values of quantitative or qualitative variables , for example a
qualitative data can be described as an anthropologist's handwritten notes about his or her
interviews with native people. Individual pieces of data are referred to be the pieces of data.
The scientific research commonly associates with this concept of data and is collected by
large number of institutions and organisations including Business, Non Governmental or
Governmental organistaion for different purposes.
The data is collected, measured, reported and analyzed so to find out the main pupose of
getting the data. The data is used in different ways sometimes it is used to predict sales in
business, sometimes data is used by housewives to make budget for their house and in many
ways the data is processed. Data can be Raw or Processed.
Raw data is described as collection of characters or numbers before it has been cleaned or
corrected by the users or we can say researchers. Raw data is corrected by removing out the
outliers or data entry errors.Whereas processed data is defined as the data that is derived from
the raw data and now can be used to visualized the data in any forms like graphs ,stats etc.
Data, Knowledge and Information are close related to each other and each of them has its
own importance or role in relation to the other. Data can be considered as Information once it
is arranged in some fashion and can thereafter be analyzed and can be used to make up the
decisions.
Data in computers are dealt in a very different way. In computing, the information is that data
that has been trranslted into a efficient form for processing. The data in coputers is converted
into binary or digital form. The computer understands only the binary language.
Claude shannon, a mathematician also known as father of the information theory developed
this concept of data in context of computing.
Computer uses binary data in form of 1 and 0 to represent data including videos, images, text
and sounds. The smallest unit of data is repsented by a single value is called BIT. A byte is
eight binary digit long. Storage and memory is measured in megabytes and giga bytes.
Data can be stored in file formats, as in mainframe systems using ISAM and VSAM. Other
file formats for data storage, conversion and processing include comma-separated values.
These formats continued to find uses across a variety of machine types, even as
more structured-data-oriented approaches gained footing in corporate computing.
Greater specialization developed as database, database management system and

then relational database technology arose to organize information.
DATA MINING
In recent years data mining has got a great attention in information industry,
because of the availability of such wide and huge amount of data and the
necessity of turning it into useful information and knowledge. The knowledge
and informaation that is gained can be used for different purposes like
market analysis, fraud detection and customer retention.
The data mining is viewed as a result of natural evolution of information

technology. The industry of database system has witnessed an evolution phase
in the development of following functionalities: data collection, data creation,data
management and advanced data analysis.
Since the 1960s, database and information technology has been evolving
systematillyfrom primitive file processing systems to sophisticated and powerful
database systems. The research and development in database systems since the
1970s has progressed from early hierarchical and network database systems to
the development of relational database systems (where data are stored in
relational table structures). Database technology since the mid1980s has been
characterized by the popular adoption of relational technology and an upsurge
of research and development activities on new and powerful database
systems. These promote the development of advanced data models such as
eextended-relational, object-oriented, object-relational, and deductive models.
Data Mining: Evaluation
Data Warehouse
A data warehouse is responsible to handle the management's decision-making

process
Subject Oriented Data warehouse gives us the information more on
the subject instead of organizing ongoing processes. So we can say
that its more of subject oriented. Subjects can vary from product,
customers, suppliers, sales, revenue, etc. Main focus of the data
warehouse is on modelling and analyzing data for decision making.
Integrated Data warehouse is made by integrating data from diverse

sources such as relational databases, flat files etc. This integration
boosts the effectiveness of data and its analyzation.
Time Variant Identification of data in a data warehouse is done

with a particular time period as it provides data information from a
historical point of view.
Non--volatile Addition of new data without having removed the

previous one is termed as non-volatile. The data warehouse in this
case is kept aside from the operational database such that changes
made in the operational database doesnt affect the data warehouse.
Data Warehousing
Data warehousing is the process in which a data warehouse is constructed
and used to fulfill the demands. A data warehouse is a collection of data
from variety of sources.
Data warehousing comprises of data cleaning, data integration, and data
consolidations. We usually follow two approaches
Query Driven
Update Driven
Query-Driven Approach
This approach is used to build wrappers and integrators on top of multiple
heterogeneous databases and are also known to be as mediators.
Procedure of Query Driven Approach

Mentioned below is process of query driven data warehousing approach
When a query is allotted to a user, a metadata dictionary interprets the
query into the queries, suitable for a single heterogeneous site.
These queries are mapped and directed to the local query mainframe.
The outcomes are then integrated into a global answer set.
Disadvantages
This approach may have some following disadvantages though
The Query Driven Approach wants complex integration and cleaning

procedures.
It is unproductive and very costly for frequent queries.
Update-Driven Approach
In this approach, the information from various sources is combined in
advance and warehoused. This information is accessible for direct querying and
analysis.
Advantages
This approach has the following advantages
High performance.
The data in this approach can be copied, processed, integrated,

summarized and restructured in the data store in advance.
From Data Warehousing (OLAP) to Data Mining (OLAM)

Integration of Online Analytical Mining can be done with Online Analytical
Processing with data mining and mining knowledge in multidimensional
databases. Here the diagram shows the integration of both OLAP and OLAM
Importance of OLAM
OLAM is important for the following reasons
High quality of data The data mining tools are essential to work
on combined, reliable, and cleaned data. These steps are though very
expensive in the pre-processing of data. Thus data warehouses created
to serve by preprocessing valuable sources of high quality data for
OLAP and data mining as well.
Existing infrastructure to process information surrounding data

warehouses Information processing infrastructure refers to accessing
data bases, addition, consolidation, and transformation of various
heterogeneous databases, web-accessing and service facilities, reporting and
OLAP analysis tools.
OLAPbased data exploration It is essential for effective data

mining. OLAM is responsible for data mining different subsets of data
at different criterias of abstraction.
Online selection of data mining functions Assimilating OLAP with

several data mining functions and online methodical mining offers user
flexibility to select preferred data mining functions and swap data
mining tasks dynamically
What is Knowledge Discovery?
Some people dont differentiate data mining from knowledge discovery while
others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process
Data Cleaning In this step, the noise and inconsistent data is
removed.
Data Integration In this step, multiple data sources are combined.

Data Selection In this step, data relevant to the analysis task are
retrieved from the database.
Data Transformation In this step, data is transformed or consolidated

into forms appropriate for mining by performing summary or
aggregation operations.
Data Mining In this step, intelligent methods are applied in order to

extract data patterns.
Pattern Evaluation In this step, data patterns are evaluated.
Knowledge Presentation In this step, knowledge is represented.

DATA MINING ALGORITHMS
k-means
What does it do? k-means creates k groups from a set of objects so that
the members of a group are more similar. Its a popular cluster analysis
technique for exploring a dataset.
Cluster analysis is a family of algorithms designed to form groups such that

the group members are more similar versus non-group members. Clusters and
groups are synonymous in the world of cluster analysis.
Definitely, suppose we have a dataset of patients. In cluster analysis, these

would be called observations. We know various things about each patient like
age, pulse, blood pressure, VO2max, cholesterol, etc. This is a vector
representing the patient.
You can basically think of a vector as a list of numbers we know about

the patient. This list can also be interpreted as coordinates in multi-
dimensional space. Pulse can be one dimension, blood pressure another
dimension and so forth.
You tell k-means how many clusters you want. K-means takes care of the
rest.
How does k-means take care of the rest? k-means has lots of variations to
optimize for certain types of data.
At a high level, they all do something like this:

1. k-means picks points in multi-dimensional space to represent each of
the k clusters. These are called centroids.
2. Every patient will be closest to 1 of these k centroids. They hopefully
wont all be closest to the same one, so theyll form a cluster around
their nearest centroid.

3. What we have are k clusters, and each patient is now a member of
a cluster.
4. k-means then finds the center for each of the k clusters based on its
cluster members (yep, using the patient vectors!).
5. This center becomes the new centroid for the cluster.

6. Since the centroid is in a different place now, patients might now be
closer to other centroids. In other words, they may change cluster
membership.
7. Steps 2-6 are repeated until the centroids no longer change, and the
cluster memberships stabilize. This is called convergence.
Is this supervised or unsupervised? It depends, but most would classify k-
means as unsupervised. Other than specifying the number of clusters, k-means
learns the clusters on its own without any information about which cluster
an observation belongs to. k-means can be semi-supervised.

Why use k-means? I dont think many will have an issue with this:
The key selling point of k-means is its simplicity. Its simplicity means its
generally faster and more efficient than other algorithms, especially over large
datasets.
It gets better:
k-means can be used to pre-cluster a massive dataset followed by a more
expensive cluster analysis on the sub-clusters. k-means can also be used to
rapidly play with k and explore whether there are overlooked patterns or
relationships in the dataset.
Its not all smooth sailing:
Two key weaknesses of k-means are its sensitivity to outliers, and its
sensitivity to the initial choice of centroids. One final thing to keep in mind
is k-means is designed to operate on continuous data youll need to do
some tricks to get it to work on discrete data.
Where is it used? A ton of implementations for k-means clustering are

available online:
Apache Mahout
Julia
R
SciPy
Weka
MATLAB
SAS
Naive Bayes
What does it do? Naive Bayes is not a single algorithm, but a family of
classification algorithms that share one common assumption:
Every feature of the data being classified is independent of all other features
given the class.
2 features are independent when the value of one feature has no effect on
the value of another feature.
For example:
Lets say you have a patient dataset containing features like pulse, cholesterol
level, weight, height and zip code. All features would be independent if the
value of all features have no effect on each other. For this dataset, its
reasonable to assume that the patients height and zip code are independent,
since a patients height has little to do with their zip code.
But lets not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not
independent:
If height increases, weight likely increases.

If cholesterol level increases, weight likely increases.
If cholesterol level increases, pulse likely increases as well.
In my experience, the features of a dataset are generally not all independent.
And that ties in with the next question
Why is it called naive? The assumption that all features of a dataset are
independent is precisely why its called naive its generally not the case
that all features are independent.
Whats Bayes? Thomas Bayes was an English statistician for which Bayes
Theorem is named after. You can click on the link to find about more about
Bayes Theorem.
In a nutshell, the theorem allows us to predict the class given a set of

features using probability.
The simplified equation for classification looks something like this:
Lets dig deeper into this

What does the equation mean? The equation finds the probability of Class
A given Features 1 and 2. In other words, if you see Features 1 and 2,
this is the probability the data is Class A.
The equation reads: The probability of Class A given Features 1 and 2 is a
fraction.
The fractions numerator is the probability of Feature 1 given Class A
multiplied by the probability of Feature 2 given Class A multiplied by
the probability of Class A.

The fractions denominator is the probability of Feature 1 multiplied by
the probability of Feature 2.
What is an example of Naive Bayes? Below is a great example taken from

a Stack Overflow thread.
Heres the deal:
We have a training dataset of 1,000 fruits.

The fruit can be a Banana, Orange or Other (these are the classes).
The fruit can be Long, Sweet or Yellow (these are the features).
What do you see in this training dataset?
Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50
are yellow.
If we are given the length, sweetness and color of a fruit (without knowing
its class), we can now calculate the probability of it being a banana, orange
or other fruit.
Suppose we are told the unknown fruit is long, sweet and yellow.
Heres how we calculate all the probabilities in 4 steps:
Step 1: To calculate the probability the fruit is a banana, lets first recognize
that this looks familiar. Its the probability of the class Banana given the
features Long, Sweet and Yellow or more succinctly:
This is exactly like the equation discussed earlier.

Step 2: Starting with the numerator, lets plug everything in.
Multiplying everything together (as in the equation), we get:

Step 3: Ignore the denominator, since itll be the same for all the other
calculations.
Step 4: Do a similar calculation for the other classes:
Since the is greater than , Naive Bayes would classify this

long, sweet and yellow fruit as a banana.
Is this supervised or unsupervised? This is supervised learning, since Naive
Bayes is provided a labeled training dataset in order to construct the tables.
Why use Naive Bayes? As you could see in the example above, Naive
Bayes involves simple arithmetic. Its just tallying up counts, multiplying and
dividing.
Once the frequency tables are calculated, classifying an unknown fruit just
involves calculating the probabilities for all the classes, and then choosing the
highest probability.
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example,
its been found to be effective for spam filtering.
Where is it used? Implementations of Naive Bayes can be found in Orange,
scikit-learn, Weka and R.
DECISION TREE
A decision tree is a structure i that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on i an attribute, each branch
denotes the outcome i of a test, and each leaf node holds a class i label.
The topmost node in the tree i is the root node.
The following decision tree is for the concept i buy_computer that indicates
whether a customer at a i company is likely to buy a computer or not.
Each internal node represents a test on an attribute. Each leaf node
represents a class.
The benefits of i having a decision tree are as follows
It does not i require any domain knowledge.
It is easy i to comprehend.
The learning and i classification steps of a decision tree are simple

and fast.
A machine researcher named J. Ross Quinlan in 1980 developed a decision i
tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented
C4.5, which i was the successor of ID3. ID3 and C4.5 adopt a greedy
approach. In this i algorithm, there is no backtracking; the trees are
constructed in a top-down i recursive divide-and-conquer manner. The pruned
trees are smaller and less complex.Here is the Tree Pruning i Approaches
listed below
Pre-pruning The tree is pruned by i halting its construction early.
Post-pruning - This approach removes a i sub-tree from a fully grown tree.
Cost Complexity-The cost complexity is i measured by the following two

parameters
Number of leaves in the i tree, and
Error rate of the tree.

TOOLS & ALGORITHMS USED FOR DATA MINING
R LANGUAGE
R-STUDIO
DECISION TREE ALGORITHM --:
CLASSIFICATION
REGRESSION
R LANGUAGE
R is defined as an open source programming language and a software that

provides environment for statistical computing and graphics and it is supported
by the R foundation for statistical computing.
The R language is used widely among statisticians and data miners to

develop statistical software and data analysis. According to recent surveys of
data miners, the popularity of R language has increased substantially.
It is object oriented. It was designed by Ross ihaka and Robert gentleman.

The developer of R language was R core team. It was developed in August
1993, around 23 years ago. The source code for R software environment is
mainly written in C,Fortran and R.
R and its libraries help and individual to implement a wide variety of

graphical and statistical techniques including classical statistical tests,time series
analysis, classifiaction,clustering and others.
One of the main strength of R language is static graphs, which can produce
publication quality graphs,including different mathematical symbols.
R is an interpreted language;and the users typically access through a command

line interpreter. Procedural programming is supported by R.
A set of packages is already included with the installation of R, with almost

10,500 additional packages at the Comprehensive R Archive Network(CRAN).
R-STUDIO
R studio is free and open-source integrated development environment (IDE) for

R, a programming language for statistical computing and graphics. i RStudio
was founded by JJ Allaire creator of the programming language ColdFusion. i

Hadley Wickham is the Chief Scientist at i RStudio.
RStudio is available i in open source and commercial editions and runs on

the desktop (Windows, macOS, and Linux) or i in a browser connected to
RStudio Server or i RStudio Server Pro (Debian, Ubuntu, Red Hat Linux, i
CentOS, openSUSE and SLES).
RStudio is written in i the C++ programming language and uses the Qt
framework for its graphical user interface.
Work on RStudio i started at around December 2010, i and the first public
beta version (v0.92) was officially announced in February 2011. Version 1.0
was released i on 1 November 2016.
DECISION TREES
Decision tree learning uses a decision tree (as a predictive model) to go

from observations about an item (represented in the branches) to conclusions
about the item's target value (represented in the leaves). It is one of the
predictive modelling approaches used in statistics, data mining and machine
learning. Tree models where the target variable can take a discrete set of
values are called classification trees; in these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to those
class labels. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees.
Input Parameter:
The attributes of a tuple
Principle:
The attributes of a tuple are tested against the decision tree and a
path is traced from the root to a leaf node which holds the
prediction for that tuple
ALGORITHM FORMULA:
At start, all the training tuples are at the root. Then, tuples are
partitioned recursively based on selected attributes. The test attributes are
selected on the basis of a heuristic or statistical measure (e.g.,
information gain). We stop when all samples for a given node belong
to the same class or there are no remaining attributes for further
partitioning majority voting.
Decision trees used in data mining are of two main types:
Classification tree analysis is when the predicted outcome is the class

to which the data belongs.
Regression tree analysis is when the predicted outcome can be

considered a real number (e.g. the price of a house, or a patient's length
of stay in a hospital).
Decision trees have various advantages:
Simple to understand and interpret. People are able to understand

decision tree models after a brief explanation. Trees can also be
displayed graphically in a way that is easy for non-experts to interpret
Able to handle both numerical and categorical data.[16] Other
techniques are usually specialised in analysing datasets that have only
one type of variable.
Requires little data preparation. Other techniques often require data

normalization. Since trees can handle qualitative predictors, there is no
need to create dummy variables
Performs well with large datasets. Large amounts of data can be

analysed using standard computing resources in reasonable time.
LIBRARIES USED--:
LIBRARY CaTools
Contains several basic utility functions including: moving (rolling, running)
window statistic functions, read/write for GIF and ENVI binary files, fast
calculation of AUC, LogitBoost classifier, base64 encoder/decoder, round-off error
free sum and cumsum, etc.
Index:
LogitBoost LogitBoost Classification Algorithm
predict.LogitBoost Prediction Based on LogitBoost Algorithm
base64encode Convert R vectors to/from the Base64 format
colAUC Column-wise Area Under ROC Curve (AUC)
combs All Combinations of k Elements from Vector v
read.ENVI Read and Write Binary Data in ENVI Format
read.gif Read and Write Images in GIF format
runmean Mean of a Moving Window
runmin Minimum and Maximum of Moving Windows
runquantile Quantile of Moving Window
runmad Median Absolute Deviation of Moving Windows
runsd Standard Deviation of Moving Windows
sample.split Split Data into Test and Train Set
sumexact Basic Sum Operations without Round-off Errors
trapz Trapezoid Rule Numerical Integration
LIBRARY rPart
Recursive partitioning for classification, regression and survival trees.
car.test.frame Automobile Data from 'Consumer Reports' 1990

car90 Automobile Data from 'Consumer Reports' 1990
cu.summary Automobile Data from 'Consumer Reports' 1990
kyphosis Data on Children who have had Corrective Spinal Surgery
labels.rpart Create Split Labels For an Rpart Object
meanvar Mean-Variance Plot for an Rpart Object
meanvar.rpart Mean-Variance Plot for an Rpart Object
na.rpart Handles Missing Values in an Rpart Object
path.rpart Follow Paths to Selected Nodes of an Rpart Object
plot.rpart Plot an Rpart Object
plotcp Plot a Complexity Parameter Table for an Rpart Fit
post PostScript Presentation Plot of an Rpart Object
post.rpart PostScript Presentation Plot of an Rpart Object
predict.rpart Predictions from a Fitted Rpart Object
print.rpart Print an Rpart Object
printcp Displays CP table for Fitted Rpart Object
prune Cost-complexity Pruning of an Rpart Object
prune.rpart Cost-complexity Pruning of an Rpart Object
residuals.rpart Residuals From a Fitted Rpart Object
rpart Recursive Partitioning and Regression Trees
rpart.control Control for Rpart Fits
rpart.exp Initialization function for exponential fitting
rpart.object Recursive Partitioning and Regression Trees Object
rsq.rpart Plots the Approximate R-Square for the Different Splits
snip.rpart Snip Subtrees of an Rpart Object
solder Soldering of Components on Printed-Circuit Boards
stagec Stage C Prostate Cancer
summary.rpart Summarize a Fitted Rpart Object
text.rpart Place Text on a Dendrogram Plot
xpred.rpart Return Cross-Validated Predictions
LIBRARY ElemStatlearn
bone Bone Mineral Density Data
countries CountryDissimilarities
galaxy GalaxyData
marketing Market Basket Analysis
mixture.example MixtureExample
nci NCI microarray data
orange10.test Simulated Orange Data
orange10.train Simulated Orange Data
orange4.test Simulated Orange Data
orange4.train Simulated Orange Data
ozone Ozone Data
phoneme Data From a Acoustic-Phonetic Continuous Speech Corpus
prostate Prostate Cancer Data
SAheart South African Hearth Disease Data
simple.ridge Simple Ridge Regression
spam Email Spam Data
vowel.test Vowel Recognition (Deterding data)
vowel.train Vowel Recognition (Deterding data)
waveform Function to simulate waveform data
waveform.test Simulated Waveform Data
waveform.train Simulated Waveform Data
zip.test Handwritten Digit Recognition Data
zip.train Handwritten Digit Recognition Data
function to convert row of zip file to format used by
zip2image
image(
DATA ANAYSIS CODE

# Decision Tree Regression
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]
library(rpart)
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
y_pred = predict(regressor, data.frame(Level = 6.5))
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') +
xlab('Level') +
ylab('Salary')
plot(regressor)
text(regressor)
# Decision Tree Classification
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
library(rpart)
classifier = rpart(formula = Purchased ~ .,
data = training_set)
y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
cm = table(test_set[, 3], y_pred)
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
main = 'Decision Tree Classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
library(ElemStatLearn)
set = test_set
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
plot(classifier)
text(classifier)
CONCLUSION
The main conclusion of this project is that Data mining is that special technique through
which we can read up a database in no time and get a idea that what is present inside the
database. In our project we have implemented Decision trees (Classification and Regression)
and got an overview about the database.
In the decision trees for classification we have used a database which consist people of
different age group with varying salaries and they have bought an item or not is given.
So after the data mining easily get an idea that the people with different age group with some
specified salary has bought the item or not.
In the decision trees for regression we use a database which has continuous data as it works
according to that only. In our Regression decision trees we have used up a database of peope
on different post and their salaries in an increasing order, so atlast it gives up a overview that
which post lies up in which salary set.
So, we can close our project by understanding that nowadays the need of data mining is
increasing day by day and its value will increase even more in future as compared to today.

Data Mining Abhas

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Mining Abhas

Încărcat de

Drepturi de autor:

Formate disponibile

INTRODUCTION

Greater specialization developed as database, database management system and

The data mining is viewed as a result of natural evolution of information

Data Mining: Evaluation

A data warehouse is responsible to handle the management's decision-making

Integrated Data warehouse is made by integrating data from diverse

Time Variant Identification of data in a data warehouse is done

Non--volatile Addition of new data without having removed the

Data warehousing comprises of data cleaning, data integration, and data

consolidations. We usually follow two approaches

Procedure of Query Driven Approach

The outcomes are then integrated into a global answer set.

This approach may have some following disadvantages though

The Query Driven Approach wants complex integration and cleaning

It is unproductive and very costly for frequent queries.

The data in this approach can be copied, processed, integrated,

From Data Warehousing (OLAP) to Data Mining (OLAM)

Existing infrastructure to process information surrounding data

OLAPbased data exploration It is essential for effective data

Online selection of data mining functions Assimilating OLAP with

What is Knowledge Discovery?

Data Cleaning In this step, the noise and inconsistent data is

Data Integration In this step, multiple data sources are combined.

Data Transformation In this step, data is transformed or consolidated

Data Mining In this step, intelligent methods are applied in order to

Pattern Evaluation In this step, data patterns are evaluated.

Knowledge Presentation In this step, knowledge is represented.

Cluster analysis is a family of algorithms designed to form groups such that

Definitely, suppose we have a dataset of patients. In cluster analysis, these

representing the patient.

You can basically think of a vector as a list of numbers we know about

At a high level, they all do something like this:

their nearest centroid.

5. This center becomes the new centroid for the cluster.

an observation belongs to. k-means can be semi-supervised.

some tricks to get it to work on discrete data.

Where is it used? A ton of implementations for k-means clustering are

If height increases, weight likely increases.

In a nutshell, the theorem allows us to predict the class given a set of

Lets dig deeper into this

the probability of Class A.

What is an example of Naive Bayes? Below is a great example taken from

Heres the deal:

We have a training dataset of 1,000 fruits.

What do you see in this training dataset?

features Long, Sweet and Yellow or more succinctly:

This is exactly like the equation discussed earlier.

Multiplying everything together (as in the equation), we get:

Since the is greater than , Naive Bayes would classify this

It does not i require any domain knowledge.

The learning and i classification steps of a decision tree are simple

Pre-pruning The tree is pruned by i halting its construction early.

Post-pruning - This approach removes a i sub-tree from a fully grown tree.

Cost Complexity-The cost complexity is i measured by the following two

Number of leaves in the i tree, and

Error rate of the tree.

DECISION TREE ALGORITHM --:

R is defined as an open source programming language and a software that

The R language is used widely among statisticians and data miners to

It is object oriented. It was designed by Ross ihaka and Robert gentleman.

R and its libraries help and individual to implement a wide variety of

R is an interpreted language;and the users typically access through a command

A set of packages is already included with the installation of R, with almost

R studio is free and open-source integrated development environment (IDE) for