Documente Academic
Documente Profesional
Documente Cultură
WHAT IS DATA
Data is defined as set of values of quantitative or qualitative variables , for example a
qualitative data can be described as an anthropologist's handwritten notes about his or her
interviews with native people. Individual pieces of data are referred to be the pieces of data.
The scientific research commonly associates with this concept of data and is collected by
large number of institutions and organisations including Business, Non Governmental or
Governmental organistaion for different purposes.
The data is collected, measured, reported and analyzed so to find out the main pupose of
getting the data. The data is used in different ways sometimes it is used to predict sales in
business, sometimes data is used by housewives to make budget for their house and in many
ways the data is processed. Data can be Raw or Processed.
Raw data is described as collection of characters or numbers before it has been cleaned or
corrected by the users or we can say researchers. Raw data is corrected by removing out the
outliers or data entry errors.Whereas processed data is defined as the data that is derived from
the raw data and now can be used to visualized the data in any forms like graphs ,stats etc.
Data, Knowledge and Information are close related to each other and each of them has its
own importance or role in relation to the other. Data can be considered as Information once it
is arranged in some fashion and can thereafter be analyzed and can be used to make up the
decisions.
Data in computers are dealt in a very different way. In computing, the information is that data
that has been trranslted into a efficient form for processing. The data in coputers is converted
into binary or digital form. The computer understands only the binary language.
Claude shannon, a mathematician also known as father of the information theory developed
this concept of data in context of computing.
Computer uses binary data in form of 1 and 0 to represent data including videos, images, text
and sounds. The smallest unit of data is repsented by a single value is called BIT. A byte is
eight binary digit long. Storage and memory is measured in megabytes and giga bytes.
Data can be stored in file formats, as in mainframe systems using ISAM and VSAM. Other
file formats for data storage, conversion and processing include comma-separated values.
These formats continued to find uses across a variety of machine types, even as
more structured-data-oriented approaches gained footing in corporate computing.
DATA MINING
In recent years data mining has got a great attention in information industry,
because of the availability of such wide and huge amount of data and the
necessity of turning it into useful information and knowledge. The knowledge
and informaation that is gained can be used for different purposes like
market analysis, fraud detection and customer retention.
Since the 1960s, database and information technology has been evolving
systematillyfrom primitive file processing systems to sophisticated and powerful
database systems. The research and development in database systems since the
1970s has progressed from early hierarchical and network database systems to
the development of relational database systems (where data are stored in
relational table structures). Database technology since the mid1980s has been
characterized by the popular adoption of relational technology and an upsurge
of research and development activities on new and powerful database
systems. These promote the development of advanced data models such as
eextended-relational, object-oriented, object-relational, and deductive models.
Data Warehouse
Data Warehousing
Data warehousing is the process in which a data warehouse is constructed
and used to fulfill the demands. A data warehouse is a collection of data
from variety of sources.
Query Driven
Update Driven
Query-Driven Approach
This approach is used to build wrappers and integrators on top of multiple
heterogeneous databases and are also known to be as mediators.
These queries are mapped and directed to the local query mainframe.
Disadvantages
Update-Driven Approach
In this approach, the information from various sources is combined in
advance and warehoused. This information is accessible for direct querying and
analysis.
Advantages
This approach has the following advantages
High performance.
Importance of OLAM
OLAM is important for the following reasons
High quality of data The data mining tools are essential to work
on combined, reliable, and cleaned data. These steps are though very
expensive in the pre-processing of data. Thus data warehouses created
to serve by preprocessing valuable sources of high quality data for
OLAP and data mining as well.
Some people dont differentiate data mining from knowledge discovery while
others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process
removed.
aggregation operations.
k-means
What does it do? k-means creates k groups from a set of objects so that
the members of a group are more similar. Its a popular cluster analysis
technique for exploring a dataset.
Naive Bayes
What does it do? Naive Bayes is not a single algorithm, but a family of
classification algorithms that share one common assumption:
Every feature of the data being classified is independent of all other features
given the class.
2 features are independent when the value of one feature has no effect on
the value of another feature.
For example:
Lets say you have a patient dataset containing features like pulse, cholesterol
level, weight, height and zip code. All features would be independent if the
value of all features have no effect on each other. For this dataset, its
reasonable to assume that the patients height and zip code are independent,
since a patients height has little to do with their zip code.
But lets not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not
independent:
Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50
are yellow.
If we are given the length, sweetness and color of a fruit (without knowing
its class), we can now calculate the probability of it being a banana, orange
or other fruit.
Suppose we are told the unknown fruit is long, sweet and yellow.
Heres how we calculate all the probabilities in 4 steps:
Step 1: To calculate the probability the fruit is a banana, lets first recognize
that this looks familiar. Its the probability of the class Banana given the
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example,
its been found to be effective for spam filtering.
Where is it used? Implementations of Naive Bayes can be found in Orange,
scikit-learn, Weka and R.
DECISION TREE
A decision tree is a structure i that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on i an attribute, each branch
denotes the outcome i of a test, and each leaf node holds a class i label.
The topmost node in the tree i is the root node.
The following decision tree is for the concept i buy_computer that indicates
whether a customer at a i company is likely to buy a computer or not.
Each internal node represents a test on an attribute. Each leaf node
represents a class.
The benefits of i having a decision tree are as follows
It is easy i to comprehend.
C4.5, which i was the successor of ID3. ID3 and C4.5 adopt a greedy
approach. In this i algorithm, there is no backtracking; the trees are
constructed in a top-down i recursive divide-and-conquer manner. The pruned
trees are smaller and less complex.Here is the Tree Pruning i Approaches
listed below
R LANGUAGE
R-STUDIO
CLASSIFICATION
REGRESSION
R LANGUAGE
One of the main strength of R language is static graphs, which can produce
publication quality graphs,including different mathematical symbols.
R-STUDIO
Input Parameter:
Principle:
The attributes of a tuple are tested against the decision tree and a
path is traced from the root to a leaf node which holds the
prediction for that tuple
ALGORITHM FORMULA:
At start, all the training tuples are at the root. Then, tuples are
partitioned recursively based on selected attributes. The test attributes are
selected on the basis of a heuristic or statistical measure (e.g.,
information gain). We stop when all samples for a given node belong
to the same class or there are no remaining attributes for further
partitioning majority voting.
LIBRARIES USED--:
LIBRARY CaTools
Contains several basic utility functions including: moving (rolling, running)
window statistic functions, read/write for GIF and ENVI binary files, fast
calculation of AUC, LogitBoost classifier, base64 encoder/decoder, round-off error
free sum and cumsum, etc.
Index:
LogitBoost LogitBoost Classification Algorithm
predict.LogitBoost Prediction Based on LogitBoost Algorithm
base64encode Convert R vectors to/from the Base64 format
colAUC Column-wise Area Under ROC Curve (AUC)
combs All Combinations of k Elements from Vector v
read.ENVI Read and Write Binary Data in ENVI Format
read.gif Read and Write Images in GIF format
runmean Mean of a Moving Window
runmin Minimum and Maximum of Moving Windows
runquantile Quantile of Moving Window
runmad Median Absolute Deviation of Moving Windows
runsd Standard Deviation of Moving Windows
sample.split Split Data into Test and Train Set
sumexact Basic Sum Operations without Round-off Errors
trapz Trapezoid Rule Numerical Integration
LIBRARY rPart
LIBRARY ElemStatlearn
bone Bone Mineral Density Data
countries CountryDissimilarities
galaxy GalaxyData
marketing Market Basket Analysis
mixture.example MixtureExample
nci NCI microarray data
orange10.test Simulated Orange Data
orange10.train Simulated Orange Data
orange4.test Simulated Orange Data
orange4.train Simulated Orange Data
ozone Ozone Data
phoneme Data From a Acoustic-Phonetic Continuous Speech Corpus
prostate Prostate Cancer Data
SAheart South African Hearth Disease Data
simple.ridge Simple Ridge Regression
spam Email Spam Data
vowel.test Vowel Recognition (Deterding data)
vowel.train Vowel Recognition (Deterding data)
waveform Function to simulate waveform data
waveform.test Simulated Waveform Data
waveform.train Simulated Waveform Data
zip.test Handwritten Digit Recognition Data
zip.train Handwritten Digit Recognition Data
function to convert row of zip file to format used by
zip2image
image(
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[2:3]
library(rpart)
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
y_pred = predict(regressor, data.frame(Level = 6.5))
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') +
xlab('Level') +
ylab('Salary')
plot(regressor)
text(regressor)
# Decision Tree Classification
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
library(rpart)
classifier = rpart(formula = Purchased ~ .,
data = training_set)
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
main = 'Decision Tree Classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
plot(classifier)
text(classifier)
CONCLUSION
The main conclusion of this project is that Data mining is that special technique through
which we can read up a database in no time and get a idea that what is present inside the
database. In our project we have implemented Decision trees (Classification and Regression)
In the decision trees for classification we have used a database which consist people of
different age group with varying salaries and they have bought an item or not is given.
So after the data mining easily get an idea that the people with different age group with some
In the decision trees for regression we use a database which has continuous data as it works
according to that only. In our Regression decision trees we have used up a database of peope
on different post and their salaries in an increasing order, so atlast it gives up a overview that
So, we can close our project by understanding that nowadays the need of data mining is
increasing day by day and its value will increase even more in future as compared to today.