Documente Academic
Documente Profesional
Documente Cultură
Many machine learning methods expect or are more effective if the data attributes
have the same scale. Two popular data scaling methods are normalization and
standardization.
Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and
1.
It is useful to scale the input attributes for a model that relies on the magnitude of
values, such as distance measures used in k-nearest neighbors and in the
preparation of coefficients in regression.
Data Standardization
Standardization refers to shifting the distribution of each attribute to have a mean of
zero and a standard deviation of one (unit variance).
Decision Tree
Decision tree are commonly learned by recursively splitting the set of training
instance into subsets based on the instances' value for the explanatory
variables.
Memorizing the training set is called Over-Fitting. A program that memorizes
its observation may not perform its task well, as it is memorized relations and
structure that are noise or coincidence.
Balancing memorization and generalization, or over-fitting and under-fitting, is
a problem common to many machine learning algorithm.
Decision Tree
Graphical representation of all the possible solutions to a decision
Decision are based on some conditions
Decision made can be easily explained
Random Forest
Builds multiple decision trees and merges them together
More accurate and stable prediction
Random decision forests correct for decision trees' habit of over fitting to their
training set.
Trained with the "bagging" method
Naive Bayes
Classification technique based on Bayes' Theorem
Assumes that the presence of particular feature in a class is unrelated to the
presence of any other feature
K- Nearest Neighbors
Stores all the available cases and classifies new cases based on a similarity measure
The "K" is KNN algorithm is the nearest neighbors we wish to take vote from.
What is decision Tree?
"A decision tree is a graphical representation of all the possible solution to a
decision based on certain conditions"
Dataset
This is how our dataset look like!
Decision Tree Terminology
CART Algorithm
Which one among them should you pick first? from the following data set.
Answer :- Determine the attribute has that best classified the training data.
But how do we choose the best attribute?
or
How does a tree decide where to split?
Entropy
How will you decide what is the best attribute?
Attribute with the highest information is considered as best.
Next Question - What is the information?
What is entropy?
Define the randomness in the data
Entropy is just a metric which measures the impurity or
The first step to solve the problem of a decision tree.
What is information gain?
Measures the reduction in entropy
Decide with attribute should be selected as the decision node
if S is our total collection
Information Gain = Entropy(S) - [(Weighted Avg) * Entropy(each feature)]
why should i Pruning?