Sunteți pe pagina 1din 9


A Radial Basis Function Network (RBFN) is a particular type of neural network. Generally,
Artificial Neural Networksare referring to the Multilayer Perceptron (MLP). Each neuron
in an MLP takes the weighted sum of its input values. That is, each input value is multiplied
by a coefficient, and the results are all summed together. A single MLP neuron is a simple
linear classifier, but complex non-linear classifiers can be built by combining these neurons
into a network.
The RBFN approach is more intuitive than the MLP. An RBFN performs classification by
measuring the inputs similarity to examples from the training set. Each RBFN neuron
stores a prototype, which is just one of the examples from the training set. When we want
to classify a new input, each neuron computes the Euclidean distance between the input and
its prototype. Roughly speaking, if the input more closely resembles the class A prototypes
than the class B prototypes, it is classified as class A.

The above illustration shows the typical architecture of an RBF Network. It consists of an
input vector, a layer of RBF neurons, and an output layer with one node per category or
class of data.
2.1 The Input Vector:
The input vector is the n-dimensional vector that you are trying to classify. The entire input
vector is shown to each of the RBF neurons.

2.2 The RBF Neurons:

Each RBF neuron stores a prototype vector which is just one of the vectors from the
training set. Each RBF neuron compares the input vector to its prototype, and outputs a
value between 0 and 1 which is a measure of similarity. If the input is equal to the
prototype, then the output of that RBF neuron will be 1. As the distance between the input
and prototype grows, the response falls off exponentially towards 0. The shape of the RBF
neurons response is a bell curve, as illustrated in the network architecture diagram.
The neurons response value is also called its activation value.
2.3 The Output Nodes:
The output of the network consists of a set of nodes, one per category that we are trying to
classify. Each output node computes a sort of score for the associated category. Typically, a
classification decision is made by assigning the input to the category with the highest score.
The score is computed by taking a weighted sum of the activation values from every RBF
Each RBF neuron computes a measure of the similarity between the input and its prototype
vector (taken from the training set). Input vectors which are more similar to the prototype
return a result closer to 1.
The RBF neuron activation function is typically written as:

RBF Neuron activation for different values of beta


KNN, Decision trees, Neural Nets are all supervised learning algorithms.. Their general goal is
to make accurate predictions about unknown data
data after being trained on known data.
Data comes in form of examples with the general forms
form like x1, .. xn are known as features,
inputs or dimensions and y is the
th output or class label.
Both xi and ys can be discrete (taking
on specific values) {0, 1} or continuous (taking on a
range of values) [0, 1].
In training we are given (x1, ... xn, y) tuples. In testing (classification), we are given only
(x1,...xn) and the goal is to predict y with high accuracy.
Training error is the classification error measured
measure using training data to test.
Testing error is classification error on data not seen in the training phase.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new
cases based on a similarity measure (e.g., distance functions). KNN has been used in
statistical estimation and pattern recognition already in the beginning of 1970s as a non
nonparametric technique.
Given an unknown point, pick the closest 1 neighbor by some distance measure.
Class of unknown is the 1-nearest
neighbor's label.
Given an unknown, pick the k closest neighbors by some distance function.
Class of unknown is the mode of the k-nearest neighbor's labels.
k is usually an odd number to facilitate tie breaking.
A case is classified by a majority vote of its neighbors, with the case being assigned to the
class most common amongst its K nearest neighbors measured by a distance function.
If K = 1, then the case is simply assigned to the class of its nearest

It should also be noted that all three distance measures are only valid for continuous
variables. In the instance of categorical variables the Hamming distance must be used.
Choosing the optimal value for K is best done by first inspecting the data. In general, a large
K value is more precise as it reduces the overall noise but there is no guarantee. Crossvalidation is another way to retrospectively determine a good K value by using an
independent dataset to validate the K value.
Consider the following data concerning credit default. Age and Loan are two numerical
variables (predictors) and Default is the target.
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000)
using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set
with Default=Y.
D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The
prediction for the unknown case is again Default=Y.


Support Vector Machine (SVM) is a classification and regression prediction tool that uses
machine learning theory to maximize
maximize predictive accuracy while automatically avoiding
fit to the data. Support Vector machines
machines can be defined as systems which use
hypothesis space of a linear functions in a high dimensional
dimensional feature space, trained with a
ng algorithm from optimization theory that implements a learning bias derived from
statistical learning theory.
SVM was first introduced in 1992, it became popular because of its success in handwritten
digit recognition with 1.1% test error rate for SVM.
It is a discriminative classifier formally defined by a separating hyperplane. In other words,
given labeled training data (supervised learning), the algorithm outputs an optimal
hyperplane which categorizes new examples.
For a linearly
early separable set of 2D-points
2D points which belong to one of two classes, find a
separating straight line.

In the above picture you can see that there exists multiple lines that offer a solution to the
problem. Is any of them better than the others? We can intuitively define a criterion to
estimate the worth of the lines:
A line is bad if it passes too close to the points because it will be noise sensitive and it will
not generalize correctly. Therefore, our goal should be to find the line passing as far as
possible from all points.
Then, the operation of the SVM algorithm is based on finding the hyperplane that gives the
largest minimum distance to the training examples. Twice, this distance receives the
important name of margin within SVMs theory. Therefore,
Therefore, the optimal separating
hyperplane maximizes the margin of the training data.

Examples Of Bad Decision Boundaries:


The decision boundary should be as far away from the data of both classes as possible.

We should maximize the margin, m.

Distance between
ween the origin and the line wt x=k is k/||w||

2.1 Finding Decision Boundaries:

Let {x1, ..., xn} be our data set and let yi{1,-1} be the class label of xi.The decision
boundary should classify all points correctly

The decision boundary can be found by solving the following constrained optimization

A decision tree is a simple representation for classifying examples. Decision tree learning is
one of the most successful techniques for supervised classification learning. Each element of
the domain of the classification is called a class.
A decision tree or a classification tree is a tree in which each internal (non-leaf)
leaf) node is
labeled with an input feature. The arcs coming from a node labeled with a feature are
labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a
class or a probability distribution over the classes.
To classify an example, filterr it down the tree, as follows:

Two decision trees

Above figure shows two possible decision trees.
trees Each decision tree can be used to classify
according to the user's action. To classify a new example using the tree on the left, first
determine the length. If it is long, predict skips. Otherwise, check the thread. If the thread is
new, predict reads. Otherwise, check the author and predict read only if the author is
The tree on the right makes probabilistic predictions
predictions when the length is short. In this case, it
predicts reads with probability 0.82 and so skips with probability 0.18.
A deterministic decision tree, in which all of the leaves are classes, can be mapped into a set
of rules, with each leaf of the tree corresponding
corresponding to a rule. The example has the
classification at the leaf if all of the conditions on the path from the root to the leaf are true.

If data for some attribute is missing and is

is hard to obtain, it might be possible to
extrapolate or use unknown.

If some attributes have continuous values, groupings might be used.

If the data set is too large, one might use bagging to select a sample from the training set.
Or, one can use boosting to assign a weight showing importance to each instance. Or,
one can divide the sample set into subsets and train on one, and test on others.