Sunteți pe pagina 1din 33

Commonly Used Classification

Techniques and Recent Developments


Presented by Ke-Shiuan Lynn
Terminology
A classifier can be viewed as a function of
block. A classifier assigns one class to each
point of the input space. The input space is
thus partitioned into disjoint subsets, called
decision regions, each associated with a
class.
Input Vector
(Feature)
Classifier
Output
(Class)
Terminology (cont.)
The way a classifier classifies inputs is
defined by its decision regions. The
borderlines between decision regions are
called decision-region boundaries or
simply decision boundaries.
Input #2
Input #1
Decision regions
Decision boundaries
Inputs of class A
Inputs of class B
Terminology (cont.)
In practice, input vectors of different classes
are rarely so neatly distinguishable. Samples
of different classes may have same input
vectors. Due to such a uncertainty, areas of
input space can be clouded by a mixture of
samples of different classes.
Input #2
Input #1
Terminology (cont.)
The optimal classifier is the one expected to
produce the least number of misclassifications.
Such misclassifications are due to uncertainty in the
problem rather than a deficiency in the decision
regions.

Input #2
Input #1
Terminology (cont.)
A designed classifier is said to generalize
well if the classifier achieves similar
classification accuracy to both training
samples and real world samples
Input #1
Input #2
Types of Models
Decision-Region Boundaries

Probability Density Functions

Posterior Probabilities
Decision-Region Boundaries
This type of model defines decision regions
by explicitly constructing boundaries in the
input space.
These models attempt to minimize the
number of expected misclassifications by
placing boundaries appropriately in the
input space.
Probability Density Functions (PDFs)
The models of this type attempt to
construct a probability density function,
p(x|C), that maps a point x in the input
space to class C.
Prior probabilities, p(C), is to be estimated
from the given database.
This model assigns the most probable class
to an input vector x by selecting the class
maximizing p(C)p(x|C).
Posterior Probabilities
Let there be m possible classes denoted C
1
,
C
2
, , C
m
. This type of models attempts to
generate m posterior probabilities p(C
i
|x),
i=1, 2, , m for any input vector x.
The classification is made in the way that
the input vector is assigned to the class
associated with maximal p(C
i
|x).

Approaches to Modeling
Fixed models

Parametric models

Nonparametric models
Fixed models
Fixed model is used when the exact input-
output relationship is known.
Decision region boundary: A known threshold
value (e.g. A particular BMI value for defining
obesity)
PDF: When each classs PDF can be obtained
Posterior probability: when the probability that
any observation belongs to each class is know.
Parametric Models
Parametric model is used when its parametric
mathematical form can be obtained.

The development process of such models
consists of two stages: (1) derive an
appropriate parametric form, and (2) tune the
parameters to fit data.
Parametric Models (cont.)
Decision-region boundary: Linear
discriminant function e.g.
y=ax
1
+bx
2
+cx
3
+d

PDF: Multivariate Gaussian function

Posterior probability: Logistic regression
Nonparametric Models
Nonparametric model is used when the
relationships between input vectors and
their associated classes are not well
understood.
Models of varying smoothness and
complexity are generated and the one with
best generalization is chosen.
Nonparametric Models (cont.)
Decision-region boundary: Learning
Vector Quantization (LVQ), K nearest
neighbor classifier, decision tree.
PDF: Gaussian mixture methods, Pazens
window.
Posterior probability: Artificial neural
network (ANN), radial basis function
(RBF), group method of data handling
(GMDH)
Commonly Used Algorithms
Parametric Nonparametric
Linear regression
Logistic regression
Unimodal Gaussian
Backpropagation
Radial basis function
K nearest neighbor
Gaussian mixture
Nearest clustering
Binary/Linear decision tree
Projection pursuit
Estimate-Maximize clustering
Multivariate Adaptive Regression Spline (MARS)
Group Method of Data Handling (GMDH)
Parzens window
Learning Vector Quentization (LVQ)
Practical Constraints
Memory usage

Training time

Classification time
Memory Usage
Algorithm Memory Usage
Linear / Logistic regression Very low
Unimodal Gaussian Very low
Backpropagation Low
Radial basis function Medium
K nearest neighbor High
Gaussian mixture Medium
Nearest clustering Medium
Binary / Linear decision tree Low
Projection pursuit Low
Estimate-Maximize clustering Medium
MARS Low
GMDH Low
Parzens window High
LVQ Medium
Training Time
Algorithm Training Time
Linear / Logistic regression Fast-Medium
Unimodal Gaussian Fast-Medium
Backpropagation Slow
Radial basis function Medium
K nearest neighbor No training required
Gaussian mixture Medium-Slow
Nearest clustering Medium
Binary / Linear decision tree Fast
Projection pursuit Medium
Estimate-Maximize clustering Medium
MARS Medium
GMDH Fast-Medium
Parzens window Fast
LVQ Slow
Classification time
Algorithm Classification time
Linear / Logistic regression Very fast
Unimodal Gaussian Fast
Backpropagation Very fast
Radial basis function Medium
K nearest neighbor Slow
Gaussian mixture Medium
Nearest clustering Fast-medium
Binary / Linear decision tree Very fast
Projection pursuit Fast
Estimate-Maximize clustering Medium
MARS Fast
GMDH Fast
Parzens window Slow
LVQ Medium
Comparison of Algorithms
Linear regression: y = w
0
+w
1
x
1
+w
2
x
2
++w
N
x
N

Logistic regression:
Linear and Logistic regressions both tend to
explicitly construct the decision-region
boundaries.
Advantages: Easy implementation, easy
explanation of input-output relationship
Disadvantages: Limited complexity on the
constructed boundary
) 1 (
1

e
y


N
i i i
x w w
1 0
Comparison of Algorithms (cont)
Binary decision tree:

Binary and Linear decision trees also tend to
explicitly construct the decision-region
boundaries.
Advantages: Easy implementation, easy
explanation of input-output relationship
Disadvantages: Limited complexity on the
constructed boundary, the tree structure may not
be global optimal.
Root
x
i
>=c
1
x
i
<c
1
x
j
>=c
2
x
j<
c
2
x
k
>=c
3
x
k
<c
3
Comparison of Algorithms (cont)
Neural Network:

Feedforward neural network and radial-basis
function network both tend to implicitly construct
the decision-region boundaries.
Advantages: They can both approximate any
complex decision boundaries provided that enough
nodes are used.
Disadvantages: Long training time
Comparison of Algorithms (cont)
Supporting vector machine





Supporting vector machine also tends to implicitly
construct the decision-region boundaries.
Advantages: This type of classifier has been shown to
have good generalization capability.
Comparison of Algorithms (cont)
Bays Rule:
Unimodal Gaussian:

Unimodal Gaussian explicitly construct the PDF,
compute the prior probability P(C
j
) and posterior
probability P(C
j
|X).
Advantages: Easy implementation, confidence
level can be obtained from the posterior
probabilities.
Disadvantages: Sample distributions may not be
Gaussian.
) ( / ) ( ) | ( ) | ( X P C P C X P X C P
j j j

) ( ) ( 2 / 1
) 2 (
1
1
2 / 1
2 /
) | (
j j
T
j
j
N
M X V M X
V
j
e C X P


Comparison of Algorithms (cont)
Gaussian mixture modify unimodal Gaussian in
the way that the PDF is estimated by a weighted
average of multiple Gaussian.
Similar to Gaussian mixture Parzens windows
approximate PDF using weighted average of
radial Gaussian.
Advantage: Given enough Gaussian components,
the above architectures can approximate arbitrary
complex distributions
Comparison of Algorithms (cont)
K nearest neighbor classifier



K nearest neighbor tends to construct posterior
probabilities P(C
j
|X)
Advantage: No training is required, confidence
level can be obtained
Disadvantage: classification accuracy is low is
complex decision-region boundary exists, large
storage required.
Other Useful Classifiers
Projection Pursuit: aims to decomposing
the task of high-dimensional modeling into
a sequence of low-dimensional modeling.
This algorithm consists of two stage: the
first stage projects the input data onto a
one-dimensional space while the second
stage construct the mapping from projected
space to the output space.
Other Useful Classifiers (cont)
Multivariate adaptive regression splines (MARS)
tends to approximate the decision-region
boundaries in two stages.
At the first stage, the algorithm partitions the state
space into small portions.
At the second stage, the algorithm construct a
low-order polynomial to approximate the
decision-region boundary within each partition.
Disadvantage: This algorithm is intractable for
problem with high (> 10) dimensional inputs
Other Useful Classifiers (cont)
Group method of data handling (GMDH)
also aims to approximate the decision-
region boundaries using high-order
polynomial functions.
The modeling process begins with a low
order polynomial, and then iteratively
combines terms to produce a higher order
polynomial until the modeling accuracy
saturates.
Keep The Following In Mind
Use multiple algorithms without bias and
let your specific data help determine which
model is best suited for your problem.
Occams Razor: Entities should not be
multiplied unnecessarily -- "when you have
two competing models which make exactly
the same predictions to the data, the one
that is simpler is the better."
A New Member In Our Group