Sunteți pe pagina 1din 41

This course has been developed in the framework of the ITEMS master program (Techniques for Analysis, Modeling

and Simulation
for Imaging, Bioinformatics and Complex Systems) nanced through project POSDRU/86/1.2/S/61756.
Pattern Recognition
An Introductory Workbook for Engineers and Scientists
C. Rasche
Abstract
The purpose of this workbook is to provide a practical access to the topic of pattern recognition. The
emphasis lies on applying statistical classication methods and learning their advantages and disadvan-
tages. We start with the very simple and easily implementable k-Nearest-Neighbor classier, followed by
the most popular and robust classier, namely the Linear Discriminant Analysis (LDA). We learn how to
apply the principal component analysis (PCA) and how to properly fold the data. We further introduce
decision trees, ensemble classiers, clustering methods and string matching methods. Eventually, we also
mention Support Vector machines and the Naive Bayes classier. The latter helps us to understand some
of the theoretical aspects, e.g. the Bayesian formulation for classication. Matlab code is provided to
facilitate the understanding and implementation of the classiers.
Prerequisites: basic programming skills
Recommended: basic linear algebra, basic signal processing
Contents
1 Introduction 3
1.1 Varia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 k-Nearest Neighbor (kNN) 5
2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Normalization ThKo p263, s5.2.2, pdf 276 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Division by Zero, Innity (Inf), Not a Number (NaN), Large Dataset . . . . . . . . . . . . . . . 7
2.5 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Linear Discriminant Analysis (Linear Classier I) 8
3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Excerpt from the LDA implementation in Matlab (classify) . . . . . . . . . . . . . . . 9
3.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Dimensionality Reduction 11
4.1 Feature Extraction - PCA DHS p115, 568 Alp p113 ThKo p326 . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Feature Selection Alp p110 ThKo p261, ch5, pdf 274 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Evaluating Classiers & Performance 14
5.1 Types of Error Estimation DHS p465 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Performance Measures for Binary Classiers Alp p489 . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.1 Class Imbalance Problem ThKo p237 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.2 Estimating Classier Complexity - Big O . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Clustering - Unsupervised Learning 17
6.1 k-Means DHS p526 ThKo p741 Alp p145 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Hierarchical Clustering DHS p550 ThKo p653 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
7 Decision Tree ThKo p215, s 4.20, pdf 228 Alp p185, ch 9 DHS p395 21
7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8 Combining Classiers [Ensemble Classiers] Alp p419, ch 17 24
8.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.3 Component Classiers without Discriminant Functions DHS p498, s. 9.7.2, pdf 576 . . . . . . . . . . . . . 25
8.4 Learning the Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.5 One-vs-All Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9 Non-Metric Classication DHS pxxx, ch 8, pdf 461 27
9.1 Recognition with Strings DHS p413, s 8.5, pdf 481 ThKo p487, s 8.2.2 . . . . . . . . . . . . . . . . . . . . . . . 27
9.1.1 String Matching Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9.1.2 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10 Density Estimation 29
10.1 Non-Parametric Methods Alp p165 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.1 Histogramming Alp p165 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.2 Kernel Estimator (Parzen Windows) Alp p167 ThKo p51 . . . . . . . . . . . . . . . . . . . . 29
10.2 Parametric Methods Alp p61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.2.1 Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11 Naive Bayes Classier 31
11.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
12 Support Vector Machines ThKo p119 33
12.1 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
13 Rounding the Picture DHS p84 34
13.1 Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
13.1.1 Rephrasing Classier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
13.2 Parametric (Generative) vs. Non-Parametric (Discriminative) . . . . . . . . . . . . . . . . . . 35
13.3 Other (Supervised) Statistical Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13.4 Algorithm-Independent Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A Appendix - LDA Beginners Example 36
B Appendix - 2D Toy Data Sets 36
C Appendix - Varia 37
C.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.2 Whitening Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.3 Programming Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.4 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.5 Some Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C.6 Parallel Computing Toolbox in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C.7 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
D Appendix - Example Questions 40
D.1 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.2 Answers (as hints) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
1 Introduction
There are many excellent textbooks on the subject of pattern recognition, but they often lack the exemplary
approach, meaning the learning-by-doing approach (though I have not read all textbooks on this topic).
Textbooks often provide the theoretical background rst, followed by giving examples. But the theoretical
background is easier to understand if one has worked through some specic examples. We therefore pro-
vide here an example-rst approach, thereby encountering the classiers advantages and disadvantages
in practice. After having worked through these examples, any textbook should read fairly easy.
Figure 1: Illustrating the classication problem in 2D.
We are given two sets of points representing two
classes (squares and triangles, respectively) - they are
our training samples (example data). To which group
would we assign a new sample (testing point) such as
the one marked as a circle?
The two classes may overlap due to measurement
noise or because some samples are indeed a mixture
of both classes - nevertheless, we would like to predict
a new sample as well as possible.
In the simplest case we compare all training samples
with the testing sample (section 2). Or we may at-
tempt to model the point clouds with functions (like a
Gaussian function; section 11). We could also nd
a straight line equation which separates best the two
point clouds (section 3). Of course, each method has
its advantages and disadvantages - there is no best
classier.
Data Format Collected/measured data often come in a format with two characteristics:
a) uniform dimensionality: all the data samples have equal dimensionality, which allows one to regard a
(single) sample as a d-dimensional vector (also called feature vector).
b) numeric values: the data values are often countable or measurable (as opposed to nominal), e.g. they
are of type integer, real, binary, etc.
Given these two characteristics, we can employ a large body of statistical classication methods, which
exploit statistical information about the data or one can simply perform (metric) distance measurements
between the samples in order to classify or organize the data. In programming terms, we deal with a
n d matrix, corresponding to [number of samples number of dimensions]: each row is a data sample
(or observation), a feature vector x, of which each component (dimension) x(i) is the measurement of an
attribute (or feature or characteristic or variable) of the data (i = 1..d). Examples:
- Computer vision: face detection is frequently done with 60x60 pixel patches, that is one deals with a
10800- dimensional vector (3 color variables). Searching an image means testing many, many 60x60
pixel patches.
- Food inspection: distinguish salmon from sea bass by measuring the degree of lightness and spatial
width, thus two dimensions only. (This is the reoccurring example in Duda/Hart/Storcks textbook).
- Bioinformatics ThKo p632: DNA microarray analysis. This is a scientic eld of enormous interest and signif-
icance that has already attracted a lot of research effort and investment. In such applications, data
sets of dimensionality as high as 4000 can be encountered.
Note: the term feature (in textbooks) can mean an individual component or variable, as for example in the
term feature selection. But it is sometimes also used to describe a feature (vector), that is a data sample!
Types of Training Procedures If we have knowledge about class (or group) information in the data,
meaning for each sample (feature vector) we know what class (category or group) it belongs to, then we
3
apply supervised learning algorithms. Training then takes place with help of a teacher one could say. For
instance, if we train a classier to recognize characters it is useful to provide labeled examples with which
we will learn the appropriate parameters values to perform optimal classication.
If we lack such class knowledge we apply unsupervised learning algorithms, called clustering algorithms
(section 6). Training then occurs without a teacher. For instance, if we are given the entire set of Chinese
characters without any labeling (translation), then we may try to organize them by attempting to nd basic
characters expressing frequent words such as house or man.
The rst three classiers we will introduce (next 3 sections), employ supervised learning algorithms.
1.1 Varia
Testing Data Sets It is instructive to start with a toy data set with two dimensions only, see appendix
B, and then to approach higher-dimensional sets. A convenient way to practice is to use the data set of
handwritten digits,
http://yann.lecun.com/exdb/mnist/
as this allows to easily verify the classier implementations (as the categories are evident). Bishop provides
also other collections, Bis p677:
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm
Source for this Workbook A few text passages are copied/pasted/modied from various textbooks, as
well as some of the gures - I have tried to compile the best pieces from each book and provide exact
citations including page number. See appendix C.7 for titles. Our workbook distinguishes itself from the
textbooks by specifying the algorithmic formulation more explicitly and by providing (vectorized) code.
Code The code fragments I provide are vectorized, meaning the slow for loops are avoided: this type of
vector/matrix thinking is unusual at the beginning, but highly recommended for 3 reasons: 1) computation
time is vastly shorter; 2) code is more compact; 3) code is less error-prone. However, the code fragments
may contain unintended mistakes, as I copied/pasted them from my own Matlab scripts and made occa-
sionally some unveried modications for instruction purposes.
It can also be useful to check Matlabs le exchange website for demos of various kinds:
http://www.mathworks.com/matlabcentral/fileexchange
Advice We recommend implementing the simple classier types by oneself, for instance in a high-level
programming language such as Matlab. This can be done with a few lines. For more complex classiers it
is more convenient to employ existing routines (such as the Linear Discriminant Analysis and the Support
Vector Machines). Why then would one want to implement the simple classiers at all? There are several
reasons. One is, that the existing routines sometimes do not account for special data entries, e.g. NaN
(not a number) or are not optimized for large datasets. Another reason is that one may intend to build
individual classiers, e.g. ensemble classiers for which it may be more convenient to write ones own
code. Furthermore, by writing our own code we know exactly what parameters/conditions etc we have
used. Finally, it is part of the learning process and one gains condence if we use classication packages.
4
2 k-Nearest Neighbor (kNN)
The Idea: The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms.
Given a testing set, we simply store all its samples as an exhaustive reference and to classify a testing
sample, we compare all the training samples to it to arrive at a classication decision; that is we do not
really relate the training samples in any way (except in one uses the covariance matrix for normalization).
Figurative Example. Determining the country by looking at license plates: You drive across Europe and deter-
mine which country you currently drive through by looking at the cars license plates. If there is a majority of
license plates for one type of country, then it is likely you are currently in that country. In regions near country
borders and near tourist resorts, this probability decreases.
The Procedure: Given is a training set, a matrix TRN with corresponding group (class) labels in vector
GrpTrn, and a testing set, a matrix TST with GrpTst. To classify a sample from the testing set (one row
vector of TST), we measure the distance to all samples in TRN, resulting in a vector Dist of length GrpTrn.
We order the distances in Dist and choose the closest training sample and take its category label as the
label of the testing sample - that would be the nearest neighbor, meaning k = 1. We can also look at more
(nearest) neighbors, e.g. 3, 5, ...and determine which category label occurs the most amongst those k
neighbors (for even k we may face parity).
In other words, a testing sample is classied by assigning it to the most frequent class label amongst its
neighborhood of size k in TRN (gure 2). One can try different distance metrics, e.g. Euclidean, Manhat-
tan,...(see also Appendix). There is essentially no initialization required with exception of the possible need
to normalize the data.
Figure 2: k-Nearest-Neighbor (kNN). Given are
11 training samples from 2 classes (marked as
squares and triangles); 1 instance (testing sam-
ple marked as lled circle) is to be classied. Solid
(thin) circle: 3NN; stippled circle: 5NN.
Algorithm 1 kNN classication. D
L
=TRN (training samples), D
T
=TST (testing samples). G with group labels
(length = n
TrainingSamples
).
Initialization normalize data
Training training samples D
L
with class (group) labels G.
(In fact, no actual training takes place here)
Testing for a testing sample ( D
T
): compute distances to all training samples D,
rank (order) D D
r
Decision observe the 1st k (ranked) distances in D
r
(the k nearest neighbors):
e.g. majority vote of the most frequent class label of the kNN determines category label
2.1 Implementation
Matlab offers the knnclassify command (as part of the bioinformatics toolbox), but coding a kNN classier
is fairly easy. Here are some fragments to understand how little it actually requires (see also ThKo p82):
5
Grp = reshape(repmat(1:5, 3, 1), [], 1); % generating class/group labels
TRN = randn(nTrn, nDim); % some training data (nTrnSamp=15)
TST = randn(nTst, nDim); % some testing data
[kNN] = deal(zeros(nTst,11)); % we will check out 11 nearest neighbors
for i = 1 : nTstSamp
iTst = repmat(TST(i,:), nTrn, 1); % replicate to same size [nTrn, nDim]
Diff = TRN-iTst; % difference [nTrn, nDim]
Dist = some metric % Euclidean, Manhattan,... [nTrn, 1]
[dst ix] = min(Dist); % min distance for 1-NN
[sds ixs] = sort(Dist,ascend); % increasing dist for k-NN
NNk(i,:) = Grp(ixs(1:11)); % closest 11 classes
end
HNN = histc(NNk(:,1:5), 1:nCAT, 2); % histogram for 5 NN
[Fq LbTst] = max(HNN, [], 2); % LbTst contains class assignment
See also the progamming hints in subsection C.3 for why we chose a for-loop in this case.
2.2 Normalization ThKo p263, s5.2.2, pdf 276
The range of values for different features may vary signicantly. It could therefore be benecial to normalize
your data. There are different possibilities to perform the normalization, for instance:
1. by dividing the feature values by the mean and standard deviation (for that feature). The resulting
normalized features will now have zero mean and unit variance. Matlab: zscore.
2. by limiting the feature values in the range of [0, 1] or [-1, 1] by proper scaling.
3. by scaling the feature values by an exponential or tangent function (e.g. tanh).
4. by performing a whitening transformation (DHS pp 34, pdf 54). This is a decorrelation method in which we mul-
tiply each sample by the covariance matrix of the dataset. The method is called whitening because
it transforms the input matrix to the form of white noise, which by denition is uncorrelated and has
uniform variance (see subsection C.2 for details).
Warning: Normalization may distort the relations between dimensions and hence the distances between
samples. Therefore normalization does not always improves classication (or clustering). It may be useful
to look at the distribution of individual features (e.g. using a plotting command such as hist) too see what
type of normalization may be appropriate.
2.3 Evaluation
Estimating Generalization Performance It is benecial to know how well our classier will perform on
new data, in other words, we would like to know its generalization performance on untested data. For that
purpose we partition the data - which we are given - into a training set, that is used exclusively for training,
and a testing set that is used for estimating the generalization performance (we implied this already above).
For the beginning we carry out the following simple partitioning: we halve the dataset, with one half being
the training set, and the other half being the testing set. Generate a performance estimate with the two
halves and then swap the two halves to generate another performance estimate. Take the mean of the two
estimates. This is also called hold-out estimation or two-fold cross-validation. Later we will encounter more
rened estimation methods (section 5).
Confusion Matrix Now we analyze which classes were mistaken for which other classes by creating a
(square) confusion matrix of size cc, where c is the number of classes. The given (actual) category is typi-
cally given in the row, the predicted (classied) category is given in the column. This helps us understanding
where the potential classication difculties for the given data lie. In Matlab:
CM = accumarray([Grp LbTst],1,[nCat nCat]);
Or you may use confusionmat if the stats toolbox is available.
6
Learning Curve It is common to test the classier for different amounts of learning samples (e.g. 5,
10, 15, 20 training samples), and to plot classication accuracy (and/or error) as a function of training
samples, a graph called learning curve. An increase in sample number should typically lead to an increase
in performance - at least initially (if performance only decreases then something is wrong). The classication
accuracy may start to decrease for very large amounts of training due to a phenomenon called overtraining
(overtting).
Optimal k Only systematic testing allows us to nd the optimal number of k nearest neighbors. In praxis,
often k = 1 or k = 3 is sufcient, but one may also want to check larger neighborhoods.
2.4 Division by Zero, Innity (Inf), Not a Number (NaN), Large Dataset
Often, some of the data contain useless or missing values. For instance, some dimensions may contain
only zero values; or the feature extraction program may have returned a NaN entry (not a number) or an
Inf entry (innity). Here is how Matlab deals with that:
- Division by zero:returns a division-by-0 warning and creates an
Inf entry, if the divisor (denominator) is 0;
NaN entry, if both divisior and dividend (numerator) are 0.
- Any operation with a NaN or Inf entry remains or produces a NaN or Inf entry.
Because most classiers will use multiplication operations, entries with NaN or Inf values can render results
useless. Matlab classication functions typically take care of this. As a programmer you may want to
eliminate dimensions with zero entries immediately and/or use the nan-commands, nanmean, nanstd,
nancov,... to deal with NaN entries. To avoid the creation of Inf entries, one can add the smallest value
possible (eps in Matlab) to a divisor, e.g. try 1/(0+eps), which so will use the largest value possible, thus
permitting to further operate with the variable (as opposed to an Inf entry).
If the dataset creates out-of-memory error notications, try using datatype single, which is only half
the storage size and thus half as accurate as the default datatype double. Initialize for instance with
zeros(nDsc,nDim,single) and do assignments by converting with single (DAT = single(DAT)).
2.5 Recapitulation
Advantages
- Decent results with an easily implementable model. In fact, we have implemented a decision rule only
and nothing more.
- Works even when only few training samples are available (n < 5 per class). E.g. most classier do not
work well with fewer than 5 samples, whereas the kNN allows to perform classication with even one
training sample per class only.
Disadvantages Classication duration can be slow if dimensionality d and/or training set n is large. The
classier has therefore O(dn) complexity. See also Big-O notation in section 5. To alleviate that problem a
number of improvements have been suggested (see course II).
Notes
- Even though the kNN may not provide the best performance, it can serve as a comparison for other
classier performances. If we do not obtain a better performance with more complex classiers, we
should consider the possibility that we may not have applied the complex classiers properly. Thus,
in any case, the kNN performance can serve as a check.
- The kNN classier does not have an actual learning process, that is, no effort was made in abstracting
or manipulating the data to derive a simple decision model.
7
3 Linear Discriminant Analysis (Linear Classier I)
A linear classier tries to separate the classes by nding a suitable border (or boundary) between the
classes. A sample point would then be classied by determining on which side of the boundary it lies.
Taking the data set in gure 1, a linear classier essentially tries to place a straight line through the two
points clouds such that it separates the two classes optimally in a statistical sense. For 3 dimensions, it
tries to nd a plane; for 4 or more dimensions we talk of hyperplanes. The line/planes represent the so-
called decision boundary. To decide the category type of a sample point, we determine on which side it lies
of the decision boundary.
Figurative Example. In our country-guessing example, a linear classier would attempt to estimate the country
borders and take those as a decision boundaries for making our best country guess.
Binary classication: For a binary classication task (two classes only), the model looks as in gure 3.
Given an input vector x, each component x(i) (also denoted as x
i
) is multiplied by a corresponding weight
value w(i) (or w
i
), which represents a weight vector w (whose components represent the hyperplane
parameters):
g(x) =

i
x(i)w(i) x w x
t
w.
g is also called the discrimination function and in this case g is simply the inner (or dot) product. This
operation represents the classication procedure already.
Learning: The difculty is to nd the appropriate weight values, which would return the best possible
separation between classes, there exist a large number of methods, which belong to gradient descent
procedures or matrix decompositions methods. This is beyond the scope of our introduction, but we look at
gradient descent procedures in more detail in the Perceptron section of workbook II.
Figure 3: A simple linear classier having d input units, each corresponding to the values of the components of an input
vector. Each input feature value xi is multiplied by its corresponding weight wi; the effective input at the output unit is the
sum all these products,

wixi. We show in each unit its effective input-output function. Thus each of the d input units
is linear, emitting exactly the value of its corresponding feature value. The single bias unit unit always emits the constant
value 1.0. The single output unit emits a + 1 if w
t
x + w0 > 0 or a 1 otherwise. [Source: Duda,Hart,Storck 2001, Fig 5.1]
Multiple-Class Classication: For classication task with multiple classes, there is a weight vector w
k
(of
length d) for each individual class k, which can be expressed as a weight matrix W(kd). The classication
procedure then consists of two steps: one step is the computation of posterior (condence) values for each
class,
g
k
(x) = x
t
W
8
which results in an array g
k
(of length n
classes
); we then, in a 2nd step determine the category:
argmax
k
g
k
.
Building the classier can be summarized as follows:
Algorithm 2 Linear Discriminant Analysis: the conceptual steps. k = 1, .., n
classes
. W
kd
= {w
k
}, G vector
with class labels.
Training nd optimal weight matrix Wfor g
k
(x) = x
t
W (x D
L
)
Testing for a testing sample x determine g
k
(x) = x
t
W (x D
T
)
Decision chose maximum of g
k
: argmax
k
g
k
3.1 Implementation
Matlab offers to nd simple discriminant functions with the command classify. The input arguments are
as follows:
sample (1st arg): n
test
d matrix containing the testing samples
training (2nd arg): n
train
d matrix containing the learning samples.
group (3rd arg): n
train
1 vector with class/group labels, where each element corresponds to the class
in the training matrix (both have the same number of rows of course).
type (4th arg - optional): allows to chose different types of classication. linear is default, whereby
groups (classes) are tted with a multivariate Gaussian (eq. 8).
prior (5th arg - optional): if not specied, it is assumed that classes occur with equal probability. In
case of doubt simply use empirical: Matlab will calculate the probabilites from group.
The output arguments are as follows:
outclass (1st arg): a n
test
1 vector, which contains the class assignments for the samples: it is of same
length as the grouping variable group (3rd input argument).
err (2nd arg): classication error for training data (training)
Post (3rd arg): n
test
c matrix containing the posterior values [0, 1].
Optimization W This is solved in Matlab by a matrix decomposition, specically by the command qr,
which performs a so-called orthogonal-triangular decomposition. The line reads:
[Q,R] = qr(training - gmeans(gindex,:), 0);
where gmeans and gindex are the group means and indices.
Small Sample Size Problem If the number of available training data is small, Matlab may complain that
it cannot compute a meaningful covariance matrix. In that case, it will return the following error:
The pooled covariance matrix of TRAINING must be positive definite.
To work around this barrier, it is easiest to apply a dimensionality reduction using the PCA and then retry
with lower dimensionality, which will be the topic of the upcoming section 4.
In the following we point out the essential lines for the linear classier type:
3.1.1 Excerpt from the LDA implementation in Matlab (classify)
1 [Q,R] = qr(training - gmeans(gindex,:), 0);
2 R = R / sqrt(n - ngroups); % SigmaHat = R*R
3 s = svd(R);
4 if any(s <= max(n,d) * eps(max(s)))
5 error(stats:classify:BadVariance,...
6 The pooled covariance matrix of TRAINING must be positive definite.);
9
7 end
8 logDetSigma = 2*sum(log(s)); % avoid over/underflow
9 % MVN relative log posterior density, by group, for each sample
10 for k = nonemptygroups
11 A = (sample - repmat(gmeans(k,:), mm, 1)) / R;
12 D(:,k) = log(prior(k)) - .5*(sum(A .* A, 2) + logDetSigma);
13 end
...
13 % find nearest group to each observation in sample data
14 [maxD,outclass] = max(D, [], 2);
Line 1: matrix decomposition of Was mentioned above.
Line 2: standard deviation of group
Line 3: singular value decomposition (another matrix decomposition)
Line 4: essentially checking whether s is too small: if so, we receive the error
3.2 Recapitulation
Advantages simple and robust: space and time complexity only O(d)
Disadvantages difcult to obtain reliable results for a small training set (n < 5 for a class)
Exercise
Study the beginners example in appendix A to understand the exact use of the commands. Then start
manipulating dimensionality; add another point class; etc. Finally, apply to a bigger data set.
As with the kNN classier, proper evaluation is best done with repeated estimates (e.g. the 2-fold cross-
validation as introduced above).
10
4 Dimensionality Reduction
Sometimes it is useful, if not even necessary, to reduce the dimensionality of the data by eliminating po-
tentially irrelevant dimensions. There can be different reasons to seek this dimensionality reduction. For
instance, the inverse of the covariance matrix can not be computed, which is needed for the LDA classier
(previous section) or the Naive Bayes classier (section 11); or we have very large dimensionality and hence
slow classication, in which case one may seek to eliminate insignicant dimensions in order to increase
classication speed; or we need to nd patterns and tendencies in a high-dimensional space, which one
tries to uncover by projecting the data onto a 2D or 3D space.
Dimensionality reduction can occur in two principally different ways, whereby here the term feature
stands for dimension:
- Feature Selection is the selection of the best subset of the (original) input feature set.
- Feature Extraction is the transformation or combination of the original feature set to create a new (re-
duced) feature set.
4.1 Feature Extraction - PCA DHS p115, 568 Alp p113 ThKo p326
The most popular method for feature extraction is the principal component analysis (PCA), also called the
Karhunen-Loeve transform. It works by aligning the coordinate axes to the directions of greatest variance
and placing the origin of the coordinate axes onto the datas center.
Example: assume a 2D data set, whose point cloud is elliptical and whose larger diameter is rotated by 45
degrees, see gure 4 left side; then the PCA places one axis along the large diameter (z
1
) and another axis
orthogonal to it (z
2
), which so form a new, rotated coordinate system, see right side in gure.
Figure 4: Principal components analysis. Left: the ellipse represents the outline of an elliptical point cloud
with axes z
1
and z
2
. Right: the PCA centers the samples and then rotates the axes to line up with the
directions of highest variance. If the variance on z
2
is too small, it can be ignored and we have dimensionality
reduction from two to one. [Source: Alpaydin 2010, Fig 6.1].
Algorithm 3 details the individual operations, algorithm 4 details the implementation. Steps 1 and 2
(of alg. 3) are performed by the Matlab command princomp. The command princomp returns a d d
matrix, from which we chose a submatrix PCO of dimensionality d d
r
, with d
r
the reduced number of
dimensions (=nPco in the code). The fraction of 0.7 is a suggestion, but should return reasonable results.
We then multiply each sample (x = DAT(i,:)) by this submatrix and obtain the data DATRed with lower
dimensionality (size n d
r
). Now try the classier again with just DATRed instead of the original data DAT,
TRN, resp.
Automatic search for ideal # of Principal Components One can automatically search for the maximal
useful number of components, if one understands the Matlabs code (useful=allowing the computation of
LDA without errors). A meaningful number of principal components has to be less equal than the minimum
of the number of dimensions and samples, hence the operation min(size(DAT)) in algorithm 4.
11
Algorithm 3 PCA steps: Performed on D
L
.
Parameters k: number of principal components - or determined algorithmically
Initialization none particular
Input x: list of data vectors (D
L
), i = 1, .., n
Dimensions
1) Compute: : d-dim mean vector
: d d covariance matrix
2) Compute eigenvectors e
i
and eigenvalues
i
3) Selection of k largest eigenvalues and corresponding eigenvectors
4) Build d k matrix A with columns consisting of the k eigenvectors
5) Projection of data x onto k-dim subspace x

: x

= F
1
(x) = A
t
(x )
Output x

: list of transformed data vectors


Algorithm 4 Implementation of the PCA. DAT is of size n d. nObs=n (number of samples).
coeff = princomp(DAT); % [nDim, nDim]
nPco = round(min(size(DAT))*0.7); % reduced dimensionality
PCO = coeff(:,1:nPco); % select the 1st nPco eigenvectors
DATRed = zeros(nObs,nPco);
for i = 1 : nObs,
DATRed(i,:) = DAT(i,:) * PCO; % transform each sample
end
To nd the optimal d
r
you simply have to search systematically, e.g. write a loop with increasing d
r
until the inverse can not be computed anymore. We would take from the LDA Matlab script, see now
subsection 3.1.1, lines 1-4 and place them into a function and systematically search for the maximally
allowable dimensionality. Note, that this does not necessarily need to be the optimal number of principal
components.
To estimate classication accuracy properly, you have to determine the optimum d
r
for the training set
only.
One may also look at the distribution of coefcient values coeff and determine a criterion based on how
the decay of values looks like for the dataset.
Advisory: Reducing dimensionality with the PCA does not necessarily mean that we have eliminated
useless dimensions. For the task of discrimination, we may in fact have eliminated useful dimensions: they
may have shown low variance in the PCA analysis, but could still have been useful for discrimination!
4.2 Feature Selection Alp p110 ThKo p261, ch5, pdf 274
Should the combination of LDA & PCA have failed (unlikely though), we can try to select manually in one
of the following ways.
The simplest way to select the best performing set of features is to test all (binomial) combinations
individually (aka as exhaustive search), but that is unfeasible for high dimensionality. Instead, suboptimal
but more time efcient methods are employed. The two most popular ones work by gradually increasing
or decreasing the number of dimensions. In either case, checking the error should be done on a validation
set, which is distinct from the training set because we want to test the generalization accuracy. With more
features, generally we have lower training error, but not necessarily lower validation error.
Let us denote by F, a feature set of input dimensions, x
i
, i = 1, ..., d. E(F) denotes the error incurred on
the validation sample when only the inputs in F are used. Depending on the application, the error is either
the mean square error or misclassication error.
Sequential Forward Selection Here we start with features: F = . At each step, for all possible x
i
, we
train our model on the training set and calculate E(F x
i
) on the validation set. Then, we choose that input
12
x
j
that causes the least error,
j = argmin
i
E(F x
i
)
and we
add x
j
to F if E(F x
j
) < E(F)
We stop if adding any feature does not decrease E.
This process may be costly because to decrease the dimensions from d to k, we need to train and test
the system d + (d 1) + (d 2) + ... + (d k) times, which is O(d
2
). This is a local search procedure and
does not guarantee nding the optimal subset, namely, the minimal subset causing the smallest error. For
example, x
i
and x
j
by themselves may not be good but together may decrease the error a lot, but because
this algorithm is greedy and adds attributes one by one, it may not be able to detect this.
Sequential Backward Selection As the name reveals already, this is the backward procedure of what we
just discussed. We start with F containing all features and do a similar process except that we remove one
attribute from F as opposed to adding to it, and we remove the one that causes the least error
j = argmin
i
E(F x
i
)
and we
remove x
j
from F if E(F x
j
) < E(F)
13
5 Evaluating Classiers & Performance
We now look at different ways of characterizing classier performance in more detail. One way is to repeat
the classication process for different permutations of the training and testing set - if sufcient training
samples are available (subsection 5.1). For binary classiers, one can employ signal-detection methods
(subsection 5.2).
5.1 Types of Error Estimation DHS p465
The Ugly Duckly theorem states that classication is impossible without some sort of bias. Bias and
variance are dened as (gure 5):
Bias measures the accuracy or quality of the match, that is the difference between the estimated and the
actual accuracy (the latter we typically do not know). High bias implies poor match.
Variance measures the precision or specicity of the match. High variance implies weak match.
Figure 5: is the parameter to be estimated. d
i
are several estimates (denoted by x) over different samples.
Bias is the difference between the expected value of d and . Variance is how much d
i
are scattered around
the expected value. We would like both to be small. [Source: Alpaydin 2010, Fig 4.1].
Bias and variance are affected by the type of resampling, see table 1 for a summary of methods. So far
we had used the holdout method, but a better method is a 5-fold cross-validation, in which the total data
set is partitioned into 5 equally sized sets of which 4 partitions serve as training, whereas the remaining
partition is used for testing. The partitions are then rotated (shifted) to obtain 5 different performance
estimates, which are then averaged to obtain the overall estimate. In Matlab: crossvalind (bioinfo toolbox)
Table 1: Error Estimation Methods (from Jain et al. 2000). n: sample size, d: dimensionality. (See also
Resampling on wiki)
Method Property Comments
Resubstitution All the available data is used for training as well as testing;
training set = test set
Optimistically biased estimate, especially when
n/d is small
Holdout Half the data is used for training and the remaining data
is used for testing; training and test sets are independent
Pessimistically biased estimate; different parti-
tionings will give different estimates
Leave-one-out,
Jackknife
A classier is designed using (n-1) samples and evalu-
ated on the one remaining sample; this is repeated n
times with different trainings sets of size (n-1).
Estimate is unbiased but it has a large variance;
large computational requirement because n dif-
ferent classiers have to be designed.
Rotation, n-fold
cross validation
A compromise between holdout and leave-one-out meth-
ods; divide the available samples into P disjoint subsets,
1P n. Use (P-1) subsets for training and the remain-
ing subset for test.
Estimate has lower bias than the holdout method
and is cheaper to implement than the leave-one-
out method.
Bootstrap Generate many bootstrap sample sets of size n by sam-
pling with replacement; several estimators of the error rate
can be dened.
Bootstrap estimates can have lower variance
than the leave-one-out method; computationally
more demanding; useful for small n.
14
5.2 Performance Measures for Binary Classiers Alp p489
The two-category measures can be understood by looking at the graph in gure 7a. The graph depicts
two overlapping density distributions: the distribution on the left represents the signal; the one on the
right represents noise. A decision threshold is set which so generates 4 possible types of responses:
TP true positive hit left of
TN true negative correct rejection right of
FP false positive false alarm left of
FN false negative miss right of
Those responses are arranged in a so-called confusion matrix (aka contingency table or cross tabulation):
Figure 6: The confusion matrix (for some artical example): predicted versus matched outcomes. TPR:
true positive rate; FPR: false positive rate; PPV: positive predictive value; ACC: accuracy. [Source: Szeliski, Tab
4.1]
The columns sum up to the actual number of positives (P) and negatives (N), while the rows sum up to
the predicted number of positives (P) and negatives (N). The corresponding probabilites sum up to 1:
P # positive actual instances = TP + FN
N # negative actual instances = FP + TN
P # positive classied instances = TP + FP
N # negative classied instances = FN + TN
Various measures can be taken (see also binary classier, ROC curve and Precision and Recall on
wikipedia):
Name Formula
error (FP + FN)/(P + N)
accuracy (TP + TN)/(P + N) = 1 - error
TP-rate TP / P
FP-rate FP / N
precision TP / P
recall TP / P = TP-rate
sensitivity TP / P = TP-rate
specicity TN / N = 1 - FP-rate
F1 score 2
(precisionrecall)
(precision+recall)
ROC curve (receiver operating characteristic): used for 2-categories; plots the true positive rate (on the
y-axis) against the false positive rate (on the x-axis) (gure 7). We can generate this curve only if we have
a way of varying the classier performance by some parameter, e.g. a threshold. For tasks with more than
2 categories, it is not possible to apply these measures directly, but one can simply employ the 2-category
measures in an evaluation of one category versus all other categories (one versus/against all) and create
so c two-category measures for a c-category classier.
Precision-Recall Curve used for search and rankings in information retrieval, where the ordering of data
is important.
15
Figure 7: Discrimination of 2 overlapping categories, or of a signal in noise. a: given a threshold we
obtain 4 response types (TP, FP, FN, TN). As the threshold is increased, the number of true positives
(TP) and false positives (FP) increases. b: The ROC curve plots the true positive rate against the false
positive rate. Ideally, the true positive rate should be close to 1, while the false positive rate is close to 0.
The area under the ROC curve (AUC) is often used as a single (scalar) measure of algorithm performance.
Alternatively, the equal error rate is sometimes used. [Source: Szeliski, 2010, Fig 4.23]
5.3 Other Issues
5.3.1 Class Imbalance Problem ThKo p237
In practice there are cases in which one class is represented by many more samples than another class, or
some classes just have very few samples. This is usually referred to as the class imbalance problem. Such
situations occur in a number of applications such as text classication, diagnosis of rare medical conditions,
and detection of oil spills in satellite imaging. Class imbalance may not be a problem if the task is easy
to learn, that is if classes are well separable; or if a large training data set is available. If not, one may
consider trying to avoid possible harmful effects rebalancing the classes by either oversampling the small
class and/or undersampling the large class.
5.3.2 Estimating Classier Complexity - Big O
We already discussed some of the advantages and disadvantages of the different classier types in terms of
their complexity. This is typically expressed with the so-called Big Onotation, see also wikipage Big O notation.
In short, the notation classies the algorithms by how they respond to changes in input size, e.g. how it
affects the processing time or working space requirements. In our case, we investigate changes in n or d
(of our nd data matrix). The issue is too complex to elaborate here and we merely summarize here, what
we mentioned so far:
- kNN: O(dn)
- Bayes: O(d) (xxx verify)
- LDA: space and time complexity only O(d)
16
6 Clustering - Unsupervised Learning
Sometimes we are given data, which we need to organize into meaningful groups (or partitions), which
now are called clusters. We attempt to nd clusters in order to uncover trends in data. This is useful
in many elds such as economy, bioinformatics, artical intelligence, etc. In comparison to the previous
classication algorithms (which required labeled data), we now deal with unlabeled data, that is we do not
have knowledge of any class labels. Because we do not have any guidance by labeled data, clustering is
sometimes called unsupervised learning. Given a set of data points (see gure 8), how do we nd its dense
regions (clusters), which likely correspond to classes or trends?
Clustering is used for data reduction, hypothesis generations, hypothesis testing, prediction based on
groups; in image analysis it is often used for scene segmentation. The two following examples are from
ThKo p598:
- Business example for hypothesis testing: cluster analysis is used for the verication of the validity of
a specic hypothesis. Consider, for example, the following hypothesis: Big companies invest abroad. One
way to verify whether this is true is to apply cluster analysis to a large and representative set of compa-
nies. Suppose that each company is represented by its size, its activities abroad, and its ability to complete
successfully projects on applied research. If, after applying cluster analysis, a cluster is formed that cor-
responds to companies that are large and have investments abroad (regardless of their ability to complete
successfully projects on applied research), then the hypothesis is supported by the cluster analysis.
- Medical example for prediction based on groups: cluster analysis is applied to a data set concerning
patients infected by the same disease. This results in a number of clusters of patients, according to their
reaction to specic drugs. Then for a new patient, we identify the most appropriate cluster for the patient
and, based on it, we decide on his or her medication.
Figure 8: Illustrating the clustering problem in 2D. We
are given a set of points and we are attempting to
nd dense regions (classes), which likely correspond
to potential classes. Are there two, three or more
classes?
Intuitively one would like to smoothen the distribution
(section 10) or to measure all point-to-point distances
to obtain a detailed description of the point distribu-
tion (subsection 6.2), which however is computation-
ally very intensive for large dimensionality; for large
dimensionality or datasets we therefore use simpler
procedures such as the k-Means algorithm (subsec-
tion 6.1).
6.1 k-Means DHS p526 ThKo p741 Alp p145
k-means clustering is an iterative procedure in which the cluster centers (aka centroids) and sizes are
gradually evolved by comparing the individual data points (vectors) sequentially. The procedure starts by
randomly selecting k data points (from n total data points), which are taken as initial centroids. Then,
the remaining data points are assigned to the nearest centroids. The resulting partitions (clusters) are
used to compute new centroids, which will have slightly moved from their previous location. With the new
centroids a new nearest-centroid clustering is carried out, which will result in a new partitions closer to the
actual clusters. By repeating these two steps, centroid computation and nearest-centroid clustering, the
algorithm gradually moves towards its nal clusters. To terminate this cycle, it requires the denition of a
stopping criterion, e.g. we quit after the new centroids hardly move anymore, which means that the cluster
development has settled.
17
Algorithm 5 k-Means clustering algorithm. Centroid = cluster center.
Parameters k: number of clusters.
Initialization randomly select k samples as initial centroids.
Input x list of vectors
Repeat
1. Generate a new partition by assigning each pattern to its closest centroid
2. Compute new centroids from the labels obtained in the previous step
(3. Optional): adjust the number of clusters by merging and splitting existing clusters
or by removing small, or outlier clusters.
Until stopping criterion fulllled (e.g. new centroids hardly move anymore)
Output L list of labels (a cluster label for each x
i
)
Implementation In Matlab existent as command kmeans for which one has to specify only the number of
clusters k as minimal parameter input: IxCls = kmeans(DAT, k);
where IxCls is an index vector with numbers corresponding to cluster assignment. Unfortunately, the
matlab version of this command does not take care of NaN entries, that is, it throws out any rows (ob-
servations) where NaN occur! To deal with NaN entries you have to modify the script as for instance:
http://alpha.imag.pub.ro/
~
rasche/course/patrec/xxxx.
Application (Alp p145): k-Means is very popular in data compression, it is specically used as vector quanti-
zation in image compression. Let us say we have an image that is stored with 24 bits/pixel and can have up
to 16 million colors. Assume we have a color screen with 8 bits/pixel that can display only 256 colors. We
want to nd the best 256 colors among all 16 million colors such that the image using only the 256 colors in
the palette looks as close as possible to the original image. This is color quantization where we map from
high to lower resolution. In the general case, the aim is to map from a continuous space to a discrete space;
this process is called vector quantization. Of course we can always quantize uniformly, but this wastes the
colormap by assigning entries to colors not existing in the image, or would not assign extra entries to colors
frequently used in the image. For example, if the image is a seascape, we expect to see many shades of
blue and maybe no red. So the distribution of the colormap entries should reect the original density as
close as possible placing many entries in high-density regions, discarding regions where there is no data.
Complexity : O(ndkT), T=number of iterations (DHS p527)
Advantages Works relatively fast due to its relatively low complexity; suitable for very large datasets with
tens or hundreds of thousands of samples (for which hierarchical clustering becomes unfeasible).
Disadvantages
1) Specication of the number of clusters k. There exist of course many attempts to nd procedures, that
determine k automatically, yet none has proven be effective for all patterns.
2) Does not guarantee optimal results due to random initial selection and non-exhaustive comparison.
6.2 Hierarchical Clustering DHS p550 ThKo p653
Hierarchical clustering consists of three steps. In the rst step, the pairwise distances between all points
are determined resulting in a n n distance matrix (aka similarity matrix if a similarity metric is used). In a
2nd step, a nested hierarchy of all n data points is generated, by gradually linking the pairs starting with the
closest pair. A hierarchy can be represented by a tree, in which case it is called dendrogram. In a 3rd step,
we cut the tree and the resulting branches form the clusters.
The linking procedure is the important step. There exist two principal types of linking, agglomerative
and divisive. Agglomerative linking starts with the smallest distance pairs gradually links to more distal
pairs; divisive works the opposite way, by considering all pairs and trying to gradually break down the links.
Agglomerative linking is the more popular one and we therefore consider only that one. (Divisive linking is
computationally so demanding that it is rare in practice.)
18
Agglomerative (aka Bottom-Up or Clumping) Hierarchical Clustering This procedure starts with n
(singleton) clusters and forms the hierarchy by successively merging clusters. The general algorithm is:
Algorithm 6 Generalized Agglomerative Scheme (GAS). From ThKo p654.
Parameters cut threshold
Initialization t = 0
Initialization choose R
t
= {C
i
= {x
i
}, i = 1, .., N} as the initial clustering.
Repeat:
t = t + 1
Among all possible pairs of clusters (C
r
, C
s
) in R
t1
nd the one, say (C
i
, C
j
), such that
g(C
i
, C
j
) = min
r,s
g(C
r
, C
s
), g is a distance (dissimilarity) function (1)
Dene C
q
= C
i
C
j
and produce the new clustering R
t
= (R
t1
{C
i
, C
j
}) {C
q
}
Until all vectors lie in a single cluster.
Cut hierarchy at level
The distance function in equation 1, can be implemented in different ways and its choice inuences the
clustering outcome:
single-linkage (aka nearest neighbor): uses the minimum distance to compute the distances between
samples and clusters. The method tends to generate chain-like clusters.
complete-linkage (aka furthest neighbor): uses the maximum distance to compute the distances be-
tween samples and clusters. The method tends to generate compact clusters.
The difference between the two is explained in the following gure. The example data set is shown in gure
9a and contains 11 points (x
1
, .., x
11
), whereby the points are already connected by straight lines segments
such that 2 clusters were formed - representing the outcome of the clustering procedure.
For the (near complete) trees in 9b and c imagine a y-axis that represents the distance between clus-
ters. In both graphs, the trees lack the top level of the hierarchy, that is the trees were cut already at a
level, where it generates the two partitions. In this example, the two linkage methods result in the same
partitioning, but for more complex data sets the outcome likely differs.
Implementation
Single Command: in Matlab the command clusterdata performs this type of clustering.The least pa-
rameter that needs to be provided is a cutoff frequency, which is either provided as:
- a real value between 0 and 2, where it represents an inconsistent value
- or as an integer value ( 2) specifying the desired number of clusters (like k in kmeans):
Lbl = clusterdata(DAT, 1.25). Lbl contains the assigned cluster labels.
Individual Steps: we program the steps individually as follows:
Dis = pdist(DAT); % pairwise distances
Lnk = linkage(Dis, single); % single linkage method
Lbl = cluster(Lnk,cutoff, 1.25);
Dis contains the N (N 1)/2 pairwise distances between all observations as a row vector (N is the
number of datapoints/samples/observations).
Lnk is a N3 array containing the connections of the tree, which can be displayed using the command
dendrogram. More specically, the rst two columns contain the tree connections, the third row are
the distances between clusters.
To visualize the distance matrix, use squareform to rearrange the vector obtained from pdist.
19
Figure 9: a. The data set: 11 points (x
1
, .., x
11
). b. the dissimiliarity dendrogram as generated by the
single-link condition. c. the dendrogram as generated by the complete-link condition (the top level of the
hierarchy is not shown). [Source: Theodoridis, Koutroumbas, 2008, Fig 13.3]
Application: e.g. biological taxonomy
Advantages Detailed analysis due to exhaustive comparison (N
2
).
Disadvantages
1) Optimal cutoff frequency can not be determined automatically
2) Suitable for databases of limited size only due to O(N
2
).
Notes
Finding the optimal cluster size (k or by use of an appropriate cutoff frequency) is a general problem in
clustering. Only systematic exploration can help us out, that is running the clustering for different parameter
values and analyzing the output carefully (see also Alp p158).
Exercise
Take two example sets, a random pattern consisting of few points (e.g. PtsRnd = rand(50,2);) and the
pattern consisting of an arc above a square grid, see appendix B. Cluster both patterns by distance
ClsPat = cluster(LnkPat, cutoff, 0.11, criterion, distance);
and compare their dendrogams by plotting everything into one gure (with subplot). Apply different thresh-
olds to understand exactly the parameters and clustering operation.
20
7 Decision Tree ThKo p215, s 4.20, pdf 228 Alp p185, ch 9 DHS p395
A decision tree is a multistage decision system, in which classes are sequentially rejected until we reach a
nally accepted class. Figure 7 left shows an example for a 2D data set. On the left is shown an (articial)
data set, which consists of 6 regions belonging to 4 different classes (classes 1 and 3 have two instances
each). On the right is shown the corresponding tree, which consists of 5 decision nodes (circles) and 6
leave nodes (squares; aka terminal nodes), which are connected by links or branches. Given a (testing)
data point, e.g. x
1
= 0.15, x
2
= 0.5, the decision node t
0
(aka root node), tests the rst component, x
1
,
by applying a threshold value of 1/4: if the value is below, the data point is assigned to class
1
; if not,
the process continues with binary decisions of the general form of x
i
> ( = threshold value) until we
found a likely class label. The example in gure 7 right is a binary decision tree and splits the space into
rectangles with sides parallel to the axes (hyperrectangles for dimensionality > 2). Other types of trees
are also possible, that split the space into convex polyhedral cells or into pieces of spheres. Note that it is
possible to reach a decision without having tested all available feature components.
Figure 10: Left: a pattern divided into rectangular subspaces by a decision tree. Right: corresponding tree. Circles:
decision nodes. Squares: leaf/terminal nodes. [Source: Theodoridis, Koutroumbas, 2008, Fig 4.27,4.28]
In praxis, we often have data of higher dimensionality and we therefore need to develop binary decision
trees automatically, that is, we need to nd out when which component x
i
is tested with what threshold
value
i
. We elaborate on this now, discussing 3 issues: impurity, stop splitting and class assignment rule.
Impurity Every binary split of a node, t, generates two descendant nodes, denoted as t
Y
and t
N
according
to the Yes or No decision; node t is also referred to as the ancestor node (when viewing such a split).
The descendant nodes are associated with two new subsets, that is, X
tY
, X
tN
, respectively (the root node
is associated with the training set X).
Now the crucial point: every split must generate subsets that are more class homogeneous compared
to the ancestors subset X
t
. This means that the training feature vectors in each one of the new subsets
show a higher preference for specic class(es), whereas data in X
t
are more equally distributed among the
classes.
Example: for a 4-class task: assume that the vectors in subset X
t
are distributed among the classes with
equal probability (percentage). If one splits the node so that the points that belong to classes
1
and
2
form
subset X
tY
, and the points from
3
and
4
form X
tN
subset, then the new subsets are more homogeneous
21
compared to X
t
or purer in the decision tree terminology.
The goal, therefore, is to dene a measure that quanties node impurity and split the node so that the
overall impurity of the descendant nodes is optimally decreased with respect to the ancestor nodes impurity.
Let P(
i
|t) denote the probability that a vector in the subset X
t
, associated with a node t, belongs to class

i
, i = 1, 2, ..., M. A commonly used denition of node impurity, denoted as I(t), is the entropy for subset
X
t
:
I(t) =
M

i=1
P(
i
|t) log
2
P(
i
|t)
where log
2
is the logarithm with base 2 (see Shannons Information Theory for more details). We have:
- Maximum impurity I(t) if all probabilities are equal to 1/M (highest impurity)
- Least impurity I(t) = 0 if all data belong to a single class, that is, if only one of the P(
i
|t) = 1 and all the
others are zero (recall that 0 log 0 = 0).
When determining the threshold at node t, we attempt to chose a value such that I(t) is large.
Example: given is a 3-class discrimination task and a set X
t
associated with node t containing N
t
= 10
vectors: 4 of these belong to class
1
, 4 to class
2
, and 2 to class
3
. Node splitting results into: subset
X
tY
, with 3 vectors from
1
, and 1 from
2
; and subset X
tN
with 1 vector from
1
, 3 from
2
, and 2 from

3
. The goal is to compute the decrease in node impurity after splitting. We have that:
I(t) =
4
10
log
2
4
10

4
10
log
2
4
10

2
10
log
2
2
10
= 1.521
I(t
Y
) =
3
4
log
2
3
4

1
4
log
2
1
4
= 0.815
I(t
N
) =
1
6
log
2
1
6

3
6
log
2
3
6

2
6
log
2
2
6
= 1.472
Hence, the impurity decrease after splitting is
I(t) = 1.521
4
10
(0.815)
6
10
(1.472) = 0.315.
Stop Splitting The natural question that now arises is when one decides to stop splitting a node and
declares it as a leaf of the tree. A possibility is to adopt a threshold T and stop splitting if the maximum
value of I(t), over all possible splits, is less than T. Other alternatives are to stop splitting either if the
cardinality of the subset X
t
is small enough or if X
t
is pure, in the sense that all points in it belong to a
single class.
Class Assignment Rule Once a node is declared to be a leaf, then it has to be given a class label. A
commonly used rule is the majority rule, that is, the leaf is labeled as
j
where
j = argmax
i
P(
i
|t)
In words, we assign a leaf, t, to that class to which the majority of the vectors in X
t
belong.
A critical factor in designing a decision tree is its size. As was the case with the multilayer perceptrons,the
size of a tree must be large enough but not too large; otherwise it tends to learn the particular details of the
training set and exhibits poor generalization performance. Experience has shown that use of a threshold
value for the impurity decreases as the stop-splitting rule does not lead to trees of the right size. Many
times it stops tree growing either too early or too late. The most commonly used approach is to grow a tree
up to a large size rst and then prune nodes according to a pruning criterion. A number of pruning criteria
have been suggested in the literature. A commonly used criterion is to combine an estimate of the error
probability with a complexity measuring term (e.g., number of terminal nodes) [Brei 84, Ripl 94].
22
Algorithm 7 Growing a binary decision tree. From ThKo p219.
Parameters Stop-splitting threshold T
Initialization Begin with the root node X
t
= X.
For each new node t
For every feature x
k
(k = 1, ..., l)
For every value
kn
(n = 1, ..., N
tk
)
- Generate X
tY
and X
tN
for: x
k
(i)
kn
, i = 1, ..., N
t
- Compute I(t|
kn
)
End

kn0
= argmax

I(t|
kn
)
End
[
k0n0
, x
k0
] = argmax

I(t|
kno
)
If the stop-splitting rule is met
declare node t as a leaf and designate it with a class label
Else
Generate nodes t
Y
, t
N
with corresponding X
tY
, X
tN
for: x
k0

k0n0
End
End
Disadvantages It is not uncommon for a small change in the training data set to result in a very different tree,
meaning there is a high variance associated with tree induction. The reason for this lies in the hierarchical
nature of the tree classiers. An error that occurs in a higher node propagates through the entire subtree,
that is all the way down to the leaves below it. The variance can be improved by using random forests (see
course II).
Advantages
- DT classiers are particularly useful when the input is non-metric, that is when we have categorical
variables. They also treat mixtures of numeric and categorical variables well.
- Due to their structural simplicity, DTs are easily interpretable.
7.1 Implementation
Matlab: use classregtree for training, eval for testing.
23
8 Combining Classiers [Ensemble Classiers] Alp p419, ch 17
The previously introduced classiers (kNN, Naive Bayes, linear discriminant) attempt to obtain an optimal
performance with a single classier, e.g. with a perfect (single) discrimination function. In contrast, the
principle of combining classiers is to use multiple less-than-perfect classiers, each one with a mediocre
discrimination function for instance; these base classiers (or base learners) are then combined to form a
single (total) decision. The classier that combines the base learners is called ensemble classier or simply
combiner.There are two principal motivations for combining classiers:
1. We have measurements from separate sources, e.g. a visual and an audio signal, each with its own
set of dimensions. Then, it is obvious to test whether a combination of separate classiers, with each
one geared toward those sources, performs better than a single classier (it is not as obvious for the
following motivation) - in this case aka data fusion. Subsection 8.1 introduces the basics combining
classiers.
2. We may try to solve the classication problem with a set of classiers, whereby an individual classier
performs merely above chance level. By the combination of these opinions we may obtain an expert
advice, which is hopefully better than the expert advice of a single classier. An example is given in
subsection 8.2.
Figure 11: Simplest combination (ensemble) classier. Input x feeds into L different base-learners, whose output dj
is combined using f() to generate the nal decision. In this example graph, all learners observe the same input; it
may be the case that different learners observe different representations of the same input, as in bagging for instance.
[Source: Alpaydin 2010, Fig 17.1]
General formulation: We have L base learners h
j
(j = 1, ..., L) and input vector x. Each base learner
makes a prediction d
j
(x), which in turn is combined with the other predictions to arrive at a nal decision:
y = f(d
1
, d
2
, ..., d
L
|), (2)
where f() is the combining function with denoting its parameters. For a multi-class discrimination task
each base learner generates K outputs and we then deal with a K L matrix d
ji
(x) (number of classes
number of learners).
8.1 Voting
The simplest way to combine multiple classiers is by voting, which corresponds to taking a linear combi-
nation of the learners
y
i
=

j
w
j
d
ji
where w
j
0,

j
w
j
= 1. (3)
24
This is also known as ensembles and linear opinion pools. In the simplest case, all learners are given
equal weight (w
j
= 1/L), which is also called simple voting: it corresponds to taking an average. Other
combination rules are
Median y
i
= median
j
d
ji
robust to outliers
Minimum y
i
= min
j
d
ji
pessimistic
Maximum y
i
= max
j
d
ji
optimistic
Product y
i
=

j
d
ji
veto power
If the outputs d
ji
are not posterior probabilities, these rules require that outputs be normalized to the same
scale. Note that after the combination rules, y
i
do not necessarily sum up to 1.
If the data set consists of features obtained from different sources, then one should denitely try an
ensemble classier with a voting scheme as it does not involve any particular tuning, that is it comes at
little effort to test this variant. For instance, we have data with audio and visual features: we train solely the
visual features with one LDA and obtain the corresponding posterior values (3rd argument, see subsection
3.1), and we train solely the audio features with another LDA and obtain the corresponding posterior values.
We then combine the two sets of posteriors with any rule that gives us the maximum performance.
8.2 Bagging
Bagging is a voting method whereby base learners h
j
are made different by training them on different
subsets of the training sets. Bagging can reduce variance and thus reduce the generalization error perfor-
mance.
The subsets are generated by bootstrap, that is by drawing randomly a subset of samples from the
training set with replacement (hence the name bagging = bootstrap aggregation). Given a training set X,
we create B variants, X
1
, X
2
, ..., X
B
, by uniformly sampling from X with replacement. (Because sampling
is done with replacement, it is possible that some instances are drawn more than once and that certain
instances are not drawn at all). One can use randsample to create different subsets of X, e.g.
for i = 1:nSub
Ixr = randsample(nTrnSamp, nSubSize); % random sampling
Xsub = X(Ixr,:); % select only first nSubSize of Ixr and thus X
...train a classifier on Xsub...
end
For each of the training set variants, X
i
, a classier h
i
, is constructed. The nal decision is in favor of
the class predicted by the majority of the subclassiers, h
i
, i = 1, 2, ..., B.
By randomly selecting a subset, the individual base learners will be slightly different (remember mo-
tivation no. 2 above). To increase diversity, bagging works better, if the base learner is trained with an
unstable algorithm, such as a decision tree, a single or multilayer perceptron, or a condensed NN. Unstable
means that small changes in the training set cause a large difference in the generated learner, namely a
high performance variance.
Bagging as such is a method worth trying as it also involves little complications. Bagging is successfully
used in some applications (e.g. Kinect Microsoft motion recognition system), specically together with
decision trees, so-called random forests.
8.3 Component Classiers without Discriminant Functions DHS p498, s. 9.7.2, pdf 576
If we create an ensemble classier, whose base learners consist of different classier types, e.g. one is a
LDA and the other is a kNN classier, then we need adjust their outputs in particular if they do not compute
discriminant functions. In order to integrate the information from the different (component) classiers we
must convert the their outputs into discriminant values. It is convenient to convert the classier output g
i
25
to a range between 0 to 1, now g
i
, in order to match them to posterior values of a (regular) discriminant
classiers. The simplest heuristics to this end are the following:
Analog (e.g. NN): softmax transformation:
g
i
=
e
gi

c
j=1
e
gi
. (4)
Rank order (e.g. kNN): If the output is a rank order list, we assume the discriminant function is linearly
proportional to the rank order of the item on the list. The values for g
i
should thus sum to 1, that is
normalization is required.
One-of-c (e.g. decision tree): If the output is a one-of-c representation, in which a single category is
identied, we let g
j
= 1 for the j corresponding to the chosen category, and 0 otherwise.
The table gives a simple illustration of these heuristics.
Other normalization schemes are certainly possible too. The Matlab command classify returns the dis-
criminant values as the 3rd argument, called posteriors, which are already normalized to a range between
0 and 1. Before combining those posteriors with the discriminant values from other component classiers,
it is useful to plot the posterior matrix to see what range of values we deal with.
8.4 Learning the Combination
Instead of choosing a combination rule (see table in subsection 8.1), we may try to optimize the combi-
nation stage by training a classier on the discriminant values being combined. For instance, we train an
optimization classier to combine the discriminant values for an LDA and a kNN classier, for which the
optimization classier takes a 2 K matrix as input (2 because we have the LDA and the kNN classier;
K=number of classes) and returns a vector of length K as the nal posterior. There are also other ways to
combine component classiers.
To provide a correct generalization performance, we need to train the base classiers and the combi-
nation stage separately. That means we need to split the training set into a subset for training the base
classiers only, and a subset for the combination stage. Ultimately, it is more complex and requires more
training data, but we may gain another few percent by cleverly combining the component classiers and
may thus beat any other classier.
8.5 One-vs-All Classier
One may also try to learn K classiers, with each one discriminating one class versus all other classes
(one-vs-all). When using such an ensemble classier, one should pay attention to the class imbalance
problem (subsection 5.3.1).
26
9 Non-Metric Classication DHS pxxx, ch 8, pdf 461
If data are nominal, meaning if they are discrete and without any natural notion of similarity or even ordering,
then one uses lists of attributes.
A common approach is to specify the values of a xed number of properties by a property d-tuple. For
example, consider describing a piece of fruit by the four properties of color, texture, taste and smell. Then a
particular piece of fruit might be described by the 4-tuple red, shiny, sweet, small, which is a shorthand for
color = red, texture = shiny, taste = sweet and size = small. Such data can be classied with decision trees
(section 7).
Another common approach is to describe the pattern by a variable length string of nominal attributes,
such as a sequence of base pairs in a segment of DNA, e.g., AGCTTCAGATTCCA; or the letters in
word/text. In that case we use methods dealing with sequences, which we elaborate next.
9.1 Recognition with Strings DHS p413, s 8.5, pdf 481 ThKo p487, s 8.2.2
A particularly long string is denoted text. Any contiguous string text that is part of x is called a substring,
segment, or more frequently a factor of x. For example, GCT is a factor of AGCTTC. There is a large num-
ber of problems in computations on strings. The ones that are of greatest importance in pattern recognition
are:
- String matching: Given x and text, test whether x is a factor of text, and if so, determine its position.
- Edit distance: Given two strings x and y, compute the minimum number of basic operations - character
insertions, deletions and exchanges - needed to transform x into y.
- String matching with errors: Given x and text, nd the locations in text where the cost or distance of
x to any factor of text is minimal.
- String matching with the dont care symbol: This is the same as basic string matching, but with a special
symbol, , the dont care symbol, which can match any other symbol.
We introduce only the rst two.
9.1.1 String Matching Distance
Figure 12: The general string-matching problem is
to nd all shifts s for which the pattern x appears
in text. Any such shift is called valid. In this case
x = bdac is indeed a factor of text, and s = 5 is the
only valid shift. [Source: Duda,Hart,Storck 2001, Fig 8.7]
The simplest detector method is to test each possible shift, which is also called naive string matching. A
more sophisticated method, the Boyer-Moore algorithm, uses the matched result at one position to predict
better possible matches, thus not testing every position and accelerating the search.
9.1.2 Edit Distance
The edit distance between x and y describes how many fundamental operations are required to transform
x into y. The fundamental operations are:
- substitutions: A character in x is replaced by the corresponding character in y.
- insertions: A character in y is inserted into x, thereby increasing the length of x by one character.
- deletions: A character in x is deleted, thereby decreasing the length of x by one character.
Let Cbe an mn matrix of integers associated with a cost or distance and let (, ) denote a generalization
of the Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise.
The basic edit-distance algorithm (algorithm 8) starts by setting C[0, 0] = 0 and initializing the left column
and top row of C with the integer number of steps away from i = 0, j = 0. The core of this algorithm nds
27
Algorithm 8 Edit distance. From DHS p486.
Initialization x, y, m length[x], n length[y]
Initialization C[0, 0] = 0
Initialization For i = 1..m, C[i, 0] = i, End
Initialization For j = 1..n, C[0, j] = j, End
For i = 1..m
For j = 1..n
Ins = C[i 1, j] + 1; % insertion cost
Del = C[i, j 1] + 1; % deletion cost
Exc = C[i 1, j 1] + 1 (x[i], y[j]) % no (ex)change cost
C[i, j] = min(Ins, Del, Exc) % the minimum of the 3 costs
End
End
Return C[m, n]
the minimum cost in each entry of C, column by column (gure 13). Algorithm 8 is thus greedy in that each
column of the distance or cost matrix is lled using merely the costs in the previous column.
As shown in gure 13, x = excused can be transformed to y = exhausted through one substitution
and two insertions. The table shows the steps of this transformation, along with the computed entries of the
cost matrix C. For the case shown, where each fundamental operation has a cost of 1, the edit distance is
given by the value of the cost matrix at the sink, i.e., C[7, 9] = 3.
Figure 13: The edit distance calculation for strings x and y can be illustrated in a table. Algorithm 3 begins
at source, i = 0, j = 0, and lls in the cost matrix C, column by column (shown in red), until the full edit
distance is placed at the sink, C[i = m, j = n]. The edit distance between excused and exhausted is thus
3. [Source: Duda,Hart,Storck 2001, Fig 8.9]
The algorithmhas complexity O(mn) and is rather crude; optimized algorithms have O(m+n) complexity
only. Linear programming techniques can also be used to nd a global minimum, though this nearly always
requires greater computational effort.
Note: as mentioned in the introduction, the pattern can consist of any (limited) set of ordered elements,
and not just letters. Example: The edit distance is sometimes applied in computer vision, specically shape
recognition, for which a shape is expressed as a sequence of classied segments.
28
10 Density Estimation
Density estimation is the characterization of a data distribution. Density estimation is in principal similar to
clustering (section 6), where we had attempted to nd classes in the entire dataset by identifying clusters.
In density estimation in contrast, we rather describe the distribution of individual dimensions (features) by
identifying their modes (maxima). One can distinguish between parametric and non-parametric methods
(sections 10.2 and 10.1).
10.1 Non-Parametric Methods Alp p165
In non-parametric methods, the distribution is piece- or pointwise estimated by either counting the number
of datapoints (subsection 10.1.1) or by smoothing them (subsection 10.1.2).
10.1.1 Histogramming Alp p165
In constructing the histogram, we have to choose both an origin and a bin width. The choice of origin affects
the datapoint count near boundaries of bins, but it is mainly the bin width that has an effect on the estimate.
The estimate is 0 if no instance falls in a bin; there are discontinuities at bin boundaries.
In Matlab: histc
Exercise Take a random, sparse 1D data distribution and understand the binning behavior by using a range
of bins, that cover exactly the data range; then apply a range of bins exceeding the data range, etc. Then
look at the individual dimensions of your data. Observe how many modes the distributions contain.
10.1.2 Kernel Estimator (Parzen Windows) Alp p167 ThKo p51
This method smoothens the data as opposed to just counting them as in histogramming (wiki: kernel
smoother). The estimation f(x) consists of the sum of a kernel function K (aka as Parzen Windows) placed
at each data point x
t
(t = 1, .., N):
f(x) =
1
Nh
N

t=1
K
_
x x
t
h
_
, (5)
where h is the Kernel width. The most common kernel K is the Gaussian function,
g(x) =
1

2
exp
_

1
2
_
x

_
2
_
(6)
in which case h corresponds to and
t
to x
t
. But one can also use a uniform (box) function, a triangular
function or any other radial-basis function for K (wiki: Kernel (statistics)).
In Matlab: ksdensity
If no h is specied, the Matlab script will estimate a value based on simple statistics of the distribution.
10.2 Parametric Methods Alp p61
Parametric means we express the distribution by parameters, that is, by an equation, which is also called
the probability density function (PDF) in the context of density estimation. In the non-parametric methods
in contrast (previous subsection), we merely transformed the distribution without expressing them by any
parameters.
The simplest parametric description is to take the mean and standard deviation , that is to take
the rst-order statistics of the distribution, and to use them in a radial-basis function. The most common
radial-basis function is the Gaussian (normal) function (equation 6).
Parameterizing a distribution with rst-order statistics were ideal, if the distribution contained only a
single mode (a uni-modal distribution). In practice, this is hardly true, as discovered above by histogramming
the individual dimension (see previous subsection). But assuming a uni-modal distribution is simply done
for computational convenience. There are situations however, where we wish to parameterize distributions
with multiple modes; we then would use a GMM, to be introduced in the following subsection.
29
Figure 14: Density approximation. Ef-
fect of different kernel widths h (1.0,
0.5 and 0.25). xs denote datapoints.
[Source: Alpaydin 2010, Fig 8.3].
10.2.1 Gaussian Mixture Models (GMM)
When we use Gaussian mixture models, we assume that the distribution is multi-modal and we also specify
the number of modes we expect, very much like in a k-Means clustering algorithm (algorithm 5). The
GMM simply adds the output of k Gaussian functions, whose means and standard deviations correspond
to the location of the modes and the width of the assumed underlying distributions. To nd the appropriate
mean and standard deviation value for each mode, one uses a so-called Expectation-Maximization (EM)
algorithm. The algorithm gradually approaches the optimal values by a search very akin to the k-Means
algorithm, hence the relation of density estimation to clustering.
We do not treat this in further detail here and merely point out that GMMs can be modeled in Matlab
with the command gmdistribution (available in statistics toolbox).
30
11 Naive Bayes Classier
The Naive Bayes classier is a method, which models the data more explicitly than the other classiers (kNN
and LDA). In fact, in the kNN classier, no modeling takes place at all; in the LDA the data are analyzed
for their mean and standard deviation, but the discrimination is based on a hyperplane only. In the Naive
Bayes classier, one goes a step further and even makes the decision based on density estimation (as
introduced in the previous section), and this classier is thus theoretically the most elegant model, as
everything is based on parameterization. Practically, the classier has only limited success, as too much
elegance sometimes lacks the robustness to deal with messy data.
The Naive Bayes classier performs density estimation assuming uni-modal Gaussian distributions as
introduced in subsection 10.2, namely by taking the mean and the standard deviations of the individual
feature dimensions for each class (group).
Figurative Example. In our country-guessing example, we would approximate the distribution of cars for each
country by a separate density function and then determine our location by using the density functions only. For
a given (spatial) location we compute the values for the different countries (from their individual functions), and
the one that returns the highest value determines our choice of country.
In gure 1, this Gaussian were elliptically-shaped for both classes with an approximate shape value of 1
(assuming equal scales on each axis). We then run a classication that is based on these two Gaussian
functions. To determine the category for a given sample (vector), we compute the values of the Gaussian
functions for each class i (with its class-specic parameters
i
and
i
); the maximum function value then
determines the selected category.
In 2D the Gaussian function becomes:
g(x, y) =
1
2
x

y
_
1
2
exp
_

1
2(1
2
)
__
x
x

x
_
2
+
_
y
y

y
_
2

2(x
x
)(y
y
)

y
_
_
(7)
where is the correlation between X and Y and where
x
> 0 and
y
> 0. In the 2D case, we can express
this more compactly using,
=
_

y
_
and =
_

2
x

x

y

2
y
_
and then formulate as follows, which is also the formula for a multivariate Gaussian (2 or more dimensions):
g(x) =
1
(2)
d/2
||
1/2
exp
_

1
2
(x )
t

1
(x )
_
N(, ) (8)
where
is the mean vector, E
_
[x
1
, x
2
, .., x
d
]
t

= [
1
,
2
, ..,
t
]
t
is the d d covariance matrix, = E[(x )(x )
t
]
|| is the determinant of the covariance matrix

1
is its inverse
(x )
t

1
(x ) is also called Mahalanobis distance
Thus we build our classier as follows:
Algorithm 9 Naive Bayes Classier. k = 1, .., c (n
classes
, K)
Training c classes ( D
L
):
mean
k
, covariance
k
, determinant |
k
|, inverse
1
k
, prior P(k)
g
k
as in equation 8
Testing 1) for a testing sample x D
T
determine g(x) c classes g
k
.
2) multiply each g
k
with the class prior P(k): f
k
= g
k
P(k)
Decision chose maximum of f
k
: argmax
k
f
k
If the classes occur with uneven frequencies, we need to determine the frequency for each class, also called
prior, and include this as pointed out in the training step and in step no. 2 of testing.
31
This type of classier is also called Naive Bayes classier, because it assumes that the feature value
distributions can be approximated by a Gaussian function, which is generally a huge oversimplication (or
naive): for most data the feature distribution is non-Gaussian. However, the classier also bears potential
complications because nding the appropriate density functions can be difcult for complex data or if there
are only few training samples (small sample size problem). The latter may prevent that the inverse of the
covariance matrix can be determined. Although there exist methods to estimate the inverse (e.g. command
pinv in Matlab), it may be easier to try the following two alternatives: one, use a dimensionality reduction,
e.g. the PCA (subsection 4.1); or two, try a different classier.
11.1 Implementation
With the commands cov,det and inv (or pinv),one can conveniently build a Bayes classier. Here are
some code fragments for orientation (see also ThKo p81):
% ----- build class information for TRAINING set:
AVG = zeros(nCat, nDim);
[COV COVInv] = deal(zeros(nCat, nDim, nDim));
CovDet = zeros(nCat,1);
for k = 1 : nCat
TrnCat = TRN(Group==k, :); % [nCatSamp, nDim]
AVG(k, :) = mean(TrnCat); % [nCat, nDim]
CovCat = cov(TrnCat);
COV(k, :, :) = CovCat; % [nCat, nDim, nDim]
CovDet(k) = det(CovCat); % determinant
COVInv(k, :, :) = pinv(CovCat); % p inverse
end
% ----- testing a (single) sample with index ix (from TESTING set):
Prob = zeros(nCat, 1); % initialize probabilites
for k = 1 : nCat
DF = AVG(k,:) - TST(ix,:) % diff between avg and sample
detCat = abs(CovDet(k)); % retrieve class determinant
CovInv = squeeze(COVInv(k, :, :)); % retrieve class inverse
fct = 1 / ( ( (2*pi)^(nDim/2) )*sqrt(detCat) +eps);
etm = (DF * CovInv * DF)/2; % Mahalanobis distance
Prob(k) = fct * exp(-etm); % probability for this class
end
[mxc ixc] = max(Prob); % final decision (class winner)
Prior: We did not include the prior in this code fragment. Given an index array IxCat with values corre-
sponding to class assignment ( 1, .., k, k=number of classes), we can generate a histogram as follows:
Nocc = accumarray(IxCat, 1, [nCat, 1])) (or use histc)
, where nCat is the number of classes (= k); then turn it into a prior (frequency) by dividing with sum(Nocc(:)):
Prior = Nocc./sum(Nocc(:)).
11.2 Recapitulation
Advantages Decent results with a simple, compact model. Results likely better than with kNN.
Disadvantages It can be difcult to determine the inverse of the covariance matrix. This can happen when
certain dimensions have values that are (close-to) zero for some classes; or when the number of training
samples is small (small sample size problem).
32
12 Support Vector Machines ThKo p119
Support Vector Machines (SVM) are sometimes assigned to the class of linear classiers (e.g. with Linear
Discriminant Analysis). They typically perform better than other linear classiers but also require more
tuning. They are designed as binary (two-category) classiers, but for multiple categories one can simply
create c binary tasks (one versus all other) and then combine their outputs (subsection 8.5). The learning
duration of SVMs is typically long and they may only work if the classes are reasonably well separable. The
following characteristics make SVMs distinct from ordinary linear classiers:
1. Kernel function: The SVM uses such functions to project the data into a higher-dimensional space in
which the data are hopefully better separable than in their original lower-dimensional space. Kernel
functions can be Radial-Basis functions, quadratic,...
2. Support Vectors: The SVM uses only a few sample vectors for generating the decision boundaries
and those are called support vectors. For a regular linear classier, there exist multiple reasonable
decision boundaries, that separate the classes of the training set. For instance, the optimal hyperplane
in gure 15 could actually show slightly different orientations. The SVM nds the one, that also
gives a good generalization performance, whereby the support vectors are exploited to what is called
maximizing the margin (the two bidirectional arrows delineate the margin).
Figure 15: Training a support vector ma-
chine consists of nding the optimal hyper-
plane, that is, the one with the maximum
distance from the nearest training patterns.
The support vectors are those (nearest) pat-
terns, a distance b from the hyperplane. The
three support vectors are shown as solid
dots. [Source: Duda,Hart,Storck 2001, Fig 5.19]
The SVM are too complex to code them quickly. We simply apply them in Matlab using the Bioinformatics
toolbox. There are two separate commands for training and testing: svmtrain and svmclassify:
Svm = svmtrain(TRN, Grp); % returns a structure...
GrpTst = svmclassify(Svm, TST); % ...which is fed together with the testing data
12.1 Recapitulation
Advantages better classication accuracy for binary tasks.
Disadvantages relatively long learning duration; require more tuning, that is adjustment until it works; may
not work well, if classes are not reasonably separable.
Recommendation: Use SVM to obtain best results, e.g. maximizing the classication performance for a
data set.
33
13 Rounding the Picture DHS p84
13.1 Bayesian Formulation
A typical textbook on pattern classication (with mathematical pretense [ambition]) starts by introducing the
Bayesian formalism and its application to the decision and classication problem. The Bayesian formulation
bears notational, analytical and theoretical elegance, but is limitedly applicable as many real-world data
are large in dimensionality. Often, we are occupied with obtaining any reasonable results in a rst place.
We now introduce this formalism, so that we understand better the language used in some textbooks. The
Naive Bayes Classier of section 11 is a simple version of this formalism. Bayes formalism expresses a
decision problem in a probabilistic framework:
Bayes rule : P(
j
|x) =
p(x|
j
)P(
j
)
p(x)
posterior =
likelihood prior
evidence
(9)
In natural language, see right side of equation: DHS p22,23
Alp p50 - Posterior: is the probability for the presence of a specic category
j
in the sample x.
- Likelihood: is the computed value using the density function. In the example of the Naive Bayes classier
(section 11), it is the value of equation 8.
- Prior: is the probability for the category being present in general, that is, it is the frequency of its occur-
rence. We called this prior already (see algorithm 9 and subsection 11.1).
- Evidence: is the marginal probability that an observation x is seen (regardless of whether it is a positive
or negative example) and ensures normalization. (This was not explicitly calculated.)
More formally: Given a sample, x, the probability P(
j
|x), that it belongs to class
j
, is the fraction of
the class-conditional probability density function, p(x|
j
), multiplied by the probability with which the class
appears, P(
j
), divided by the evidence p(x). We can formalize evidence as follows:
p(x) =
c

j=1
p(x|
j
)P(
j
) =

(likelihood prior) = Normalizer to ensure

j
P(
j
|x) = 1 (10)
13.1.1 Rephrasing Classier Methods
Given the above Bayesian formulation, we can now rephrase the working principle of the three classier
types (sections 2, 3, 11) as follows:
k-Nearest-Neighbor (section 2): estimates the posterior values P(
j
|x) directly, without attempting to
compute any density functions (likelihoods); in short, it is a non-parametric method, because no effort
is made to nd functions, that approximate the density p(x|
j
).
kNN is a type of instance-based learning, or lazy learning where the function is only approximated
locally and all computation is deferred until classication.
Naive Bayes Classier (section 3): is essentially the simplest version of the Bayesian formulation and
that classier makes the following two assumptions in particular:
1. It assumes that the features are independent and identically drawn (i.i.e.), in short statistically
independent. This is also called Naive Bayes Rule. But often we do not know beforehand,
whether the dimensions are uncorrelated.
2. It assumes that the features are Gaussian distributed ( [x], [(x )(x )
t
])
For most data, these are two strong assumptions because most data distributions are more complex.
Despite those two strong assumptions, the Naive Bayes classier often returns acceptable perfor-
mance.
Linear Discriminant Function (section 4): they are similar to the kNN approach in the sense that they
do not require knowledge of the form of the underlying probability distributions. (Some researchers
argue, that attempting to nd the density function is a more complex problem than trying to directly
develop discriminants functions.)
34
13.2 Parametric (Generative) vs. Non-Parametric (Discriminative)
Along with the Bayesian framework comes also the distinction between parametric and non-parametric
methods (as already implied above and made in section 10). The parametric methods pursue the ap-
proximation of density distributions p(x|
j
) by functions with a few essential parameters. Non-parametric
methods in contrast nd approximations without any explicit models (and hence parameters), such as the
kNN and the Parzen window. Chapters in textbooks are often organized according to this distinction. Here
we summarize the typical characterization of methods:
Parametric Multi-Variate Methods, (MLE, EM)
Semi-parametric Clustering, k-means, (EM)
Non-parametric Parzen, kNN, LDA, SVM, Decision Trees
Note 1: the semi-parametric classication I found in Alpaydins textbook.
Note 2: EM: expectation-maximization algorithm (subsection 10.2.1); MLE: maximum-likelihood estimation
algorithm. Both are density estimation methods. To be introduced in course II.
Note 3: the EM algorithm can obviously be classied differently, depending on the exact viewpoint.
Note 4: Bishop uses the terms Generative vs. Discriminative.
13.3 Other (Supervised) Statistical Classiers
- Perceptron: is essentially a linear classier with a different learning method (course II).
- Neural Networks (NN): are elaborations of the perceptron. The simplest versions are 3 layers networks,
which can be regarded as consisting of 2 layers of perceptrons (course II).
- Hidden Markov Models (HMM): are especially suited for classifying dynamic patterns (course II).
13.4 Algorithm-Independent Issues
Curse of Dimensionality Intuitively, one would think that the more dimensions (attributes) we have at our
disposal (through measurements), the easier it is to separate the classes (with any classier). However,
one often nds that with increasing number of dimensions, it is more challenging to nd the appropriate
separability, which is also refered to as the curse of dimensionality. On the one hand, if there are irrelevant
and possibly obstructive dimensions, it may indeed be better to reduce the dimensionality (as introduced
with the PCA for instance). On the other hand, the clever use of kernel functions, as in Support Vector
Machines, shows that more parameters can also be useful.
No Free Lunch theorem DHS p454 The theorem essentially states that no classier technique is superior to
any other one. Virtually any powerful algorithm, whether it be kNN, articial NN, unpruned decision trees,
etc. can solve a problem decently if sufcient parameters are created for the problem at hand.
The machine learning community tended to regard the most recently developed classier methodology
as a breakthrough in the quest of a (supposed) superior classication method. However, after decades of
research, it has become clear (to most researchers) that no classier model is absolutely better than any
other one: each classier has its advantages and disadvantages and their underlying, individual theoretical
motivations are all justied in principle. In order to nd the best performing classier for a given problem, a
practitioner simply has to test them all essentially.
35
A Appendix - LDA Beginners Example
Example should work by copy/paste and not require PCA.
clear;
S1 = [2 1.5; 1.5 3]; % covariance for multi-variate normal distribution
PC1 = mvnrnd([0.3 0.5], S1, 50); % training class 1
PTEST = mvnrnd([0.3 0.5], S1, 30); % testing (class 1)
PC2 = mvnrnd([3.2 0.5], S1, 50); % training class 2
PTREN = [PC1; PC2];
Grp = [ones(size(PC1,1),1); ones(size(PC2,1),1)*2];
Lb = classify(PTEST, PTREN, Grp);
H = histc(Lb,[1 2]);
pcCorrect = H(1)/size(PTEST,1);
fprintf(pc correct %1.4f\n, pcCorrect);
%% -------- Plotting
figure(2); clf; hold on;
scatter(PC1(:,1), PC1(:,2), sb, markerfacecolor, b);
scatter(PC2(:,1), PC2(:,2), r^, markerfacecolor, r);
scatter(PTEST(:,1), PTEST(:,2), go, markerfacecolor, g);
B Appendix - 2D Toy Data Sets
Create a set of random points from a uniform distribution. Use randn for normal distribution.
PtsRnd = rand(500,2); % 500 random (2D) points
Two densities, both elliptical, but one with gradient and rotated by 45 deg:
nP = 3000; % number of points
% --- ellipse with gradient
Mu1 = [5 6];
Si1 = [3 0.2]; % i*0.2]; % [i 1.5; 1.5 i+2];
P1 = mvnrnd(Mu1, Si1, nP);
P1 = f_RotCo(P1,pi/4); % rotation by 45 deg
[v O] = sort(P1(:,1));
Ixr = logspace(0, log10(nP), 1000); % gradient
Ixu = unique(round(Ixr));
P1 = P1(O(Ixu),:);
% --- ellipse
Mu2 = [7 5];
Si2 = [.1 1];
P2 = mvnrnd(Mu2, Si2, nP/3);
P = [P1; P2]; % the set of points
An arc above a square grid:
degirad = pi/180;
wd = 45*degirad;
nap = 10;
yyarc = cos(linspace(-wd,wd,nap))*(0.5)+0.4;
xxarc = linspace(.15,.85,nap);
nsp = 5;
yysqu = repmat(linspace(0.1,0.3,nsp),nsp,1); yysqu = yysqu(:);
xxsqu = repmat(linspace(0.3,0.7,nsp),1,nsp);
PtsPat = [xxarc yyarc];
PtsPat = [PtsPat; [xxsqu yysqu]]; % append
36
C Appendix - Varia
C.1 Metrics
Minkowski: L
k
(a, b) =
_
d

i=1
|a
i
b
i
|
k
_
1/k
...also referred to as the L
k
norm:
- L
1
: Manhattan or city-block norm.
- L
2
: Euclidean distance norm.
DHS p187
Mahalanobis (distance): Given are a multivariate vector x, a mean vector (e.g. obtained from averaging
over a class for instance) and a covariance matrix S:
D
M
(x) =
_
(x )
T
S
1
(x )
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance.
If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean
distance.
C.2 Whitening Transform
Input: DAT, a n d matrix; output: DWit, the whitened data.
CovMx = cov(DAT); % covariance -> [nDim,nDim] matrix
[EPhi ELam] = eig(CovMx); % eigenvectors & -values [nDim,nDim]
Ddco = DAT * EPhi; % DECORRELATION
LamS = ELam.^(-0.5);
LamS = diag(diag(LamS)); % ensure its a diagonal matrix
DWit = Ddco * LamS; % EQUAL VARIANCE
% verify
COVwhi = cov(Ddco); % covariance of decorrelated data (should be a diagonal matrix)
Df = diag(ELam)-diag(COVdco); % difference of diagonal elements
if sum(Df)>0.1, error(odd: differences of diagonal elements very large!?); end
See also http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf
C.3 Programming Hints
Speed To write fast-running code in Matlab, one should exploit Matlabs matrix-manipulating commands
in order to avoid the costly for loops (see for instance repmat or accumarray). Writing a kNN classier can
be conveniently done using the repmat command. However, when dealing with high dimensionality and
large number of samples, exploiting this command can in fact slow down computation because the machine
will spend a signicant amount of time allocating the required memory for the large matrices. In that case,
it may in fact be faster to maintain one for loop, and to use repmat only limitedly.
Vector Multiplication In mathematical notation a vector is assumed a column vector. In Matlab however
if you dene a vector as a=[1 2 3], it is a row vector - in fact as you write. To conform with mathematical
notation, either transpose the vector immediately by using the transpose sign (e.g., a=[1 2 3]) or by us-
ing semi-colons (e.g., a=[1; 2; 3];); otherwise you are forced to change place of the transpose sign later
when applying the dot product (a*b instead of a*b), in which case it appears reverse to the mathematical
notation! Or simply use the command dot, for which the column/row orientation is irrelevant.
C.4 Mathematical Notation
The mathematical notation in this workbook is admittedly a bit messy, because I took equations from differ-
ent textbooks. I did not make an effort to create a consistent notation, so that the reader can easily compare
37
the equations to the original text. In the majority of textbooks a vector is denoted with a lower-case letter in
bold face, e.g. x; a matrix is denoted as an upper-case letter in bold face, e.g. . But there are deviations
from this norm.
C.5 Some Software Packages
- MatLab: Unfortunately expensive and mostly available either in academia or industry.
- Weka: http://en.wikipedia.org/wiki/Weka_(machine_learning)
- R: supposed to be a replacement for MatLab.
- Python: I have no experience with it.
C.6 Parallel Computing Toolbox in Matlab
Should you be lucky owner of the parallel computing toolbox in Matlab, then you can even use it on your
home PC or laptop, as nowadays home PCs have multiple cores and that permits parallel computing in
principle. It is relatively simple to exploit the parallel computing features in for-loops that are suitable for
parallel processing: simply open a pool of cores, carry out the loop using the parfor command and then
close the pool again.
matlabpool local 2; % opening two cores (workers)
parfor i = 1:1000
A(i) = SomeFunction(Dat, i); % the data are manipulated in some function by counter i
end
matlabpool close;
The parfor loop can not be used if your computations in the loop depend on previous results, for example
in an iterative process where A(i) depended on A(i-1). It also only makes sense if the process that is
supposed to be repeated in parallel is computationally intensive, otherwise the assignment of the individual
steps to the corresponding cores (workers) may slow down the computation.
38
C.7 Reading
See references for publication details.
(Alpaydin, 2010): An introductory book. Reviews some topics froma different perspective than Duda/Hart/Stork
for example. It can be regarded as complementary to this workbook, but also complementary to other text-
books.
(Theodoridis and Koutroumbas, 2008): Contains the most practical tips of those books, that also intend
to provide the theoretical background. Treats clustering very thoroughly - in more depth than any other
textbook. Contains code examples.
(Witten et al., 2011): The most practical machine learning book probably, but rather short on the motivation
of the individual classier types. It accompanies the Weka machine learning suite (see link above).
(Duda et al., 2001): The professional book. A must have if one intends to further deepen ones knowledge
about pattern classication. The book excels at relating the different classier philosophies and emphasizes
the similarities between classiers and neural networks. Due to its age (already 12 years for the 2nd
version), it lacks in depth treatment for recent advances such as combining classiers and graph methods
for instance.
(Bishop, 2007): Another professional book. Contains beautiful illustrations and some historic comments,
but aims rather at an advanced readership (upper-level undergraduate and graduate students).
(Jain et al., 2000): A review with some useful summaries. Should be available on the internet. Use scholar
google.
Wikipedia: Always good for looking up denitions, formulations and different viewpoints. But wikipedias
variety - originating from the contribution of different authors - is also its shortcoming: it is hard to compre-
hend the topic as a whole from the individual articles (websites). Hence, textbooks are still irreplaceable.
References
Alpaydin, E. (2010). Introduction to Machine Learning. MIT Press, Cambridge, MA, 2nd edition.
Bishop, C. (2007). Pattern Recognition and Machine Learning. Springer, New York.
Duda, R., Hart, P., and Stork, D. (2001). Pattern Classication. John Wiley and Sons Inc, 2nd edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer, New York.
Jain, A., Duin, R., and Jianchang, M. (2000). Statistical pattern recognition: a review. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 22(1):437.
Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition. Academic Press, 4th edition.
Witten, I., Frank, E., and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, 3rd edition.
39
D Appendix - Example Questions
D.1 Questions
1. You are given a completely new data set of medium size (a few hundred samples in total, up to dimensionality
50; with class labels). What was suggested (in the course) on how you proceed with the analysis?
2. Advantages/disadvantages of kNN, Bayes, LDA, SVM,...(other methods)?
3. What can we learn from a learning curve? Why would one bother to train with smaller amounts of data and not
use the entire training set only?
4. What normalization schemes do you know?
5. You have only few data, but still want to model a classier to obtain an idea about the classication performance.
Lets say you have 3 classes with 3, 5 and 7 samples resp. Which classier is preferred?
6. You trained c binary (one-versus-all) classiers and observe that for increasing training data, the performance
decreases?
7. How does the kNN, Bayes, LDA (or other) classier work?
8. What is characteristic for the SVM?
9. How is the performance of a binary classier analyzed?
10. You have data whose features (dimensions, variables) comes from different sources, e.g. audio and visual. Do
you train a single classier for all features?
11. You have satisfactory results, lets say with the LDA-PCA combination. But now you want to optimize and improve
if necessary by another 1-2 percent. What could you try?
12. What does the PCA do? How do you apply it in Matlab?
13. You are given a set of patterns whose features are drawn from a (limited) set of elements. What classier do you
recommend? Some of the patterns have different (vector) length - which classier could you try now?
14. Your data contain components, that have only zero values, or some values maybe missing and expressed with
NaN. How do you proceed?
15. Does normalization improve performance?
16. You intend to datamine (explore) a huge set and are given no labels (class information). How do you begin?
17. Compare hierarchical clustering with k-means clustering.
18. What types of error estimation do you know?
D.2 Answers (as hints)
1. Start with kNN to obtain a performance reference, which can be a lower bound; if time permits, use Bayesian;
then use linear discriminant analysis combined with principal component analysis.
2. As in script.
3. a) Overtting: there is the possibility that we obtain better performance for a smaller training set: the learning
curve should be increasing, but may also decrease for excessive training data. b) Verication: we gain certainty
that weve done everything correct.
4. As in script.
5. kNN is the rst choice, at it can essentially work with single samples only. You may also try a Naive Bayes
classier. LDA is unlikely to return reliable results.
6. a) Overtting (see learning curve). b) Class imbalance problem.
7. kNN: storage-based classier. Each testing sample is compared to all other ...
Bayes: we use Gaussians to approximate the distributions...
LDA: a weight matrix Wis generated that separates the classes...
8. a) Focuses on samples that were difcult to classify. b) Uses a kernel function to project data into a higher-
dimensional space.
9. With a 4-response table (hit, miss, ...). Ideally the system has parameters with which we can inuence the
performance and so create an ROC curve (see script for details).
40
10. One can. But we can also try ensemble classiers (such as bagging) - it sometimes gives better results.
11. a) SVM. b) Search for optimal number of princ. comp. combined with LDA. c) Feature selection. d) Ensemble
classier.
12. Finds axes of variation for each dimensions and rotates the data such that it is aligned with those axes. Applied
in Matlab with the command princomp which returns a d d matrix from which we select components and then
transform the data.
13. Decision tree. If patterns of unequal length: string matching.
14. It can be ignored if we use for instance the PCA and the LDA of Matlab. However, we need to take care of it,
when clustering for instance or when building our own classier. See script for details.
15. Often, but not always, because normalization can also lead to a distortion of the samples relations (distances).
16. Clustering. k-means. see script for details.
17. See script for details.
18. Hold-out estimation, cross-fold validation, ...see script.
41

S-ar putea să vă placă și