Documente Academic
Documente Profesional
Documente Cultură
and Simulation
for Imaging, Bioinformatics and Complex Systems) nanced through project POSDRU/86/1.2/S/61756.
Pattern Recognition
An Introductory Workbook for Engineers and Scientists
C. Rasche
Abstract
The purpose of this workbook is to provide a practical access to the topic of pattern recognition. The
emphasis lies on applying statistical classication methods and learning their advantages and disadvan-
tages. We start with the very simple and easily implementable k-Nearest-Neighbor classier, followed by
the most popular and robust classier, namely the Linear Discriminant Analysis (LDA). We learn how to
apply the principal component analysis (PCA) and how to properly fold the data. We further introduce
decision trees, ensemble classiers, clustering methods and string matching methods. Eventually, we also
mention Support Vector machines and the Naive Bayes classier. The latter helps us to understand some
of the theoretical aspects, e.g. the Bayesian formulation for classication. Matlab code is provided to
facilitate the understanding and implementation of the classiers.
Prerequisites: basic programming skills
Recommended: basic linear algebra, basic signal processing
Contents
1 Introduction 3
1.1 Varia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 k-Nearest Neighbor (kNN) 5
2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Normalization ThKo p263, s5.2.2, pdf 276 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Division by Zero, Innity (Inf), Not a Number (NaN), Large Dataset . . . . . . . . . . . . . . . 7
2.5 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Linear Discriminant Analysis (Linear Classier I) 8
3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Excerpt from the LDA implementation in Matlab (classify) . . . . . . . . . . . . . . . 9
3.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Dimensionality Reduction 11
4.1 Feature Extraction - PCA DHS p115, 568 Alp p113 ThKo p326 . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Feature Selection Alp p110 ThKo p261, ch5, pdf 274 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Evaluating Classiers & Performance 14
5.1 Types of Error Estimation DHS p465 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Performance Measures for Binary Classiers Alp p489 . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.1 Class Imbalance Problem ThKo p237 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3.2 Estimating Classier Complexity - Big O . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Clustering - Unsupervised Learning 17
6.1 k-Means DHS p526 ThKo p741 Alp p145 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Hierarchical Clustering DHS p550 ThKo p653 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
7 Decision Tree ThKo p215, s 4.20, pdf 228 Alp p185, ch 9 DHS p395 21
7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8 Combining Classiers [Ensemble Classiers] Alp p419, ch 17 24
8.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.3 Component Classiers without Discriminant Functions DHS p498, s. 9.7.2, pdf 576 . . . . . . . . . . . . . 25
8.4 Learning the Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.5 One-vs-All Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9 Non-Metric Classication DHS pxxx, ch 8, pdf 461 27
9.1 Recognition with Strings DHS p413, s 8.5, pdf 481 ThKo p487, s 8.2.2 . . . . . . . . . . . . . . . . . . . . . . . 27
9.1.1 String Matching Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9.1.2 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10 Density Estimation 29
10.1 Non-Parametric Methods Alp p165 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.1 Histogramming Alp p165 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.1.2 Kernel Estimator (Parzen Windows) Alp p167 ThKo p51 . . . . . . . . . . . . . . . . . . . . 29
10.2 Parametric Methods Alp p61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.2.1 Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11 Naive Bayes Classier 31
11.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
12 Support Vector Machines ThKo p119 33
12.1 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
13 Rounding the Picture DHS p84 34
13.1 Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
13.1.1 Rephrasing Classier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
13.2 Parametric (Generative) vs. Non-Parametric (Discriminative) . . . . . . . . . . . . . . . . . . 35
13.3 Other (Supervised) Statistical Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
13.4 Algorithm-Independent Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A Appendix - LDA Beginners Example 36
B Appendix - 2D Toy Data Sets 36
C Appendix - Varia 37
C.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.2 Whitening Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.3 Programming Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.4 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.5 Some Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C.6 Parallel Computing Toolbox in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C.7 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
D Appendix - Example Questions 40
D.1 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.2 Answers (as hints) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2
1 Introduction
There are many excellent textbooks on the subject of pattern recognition, but they often lack the exemplary
approach, meaning the learning-by-doing approach (though I have not read all textbooks on this topic).
Textbooks often provide the theoretical background rst, followed by giving examples. But the theoretical
background is easier to understand if one has worked through some specic examples. We therefore pro-
vide here an example-rst approach, thereby encountering the classiers advantages and disadvantages
in practice. After having worked through these examples, any textbook should read fairly easy.
Figure 1: Illustrating the classication problem in 2D.
We are given two sets of points representing two
classes (squares and triangles, respectively) - they are
our training samples (example data). To which group
would we assign a new sample (testing point) such as
the one marked as a circle?
The two classes may overlap due to measurement
noise or because some samples are indeed a mixture
of both classes - nevertheless, we would like to predict
a new sample as well as possible.
In the simplest case we compare all training samples
with the testing sample (section 2). Or we may at-
tempt to model the point clouds with functions (like a
Gaussian function; section 11). We could also nd
a straight line equation which separates best the two
point clouds (section 3). Of course, each method has
its advantages and disadvantages - there is no best
classier.
Data Format Collected/measured data often come in a format with two characteristics:
a) uniform dimensionality: all the data samples have equal dimensionality, which allows one to regard a
(single) sample as a d-dimensional vector (also called feature vector).
b) numeric values: the data values are often countable or measurable (as opposed to nominal), e.g. they
are of type integer, real, binary, etc.
Given these two characteristics, we can employ a large body of statistical classication methods, which
exploit statistical information about the data or one can simply perform (metric) distance measurements
between the samples in order to classify or organize the data. In programming terms, we deal with a
n d matrix, corresponding to [number of samples number of dimensions]: each row is a data sample
(or observation), a feature vector x, of which each component (dimension) x(i) is the measurement of an
attribute (or feature or characteristic or variable) of the data (i = 1..d). Examples:
- Computer vision: face detection is frequently done with 60x60 pixel patches, that is one deals with a
10800- dimensional vector (3 color variables). Searching an image means testing many, many 60x60
pixel patches.
- Food inspection: distinguish salmon from sea bass by measuring the degree of lightness and spatial
width, thus two dimensions only. (This is the reoccurring example in Duda/Hart/Storcks textbook).
- Bioinformatics ThKo p632: DNA microarray analysis. This is a scientic eld of enormous interest and signif-
icance that has already attracted a lot of research effort and investment. In such applications, data
sets of dimensionality as high as 4000 can be encountered.
Note: the term feature (in textbooks) can mean an individual component or variable, as for example in the
term feature selection. But it is sometimes also used to describe a feature (vector), that is a data sample!
Types of Training Procedures If we have knowledge about class (or group) information in the data,
meaning for each sample (feature vector) we know what class (category or group) it belongs to, then we
3
apply supervised learning algorithms. Training then takes place with help of a teacher one could say. For
instance, if we train a classier to recognize characters it is useful to provide labeled examples with which
we will learn the appropriate parameters values to perform optimal classication.
If we lack such class knowledge we apply unsupervised learning algorithms, called clustering algorithms
(section 6). Training then occurs without a teacher. For instance, if we are given the entire set of Chinese
characters without any labeling (translation), then we may try to organize them by attempting to nd basic
characters expressing frequent words such as house or man.
The rst three classiers we will introduce (next 3 sections), employ supervised learning algorithms.
1.1 Varia
Testing Data Sets It is instructive to start with a toy data set with two dimensions only, see appendix
B, and then to approach higher-dimensional sets. A convenient way to practice is to use the data set of
handwritten digits,
http://yann.lecun.com/exdb/mnist/
as this allows to easily verify the classier implementations (as the categories are evident). Bishop provides
also other collections, Bis p677:
http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm
Source for this Workbook A few text passages are copied/pasted/modied from various textbooks, as
well as some of the gures - I have tried to compile the best pieces from each book and provide exact
citations including page number. See appendix C.7 for titles. Our workbook distinguishes itself from the
textbooks by specifying the algorithmic formulation more explicitly and by providing (vectorized) code.
Code The code fragments I provide are vectorized, meaning the slow for loops are avoided: this type of
vector/matrix thinking is unusual at the beginning, but highly recommended for 3 reasons: 1) computation
time is vastly shorter; 2) code is more compact; 3) code is less error-prone. However, the code fragments
may contain unintended mistakes, as I copied/pasted them from my own Matlab scripts and made occa-
sionally some unveried modications for instruction purposes.
It can also be useful to check Matlabs le exchange website for demos of various kinds:
http://www.mathworks.com/matlabcentral/fileexchange
Advice We recommend implementing the simple classier types by oneself, for instance in a high-level
programming language such as Matlab. This can be done with a few lines. For more complex classiers it
is more convenient to employ existing routines (such as the Linear Discriminant Analysis and the Support
Vector Machines). Why then would one want to implement the simple classiers at all? There are several
reasons. One is, that the existing routines sometimes do not account for special data entries, e.g. NaN
(not a number) or are not optimized for large datasets. Another reason is that one may intend to build
individual classiers, e.g. ensemble classiers for which it may be more convenient to write ones own
code. Furthermore, by writing our own code we know exactly what parameters/conditions etc we have
used. Finally, it is part of the learning process and one gains condence if we use classication packages.
4
2 k-Nearest Neighbor (kNN)
The Idea: The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms.
Given a testing set, we simply store all its samples as an exhaustive reference and to classify a testing
sample, we compare all the training samples to it to arrive at a classication decision; that is we do not
really relate the training samples in any way (except in one uses the covariance matrix for normalization).
Figurative Example. Determining the country by looking at license plates: You drive across Europe and deter-
mine which country you currently drive through by looking at the cars license plates. If there is a majority of
license plates for one type of country, then it is likely you are currently in that country. In regions near country
borders and near tourist resorts, this probability decreases.
The Procedure: Given is a training set, a matrix TRN with corresponding group (class) labels in vector
GrpTrn, and a testing set, a matrix TST with GrpTst. To classify a sample from the testing set (one row
vector of TST), we measure the distance to all samples in TRN, resulting in a vector Dist of length GrpTrn.
We order the distances in Dist and choose the closest training sample and take its category label as the
label of the testing sample - that would be the nearest neighbor, meaning k = 1. We can also look at more
(nearest) neighbors, e.g. 3, 5, ...and determine which category label occurs the most amongst those k
neighbors (for even k we may face parity).
In other words, a testing sample is classied by assigning it to the most frequent class label amongst its
neighborhood of size k in TRN (gure 2). One can try different distance metrics, e.g. Euclidean, Manhat-
tan,...(see also Appendix). There is essentially no initialization required with exception of the possible need
to normalize the data.
Figure 2: k-Nearest-Neighbor (kNN). Given are
11 training samples from 2 classes (marked as
squares and triangles); 1 instance (testing sam-
ple marked as lled circle) is to be classied. Solid
(thin) circle: 3NN; stippled circle: 5NN.
Algorithm 1 kNN classication. D
L
=TRN (training samples), D
T
=TST (testing samples). G with group labels
(length = n
TrainingSamples
).
Initialization normalize data
Training training samples D
L
with class (group) labels G.
(In fact, no actual training takes place here)
Testing for a testing sample ( D
T
): compute distances to all training samples D,
rank (order) D D
r
Decision observe the 1st k (ranked) distances in D
r
(the k nearest neighbors):
e.g. majority vote of the most frequent class label of the kNN determines category label
2.1 Implementation
Matlab offers the knnclassify command (as part of the bioinformatics toolbox), but coding a kNN classier
is fairly easy. Here are some fragments to understand how little it actually requires (see also ThKo p82):
5
Grp = reshape(repmat(1:5, 3, 1), [], 1); % generating class/group labels
TRN = randn(nTrn, nDim); % some training data (nTrnSamp=15)
TST = randn(nTst, nDim); % some testing data
[kNN] = deal(zeros(nTst,11)); % we will check out 11 nearest neighbors
for i = 1 : nTstSamp
iTst = repmat(TST(i,:), nTrn, 1); % replicate to same size [nTrn, nDim]
Diff = TRN-iTst; % difference [nTrn, nDim]
Dist = some metric % Euclidean, Manhattan,... [nTrn, 1]
[dst ix] = min(Dist); % min distance for 1-NN
[sds ixs] = sort(Dist,ascend); % increasing dist for k-NN
NNk(i,:) = Grp(ixs(1:11)); % closest 11 classes
end
HNN = histc(NNk(:,1:5), 1:nCAT, 2); % histogram for 5 NN
[Fq LbTst] = max(HNN, [], 2); % LbTst contains class assignment
See also the progamming hints in subsection C.3 for why we chose a for-loop in this case.
2.2 Normalization ThKo p263, s5.2.2, pdf 276
The range of values for different features may vary signicantly. It could therefore be benecial to normalize
your data. There are different possibilities to perform the normalization, for instance:
1. by dividing the feature values by the mean and standard deviation (for that feature). The resulting
normalized features will now have zero mean and unit variance. Matlab: zscore.
2. by limiting the feature values in the range of [0, 1] or [-1, 1] by proper scaling.
3. by scaling the feature values by an exponential or tangent function (e.g. tanh).
4. by performing a whitening transformation (DHS pp 34, pdf 54). This is a decorrelation method in which we mul-
tiply each sample by the covariance matrix of the dataset. The method is called whitening because
it transforms the input matrix to the form of white noise, which by denition is uncorrelated and has
uniform variance (see subsection C.2 for details).
Warning: Normalization may distort the relations between dimensions and hence the distances between
samples. Therefore normalization does not always improves classication (or clustering). It may be useful
to look at the distribution of individual features (e.g. using a plotting command such as hist) too see what
type of normalization may be appropriate.
2.3 Evaluation
Estimating Generalization Performance It is benecial to know how well our classier will perform on
new data, in other words, we would like to know its generalization performance on untested data. For that
purpose we partition the data - which we are given - into a training set, that is used exclusively for training,
and a testing set that is used for estimating the generalization performance (we implied this already above).
For the beginning we carry out the following simple partitioning: we halve the dataset, with one half being
the training set, and the other half being the testing set. Generate a performance estimate with the two
halves and then swap the two halves to generate another performance estimate. Take the mean of the two
estimates. This is also called hold-out estimation or two-fold cross-validation. Later we will encounter more
rened estimation methods (section 5).
Confusion Matrix Now we analyze which classes were mistaken for which other classes by creating a
(square) confusion matrix of size cc, where c is the number of classes. The given (actual) category is typi-
cally given in the row, the predicted (classied) category is given in the column. This helps us understanding
where the potential classication difculties for the given data lie. In Matlab:
CM = accumarray([Grp LbTst],1,[nCat nCat]);
Or you may use confusionmat if the stats toolbox is available.
6
Learning Curve It is common to test the classier for different amounts of learning samples (e.g. 5,
10, 15, 20 training samples), and to plot classication accuracy (and/or error) as a function of training
samples, a graph called learning curve. An increase in sample number should typically lead to an increase
in performance - at least initially (if performance only decreases then something is wrong). The classication
accuracy may start to decrease for very large amounts of training due to a phenomenon called overtraining
(overtting).
Optimal k Only systematic testing allows us to nd the optimal number of k nearest neighbors. In praxis,
often k = 1 or k = 3 is sufcient, but one may also want to check larger neighborhoods.
2.4 Division by Zero, Innity (Inf), Not a Number (NaN), Large Dataset
Often, some of the data contain useless or missing values. For instance, some dimensions may contain
only zero values; or the feature extraction program may have returned a NaN entry (not a number) or an
Inf entry (innity). Here is how Matlab deals with that:
- Division by zero:returns a division-by-0 warning and creates an
Inf entry, if the divisor (denominator) is 0;
NaN entry, if both divisior and dividend (numerator) are 0.
- Any operation with a NaN or Inf entry remains or produces a NaN or Inf entry.
Because most classiers will use multiplication operations, entries with NaN or Inf values can render results
useless. Matlab classication functions typically take care of this. As a programmer you may want to
eliminate dimensions with zero entries immediately and/or use the nan-commands, nanmean, nanstd,
nancov,... to deal with NaN entries. To avoid the creation of Inf entries, one can add the smallest value
possible (eps in Matlab) to a divisor, e.g. try 1/(0+eps), which so will use the largest value possible, thus
permitting to further operate with the variable (as opposed to an Inf entry).
If the dataset creates out-of-memory error notications, try using datatype single, which is only half
the storage size and thus half as accurate as the default datatype double. Initialize for instance with
zeros(nDsc,nDim,single) and do assignments by converting with single (DAT = single(DAT)).
2.5 Recapitulation
Advantages
- Decent results with an easily implementable model. In fact, we have implemented a decision rule only
and nothing more.
- Works even when only few training samples are available (n < 5 per class). E.g. most classier do not
work well with fewer than 5 samples, whereas the kNN allows to perform classication with even one
training sample per class only.
Disadvantages Classication duration can be slow if dimensionality d and/or training set n is large. The
classier has therefore O(dn) complexity. See also Big-O notation in section 5. To alleviate that problem a
number of improvements have been suggested (see course II).
Notes
- Even though the kNN may not provide the best performance, it can serve as a comparison for other
classier performances. If we do not obtain a better performance with more complex classiers, we
should consider the possibility that we may not have applied the complex classiers properly. Thus,
in any case, the kNN performance can serve as a check.
- The kNN classier does not have an actual learning process, that is, no effort was made in abstracting
or manipulating the data to derive a simple decision model.
7
3 Linear Discriminant Analysis (Linear Classier I)
A linear classier tries to separate the classes by nding a suitable border (or boundary) between the
classes. A sample point would then be classied by determining on which side of the boundary it lies.
Taking the data set in gure 1, a linear classier essentially tries to place a straight line through the two
points clouds such that it separates the two classes optimally in a statistical sense. For 3 dimensions, it
tries to nd a plane; for 4 or more dimensions we talk of hyperplanes. The line/planes represent the so-
called decision boundary. To decide the category type of a sample point, we determine on which side it lies
of the decision boundary.
Figurative Example. In our country-guessing example, a linear classier would attempt to estimate the country
borders and take those as a decision boundaries for making our best country guess.
Binary classication: For a binary classication task (two classes only), the model looks as in gure 3.
Given an input vector x, each component x(i) (also denoted as x
i
) is multiplied by a corresponding weight
value w(i) (or w
i
), which represents a weight vector w (whose components represent the hyperplane
parameters):
g(x) =
i
x(i)w(i) x w x
t
w.
g is also called the discrimination function and in this case g is simply the inner (or dot) product. This
operation represents the classication procedure already.
Learning: The difculty is to nd the appropriate weight values, which would return the best possible
separation between classes, there exist a large number of methods, which belong to gradient descent
procedures or matrix decompositions methods. This is beyond the scope of our introduction, but we look at
gradient descent procedures in more detail in the Perceptron section of workbook II.
Figure 3: A simple linear classier having d input units, each corresponding to the values of the components of an input
vector. Each input feature value xi is multiplied by its corresponding weight wi; the effective input at the output unit is the
sum all these products,
wixi. We show in each unit its effective input-output function. Thus each of the d input units
is linear, emitting exactly the value of its corresponding feature value. The single bias unit unit always emits the constant
value 1.0. The single output unit emits a + 1 if w
t
x + w0 > 0 or a 1 otherwise. [Source: Duda,Hart,Storck 2001, Fig 5.1]
Multiple-Class Classication: For classication task with multiple classes, there is a weight vector w
k
(of
length d) for each individual class k, which can be expressed as a weight matrix W(kd). The classication
procedure then consists of two steps: one step is the computation of posterior (condence) values for each
class,
g
k
(x) = x
t
W
8
which results in an array g
k
(of length n
classes
); we then, in a 2nd step determine the category:
argmax
k
g
k
.
Building the classier can be summarized as follows:
Algorithm 2 Linear Discriminant Analysis: the conceptual steps. k = 1, .., n
classes
. W
kd
= {w
k
}, G vector
with class labels.
Training nd optimal weight matrix Wfor g
k
(x) = x
t
W (x D
L
)
Testing for a testing sample x determine g
k
(x) = x
t
W (x D
T
)
Decision chose maximum of g
k
: argmax
k
g
k
3.1 Implementation
Matlab offers to nd simple discriminant functions with the command classify. The input arguments are
as follows:
sample (1st arg): n
test
d matrix containing the testing samples
training (2nd arg): n
train
d matrix containing the learning samples.
group (3rd arg): n
train
1 vector with class/group labels, where each element corresponds to the class
in the training matrix (both have the same number of rows of course).
type (4th arg - optional): allows to chose different types of classication. linear is default, whereby
groups (classes) are tted with a multivariate Gaussian (eq. 8).
prior (5th arg - optional): if not specied, it is assumed that classes occur with equal probability. In
case of doubt simply use empirical: Matlab will calculate the probabilites from group.
The output arguments are as follows:
outclass (1st arg): a n
test
1 vector, which contains the class assignments for the samples: it is of same
length as the grouping variable group (3rd input argument).
err (2nd arg): classication error for training data (training)
Post (3rd arg): n
test
c matrix containing the posterior values [0, 1].
Optimization W This is solved in Matlab by a matrix decomposition, specically by the command qr,
which performs a so-called orthogonal-triangular decomposition. The line reads:
[Q,R] = qr(training - gmeans(gindex,:), 0);
where gmeans and gindex are the group means and indices.
Small Sample Size Problem If the number of available training data is small, Matlab may complain that
it cannot compute a meaningful covariance matrix. In that case, it will return the following error:
The pooled covariance matrix of TRAINING must be positive definite.
To work around this barrier, it is easiest to apply a dimensionality reduction using the PCA and then retry
with lower dimensionality, which will be the topic of the upcoming section 4.
In the following we point out the essential lines for the linear classier type:
3.1.1 Excerpt from the LDA implementation in Matlab (classify)
1 [Q,R] = qr(training - gmeans(gindex,:), 0);
2 R = R / sqrt(n - ngroups); % SigmaHat = R*R
3 s = svd(R);
4 if any(s <= max(n,d) * eps(max(s)))
5 error(stats:classify:BadVariance,...
6 The pooled covariance matrix of TRAINING must be positive definite.);
9
7 end
8 logDetSigma = 2*sum(log(s)); % avoid over/underflow
9 % MVN relative log posterior density, by group, for each sample
10 for k = nonemptygroups
11 A = (sample - repmat(gmeans(k,:), mm, 1)) / R;
12 D(:,k) = log(prior(k)) - .5*(sum(A .* A, 2) + logDetSigma);
13 end
...
13 % find nearest group to each observation in sample data
14 [maxD,outclass] = max(D, [], 2);
Line 1: matrix decomposition of Was mentioned above.
Line 2: standard deviation of group
Line 3: singular value decomposition (another matrix decomposition)
Line 4: essentially checking whether s is too small: if so, we receive the error
3.2 Recapitulation
Advantages simple and robust: space and time complexity only O(d)
Disadvantages difcult to obtain reliable results for a small training set (n < 5 for a class)
Exercise
Study the beginners example in appendix A to understand the exact use of the commands. Then start
manipulating dimensionality; add another point class; etc. Finally, apply to a bigger data set.
As with the kNN classier, proper evaluation is best done with repeated estimates (e.g. the 2-fold cross-
validation as introduced above).
10
4 Dimensionality Reduction
Sometimes it is useful, if not even necessary, to reduce the dimensionality of the data by eliminating po-
tentially irrelevant dimensions. There can be different reasons to seek this dimensionality reduction. For
instance, the inverse of the covariance matrix can not be computed, which is needed for the LDA classier
(previous section) or the Naive Bayes classier (section 11); or we have very large dimensionality and hence
slow classication, in which case one may seek to eliminate insignicant dimensions in order to increase
classication speed; or we need to nd patterns and tendencies in a high-dimensional space, which one
tries to uncover by projecting the data onto a 2D or 3D space.
Dimensionality reduction can occur in two principally different ways, whereby here the term feature
stands for dimension:
- Feature Selection is the selection of the best subset of the (original) input feature set.
- Feature Extraction is the transformation or combination of the original feature set to create a new (re-
duced) feature set.
4.1 Feature Extraction - PCA DHS p115, 568 Alp p113 ThKo p326
The most popular method for feature extraction is the principal component analysis (PCA), also called the
Karhunen-Loeve transform. It works by aligning the coordinate axes to the directions of greatest variance
and placing the origin of the coordinate axes onto the datas center.
Example: assume a 2D data set, whose point cloud is elliptical and whose larger diameter is rotated by 45
degrees, see gure 4 left side; then the PCA places one axis along the large diameter (z
1
) and another axis
orthogonal to it (z
2
), which so form a new, rotated coordinate system, see right side in gure.
Figure 4: Principal components analysis. Left: the ellipse represents the outline of an elliptical point cloud
with axes z
1
and z
2
. Right: the PCA centers the samples and then rotates the axes to line up with the
directions of highest variance. If the variance on z
2
is too small, it can be ignored and we have dimensionality
reduction from two to one. [Source: Alpaydin 2010, Fig 6.1].
Algorithm 3 details the individual operations, algorithm 4 details the implementation. Steps 1 and 2
(of alg. 3) are performed by the Matlab command princomp. The command princomp returns a d d
matrix, from which we chose a submatrix PCO of dimensionality d d
r
, with d
r
the reduced number of
dimensions (=nPco in the code). The fraction of 0.7 is a suggestion, but should return reasonable results.
We then multiply each sample (x = DAT(i,:)) by this submatrix and obtain the data DATRed with lower
dimensionality (size n d
r
). Now try the classier again with just DATRed instead of the original data DAT,
TRN, resp.
Automatic search for ideal # of Principal Components One can automatically search for the maximal
useful number of components, if one understands the Matlabs code (useful=allowing the computation of
LDA without errors). A meaningful number of principal components has to be less equal than the minimum
of the number of dimensions and samples, hence the operation min(size(DAT)) in algorithm 4.
11
Algorithm 3 PCA steps: Performed on D
L
.
Parameters k: number of principal components - or determined algorithmically
Initialization none particular
Input x: list of data vectors (D
L
), i = 1, .., n
Dimensions
1) Compute: : d-dim mean vector
: d d covariance matrix
2) Compute eigenvectors e
i
and eigenvalues
i
3) Selection of k largest eigenvalues and corresponding eigenvectors
4) Build d k matrix A with columns consisting of the k eigenvectors
5) Projection of data x onto k-dim subspace x
: x
= F
1
(x) = A
t
(x )
Output x
i
, i = 1, 2, ..., M. A commonly used denition of node impurity, denoted as I(t), is the entropy for subset
X
t
:
I(t) =
M
i=1
P(
i
|t) log
2
P(
i
|t)
where log
2
is the logarithm with base 2 (see Shannons Information Theory for more details). We have:
- Maximum impurity I(t) if all probabilities are equal to 1/M (highest impurity)
- Least impurity I(t) = 0 if all data belong to a single class, that is, if only one of the P(
i
|t) = 1 and all the
others are zero (recall that 0 log 0 = 0).
When determining the threshold at node t, we attempt to chose a value such that I(t) is large.
Example: given is a 3-class discrimination task and a set X
t
associated with node t containing N
t
= 10
vectors: 4 of these belong to class
1
, 4 to class
2
, and 2 to class
3
. Node splitting results into: subset
X
tY
, with 3 vectors from
1
, and 1 from
2
; and subset X
tN
with 1 vector from
1
, 3 from
2
, and 2 from
3
. The goal is to compute the decrease in node impurity after splitting. We have that:
I(t) =
4
10
log
2
4
10
4
10
log
2
4
10
2
10
log
2
2
10
= 1.521
I(t
Y
) =
3
4
log
2
3
4
1
4
log
2
1
4
= 0.815
I(t
N
) =
1
6
log
2
1
6
3
6
log
2
3
6
2
6
log
2
2
6
= 1.472
Hence, the impurity decrease after splitting is
I(t) = 1.521
4
10
(0.815)
6
10
(1.472) = 0.315.
Stop Splitting The natural question that now arises is when one decides to stop splitting a node and
declares it as a leaf of the tree. A possibility is to adopt a threshold T and stop splitting if the maximum
value of I(t), over all possible splits, is less than T. Other alternatives are to stop splitting either if the
cardinality of the subset X
t
is small enough or if X
t
is pure, in the sense that all points in it belong to a
single class.
Class Assignment Rule Once a node is declared to be a leaf, then it has to be given a class label. A
commonly used rule is the majority rule, that is, the leaf is labeled as
j
where
j = argmax
i
P(
i
|t)
In words, we assign a leaf, t, to that class to which the majority of the vectors in X
t
belong.
A critical factor in designing a decision tree is its size. As was the case with the multilayer perceptrons,the
size of a tree must be large enough but not too large; otherwise it tends to learn the particular details of the
training set and exhibits poor generalization performance. Experience has shown that use of a threshold
value for the impurity decreases as the stop-splitting rule does not lead to trees of the right size. Many
times it stops tree growing either too early or too late. The most commonly used approach is to grow a tree
up to a large size rst and then prune nodes according to a pruning criterion. A number of pruning criteria
have been suggested in the literature. A commonly used criterion is to combine an estimate of the error
probability with a complexity measuring term (e.g., number of terminal nodes) [Brei 84, Ripl 94].
22
Algorithm 7 Growing a binary decision tree. From ThKo p219.
Parameters Stop-splitting threshold T
Initialization Begin with the root node X
t
= X.
For each new node t
For every feature x
k
(k = 1, ..., l)
For every value
kn
(n = 1, ..., N
tk
)
- Generate X
tY
and X
tN
for: x
k
(i)
kn
, i = 1, ..., N
t
- Compute I(t|
kn
)
End
kn0
= argmax
I(t|
kn
)
End
[
k0n0
, x
k0
] = argmax
I(t|
kno
)
If the stop-splitting rule is met
declare node t as a leaf and designate it with a class label
Else
Generate nodes t
Y
, t
N
with corresponding X
tY
, X
tN
for: x
k0
k0n0
End
End
Disadvantages It is not uncommon for a small change in the training data set to result in a very different tree,
meaning there is a high variance associated with tree induction. The reason for this lies in the hierarchical
nature of the tree classiers. An error that occurs in a higher node propagates through the entire subtree,
that is all the way down to the leaves below it. The variance can be improved by using random forests (see
course II).
Advantages
- DT classiers are particularly useful when the input is non-metric, that is when we have categorical
variables. They also treat mixtures of numeric and categorical variables well.
- Due to their structural simplicity, DTs are easily interpretable.
7.1 Implementation
Matlab: use classregtree for training, eval for testing.
23
8 Combining Classiers [Ensemble Classiers] Alp p419, ch 17
The previously introduced classiers (kNN, Naive Bayes, linear discriminant) attempt to obtain an optimal
performance with a single classier, e.g. with a perfect (single) discrimination function. In contrast, the
principle of combining classiers is to use multiple less-than-perfect classiers, each one with a mediocre
discrimination function for instance; these base classiers (or base learners) are then combined to form a
single (total) decision. The classier that combines the base learners is called ensemble classier or simply
combiner.There are two principal motivations for combining classiers:
1. We have measurements from separate sources, e.g. a visual and an audio signal, each with its own
set of dimensions. Then, it is obvious to test whether a combination of separate classiers, with each
one geared toward those sources, performs better than a single classier (it is not as obvious for the
following motivation) - in this case aka data fusion. Subsection 8.1 introduces the basics combining
classiers.
2. We may try to solve the classication problem with a set of classiers, whereby an individual classier
performs merely above chance level. By the combination of these opinions we may obtain an expert
advice, which is hopefully better than the expert advice of a single classier. An example is given in
subsection 8.2.
Figure 11: Simplest combination (ensemble) classier. Input x feeds into L different base-learners, whose output dj
is combined using f() to generate the nal decision. In this example graph, all learners observe the same input; it
may be the case that different learners observe different representations of the same input, as in bagging for instance.
[Source: Alpaydin 2010, Fig 17.1]
General formulation: We have L base learners h
j
(j = 1, ..., L) and input vector x. Each base learner
makes a prediction d
j
(x), which in turn is combined with the other predictions to arrive at a nal decision:
y = f(d
1
, d
2
, ..., d
L
|), (2)
where f() is the combining function with denoting its parameters. For a multi-class discrimination task
each base learner generates K outputs and we then deal with a K L matrix d
ji
(x) (number of classes
number of learners).
8.1 Voting
The simplest way to combine multiple classiers is by voting, which corresponds to taking a linear combi-
nation of the learners
y
i
=
j
w
j
d
ji
where w
j
0,
j
w
j
= 1. (3)
24
This is also known as ensembles and linear opinion pools. In the simplest case, all learners are given
equal weight (w
j
= 1/L), which is also called simple voting: it corresponds to taking an average. Other
combination rules are
Median y
i
= median
j
d
ji
robust to outliers
Minimum y
i
= min
j
d
ji
pessimistic
Maximum y
i
= max
j
d
ji
optimistic
Product y
i
=
j
d
ji
veto power
If the outputs d
ji
are not posterior probabilities, these rules require that outputs be normalized to the same
scale. Note that after the combination rules, y
i
do not necessarily sum up to 1.
If the data set consists of features obtained from different sources, then one should denitely try an
ensemble classier with a voting scheme as it does not involve any particular tuning, that is it comes at
little effort to test this variant. For instance, we have data with audio and visual features: we train solely the
visual features with one LDA and obtain the corresponding posterior values (3rd argument, see subsection
3.1), and we train solely the audio features with another LDA and obtain the corresponding posterior values.
We then combine the two sets of posteriors with any rule that gives us the maximum performance.
8.2 Bagging
Bagging is a voting method whereby base learners h
j
are made different by training them on different
subsets of the training sets. Bagging can reduce variance and thus reduce the generalization error perfor-
mance.
The subsets are generated by bootstrap, that is by drawing randomly a subset of samples from the
training set with replacement (hence the name bagging = bootstrap aggregation). Given a training set X,
we create B variants, X
1
, X
2
, ..., X
B
, by uniformly sampling from X with replacement. (Because sampling
is done with replacement, it is possible that some instances are drawn more than once and that certain
instances are not drawn at all). One can use randsample to create different subsets of X, e.g.
for i = 1:nSub
Ixr = randsample(nTrnSamp, nSubSize); % random sampling
Xsub = X(Ixr,:); % select only first nSubSize of Ixr and thus X
...train a classifier on Xsub...
end
For each of the training set variants, X
i
, a classier h
i
, is constructed. The nal decision is in favor of
the class predicted by the majority of the subclassiers, h
i
, i = 1, 2, ..., B.
By randomly selecting a subset, the individual base learners will be slightly different (remember mo-
tivation no. 2 above). To increase diversity, bagging works better, if the base learner is trained with an
unstable algorithm, such as a decision tree, a single or multilayer perceptron, or a condensed NN. Unstable
means that small changes in the training set cause a large difference in the generated learner, namely a
high performance variance.
Bagging as such is a method worth trying as it also involves little complications. Bagging is successfully
used in some applications (e.g. Kinect Microsoft motion recognition system), specically together with
decision trees, so-called random forests.
8.3 Component Classiers without Discriminant Functions DHS p498, s. 9.7.2, pdf 576
If we create an ensemble classier, whose base learners consist of different classier types, e.g. one is a
LDA and the other is a kNN classier, then we need adjust their outputs in particular if they do not compute
discriminant functions. In order to integrate the information from the different (component) classiers we
must convert the their outputs into discriminant values. It is convenient to convert the classier output g
i
25
to a range between 0 to 1, now g
i
, in order to match them to posterior values of a (regular) discriminant
classiers. The simplest heuristics to this end are the following:
Analog (e.g. NN): softmax transformation:
g
i
=
e
gi
c
j=1
e
gi
. (4)
Rank order (e.g. kNN): If the output is a rank order list, we assume the discriminant function is linearly
proportional to the rank order of the item on the list. The values for g
i
should thus sum to 1, that is
normalization is required.
One-of-c (e.g. decision tree): If the output is a one-of-c representation, in which a single category is
identied, we let g
j
= 1 for the j corresponding to the chosen category, and 0 otherwise.
The table gives a simple illustration of these heuristics.
Other normalization schemes are certainly possible too. The Matlab command classify returns the dis-
criminant values as the 3rd argument, called posteriors, which are already normalized to a range between
0 and 1. Before combining those posteriors with the discriminant values from other component classiers,
it is useful to plot the posterior matrix to see what range of values we deal with.
8.4 Learning the Combination
Instead of choosing a combination rule (see table in subsection 8.1), we may try to optimize the combi-
nation stage by training a classier on the discriminant values being combined. For instance, we train an
optimization classier to combine the discriminant values for an LDA and a kNN classier, for which the
optimization classier takes a 2 K matrix as input (2 because we have the LDA and the kNN classier;
K=number of classes) and returns a vector of length K as the nal posterior. There are also other ways to
combine component classiers.
To provide a correct generalization performance, we need to train the base classiers and the combi-
nation stage separately. That means we need to split the training set into a subset for training the base
classiers only, and a subset for the combination stage. Ultimately, it is more complex and requires more
training data, but we may gain another few percent by cleverly combining the component classiers and
may thus beat any other classier.
8.5 One-vs-All Classier
One may also try to learn K classiers, with each one discriminating one class versus all other classes
(one-vs-all). When using such an ensemble classier, one should pay attention to the class imbalance
problem (subsection 5.3.1).
26
9 Non-Metric Classication DHS pxxx, ch 8, pdf 461
If data are nominal, meaning if they are discrete and without any natural notion of similarity or even ordering,
then one uses lists of attributes.
A common approach is to specify the values of a xed number of properties by a property d-tuple. For
example, consider describing a piece of fruit by the four properties of color, texture, taste and smell. Then a
particular piece of fruit might be described by the 4-tuple red, shiny, sweet, small, which is a shorthand for
color = red, texture = shiny, taste = sweet and size = small. Such data can be classied with decision trees
(section 7).
Another common approach is to describe the pattern by a variable length string of nominal attributes,
such as a sequence of base pairs in a segment of DNA, e.g., AGCTTCAGATTCCA; or the letters in
word/text. In that case we use methods dealing with sequences, which we elaborate next.
9.1 Recognition with Strings DHS p413, s 8.5, pdf 481 ThKo p487, s 8.2.2
A particularly long string is denoted text. Any contiguous string text that is part of x is called a substring,
segment, or more frequently a factor of x. For example, GCT is a factor of AGCTTC. There is a large num-
ber of problems in computations on strings. The ones that are of greatest importance in pattern recognition
are:
- String matching: Given x and text, test whether x is a factor of text, and if so, determine its position.
- Edit distance: Given two strings x and y, compute the minimum number of basic operations - character
insertions, deletions and exchanges - needed to transform x into y.
- String matching with errors: Given x and text, nd the locations in text where the cost or distance of
x to any factor of text is minimal.
- String matching with the dont care symbol: This is the same as basic string matching, but with a special
symbol, , the dont care symbol, which can match any other symbol.
We introduce only the rst two.
9.1.1 String Matching Distance
Figure 12: The general string-matching problem is
to nd all shifts s for which the pattern x appears
in text. Any such shift is called valid. In this case
x = bdac is indeed a factor of text, and s = 5 is the
only valid shift. [Source: Duda,Hart,Storck 2001, Fig 8.7]
The simplest detector method is to test each possible shift, which is also called naive string matching. A
more sophisticated method, the Boyer-Moore algorithm, uses the matched result at one position to predict
better possible matches, thus not testing every position and accelerating the search.
9.1.2 Edit Distance
The edit distance between x and y describes how many fundamental operations are required to transform
x into y. The fundamental operations are:
- substitutions: A character in x is replaced by the corresponding character in y.
- insertions: A character in y is inserted into x, thereby increasing the length of x by one character.
- deletions: A character in x is deleted, thereby decreasing the length of x by one character.
Let Cbe an mn matrix of integers associated with a cost or distance and let (, ) denote a generalization
of the Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise.
The basic edit-distance algorithm (algorithm 8) starts by setting C[0, 0] = 0 and initializing the left column
and top row of C with the integer number of steps away from i = 0, j = 0. The core of this algorithm nds
27
Algorithm 8 Edit distance. From DHS p486.
Initialization x, y, m length[x], n length[y]
Initialization C[0, 0] = 0
Initialization For i = 1..m, C[i, 0] = i, End
Initialization For j = 1..n, C[0, j] = j, End
For i = 1..m
For j = 1..n
Ins = C[i 1, j] + 1; % insertion cost
Del = C[i, j 1] + 1; % deletion cost
Exc = C[i 1, j 1] + 1 (x[i], y[j]) % no (ex)change cost
C[i, j] = min(Ins, Del, Exc) % the minimum of the 3 costs
End
End
Return C[m, n]
the minimum cost in each entry of C, column by column (gure 13). Algorithm 8 is thus greedy in that each
column of the distance or cost matrix is lled using merely the costs in the previous column.
As shown in gure 13, x = excused can be transformed to y = exhausted through one substitution
and two insertions. The table shows the steps of this transformation, along with the computed entries of the
cost matrix C. For the case shown, where each fundamental operation has a cost of 1, the edit distance is
given by the value of the cost matrix at the sink, i.e., C[7, 9] = 3.
Figure 13: The edit distance calculation for strings x and y can be illustrated in a table. Algorithm 3 begins
at source, i = 0, j = 0, and lls in the cost matrix C, column by column (shown in red), until the full edit
distance is placed at the sink, C[i = m, j = n]. The edit distance between excused and exhausted is thus
3. [Source: Duda,Hart,Storck 2001, Fig 8.9]
The algorithmhas complexity O(mn) and is rather crude; optimized algorithms have O(m+n) complexity
only. Linear programming techniques can also be used to nd a global minimum, though this nearly always
requires greater computational effort.
Note: as mentioned in the introduction, the pattern can consist of any (limited) set of ordered elements,
and not just letters. Example: The edit distance is sometimes applied in computer vision, specically shape
recognition, for which a shape is expressed as a sequence of classied segments.
28
10 Density Estimation
Density estimation is the characterization of a data distribution. Density estimation is in principal similar to
clustering (section 6), where we had attempted to nd classes in the entire dataset by identifying clusters.
In density estimation in contrast, we rather describe the distribution of individual dimensions (features) by
identifying their modes (maxima). One can distinguish between parametric and non-parametric methods
(sections 10.2 and 10.1).
10.1 Non-Parametric Methods Alp p165
In non-parametric methods, the distribution is piece- or pointwise estimated by either counting the number
of datapoints (subsection 10.1.1) or by smoothing them (subsection 10.1.2).
10.1.1 Histogramming Alp p165
In constructing the histogram, we have to choose both an origin and a bin width. The choice of origin affects
the datapoint count near boundaries of bins, but it is mainly the bin width that has an effect on the estimate.
The estimate is 0 if no instance falls in a bin; there are discontinuities at bin boundaries.
In Matlab: histc
Exercise Take a random, sparse 1D data distribution and understand the binning behavior by using a range
of bins, that cover exactly the data range; then apply a range of bins exceeding the data range, etc. Then
look at the individual dimensions of your data. Observe how many modes the distributions contain.
10.1.2 Kernel Estimator (Parzen Windows) Alp p167 ThKo p51
This method smoothens the data as opposed to just counting them as in histogramming (wiki: kernel
smoother). The estimation f(x) consists of the sum of a kernel function K (aka as Parzen Windows) placed
at each data point x
t
(t = 1, .., N):
f(x) =
1
Nh
N
t=1
K
_
x x
t
h
_
, (5)
where h is the Kernel width. The most common kernel K is the Gaussian function,
g(x) =
1
2
exp
_
1
2
_
x
_
2
_
(6)
in which case h corresponds to and
t
to x
t
. But one can also use a uniform (box) function, a triangular
function or any other radial-basis function for K (wiki: Kernel (statistics)).
In Matlab: ksdensity
If no h is specied, the Matlab script will estimate a value based on simple statistics of the distribution.
10.2 Parametric Methods Alp p61
Parametric means we express the distribution by parameters, that is, by an equation, which is also called
the probability density function (PDF) in the context of density estimation. In the non-parametric methods
in contrast (previous subsection), we merely transformed the distribution without expressing them by any
parameters.
The simplest parametric description is to take the mean and standard deviation , that is to take
the rst-order statistics of the distribution, and to use them in a radial-basis function. The most common
radial-basis function is the Gaussian (normal) function (equation 6).
Parameterizing a distribution with rst-order statistics were ideal, if the distribution contained only a
single mode (a uni-modal distribution). In practice, this is hardly true, as discovered above by histogramming
the individual dimension (see previous subsection). But assuming a uni-modal distribution is simply done
for computational convenience. There are situations however, where we wish to parameterize distributions
with multiple modes; we then would use a GMM, to be introduced in the following subsection.
29
Figure 14: Density approximation. Ef-
fect of different kernel widths h (1.0,
0.5 and 0.25). xs denote datapoints.
[Source: Alpaydin 2010, Fig 8.3].
10.2.1 Gaussian Mixture Models (GMM)
When we use Gaussian mixture models, we assume that the distribution is multi-modal and we also specify
the number of modes we expect, very much like in a k-Means clustering algorithm (algorithm 5). The
GMM simply adds the output of k Gaussian functions, whose means and standard deviations correspond
to the location of the modes and the width of the assumed underlying distributions. To nd the appropriate
mean and standard deviation value for each mode, one uses a so-called Expectation-Maximization (EM)
algorithm. The algorithm gradually approaches the optimal values by a search very akin to the k-Means
algorithm, hence the relation of density estimation to clustering.
We do not treat this in further detail here and merely point out that GMMs can be modeled in Matlab
with the command gmdistribution (available in statistics toolbox).
30
11 Naive Bayes Classier
The Naive Bayes classier is a method, which models the data more explicitly than the other classiers (kNN
and LDA). In fact, in the kNN classier, no modeling takes place at all; in the LDA the data are analyzed
for their mean and standard deviation, but the discrimination is based on a hyperplane only. In the Naive
Bayes classier, one goes a step further and even makes the decision based on density estimation (as
introduced in the previous section), and this classier is thus theoretically the most elegant model, as
everything is based on parameterization. Practically, the classier has only limited success, as too much
elegance sometimes lacks the robustness to deal with messy data.
The Naive Bayes classier performs density estimation assuming uni-modal Gaussian distributions as
introduced in subsection 10.2, namely by taking the mean and the standard deviations of the individual
feature dimensions for each class (group).
Figurative Example. In our country-guessing example, we would approximate the distribution of cars for each
country by a separate density function and then determine our location by using the density functions only. For
a given (spatial) location we compute the values for the different countries (from their individual functions), and
the one that returns the highest value determines our choice of country.
In gure 1, this Gaussian were elliptically-shaped for both classes with an approximate shape value of 1
(assuming equal scales on each axis). We then run a classication that is based on these two Gaussian
functions. To determine the category for a given sample (vector), we compute the values of the Gaussian
functions for each class i (with its class-specic parameters
i
and
i
); the maximum function value then
determines the selected category.
In 2D the Gaussian function becomes:
g(x, y) =
1
2
x
y
_
1
2
exp
_
1
2(1
2
)
__
x
x
x
_
2
+
_
y
y
y
_
2
2(x
x
)(y
y
)
y
_
_
(7)
where is the correlation between X and Y and where
x
> 0 and
y
> 0. In the 2D case, we can express
this more compactly using,
=
_
y
_
and =
_
2
x
x
y
2
y
_
and then formulate as follows, which is also the formula for a multivariate Gaussian (2 or more dimensions):
g(x) =
1
(2)
d/2
||
1/2
exp
_
1
2
(x )
t
1
(x )
_
N(, ) (8)
where
is the mean vector, E
_
[x
1
, x
2
, .., x
d
]
t
= [
1
,
2
, ..,
t
]
t
is the d d covariance matrix, = E[(x )(x )
t
]
|| is the determinant of the covariance matrix
1
is its inverse
(x )
t
1
(x ) is also called Mahalanobis distance
Thus we build our classier as follows:
Algorithm 9 Naive Bayes Classier. k = 1, .., c (n
classes
, K)
Training c classes ( D
L
):
mean
k
, covariance
k
, determinant |
k
|, inverse
1
k
, prior P(k)
g
k
as in equation 8
Testing 1) for a testing sample x D
T
determine g(x) c classes g
k
.
2) multiply each g
k
with the class prior P(k): f
k
= g
k
P(k)
Decision chose maximum of f
k
: argmax
k
f
k
If the classes occur with uneven frequencies, we need to determine the frequency for each class, also called
prior, and include this as pointed out in the training step and in step no. 2 of testing.
31
This type of classier is also called Naive Bayes classier, because it assumes that the feature value
distributions can be approximated by a Gaussian function, which is generally a huge oversimplication (or
naive): for most data the feature distribution is non-Gaussian. However, the classier also bears potential
complications because nding the appropriate density functions can be difcult for complex data or if there
are only few training samples (small sample size problem). The latter may prevent that the inverse of the
covariance matrix can be determined. Although there exist methods to estimate the inverse (e.g. command
pinv in Matlab), it may be easier to try the following two alternatives: one, use a dimensionality reduction,
e.g. the PCA (subsection 4.1); or two, try a different classier.
11.1 Implementation
With the commands cov,det and inv (or pinv),one can conveniently build a Bayes classier. Here are
some code fragments for orientation (see also ThKo p81):
% ----- build class information for TRAINING set:
AVG = zeros(nCat, nDim);
[COV COVInv] = deal(zeros(nCat, nDim, nDim));
CovDet = zeros(nCat,1);
for k = 1 : nCat
TrnCat = TRN(Group==k, :); % [nCatSamp, nDim]
AVG(k, :) = mean(TrnCat); % [nCat, nDim]
CovCat = cov(TrnCat);
COV(k, :, :) = CovCat; % [nCat, nDim, nDim]
CovDet(k) = det(CovCat); % determinant
COVInv(k, :, :) = pinv(CovCat); % p inverse
end
% ----- testing a (single) sample with index ix (from TESTING set):
Prob = zeros(nCat, 1); % initialize probabilites
for k = 1 : nCat
DF = AVG(k,:) - TST(ix,:) % diff between avg and sample
detCat = abs(CovDet(k)); % retrieve class determinant
CovInv = squeeze(COVInv(k, :, :)); % retrieve class inverse
fct = 1 / ( ( (2*pi)^(nDim/2) )*sqrt(detCat) +eps);
etm = (DF * CovInv * DF)/2; % Mahalanobis distance
Prob(k) = fct * exp(-etm); % probability for this class
end
[mxc ixc] = max(Prob); % final decision (class winner)
Prior: We did not include the prior in this code fragment. Given an index array IxCat with values corre-
sponding to class assignment ( 1, .., k, k=number of classes), we can generate a histogram as follows:
Nocc = accumarray(IxCat, 1, [nCat, 1])) (or use histc)
, where nCat is the number of classes (= k); then turn it into a prior (frequency) by dividing with sum(Nocc(:)):
Prior = Nocc./sum(Nocc(:)).
11.2 Recapitulation
Advantages Decent results with a simple, compact model. Results likely better than with kNN.
Disadvantages It can be difcult to determine the inverse of the covariance matrix. This can happen when
certain dimensions have values that are (close-to) zero for some classes; or when the number of training
samples is small (small sample size problem).
32
12 Support Vector Machines ThKo p119
Support Vector Machines (SVM) are sometimes assigned to the class of linear classiers (e.g. with Linear
Discriminant Analysis). They typically perform better than other linear classiers but also require more
tuning. They are designed as binary (two-category) classiers, but for multiple categories one can simply
create c binary tasks (one versus all other) and then combine their outputs (subsection 8.5). The learning
duration of SVMs is typically long and they may only work if the classes are reasonably well separable. The
following characteristics make SVMs distinct from ordinary linear classiers:
1. Kernel function: The SVM uses such functions to project the data into a higher-dimensional space in
which the data are hopefully better separable than in their original lower-dimensional space. Kernel
functions can be Radial-Basis functions, quadratic,...
2. Support Vectors: The SVM uses only a few sample vectors for generating the decision boundaries
and those are called support vectors. For a regular linear classier, there exist multiple reasonable
decision boundaries, that separate the classes of the training set. For instance, the optimal hyperplane
in gure 15 could actually show slightly different orientations. The SVM nds the one, that also
gives a good generalization performance, whereby the support vectors are exploited to what is called
maximizing the margin (the two bidirectional arrows delineate the margin).
Figure 15: Training a support vector ma-
chine consists of nding the optimal hyper-
plane, that is, the one with the maximum
distance from the nearest training patterns.
The support vectors are those (nearest) pat-
terns, a distance b from the hyperplane. The
three support vectors are shown as solid
dots. [Source: Duda,Hart,Storck 2001, Fig 5.19]
The SVM are too complex to code them quickly. We simply apply them in Matlab using the Bioinformatics
toolbox. There are two separate commands for training and testing: svmtrain and svmclassify:
Svm = svmtrain(TRN, Grp); % returns a structure...
GrpTst = svmclassify(Svm, TST); % ...which is fed together with the testing data
12.1 Recapitulation
Advantages better classication accuracy for binary tasks.
Disadvantages relatively long learning duration; require more tuning, that is adjustment until it works; may
not work well, if classes are not reasonably separable.
Recommendation: Use SVM to obtain best results, e.g. maximizing the classication performance for a
data set.
33
13 Rounding the Picture DHS p84
13.1 Bayesian Formulation
A typical textbook on pattern classication (with mathematical pretense [ambition]) starts by introducing the
Bayesian formalism and its application to the decision and classication problem. The Bayesian formulation
bears notational, analytical and theoretical elegance, but is limitedly applicable as many real-world data
are large in dimensionality. Often, we are occupied with obtaining any reasonable results in a rst place.
We now introduce this formalism, so that we understand better the language used in some textbooks. The
Naive Bayes Classier of section 11 is a simple version of this formalism. Bayes formalism expresses a
decision problem in a probabilistic framework:
Bayes rule : P(
j
|x) =
p(x|
j
)P(
j
)
p(x)
posterior =
likelihood prior
evidence
(9)
In natural language, see right side of equation: DHS p22,23
Alp p50 - Posterior: is the probability for the presence of a specic category
j
in the sample x.
- Likelihood: is the computed value using the density function. In the example of the Naive Bayes classier
(section 11), it is the value of equation 8.
- Prior: is the probability for the category being present in general, that is, it is the frequency of its occur-
rence. We called this prior already (see algorithm 9 and subsection 11.1).
- Evidence: is the marginal probability that an observation x is seen (regardless of whether it is a positive
or negative example) and ensures normalization. (This was not explicitly calculated.)
More formally: Given a sample, x, the probability P(
j
|x), that it belongs to class
j
, is the fraction of
the class-conditional probability density function, p(x|
j
), multiplied by the probability with which the class
appears, P(
j
), divided by the evidence p(x). We can formalize evidence as follows:
p(x) =
c
j=1
p(x|
j
)P(
j
) =
j
P(
j
|x) = 1 (10)
13.1.1 Rephrasing Classier Methods
Given the above Bayesian formulation, we can now rephrase the working principle of the three classier
types (sections 2, 3, 11) as follows:
k-Nearest-Neighbor (section 2): estimates the posterior values P(
j
|x) directly, without attempting to
compute any density functions (likelihoods); in short, it is a non-parametric method, because no effort
is made to nd functions, that approximate the density p(x|
j
).
kNN is a type of instance-based learning, or lazy learning where the function is only approximated
locally and all computation is deferred until classication.
Naive Bayes Classier (section 3): is essentially the simplest version of the Bayesian formulation and
that classier makes the following two assumptions in particular:
1. It assumes that the features are independent and identically drawn (i.i.e.), in short statistically
independent. This is also called Naive Bayes Rule. But often we do not know beforehand,
whether the dimensions are uncorrelated.
2. It assumes that the features are Gaussian distributed ( [x], [(x )(x )
t
])
For most data, these are two strong assumptions because most data distributions are more complex.
Despite those two strong assumptions, the Naive Bayes classier often returns acceptable perfor-
mance.
Linear Discriminant Function (section 4): they are similar to the kNN approach in the sense that they
do not require knowledge of the form of the underlying probability distributions. (Some researchers
argue, that attempting to nd the density function is a more complex problem than trying to directly
develop discriminants functions.)
34
13.2 Parametric (Generative) vs. Non-Parametric (Discriminative)
Along with the Bayesian framework comes also the distinction between parametric and non-parametric
methods (as already implied above and made in section 10). The parametric methods pursue the ap-
proximation of density distributions p(x|
j
) by functions with a few essential parameters. Non-parametric
methods in contrast nd approximations without any explicit models (and hence parameters), such as the
kNN and the Parzen window. Chapters in textbooks are often organized according to this distinction. Here
we summarize the typical characterization of methods:
Parametric Multi-Variate Methods, (MLE, EM)
Semi-parametric Clustering, k-means, (EM)
Non-parametric Parzen, kNN, LDA, SVM, Decision Trees
Note 1: the semi-parametric classication I found in Alpaydins textbook.
Note 2: EM: expectation-maximization algorithm (subsection 10.2.1); MLE: maximum-likelihood estimation
algorithm. Both are density estimation methods. To be introduced in course II.
Note 3: the EM algorithm can obviously be classied differently, depending on the exact viewpoint.
Note 4: Bishop uses the terms Generative vs. Discriminative.
13.3 Other (Supervised) Statistical Classiers
- Perceptron: is essentially a linear classier with a different learning method (course II).
- Neural Networks (NN): are elaborations of the perceptron. The simplest versions are 3 layers networks,
which can be regarded as consisting of 2 layers of perceptrons (course II).
- Hidden Markov Models (HMM): are especially suited for classifying dynamic patterns (course II).
13.4 Algorithm-Independent Issues
Curse of Dimensionality Intuitively, one would think that the more dimensions (attributes) we have at our
disposal (through measurements), the easier it is to separate the classes (with any classier). However,
one often nds that with increasing number of dimensions, it is more challenging to nd the appropriate
separability, which is also refered to as the curse of dimensionality. On the one hand, if there are irrelevant
and possibly obstructive dimensions, it may indeed be better to reduce the dimensionality (as introduced
with the PCA for instance). On the other hand, the clever use of kernel functions, as in Support Vector
Machines, shows that more parameters can also be useful.
No Free Lunch theorem DHS p454 The theorem essentially states that no classier technique is superior to
any other one. Virtually any powerful algorithm, whether it be kNN, articial NN, unpruned decision trees,
etc. can solve a problem decently if sufcient parameters are created for the problem at hand.
The machine learning community tended to regard the most recently developed classier methodology
as a breakthrough in the quest of a (supposed) superior classication method. However, after decades of
research, it has become clear (to most researchers) that no classier model is absolutely better than any
other one: each classier has its advantages and disadvantages and their underlying, individual theoretical
motivations are all justied in principle. In order to nd the best performing classier for a given problem, a
practitioner simply has to test them all essentially.
35
A Appendix - LDA Beginners Example
Example should work by copy/paste and not require PCA.
clear;
S1 = [2 1.5; 1.5 3]; % covariance for multi-variate normal distribution
PC1 = mvnrnd([0.3 0.5], S1, 50); % training class 1
PTEST = mvnrnd([0.3 0.5], S1, 30); % testing (class 1)
PC2 = mvnrnd([3.2 0.5], S1, 50); % training class 2
PTREN = [PC1; PC2];
Grp = [ones(size(PC1,1),1); ones(size(PC2,1),1)*2];
Lb = classify(PTEST, PTREN, Grp);
H = histc(Lb,[1 2]);
pcCorrect = H(1)/size(PTEST,1);
fprintf(pc correct %1.4f\n, pcCorrect);
%% -------- Plotting
figure(2); clf; hold on;
scatter(PC1(:,1), PC1(:,2), sb, markerfacecolor, b);
scatter(PC2(:,1), PC2(:,2), r^, markerfacecolor, r);
scatter(PTEST(:,1), PTEST(:,2), go, markerfacecolor, g);
B Appendix - 2D Toy Data Sets
Create a set of random points from a uniform distribution. Use randn for normal distribution.
PtsRnd = rand(500,2); % 500 random (2D) points
Two densities, both elliptical, but one with gradient and rotated by 45 deg:
nP = 3000; % number of points
% --- ellipse with gradient
Mu1 = [5 6];
Si1 = [3 0.2]; % i*0.2]; % [i 1.5; 1.5 i+2];
P1 = mvnrnd(Mu1, Si1, nP);
P1 = f_RotCo(P1,pi/4); % rotation by 45 deg
[v O] = sort(P1(:,1));
Ixr = logspace(0, log10(nP), 1000); % gradient
Ixu = unique(round(Ixr));
P1 = P1(O(Ixu),:);
% --- ellipse
Mu2 = [7 5];
Si2 = [.1 1];
P2 = mvnrnd(Mu2, Si2, nP/3);
P = [P1; P2]; % the set of points
An arc above a square grid:
degirad = pi/180;
wd = 45*degirad;
nap = 10;
yyarc = cos(linspace(-wd,wd,nap))*(0.5)+0.4;
xxarc = linspace(.15,.85,nap);
nsp = 5;
yysqu = repmat(linspace(0.1,0.3,nsp),nsp,1); yysqu = yysqu(:);
xxsqu = repmat(linspace(0.3,0.7,nsp),1,nsp);
PtsPat = [xxarc yyarc];
PtsPat = [PtsPat; [xxsqu yysqu]]; % append
36
C Appendix - Varia
C.1 Metrics
Minkowski: L
k
(a, b) =
_
d
i=1
|a
i
b
i
|
k
_
1/k
...also referred to as the L
k
norm:
- L
1
: Manhattan or city-block norm.
- L
2
: Euclidean distance norm.
DHS p187
Mahalanobis (distance): Given are a multivariate vector x, a mean vector (e.g. obtained from averaging
over a class for instance) and a covariance matrix S:
D
M
(x) =
_
(x )
T
S
1
(x )
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance.
If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean
distance.
C.2 Whitening Transform
Input: DAT, a n d matrix; output: DWit, the whitened data.
CovMx = cov(DAT); % covariance -> [nDim,nDim] matrix
[EPhi ELam] = eig(CovMx); % eigenvectors & -values [nDim,nDim]
Ddco = DAT * EPhi; % DECORRELATION
LamS = ELam.^(-0.5);
LamS = diag(diag(LamS)); % ensure its a diagonal matrix
DWit = Ddco * LamS; % EQUAL VARIANCE
% verify
COVwhi = cov(Ddco); % covariance of decorrelated data (should be a diagonal matrix)
Df = diag(ELam)-diag(COVdco); % difference of diagonal elements
if sum(Df)>0.1, error(odd: differences of diagonal elements very large!?); end
See also http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf
C.3 Programming Hints
Speed To write fast-running code in Matlab, one should exploit Matlabs matrix-manipulating commands
in order to avoid the costly for loops (see for instance repmat or accumarray). Writing a kNN classier can
be conveniently done using the repmat command. However, when dealing with high dimensionality and
large number of samples, exploiting this command can in fact slow down computation because the machine
will spend a signicant amount of time allocating the required memory for the large matrices. In that case,
it may in fact be faster to maintain one for loop, and to use repmat only limitedly.
Vector Multiplication In mathematical notation a vector is assumed a column vector. In Matlab however
if you dene a vector as a=[1 2 3], it is a row vector - in fact as you write. To conform with mathematical
notation, either transpose the vector immediately by using the transpose sign (e.g., a=[1 2 3]) or by us-
ing semi-colons (e.g., a=[1; 2; 3];); otherwise you are forced to change place of the transpose sign later
when applying the dot product (a*b instead of a*b), in which case it appears reverse to the mathematical
notation! Or simply use the command dot, for which the column/row orientation is irrelevant.
C.4 Mathematical Notation
The mathematical notation in this workbook is admittedly a bit messy, because I took equations from differ-
ent textbooks. I did not make an effort to create a consistent notation, so that the reader can easily compare
37
the equations to the original text. In the majority of textbooks a vector is denoted with a lower-case letter in
bold face, e.g. x; a matrix is denoted as an upper-case letter in bold face, e.g. . But there are deviations
from this norm.
C.5 Some Software Packages
- MatLab: Unfortunately expensive and mostly available either in academia or industry.
- Weka: http://en.wikipedia.org/wiki/Weka_(machine_learning)
- R: supposed to be a replacement for MatLab.
- Python: I have no experience with it.
C.6 Parallel Computing Toolbox in Matlab
Should you be lucky owner of the parallel computing toolbox in Matlab, then you can even use it on your
home PC or laptop, as nowadays home PCs have multiple cores and that permits parallel computing in
principle. It is relatively simple to exploit the parallel computing features in for-loops that are suitable for
parallel processing: simply open a pool of cores, carry out the loop using the parfor command and then
close the pool again.
matlabpool local 2; % opening two cores (workers)
parfor i = 1:1000
A(i) = SomeFunction(Dat, i); % the data are manipulated in some function by counter i
end
matlabpool close;
The parfor loop can not be used if your computations in the loop depend on previous results, for example
in an iterative process where A(i) depended on A(i-1). It also only makes sense if the process that is
supposed to be repeated in parallel is computationally intensive, otherwise the assignment of the individual
steps to the corresponding cores (workers) may slow down the computation.
38
C.7 Reading
See references for publication details.
(Alpaydin, 2010): An introductory book. Reviews some topics froma different perspective than Duda/Hart/Stork
for example. It can be regarded as complementary to this workbook, but also complementary to other text-
books.
(Theodoridis and Koutroumbas, 2008): Contains the most practical tips of those books, that also intend
to provide the theoretical background. Treats clustering very thoroughly - in more depth than any other
textbook. Contains code examples.
(Witten et al., 2011): The most practical machine learning book probably, but rather short on the motivation
of the individual classier types. It accompanies the Weka machine learning suite (see link above).
(Duda et al., 2001): The professional book. A must have if one intends to further deepen ones knowledge
about pattern classication. The book excels at relating the different classier philosophies and emphasizes
the similarities between classiers and neural networks. Due to its age (already 12 years for the 2nd
version), it lacks in depth treatment for recent advances such as combining classiers and graph methods
for instance.
(Bishop, 2007): Another professional book. Contains beautiful illustrations and some historic comments,
but aims rather at an advanced readership (upper-level undergraduate and graduate students).
(Jain et al., 2000): A review with some useful summaries. Should be available on the internet. Use scholar
google.
Wikipedia: Always good for looking up denitions, formulations and different viewpoints. But wikipedias
variety - originating from the contribution of different authors - is also its shortcoming: it is hard to compre-
hend the topic as a whole from the individual articles (websites). Hence, textbooks are still irreplaceable.
References
Alpaydin, E. (2010). Introduction to Machine Learning. MIT Press, Cambridge, MA, 2nd edition.
Bishop, C. (2007). Pattern Recognition and Machine Learning. Springer, New York.
Duda, R., Hart, P., and Stork, D. (2001). Pattern Classication. John Wiley and Sons Inc, 2nd edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer, New York.
Jain, A., Duin, R., and Jianchang, M. (2000). Statistical pattern recognition: a review. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 22(1):437.
Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition. Academic Press, 4th edition.
Witten, I., Frank, E., and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, 3rd edition.
39
D Appendix - Example Questions
D.1 Questions
1. You are given a completely new data set of medium size (a few hundred samples in total, up to dimensionality
50; with class labels). What was suggested (in the course) on how you proceed with the analysis?
2. Advantages/disadvantages of kNN, Bayes, LDA, SVM,...(other methods)?
3. What can we learn from a learning curve? Why would one bother to train with smaller amounts of data and not
use the entire training set only?
4. What normalization schemes do you know?
5. You have only few data, but still want to model a classier to obtain an idea about the classication performance.
Lets say you have 3 classes with 3, 5 and 7 samples resp. Which classier is preferred?
6. You trained c binary (one-versus-all) classiers and observe that for increasing training data, the performance
decreases?
7. How does the kNN, Bayes, LDA (or other) classier work?
8. What is characteristic for the SVM?
9. How is the performance of a binary classier analyzed?
10. You have data whose features (dimensions, variables) comes from different sources, e.g. audio and visual. Do
you train a single classier for all features?
11. You have satisfactory results, lets say with the LDA-PCA combination. But now you want to optimize and improve
if necessary by another 1-2 percent. What could you try?
12. What does the PCA do? How do you apply it in Matlab?
13. You are given a set of patterns whose features are drawn from a (limited) set of elements. What classier do you
recommend? Some of the patterns have different (vector) length - which classier could you try now?
14. Your data contain components, that have only zero values, or some values maybe missing and expressed with
NaN. How do you proceed?
15. Does normalization improve performance?
16. You intend to datamine (explore) a huge set and are given no labels (class information). How do you begin?
17. Compare hierarchical clustering with k-means clustering.
18. What types of error estimation do you know?
D.2 Answers (as hints)
1. Start with kNN to obtain a performance reference, which can be a lower bound; if time permits, use Bayesian;
then use linear discriminant analysis combined with principal component analysis.
2. As in script.
3. a) Overtting: there is the possibility that we obtain better performance for a smaller training set: the learning
curve should be increasing, but may also decrease for excessive training data. b) Verication: we gain certainty
that weve done everything correct.
4. As in script.
5. kNN is the rst choice, at it can essentially work with single samples only. You may also try a Naive Bayes
classier. LDA is unlikely to return reliable results.
6. a) Overtting (see learning curve). b) Class imbalance problem.
7. kNN: storage-based classier. Each testing sample is compared to all other ...
Bayes: we use Gaussians to approximate the distributions...
LDA: a weight matrix Wis generated that separates the classes...
8. a) Focuses on samples that were difcult to classify. b) Uses a kernel function to project data into a higher-
dimensional space.
9. With a 4-response table (hit, miss, ...). Ideally the system has parameters with which we can inuence the
performance and so create an ROC curve (see script for details).
40
10. One can. But we can also try ensemble classiers (such as bagging) - it sometimes gives better results.
11. a) SVM. b) Search for optimal number of princ. comp. combined with LDA. c) Feature selection. d) Ensemble
classier.
12. Finds axes of variation for each dimensions and rotates the data such that it is aligned with those axes. Applied
in Matlab with the command princomp which returns a d d matrix from which we select components and then
transform the data.
13. Decision tree. If patterns of unequal length: string matching.
14. It can be ignored if we use for instance the PCA and the LDA of Matlab. However, we need to take care of it,
when clustering for instance or when building our own classier. See script for details.
15. Often, but not always, because normalization can also lead to a distortion of the samples relations (distances).
16. Clustering. k-means. see script for details.
17. See script for details.
18. Hold-out estimation, cross-fold validation, ...see script.
41