Sunteți pe pagina 1din 6

B1a Machine Learning – Feature Selection

Athanasios Tsanas (‘Thanasis’)

Summary
In this lab you will experiment with some Feature Selection (FS) algorithms. You will work
independently on your PCs. The aim is to observe the effect and develop an understanding
of important and recurring concepts in machine learning such as the curse of dimensionality,
model complexity, relevance, redundancy, information content, and learner accuracy.

1. Introduction
We start by assuming that the original, raw data has been processed and some
(informative) features have been extracted. These features will be used to develop an
automatic algorithm to predict the response which is also provided. In practice, you would
have to extract these features depending on your raw data.
For example, imagine you were given some ECGs and you were asked to identify some
patterns (=features) in the data so that you could discriminate whether someone might
have a heart-related condition. Or you might have been given speech signals and the aim
would be to extract features to differentiate healthy controls from people with Parkinson’s
disease. The list is really endless.
We will just consider the setting where those features have already been extracted, i.e.
we have the data in a convenient format with the design matrix X, and the response vector
y. Remember the following schematic: we extract features, select or transform the original
feature space, and map the (new) feature space onto the response.

Feature Feature selection or Statistical


extraction feature transformation mapping

1
2. Data
You will start with the Iris dataset: it is a 3-class classification problem, and it contains
150 samples with 4 features. This is a very simple dataset and will help you see clearly some
simple critical aspects of general machine learning. Once you have completed the
methodology described later in this document, you will be asked to experiment with
additional datasets. You can download and experiment with datasets from the UCI ML
repository.
My suggestion would be to try classification or regression problems that appear
interesting to you (but keep in mind that some FS algorithms are specifically restricted to
classification!). Ideally, the datasets of your choice should consist of multiple features. One
suggestion is the Parkinson’s voice rehabilitation (binary classification), and the
Cardiotocography dataset (10-class classification, with 2129 samples and 21 features) but
feel free to use another dataset if you prefer. For standardization of your findings, I would
strongly suggest starting with the Physionet2012 dataset (binary classification).

3. Methods
This section reviews some indicative FS approaches; the field is very extensive to be
covered in considerable depth, but the following concepts will be a good starting point. You
will use the selected feature subset to feed a learner (classifier or regressor depending on
your problem), and report the out of sample learner accuracy (e.g. use standard 10-fold
cross-validation with 100 iterations for statistical confidence). Matlab code is provided for
the feature mapping.
In all cases, you will need to determine the top K features and obtain the performance of
your model as a function of the number of the top features you feed your learner.

A. Maximum relevance
The intuitively simplest approach is to include in our final selected feature subset the top
K features which are most predictive of the response (univariate association). This is
reasonable, but you will notice it has certain flaws and leads to sub-optimal outputs.
Determine the relevance score of each feature. Relevance refers to the univariate
association of each feature with the response. We need to quantify the strength of this

2
association between each feature and the response, which entails the use of a FS criterion.
It is up to you to experiment with different criteria (hint: try the Pearson and Spearman
correlation coefficients --- “corr”). You may want to explore additional criteria.

B. Minimum redundancy
A related concept, but this time we plan to penalize features with large overlapping
information. For example you can reduce the feature space by ‘killing’ one of the features
that has very large association with another feature (this will be determined based on the
criterion selected and an arbitrarily set threshold, for example use Spearman correlation
and threshold = 0.95). Continue until the final set of features does not have large
associations.
What do you observe compared to the maximum relevance approach?

C. Maximum relevance & minimum redundancy (mRMR)


How about combining the above two approaches? This has led to the very successful
algorithm called mRMR, which is widely used in practice. It is often one of the first off-the-
shelf FS algorithms many researchers use.
What do you observe compared to maximum relevance and minimum redundancy?
Remember to observe the output of the metrics in the FS criterion you used! Matlab code is
provided for mRMR by the developed of this algorithm, and also an mRMR variant called
mRMRcorr, which is a computationally simpler approach. The difference between the
standard mRMR and mRMRcorr is in the criterion used to quantify relevance and redundancy.
Matlab code is provided.

D. RELIEF
This is a feature weighting algorithm. You will use Matlab’s function relieff.

E. Gram-Schmidt orthogonalization (GSO)


A standard approach borrowing concepts from linear algebra applied to FS. Matlab code
is provided.

3
4. Results
Rank the features selected using each of the feature selection algorithms. Then, feed the
learner with 1,2,3…K features and record the out of samples performance (use 10-fold cross
validation with 100 iterations). Use at least two different learners (problem of feature
exportability).

Pool of data: X, y. This is the design matrix and the response provided
originally.
Split the X,y data into training and testing subsets. For example use top
90% of the data for training and lower 10% for testing. Repeat 100 times,
each time randomly permuting the original X,y data (see “randperm”)
prior to splitting into training and testing. Equivalently, you can use
Matlab’s command “crossvalind”.
Or use the samples belonging to K-1 individuals for training and the
samples for the Kth individual for testing. Repeat for all K individuals,
each time time leaving the data from one of the individuals out for testing
and training with the rest of the data.

Xtrain, ytrain Xtest, ytest

Select features using the FS


algorithms. Play fair! Feed the
same dataset to all FS algorithms

-Decide on the final feature


subset and experiment with
classifier’s hyperparameters.

DECIDE on the final Use the selected features in


feature set and classifier the Xtest data. No further
hyper-parameters optimisation allowed!!!

4
You should be reporting something like the following (depending on the FS algorithms
and the learners you used).

FS algorithm 1 FS algorithm 2 FS algorithm 3 FS algorithm 4


Feature1,1 Feature2,1 Feature3,1 Feature4,1
Feature1,2 Feature2,2 Feature3,1 Feature4,2
… … … …
… … … …
Feature1,K Feature2,K Feature3,K Feature4,K
91% 87% 82% 97%
Table with selected features and out of sample learner performance.

Hepatitis Hepatitis
0.5 0.5
LASSO LASSO
0.45 mRMR 0.45 mRMR
mRMRSpearman mRMRSpearman
Misclassification (SVM)

0.4 0.4
Misclassification (RF)

GSO GSO
RELIEF RELIEF
0.35 0.35
LLBFS LLBFS
RRCT 0.3 RRCT
0.3

0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
Number of features Number of features

5. Discussion
Which features did you select? Remember that in practice you will have to suggest why
particular features have been selected (e.g. there is some physiological understanding
behind this selection? --- this will depend on your application)
Which FS algorithm worked best for your problem? Can you tentatively suggest why?

5
CAREFUL!! Do NOT generalize the findings you have obtained from 2-3 datasets light-
heartedly to other problems: the no free lunch theorem suggests that there is no universally
single best algorithm in machine learning for all applications! This applies to feature
selection as well as practically all other aspects of machine learning.

Please send your reports by 17.00 to:


(David): davidc@robots.ox.ac.uk, (Thanasis): tsanas@maths.ox.ac.uk

S-ar putea să vă placă și