Chapter1: Introduction: Notes On MLAPP

Chapter1: Introduction
Notes on MLAPP
Wu Ziqing
School of Computer Science and Engineering

Nanyang Technological University
13/07/2018
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 1 / 25

Outline
1 What is Machine Learning

Definition
2 Types of Machine Learning

Supervised Machine Learning
Classification
Regression
Unsupervised Machine Learning
Clustering
Dimensionality Reduction
Graph Structure
Matrix Completion
Reinforcement Learning

Outline
3 Basic Concepts in Machine Learning

Parametric and Non-Parametric models
KNN classifier
The Curse of Dimensionality
Parametric Models for Classification and Regression
Linear Regression
Logistic Regression
Overfitting
Model Selection
No Free Lunch Theorem

Definition of Machine Learning
What is Machine Learning
Definition of Machine Learning:
A set of methods that can automatically detect patterns of data,

and then use the uncovered patterns to predict future data, or to
perform other kinds of decision making under uncertainty.

Types of Machine Learning
Supervised/Predictive Machine Learning

Unsupervised/Descriptive Machine Learning

Goal: to learn a mapping from inputs x to output y , given a labelled
input-output set:
D = {(xi , yi )}N
i=1
D: Also called Training Set

N: Number of training samples
xi : Usually is a D-dimensional vector. Can also be complex structured
object: image, text, time series and so on.
Also called the Attribute,Feature, or Covariates.
yi : If it is categorical, the problem is called classification or pattern
recognition.
If it is nominal and real-valued, the problems is called regression.
If it has some natural ordering, it problems is called Ordinal
Regression.
Classification
Types of classification according to output y ∈ {1, ..., C }:

Binary Classification: C = 2, there are only two labels of y : 0 or 1.
Multi-label Classification: C > 2, there are many labels of y .
Single Output Classification: each label of y are mutually
exclusive, thus, each input in x can be only mapped to one label of y .
Multiple Output Classification: labels on y are not mutually
exclusive, thus, each input in x can be mapped to many labels. (e.g.,
a person can be mapped to tall and strong)
This is better view as predicting multiple related binary class labels.
(e.g., a person is tall/not tall, strong/not strong)

Classification
Function Approximation: We can formalise a Multi-label Classification

as follows:
We assume that y = f (x) for some unknown function f (x), our goal
is to estimate f (x) given training set D.
We can then estimate output ŷ = fˆ(x) = argmaxp(y = c|x, D),
which means to find the most probable class label given a novel input
x and a training set D. This is also called Generalisation.

Classification
Some real-world applications of classification:
Document classification (e.g., email spam filtering)

Image classification (e.g., handwriting recognition)
Object detection (e.g., face detection)

Regression
Estimate f (x) to predict Continuous output ŷ given an input x and a

training set D.
fˆ(x) can be models like linear regression, polynomial regression and so on.

Goal: Find interesting patterns given only the input data (also known as
knowledge discovery):
D = {xi }N
i=1
The task can be formalise as (unconditional) density estimation, i.e.,

the model should be in the form of:
p(xi |θ)

Clustering
To cluster a set of data into groups:

1 find the number of clusters K where K = argmaxK p(K |D)
2 infer which cluster k each data point zi belongs to by
zi = argmaxk p(k|xi , D)

Dimensionality Reduction
For high dimensional data, the variability of the data may only appear on a
few latent factors. We can perform dimensionality reduction by
projecting the high dimensional data to lower dimensional subspace to
capture the essence of the data.
Advantages:
better prediction accuracy
enabling fast nearest neighbour searches
facilitates visualisation of high dimensional data
The most common approach of dimensionality reduction is called

Principal Component Analysis (PCA).

Graph Structure
We want to learn the correlation of each variables, by a graph structure, in

a data set by computing:
Ĝ = argmaxG p(G |D)
In the graph, each variable is represented by a node and each edge

represents the direct dependence between two variables.
This Sparse Graphical Model can help to:

Gain knowledge by interpreting the graph structure.
Get better joint probability estimators to model correlations and make
predictions.

Matrix Completion
Sometimes the designed matrix will have missing values such as NAN (Not
a Number). Matrix Completion (Imputation) can infer plausible missing
values. Matrix Completion can be applied to:
Image inpainting: to denoise/ complete an image.
Collaborative filtering: To complete the missing entries in a matrix.
Market Basket Analysis: in a binary matrix, predict which cell will be
turned on given a few has been turned on. (predict what items will be
purchased together.)

Types of Machine Learning
Goal: Learn how to behave given occasional reward/punishment signals.

Paramatric and Non-Parametric Models
Parametric Model: the model has fixed number of parameters.

Advantage: faster to use
Disadvantage: making stronger assumptions about nature of data
distribution
Non-Parametric Model: its number of parameters grows with the

amount of training data.
Advantage: more flexible
Disadvantage: computationally intractable for large datasets

KNN classifier
K Nearest Neighbour (KNN) classifier is an example of

Non-Parametric classifier. It works as following:
1 looks at K nearest neighbours of a test data x
2 calculate the fraction of its neighbours’ classes:
1 X
p(y = c|x, D, K ) = II(yi = c)
K
i∈NK (x,D)
where i ∈ NK (x, D) denotes the K nearest numbers to x in D and

II(e) is the indicator function which equals to 1 iff e is true.
3 assign a class c to x accordingly

The Curse of Dimensionality
Curse of Dimensionality: With high dimensional data, some algorithm

will perform poorly.
For example in D-dimensional-space KNN, for a hyper cube1 to contain

fraction f of total data as the neighbour data, the expected edge length is
eD (f ) = f 1/D . For a 10-d dataset, even to contain 1% of the data, the
length of the cube needs to extend to 63% of the length along each
dimension.
1
an n-dimensional shape with each line aligned in one dimension, perpendicular to
each other and of the same length
Linear Regression
Linear regression asserts that the response is a linear function of the input:
y (x) = w T x + = D 2
P
j=1 wj xj + , where = N (µ, σ )
Or it can be written as:

p(y |x, θ) = N (x|µ(x), σ 2 (x)),
where µ(x) = w T x, σ 2 (x) = σ 2 and θ = (w , σ 2 )
We can also replace x with non-linear function of x, Φ(x), to make it

Polynomial Regression:
p(y |x, θ) = N (x|w T Φ(x), σ 2 (x))
Changing the input into a function of input is known as basis function
expansion.
Logistic Regression
To fit the linear regression to classification, we perform the following two

steps:
1 change the Gaussian Distribution into Bernoulli Distribution, which is
more suitable for binary response:
p(y |x, w ) = Ber (y |xµ(x)), where µ(x) = p(y = 1|x)
2 pass the linear combination of the inputs into a sigmoid function, to
ensure 0 ≤ µ(x) ≤ 1:
1
µ(x) = sigm(w T x), where sigm(η) = 1+e (−η)
If we set a threshold of 0.5,
ŷ (x) = 1 ⇐⇒ p(y = 1|x) > 0.5

Overfitting
If we try to include every minor variation into the model, we are more
likely to include the noise than the true signal. It is called Overfitting.
Although overfitting may lead to perfect result in training set, it may

cause inaccurate prediction for novel data.

Model Selection
To determine weather the model is too complexed to overfit, we can

compute the Misclassification Rate, which is the proportion of the
incorrect prediction:
err (f , D) = N1 N
P
i=1 II(f (xi ) 6= yi )
What we are interested in is Generalisation error, which is the expected

value of Misclassification Rate over future data. Since we do not have
future data, we can divide the training data into training set and
validation set.

Model Selection(Cont.)
If the number of data in validation set is not enough for reliable prediction,
we can use the technique of Cross Validation (CV).
1 Divide the data into K folds.
2 For each fold k ∈ {1, 2, ..., K }, train the model using the rest of the
fold and test against kt h fold.
3 Do the training and testing in a round robin fashion.

No Free Lunch Theorem
No Free Lunch Theorem: There is no universally best model.
“All models are wrong, but some are useful.” — George Box

Chapter1: Introduction: Notes On MLAPP

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chapter1: Introduction: Notes On MLAPP

Încărcat de

Drepturi de autor:

Formate disponibile

Chapter1: Introduction

School of Computer Science and Engineering

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 1 / 25

1 What is Machine Learning

2 Types of Machine Learning

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 2 / 25

3 Basic Concepts in Machine Learning

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 3 / 25

Definition of Machine Learning:

A set of methods that can automatically detect patterns of data,

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 4 / 25

Supervised/Predictive Machine Learning

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 5 / 25

D: Also called Training Set

Types of classification according to output y ∈ {1, ..., C }:

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 7 / 25

Function Approximation: We can formalise a Multi-label Classification

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 8 / 25

Some real-world applications of classification:

Document classification (e.g., email spam filtering)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 9 / 25

Estimate f (x) to predict Continuous output ŷ given an input x and a

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 10 / 25

The task can be formalise as (unconditional) density estimation, i.e.,

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 11 / 25

To cluster a set of data into groups:

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 12 / 25

The most common approach of dimensionality reduction is called

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 13 / 25

We want to learn the correlation of each variables, by a graph structure, in

Ĝ = argmaxG p(G |D)

In the graph, each variable is represented by a node and each edge

This Sparse Graphical Model can help to:

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 14 / 25

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 15 / 25

Goal: Learn how to behave given occasional reward/punishment signals.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 16 / 25

Parametric Model: the model has fixed number of parameters.

Non-Parametric Model: its number of parameters grows with the

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 17 / 25

K Nearest Neighbour (KNN) classifier is an example of

where i ∈ NK (x, D) denotes the K nearest numbers to x in D and

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 18 / 25

Curse of Dimensionality: With high dimensional data, some algorithm

For example in D-dimensional-space KNN, for a hyper cube1 to contain

Or it can be written as:

We can also replace x with non-linear function of x, Φ(x), to make it

To fit the linear regression to classification, we perform the following two

If we set a threshold of 0.5,

ŷ (x) = 1 ⇐⇒ p(y = 1|x) > 0.5

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 21 / 25

Although overfitting may lead to perfect result in training set, it may

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 22 / 25

To determine weather the model is too complexed to overfit, we can

What we are interested in is Generalisation error, which is the expected

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 23 / 25

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 24 / 25

No Free Lunch Theorem: There is no universally best model.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 25 / 25

S-ar putea să vă placă și