Sunteți pe pagina 1din 25

Chapter1: Introduction

Notes on MLAPP

Wu Ziqing

School of Computer Science and Engineering


Nanyang Technological University

13/07/2018

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 1 / 25


Outline

1 What is Machine Learning


Definition

2 Types of Machine Learning


Supervised Machine Learning
Classification
Regression
Unsupervised Machine Learning
Clustering
Dimensionality Reduction
Graph Structure
Matrix Completion
Reinforcement Learning

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 2 / 25


Outline

3 Basic Concepts in Machine Learning


Parametric and Non-Parametric models
KNN classifier
The Curse of Dimensionality
Parametric Models for Classification and Regression
Linear Regression
Logistic Regression
Overfitting
Model Selection
No Free Lunch Theorem

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 3 / 25


Definition of Machine Learning
What is Machine Learning

Definition of Machine Learning:

A set of methods that can automatically detect patterns of data,


and then use the uncovered patterns to predict future data, or to
perform other kinds of decision making under uncertainty.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 4 / 25


Types of Machine Learning

Supervised/Predictive Machine Learning


Unsupervised/Descriptive Machine Learning
Reinforcement Learning

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 5 / 25


Supervised Machine Learning
Goal: to learn a mapping from inputs x to output y , given a labelled
input-output set:

D = {(xi , yi )}N
i=1

D: Also called Training Set


N: Number of training samples
xi : Usually is a D-dimensional vector. Can also be complex structured
object: image, text, time series and so on.
Also called the Attribute,Feature, or Covariates.
yi : If it is categorical, the problem is called classification or pattern
recognition.
If it is nominal and real-valued, the problems is called regression.
If it has some natural ordering, it problems is called Ordinal
Regression.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 6 / 25
Supervised Machine Learning
Classification

Types of classification according to output y ∈ {1, ..., C }:


Binary Classification: C = 2, there are only two labels of y : 0 or 1.
Multi-label Classification: C > 2, there are many labels of y .
Single Output Classification: each label of y are mutually
exclusive, thus, each input in x can be only mapped to one label of y .
Multiple Output Classification: labels on y are not mutually
exclusive, thus, each input in x can be mapped to many labels. (e.g.,
a person can be mapped to tall and strong)
This is better view as predicting multiple related binary class labels.
(e.g., a person is tall/not tall, strong/not strong)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 7 / 25


Supervised Machine Learning
Classification

Function Approximation: We can formalise a Multi-label Classification


as follows:
We assume that y = f (x) for some unknown function f (x), our goal
is to estimate f (x) given training set D.
We can then estimate output ŷ = fˆ(x) = argmaxp(y = c|x, D),
which means to find the most probable class label given a novel input
x and a training set D. This is also called Generalisation.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 8 / 25


Supervised Machine Learning
Classification

Some real-world applications of classification:

Document classification (e.g., email spam filtering)


Image classification (e.g., handwriting recognition)
Object detection (e.g., face detection)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 9 / 25


Supervised Machine Learning
Regression

Estimate f (x) to predict Continuous output ŷ given an input x and a


training set D.
fˆ(x) can be models like linear regression, polynomial regression and so on.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 10 / 25


Unsupervised Machine Learning

Goal: Find interesting patterns given only the input data (also known as
knowledge discovery):

D = {xi }N
i=1

The task can be formalise as (unconditional) density estimation, i.e.,


the model should be in the form of:

p(xi |θ)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 11 / 25


Unsupervised Machine Learning
Clustering

To cluster a set of data into groups:


1 find the number of clusters K where K = argmaxK p(K |D)
2 infer which cluster k each data point zi belongs to by
zi = argmaxk p(k|xi , D)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 12 / 25


Unsupervised Machine Learning
Dimensionality Reduction

For high dimensional data, the variability of the data may only appear on a
few latent factors. We can perform dimensionality reduction by
projecting the high dimensional data to lower dimensional subspace to
capture the essence of the data.

Advantages:
better prediction accuracy
enabling fast nearest neighbour searches
facilitates visualisation of high dimensional data

The most common approach of dimensionality reduction is called


Principal Component Analysis (PCA).

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 13 / 25


Unsupervised Machine Learning
Graph Structure

We want to learn the correlation of each variables, by a graph structure, in


a data set by computing:

Ĝ = argmaxG p(G |D)

In the graph, each variable is represented by a node and each edge


represents the direct dependence between two variables.

This Sparse Graphical Model can help to:


Gain knowledge by interpreting the graph structure.
Get better joint probability estimators to model correlations and make
predictions.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 14 / 25


Unsupervised Machine Learning
Matrix Completion

Sometimes the designed matrix will have missing values such as NAN (Not
a Number). Matrix Completion (Imputation) can infer plausible missing
values. Matrix Completion can be applied to:
Image inpainting: to denoise/ complete an image.
Collaborative filtering: To complete the missing entries in a matrix.
Market Basket Analysis: in a binary matrix, predict which cell will be
turned on given a few has been turned on. (predict what items will be
purchased together.)

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 15 / 25


Reinforcement Learning
Types of Machine Learning

Goal: Learn how to behave given occasional reward/punishment signals.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 16 / 25


Paramatric and Non-Parametric Models

Parametric Model: the model has fixed number of parameters.


Advantage: faster to use
Disadvantage: making stronger assumptions about nature of data
distribution

Non-Parametric Model: its number of parameters grows with the


amount of training data.
Advantage: more flexible
Disadvantage: computationally intractable for large datasets

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 17 / 25


Parametric and Non-Parametric models
KNN classifier

K Nearest Neighbour (KNN) classifier is an example of


Non-Parametric classifier. It works as following:
1 looks at K nearest neighbours of a test data x
2 calculate the fraction of its neighbours’ classes:
1 X
p(y = c|x, D, K ) = II(yi = c)
K
i∈NK (x,D)

where i ∈ NK (x, D) denotes the K nearest numbers to x in D and


II(e) is the indicator function which equals to 1 iff e is true.
3 assign a class c to x accordingly

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 18 / 25


Parametric and Non-Parametric models
The Curse of Dimensionality

Curse of Dimensionality: With high dimensional data, some algorithm


will perform poorly.

For example in D-dimensional-space KNN, for a hyper cube1 to contain


fraction f of total data as the neighbour data, the expected edge length is
eD (f ) = f 1/D . For a 10-d dataset, even to contain 1% of the data, the
length of the cube needs to extend to 63% of the length along each
dimension.

1
an n-dimensional shape with each line aligned in one dimension, perpendicular to
each other and of the same length
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 19 / 25
Parametric Models for Classification and Regression
Linear Regression

Linear regression asserts that the response is a linear function of the input:
y (x) = w T x +  = D 2
P
j=1 wj xj + , where  = N (µ, σ )

Or it can be written as:


p(y |x, θ) = N (x|µ(x), σ 2 (x)),
where µ(x) = w T x, σ 2 (x) = σ 2 and θ = (w , σ 2 )

We can also replace x with non-linear function of x, Φ(x), to make it


Polynomial Regression:
p(y |x, θ) = N (x|w T Φ(x), σ 2 (x))
Changing the input into a function of input is known as basis function
expansion.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 20 / 25
Parametric Models for Classification and Regression
Logistic Regression

To fit the linear regression to classification, we perform the following two


steps:
1 change the Gaussian Distribution into Bernoulli Distribution, which is
more suitable for binary response:
p(y |x, w ) = Ber (y |xµ(x)), where µ(x) = p(y = 1|x)
2 pass the linear combination of the inputs into a sigmoid function, to
ensure 0 ≤ µ(x) ≤ 1:
1
µ(x) = sigm(w T x), where sigm(η) = 1+e (−η)

If we set a threshold of 0.5,

ŷ (x) = 1 ⇐⇒ p(y = 1|x) > 0.5

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 21 / 25


Parametric Models for Classification and Regression
Overfitting

If we try to include every minor variation into the model, we are more
likely to include the noise than the true signal. It is called Overfitting.

Although overfitting may lead to perfect result in training set, it may


cause inaccurate prediction for novel data.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 22 / 25


Parametric Models for Classification and Regression
Model Selection

To determine weather the model is too complexed to overfit, we can


compute the Misclassification Rate, which is the proportion of the
incorrect prediction:

err (f , D) = N1 N
P
i=1 II(f (xi ) 6= yi )

What we are interested in is Generalisation error, which is the expected


value of Misclassification Rate over future data. Since we do not have
future data, we can divide the training data into training set and
validation set.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 23 / 25


Parametric Models for Classification and Regression
Model Selection(Cont.)

If the number of data in validation set is not enough for reliable prediction,
we can use the technique of Cross Validation (CV).
1 Divide the data into K folds.
2 For each fold k ∈ {1, 2, ..., K }, train the model using the rest of the
fold and test against kt h fold.
3 Do the training and testing in a round robin fashion.

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 24 / 25


Parametric Models for Classification and Regression
No Free Lunch Theorem

No Free Lunch Theorem: There is no universally best model.

“All models are wrong, but some are useful.” — George Box

Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 25 / 25

S-ar putea să vă placă și