ML Review

Machine Learning Review
M. Soleymani
Sharif University of Technology
Fall 2017
Some slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017
Types of ML problems
• Supervised learning (regression, classification)
– predicting a target variable for which we get to see examples.
• Unsupervised learning
– revealing structure in the observed data
• Reinforcement learning
– partial (indirect) feedback, no explicit guidance
– Given rewards for a sequence of moves to learn a policy and utility functions
2
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )
• We use training set to find the function that can also predict output
on the test set
3
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
4
Supervised Learning: Regression vs. Classification
• Supervised Learning
– Regression: predict a continuous target variable
• E.g., 𝑦 ∈ [0,1]
– Classification: predict a discrete target variable

• E.g.,𝑦 ∈ {1,2, … , 𝐶}
5
Regression Example
• Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
Figure adopted from slides of Andrew Ng,

Machine Learning course, Stanford.
6
Supervised Learning vs. Unsupervised Learning
• Supervised learning
– Given: Training set
𝑁
• labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
– Goal: learning a mapping from 𝒙 to 𝑦
• Unsupervised learning
– Given: Training set
𝑖 𝑁
• 𝒙 𝑖=1
– Goal: find groups or structures in the data
• Discover the intrinsic structure in the data
7
Supervised Learning: Samples
x2
Classification
x1
8
Unsupervised Learning: Samples
x2 Type I Type II
Clustering
Type III
Type IV
x1
9
Reinforcement Learning
• Provides only an indication as to whether an action is correct or not
Data in supervised learning:

(input, correct output)
Data in Reinforcement Learning:
(input, some output, a grade of reward for this output)
10
Reinforcement Learning
• Typically, we need to get a sequence of decisions
– it is usually assumed that reward signals refer to the entire sequence
11
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )
• We use training set to find the function that can also predict output
on the test set
12
Generalization
• We don’t intend to memorize data but need to figure out the pattern.
• A core objective of learning is to generalize from the experience.

– Generalization: ability of a learning algorithm to perform accurately on new,
unseen examples after having experienced.
13
(Typical) Steps of solving supervised learning problem
• Select the hypothesis space
– A class of parametric models that map each input vector, x, into a predicted output y.
• Define a loss function that quantifies how much undesirable is each

parameter vector across the training data.
• Come up with a way of efficiently finding the parameters that minimize the
loss function. (optimization)
• Evaluate the obtained model

Linear regression: square error loss function
500
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥
400
𝒘 = [𝑤0 , 𝑤1 ]
𝑦 (𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300
200 Parameters that be found

100
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function:
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − (𝑤0 + 𝑤1 𝑥 𝑖 )
𝑖=1
15
Cost function: example
𝐽(𝒘)
(function of the parameters 𝑤0 , 𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 16
Cost function: example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 , 𝑤1 , this is a function of 𝑥) (function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 17
Review: Iterative optimization of cost function
• Cost function: 𝐽(𝒘)
• Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
• Steps:
– Start from 𝒘0
– Repeat
• Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
• 𝑡 ←𝑡+1
– until we hopefully end up at a minimum
18
How to optimize parameters?
A person is stuck in the mountains and is trying to get

down (i.e. trying to find the minima).
Follow up the slope
The steepness of the hill represents the slope of the

surface at that point.
How to compute the slope?
• In 1-dimension, the derivative of a function:
– the slope of the error surface can be calculated by taking the derivative of the error
function at that point
• In multiple dimensions, the gradient is the vector of (partial derivatives)

along each dimension
• The direction of steepest descent is the negative gradient

Gradient descent (or steepest descent)
• In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒘𝑡 :
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
– 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡
– Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

point 𝒘𝑡
Learning rate: The amount of time he travels before taking

another measurement is the learning rate of the algorithm.
22
Gradient descent
• Minimize 𝐽(𝒘) Step size

(Learning rate parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤0 𝜕𝑤2 𝜕𝑤𝑑
• If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

• 𝜂 can be allowed to change at every iteration as 𝜂𝑡 .
23
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑁
𝑇
𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝒘𝑡 𝒙(𝑖) − 𝑦 (𝑖) 𝒙(𝑖)
𝑖=1
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 24
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
𝑤1
𝑤0
Gradient descent disadvantages
• Local minima problem
• However, when 𝐽 is convex, all local minima are also global minima ⇒ gradient
descent can converge to the global solution.
33
Stochastic gradient descent
• Batch techniques process the entire training set in one go
– thus they can be computationally costly for large data sets.
• Stochastic gradient descent: when the cost function can comprise a sum over
data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1
• Update after presentation of a mini-batch 𝑆 of data:
𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝛻𝒘 𝐽(𝑗) (𝒘)

𝑗∈𝑆
34
Linear model: multi-dimensional inputs
𝑓 𝒙; 𝒘 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑
= 𝒘𝑇 𝒙
𝑤0 1
𝑤1 𝑥
𝒘= ⋮ 𝒙= 1
⋮
𝑤𝑑 𝑥𝑑 35
Generalized linear regression
• Linear combination of fixed non-linear function of the input vector
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)
{𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ
• Polynomial (univariate)
36
Model complexity and overfitting
• With limited training data, models may achieve zero training error but
a large test error.
• Over-fitting: when the training loss no longer bears any relation to the
test (generalization) loss.
– Fails to generalize to unseen examples.
37
Over-fitting causes
• Model complexity
– E.g., Model with a large number of parameters (degrees of freedom)
• Low number of training data

– Small data size compared to the complexity of the model
38
Model complexity
• Example:
– Polynomials with larger 𝑚 are becoming increasingly tuned to the random
noise on the target values.
𝑚=0 𝑚=1
𝑦 𝑦
𝑚=3 𝑚=9
𝑦 𝑦
39
39
[Bishop]
Number of training data & overfitting
 Over-fitting problem becomes less severe as the size of training data
increases.
𝑚=9 𝑚=9
𝑛 = 15 𝑛 = 100
[Bishop]
40
Avoiding over-fitting
• Determine a suitable value for model complexity
– Simple hold-out method
– Cross-validation
• Regularization (Occam’s Razor)

– Explicit preference towards simpler models
– Penalize for the model complexity in the objective function
41
Simple hold out: training, validation, and test sets
• Simple hold-out chooses the model (hyperparameters) that minimizes error on

validation set.
𝐽𝑣
error Training
Validation
Test
𝐽𝑡𝑟𝑎𝑖𝑛
degree of polynomial 𝑚
• run on the test set once at the very end!
42
Cross-Validation (CV): Evaluation
• 𝑘-fold cross-validation steps:
– Shuffle the dataset and randomly partition training data into 𝑘 groups of approximately equal size
– for 𝑖 = 1 to 𝑘
• Choose the 𝑖-th group as the held-out validation group
• Train the model on all but the 𝑖-th group of data
• Evaluate the model on the held-out group
– Performance scores of the model from 𝑘 runs are averaged.
… First run
… Second run
…
… (k-1)th run
… k-th run
43
Regularization
• Adding a penalty term in the cost function to discourage the
coefficients from reaching large values.
• Ridge regression (weight decay):
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖 + 𝜆𝑅(𝒘)
𝑖=1
Generalization: prefer simple ones;

Approximation: Control the variance of the models
How much model predictions
2
match training data e.g. 𝑅 𝒘 = 𝒘 = 𝒘𝑇 𝒘
𝜆: regularization strength
(hyperparameter)
44
Regularization
𝜆=0 𝜆 > 0 (e^-18)
[Bishop]
Choosing the regularization parameter
error
𝐽𝑣
𝐽𝑡𝑟𝑎𝑖𝑛
46
Classification problem
• Given: Training set
𝑖 𝑖 𝑁
– labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
– 𝑦 ∈ {1, … , 𝐾}
• Goal: Given an input 𝒙, assign it to one of 𝐾 classes
• Examples:
– Image classification
– Speech recogntion
–…
47
Linear Classifier example
• Two class example:
3
− 𝑥1 − 𝑥2 + 3 = 0
4
𝑥2
𝒞1
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1
𝒞2
𝑥1 3
1 2 3 4 𝒘= − −1
4
𝑤0 = 3
48
Square error loss function for classification!
𝐾=2
Square error loss is not suitable for classification:
– Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct
side of the decision)
– Least square loss also lack robustness to noise
𝑁
𝑖 𝑖 2
𝐽 𝒘 = 𝑤𝑥 + 𝑤0 − 𝑦
𝑖=1
49
Parametric classifier: Multiclass
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾 𝑇
• 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

Parametric classifier: Linear
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾
• 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

– In linear classifiers, 𝑾 is 𝑑 × 𝐾 where 𝑑 shows number of features
– 𝑾𝑇 𝒙 provides us a vector
• 𝑓 𝒙; 𝑾 contains K numbers giving class scores for the input 𝒙

Linear classifier
• Output obtained from 𝑾𝑇 𝒙 + 𝒃
𝑥1
𝒙= ⋮
𝑥784
28 × 28
𝒘1
𝑾𝑇 = ⋮
𝒘10 10×784
𝑏1
𝒃= ⋮
𝑏10
Example
𝑾𝑇
How can we tell whether this W and b is good or bad?
Bias can also be included in the W matrix
Multi-class SVM
𝑁
1
𝐽 𝑾 = 𝐿 𝑖 + 𝜆𝑅(𝑾)
𝑁
𝑖=1
Hinge loss: 𝐿𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦(𝑖) 𝑠𝑗 ≡ 𝑓𝑗 𝒙 𝑖 ; 𝑾

𝑗≠𝑦 (𝑖) = 𝒘𝑗𝑇 𝒙(𝑖)
= max 0,1 + 𝒘𝑗𝑇 𝒙(𝑖) − 𝒘𝑇𝑦(𝑖) 𝒙(𝑖)

𝑗≠𝑦 (𝑖)
𝐾 𝑑
L2 regularization: 2
𝑅 𝑾 = 𝑤𝑙𝑘
𝑘=1 𝑙=1
Multi-class SVM loss: Example
3 training examples, 3 classes.
With some W the scores are 𝑊 𝑇 𝑥
𝑠𝑗 = 𝒘𝑗𝑇 𝒙(𝑖)
𝐿 𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)

𝑗≠𝑦 (𝑖)
𝑁
1 𝑖
1
𝐿 = 2.9 + 0 + 12.9 = 5.7
𝑁 3
𝑖=1
𝐿(1) = max 0,1 + 5.1 − 3.2 𝐿(2) = max 0,1 + 1.3 − 4.9 𝐿(3) = max(0, 2.2 − (−3.1) + 1)
+ max 0,1 − 1.7 − 3.2 + max 0,1 + 2 − 4.9 +max(0, 2.5 − (−3.1) + 1)
= max 0,2.9 + max(0, −3.9) = max 0, −2.6 + max(0, −1.9) = max(0, 6.3) + max(0, 6.6)
= 2.9 + 0 =0+0 = 6.3 + 6.6 = 12.9
Some questions?
𝑖
𝐿 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)
𝑗≠𝑦 (𝑖)
• Q1: What if the sum was over all classes? (including 𝑗 = 𝑦𝑖 )

• Q2: What if we used mean instead of sum?
2
• Q3: What if we used 𝐿 𝑖 = 𝑗≠𝑦 𝑖
max 0,1 + 𝑠𝑗 − 𝑠𝑦 𝑖 ?
• Q4: what is the min/max possible?
• Q5: why do we use regularization term?
Other regularization terms
𝐾 𝑑 2
• L2 regularization 𝑤
𝑘=1 𝑙=1 𝑙𝑘
𝐾 𝑑
• L1 regularization 𝑘=1 𝑙=1 𝑤𝑙𝑘
𝑑 𝑑
• Elastic net (L1 + L2) β 𝐾 𝑤
𝑘=1 𝑙=1 𝑙𝑘
2
+ 𝐾
𝑘=1 𝑙=1 𝑤𝑙𝑘
Softmax Classifier (Multinomial Logistic Regression)
𝑒 𝑠𝑘
softmax function 𝑃 𝑌 = 𝑘 𝑋 = 𝒙(𝑖) = 𝐾 𝑠𝑗
𝑠𝑘 = 𝑓𝑘 𝒙 𝑖 ; 𝑊 = 𝑤𝑘𝑇 𝒙(𝑖)
𝑗=1 𝑒
• Maximum log likelihood is equivalent to minimize the negative of log

likelihood of the correct class:
𝐿(𝑖) = − log 𝑃 𝑌 = 𝑦 𝑖
𝑋=𝑥 𝑖
Cross-entropy loss
𝐾
= −𝑠𝑦(𝑖) + log 𝑒 𝑠𝑗
𝑗=1
Softmax classifier loss: example
𝑠 (𝑖)
(𝑖)
𝑒 𝑦
𝐿 = − log 𝐾 𝑠𝑗
𝑗=1 𝑒
𝐿(1) = − log 0.13

= 0.89
Cross entropy
𝐻 𝑞, 𝑝 = − 𝑞 𝑥 log 𝑝(𝑥)
𝑥
• For the loss of the softmax classifier:

𝑒 𝑠𝑘
– p: estimated class probabilities 𝐾 𝑒 𝑠𝑗
𝑗=1
– q: the true distribution
• all probability mass is on the correct class 𝑞 𝑌 = 𝑦 𝑖 = 1 (𝑞 𝑌 ≠ 𝑦 𝑖 = 0).
Relation to KL divergence
𝐻(𝑞, 𝑝) = 𝐻(𝑞) + 𝐷𝐾𝐿 (𝑞||𝑝)
• Since 𝐻(𝑞) for the loss of softmax classifier is zero:

– Minimizing cross entropy is equivalent to minimizing the KL divergence
between the two distributions (a measure of distance).
– cross-entropy loss wants the predicted distribution to have all of its mass on
the correct answer.
Recap
We need 𝛻𝑊 𝐿 to update weights

Resources
• Deep Learning Book, Chapter 5.
• Please see the following notes:
– http://cs231n.github.io/linear-classify/
– http://cs231n.github.io/optimization-1/

ML Review

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ML Review

Încărcat de

Drepturi de autor:

Formate disponibile

Machine Learning Review

• Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )

– Classification: predict a discrete target variable

Figure adopted from slides of Andrew Ng,

Data in supervised learning:

• Training data: 𝒙1 , 𝑦1 , 𝒙2 , 𝑦2 , … , (𝒙𝑁 , 𝑦𝑁 )

• A core objective of learning is to generalize from the experience.

• Define a loss function that quantifies how much undesirable is each

• Evaluate the obtained model

200 Parameters that be found

A person is stuck in the mountains and is trying to get

The steepness of the hill represents the slope of the

• In multiple dimensions, the gradient is the vector of (partial derivatives)

• The direction of steepest descent is the negative gradient

• In each step, takes steps proportional to the negative of the

– Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a

Learning rate: The amount of time he travels before taking

• Minimize 𝐽(𝒘) Step size

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

• If 𝜂 is small enough, then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .

• Local minima problem

• Update after presentation of a mini-batch 𝑆 of data:

𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝛻𝒘 𝐽(𝑗) (𝒘)

{𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)

• Low number of training data

• Regularization (Occam’s Razor)

• Simple hold-out chooses the model (hyperparameters) that minimizes error on

• run on the test set once at the very end!

• Ridge regression (weight decay):

Generalization: prefer simple ones;

𝜆=0 𝜆 > 0 (e^-18)

• Goal: Given an input 𝒙, assign it to one of 𝐾 classes

• 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

• 𝑾 = 𝒘1 ⋯ 𝒘𝐾 contains one vector of parameters for each class

• 𝑓 𝒙; 𝑾 contains K numbers giving class scores for the input 𝒙

Hinge loss: 𝐿𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦(𝑖) 𝑠𝑗 ≡ 𝑓𝑗 𝒙 𝑖 ; 𝑾

= max 0,1 + 𝒘𝑗𝑇 𝒙(𝑖) − 𝒘𝑇𝑦(𝑖) 𝒙(𝑖)

𝐿 𝑖 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)

• Q1: What if the sum was over all classes? (including 𝑗 = 𝑦𝑖 )

• Maximum log likelihood is equivalent to minimize the negative of log

𝐿(1) = − log 0.13

• For the loss of the softmax classifier:

𝐻(𝑞, 𝑝) = 𝐻(𝑞) + 𝐷𝐾𝐿 (𝑞||𝑝)

• Since 𝐻(𝑞) for the loss of softmax classifier is zero:

We need 𝛻𝑊 𝐿 to update weights

S-ar putea să vă placă și