Documente Academic
Documente Profesional
Documente Cultură
M. Soleymani
Sharif University of Technology
Fall 2017
Some slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017
Types of ML problems
• Supervised learning (regression, classification)
– predicting a target variable for which we get to see examples.
• Unsupervised learning
– revealing structure in the observed data
• Reinforcement learning
– partial (indirect) feedback, no explicit guidance
– Given rewards for a sequence of moves to learn a policy and utility functions
2
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• We use training set to find the function that can also predict output
on the test set
3
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
4
Supervised Learning: Regression vs. Classification
• Supervised Learning
– Regression: predict a continuous target variable
• E.g., 𝑦 ∈ [0,1]
5
Regression Example
• Housing price prediction
400
300
Price ($)
200
in 1000’s
100
0
0 500 1000 1500 2000 2500
Size in feet2
• Supervised learning
– Given: Training set
𝑁
• labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 𝑖 ,𝑦 𝑖
𝑖=1
– Goal: learning a mapping from 𝒙 to 𝑦
• Unsupervised learning
– Given: Training set
𝑖 𝑁
• 𝒙 𝑖=1
– Goal: find groups or structures in the data
• Discover the intrinsic structure in the data
7
Supervised Learning: Samples
x2
Classification
x1
8
Unsupervised Learning: Samples
x2 Type I Type II
Clustering
Type III
Type IV
x1
9
Reinforcement Learning
• Provides only an indication as to whether an action is correct or not
10
Reinforcement Learning
• Typically, we need to get a sequence of decisions
– it is usually assumed that reward signals refer to the entire sequence
11
Components of (Supervised) Learning
• Unknown target function: 𝑓: 𝒳 → 𝒴
– Input space: 𝒳
– Output space: 𝒴
• We use training set to find the function that can also predict output
on the test set
12
Generalization
• We don’t intend to memorize data but need to figure out the pattern.
13
(Typical) Steps of solving supervised learning problem
• Select the hypothesis space
– A class of parametric models that map each input vector, x, into a predicted output y.
• Come up with a way of efficiently finding the parameters that minimize the
loss function. (optimization)
0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function:
𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − (𝑤0 + 𝑤1 𝑥 𝑖 )
𝑖=1
15
Cost function: example
𝐽(𝒘)
(function of the parameters 𝑤0 , 𝑤1)
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 16
Cost function: example
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(for fixed 𝑤0 , 𝑤1 , this is a function of 𝑥) (function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adapted from: Prof. Andrew Ng’s slides 17
Review: Iterative optimization of cost function
• Cost function: 𝐽(𝒘)
• Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘
• Steps:
– Start from 𝒘0
– Repeat
• Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
• 𝑡 ←𝑡+1
– until we hopefully end up at a minimum
18
How to optimize parameters?
– the slope of the error surface can be calculated by taking the derivative of the error
function at that point
22
Gradient descent
𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ , ,…, ]
𝜕𝑤0 𝜕𝑤2 𝜕𝑤𝑑
23
𝐽(𝑤0 , 𝑤1 )
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
𝑁
𝑇
𝒘𝑡+1 = 𝒘𝑡 − 𝜂 𝒘𝑡 𝒙(𝑖) − 𝑦 (𝑖) 𝒙(𝑖)
𝑖=1
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 24
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 25
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 26
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 27
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 28
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 29
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 30
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 31
𝑓 𝑥; 𝑤0 , 𝑤1 = 𝑤0 + 𝑤1 𝑥 𝐽(𝑤0 , 𝑤1 )
(function of the parameters 𝑤0 , 𝑤1 )
𝑤1
𝑤0
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford) 32
Gradient descent disadvantages
• However, when 𝐽 is convex, all local minima are also global minima ⇒ gradient
descent can converge to the global solution.
33
Stochastic gradient descent
• Batch techniques process the entire training set in one go
– thus they can be computationally costly for large data sets.
• Stochastic gradient descent: when the cost function can comprise a sum over
data points:
𝑛
𝐽(𝒘) = 𝐽 𝑖 (𝒘)
𝑖=1
34
Linear model: multi-dimensional inputs
𝑓 𝒙; 𝒘 = 𝑤0 + 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑
= 𝒘𝑇 𝒙
𝑤0 1
𝑤1 𝑥
𝒘= ⋮ 𝒙= 1
⋮
𝑤𝑑 𝑥𝑑 35
Generalized linear regression
• Linear combination of fixed non-linear function of the input vector
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)
36
Model complexity and overfitting
• With limited training data, models may achieve zero training error but
a large test error.
• Over-fitting: when the training loss no longer bears any relation to the
test (generalization) loss.
– Fails to generalize to unseen examples.
37
Over-fitting causes
• Model complexity
– E.g., Model with a large number of parameters (degrees of freedom)
38
Model complexity
• Example:
– Polynomials with larger 𝑚 are becoming increasingly tuned to the random
noise on the target values.
𝑚=0 𝑚=1
𝑦 𝑦
𝑚=3 𝑚=9
𝑦 𝑦
39
39
[Bishop]
Number of training data & overfitting
Over-fitting problem becomes less severe as the size of training data
increases.
𝑚=9 𝑚=9
𝑛 = 15 𝑛 = 100
[Bishop]
40
Avoiding over-fitting
• Determine a suitable value for model complexity
– Simple hold-out method
– Cross-validation
41
Simple hold out: training, validation, and test sets
𝐽𝑣
error Training
Validation
Test
𝐽𝑡𝑟𝑎𝑖𝑛
degree of polynomial 𝑚
42
Cross-Validation (CV): Evaluation
• 𝑘-fold cross-validation steps:
– Shuffle the dataset and randomly partition training data into 𝑘 groups of approximately equal size
– for 𝑖 = 1 to 𝑘
• Choose the 𝑖-th group as the held-out validation group
• Train the model on all but the 𝑖-th group of data
• Evaluate the model on the held-out group
– Performance scores of the model from 𝑘 runs are averaged.
… First run
… Second run
…
… (k-1)th run
… k-th run
43
Regularization
• Adding a penalty term in the cost function to discourage the
coefficients from reaching large values.
𝑛 2
𝐽 𝒘 = 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖 + 𝜆𝑅(𝒘)
𝑖=1
44
Regularization
[Bishop]
Choosing the regularization parameter
error
𝐽𝑣
𝐽𝑡𝑟𝑎𝑖𝑛
46
Classification problem
• Given: Training set
𝑖 𝑖 𝑁
– labeled set of 𝑁 input-output pairs 𝐷 = 𝒙 ,𝑦 𝑖=1
– 𝑦 ∈ {1, … , 𝐾}
• Examples:
– Image classification
– Speech recogntion
–…
47
Linear Classifier example
• Two class example:
3
− 𝑥1 − 𝑥2 + 3 = 0
4
𝑥2
𝒞1
3
2 if 𝒘𝑇 𝒙 + 𝑤0 ≥ 0 then 𝒞1
else 𝒞2
1
𝒞2
𝑥1 3
1 2 3 4 𝒘= − −1
4
𝑤0 = 3
48
Square error loss function for classification!
𝐾=2
Square error loss is not suitable for classification:
– Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct
side of the decision)
– Least square loss also lack robustness to noise
𝑁
𝑖 𝑖 2
𝐽 𝒘 = 𝑤𝑥 + 𝑤0 − 𝑦
𝑖=1
49
Parametric classifier: Multiclass
• 𝑓 𝒙; 𝑾 = 𝑓1 𝒙, 𝑾 , … , 𝑓𝐾 𝒙, 𝑾 𝑇
𝑥1
𝒙= ⋮
𝑥784
28 × 28
𝒘1
𝑾𝑇 = ⋮
𝒘10 10×784
𝑏1
𝒃= ⋮
𝑏10
Example
𝑾𝑇
How can we tell whether this W and b is good or bad?
Bias can also be included in the W matrix
Multi-class SVM
𝑁
1
𝐽 𝑾 = 𝐿 𝑖 + 𝜆𝑅(𝑾)
𝑁
𝑖=1
𝑠𝑗 = 𝒘𝑗𝑇 𝒙(𝑖)
𝑁
1 𝑖
1
𝐿 = 2.9 + 0 + 12.9 = 5.7
𝑁 3
𝑖=1
𝐿(1) = max 0,1 + 5.1 − 3.2 𝐿(2) = max 0,1 + 1.3 − 4.9 𝐿(3) = max(0, 2.2 − (−3.1) + 1)
+ max 0,1 − 1.7 − 3.2 + max 0,1 + 2 − 4.9 +max(0, 2.5 − (−3.1) + 1)
= max 0,2.9 + max(0, −3.9) = max 0, −2.6 + max(0, −1.9) = max(0, 6.3) + max(0, 6.6)
= 2.9 + 0 =0+0 = 6.3 + 6.6 = 12.9
Some questions?
𝑖
𝐿 = max 0,1 + 𝑠𝑗 − 𝑠𝑦 (𝑖)
𝑗≠𝑦 (𝑖)
𝐿(𝑖) = − log 𝑃 𝑌 = 𝑦 𝑖
𝑋=𝑥 𝑖
Cross-entropy loss
𝐾
= −𝑠𝑦(𝑖) + log 𝑒 𝑠𝑗
𝑗=1
Softmax classifier loss: example
𝑠 (𝑖)
(𝑖)
𝑒 𝑦
𝐿 = − log 𝐾 𝑠𝑗
𝑗=1 𝑒
𝐻 𝑞, 𝑝 = − 𝑞 𝑥 log 𝑝(𝑥)
𝑥