Sunteți pe pagina 1din 29

Linear Regression

CS434

Supervised learning
Regression analysis
“In statistical modeling, regression analysis is a set of
statistical processes for estimating the relationships
among variables … focus is on the relationship
between a dependent variable and one or
more independent variables (or 'predictors'). ”
--- Wikipedia

In simple words, we want to predict 𝑦𝑦 (the dependent


variable, or target) based on a set of 𝑥𝑥’s (independent
variables, or features), where 𝑦𝑦 is continuous.
Supervised learning
A example regression problem
• We want to predict a person’s height based on
his/her knee height and/or arm span
– Target: 𝑦𝑦, the height
– Features: 𝑥𝑥1 , the knee height; 𝑥𝑥2 , the arm span
• This is useful for patients who are bed bound and
cannot stand to take an accurate measurement of
their height
• Training data:
– A set of measurements from a subsample of the
population

Supervised learning
Our Training Data
195 195
190 190
185 185

Height
Height

180 180
175 175
170 170
165 165
45 50 55 60 160 180 200 220
Knee height Arm span

Suspected outliers

Ignoring these outlier points, there seems to be a reasonable linear


relationship between the features and our target variable

Supervised learning
Linear prediction function
• We will only consider linear functions ( thus
the name linear regression):
𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑏𝑏

• Let’s start with just one feature, to make


notations simple, we will for now call it 𝑥𝑥, and
our goal is thus to learn a function
𝑦𝑦 = 𝑤𝑤𝑤𝑤 + 𝑏𝑏

Supervised learning
One-dimensional Regression
Example: 𝑦𝑦 = 𝑥𝑥 + 3
195

190

185
Height
180

175

170

165
160 170 180 190 200 210 220

Arm span

• Goal: fit a line through the points


• Problem: the data does not exactly go through a line

Supervised learning
One-dimensional regression
195

190

185
Height
180

175

170

165
160 170 180 190 200 210 220

Arm span
• Which line is better?
• The blue line seems better, but in what way?
• How can we define this goodness precisely?

Supervised learning
Let’s formalize it a bit more
arm Height • Given a set of training examples
ID span (x) (y) { 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 : 𝑖𝑖 = 1, … , 𝑛𝑛}
1 166 170
2 196 191 • Goal: learn 𝑤𝑤 and 𝑏𝑏 from the training
3
4
191
180.34
189
180.34
data, so that 𝑦𝑦 = 𝑤𝑤𝑤𝑤 + 𝑏𝑏 predicts 𝑦𝑦𝑖𝑖
5 174 171 from 𝑥𝑥𝑖𝑖 accurately
6 176.53 176.53
7 177 187
• In mathematical terms, we would
8 208.28 185.42 like to find the 𝑤𝑤 and 𝑏𝑏 that
9
10
199
181
190
181
minimizes the following objective:
11 178 180
𝑛𝑛
12 172 175
𝐸𝐸 𝑤𝑤, 𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − (𝑤𝑤𝑥𝑥𝑖𝑖 + 𝑏𝑏) 2
13 185 188
14 188 185 𝑖𝑖=1
15 165 170 Sum of Squared Error (SSE)
Supervised learning
Optimization 101
• Given a function 𝑓𝑓(𝑡𝑡), finding the 𝑡𝑡 ∗ that minimizes 𝑓𝑓(𝑡𝑡) can be a
challenging or impossible in many situations
– 𝑓𝑓(𝑡𝑡) could be unbounded, without a minimizer

– 𝑓𝑓(𝑡𝑡) could have a lot of local minimizers

• For the later case, more advanced methods will be needed


• But sometimes, 𝑓𝑓 𝑡𝑡 is well behaved and it is easy to find the optimizer

𝑓𝑓(𝑡𝑡) • In some cases the function is convex (e.g., a


simple quadratic objective) and only has one
global minimum
• Then it’s simple to find the global minimum:
1. Take derivative of 𝑓𝑓(𝑡𝑡), which we call
𝑓𝑓𝑓(𝑡𝑡)
𝑡𝑡 ∗ 2. Set it to zero 𝑓𝑓 ′ 𝑡𝑡 = 0
3. Solve for 𝑡𝑡
Supervised learning
𝑛𝑛

Optimizing 𝐸𝐸 𝑤𝑤, 𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏 2

𝑖𝑖=1
1. Take partial derivative w.r.t. 𝑤𝑤 and 𝑏𝑏 respectively:
𝑛𝑛 𝑛𝑛
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= � −2 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏 𝑥𝑥𝑖𝑖 ; = � −2 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
𝑖𝑖=1 𝑖𝑖=1

2. Setting them to zero, and solve for 𝑤𝑤 and 𝑏𝑏


𝑛𝑛
𝜕𝜕𝜕𝜕
= � 2 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏 = 0
𝜕𝜕𝜕𝜕
𝑖𝑖=1
𝑛𝑛
𝜕𝜕𝜕𝜕
= � 2(𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖2 − 𝑏𝑏𝑥𝑥𝑖𝑖 ) = 0
𝜕𝜕𝜕𝜕
𝑖𝑖=1
𝑛𝑛
1 𝑦𝑦𝑦𝑦 − 𝑦𝑦�𝑥𝑥̅
𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − 𝑤𝑤 ∗ 𝑥𝑥𝑖𝑖 = 𝑦𝑦� − 𝑤𝑤 ∗ 𝑥𝑥̅
∗ 𝑤𝑤 ∗ =
𝑛𝑛 𝑥𝑥 2 − 𝑥𝑥̅ 2
𝑖𝑖=1
Our problem 195

190

arm Height 185


ID span (x) (y)

Height
1 166 170 180

2 196 191
175
3 191 189
4 180.34 180.34 170
5 174 171
6 176.53 176.53 165
160 170 180 190 200 210 220
7 177 187 Arm span
8 208.28 185.42
9 199 190
1 1
10 181 181 𝑥𝑥̅ = ∑𝑖𝑖 𝑥𝑥𝑖𝑖 = 182.477 𝑦𝑦� = ∑𝑖𝑖 𝑦𝑦𝑖𝑖 = 181.286
𝑛𝑛 𝑛𝑛
11 178 180
𝑥𝑥 2 = 33436.53 𝑥𝑥𝑥𝑥 = 31148.91 𝑥𝑥̅ 𝑦𝑦� = 33080.46
12 172 175
13 185 188
14 188 185 31148.91 − 33080.46
𝑤𝑤 ∗ = = 0.493
15 165 170 33436.53 − 182.4772
𝑏𝑏 ∗ = 181.3 − 0.493 ∗ 182.5 = 91.30
Supervised learning
Height = 91.30+0.493*armspan
Extending to more features
• Having more features will mean our objective has more
variables to optimize over
• One can solve this similarly by
– Taking partial derivative of each variable
– Setting them to zero
– Solving the system of equations simultaneously

• Use vector calculus, this can be expressed in a succinct


way
• Before we do that, we will briefly review some linear
algebra and vector calculus notations

Supervised learning
Definition: vector
• A vector is a one dimensional array.
• We usually denote vectors as boldface lower
case letter x, and use 𝑥𝑥 to denote a single
variable
• If we don’t specify otherwise, assume x is a
column vector

Adopted from notes by Andrew Rosenberg of CUNY


Definition: matrix

Adopted from notes by Andrew Rosenberg of CUNY


Transposition

Adopted from notes by Andrew Rosenberg of CUNY


Adopted from notes by Andrew Rosenberg of CUNY
Adopted from notes by Andrew Rosenberg of CUNY
Adopted from notes by Andrew Rosenberg of CUNY
• To multiply two vectors, they must also have their dimensions
aligned
• For example:
𝑦𝑦1
[𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 ] 𝑦𝑦2 = 𝑥𝑥1 𝑦𝑦1 + 𝑥𝑥2 𝑦𝑦2 + 𝑥𝑥3 𝑦𝑦3
𝑦𝑦3
– This is often called the inner (or dot) product of two vectors, written
as: x, y or (𝐱𝐱 ⋅ 𝐲𝐲) = x 𝑇𝑇 y

• Or, alternatively:
𝑥𝑥1 𝑥𝑥1 𝑦𝑦1 𝑥𝑥1 𝑦𝑦2 𝑥𝑥1 𝑦𝑦3
𝑥𝑥2 [𝑦𝑦1 𝑦𝑦2 𝑦𝑦3 ] = 𝑥𝑥2 𝑦𝑦1 𝑥𝑥2 𝑦𝑦2 𝑥𝑥2 𝑦𝑦3
𝑥𝑥3 𝑥𝑥3 𝑦𝑦1 𝑥𝑥3 𝑦𝑦2 𝑥𝑥3 𝑦𝑦3
– This is often called the outer product of two vectors, written as:
x ⊗ y=xy𝑇𝑇
Useful operations: vector norm
• Given a d-dimensional vector 𝐱𝐱 = 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑑𝑑 𝑇𝑇 , the
(L-2, or Euclidean) norm of 𝐱𝐱 is represented as
2
𝐱𝐱 2 = 𝑥𝑥12 + 𝑥𝑥22 + ⋯+ 𝑥𝑥𝑑𝑑2 = 2
< 𝐱𝐱, 𝐱𝐱 >

• There are other norms as well. 𝐿𝐿𝑝𝑝 norm is defined as:


1/𝑝𝑝
𝑑𝑑

𝐱𝐱 = � 𝑥𝑥𝑖𝑖 𝑝𝑝
𝑝𝑝
𝑖𝑖=1
Useful operations: Matrix Inversion
• The inverse of a square matrix 𝐴𝐴 is a matrix 𝐴𝐴−1 such that 𝐴𝐴𝐴𝐴−1 =
𝐼𝐼, where 𝐼𝐼 is called an identity matrix. For example:
1 0
𝐴𝐴 = 1 , 𝐴𝐴−1 = 1 0
1 −2 2
2

• A square matrix 𝐴𝐴 is invertible iff 𝐴𝐴 ≠ 0, i.e., determinant of 𝐴𝐴 is


nonzero
• One way to test if the determinant of A is nonzero if to see if you
can come up with a linear combination of the vectors of A so that it
equals zero
• if not, we say the columns of A are independent of each other, and
𝐴𝐴 ≠ 0
Some useful Matrix Inversion
Properties
Multi-dimensional Regression
• Now we will consider the more general case that considers
multiple features
• Eg, each example is described by 𝑥𝑥1 𝑥𝑥2 𝑇𝑇 , where 𝑥𝑥1
denotes the arm span, 𝑥𝑥2 denotes the knee height
• We want to learn
𝑦𝑦 = 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + 𝑏𝑏
• It is inconvenient to have to represent 𝑏𝑏 separately, so we
will use a small trick to represent all the coefficients jointly
– Include a constant 1 as a dummy input features: 𝒙𝒙 = 1 𝑥𝑥1 𝑥𝑥2 𝑇𝑇

– Let 𝒘𝒘 denote the vector of all coefficients: 𝒘𝒘 = 𝑏𝑏 𝑤𝑤1 𝑤𝑤2 𝑇𝑇


– The function can be compactly represented as y = 𝒘𝒘𝑇𝑇 𝒙𝒙

Supervised learning
Objective
𝑛𝑛
2
Previous objective: 𝐸𝐸 𝑤𝑤, 𝑏𝑏 = � 𝑦𝑦𝑖𝑖 − 𝑤𝑤𝑥𝑥𝑖𝑖 − 𝑏𝑏
𝑖𝑖=1
𝒏𝒏
Updated form 𝐸𝐸 𝒘𝒘 = � 𝑦𝑦𝑖𝑖 − 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 𝟐𝟐

𝒊𝒊=𝟏𝟏

x1T   y1  Example 1
𝑇𝑇
• Let 𝒚𝒚 = 𝑦𝑦1 , 𝑦𝑦2 , … , 𝑦𝑦𝑛𝑛  
𝑇𝑇
X= :  y =  : 
• Let 𝑋𝑋 = 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑛𝑛 xTn 
   yn  Example n

𝒏𝒏

𝐸𝐸 𝒘𝒘 = � 𝑦𝑦𝑖𝑖 − 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 𝟐𝟐


= 𝒚𝒚 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒚𝒚 − 𝑿𝑿𝑿𝑿
𝒊𝒊=𝟏𝟏
Supervised learning
Optimizing 𝐸𝐸(𝒘𝒘)
𝑇𝑇
𝐸𝐸(𝒘𝒘) = 𝒚𝒚 − 𝑿𝑿𝑿𝑿 𝒚𝒚 − 𝑿𝑿𝑿𝑿
Take the gradient and setting to zero:
∇E (w ) = −2 XT (y − Xw) = 0 d +1 𝑑𝑑: # of input
features

⇒ XT Xw = XT y

⇒ w = ( XT X) −1 XT y
Matrix cookbook is a good resource to help you with this type of manipulations.

Supervised learning
knee arm
height height span 1 50 166 170
170 50 166 1 57 196 191
191 57 196 1 50 191 189
189 50 191 1 53.34 180.34 180.34
180.34 53.34 180.34 1 54 174 171
171 54 174 1 55.88 176.53 176.53
176.53 55.88 176.53 1 57 177 187
187 57 177
𝑋𝑋 = 1 55.88 208.28 𝑌𝑌 = 185.42
185.42 55.88 208.28 1 57 199 190
190 57 199 1 54 181 181
181 54 181 1 55 178 180
180 55 178 1 53 172 175
175 53 172 1 57 185 188
188 57 185 1 49.5 165 170
170 49.5 165 1 57 188 185
185 57 188

70.19
15 815.6 2737.2
𝑇𝑇 𝒘𝒘 = 0.656
𝑋𝑋 𝑋𝑋 = 815.6 44451.6 149081
0.413
2737.2 149081 501547.9

Height = 70.19+.656*knee height + 0.413 * arm span


Supervised learning
Geometric interpretation
• y is a vector in 𝑅𝑅𝑛𝑛
• Each column (feature) of X is also a vector in 𝑅𝑅𝑛𝑛
• Xw is a linear combination of the columns of 𝑿𝑿, and lies in
range(X)
• We want Xw to match y as closely as possible

y
y − Xw

x1

x1w1* + x 2 w2* = Xw*

x2
What is the effect of adding one
feature?
• By using both armspan and knee height, can we
do better than using just armspan?
• How do we compare?
– Training SSE with only arm span: 257.445
– Training SSE with armspan and knee height: 225.680
• Does it mean the model with two features is
necessarily better than one?
• More generally, is it always better to have more
features?
– Effect on training?
– Effect on testing?
Supervised learning
Summary
• We introduce linear regression, which assumes that the
function that maps from x to y is linear
• Sum Squared Error objective
𝐸𝐸(𝒘𝒘) = 𝒚𝒚 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒚𝒚 − 𝑿𝑿𝑿𝑿
• The solution is given by:
w = ( XT X) −1 XT y
• There are other objectives, leading to different solutions
• Although we make the linear assumption, it is easy to
use this to learn nonlinear functions as well
– Introduce nonlinear features, e.g., 𝑥𝑥12 , 𝑥𝑥22 , 𝑥𝑥1 𝑥𝑥2
Supervised learning

S-ar putea să vă placă și