Sunteți pe pagina 1din 36

Simple Linear Regression

Simple Regression
 Simple regression analysis is a statistical tool That
gives us the ability to estimate the mathematical
relationship between a dependent variable (usually
called y) and an independent variable (usually
called x).
 The dependent variable is the variable for which
we want to make a prediction.
 While various non-linear forms may be used,
simple linear regression models are the most
common.
Introduction
 The primary goal of quantitative
analysis is to use current
information about a
phenomenon to predict its future
behavior. lot size Man-hours
30 73
 Current information is usually in 20 50
60 128
the form of a set of data. 80 170
40 87
 In a simple case, when the data 50 108
form a set of pairs of numbers, 60 135
we may interpret them as 30 69
70 148
representing the observed values 60 132
of an independent (or predictor )
variable X and a dependent ( or
response) variable Y.
Introduction
 The goal of the analyst
who studies the data is to Statistical relation between Lot size and Man-Hour

180

find a functional relation 160

y  f (x) 140

120

between the response


100

Man-Hour
80

variable y and the predictor 60

40

variable x. 20

0
0 10 20 30 40 50 60 70 80 90
Lot size
Regression Function
 The statement that the relation
between X and Y is statistical
should be interpreted as providing
the following guidelines:
1. Regard Y as a random variable.
2. For each X, take f (x) to be the
expected value (i.e., mean value) of
y.
3. Given that E (Y) denotes the
expected value of Y, call the
equation
E (Y )  f ( x)
the regression function.
Pictorial Presentation of Linear Regression
Model
Historical Origin of Regression
 Regression Analysis was
first developed by Sir
Francis Galton, who
studied the relation
between heights of sons
and fathers.
 Heights of sons of both tall
and short fathers appeared
to “revert” or “regress” to
the mean of the group.
Construction of Regression Models
 Selection of independent variables
 Since reality must be reduced to manageable proportions
whenever we construct models, only a limited number of
independent or predictor variables can or should be included in a
regression model. Therefore a central problem is that of
choosing the most important predictor variables.
 Functional form of regression relation
 Sometimes, relevant theory may indicate the appropriate
functional form. More frequently, however, the functional form
is not known in advance and must be decided once the data have
been collected and analyzed.
 Scope of model
 In formulating a regression model, we usually need to restrict
the coverage of model to some interval or region of values of the
independent variables.
Uses of Regression Analysis
 Regression analysis serves Three major purposes.
1.Description
2.Control
3.Prediction
 The several purposes of regression analysis frequently
overlap in practice
Formal Statement of the Model
 General regression model

1. 0, and 1 are parameters


Y   0  1 X  
2. X is a known constant
3. Deviations  are independent N(o, 2)
Meaning of Regression Coefficients
 The values of the regression parameters 0, and 1 are
not known.We estimate them from data.
 1 indicates the change in the mean response per unit
increase in X.
Regression Line
 If the scatter plot of our sample data suggests a linear
relationship between two variables i.e.

we can summarize the relationship by drawing a


straight line on the plot.
y  give
 Least squares method 0  us
1 x the “best” estimated line

for our set of sample data.


Regression Line
 We will write an estimated regression line based on
sample data as

 The method of least squares chooses the values for b0,


 bsum
and b to minimize yˆthe 0  b1 x
of squared errors
1

n n 2

SSE   ( yi  yˆ i ) 2   y  b0  b1 x 
i 1 i 1
Regression Line
 Using calculus, we obtain estimating formulas:

or
n n n n

 (x i  x )( yi  y ) n xi yi   xi  yi
b1  i 1
n
 i 1
n
i 1
n
i 1

 (x
i 1
i  x )2 n xi2  ( xi ) 2
i 1 i 1

Sy
b1  r
Sx

b0  y  b1 x
Estimation of Mean Response
 Fitted regression line can be used to estimate the
mean value of y for a given value of x.
 Example
 The weekly advertising expenditure (x) and weekly sales
(y) are presented in the following table.

y x
1250 41
1380 54
1425 63
1425 54
1450 48
1300 46
1400 62
1510 61
1575 64
1650 71
Point Estimation of Mean Response
 From previous table we have:

n  10  x  564  x  32604 2

 The least squares estimates of the regression coefficients


are:  y  14365  xy  818755

n xy   x y 10(818755)  (564)(14365)
b1    10.8
n x 2  ( x ) 2 10(32604)  (564) 2

b0  1436.5  10.8(56.4)  828


Point Estimation of Mean Response
 The estimated regression function is:

ŷ  828  10.8x
 This means that if the weekly advertising expenditure is
Sales  828  10.8 Expenditur e
increased by $1 we would expect the weekly sales to
increase by $10.8.
Point Estimation of Mean Response
 Fitted values for the sample data are obtained by
substituting the x value into the estimated regression
function.
 For example if the advertising expenditure is $50, then
the estimated Sales is:

 This is called the point estimate (forecast) of the mean


response (sales).
Sales  828  10.8(50)  1368
Example:Retail sales and floor space
 It is customary in retail operations to asses the
performance of stores partly in terms of their
annual sales relative to their floor area (square
feet). We might expect sales to increase linearly as
stores get larger, with of course individual
variation among stores of the same size. The
regression model for a population of stores says
that
SALES = 0 + 1 AREA + 
Example:Retail sales and floor space
 The slope 1 is as usual a rate of change: it is the
expected increase in annual sales associated with
each additional square foot of floor space.
 The intercept 0 is needed to describe the line but
has no statistical importance because no stores
have area close to zero.
 Floor space does not completely determine sales.
The term  in the model accounts for difference
among individual stores with the same floor space.
A store’s location, for example, is important.
Residual
 The difference between the observed value yi and the
corresponding fitted value .

ŷi
 Residuals are highly useful for studying whether a
ei  yi  yˆi
given regression model is appropriate for the data at
hand.
Example: weekly advertising
expenditure

y x y-hat Residual (e)


1250 41 1270.8 -20.8
1380 54 1411.2 -31.2
1425 63 1508.4 -83.4
1425 54 1411.2 13.8
1450 48 1346.4 103.6
1300 46 1324.8 -24.8
1400 62 1497.6 -97.6
1510 61 1486.8 23.2
1575 64 1519.2 55.8
1650 71 1594.8 55.2
Estimation of the variance of the error
terms, 2
 The variance 2 of the error terms i in the regression
model needs to be estimated for a variety of purposes.
 It gives an indication of the variability of the probability
distributions of y.
 It is needed for making inference concerning regression
function and the prediction of y.
Regression Standard Error
 To estimate  we work with the variance and take
the square root to obtain the standard deviation.
 For simple linear regression the estimate of 2 is
the average squared residual.

1 1
s y. x   i n2 i i
  ˆ
2
 To estimate  ,
2 2
e ( y y )
use
n2
 s estimates the standard
s y. xdeviation
 s y. x
2  of the error
term  in the statistical model for simple linear
regression.
Regression Standard Error

y x y-hat Residual (e) square(e)


1250 41 1270.8 -20.8 432.64
1380 54 1411.2 -31.2 973.44
1425 63 1508.4 -83.4 6955.56
1425 54 1411.2 13.8 190.44
1450 48 1346.4 103.6 10732.96
1300 46 1324.8 -24.8 615.04
1400 62 1497.6 -97.6 9525.76
1510 61 1486.8 23.2 538.24
1575 64 1519.2 55.8 3113.64
1650 71 1594.8 55.2 3047.04

y-hat = 828+10.8X total 36124.76


Sy .x 67.19818
Basic Assumptions of a Regression Model
 A regression model is based on the following
assumptions:
1. There is a probability distribution of Y for each level of X.
2. Given that µy is the mean value of Y, the standard form of
the model is

where  is a random variable with a normal distribution with


mean 0 and standard deviation .

 y  f (x)  
Conditions for Regression Inference
 You can fit a least-squares line to any set of
explanatory-response data when both variables are
quantitative.
 If the scatter plot doesn’t show an approximately
linear pattern, the fitted line may be almost useless.
Conditions for Regression Inference
 The simple linear regression model, which is the basis
for inference, imposes several conditions.
 We should verify these conditions before proceeding
with inference.
 The conditions concern the population, but we can
observe only our sample.
Conditions for Regression Inference
 In doing Inference, we assume:
1. The sample is from the population.
2. There is a linear relationship in the population.
1. We can not observe the population , so we check the scatter
plot of the sample data.
3. The standard deviation of the responses about the
population line is the same for all values of the
explanatory variable.
1. The spread of observations above and below the least-
squares line should be roughly uniform as x varies.
Conditions for Regression Inference
 Plotting the residuals against the explanatory variable
is helpful in checking these conditions because a
residual plot magnifies patterns.
Analysis of Residual
 To examine whether the regression model is
appropriate for the data being analyzed, we can
check the residual plots.
 Residual plots are:
 Plot a histogram of the residuals
 Plot residuals against the fitted values.
 Plot residuals against the independent variable.
 Plot residuals over time if the data are chronological.
Analysis of Residual
 A histogram of the residuals provides a check on
the normality assumption. A Normal quantile plot
of the residuals can also be used to check the
Normality assumptions.
 Regression Inference is robust against moderate
lack of Normality. On the other hand, outliers and
influential observations can invalidate the results
of inference for regression
 Plot of residuals against fitted values or the
independent variable can be used to check the
assumption of constant variance and the aptness of
the model.
Analysis of Residual
 Plot of residuals against time provides a check on the
independence of the error terms assumption.
 Assumption of independence is the most critical one.
Residual plots
 The residuals should
have no systematic
pattern.
 The residual plot to right Degree Days Residual Plot
shows a scatter of the 1
points with no 0.5
individual observations

Residuals
0
or systematic change as 0 20 40 60
-0.5
x increases.
-1

Degree Days
Residual plots
 The points in this
residual plot have a
curve pattern, so a
straight line fits poorly
Residual plots
 The points in this plot
show more spread for
larger values of the
explanatory variable x,
so prediction will be less
accurate when x is large.

S-ar putea să vă placă și