Lecture 01 Regression Analysis1 PDF

Objectives
 At the end of this course participants will be able

to
 understand the concept of statistical modelling
 differentiate between different types of regression
models
 interpret the regression analysis results generated by
the statistical software (SPSS)
Outlines
2
 Introduction
 Basic regression concepts (key words)
 Scatterplot
 The correlation coefficient
 Statistical Modelling
 Linear regression
 Multiple linear regression
 Regression with categorical data
 Other types of regression: Logistic regression and

Poisson regression
Areas of Statistics
3
 There are two main areas of Statistics:

Descriptive statistics:
provides tabular, graphical techniques and numerical
measures for describing data.
Inferential statistics:
provides procedures for analyzing data and making
decisions. Using the sample (statistic) to infer about
the population (parameter)
Inferential statistics: Key Definitions
4
 Population is the collection of all items or things under

consideration –people or objects
 Sample is a portion of the population selected for analysis
(Selected at random)
 Parameter is any numerical measure calculated based on
the population elements
 Statistic is any numerical measure calculated based on the
sample elements
Inferential Statistics
5
 Making statements about a population by examining

sample results
 (known variable) Inference (unknown constant, but
can be estimated from sample)
Sample statistics Population parameters
Sample
Population
Types of data
6
 Data are the facts, figures, or records that are

collected from the sample elements.
 Data can be classified:
Categorical data are labels or names used to

identify attributes of the sample elements. The labels
can be numbers with no real numerical meaning.
Numerical data are numbers (with real meaning),
representing measurements, obtained from the
sample elements.
Types of data,... cont.
7
 Categorical:
Nominal: Order is not important (no ranking)
◼Examples: gender, marital status, race, ...
Ordinal: Order is important
◼Examples: Education level, disease severity, level of
satisfaction,…
 Numerical:
Discrete:Countable (how many)
◼Number of children, length of stay at the hospital,…
Continuous: Uncountable (how much)
◼Age, weight, height, blood sugar,…
Three Skills for data analysis
8
There are three skills you need to master in order

to get a proper and reliable data analysis:
 When to use? (next slide)
 When to use a specific statistical procedure
 Depends on types of data (quantitative vs qualitative)
 How to get?
 Apply the procedure using the formulae or a software
 What to look for?
 Interpretation (comment on the output from the software)
Four Questions for (When to use)
9
There are four questions you need to answer to know “when

to use?” i.e. to select the appropriate statistical procedure:
 How many variables in the research question?
 The same as the number of questions we ask each individual
 What are the types of these variables?

 Quantitative or qualitative
 Level of measurement (Nominal, ordinal, scale)
 How many groups in the research question?

 The same as number of categories in the qualitative variable
 Which variables are the dependent and independent? (next slide)

 Dependent variable (Outcome or response)
 Independent variable (cause or effect)
Independent vs dependent variables
10
Hints that may help you to distinguish between

independent and dependent variables:
 Which occurs first always independent
 Ex. If you study, you will pass the exam
 Studying = independent and passing = dependent
 There are two types of characteristics (original and
gained) the original always independent
 Ex. Is there a relation between gender and obesity?
 Gender = independent and obesity = dependent
 Use the logic
 Ex. Is there a relation between height and weight?
 Height = independent and weight = dependent
Relationships between numerical data
11
 In addition to hypothesis testing and confidence

intervals, inferential (analytical) statistics includes
determining whether a relationship between two
or more quantitative (numerical) variables exists.
Relationships between numerical data
12
 There are three ways to assess the relationship

between numerical variables
 Graphically
▪ Scatter plot
 Numerically
▪ Correlation coefficient
 Modelling (equation)
▪ Regression
GRAPHICAL PRESENTATION
Scatterplot
Scatter plot
14
◼ It is a graphic presentation of the relation between two

quantitative (numerical) variables with the independent
variable on the x-axes and the dependent variable on
the y-axes.
◼ Four interpretations to look for in a scatter plot
➢ Pattern (linear or non-linear)
➢ Trend (positive of negative)
➢ Magnitude (strength) (weak, moderate or strong)
➢ Outliers
Scatter plot -Example
15
Positive Negative No relation

Strong Strong
Positive Negative Not linear

weak weak
Example1
16
 Open the SPSS data file “Health_funding” in the SPSS folder.

C:\Program Files\IBM\SPSS\Statistics\25\Samples\English
 This is a hypothetical data file that contains data on health care
funding (amount per 100 population), disease rates (rate per

10,000 population), and visits to health care providers (rate per
10,000 population). Each case represents a different city.
1. Draw the scatterplot between “funding” and “disease” and comment on
the graph.
2. Draw the scatterplot between “funding” and “visits” and comment on the
graph.
Correlation coefficient in SPSS
17
 To get the correlation coefficient in SPSS

Example1
18
 Open the SPSS data file “Health_funding” in the SPSS

folder.
C:\Program Files\IBM\SPSS\Statistics\24\Samples\English
1. Draw the scatterplot between “funding” and “disease”

and comment on the graph.
2. Draw the scatterplot between “funding” and “visits” and
comment on the graph.
1. Draw the scatterplot between “funding” and “disease”
and comment on the graph.
19
1. Linear
2. Positive
3. Strong
4. There is an outlier
2. Draw the scatterplot between “funding” and “visits” and
comment on the graph.
20
1. Linear
2. Positive
3. Strong
4. No outliers
NUMERICAL MEASURE
Correlation Coefficient
Correlation Coefficient
22
 The coefficient of correlation, denoted 𝝆 (rho) in the

population and r in the sample, is used to measure the
strength of the linear association between two
quantitative variables.
 (x − x )( y i − y )
n
i =1 i SS xy
r= =

i =1( x i    i =1 ( i
2 2
−x ) −y) SS xx SS yy
n n
y
 
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
𝑥ҧ = the mean of x
𝑦ത = the mean of y
Properties of correlation coefficient r
23
 r is independent of units.
 -1 ≤ r ≤ 1
 r < 0 → negative linear correlation
 r > 0 → positive linear correlation
 Empirical rule to interpret r:
 |r| close to 1 → strong correlation
 |r| between 0.5 to 0.7 → moderate correlation
 |r| between 0 to 0.5 → weak or no correlation

Correlation: Examples
24
r=0.98 r =-0.99
r=-0.13
r =0.51 r = -0.44
Testing the significance of the correlation (𝜌)
25
 We can test to see if the correlation is significant using

the hypotheses
 H0: 𝜌 = 0 (no linear correlation)
 H1: 𝜌 ≠ 0 (there is a linear correlation)
 P–value > 0.05 → no linear correlation
 P–value ≤ 0.05 → there is a linear correlation

Correlation coefficient in SPSS
26
 To get the correlation coefficient in SPSS

Example 2
27
 Using the data of example1 get the value of the

correlation coefficient and comment on its value. Test
the significance of this correlation.
Correlations
Visits to health
Health care Reported
care providers
funding (amount diseases (rate
(rate per
per 100) per 10,000)
10,000)
Health care Pearson ** **
1 .737 .964
funding (amount Correlation
per 100) Sig. (2-tailed) 0.000 0.000
N 50 50 50
**. Correlation is significant at the 0.01 level (2-tailed).
Example 2
28
 There is a strong positive correlation between Health

care funding (amount per 100) with each of Reported
diseases (rate per 10,000) and Visits to health care
providers (rate per 10,000), (r=0.74 and r =0.96)
respectively. Both variables are linearly related to
funding (P-value < 0.001)
STATISTICAL MODELLING
Simple linear
Regression
What is statistical modelling?
51
▪ Statistical modelling is a form of mathematical

modelling which relates a dependent variable and
an independent variable(s) through an equation used
for prediction.
▪ For example, one can build a statistical model which
relates height of the child we and his/her age.
▪ Linear regression is a special case of statistical
modelling where the dependent variable is
continuous and the independent variables are
continuous or categorical (converted to dummy
variables)
Model of the simple linear regression
51
 The simple linear regression model

y= b0 + b1x+ 
y = dependent variable
x = independent variable
b0 = y-intercept
b1 = slope of the line
 = error variable
 Simple since it has only one independent variable
(predictor)
 β0 and β1 are called regression parameters
 b0 is the estimate of β0 and b1 is the estimate of β1

The Simple Linear Regression Model
32
Example 3
33
Using the data of example1and assuming that “funding” is the

dependent variable and “disease” is the independent answer the
following questions
1. Find the estimated regression line.
2. Interpret the regression coefficients.
3. Determine the coefficient of determination and interpret its
value.
4. Test whether “funding” and “disease” are linearly related.
5. Is the model significant? valid?
6. Predict “funding” of a city with 200 of reported diseases
(rate per 10,000).
7. Find a 95% prediction interval for “funding” of a city with
200 of reported diseases (rate per 10,000).
8. Find a 95% confidence interval for the average funding” of
all cities with 200 of reported diseases (rate per 10,000).
Estimating the coefficients
34
 The regression equation that estimates the equation of

the first order linear model is:
ŷ = b0 + b1x
 The estimates of the coefficients are:
b1 =
 (x − x )( y − y ) SS
i i
= xy
 (x − x )
2
i
SS xx
b0 = y − b1x
Linear regression in SPSS
35
 To get these estimates in SPSS

36
Coefficients a
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 91.625 11.191 8.187 0.000
Reported diseases 0.479 0.063 0.737 7.556 0.000
(rate per 10,000)
a. Dependent Variable: Health care funding (amount per 100)
b0 b1 𝑌෠ = 91.625 + 0.479 X
ෟ = 91.625 + 0.479 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠
funding
37
𝑌෠ = 91.6 + 0.48 X
Interpretation of Regression Coefficients
38
39
 The intercept, b0 is the estimated average value of y

(dependent) when the value of x (independent) is zero.
 b0 =91.6: This is the health care funding (amount per
100) for a city with zero reported diseases (rate per
10,000) (doesn’t make sense, why?)
 The slope, b1 is the estimated change in the average value
of y as a result of a one-unit change in x.
 b1 = 0.48: the health care funding increases on average
by 0.48 for every one increase in reported disease.
Coefficient of Determination (R-square)
(Model assessment)
40
 The simple coefficient of determination (r2) is the

proportion of the total variation in the dependent
variable (Y) that is explained or accounted for by the
variation in the independent variable (X).
It is the square of the coefficient of correlation (r).
0  r2  1.
◼r2= 1: Perfect match between the line and the data.
◼r2= 0: There is no linear relationship between x and y.
Itdoes not give any information on the direction of
the relationship between the variables.
The larger the value of r2, the better the fit is.
3. Determine the coefficient of determination and
interpret its value.
41
R-square = 54% of the variation in health care funding (amount

per 100) is explained by the variation in number of reported
diseases (rate per 10,000), the model is acceptable. The
remaining (46% =100-54) is due other factors
Model Summaryb
Adjusted R Std. Error of

R R Square
Square the Estimate
Model
1 a
.737 0.543 0.534 9.92069
a. Predictors: (Constant), Reported diseases (rate per 10,000)
b. Dependent Variable: Health care funding (amount per 100)

4. Test whether “funding” and “disease” are linearly
related.
42
 A regression model is not likely to be useful unless there is a

significant relationship between x and y.
 To test significance, we use the null hypothesis:
H0: β1 = 0 (no linear relation, why?)
H1: β1 ≠ 0 (linear relation)
 P-value < 0.001, reject the null hypothesis and conclude that
there is a linear relation between “funding” and “disease”

Coefficients a
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 91.625 11.191 8.187 0.000
Reported diseases 0.479 0.063 0.737 7.556 0.000
(rate per 10,000)

5. Is the model significant? Valid?
43
 Also called the F test. It tests the significance of the overall

regression relationship between x(s) and y and is given in the
ANOVA table in regression output.
H0: The model is NOT valid
H1: The model is valid
 P-value < 0.001, reject the null hypothesis and conclude that
the model is valid.
ANOVAa
Sum of Mean
Model Squares df Square F Sig.
b
1 Regression 5619.028 1 5619.028 57.092 .000
Residual 4724.160 48 98.420
Total 10343.188 49
b. Predictors: (Constant), Reported diseases (rate per 10,000)
(rate per 10,000).
44
 If we are satisfied with how well the model fits the data, we
can use it to make predictions for y.
ෟ = 91.625 + 0.479 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠
funding
ෟ = 91.625 + 0.479 ∗ 200

funding
ෟ = 187.425
funding
 On average it expected that the “funding” of a city with 200
of reported diseases (rate per 10,000) is 187$ (amount per
100)
Prediction in SPSS
45
 To get the prediction in SPSS:

 First add the new value of x at the end of the column
of x which is 200
Prediction in SPSS
46
click save, tick Unstandardized predicted values, Mean and individual

Prediction in SPSS
47
6. Predict “funding” of a city with 200 of reported diseases (rate per 10,000).
7. Find a 95% prediction interval for “funding” of a city with 200 of reported
diseases (rate per 10,000).
8. Find a 95% confidence interval for the average funding” of all cities with 200 of
reported diseases (rate per 10,000).
STATISTICAL MODELLING
Multiple linear
regression
Multiple linear regression model
49
 Simple linear regression used one independent variable

to explain the dependent variable while multiple
regression uses two or more independent variables to
describe the dependent variable
 Some relationships are too complex to be described
using a single independent variable. This allows multiple
regression models to handle more these situations
 There is no limit to the number of independent variables
a model can have.

 Multiple regression has only one dependent variable (y)
Multiple linear regression model
50
 The multiple linear regression model relating y

dependent variable to x1, x2,…, xk independent
variables is given by
y = β0 + β1x1 + β2x2 +…+ βkxk + 
 β0, β1, β2,… βk are unknown parameters
  is an error term
 The estimated regression equation is given by
ŷ = b0 + b1x1 + b2x2 + … + bkxk

 b0, b1, b2,…, bk are the estimates of the
parameters β0, β1, β2,…, βk
Example 4
51
Using the data of example1and assuming that “funding” is the

dependent variable and both “disease” and “visits” as the
independent variables answer the following questions
3. Determine the coefficient of determination and interpret its
value.
4. Which one of the independent variables (“disease” and
“visits”) has a linear relation with “funding”
5. Is the model significant? valid?
(rate per 10,000) and 180 Visits to health care providers
(rate per 10,000)
Multiple regression in SPSS
52
Analyze→ Regression→ Linear

53
Coefficientsa
Coefficients Coefficients t Sig.
Model B Std. Error Beta

1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
Visits to health care
providers (rate per 0.858 0.053 0.960 16.109 0.000
10,000)
b0 b1 b2
𝑌෠ = 24.982 + 0.004 X1+0.858 X2

ෟ = 24.982 + 0.004 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠 + 0.858 𝑉𝑖𝑠𝑖𝑡𝑠
funding
54
 b0 =25: This is the health care funding (amount per

100) for a city with zero reported diseases (rate per
10,000) and zero Visits to health care providers (rate
per 10,000) (doesn’t make sense, why?)
 b1= 0.004: the health care funding increases on
average by 0.004 for every one increase in reported
disease while number of visits remain constant.
 b2 = 0.858: the health care funding increases on
average by 0.858 for every one increase in visits

while number of reported disease remain constant
3. Determine the coefficient of determination and
interpret its value.
55
R-square = 93% of the variation in health care funding (amount

per 100) is explained by the variation in number of reported
diseases (rate per 10,000) and Visits to health care providers
(rate per 10,000), the model is excellent. The remaining (7%
=100-93) is due other factors
Model Summary
Adjusted R Std. Error of

R R Square
Square the Estimate
Model
a
1 .964 0.930 0.927 3.92604
a. Predictors: (Constant), Visits to health care providers (rate per 10,000), Reported
diseases (rate per 10,000)
4. Which one of the independent variables (“disease”
and “visits”) has a linear relation with “funding”
56
 To test significance of the relation between “disease” and “funding”

H0:β1=0 (no linear relation between “disease” and “funding”)
H1:β1≠0 (there is linear relation between “disease” and “funding”)
 P-value =0.928, don't reject the null hypothesis and conclude that there
is NO linear relation between “funding” and “disease”
 Why the significance of “disease” changed between the two models
(simple and multiple?
Coefficientsa
Standardize
Unstandardized
d
Coefficients t Sig.
Coefficients
1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
10,000)
4. Which one of the independent variables (“disease”
and “visits”) has a linear relation with “funding”
57
 To test significance of the relation between “visits” and “funding”

H0:β2=0 (no linear relation between “visits” and “funding”)
H1:β2≠0 (there is linear relation between “visits” and “funding”)
 P-value <0.001, reject the null hypothesis and conclude that
there is a linear relation between “funding” and “visits”
Coefficientsa
Coefficients Coefficients t Sig.

1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
10,000)
5. Is the model significant? Valid?
58

H0: The model is NOT valid
H1: The model is valid
 From the ANOVA table in regression output: P-value < 0.001,
reject the null hypothesis and conclude that the model is valid.
ANOVAa
Sum of Mean
Model Squares df Square F Sig.
b
1 Regression 9618.740 2 4809.370 312.018 .000
Residual 724.448 47 15.414
Total 10343.188 49
b. Predictors: (Constant), Visits to health care providers (rate per 10,000), Reported diseases (rate per
10,000)
(rate per 10,000) and 180 Visits to health care providers
(rate per 10,000)
59
ෟ = 24.982 + 0.004 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠 + 0.858 𝑉𝑖𝑠𝑖𝑡𝑠

funding
ෟ = 24.982 + 0.004 ∗ 200 + 0.858 ∗ 180

funding
ෟ = 180.22
funding
 On average it expected that the “funding” of a city

with 200 of reported diseases (rate per 10,000) and
180 Visits to health care providers (rate per 10,000)
is 180.22$ (amount per 100)

Lecture 01 Regression Analysis1 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 01 Regression Analysis1 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Objectives

 At the end of this course participants will be able

 The correlation coefficient

 Multiple linear regression

 Regression with categorical data

 Other types of regression: Logistic regression and

 There are two main areas of Statistics:

 Population is the collection of all items or things under

 Making statements about a population by examining

 Data are the facts, figures, or records that are

Categorical data are labels or names used to

There are three skills you need to master in order

There are four questions you need to answer to know “when

 What are the types of these variables?

 How many groups in the research question?

 Which variables are the dependent and independent? (next slide)

Hints that may help you to distinguish between

 In addition to hypothesis testing and confidence

 There are three ways to assess the relationship

◼ It is a graphic presentation of the relation between two

Positive Negative No relation

Positive Negative Not linear

 Open the SPSS data file “Health_funding” in the SPSS folder.

funding (amount per 100 population), disease rates (rate per

 To get the correlation coefficient in SPSS

 Open the SPSS data file “Health_funding” in the SPSS

1. Draw the scatterplot between “funding” and “disease”

 The coefficient of correlation, denoted 𝝆 (rho) in the

 r < 0 → negative linear correlation

 r > 0 → positive linear correlation

 Empirical rule to interpret r:

 |r| close to 1 → strong correlation

 |r| between 0.5 to 0.7 → moderate correlation

 |r| between 0 to 0.5 → weak or no correlation

 We can test to see if the correlation is significant using

 H1: 𝜌 ≠ 0 (there is a linear correlation)

 P–value > 0.05 → no linear correlation

 P–value ≤ 0.05 → there is a linear correlation

 To get the correlation coefficient in SPSS

 Using the data of example1 get the value of the

 There is a strong positive correlation between Health

▪ Statistical modelling is a form of mathematical

 The simple linear regression model

 b0 is the estimate of β0 and b1 is the estimate of β1

Using the data of example1and assuming that “funding” is the

 The regression equation that estimates the equation of

 To get these estimates in SPSS

 The intercept, b0 is the estimated average value of y

 The simple coefficient of determination (r2) is the

R-square = 54% of the variation in health care funding (amount

Adjusted R Std. Error of

b. Dependent Variable: Health care funding (amount per 100)

 A regression model is not likely to be useful unless there is a

there is a linear relation between “funding” and “disease”

a. Dependent Variable: Health care funding (amount per 100)

 Also called the F test. It tests the significance of the overall

ෟ = 91.625 + 0.479 ∗ 200

 To get the prediction in SPSS:

click save, tick Unstandardized predicted values, Mean and individual

 Simple linear regression used one independent variable

a model can have.

 The multiple linear regression model relating y

 The estimated regression equation is given by