Sunteți pe pagina 1din 59

Objectives

 At the end of this course participants will be able


to
 understand the concept of statistical modelling
 differentiate between different types of regression
models
 interpret the regression analysis results generated by
the statistical software (SPSS)
Outlines
2

 Introduction
 Basic regression concepts (key words)

 Scatterplot

 The correlation coefficient

 Statistical Modelling

 Linear regression

 Multiple linear regression

 Regression with categorical data

 Other types of regression: Logistic regression and


Poisson regression
Areas of Statistics
3

 There are two main areas of Statistics:


Descriptive statistics:
provides tabular, graphical techniques and numerical
measures for describing data.
Inferential statistics:
provides procedures for analyzing data and making
decisions. Using the sample (statistic) to infer about
the population (parameter)
Inferential statistics: Key Definitions
4

 Population is the collection of all items or things under


consideration –people or objects
 Sample is a portion of the population selected for analysis

(Selected at random)
 Parameter is any numerical measure calculated based on
the population elements
 Statistic is any numerical measure calculated based on the

sample elements
Inferential Statistics
5

 Making statements about a population by examining


sample results
 (known variable) Inference (unknown constant, but
can be estimated from sample)
Sample statistics Population parameters

Sample
Population
Types of data
6

 Data are the facts, figures, or records that are


collected from the sample elements.
 Data can be classified:

Categorical data are labels or names used to


identify attributes of the sample elements. The labels
can be numbers with no real numerical meaning.
Numerical data are numbers (with real meaning),
representing measurements, obtained from the
sample elements.
Types of data,... cont.
7

 Categorical:
Nominal: Order is not important (no ranking)
◼Examples: gender, marital status, race, ...
Ordinal: Order is important
◼Examples: Education level, disease severity, level of
satisfaction,…
 Numerical:
Discrete:Countable (how many)
◼Number of children, length of stay at the hospital,…
Continuous: Uncountable (how much)
◼Age, weight, height, blood sugar,…
Three Skills for data analysis
8

There are three skills you need to master in order


to get a proper and reliable data analysis:
 When to use? (next slide)
 When to use a specific statistical procedure
 Depends on types of data (quantitative vs qualitative)
 How to get?
 Apply the procedure using the formulae or a software
 What to look for?
 Interpretation (comment on the output from the software)
Four Questions for (When to use)
9

There are four questions you need to answer to know “when


to use?” i.e. to select the appropriate statistical procedure:
 How many variables in the research question?
 The same as the number of questions we ask each individual

 What are the types of these variables?


 Quantitative or qualitative
 Level of measurement (Nominal, ordinal, scale)

 How many groups in the research question?


 The same as number of categories in the qualitative variable

 Which variables are the dependent and independent? (next slide)


 Dependent variable (Outcome or response)
 Independent variable (cause or effect)
Independent vs dependent variables
10

Hints that may help you to distinguish between


independent and dependent variables:
 Which occurs first always independent
 Ex. If you study, you will pass the exam
 Studying = independent and passing = dependent
 There are two types of characteristics (original and
gained) the original always independent
 Ex. Is there a relation between gender and obesity?
 Gender = independent and obesity = dependent
 Use the logic
 Ex. Is there a relation between height and weight?
 Height = independent and weight = dependent
Relationships between numerical data
11

 In addition to hypothesis testing and confidence


intervals, inferential (analytical) statistics includes
determining whether a relationship between two
or more quantitative (numerical) variables exists.
Relationships between numerical data
12

 There are three ways to assess the relationship


between numerical variables
 Graphically

▪ Scatter plot
 Numerically
▪ Correlation coefficient
 Modelling (equation)
▪ Regression
GRAPHICAL PRESENTATION
Scatterplot
Scatter plot
14

◼ It is a graphic presentation of the relation between two


quantitative (numerical) variables with the independent
variable on the x-axes and the dependent variable on
the y-axes.
◼ Four interpretations to look for in a scatter plot
➢ Pattern (linear or non-linear)
➢ Trend (positive of negative)
➢ Magnitude (strength) (weak, moderate or strong)
➢ Outliers
Scatter plot -Example
15

Positive Negative No relation


Strong Strong

Positive Negative Not linear


weak weak
Example1
16

 Open the SPSS data file “Health_funding” in the SPSS folder.


C:\Program Files\IBM\SPSS\Statistics\25\Samples\English
 This is a hypothetical data file that contains data on health care

funding (amount per 100 population), disease rates (rate per


10,000 population), and visits to health care providers (rate per
10,000 population). Each case represents a different city.
1. Draw the scatterplot between “funding” and “disease” and comment on
the graph.
2. Draw the scatterplot between “funding” and “visits” and comment on the
graph.
Correlation coefficient in SPSS
17

 To get the correlation coefficient in SPSS


Example1
18

 Open the SPSS data file “Health_funding” in the SPSS


folder.
C:\Program Files\IBM\SPSS\Statistics\24\Samples\English

1. Draw the scatterplot between “funding” and “disease”


and comment on the graph.
2. Draw the scatterplot between “funding” and “visits” and
comment on the graph.
1. Draw the scatterplot between “funding” and “disease”
and comment on the graph.
19

1. Linear
2. Positive
3. Strong
4. There is an outlier
2. Draw the scatterplot between “funding” and “visits” and
comment on the graph.
20

1. Linear
2. Positive
3. Strong
4. No outliers
NUMERICAL MEASURE
Correlation Coefficient
Correlation Coefficient
22

 The coefficient of correlation, denoted 𝝆 (rho) in the


population and r in the sample, is used to measure the
strength of the linear association between two
quantitative variables.
 (x − x )( y i − y )
n
i =1 i SS xy
r= =

i =1( x i    i =1 ( i
2 2
−x ) −y) SS xx SS yy
n n
y
 
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
𝑥ҧ = the mean of x
𝑦ത = the mean of y
Properties of correlation coefficient r
23

 r is independent of units.
 -1 ≤ r ≤ 1

 r < 0 → negative linear correlation

 r > 0 → positive linear correlation

 Empirical rule to interpret r:

 |r| close to 1 → strong correlation

 |r| between 0.5 to 0.7 → moderate correlation

 |r| between 0 to 0.5 → weak or no correlation


Correlation: Examples
24

r=0.98 r =-0.99

r=-0.13

r =0.51 r = -0.44
Testing the significance of the correlation (𝜌)
25

 We can test to see if the correlation is significant using


the hypotheses
 H0: 𝜌 = 0 (no linear correlation)

 H1: 𝜌 ≠ 0 (there is a linear correlation)

 P–value > 0.05 → no linear correlation

 P–value ≤ 0.05 → there is a linear correlation


Correlation coefficient in SPSS
26

 To get the correlation coefficient in SPSS


Example 2
27

 Using the data of example1 get the value of the


correlation coefficient and comment on its value. Test
the significance of this correlation.
Correlations
Visits to health
Health care Reported
care providers
funding (amount diseases (rate
(rate per
per 100) per 10,000)
10,000)
Health care Pearson ** **
1 .737 .964
funding (amount Correlation
per 100) Sig. (2-tailed) 0.000 0.000
N 50 50 50
**. Correlation is significant at the 0.01 level (2-tailed).
Example 2
28

 There is a strong positive correlation between Health


care funding (amount per 100) with each of Reported
diseases (rate per 10,000) and Visits to health care
providers (rate per 10,000), (r=0.74 and r =0.96)
respectively. Both variables are linearly related to
funding (P-value < 0.001)
STATISTICAL MODELLING
Simple linear
Regression
What is statistical modelling?
51

▪ Statistical modelling is a form of mathematical


modelling which relates a dependent variable and
an independent variable(s) through an equation used
for prediction.
▪ For example, one can build a statistical model which
relates height of the child we and his/her age.
▪ Linear regression is a special case of statistical
modelling where the dependent variable is
continuous and the independent variables are
continuous or categorical (converted to dummy
variables)
Model of the simple linear regression
51

 The simple linear regression model


y= b0 + b1x+ 
y = dependent variable
x = independent variable
b0 = y-intercept
b1 = slope of the line
 = error variable
 Simple since it has only one independent variable

(predictor)
 β0 and β1 are called regression parameters

 b0 is the estimate of β0 and b1 is the estimate of β1


The Simple Linear Regression Model
32
Example 3
33

Using the data of example1and assuming that “funding” is the


dependent variable and “disease” is the independent answer the
following questions
1. Find the estimated regression line.
2. Interpret the regression coefficients.
3. Determine the coefficient of determination and interpret its
value.
4. Test whether “funding” and “disease” are linearly related.
5. Is the model significant? valid?
6. Predict “funding” of a city with 200 of reported diseases
(rate per 10,000).
7. Find a 95% prediction interval for “funding” of a city with
200 of reported diseases (rate per 10,000).
8. Find a 95% confidence interval for the average funding” of
all cities with 200 of reported diseases (rate per 10,000).
Estimating the coefficients
34

 The regression equation that estimates the equation of


the first order linear model is:
ŷ = b0 + b1x
 The estimates of the coefficients are:

b1 =
 (x − x )( y − y ) SS
i i
= xy

 (x − x )
2
i
SS xx

b0 = y − b1x
Linear regression in SPSS
35

 To get these estimates in SPSS


1. Find the estimated regression line.
36

Coefficients a
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 91.625 11.191 8.187 0.000
Reported diseases 0.479 0.063 0.737 7.556 0.000
(rate per 10,000)
a. Dependent Variable: Health care funding (amount per 100)

b0 b1 𝑌෠ = 91.625 + 0.479 X
ෟ = 91.625 + 0.479 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠
funding
1. Find the estimated regression line.
37

𝑌෠ = 91.6 + 0.48 X
Interpretation of Regression Coefficients
38
2. Interpret the regression coefficients.
39

 The intercept, b0 is the estimated average value of y


(dependent) when the value of x (independent) is zero.
 b0 =91.6: This is the health care funding (amount per
100) for a city with zero reported diseases (rate per
10,000) (doesn’t make sense, why?)
 The slope, b1 is the estimated change in the average value
of y as a result of a one-unit change in x.
 b1 = 0.48: the health care funding increases on average
by 0.48 for every one increase in reported disease.
Coefficient of Determination (R-square)
(Model assessment)
40

 The simple coefficient of determination (r2) is the


proportion of the total variation in the dependent
variable (Y) that is explained or accounted for by the
variation in the independent variable (X).
It is the square of the coefficient of correlation (r).
0  r2  1.
◼r2= 1: Perfect match between the line and the data.
◼r2= 0: There is no linear relationship between x and y.
Itdoes not give any information on the direction of
the relationship between the variables.
The larger the value of r2, the better the fit is.
3. Determine the coefficient of determination and
interpret its value.
41

R-square = 54% of the variation in health care funding (amount


per 100) is explained by the variation in number of reported
diseases (rate per 10,000), the model is acceptable. The
remaining (46% =100-54) is due other factors
Model Summaryb

Adjusted R Std. Error of


R R Square
Square the Estimate
Model
1 a
.737 0.543 0.534 9.92069
a. Predictors: (Constant), Reported diseases (rate per 10,000)

b. Dependent Variable: Health care funding (amount per 100)


4. Test whether “funding” and “disease” are linearly
related.
42

 A regression model is not likely to be useful unless there is a


significant relationship between x and y.
 To test significance, we use the null hypothesis:
H0: β1 = 0 (no linear relation, why?)
H1: β1 ≠ 0 (linear relation)
 P-value < 0.001, reject the null hypothesis and conclude that

there is a linear relation between “funding” and “disease”


Coefficients a
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 91.625 11.191 8.187 0.000
Reported diseases 0.479 0.063 0.737 7.556 0.000
(rate per 10,000)

a. Dependent Variable: Health care funding (amount per 100)


5. Is the model significant? Valid?
43

 Also called the F test. It tests the significance of the overall


regression relationship between x(s) and y and is given in the
ANOVA table in regression output.
 To test significance, we use the null hypothesis:
H0: The model is NOT valid
H1: The model is valid
 P-value < 0.001, reject the null hypothesis and conclude that
the model is valid.
ANOVAa
Sum of Mean
Model Squares df Square F Sig.
b
1 Regression 5619.028 1 5619.028 57.092 .000
Residual 4724.160 48 98.420
Total 10343.188 49
a. Dependent Variable: Health care funding (amount per 100)
b. Predictors: (Constant), Reported diseases (rate per 10,000)
6. Predict “funding” of a city with 200 of reported diseases
(rate per 10,000).
44

 If we are satisfied with how well the model fits the data, we
can use it to make predictions for y.
ෟ = 91.625 + 0.479 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠
funding

ෟ = 91.625 + 0.479 ∗ 200


funding

ෟ = 187.425
funding
 On average it expected that the “funding” of a city with 200
of reported diseases (rate per 10,000) is 187$ (amount per
100)
Prediction in SPSS
45

 To get the prediction in SPSS:


 First add the new value of x at the end of the column

of x which is 200
Prediction in SPSS
46

click save, tick Unstandardized predicted values, Mean and individual


Prediction in SPSS
47

6. Predict “funding” of a city with 200 of reported diseases (rate per 10,000).
7. Find a 95% prediction interval for “funding” of a city with 200 of reported
diseases (rate per 10,000).
8. Find a 95% confidence interval for the average funding” of all cities with 200 of
reported diseases (rate per 10,000).
STATISTICAL MODELLING
Multiple linear
regression
Multiple linear regression model
49

 Simple linear regression used one independent variable


to explain the dependent variable while multiple
regression uses two or more independent variables to
describe the dependent variable
 Some relationships are too complex to be described
using a single independent variable. This allows multiple
regression models to handle more these situations
 There is no limit to the number of independent variables

a model can have.


 Multiple regression has only one dependent variable (y)
Multiple linear regression model
50

 The multiple linear regression model relating y


dependent variable to x1, x2,…, xk independent
variables is given by
y = β0 + β1x1 + β2x2 +…+ βkxk + 
 β0, β1, β2,… βk are unknown parameters

  is an error term

 The estimated regression equation is given by

ŷ = b0 + b1x1 + b2x2 + … + bkxk


 b0, b1, b2,…, bk are the estimates of the
parameters β0, β1, β2,…, βk
Example 4
51

Using the data of example1and assuming that “funding” is the


dependent variable and both “disease” and “visits” as the
independent variables answer the following questions
1. Find the estimated regression line.
2. Interpret the regression coefficients.
3. Determine the coefficient of determination and interpret its
value.
4. Which one of the independent variables (“disease” and
“visits”) has a linear relation with “funding”
5. Is the model significant? valid?
6. Predict “funding” of a city with 200 of reported diseases
(rate per 10,000) and 180 Visits to health care providers
(rate per 10,000)
Multiple regression in SPSS
52

Analyze→ Regression→ Linear


1. Find the estimated regression line.
53

Coefficientsa
Unstandardized Standardized
Coefficients Coefficients t Sig.

Model B Std. Error Beta


1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
Visits to health care
providers (rate per 0.858 0.053 0.960 16.109 0.000
10,000)
a. Dependent Variable: Health care funding (amount per 100)

b0 b1 b2

𝑌෠ = 24.982 + 0.004 X1+0.858 X2


ෟ = 24.982 + 0.004 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠 + 0.858 𝑉𝑖𝑠𝑖𝑡𝑠
funding
2. Interpret the regression coefficients.
54

 b0 =25: This is the health care funding (amount per


100) for a city with zero reported diseases (rate per
10,000) and zero Visits to health care providers (rate
per 10,000) (doesn’t make sense, why?)
 b1= 0.004: the health care funding increases on
average by 0.004 for every one increase in reported
disease while number of visits remain constant.
 b2 = 0.858: the health care funding increases on

average by 0.858 for every one increase in visits


while number of reported disease remain constant
3. Determine the coefficient of determination and
interpret its value.
55

R-square = 93% of the variation in health care funding (amount


per 100) is explained by the variation in number of reported
diseases (rate per 10,000) and Visits to health care providers
(rate per 10,000), the model is excellent. The remaining (7%
=100-93) is due other factors
Model Summary

Adjusted R Std. Error of


R R Square
Square the Estimate
Model
a
1 .964 0.930 0.927 3.92604
a. Predictors: (Constant), Visits to health care providers (rate per 10,000), Reported
diseases (rate per 10,000)
4. Which one of the independent variables (“disease”
and “visits”) has a linear relation with “funding”
56

 To test significance of the relation between “disease” and “funding”


H0:β1=0 (no linear relation between “disease” and “funding”)
H1:β1≠0 (there is linear relation between “disease” and “funding”)
 P-value =0.928, don't reject the null hypothesis and conclude that there
is NO linear relation between “funding” and “disease”
 Why the significance of “disease” changed between the two models
(simple and multiple?
Coefficientsa
Standardize
Unstandardized
d
Coefficients t Sig.
Coefficients
Model B Std. Error Beta
1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
Visits to health care
providers (rate per 0.858 0.053 0.960 16.109 0.000
10,000)
a. Dependent Variable: Health care funding (amount per 100)
4. Which one of the independent variables (“disease”
and “visits”) has a linear relation with “funding”
57

 To test significance of the relation between “visits” and “funding”


H0:β2=0 (no linear relation between “visits” and “funding”)
H1:β2≠0 (there is linear relation between “visits” and “funding”)
 P-value <0.001, reject the null hypothesis and conclude that
there is a linear relation between “funding” and “visits”

Coefficientsa
Unstandardized Standardized
Coefficients Coefficients t Sig.

Model B Std. Error Beta


1 (Constant) 24.982 6.061 4.122 0.000
Reported diseases
0.004 0.039 0.005 0.091 0.928
(rate per 10,000)
Visits to health care
providers (rate per 0.858 0.053 0.960 16.109 0.000
10,000)
a. Dependent Variable: Health care funding (amount per 100)
5. Is the model significant? Valid?
58

 To test significance, we use the null hypothesis:


H0: The model is NOT valid
H1: The model is valid
 From the ANOVA table in regression output: P-value < 0.001,
reject the null hypothesis and conclude that the model is valid.
ANOVAa
Sum of Mean
Model Squares df Square F Sig.
b
1 Regression 9618.740 2 4809.370 312.018 .000
Residual 724.448 47 15.414
Total 10343.188 49
a. Dependent Variable: Health care funding (amount per 100)
b. Predictors: (Constant), Visits to health care providers (rate per 10,000), Reported diseases (rate per
10,000)
6. Predict “funding” of a city with 200 of reported diseases
(rate per 10,000) and 180 Visits to health care providers
(rate per 10,000)
59

ෟ = 24.982 + 0.004 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠 + 0.858 𝑉𝑖𝑠𝑖𝑡𝑠


funding

ෟ = 24.982 + 0.004 ∗ 200 + 0.858 ∗ 180


funding

ෟ = 180.22
funding

 On average it expected that the “funding” of a city


with 200 of reported diseases (rate per 10,000) and
180 Visits to health care providers (rate per 10,000)
is 180.22$ (amount per 100)

S-ar putea să vă placă și