Sunteți pe pagina 1din 18

Correlation and Regression

/1
Correlation and Regression Analysis

When to use?:
• X and Y both are variable / continuous
• Data on X and Y to be in pair (for each value of X,
there needs to be a corresponding value of Y).

2
Three steps

• Scatter Plot.
• Correlation Analysis.
• Regression Analysis.

3
Scatter Plot
Construction of a scatter diagram:
 Collect paired samples of data from two variables that you
think might be related and make a dataset.
 Place the supposed independent variable, the one that
potentially “affects” a change in the other variable, on the
X axis.
 Place the potential response or dependent variable on the
Y axis.

Refer File Gasoline.mtw

4
Scatter Plot

5
Scatter Plot

There seems to be positive correlation between the purity of


the catalyst (x) & Octane number of Gasoline (Y).

6
Scatter Plot
n=30 r=0.9 n=30 r=-0.9

y-effect
y-effect
x-cause x-cause
Positive Correlation Negative Correlation

There are
n=30 r=0.6 n=30 r=-0.6
many types of
scattering
patterns

Positive Correlation M ay Be Present Negative Correlation M ay Be Present

n=30 r=0.0 n=30 r=0.0

No Correlation No Linear Correlation

7
Correlation Analysis
• Scatter diagrams or plots provides a graphical
representation of the relationship.

• Correlation Coefficient: A metric that is commonly


used for representing a linear relationship between
two continuous variables: X and Y.
n

 (x i − x)( yi − y ) 𝐒 𝐱𝐲
r= i =1 =
n n 𝐒 𝐱𝐱𝐒 𝐲𝐲
 (x
i =1
i − x) 2
(y
i =1
i − y) 2

8
r = Linear Correlation Coefficient

Correlation coefficient has two components:


1. Direction (seen by sign + or -)
2. Strength (seen by absolute value)

We are testing Hypothesis:

H0: X and Y are not correlated


Ha: X and Y are correlated

If p < 0.05, we can reject H0 and correlation exists.

9
r = Linear Correlation Coefficient
Thumb rule:
p < 0.05 for correlation to exist.
Once correlation exists, strength can be classified as
follows:
Value of correlation
Type of correlation
coefficient (r)
> 0.9 Strong
Between 0.7 and 0.9 Moderate
< 0.7 Weak

10
r = Linear Correlation Coefficient

Since p < 0.05, reject H 0. As such X and Y are correlated.


11
Tips on dealing with accidental
Low & high correlations

Low Accidental Correlation


Determine proportion of the range occupied for each process variable. It is
possible that complete process variable range is not covered. Collect more
data until the process range is almost close to 100% of the normal range.

High Accidental Correlation


If HIGH r value occurs with HIGH p-value , possible reason could be low sample
size

12
Correlation vs. Causation
• It is important to keep in mind that a strong
mathematical (or graphical) relationship between two
variables does not confirm that one causes the other.
Two variables can be highly related to one another, but
neither is caused by the other.
• Validation of root cause is made only when two
requirements are met:
– There is a statistically significant relationship between the
suspected root cause and the effect.
– Knowledge of the process corroborates this causal relationship.

• One of these alone is not adequate for validation.

13
Finding Relationships in Data
• One of the most important aspects of statistical analysis in Six
Sigma is the identification of a mathematical model (equation) that
explains relationships present in a dataset.
• If X & Y variables are continuous, the method used is Regression,
also sometimes referred to as “curve fitting”.
Regression provides:
• A hypothesis test of whether each input variable (X) is significantly
correlated with the response (Y) under study.
• A quantitative estimate of the relationship of each input variable
with the response.
– A coefficient in a mathematical equation.
• An estimate of how much of the total variation in the response is
explained by each factor.

14
Regression Analysis
• Two types of regression:
– If there is one X and one Y it is called simple linear
regression
• Model will be Y = b 0 + b 1X
Y
• Where b 0 is y intercept
b1
• b 1 is slope 𝚫𝐲
𝚫𝐱

b0
X
– If there are multiple X’s and one Y, then we have
multiple linear regression
• Model will be Y= b 0 + b 1 X1 + b 2 X2 + b 3X3 + ……….

15
Simple Linear Regression

Refer File Gasoline.mtw

16
Analysis & Interpretation –
Coefficients Estimates

17
Analysis & Interpretation –
Coefficient of determination

Visualize and interpret the following scenarios :

1.1 r = 0.9; r2 = 0.81; b 1 = 2.0


1.2 r = 0.9; r2 = 0.81; b 1 = 8.0

2.1 r = 0.8; r2 = 0.64; b 1 = 3.0


2.2 r = 0.4; r2 = 0.16; b 1 = 3.0

Try to manually draw the scatter diagram for these situations and discuss

18

S-ar putea să vă placă și