Sunteți pe pagina 1din 43

(Not) Relationships among

Variables
Descriptive stats (e.g., mean, median,
mode, standard deviation)
describe a sample of data
z-test &/or t-test for a single population
parameter (e.g., mean)
infer the true value of a single variable
ex: mean # of random digits that people can
memorize
Relationships among Variables
But relationships among more than one variable
are the crucial feature of almost all scientific
research.
Examples:
How does the perception of a stimulus vary with the
physical intensity of that stimulus?
How does the attitude towards the President vary with
the socio-economic properties of the survey
respondent?
How does the performance on a mental task vary with
age?
Relationships among Variables
More Examples:
How does depression vary with number of traumatic
experiences?
How does undergraduate drinking vary with performance in
quantitative courses?
How does memory performance vary with attention span?
etc...
Weve already learned a few ways to analyze
relationships among 2 variables.


Relationships among Two
Variables: Chi-Square
Chi-Square test of independence (2-way
contingency table)
compare observed cell frequencies to the cell
frequencies youd expect if the two variables are
independent.
ex:
X=geographical region: West coast, Midwest, East
coast
Y=favorite color: red, blue, green
Note: both variables are categorical
Relationships among Two
Variables: Chi-Square
Observed frequencies:




Expected frequencies:
West
Coast Midwest
East
Coast
Red 49 30 18
Blue 52 32 20
Green 130 62 107
West
Coast Midwest
East
Coast total
Red 49 30 18 97
Blue 52 32 20 104
Green 130 62 107 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 97
Blue 104
Green 299
total 231 124 145 500

(row total)(column total)
grand total
West
Coast Midwest
East
Coast total
Red 44.814 97
Blue 104
Green 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 44.814 24.06 28.13 97
Blue 48.05 25.79 30.16 104
Green 138.14 74.15 86.71 299
total 231 124 145 500
Relationships among Two
Variables: Chi-Square
Observed frequencies:




Expected frequencies:
West
Coast Midwest
East
Coast total
Red 49 30 18 97
Blue 52 32 20 104
Green 130 62 107 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 44.814 24.06 28.13 97
Blue 48.05 25.79 30.16 104
Green 138.14 74.15 86.71 299
total 231 124 145 500

_
2
=
(observedexpected)
2
expected
all cells

17.97
if this exceeds
critical value,
reject H
0
that the
2 variables are
independent
(unrelated)
Relationships among Two
Variables: z, t tests
z-test &/or t-test for difference of
population means
compare values of one variable (Y) for 2 different
levels/groups of another variable (X)
ex:
X=age: young people vs. old people
Y=# random digits can memorize
Q: Is the mean # digits the same for the 2 age
groups?
Relationships among Two
Variables: ANOVA
ANOVA
compare values of one variable (Y) for 3+ different
levels/groups of another variable (X)
ex:
X=age: young people, middle-aged, old people
Y=# random digits can memorize
Q: Is the mean # digits the same for all 3 age
groups?
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
old
Relationships among Two
Variables: z, t & ANOVA
NOTE: for z/t tests for differences, and for ANOVA, there are a
small number of possible values for one of the variables (X)
z, t
ANOVA
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
middle
old
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
old
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
middle
old
Relationships among Two
Variables: z, t & ANOVA
NOTE: for z/t tests for differences, and for ANOVA, there are a
small number of possible values for one of the variables (X)
QuickTime and a
decompressor
are needed to see this picture.
z, t
QuickTime and a
decompressor
are needed to see this picture.
ANOVA
Relationships among Two
Variables: many values of X?
What about when there are many possible values of BOTH
variables? Maybe theyre even continuous (rather than discrete)?
QuickTime and a
decompressor
are needed to see this picture.
Correlation, and
Simple Linear
Regression will be
used to analyze
relationship among
two such variables
(scatter plot)
Correlation: Scatter Plots
QuickTime and a
decompressor
are needed to see this picture.
Does it look like there is a relationship?
Correlation
measures the direction and strength of a
linear relationship between two variables
that is, it answers in a general way the
question: as the values of one variable
change, how do the corresponding values
of the other variable change?
Linear Relationship
linear relationship:
y=a +bx (straight line)

QuickTime and a
decompressor
are needed to see this picture.
QuickTime and a
decompressor
are needed to see this picture.
QuickTime and a
decompressor
are needed to see this picture.
Linear
Not (strictly) Linear
Correlation Coefficient: r
sign: direction of relationship
magnitude (number): strength of relationship
-1 r 1
r=0 is no linear relationship
r=-1 is perfect negative correlation
r=1 is perfect positive correlation
Notes:
Symmetric Measure (You can exchange X and Y and
get the same value)
Measures linear relationship only
QuickTime and a
decompressor
are needed to see this picture.
Correlation Coefficient: r
Formula:

alt. formula (ALEKS):

r =
1
n 1
x
i
x
s
x
|
\

|
.
|
y
i
y
s
y
|
\


|
.
|
|

standardized
values

r =
x
i
y
i
nx y
i=1
n

(n 1)s
x
s
y
QuickTime and a
decompressor
are needed to see this picture.
Correlation: Examples
QuickTime and a
decompressor
are needed to see this picture.
Population: undergraduates
Correlation: Examples
QuickTime and a
decompressor
are needed to see this picture.
Population: undergraduates
Correlation: Examples
Population: undergraduates
Correlation: Examples
Others?
Correlation: Interpretation
Correlation Causation!

Correlation: Interpretation
Correlation Causation!
When 2 variables are correlated, the causality may
be:
X --> Y
X <-- Y
Z --> X&Y (lurking third variable)
or a combination of the above
Examples: ice cream & murder, violence & video
games, SAT verbal & math, booze & GPA
Inferring causation requires consideration of: how
data gathered (e.g., experiment vs. observation),
other relevant knowledge, logic...
Simple Linear Regression
PREDICTING one variable (Y) from
another (X)
No longer symmetric like Correlation
One variable is used to explain another variable
X Variable
Independent Variable
Explaining Variable
Exogenous Variable
Predictor Variable
Y Variable
Dependent Variable
Response Variable
Endogenous Variable
Criterion Variable
Simple Linear Regression
idea: find a line (linear function) that best
fits the scattered data points
this will let us characterize the relationship
between X & Y, and predict new values of
Y for a given X value.
(0,a)
b
Intercept
Slope
bX+a
X
Reminder: (Simple) Linear Function Y=a+bX
We are interested in this to model the relationship between an
independent variable X and a dependent variable Y
Y
1
slope : b
intercept : a
bX a Y + =
: s prediction errorless had we If
X
Y
Simple Linear Regression
all data points would
fall right on the line
X
Y
A guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(same slope, different intercept)
X
Y
Initial guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(same intercept, different slope)
X
Y
Initial guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(different intercept and slope, same center)
X
Y
We will end up being reasonably confident
that the true regression line is somewhere
in the indicated region.
X
Y
Estimated Regression Line
errors/residuals
X
Y
Estimated Regression Line
X
Y
Estimated Regression Line
Error Terms have to be drawn vertically
X
Y
Estimated Regression Line
i i i
y y e =
i
y
i
y

i
x
bX a Y + =

: Line Regression the of Equation


i
y

=y hat: predicted
value of Y for X
i

Estimating the Regression Line
Idea: find the formula for the line that minimizes
the squared errors
error: distance between actual data point and
predicted value
Y=a+bX
Y=b
0
+b
1
x
b
1
=slope of regression line
b
0
=Y intercept of regression line



b
1
=
X
i
X
( )
Y
i
Y
( )
i=1
N

X
i
X
( )
2
i=1
N

ALEKS:

b
1
= r
s
y
s
x

b
0
=Y b
1
X
Y=b
0
+b
1
X

b
1
=
x
i
y
i
nx y
i=1
n

(n 1)s
x
2
b
1
(slope)
b
0
(Y intercept)
using
correlation
coefficient

S-ar putea să vă placă și