Sunteți pe pagina 1din 39

1.

Are Women Paid Less than Men?


Intro and revision of basic statistics
Learning Objectives
1. Review of basic statistics
2. Some basic stata commands
3. Nature of randomness
4. Distributions
5. The use and abuse of the Normal Distribution
6. Basic Hypothesis testing
7. The role of prediction & causation


Wages and Gender
Some Stata commands
We can use data in wages.dta to examine this
issue
.dta is a stata format data file
Start stata via ucd connect
Review the dataset using basic stata commands
Editor
summ
Describe
sort

Wages and gender
Basic answer to the question is to compare
average wages.
Stata:
summ wage if gender==1
Summ wage if gender==2
So we can say wages different in this sample
Can we say this difference would be true
generally?
Note the implicit out of sample prediction in this
question


Statistical Inference
Key point is that would be no problem if we
observed wages of all men and women.
Sampling is the source of the problem
Result could it be dumb luck
Choice of sample could be dumb luck
Isolated example?
What is chance that different sample lead to
radically different result?
Statistical Inference
Key Point: Statistical inference is process of
deciding when we can use a sample result to
make statements about the world
Also referred to as Hypothesis testing
Using this data can we reject the hypothesis that both
groups are paid the same?
Rough ans: we can if difference between the
average wage is large (How large?)
To answer precisely we need to review the nature
of randomness the role of dumb luck

Random Variables
Outcome of an experiment value unknown until
it is observed
Dice, coin, roulette wheel
Getting a job, the wage
In our example the observed differences between
wages may be random and have nothing to do with
gender i.e. dumb luck
Continuous vs Discrete Random Variables
Discrete: Finite no of possible outcomes
Dice, coin, lottery
Useful for intuition
Continuous: Outcome can be any value over a range
Most real world data is continuous

Distribution of Random Variables
Discrete random variable
Establish the frequency of outcomes in repeated
experiments
Continuous Random Variable
Outcome can be any value over a range
Can establish probability of r.v. being with a
particular interval,
not of taking a particular value (zero by definition)
e.g. Household Income, interest rates, price of
bread

Probability Density Function f(x)
Mathematical representation of the
distribution of a random variable
f(x) = Pr(X=x)
prob of dice to get 3 i.e. f(3) = Pr(X=3) = 1/6.







E f(x) = 1; Sum of probabilities = 1, 0 s f(x)
s 1
For a continuous rv
can take on any value within an interval: infinite
no. of values.
NB: the probability of one exact value occurring =
0:
Pr(a s X s b) = ?
Probability is not measured by height now
measured by area


An example of a continuous RV






An example of the bell curve
Pr(Y s1200) = Shaded Area =
1200
0
( ) f y
}

Integral defines the probability:
f(y)




1200
Empirical vs Theoretical Distribution
The examples so far are theoretical
distributions
Real world data wont necessarily match these
exactly or even approximately
We can show the empirical distribution of
data on a histogram
File dice.dta contains 10 dice each rolled 3449
times
hist dice1
Empirical Dist of Dice
0
5
1
0
1
5
2
0
P
e
r
c
e
n
t
1 2 3 4 5 6
dice1
Example using wage data
Histogram will show the distribution
Stata: hist wages, bin(50)
Can also do it separately for the two genders
Hist wages if gender==1, bin(50) norm
Hist wages if gender==2, bin(50) norm
This shows that the distribution of wages
looks a little different for both groups
Not just the average
Note that bell curve is bad approximation

0
.
0
5
.
1
.
1
5
.
2
0 5 10 15 20 0 5 10 15 20
Male Female
Density
normal hwage
D
e
n
s
i
t
y
hwage
Graphs by ==1 for a man, 2 for a woman
Characteristics of a Distribution
We can characterise the difference between
distributions in many ways
The two main are the Expected value and the
Variance
Expected Value is the average
Weighted by probability




Sample and Theoretical Mean
Sample and theoretical mean will be different
dice:
E(X)=1.(1/6)+2.(1/6)+3.(1/6)+4.(1/6)+5.(1/6)+6.(1/
6)=3.5
For dice1: summ dice1
Continuous:
We already did for gender using summ

Rules of Expectations
1. E(X+Y) = E(X)+E(Y)
2. E(X-Y) = E(X)-E(Y)
3. E(aX) =a E(X)
4. E(X+a) = E(X)+a
5. E(aX+bY+cZ) = aE(X)+bE(Y)+cE(X)

See dice.dta for examples of this



Variance of a random variable
Distribution has an average but it also varies
around that average
Need a concept to measure that dispersion




In stata part of output of summ
Var(X) =
( )
2
E X E X
| |
|
\ .

( ) ( )
2
2
2 . E X X E X E X
| |
| |
|
|
|
\ . |
\ .

( ) ( ) ( )
2
2
2 E X E X E X E X
| | | |
| |
\ . \ .

( )
2
2
E X E X
| | | |
| |
\ . \ .

Dice Example
X (X-E(X))
2

1 1 (1-3.5)
2
= 6.25
2 4 (2-3.5)
2
=2.25
3 9 (3-3.5)
2
=0.25
4 15 .0.25
5 25 .2.25
6 36 .6.25
E=21 E=91 E=17.5
2
X
Dice Example cont.
E (X) =X = 21/6 = 3.5
E(X
2
)= 91/6 = 15.166
Var (X) = E(X
2
) (E(X))
2
= 15.166 (3.5)
2
= 2.916
Var(X) =
( )
2
1 17.5
2.916
6
X E X
n
| |
|
\ .
= =



Note That the theoretical variance may differ from the variance in the sample
Stata: summ dice1
Rules of Variance
1. Var (aX) = a
2
Var(X):
2. Var(X+a) = Var(X):
3. Var(a+bX) = b
2
Var(X):
4. Var (aX+bY) = a
2
Var(X) + b
2
Var(Y) if X and Y
are independent. (deal with dependence
later).
Standard Deviation = Square root of
Variance: o = \o
2

Normal Distribution
A special continuous distribution that can be
very useful
AKA Gaussian Distribution, Bell Curve,
Law of Errors
Mean and variance completely define it
X~N(,o)
Bell Curve
Pr(Y s1200) = Shaded Area =
1200
0
( ) f y
}

Integral defines the probability:
f(y)











1200
1000
Calculating Probabilities from Bell
Curve
Integral solved for you by computer
in stata the normal function gives area under
the curve of standard normal rv
Mean=0, variance=1
display normal(0.5)
For other normal make use of trick
If y~N(,o) then z=(y-)/o is N(0,1)
Mean 1000, stn dev 100
Prob(x<1200)=Prob(z<2)
di normal(2)

Properties of the Normal
Symmetric around the mean
Positive and negative deviations are equally likely
The probability of a deviation declines with
the size of a deviation
approx. 68% of the area under the curve lies
between [o,+o|
and 95% is in the interval [2o,+2o|



Using the Bell Curve
We often assume that data has normal
distribution
Easy to use
Often intuitive
But remember nothing is actually normal
We choose to model things as normal
it is only a convenient approximation
Could be very wrong
Always have to ask yourself if it is reasonable to
treat the data as normal
Check histogram
Using the Bell Curve
Nassim Taleb made career out of complaining
that we assume data is normal when it is not
See Fooled by Randomness
Already seen that bad approx to wage data
Bad approximation to stock market data as
grossly underside tails
Low prob of large changes
See sandp.dta
Empirical dist of % Stock Returns
0
.
2
.
4
.
6
D
e
n
s
i
t
y
-4 -2 0 2 4
% change in daily closing price
Answer the question!
We now have some tools which we can use to
answer the question of gender bias in wages
Recall that women are paid less on average in
the sample
Recall that the issue is whether we can use
this fact about the sample to make statements
about the world (population)
An Answer
We will develop a precise method of answering these
kind of questions as we go thorough the course
For the moment we adopt a very simple heuristic
approach
Most people would suggest that the difference in the
means is the key
i.e. 6.701875 -5.451302= 1.250573
So men are paid more than women
We know the problem with this is that it doesnt take
account of possibility that sample may be unusual
The statistical inference problem

An Answer
Suppose we think that by dumb luck all our
women have come from the low end of the
distribution of wages and all our men have
come from the upper end
We would find discrimination but it is just to our
weird sample
So we need to account for the dispersion of
the data
Recall we measure dispersion by the variance

An Answer
Think of how this works
If the variance is large then by dumb luck we could
have chosen a sample that where men came from one
end of the distribution and women from the other
If the variance of wages is small then the distance
between the two ends of the distribution isn't that big
So any difference in the wage is less likely due to
chance
So what matters is the difference in the (mean)
wage relative to the dispersion or variance of
wages
Sample size will also matter
Answer
So we calculate

This is known as the test statistic
If this is large then we will say there is a
statistically significant difference between the
two
How large? Typically around 2
We will see why later
n
t
/
2 1
o

=
Here it is!
Stata:
di (6.701875 -5.451302)/(3.157414/345^0.5)
Gives 7.35
This is more than 2 so we reject the idea of equal
wages
Note we are not saying discrimination is certain
statistically significant difference in wages
The observed difference is large enough that it is
unlikely to have ocurred by chance
We can reject the null hypothesis of equal wages
Note the phrasing can reject

A Comment on Prediction
We have taken a fact about the sample and
inferred a result about the world
Statistical inference
May be wrong
Implicit prediction: women (outside the
sample) will be paid less than men
Over the next few weeks will learn what is
missing from this technique
A Comment on Causation
Implicit causation:
gender causes the pay difference
Only really shown correlation
Need theory to justify causation
Think of effect of schooling vs effect of
intelligence
This is problem even if solve the sampling
issue
What Have We Learned?
1. Most fundamental issue: sampling causes
randomness
2. Theoretical distributions are only approx to
the true distribution
3. Distributions can be characterised by means
and variances
4. Sampling error
makes statistical inference non-trivial
Prediction is implicit in everything
5. Causation is not the same as correlation

What is missing?
Only one of the variables was continuous
We were very lax and vague in the discussion
of statistical inference
Both of these to be corrected in the next
section

S-ar putea să vă placă și