Sunteți pe pagina 1din 7

Gator

 Tutoring  
Fall  2009  
Exam  1  Notes  
 

1 Confidence  Interval  
Confidence   intervals   provide   an   estimate   for   some   target   parameter   that   you’re  
trying   to   estimate   (mean,   variance,   etc.)   and   are   associated   with   some   level   of  
uncertainty,  called  the  level  of  confidence.  They  have  the  form  
θ̂ ± ME  
where  θ  is  the  point  estimate  of  the  parameter  and  ME  is  the  margin  of  error.  

2 One  Sample  Testing  


2.1 Point  Estimates  
The  point  estimate  is  your  “best  guess”  for  the  true  value  of  the  parameter  you’re  
trying   to   estimate   (mean,   variance,   etc.).   Often,   you   will   estimate   the   population  
mean   (μ),   and   your   point   estimate   will   be   the   sample   mean   ( X ).   You   may   also  
estimate   the   proportion   (P)   of   the   population   having   a   certain   characteristic,   and  
your  point  estimate  would  be    
X
p̂ =  
n
where  X  is  the  number  of  observations  that  have  this  characteristic  in  your  sample,  
and   n   is   the   number   of   observation   in   your   sample.   The   point   estimate   for   the  
proportion  is  denoted   p̂  to  distinguish  it  from  the  “true”  proportion  p  (i.e.  the  one  
you  are  trying  to  estimate).  

2.2 Margin  of  Error  


The  margin  of  error  of  a  confidence  interval  has  the  form  
ME = k(SE)  
where  k  is  a  number  from  a  statistical  table  (which  represents  how  many  standard  
deviations  away  from  the  mean  the  data  falls)  and  SE  is  the  standard  error.  

2.2.1 Statistical  Number  


The  number  k  is  based  on  the  distribution  of  the  data,  and  the  level  of  confidence.  
For  a  normal  distribution,  the  number  is  called  a  z-­number  and  is  usually  given  as  
the  number  at  which  X%  of  the  data  lies  above  (in  the  upper  tail),  where  X%  is  the  
level  of  confidence.  The  table  given  in  the  test  may  look  like  the  following:  
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 

 
If   the   data   has   a   t-­‐distribution,   the   number   is   called   a   t-­number   and   is   found  
similarly;   however,   every   t-­‐number   has   a   certain   degrees   of   freedom   (df)  
associated  with  it.  For  a  one-­‐sample  test  with  n  observations,  the  df  is  n  -­‐  1.    

2.2.2 Standard  Error  


If  the  data  has  a  z-­distribution  or  a  t-­distribution,  the  standard  error  SE  is  
s
SE =  
n
where  s  is  the  standard  deviation  of  the  sample  and  n  is  the  number  of  observations  
in  the  sample.  
If  you’re  estimating  the  proportion  of  the  population  having  a  certain  characteristic,    
the  standard  error  SE  is  

p̂(1 − p̂)
SE =  
n
where   p̂  is  the  point  estimate  for  the  data  set.  The  statistical  number  is  still  based  
on  how  the  data  is  distributed.  
If   the   population   (N)   is   finite   and   known,   we   can   use   the   Finite   Population  
Correction   (FPC)   as   a   better   measure   for   the   standard   error.   The   standard   error  
under  FPC  is  
s N −n
ŜE =  
n N −1
 
where  s  is  the  standard  deviation  of  the  sample,  n  is  the  number  of  observations  in  
the  sample,  and  N  is  the  size  of  the  (finite)  population.  

3 Two  Sample  Testing  


3.1 Point  Estimates  
We  are  also  often  interested  in  estimating  the  difference  in  means  of  two  samples;  
this  is  known  as  a  two-­‐sample  test.  The  point  estimate  for  the  difference  is  
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 
D = X − Y  
where   X x   and   X2   are   the   sample   means   of   the   first   and   second   samples,  
respectively.  
We   can   also   estimate   the   difference   in   the   proportions   of   two   samples.   The   point  
estimate  for  the  difference  is  
p̂x − p̂y  

where   p̂x  and   p̂y  are  the  sample  proportions  of  the  first  and  second  samples,  
respectively.  

3.2 Margin  of  Error  


The  margin  of  error  for  a  two-­‐sample  test  is  still  of  the  form   ME = k(SE) .  

3.2.1 Difference  in  Means  -­  Population  Variances  Unknown  


If   the   population   variance   (σ2)   is   known   for   each   data   set,   the   statistical   number  
comes  from  a  z-­table.  The  standard  error  is  
2
sx2 sy
SE = +  
nx ny

so  the  margin  of  error  is  


2
sx2 sy
ME = zα /2 +  
nx ny

3.2.2 Difference  in  Means  -­  Population  Variances  Unknown  


Typically,   we   don’t   know   the   population   variances.   If   the   population   variances   are  
not  known,  the  statistical  number  comes  from  a  t-­table  -­‐  which  has  a  certain  degrees  
of   freedom.   For   a   two-­‐sample   test   with   nx   observations   in   the   first   sample   and   ny  
observations   in   the   second   sample,   the   df   is   nx   +   ny   –   2.   If   the   variances   (although  
unknown)   are   assumed   to   be   the   same   for   both   samples,   we   use   something   called  
the  pooled  variance.  It  has  the  following  formula:  
(nx − 1)sx2 + (ny − 1)sy2
S =
2
 
(nx + ny − 2)
p

Once  we  have  the  pooled  variance,  we  use  the  following  formula  to  find  the  standard  
error:  

s 2p s 2p
SE = +  
nx ny
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 
so  the  margin  of  error  is  

s 2p s 2p
ME = tα /2 +  
nx ny

3.2.3 Difference  in  Proportions  


The  statistical  number  used  when  estimating  the  difference  in  proportions  comes  
from  a  z-­table.  The  sample  error  is:  
p̂x (1 − p̂x ) p̂y (1 − p̂y )
SE = +  
nx ny

so  the  margin  of  error  is  


p̂x (1 − p̂x ) p̂y (1 − p̂y )
ME = zα /2 +  
nx ny

4 Stratified  Sampling  
Sometimes   it   makes   sense   to   split   up   our   overall   population   into   k   different  
categories,  called  strata.  Each  strata  has  its  own  size,  mean,  and  standard  deviation.  
If   we   want   to   estimate   the   overall   mean   for   the   entire   population,   we   can   use   the  
stratified  sample  estimate  for  our  point  estimate.  It  has  the  following  formula:  
1 k
X strata = ∑ N j X j  
N i =1
where  N  is  the  overall  population,  Nj  is  the  size  of  the  population  in  the  jth  strata,  and  
X j  is  the  mean  of  the  jth  strata.  
If  we  needed  to  compute  the  standard  error  for  the  mean  of  a  strata,  we  could  use  
the  finite  population  correction  (as  the  size  of  the  population  is  known).  
 

4.1 Allocation  in  Stratified  Sampling  


For  a  given  population  size,  there  are  a  couple  of  ways  to  allocate  your  data  among  
the  strata.  
The  first  method  is  to  use  the  proportional  allocation  method.  To  determine  the  
size  of  the  jth  stratum  using  this  method,  we  use  the  following  formula:  
Nj
nj = n  
N
where  Nj  is  the  old  size  of  the  jth  strata  from  the  previous  data  set,  N  is  the  old  
population  size  (of  the  overall  data  set  of  all  strata),  and  n  is  the  new  population  size  
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 
(of  the  overall  data  set  of  all  strata).  Again,  this  gives  us  the  size  nj  that  we  should  
make  our  jth  stratum.  
The  second  method  is  to  use  the  optimal  allocation  method.  The  formula  for  the  
size  of  the  jth  stratum  is  
N jσ j
nj = k n  
∑N σ i i
i =1

where  N  is  the  old  population  size  of  each  strata,  σ  is  the  old  standard  deviation  of  
each  strata,  and  n  is  the  new  population  size  (of  the  overall  data  set  of  all  strata).  

5 Median  Estimation  (The  “.4n  –  2”  rule)  


Sometimes  we  want  to  find  a  confidence  interval  for  the  median  value  of  our  data.  
The  “.4n-­‐2”  rule  provides  a  quick  way  to  do  this.    
1. First,  calculate  the  following  number:  
.4n − 2  
  where  n  is  the  number  of  observations  in  the  sample.  
2. Round  the  number  you  obtained  in  Step  1  to  the  nearest  integer,  and  call  this  
number  r.  
Then,  use  the  rth-­‐smallest  observation  in  your  data  set  as  the  lower  bound,  and  the  
rth-­‐largest   observation   in   your   data   set   as   the   upper   bound,   and   you   will   have   a  
confidence  interval  of  approximately  95%.  

6 Outliers  
The  easiest  way  to  check  for  outliers  is  to  see  if  they  fall  below  the  lower  fence,  or  
above  the  upper  fence.  
1. Calculate  the  interquartile  range  (IQR),  which  is  the  difference  between  the  
third  quartile  (Q3)  of  data  and  the  first  quartile  (Q1)  of  data.  50%  of  your  data  
lies  in  this  range.  
2. The  lower  fence  is  Q1  –  1.5(IQR).  
3. The  upper  fence  is  Q3  +  1.5(IQR).  

7 Hypothesis  Testing  
Be  familiar  with  the  terminology:  
• H0:  the  null  hypothesis.  This  is  what  we  are  assuming  to  be  true.  
• H1:  the  alternate  hypothesis.  This  is  the  other  possibility,  if  we  find  enough  
evidence  to  reject  the  null  hypothesis.  
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 
• Rejection  Region:  If  our  sample  estimate  falls  in  this  region,  we  have  enough  
evidence  to  reject  the  null  hypothesis.  
• Critical  Value:  Determines  the  rejection  region;  it  based  on  a  certain  level  of  
confidence  
• P-­value:  For  a  sample,  tells  us  the  smallest  possible  error  we  could  have  (which  is  
associated  with  the  largest  level  of  confidence  we  could  have)  given  the  data.  

7.1 Which  test?  


• If  the  null  hypothesis  is  of  the  form  μ  >  c,  use  a  lower  tail  test  (reject  H0  if   X  is  too  
small)  
• If  the  null  hypothesis  is  of  the  form  μ  <  c,  use  an  upper  tail  test  (reject  H0  if   X  is  
too  large)  
• If  the  null  hypothesis  is  of  the  form  μ  =  c,  use  a  two-­‐sided  test  (reject  H0  if   X  is  too  
small  or  two  large)  

7.2 Finding  the  Critical  Value/P-­value  


To  find  the  critical  value  that  we  will  use  to  define  our  rejection  region,  simply  find  
the  upper  tail  that’s  associated  with  the  given  level  of  confidence  α  in  a  statistical  
table,  and  reference  the  statistical  number  (z-­‐number,  t-­‐number,  etc.)  that  
corresponds  with  that  α.  
We  can  then  rearrange  our  confidence  interval  formula  to  solve  for  k,  the  statistical  
number:  
θ̂ − θ
k=  
s/ n
where   θ̂  is  the  estimated  parameter,   θ  is  the  “true”  parameter  (which  we  are  
assuming  has  the  value  given  by  the  null  hypothesis),  s  is  the  sample  standard  
deviation,  and  n  is  the  number  of  observations  in  the  sample.  If  this  number  is  in  the  
rejection  region  (given  by  the  critical  value),  we  will  reject  the  null  hypothesis.  
So,  if  we  were  estimating  the  mean,  and  if  our  sample  had  a  z-­distribution,  we  could  
find  the  z-­number  of  our  sample  by    
X−µ
zα =  
s/ n
We  could  then  compare  this  with  the  critical  number,  to  see  if  it  falls  in  the  rejection  
region.  This  process  would  be  similar  for  estimating  the  mean  with  a  t-­distribution.  
The  p-­value  is  the  smallest  α  that  we  can  have  while  still  rejecting  our  null  
hypothesis  (i.e.  still  making  a  statistically  significant  conclusion).  To  solve  for  it,  we  
simply  need  to  find  the  upper  tail  α  that  corresponds  with  the  statistical  number  
Gator  Tutoring  
Fall  2009  
Exam  1  Notes  
 
from  our  sample.  We  find  this  the  same  way  we  did  when  we  were  comparing  our  
sample  to  the  critical  number.  
For  example,  if  we  used  the  formula  
X−µ
zα =  
s/ n
and  found  our  z-­‐number  to  be  1.28,  we  could  just  look  this  up  in  our  tables,  and  the  
percentage  of  data  in  the  upper  tail  (α)  would  be  our  p-­‐value.  Similar  for  t-­‐
distribution.  

8 Bins  
Here’s  the  table  from  the  book,  which  tells  us  the  recommended  number  of  bins  
based  on  different  sample  sizes:  

Sample  Size   Number  of  Bins  

Fewer  than  50   5-­‐7  

50  to  100   7-­‐8  

101  to  500   8-­‐10  

501  to  1000   10-­‐11  

1001  to  5000   11-­‐14  

More  than  5000   14-­‐20  

S-ar putea să vă placă și