Sunteți pe pagina 1din 37

Data Analysis and Surveying 101:

Basic research methods and


biostatistics
as they apply to the

Theresa Jackson Hughes, MPH


American College Health Association
December 2006

What we will cover


today
Research Methods

Sampling Frame and Sampling


Generalizability
Bias
Reliability and Validity
Levels of measurement

Biostatistics

Statistical significance
Other key terms
Appropriate statistical tests
Fun examples from the Spring 2005 dataset!

Get excited! Its data time!!!

Research
Methods

To do successful research, you don't need


to know everything, you just need to know
of one thing that isn't known.
Arthur Schawlow

That's the nature of research - you don't


know what in hell you're doing.
Harold "Doc" Edgerton

If we knew what it was we were doing, it


would not be called research, would it?
Albert Einstein

What exactly is
research?

Scientific research is systematic,


controlled, empirical, and critical
investigation of natural phenomena guided
by theory and hypotheses about the
presumed relations among such
phenomena.
Kerlinger, 1986

Research is an organized and systematic


way of finding answers to questions

Important Components
of Empirical Research
Problem statement, research questions,
purposes, benefits
Theory, assumptions, background literature
Variables and hypotheses
Operational definitions and measurement
Research design and methodology
Instrumentation, sampling
Data analysis
Conclusions, interpretations,
recommendations

Sampling
What is your population of interest?
To whom do you want to generalize your
results?
All students (18 and over)
Undergraduates only
Greeks
Athletes
Other

Can you sample the entire


population?

Sampling
A sample is a smaller (but hopefully
representative) collection of units from a
population used to determine truths about
that population (Field, 2005)
Why sample?
Resources (time, money) and workload
Gives results with known accuracy that can be
calculated mathematically

The sampling frame is the list from which


the potential respondents are drawn
Registrars office
Class rosters
Must assess sampling frame errors

Types of Samples
Probability (Random) Samples
Simple random sample
Systematic random sample
Stratified random sample
Proportionate
Disproportionate

Cluster sample

Non-Probability Samples
Convenience sample
Purposive sample
Quota

Sample Size
Depends on expected response rate
Average 85% for paper
FINAL SAMPLE DESIRED / .85 = SAMPLE

Average 25% for web


FINAL SAMPLE DESIRED / .25 = SAMPLE
Size of Campus
<600

Final Desired N
All students

600-2,999

600

3,000-9,999

700

10,000-19,999

800

20,000-29,000

900

30,000

1,000

Bias and Error

Bias and Error


Systematic Error or Bias: unknown or
unacknowledged error created during
the design, measurement, sampling,
procedure, or choice of problem
studied
Error tends to go in one direction
Examples: Selection, Recall, Social
desirability

Random
Unrelated to true measures
Example: Momentary fatigue

Reliability and Validity


Reliability
The extent to which a test is repeatable and
yields consistent scores
Affected by random error/bias

Validity
The extent to which a test measures what it is
supposed to measure
A subjective judgment made on the basis of
experience and empirical indicators
Asks "Is the test measuring what you think its
measuring?
Affected by systematic error/bias

Reliability vs. Validity

In order to be valid, a test must be reliable;


but reliability does not guarantee validity.

Levels of
Measurement

Levels of
Measurement

Nominal

Gender

Interval

Body Mass Index (BMI)

Male, Female

Vaccinations
Yes, No, Unsure

Ordinal
Personal health status
Excellent, Very good,
Good, Fair, Poor

Last 30 days
Never used, Not in
last 30 days, 1-2 days,
3-5 days, 6-9 days,
10-19 days, 20-29
days, All 30 days

Ratio
Number of drinks
Number of sexual
partners
Perception percentages
Blood alcohol
concentration (BAC)

Biostatistics

It is commonly believed that anyone who


tabulates numbers is a statistician. This is
like believing that anyone who owns a
scalpel is a surgeon.
R. Hooke

Torture numbers, and they'll confess to


anything.
Gregg Easterbrook

98% of all statistics are made up.


Author Unknown

Types of Statistics
Descriptive statistics
Describe the basic features of data in a
study
Provide summaries about the sample and
measures

Inferential statistics
Investigate questions, models, and
hypotheses
Infer population characteristics based on
sample
Make judgments about what we observe

Descriptive Statistics

Mode
Median
Mean
Central Tendency
Variation
Range
Variance
Standard Deviation
Frequency

Descriptive Statistics
Examples
Categorical Variables (Nominal/Ordinal)
Q1 Gen health

Valid

Missing
Total

1 excellent
2 very good
3 good
4 fair
5 poor
6 don't know
Total
System

Frequency
9145
23767
16442
3737
565
132
53788
323
54111

Percent
16.9
43.9
30.4
6.9
1.0
.2
99.4
.6
100.0

Valid Percent
17.0
44.2
30.6
6.9
1.1
.2
100.0

Cumulative
Percent
17.0
61.2
91.8
98.7
99.8
100.0

Descriptive Statistics
Examples
Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation

Q49
Year in
school

1 1st year undergrad


2 2nd year under
3 3rd year under
4 4th year under
5 5th year or more under
6 graduate
7 adult special
8 other

Total

Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total
Count
% of Total

Q46 Sex
1 female
2 male
7366
4154
14.5%
8.2%
6755
3678
13.3%
7.2%
6195
3333
12.2%
6.6%
5192
2676
10.2%
5.3%
1380
985
2.7%
1.9%
5088
3246
10.0%
6.4%
203
105
.4%
.2%
266
145
.5%
.3%
32445
18322
63.9%
36.1%

Total
11520
22.7%
10433
20.6%
9528
18.8%
7868
15.5%
2365
4.7%
8334
16.4%
308
.6%
411
.8%
50767
100.0%

Descriptive Statistics
Examples
Continuous Variables (Interval/Ratio)
Descriptive Statistics
Q48 Weight in pounds
HT_INCH Height in
Inches
Q13 How many drinks
Q12 Hours alcohol
BAC Blood Alcohol
Content
Valid N (listwise)

N
51935

Range
534

Minimum
52

Maximum
586

Mean
153.16

Std. Deviation
35.791

Variance
1281.031

52017

56.00

48.00

104.00

67.2035

4.01241

16.099

53374
53326

88
65

0
0

88
65

4.42
2.99

4.401
2.726

19.370
7.430

50604

2.47

.00

2.47

.0731

.08357

.007

50218

Hypotheses
Null hypotheses
Presumed true until statistical evidence
in the form of a hypothesis test indicates
otherwise
There is no effect/relationship
There is no difference in means

Alternative hypotheses
Tested using inferential statistics
There is an effect/relationship
There is a difference in means

Alpha, Beta, Power,


Effect Size

Alpha probability of
making a Type I error

Reject null when null is


true
Level of significance, p
value

Beta probability of making


a Type II error
Fail to reject null when null
is false

Power probability of
correctly rejecting null
1 Beta

Effect Size
Measure of the strength of
the relationship between
two variables

Reject
null

Fail to
Reject
null

Null is
true

Null is
false

Alpha
Type I
error

1 Beta
Power

1 Alpha
CORRECT
NONREJECTION

CORRECT
REJECTION

Beta
Type II
error

Lets test some


hypotheses!!!

Test of the mean of one


continuous variable

College students report drinking an average of 5 drinks


the last time they partied/socialized
Hypotheses
Ho: = 5
H A: 5

Test: Two-tailed t-test


Result: Reject null
One-Sample Statistics

How many drinks

N
53374

Mean
4.42

Std. Deviation
4.401

Std. Error
Mean
.019

One-Sample Test
Test Value = 5

How many drinks

t
-30.352

df
53373

Sig. (2-tailed)
.000

Mean
Difference
-.578

95% Confidence
Interval of the
Difference
Lower
Upper
-.62
-.54

Test of a single
proportion of
categorical
variable
one
20% of college
students report their health
is excellent
Hypotheses
Ho: p = 20
HA: p 20 (one-tailed)

Test: Z-test for a single proportion


Result: Reject null

Binomial Test

Gen health

Group 1
Group 2
Total

Category
<= 1
>1

N
9145
44643
53788

Observed
Prop.
.170
.830
1.000

Test Prop.
.2

Asymp. Sig.
(1-tailed)
.000a,b

a. Alternative hypothesis states that the proportion of cases in the first group < .2.
b. Based on Z Approximation.

Test of a relationship
between
two
continuous variables
There is a relationship between the number of drinks

students report drinking the last time they drank and the
number of sex partners they have had within the last
school year
Hypotheses
Ho : = 0
HA : 0

Test: Pearson Product Moment Correlation


Result: Reject null
Correlations

How many drinks

Partners you had

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

How many
drinks
1
53374
.238**
.000
52576

**. Correlation is significant at the 0.01 level (2-tailed).

Partners
you had
.238**
.000
52576
1
52896

Test of the difference


between two means

Men and women report significantly different numbers of


sexual partners over the past 12 months
Hypotheses
1 = 2
1 2

Test: Independent Samples t-test OR One-way ANOVA


Result: Reject null
Group Statistics

Partners you had

Sex
female
male

N
32687
18474

Mean
1.34
1.82

Std. Deviation
2.017
3.627

Std. Error
Mean
.011
.027

Independent Samples Test


Levene's Test for
Equality of Variances

F
Partners you had

Equal variances
assumed
Equal variances
not assumed

867.978

Sig.
.000

t-test for Equality of Means

df

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

95% Confidence
Interval of the
Difference
Lower
Upper

-19.360

51159

.000

-.483

.025

-.532

-.434

-16.704

25065.988

.000

-.483

.029

-.540

-.426

Test of the difference


between two or more
means
Mean BAC reported differs
across student residences
Hypotheses
1 = 2 = 3 = 4 = 5 = 6
i j for at least one pair i, j

Test: One-way ANOVA


Result: Reject null

Descriptives

Blood Alcohol Content

residence hall
frat/sorority house
other university housing
off campus
with parents
other
Total

N
21285
781
3620
18151
4279
2266
50382

Mean
.0741
.1127
.0622
.0773
.0606
.0579
.0731

Std. Deviation
.08215
.09278
.07357
.08539
.08490
.08296
.08357

Std. Error
.00056
.00332
.00122
.00063
.00130
.00174
.00037

95% Confidence Interval for


Mean
Lower Bound
Upper Bound
.0730
.0752
.1062
.1193
.0598
.0646
.0760
.0785
.0581
.0631
.0545
.0613
.0724
.0738

Minimum
.00
.00
.00
.00
.00
.00
.00

Maximum
1.27
.75
1.41
2.47
1.17
1.26
2.47

ANOVA
Blood Alcohol Content

Between Groups
Within Groups
Total

Sum of
Squares
3.188
348.695
351.884

df
5
50376
50381

Mean Square
.638
.007

F
92.123

Sig.
.000

Test of the difference


between two or more
means
Multiple Comparisons

Dependent Variable: Blood Alcohol Content


Games-Howell

(I) Currently live


residence hall

frat/sorority house

other university housing

off campus

with parents

other

(J) Currently live


frat/sorority house
other university housing
off campus
with parents
other
residence hall
other university housing
off campus
with parents
other
residence hall
frat/sorority house
off campus
with parents
other
residence hall
frat/sorority house
other university housing
with parents
other
residence hall
frat/sorority house
other university housing
off campus
other
residence hall
frat/sorority house
other university housing
off campus
with parents

*. The mean difference is significant at the .05 level.

Mean
Difference
(I-J)
Std. Error
-.03865*
.00337
.01190*
.00135
-.00316*
.00085
.01350*
.00141
.01623*
.00183
.03865*
.00337
.05055*
.00354
.03548*
.00338
.05215*
.00356
.05488*
.00375
-.01190*
.00135
-.05055*
.00354
-.01506*
.00138
.00160
.00178
.00433
.00213
.00316*
.00085
-.03548*
.00338
.01506*
.00138
.01667*
.00144
.01940*
.00185
-.01350*
.00141
-.05215*
.00356
-.00160
.00178
-.01667*
.00144
.00273
.00217
-.01623*
.00183
-.05488*
.00375
-.00433
.00213
-.01940*
.00185
-.00273
.00217

Sig.
.000
.000
.003
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.947
.323
.003
.000
.000
.000
.000
.000
.000
.947
.000
.809
.000
.000
.323
.000
.809

95% Confidence Interval


Lower Bound
Upper Bound
-.0483
-.0290
.0081
.0157
-.0056
-.0007
.0095
.0175
.0110
.0215
.0290
.0483
.0404
.0606
.0258
.0451
.0420
.0623
.0442
.0656
-.0157
-.0081
-.0606
-.0404
-.0190
-.0111
-.0035
.0067
-.0017
.0104
.0007
.0056
-.0451
-.0258
.0111
.0190
.0125
.0208
.0141
.0247
-.0175
-.0095
-.0623
-.0420
-.0067
.0035
-.0208
-.0125
-.0035
.0089
-.0215
-.0110
-.0656
-.0442
-.0104
.0017
-.0247
-.0141
-.0089
.0035

Test for a relationship


between two categorical
variables
Is there an association between being a member
of a fraternity/sorority and ever being diagnosed
with depression?
Hypotheses
Ho: There is no association between being a member of a
fraternity/sorority and ever being diagnosed with
depression.
HA: There is an association between being a member of a
fraternity/sorority and ever being diagnosed with
depression.

Test: Chi-square test for independence


Result: Fail to reject null

Test for relationship


between two categorical
variables
Ever - Depression * Frat or sorority? Crosstabulation

Ever - Depression

yes
no

Total

Count
Expected Count
Count
Expected Count
Count
Expected Count

Frat or sorority?
yes
no
681
7692
715.6
7657.4
3744
39657
3709.4
39691.6
4425
47349
4425.0
47349.0

Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases

Value
2.185b
2.122
2.211
2.185

df
1
1
1
1

Asymp. Sig.
(2-sided)
.139
.145
.137

Exact Sig.
(2-sided)

Exact Sig.
(1-sided)

.141

.073

.139

51774

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 715.
62.

Total
8373
8373.0
43401
43401.0
51774
51774.0

Important Points to
Remember

An significant association does not


indicate causation
Statistical significance is not always
the same as practical significance
Multiple factors contribute to whether
your results are significant
It gets easier and easier as you
practice!

Questions???

S-ar putea să vă placă și