Sunteți pe pagina 1din 37

Data Analysis and Surveying 101:

Basic research methods and biostatistics as they apply to the

Theresa Jackson Hughes, MPH American College Health Association December 2006

What we will cover today


Research Methods
Sampling Frame and Sampling Generalizability Bias Reliability and Validity Levels of measurement Statistical significance Other key terms Appropriate statistical tests Fun examples from the Spring 2005 dataset!

Biostatistics

Get excited! Its data time!!!

Research Methods

To do successful research, you don't need to know everything, you just need to know of one thing that isn't known.
Arthur Schawlow

That's the nature of research - you don't know what in hell you're doing.
Harold "Doc" Edgerton

If we knew what it was we were doing, it would not be called research, would it?
Albert Einstein

What exactly is research?


Scientific research is systematic, controlled, empirical, and critical investigation of natural phenomena guided by theory and hypotheses about the presumed relations among such phenomena.
Kerlinger, 1986

Research is an organized and systematic way of finding answers to questions

Important Components of Empirical Research


Problem statement, research questions, purposes, benefits Theory, assumptions, background literature Variables and hypotheses Operational definitions and measurement Research design and methodology Instrumentation, sampling Data analysis Conclusions, interpretations, recommendations

Sampling
What is your population of interest?
To whom do you want to generalize your results?
All students (18 and over) Undergraduates only Greeks Athletes Other

Can you sample the entire population?

Sampling
A sample is a smaller (but hopefully representative) collection of units from a population used to determine truths about that population (Field, 2005) Why sample?
Resources (time, money) and workload Gives results with known accuracy that can be calculated mathematically

The sampling frame is the list from which the potential respondents are drawn
Registrars office Class rosters Must assess sampling frame errors

Types of Samples
Probability (Random) Samples
Simple random sample Systematic random sample Stratified random sample
Proportionate Disproportionate

Cluster sample

Non-Probability Samples
Convenience sample Purposive sample Quota

Sample Size
Depends on expected response rate
Average 85% for paper
FINAL SAMPLE DESIRED / .85 = SAMPLE

Average 25% for web


FINAL SAMPLE DESIRED / .25 = SAMPLE
Size of Campus <600 600-2,999 3,000-9,999 10,000-19,999 20,000-29,000 30,000 Final Desired N All students 600 700 800 900 1,000

Bias and Error

Bias and Error


Systematic Error or Bias: unknown or unacknowledged error created during the design, measurement, sampling, procedure, or choice of problem studied
Error tends to go in one direction
Examples: Selection, Recall, Social desirability

Random
Unrelated to true measures
Example: Momentary fatigue

Reliability and Validity


Reliability
The extent to which a test is repeatable and yields consistent scores Affected by random error/bias

Validity
The extent to which a test measures what it is supposed to measure A subjective judgment made on the basis of experience and empirical indicators Asks "Is the test measuring what you think its measuring? Affected by systematic error/bias

Reliability vs. Validity

In order to be valid, a test must be reliable; but reliability does not guarantee validity.

Levels of Measurement

Levels of Measurement
Nominal
Gender
Male, Female

Interval
Body Mass Index (BMI)

Vaccinations
Yes, No, Unsure

Ordinal
Personal health status
Excellent, Very good, Good, Fair, Poor

Ratio
Number of drinks Number of sexual partners Perception percentages Blood alcohol concentration (BAC)

Last 30 days
Never used, Not in last 30 days, 1-2 days, 3-5 days, 6-9 days, 10-19 days, 20-29 days, All 30 days

Biostatistics

It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.
R. Hooke

Torture numbers, and they'll confess to anything.


Gregg Easterbrook

98% of all statistics are made up.


Author Unknown

Types of Statistics
Descriptive statistics
Describe the basic features of data in a study Provide summaries about the sample and measures

Inferential statistics
Investigate questions, models, and hypotheses Infer population characteristics based on sample Make judgments about what we observe

Descriptive Statistics
Mode Median Mean Central Tendency Variation Range Variance Standard Deviation Frequency

Descriptive Statistics Examples


Categorical Variables (Nominal/Ordinal)
Q1 Gen health Frequency 9145 23767 16442 3737 565 132 53788 323 54111 Percent 16.9 43.9 30.4 6.9 1.0 .2 99.4 .6 100.0 Valid Percent 17.0 44.2 30.6 6.9 1.1 .2 100.0 Cumulative Percent 17.0 61.2 91.8 98.7 99.8 100.0

Valid

Missing Total

1 excellent 2 very good 3 good 4 fair 5 poor 6 don't know Total System

Descriptive Statistics Examples


Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation Q46 Sex 1 female 2 male 1st year undergrad Count 7366 4154 % of Total 14.5% 8.2% 2nd year under Count 6755 3678 % of Total 13.3% 7.2% 3rd year under Count 6195 3333 % of Total 12.2% 6.6% 4th year under Count 5192 2676 % of Total 10.2% 5.3% 5th year or more under Count 1380 985 % of Total 2.7% 1.9% graduate Count 5088 3246 % of Total 10.0% 6.4% adult special Count 203 105 % of Total .4% .2% other Count 266 145 % of Total .5% .3% Count 32445 18322 % of Total 63.9% 36.1% Total 11520 22.7% 10433 20.6% 9528 18.8% 7868 15.5% 2365 4.7% 8334 16.4% 308 .6% 411 .8% 50767 100.0% Q49 Year in school 1 2 3 4 5 6 7 8 Total

Descriptive Statistics Examples


Continuous Variables (Interval/Ratio)

Descriptiv e Statistics Q48 Weight in pounds HT_INCH Height in Inches Q13 How many drinks Q12 Hours alcohol BAC Blood Alcohol Content Valid N (listwise) N 51935 52017 53374 53326 50604 50218 Range 534 56.00 88 65 2.47 Minimum 52 48.00 0 0 .00 Maximum 586 104.00 88 65 2.47 Mean 153.16 67.2035 4.42 2.99 .0731 Std. Deviation 35.791 4.01241 4.401 2.726 .08357 Variance 1281.031 16.099 19.370 7.430 .007

Hypotheses
Null hypotheses
Presumed true until statistical evidence in the form of a hypothesis test indicates otherwise
There is no effect/relationship There is no difference in means

Alternative hypotheses
Tested using inferential statistics
There is an effect/relationship There is a difference in means

Alpha, Beta, Power, Effect Size Alpha probability of


making a Type I error
Reject null when null is true Level of significance, p value

Null is true
Reject null Alpha Type I error 1 Alpha
CORRECT NONREJECTION

Null is false
1 Beta Power
CORRECT REJECTION

Beta probability of making a Type II error


Fail to reject null when null is false

Power probability of correctly rejecting null


1 Beta

Fail to Reject null

Beta Type II error

Effect Size
Measure of the strength of the relationship between two variables

Lets test some hypotheses!!!

Test of the mean of one continuous variable


College students report drinking an average of 5 drinks the last time they partied/socialized
Hypotheses
Ho: = 5 HA: 5

Test: Two-tailed t-test Result: Reject null


One-Sample Statistics N 53374 Mean 4.42 Std. Deviation 4.401 Std. Error Mean .019

How many drinks

One-Sample Test Test Value = 5 95% Confidence Interval of the Difference Lower Upper -.62 -.54

How many drinks

t -30.352

df 53373

Sig. (2-tailed) .000

Mean Difference -.578

Test of a single proportion of one categorical variable


20% of college students report their health is excellent
Hypotheses
Ho: p = 20 HA: p 20 (one-tailed)

Test: Z-test for a single proportion Result: Reject null


Binomial Test Category <= 1 >1 N 9145 44643 53788 Observed Prop. .170 .830 1.000 Test Prop. .2 Asymp. Sig. (1-tailed) .000 a,b

Gen health

Group 1 Group 2 Total

a. Alternative hypothesis states that the proportion of cases in the first group < .2. b. Based on Z Approximation.

Test of a relationship between two continuous variables


There is a relationship between the number of drinks students report drinking the last time they drank and the number of sex partners they have had within the last school year
Hypotheses
Ho: = 0 HA: 0

Test: Pearson Product Moment Correlation Result: Reject null Correlations


How many drinks Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N How many drinks 1 53374 .238** .000 52576 Partners you had .238** .000 52576 1 52896

Partners you had

**. Correlation is significant at the 0.01 level (2-tailed).

Men and women report significantly different numbers of sexual partners over the past 12 months
Hypotheses Test: Independent Samples t-test OR One-way ANOVA Result: Reject null
Group Statistics N 32687 18474 Partners you had Sex female male Mean 1.34 1.82 Std. Deviation 2.017 3.627 Std. Error Mean .011 .027

Test of the difference between two means

1 = 2 1 2

Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -.532 -.540 -.434 -.426

F Partners you had Equal variances assumed Equal variances not assumed 867.978

Sig. .000

t -19.360

df 51159

Sig. (2-tailed) .000 .000

Mean Difference -.483 -.483

Std. Error Difference .025 .029

-16.704 25065.988

Test of the difference between two or more means


Mean BAC reported differs across student residences
Hypotheses Test: One-way ANOVA Result: Reject null
Blood Alcohol Content 95% Confidence Interval for Mean Lower Bound Upper Bound .0730 .0752 .1062 .1193 .0598 .0646 .0760 .0785 .0581 .0631 .0545 .0613 .0724 .0738

1 = 2 = 3 = 4 = 5 = 6 i j for at least one pair i, j

Descriptiv es

residence hall frat/sorority house other university housing off campus with parents other Total

N 21285 781 3620 18151 4279 2266 50382

Mean .0741 .1127 .0622 .0773 .0606 .0579 .0731

Std. Deviation .08215 .09278 .07357 .08539 .08490 .08296 .08357

Std. Error .00056 .00332 .00122 .00063 .00130 .00174 .00037

Minimum .00 .00 .00 .00 .00 .00 .00

Maximum 1.27 .75 1.41 2.47 1.17 1.26 2.47

ANOVA Blood Alcohol Content Sum of Squares 3.188 348.695 351.884 df 5 50376 50381 Mean Square .638 .007 F 92.123 Sig. .000

Between Groups Within Groups Total

Test of the difference between two or more means


Multiple Comparisons Dependent Variable: Blood Alcohol Content Games-Howell Mean Difference (I-J) Std. Error -.03865* .00337 .01190* .00135 -.00316* .00085 .01350* .00141 .01623* .00183 .03865* .00337 .05055* .00354 .03548* .00338 .05215* .00356 .05488* .00375 -.01190* .00135 -.05055* .00354 -.01506* .00138 .00160 .00178 .00433 .00213 .00316* .00085 -.03548* .00338 .01506* .00138 .01667* .00144 .01940* .00185 -.01350* .00141 -.05215* .00356 -.00160 .00178 -.01667* .00144 .00273 .00217 -.01623* .00183 -.05488* .00375 -.00433 .00213 -.01940* .00185 -.00273 .00217 (I) Currently live residence hall (J) Currently live frat/sorority house other university housing off campus with parents other frat/sorority house residence hall other university housing off campus with parents other other university housing residence hall frat/sorority house off campus with parents other off campus residence hall frat/sorority house other university housing with parents other with parents residence hall frat/sorority house other university housing off campus other other residence hall frat/sorority house other university housing off campus with parents Sig. .000 .000 .003 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .947 .323 .003 .000 .000 .000 .000 .000 .000 .947 .000 .809 .000 .000 .323 .000 .809 95% Confidence Interval Lower Bound Upper Bound -.0483 -.0290 .0081 .0157 -.0056 -.0007 .0095 .0175 .0110 .0215 .0290 .0483 .0404 .0606 .0258 .0451 .0420 .0623 .0442 .0656 -.0157 -.0081 -.0606 -.0404 -.0190 -.0111 -.0035 .0067 -.0017 .0104 .0007 .0056 -.0451 -.0258 .0111 .0190 .0125 .0208 .0141 .0247 -.0175 -.0095 -.0623 -.0420 -.0067 .0035 -.0208 -.0125 -.0035 .0089 -.0215 -.0110 -.0656 -.0442 -.0104 .0017 -.0247 -.0141 -.0089 .0035

*. The mean difference is significant at the .05 level.

Test for a relationship between two categorical variables


Is there an association between being a member of a fraternity/sorority and ever being diagnosed with depression?
Hypotheses
Ho: There is no association between being a member of a fraternity/sorority and ever being diagnosed with depression. HA: There is an association between being a member of a fraternity/sorority and ever being diagnosed with depression.

Test: Chi-square test for independence Result: Fail to reject null

Test for relationship between two categorical variables


Ev er - Depression * Frat or sorority? Crosstabulation Frat or sorority? yes no 681 7692 715.6 7657.4 3744 39657 3709.4 39691.6 4425 47349 4425.0 47349.0 Total 8373 8373.0 43401 43401.0 51774 51774.0

Ever - Depression

yes no

Total

Count Expected Count Count Expected Count Count Expected Count

Chi-Square Tests Value 2.185 b 2.122 2.211 df 1 1 1 Asymp. Sig. (2-sided) .139 .145 .137 Exact Sig. (2-sided) Exact Sig. (1-sided)

Pearson Chi-Square a Continuity Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases

.141 2.185 51774 1 .139

.073

a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 715. 62.

An significant association does not indicate causation Statistical significance is not always the same as practical significance Multiple factors contribute to whether your results are significant It gets easier and easier as you practice!

Important Points to Remember

Questions???