Sunteți pe pagina 1din 37

# Data Analysis and Surveying 101:

## Basic research methods and biostatistics as they apply to the

Theresa Jackson Hughes, MPH American College Health Association December 2006

## What we will cover today

Research Methods
Sampling Frame and Sampling Generalizability Bias Reliability and Validity Levels of measurement Statistical significance Other key terms Appropriate statistical tests Fun examples from the Spring 2005 dataset!

Biostatistics

## Get excited! Its data time!!!

Research Methods

To do successful research, you don't need to know everything, you just need to know of one thing that isn't known.
Arthur Schawlow

That's the nature of research - you don't know what in hell you're doing.
Harold "Doc" Edgerton

If we knew what it was we were doing, it would not be called research, would it?
Albert Einstein

## What exactly is research?

Scientific research is systematic, controlled, empirical, and critical investigation of natural phenomena guided by theory and hypotheses about the presumed relations among such phenomena.
Kerlinger, 1986

## Important Components of Empirical Research

Problem statement, research questions, purposes, benefits Theory, assumptions, background literature Variables and hypotheses Operational definitions and measurement Research design and methodology Instrumentation, sampling Data analysis Conclusions, interpretations, recommendations

Sampling
What is your population of interest?
To whom do you want to generalize your results?
All students (18 and over) Undergraduates only Greeks Athletes Other

## Can you sample the entire population?

Sampling
A sample is a smaller (but hopefully representative) collection of units from a population used to determine truths about that population (Field, 2005) Why sample?
Resources (time, money) and workload Gives results with known accuracy that can be calculated mathematically

The sampling frame is the list from which the potential respondents are drawn
Registrars office Class rosters Must assess sampling frame errors

Types of Samples
Probability (Random) Samples
Simple random sample Systematic random sample Stratified random sample
Proportionate Disproportionate

Cluster sample

Non-Probability Samples
Convenience sample Purposive sample Quota

Sample Size
Depends on expected response rate
Average 85% for paper
FINAL SAMPLE DESIRED / .85 = SAMPLE

## Average 25% for web

FINAL SAMPLE DESIRED / .25 = SAMPLE
Size of Campus <600 600-2,999 3,000-9,999 10,000-19,999 20,000-29,000 30,000 Final Desired N All students 600 700 800 900 1,000

## Bias and Error

Systematic Error or Bias: unknown or unacknowledged error created during the design, measurement, sampling, procedure, or choice of problem studied
Error tends to go in one direction
Examples: Selection, Recall, Social desirability

Random
Unrelated to true measures
Example: Momentary fatigue

## Reliability and Validity

Reliability
The extent to which a test is repeatable and yields consistent scores Affected by random error/bias

Validity
The extent to which a test measures what it is supposed to measure A subjective judgment made on the basis of experience and empirical indicators Asks "Is the test measuring what you think its measuring? Affected by systematic error/bias

## Reliability vs. Validity

In order to be valid, a test must be reliable; but reliability does not guarantee validity.

Levels of Measurement

Levels of Measurement
Nominal
Gender
Male, Female

Interval
Body Mass Index (BMI)

Vaccinations
Yes, No, Unsure

Ordinal
Personal health status
Excellent, Very good, Good, Fair, Poor

Ratio
Number of drinks Number of sexual partners Perception percentages Blood alcohol concentration (BAC)

Last 30 days
Never used, Not in last 30 days, 1-2 days, 3-5 days, 6-9 days, 10-19 days, 20-29 days, All 30 days

Biostatistics

It is commonly believed that anyone who tabulates numbers is a statistician. This is like believing that anyone who owns a scalpel is a surgeon.
R. Hooke

## Torture numbers, and they'll confess to anything.

Gregg Easterbrook

## 98% of all statistics are made up.

Author Unknown

Types of Statistics
Descriptive statistics
Describe the basic features of data in a study Provide summaries about the sample and measures

Inferential statistics
Investigate questions, models, and hypotheses Infer population characteristics based on sample Make judgments about what we observe

Descriptive Statistics
Mode Median Mean Central Tendency Variation Range Variance Standard Deviation Frequency

## Descriptive Statistics Examples

Categorical Variables (Nominal/Ordinal)
Q1 Gen health Frequency 9145 23767 16442 3737 565 132 53788 323 54111 Percent 16.9 43.9 30.4 6.9 1.0 .2 99.4 .6 100.0 Valid Percent 17.0 44.2 30.6 6.9 1.1 .2 100.0 Cumulative Percent 17.0 61.2 91.8 98.7 99.8 100.0

Valid

Missing Total

1 excellent 2 very good 3 good 4 fair 5 poor 6 don't know Total System

## Descriptive Statistics Examples

Categorical Variables (Nominal/Ordinal)
Q49 Year in school * Q46 Sex Crosstabulation Q46 Sex 1 female 2 male 1st year undergrad Count 7366 4154 % of Total 14.5% 8.2% 2nd year under Count 6755 3678 % of Total 13.3% 7.2% 3rd year under Count 6195 3333 % of Total 12.2% 6.6% 4th year under Count 5192 2676 % of Total 10.2% 5.3% 5th year or more under Count 1380 985 % of Total 2.7% 1.9% graduate Count 5088 3246 % of Total 10.0% 6.4% adult special Count 203 105 % of Total .4% .2% other Count 266 145 % of Total .5% .3% Count 32445 18322 % of Total 63.9% 36.1% Total 11520 22.7% 10433 20.6% 9528 18.8% 7868 15.5% 2365 4.7% 8334 16.4% 308 .6% 411 .8% 50767 100.0% Q49 Year in school 1 2 3 4 5 6 7 8 Total

## Descriptive Statistics Examples

Continuous Variables (Interval/Ratio)

Descriptiv e Statistics Q48 Weight in pounds HT_INCH Height in Inches Q13 How many drinks Q12 Hours alcohol BAC Blood Alcohol Content Valid N (listwise) N 51935 52017 53374 53326 50604 50218 Range 534 56.00 88 65 2.47 Minimum 52 48.00 0 0 .00 Maximum 586 104.00 88 65 2.47 Mean 153.16 67.2035 4.42 2.99 .0731 Std. Deviation 35.791 4.01241 4.401 2.726 .08357 Variance 1281.031 16.099 19.370 7.430 .007

Hypotheses
Null hypotheses
Presumed true until statistical evidence in the form of a hypothesis test indicates otherwise
There is no effect/relationship There is no difference in means

Alternative hypotheses
Tested using inferential statistics
There is an effect/relationship There is a difference in means

## Alpha, Beta, Power, Effect Size Alpha probability of

making a Type I error
Reject null when null is true Level of significance, p value

Null is true
Reject null Alpha Type I error 1 Alpha
CORRECT NONREJECTION

Null is false
1 Beta Power
CORRECT REJECTION

## Beta probability of making a Type II error

Fail to reject null when null is false

1 Beta

## Beta Type II error

Effect Size
Measure of the strength of the relationship between two variables

## Test of the mean of one continuous variable

College students report drinking an average of 5 drinks the last time they partied/socialized
Hypotheses
Ho: = 5 HA: 5

## Test: Two-tailed t-test Result: Reject null

One-Sample Statistics N 53374 Mean 4.42 Std. Deviation 4.401 Std. Error Mean .019

## How many drinks

One-Sample Test Test Value = 5 95% Confidence Interval of the Difference Lower Upper -.62 -.54

t -30.352

df 53373

## Test of a single proportion of one categorical variable

20% of college students report their health is excellent
Hypotheses
Ho: p = 20 HA: p 20 (one-tailed)

## Test: Z-test for a single proportion Result: Reject null

Binomial Test Category <= 1 >1 N 9145 44643 53788 Observed Prop. .170 .830 1.000 Test Prop. .2 Asymp. Sig. (1-tailed) .000 a,b

Gen health

## Group 1 Group 2 Total

a. Alternative hypothesis states that the proportion of cases in the first group < .2. b. Based on Z Approximation.

## Test of a relationship between two continuous variables

There is a relationship between the number of drinks students report drinking the last time they drank and the number of sex partners they have had within the last school year
Hypotheses
Ho: = 0 HA: 0

## Test: Pearson Product Moment Correlation Result: Reject null Correlations

How many drinks Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N How many drinks 1 53374 .238** .000 52576 Partners you had .238** .000 52576 1 52896

## **. Correlation is significant at the 0.01 level (2-tailed).

Men and women report significantly different numbers of sexual partners over the past 12 months
Hypotheses Test: Independent Samples t-test OR One-way ANOVA Result: Reject null
Group Statistics N 32687 18474 Partners you had Sex female male Mean 1.34 1.82 Std. Deviation 2.017 3.627 Std. Error Mean .011 .027

## Test of the difference between two means

1 = 2 1 2

Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -.532 -.540 -.434 -.426

F Partners you had Equal variances assumed Equal variances not assumed 867.978

Sig. .000

t -19.360

df 51159

## Std. Error Difference .025 .029

-16.704 25065.988

## Test of the difference between two or more means

Mean BAC reported differs across student residences
Hypotheses Test: One-way ANOVA Result: Reject null
Blood Alcohol Content 95% Confidence Interval for Mean Lower Bound Upper Bound .0730 .0752 .1062 .1193 .0598 .0646 .0760 .0785 .0581 .0631 .0545 .0613 .0724 .0738

## 1 = 2 = 3 = 4 = 5 = 6 i j for at least one pair i, j

Descriptiv es

residence hall frat/sorority house other university housing off campus with parents other Total

## Maximum 1.27 .75 1.41 2.47 1.17 1.26 2.47

ANOVA Blood Alcohol Content Sum of Squares 3.188 348.695 351.884 df 5 50376 50381 Mean Square .638 .007 F 92.123 Sig. .000

## Test of the difference between two or more means

Multiple Comparisons Dependent Variable: Blood Alcohol Content Games-Howell Mean Difference (I-J) Std. Error -.03865* .00337 .01190* .00135 -.00316* .00085 .01350* .00141 .01623* .00183 .03865* .00337 .05055* .00354 .03548* .00338 .05215* .00356 .05488* .00375 -.01190* .00135 -.05055* .00354 -.01506* .00138 .00160 .00178 .00433 .00213 .00316* .00085 -.03548* .00338 .01506* .00138 .01667* .00144 .01940* .00185 -.01350* .00141 -.05215* .00356 -.00160 .00178 -.01667* .00144 .00273 .00217 -.01623* .00183 -.05488* .00375 -.00433 .00213 -.01940* .00185 -.00273 .00217 (I) Currently live residence hall (J) Currently live frat/sorority house other university housing off campus with parents other frat/sorority house residence hall other university housing off campus with parents other other university housing residence hall frat/sorority house off campus with parents other off campus residence hall frat/sorority house other university housing with parents other with parents residence hall frat/sorority house other university housing off campus other other residence hall frat/sorority house other university housing off campus with parents Sig. .000 .000 .003 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .947 .323 .003 .000 .000 .000 .000 .000 .000 .947 .000 .809 .000 .000 .323 .000 .809 95% Confidence Interval Lower Bound Upper Bound -.0483 -.0290 .0081 .0157 -.0056 -.0007 .0095 .0175 .0110 .0215 .0290 .0483 .0404 .0606 .0258 .0451 .0420 .0623 .0442 .0656 -.0157 -.0081 -.0606 -.0404 -.0190 -.0111 -.0035 .0067 -.0017 .0104 .0007 .0056 -.0451 -.0258 .0111 .0190 .0125 .0208 .0141 .0247 -.0175 -.0095 -.0623 -.0420 -.0067 .0035 -.0208 -.0125 -.0035 .0089 -.0215 -.0110 -.0656 -.0442 -.0104 .0017 -.0247 -.0141 -.0089 .0035

## Test for a relationship between two categorical variables

Is there an association between being a member of a fraternity/sorority and ever being diagnosed with depression?
Hypotheses
Ho: There is no association between being a member of a fraternity/sorority and ever being diagnosed with depression. HA: There is an association between being a member of a fraternity/sorority and ever being diagnosed with depression.

## Test for relationship between two categorical variables

Ev er - Depression * Frat or sorority? Crosstabulation Frat or sorority? yes no 681 7692 715.6 7657.4 3744 39657 3709.4 39691.6 4425 47349 4425.0 47349.0 Total 8373 8373.0 43401 43401.0 51774 51774.0

Ever - Depression

yes no

Total

## Count Expected Count Count Expected Count Count Expected Count

Chi-Square Tests Value 2.185 b 2.122 2.211 df 1 1 1 Asymp. Sig. (2-sided) .139 .145 .137 Exact Sig. (2-sided) Exact Sig. (1-sided)

Pearson Chi-Square a Continuity Correction Likelihood Ratio Fisher's Exact Test Linear-by-Linear Association N of Valid Cases

## .141 2.185 51774 1 .139

.073

a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 715. 62.

An significant association does not indicate causation Statistical significance is not always the same as practical significance Multiple factors contribute to whether your results are significant It gets easier and easier as you practice!

Questions???