Sunteți pe pagina 1din 2

Summarizing Data:

Categorical data:
Count: number of individuals in a category
Proportion: count in category divided by total number of individuals considered
Percentage: proportion as decimal x 100%
Quantitative data: mean is sum of values divided by total number of values
Probability sampling plan incorporates randomness in the selection process so rules of probability apply.
Simple random sample is taken at random and without replacement.
Stratified random sample takes separate random samples from groups of similar individuals (strata) within the population.
Cluster sample selects small groups (clusters) at random from within the population (all units in each cluster included).
Multistage sample stratifies in stages, randomly sampling from groups that are successively more specific.
Systematic sampling plan uses methodical but non-random approach (select individuals at regularly spaced intervals on a list).
Bias: tendency of an estimate to deviate in one direction from a true value
Some sources of bias:
selection bias: due to unrepresentative sample, rather than to flawed study design sampling frame doesnt match population;
self-selected (volunteer) sample, haphazard sample, convenience sample, non-response
Observational study: researchers record variables values as they naturally occur (can
be retrospective or prospective). (watch what happens)
Retrospective observational study: researchers record variables values backward in time, about the past.
Prospective observational study: researchers record variables values forward in time from the present.
Sample survey: observational study with self-reported values, often opinions (questionnaire)
Experiment: researchers manipulate explanatory variable, observe response
Anecdotal evidence: personal accounts by one or a few individuals selected haphazardly
or by convenience. (To be avoided.)
Sample Survey Design: Issues to Consider
Open vs. closed questions, Unbalanced response options, Leading questions or planting ideas with questions, Complicated
questions, Sensitive questions, Hard-to-define concepts
Blind study design: Subjects blind to avoid placebo effect; Researchers blind to avoid experimenter effect
Other pitfalls of experiments: lack of realism, Hawthorne effect (peoples performance is improved due to awareness of being
observed), non-compliance, unethical or impractical treatment
Statistic: number summarizing sample
Parameter: number summarizing population
p-hat: sample proportion (a statistic) [p-hat]
p: population proportion (a parameter)
Symmetric distribution: balanced on either side of center
Skewed distribution: unbalanced (lopsided)
Skewed left: has a few relatively low values
Skewed right: has a few relatively high values
Outliers: values noticeably far from the rest
Unimodal: single-peaked
Bimodal: two-peaked
Uniform: all values equally common (flat shape)
Normal: a particular symmetric bell-shape

The 1.5-Times-IQR Rule identifies outliers:


below Q1-1.5(IQR) considered low outlier
above Q3+1.5(IQR) considered high outlier
A boxplot displays median, quartiles, and extreme values, with special treatment for
outliers:
1. Bottom whisker to minimum non-outlier 2. Bottom of box at Q1 3. Line through box at median 4. Top of box at Q3 5. Top whisker
to maximum non-outlier Outliers denoted *.

Mean: the arithmetic average of values. For n sampled values, the mean is called x-bar. The mean of a population, to be discussed
later, is denoted and called mu.
Symmetric: mean approximately equals median
Skewed left / low outliers: mean less than median
Skewed right / high outliers: mean greater than median
Pronounced skewness / outliers -> Report median.
Otherwise, in general -> Report mean (contains more information).
Tell how we denote the mean x-bar and standard deviation s if the values only constitute a sample, and how we denote mean mu, ,
and standard deviation sigma, .
z-score=(value-mean)/(sd)

Unstandardizing: x=mean+z(sd)
Two-sample or Several-Sample Design: extend notation for means and standard deviations with subscript numbers 1, 2, etc.
Paired Design: indicate notation for differences with subscript d
Two- or Several-Sample Design
Format: one column for each group or one column for each of two variables; Display: side-by-side boxplots; Compare: means and
sds or 5 No. Summaries
Paired Design: Display: Histogram of differences; Summarize: Mean and sd of differences
Use rows for explanatory, columns for response
Compare proportions or percentages in response of interest (conditional proportions or percentages) for various explanatory groups.
A conditional percentage or proportion tells the percentage or proportion in the response
of interest, given that an individual falls in a particular explanatory group.
If the nature of a relationship changes, depending on whether groups are combined or kept
separate, we call this phenomenon Simpsons Paradox.
Scatterplot displays relationship between 2 quantitative variables:
Explanatory variable (x) on horizontal axis
Response variable (y) on vertical axis

Non-Overlapping Or Rule: For any two non-overlapping events A and B, P(A or B)=P(A)+P(B).
Independent And Rule: For any two independent events A and B, P(A and B)=P(A)!P(B).
General Or Rule: For any two events A and B, P(A or B)=P(A)+P(B)-P(A and B).
Need And Rule that applies even if events are dependent.
Recall: For any two events A and B, P(A and B)=P(A)xP(B given A)
Rearrange to form Rule of Conditional Probability: P(B given A) = P(A and B)/P(A)
Residual=True value-Predicted value

S-ar putea să vă placă și