Documente Academic
Documente Profesional
Documente Cultură
com
A. Variables:
** Good for large data, but does not tell you each data is.
Types of Distributions:
Symmetric: A distribution is symmetric if the right and left sides of the histogram are
approximately the same shape.
Unimodal: The distribution has a single peak that shows the most common value in the data.
(Only one peak).
Bimodal: Distribution has two peaks; represents the two modes of the data.
Skewed to the Right: If the right side extends much farther out than the left side.
Skewed to the Left: If the left side extends much farther out than the right side.
Mean: x =∑x/N
Median:
- If total number(n) is an odd number: n+1/N
- If total number(n) is an even number: n/2 + (n/2 +1)/2
Density Curves:
Symmetric Curve: Mean, median, and mode are all in the same spot…in the centre.
Right Skewed Curve: Mean is pulled to the right, towards the tail.
C. Measures of Spread:
Quartiles:
- Q1: Larger than 25% of the observations.
- Q2: Median.
- Q3: Larger than 75% of the observations.
IQR: Q3-Q1
To Find Outliers:
- Q3 + (1.5)IQR
- Q1 – (1.5)IQR
D. Normal Distribution:
Properties:
- Have a bell-shape and are symmetrical.
- The mean (MU) is in the centre of the distribution.
- Standard deviation tells us how spread out the data is on both sides.
Empirical Rule:
Z = X – MU/ Sigma
X = MU + Z(Sigma)
E. Scatterplots:
F. Correlation:
- Measures the direction and strength of the linear relationship between two quantitative
variables.
Properties of Correlation:
G. Regression:
Regression Line: A straight line that describes how a response variable Y CHANGES as an
explanatory variable X CHANGES. We use this to predict the value of Y for a given value of X.
Residuals: Difference between an observed value of the response variable and the predicted
value.
OBSERVED Y – PREDICTED Y
Lurking Variable: A variable that is NOT among the explanatory or response variables and
may influence the interpretation of relationships.
- Association does NOT imply causation.
H. Two-Way Tables:
Simpson’s Paradox: Sometimes an association that holds true for all several groups can reverse
direction when the data are combined to form one larger group.
J. Methods of Sampling:
Sample: What group you obtain information from (don’t include people who don’t respond).
Sampling:
Observation Study: Variables are measured on individuals without influencing the responses.
Block Design: Group similar individuals together and then randomize within each of these
BLOCKS.
Matched Pairs Design: Two treatments under study. Subjects are matched in pairs based on
attributes. Randomly assigned within each pair.
K. Probability:
Mutually Exclusive (or disjoint): Events that can’t both happen at the same time.
Ex: Draw one card, drawing an ace and drawing a king: a single card cannot both be an ace and
a king.
Properties:
L. Random Variables:
Discrete Random Variables: Lists each possible value the random variable can assume,
together with its probability.
Properties:
- F(x) >_ 0.
- The total area under the curve is equal to 1.
- P(a< X < b) = area under the curve between A and B.
M. Sampling Distribution:
Statistic: Number that can be calculated from a sample without using any unknown parameters.
Law of Large Numbers: X is rarely exactly right. If we keep taking larger and larger samples,
the statistic X will get closer and closer to the parameter MU.
Mean and Standard Deviation of a Sample Mean: Mean of a SRS of n is drawn from a large
population with the MU and SIGMA.
SIGMA/SQR(N)
Central Limit Theorem: Forms foundation for the inferential branch of statistics.
1. If samples of size n, N > 40, then the sampling distribution of the sample mean approximates
a Normal Distribution. The greater the sample size, the better the approximation.
2. If population itself Normally distributed, the sampling distribution of the sample mean is
Normally Distributed for ANY sample size n.
** Always allows us to use Normal probability calculations to answer questions about sample
means from many observations even when the population distribution is NOT normal.
1. Based on data; 95% confident that the population parameter is contained in the interval.
2. Out of n separate confident intervals, 95% of them will contain the population parameter.
**ALWAYS ROUND UP N
N = Z*(SIGMA)2/MARGIN OF ERROR
Margin of Error:
M = Z* (SIGMA/SQR(N)
P. Hypothesis Testing:
- Is a process using sample statistics to test a claim about the value of a population
parameter.
Steps:
1. State NULL HYPOTHEIS. Contains a statement of equality; <, =, >. This is what we
actually test with the evidence.
P-Values:
- The smaller the p-value, the greater the evidence against the null hypothesis. Use the
STANDARD NORMAL TABLE to find the p-value.
- You always make TWO decisions; 1) Reject the null hypothesis. 2) Fail to reject the null
hypothesis.
- t-distributions are more spread out than the standard normal. Have more probability in the
tails than in the centre.
(n-1) degrees of freedom
- Draw a SRS of size n from a large population having an unknown mean MU. A level C
confidence interval for MU is:
Robustness:
- T procedure is correct when the population is
normally distributed.
- ROBUST to small deviations from normality. Results will not be
greatly affected.
Factors:
1. MUST be a SRS from the population.
2. Outliers are skewness. Strongly influence the mean. Gets smaller as sample size increases.
2. Both populations are normally distributed. MEANS and STANDARD DEVIATION are
UNKNOWN.
Confidence Interval:
N = Z*/M2 P*(1-p*)
Where…