Sunteți pe pagina 1din 38

Unit 1: Exploratory Data Analysis

In order to convert raw data into useful information, we need to summarize and then
examine the distribution of the variable. By distribution, we mean: what values the
variable takes and how often the variable takes those values.

There are two graphical displays for visualizing the distribution of categorical data:
1. The Pie Chart emphasizes how the different categories relate to the whole
2. The Bar Chart emphasizes how the different categories compare with each other
A variation is a pictogram but pictograms can be misleading.

To display data from one quantitative variable graphically:
1. Histogram Each observation is counted only in one interval. Some information
is lost on a histogram. Ex: We cannot answer what was the lowest score. We have
to chose the data to break into different intervals which affects the look of the
histogram.
2. Stemplot the steam and leaf plot. Leaf is right-most digit and the stem is
everything except the right-most digit. It preserves original data and sorts the data.

Shape, Center & Spread is the overall pattern. Outliers are deviations from the pattern.
When describing the shape of a distribution, we should consider:
1. Symmetry/skewness of the distribution
2. Peakedness (modality) the number of peaks (modes) the distribution has.




Unimodal has one mode where observations are concentrated. The second distribution is
bimodal which has two modes where the observations are concentrated. The third
distribution is flat or uniform. This distribution has no modes.

A distribution is called skewed right if the right tail (larger values) is much longer than
the left tail(small values). Ex: Salaries

A distribution is skewed left if the left tail (smaller values) is much longer than the right
tail (larger values). Ex: age of death from natural causes such as heart disease or cancer

Skewed distributions can also be bimodal. If a distribution has more than two modes, we
say that the distribution is multimodal.

The center of the distribution is its midpoint which is the value that divides the
distribution so that approximately half the observations take smaller values and
approximately half the observations take larger values.

The spread (variability) can be described by the approximate range covered by the data.
The smallest observation is min and the largest is max.

Outliers are observations that fall outside the overall pattern. They need further research
before continuing the analysis.

The dotplot is like a stemplot except it displays with a dot rather than the actual value.
The stemplot is easy and quick for small simple datasets, retains the actual data, and sorts
the data.

The mode is the value where the distribution has a peak. It is also the most commonly
occurring value in a distribution. The mean is the average of a set of observations. The
median is the midpoint of the distribution.

The mean is very sensitive to outliers (because it factors in their magnitude) while
the median is resistant to outliers.
-For symmetric distribution with no outliers: Mean is approximately equal to Median
-For skewed right distributions or datasets with high outliers: Mean is greater than
Median
-For skewed left distributions with low outliers: Mean is less than Median

The range covered by the data is the most intuitive measure of variability. The range is
exactly the distance between he smallest data point and the largest one. Range = Max
min

Inter-Quartile Range (IQR) measures the variability of a distribution by giving us the
range covered by the Middle 50% of the data.

You split the data into halves.Then you split the halves into quarters. Quartile 1 is the
first quartile since one quarter of the data points fall below it. Repeat for the the top
halve. Quartile 3 is the third quartile since three quartiles fall below it. IQR will be Q3
Q1. IQR = Q3 Q1

When n is odd, the median is not included in either the bottom or top half of the data.
When n is even, the data are naturally divided into two halves.

We can use IQR to detect outliers.
An observation is considered a suspected outlier if it is:
- below Q1 1.5(IQR)
- above Q3 + 1.5(IQR)
Outliers have possible flags. Even though it is an extreme value, if an outlier can be
understood to have been produced by essentially the same sort of physical or biological
process as the rest of the data, and if such extreme values are expected to eventually
occur again, then such an outlier indicates something important and interesting about the
process youre investigating and should be kept in the data.

If an outlier can be explained to be produced under fundamentally different conditions
from the rest of the data then such an outlier can be removed. An outlier might indicate a
mistake which should be corrected if possible or removed.

The combination of all five numbers (min, Q1, M, Q3, Max) is called the five number
summary and provides a quick numerical description of both the center and spread of a
distribution.

The boxplot graphically represents the distribution of a quantitative variable by visually
displaying the five number summary and any observation that was classified as a
suspected outlier.

Boxplots are most useful when presented side by side for comparing and contrasting
distribution from two or more groups.

The standard deviation quantifies the spread of a distribution in a completely different
way. It quantifies the spread by measuring how far the observations are from their mean.
SD gives the average or typical distance between a data point and the mean.

In order to find SD, we calculate the mean of our data, then we find the deviations from
the mean which is the difference between each observation and the mean. Then you
square each deviation. Average the square deviations and divide it by n -1 (one less than
the sample size). This gives us the variance which is the average of the squared
deviations. SD is the square root of the variance.

When SD is 0 then all the observations have the same value. It means that there is not
variability or spread in the data. Range and IQR will also be 0. SD is strongly influenced
by outliers in the data.

Use the mean and the SD as measures of center and spread only for reasonably
symmetric distributions with no outliers. Use the five-number summary for all other
cases.

For distributions that have a normal shape, the following rule applies:
The Standard Deviation Rule:
Approximately 68% of the observations fall within 1 standard deviation of the mean.
Approximately 95% of the observations fall within 2 standard deviations of the
mean.
Approximately 99.7% (or virtually all) of the observations fall within 3 standard
deviations of the mean.



SD can be used as a measure of spread if the mean is the measure of center.

In most studies involving two variables, each of the variables has a role. We distinguish
between:
- The explanatory variable or independent variable which is the variable that
claims to explain, predict or affect the response. Denoted as X
- The response variable or the dependent variable which is the outcome of the
study. Denoted as Y

There are four possibilities.

1. Categorical explanatory and quantitative response in C Q, we use side-by-side
boxplots to compare the distribution of the response (quantitative) variable within
each category of the explanatory variable. The data display is side-by-side boxplots
and the numerical summaries is the descriptive statistics.
2. Categorical explanatory and categorical response in CC, we examine the
relationship between two categorical variables by using a two-way table. The data
display is two-way table and the numerical summaries are the conditional
percentages. They are found by converting the counts to percents within or
restricted to each value of the explanatory variable separately. If the explanatory
variable is in rows, it would be row percents and if the explanatory variable were
in columns then it would be column percents. A double bar chart shows all the
conditional percents instead of a table.
3. Quantitative explanatory and quantitative response in Q Q, we use scatterplots.

The direction of the relationship can be positive, negative or neither. A positive
relationship means that an increase in one of the variables is associated with an
increase in the other. Vice versa with a negative relationship.
The form of the relationship is a general shape. Relationships are linear when they
are points scattered about a line. Relationships are curvilinear when it curves
exponentially. Another form is clusters where points cluster in the data.
The strength of the relationship is determined by how closely the data follows he
form of the relationship. A stronger relationship is more tightly packed while a
weaker relationship is loosely compacted.
Data points that deviate from the pattern are called outliers.
A labeled scatterplot labels each subgroup differently.
4. Quantitative explanatory and categorical response

Not every relationship between two quantitative variables has a linear form. However
there are many ways to interpret linear relationships.

The correlation coefficient (r) is a numerical measure that measures the strength and
direction of a linear relationship between two quantitative variables.
Calculated by
The value of r ranges from -1 to 1. Negative values of r indicate a negative direction for a
linear relationship and positive values or r indicate a positive direction for a linear
relationship. Values or r that are close to 0 either negative or positive indicate a weak
linear relationship. Values that are close to -1 or close to 1 indicate a strong linear
relationship, either negative or positive.

Correlation does not change when the units of measurement of either one of the variables
change. Correlation is unit less. Correlation only measures the strength of a linear
relationship and ignores any other type of relationship no matter how strong it is.
Correlation by itself is not enough to determine whether or not a relationship is linear.
Always look at the data! We can have a strong curvilinear line and mistake it for a
moderately strong linear line.
The technique that specifies the dependence of the response variable on the explanatory
variable is called regression. When that dependence is linear, the technique is called
linear regression. It finds the line that best fits the pattern of the linear relationship. We
use a criterion called the least squares criterion which says Among all the lines that
look good on your data, choose the one that has the smallest sum of squared vertical
deviations.


We are looking for the smallest total yellow area. The line is called the least-squares
regression line.

For intercept a, it depends on the value of slope b so you need to find b first. The slope of
the least squares regression line can be interpreted as the average change in the response
variable when the explanatory y variable increases by 1 unit.

Prediction for ranges of the explanatory variable that are not in the data is called
extrapolation. Extrapolations can lead to very poor or illogical predictions. Since there is
no way of knowing whether a relationship holds beyond the range of the explanatory
variable in the data, extrapolation is not reliable, and should be avoided.

Association does not imply causation!

A lurking variable is a variable that is not among the explanatory or response variables
in a study but could substantially affect your interpretation of the relationship among
those variables. A lurking variable might have an effect on both the explanatory and the
response variables.

An observed association between two variables is not enough evidence that there is a
casual relationship between them.

Simpsons paradox whenever the introduction of a lurking variable causes us to
rethink the direction of an association

Including a lurking variable in our exploration may: help us to gain a deeper
understanding of the relationship between variables or lead us to rethink the direction of
an association.

Page 57 for a summary of EDA


Unit 2: Producing Data
Volunteer sample are samples where individuals have selected themselves to be
included. Such samples are guaranteed to be biased because sampled individuals only
provide info about themselves so we cannot generalize to any larger group at all.
Volunteer samples are the only ethical way to obtain a sample.

A convenience sample is when individuals happen to be at the right time and place to
suit the schedule of the researcher. Sample results can be biased because some
convenience samples are designed in such a way those certain individuals have no chance
at all of being selected.

A sampling frame is a list of potential individuals to be sampled. There may be bias
because of discrepancy. It is always best o have the sampling frame match the population
as closely as possible.

Systematic sampling is when individuals are sampled completely according to a system.
Ex: Every 50
th
student. If individuals are sampled completely at random, and without
replacement then each group of a given size is just as likely to be selected as all the other
groups of that size. This is called simple random sample (SRS).

Volunteer response has many potential problems. It is not as problematic as a volunteer
sample but there is danger that those who do respond are different from those who dont
with respect to the variable of interest. Nonresponse is also an issue.

Any plan that relies on random selection is called a probability sampling plan (or
technique).
- Simple Random Sampling is the simplest probability sampling plan. Each
individual has the same chance of being selected.
- Cluster Sampling used when the population is naturally divided into groups. Ex:
All the nurses in a certain city are divided into hospitals, all the students in a
university are divided into majors.
- Stratified Sampling used when our population is naturally divided into
subpopulations called stratum. Ex: A random sample of 25 seniors from each of
the high schools in the city combined.
Multistage sampling is essentially a complex form of cluster sampling. You have
another stage of sampling in which you choose a sample from each randomly selected
clusters.

Researchers can learn more about the variables of interest for the population by taking
larger samples.

An observational study is a study in which values of the variable or variables of interest
are recorded as they naturally occur. There is no interference by the researchers who
conduct the study. A sample study is a particular type of observational study in which
individuals report variables values themselves, frequently by giving their opinions. An
experiment is where researchers interfere and they assign the values of the explanatory
variable to individuals. The researchers take control of the values of the explanatory
variable because they want to see how changes in the value of the explanatory variable
affect the response variable. The type of design used, and the details of the design, are
crucial because they will determine what kind of conclusions we may draw from the
results.

The values of the variables of interest that are recorded forward in time are called
prospective observational studies. Recorded backward in time is called a retrospective
observational study.




A lurking variable can be tied in with the explanatory variables values and may itself
cause the response to be success or failure.
or
The key to establishing causation is to rule out the possibility of any lurking variable, or
in other words, to ensure that individuals differ only with respect to the values of the
explanatory variable.

The factor is the explanatory variable. The different imposed values of the explanatory
variable are the treatments. The group receiving the different treatments is called
treatment groups. The group in which individuals have no specific treatment imposed is
the control group. The subjects in each treatment group differ from those in the other
treatment groups only with respect to the treatment.

Random assignment to treatments is the best way to prevent treatment group of
individuals from differing from each other in ways other than the treatment assigned. A
randomized controlled experiment is a design where researchers control values of the
explanatory variable with a randomization procedure. The groups should not differ
significantly with respect to any potential lurking variable. If we see a relationship
between the explanatory and response variable then we have evidence that it is a casual
one.

The control group consists of subjects who do not receive a treatment but who are
otherwise handled identically to those who do receive a treatment. Subjects in an
experiment should be blind to which treatment they receive. A placebo pill can be
administered to the control group so that there are no psychological differences between
those who receive the drug and those who do not. When patients improve because they
are told they are receiving treatment, even though they are not actually receiving
treatment, this is known as the placebo effect.

In some cases, it is important for researchers who evaluate the response to be blind to
which treatment the subject received in order to prevent the experimenter effect from
influencing their assessments. If neither the subjects nor researchers know who assigned
what treatment, then the experiment is called double-blind. The most reliable way to
determine whether the explanatory variable is actually causing changes in the response
variable is to carry out a randomized controlled double-blind experiment.

The phenomenon where people in an experiment behave differently from how they would
normally behave is called the Hawthorne effect. A lack of realism or lack of ecological
validity is a possible drawback to the use of an experiment rather than an observational
study to explore a relationship. Noncompliance is failure to submit to the assigned
treatment.

In experiments with two explanatory variables, there has to be one treatment group for
every combination of categories of the two explanatory variables.

The experiments design may be enhanced by relaxing the requirement of total
randomization and blocking the subjects first. Some study design called matched pairs
may enable us to pinpoint the effects of the explanatory variable by comparing responses
for the same individual under two explanatory values or for two individuals who are as
similar as possible except that the first gets one treatment and the second gets another or
serves as the control.


Before-and-after studies are another common type of matched pairs design. For each
individual, the response variable is measured twice: first before the treatment, then again
after the treatment.

The following issues in the design of sample surveys will be discussed:
- Open vs. closed questions: An open question allows for unlimited response. A
closed question limits it to only certain responses. Closed questions may be biased
by the options provided.
- Unbalanced response options
- Leading questions
- Planting ideas with questions
- Complicated questions
- Sensitive questions
Questions should be worded neutrally. Earlier questions should not deliberately influence
responses to later questions. Questions shouldnt be confusing or complicated. Survey
method and questions should be carefully designed to elicit honest responses if there are
sensitive issues involved.
Page 79 for Summary on Producing Data
Unit 3: Probability
Probability is not always intuitive. Probability is a mathematical description of
randomness and uncertainty. It is a way to measure or quantify uncertainty. It is the
likelihood that something will occur.

We use the notation P(event) to represent a probability. Ex: P(win lottery) = the
probability that a person who has a lottery ticket will win that lottery. The probability of
an event tells us how likely it is that the event will occur. The probability that an event
will occur is between 0 and 1 or 0<= P(A) <=1.

People like to express probability in percentages. Each can be changed into an equivalent
percentage. The chance that an event will occur is between 0% and 100%. There are two
fundamental ways in which we can determine probability:
Theoretical is also known as Classical which includes games of chance such as
flipping coins, rolling dice, spinning spinners and etc. They are classical because
their values are determined by the game itself.
Empirical methods use a series of trials that produce outcomes that cannot be
predicted in advance (hence the uncertainty).
A random experiment is a series of trials that produces outcomes that cannot be
predicted in advance. To estimate the probability of P(A), we may repeat the random
experiment many times and count the number of times event A occurs. Then P(A) is
estimated by the ratio of the number of times A occurs to the number of repetitions is
called the relative frequency of P(A).

Law of Large Numbers the actual or true probability of event is estimated by the
relative frequency with which the event occurs in a long series of trials. The more
repetitions that are performed, the closer the relative frequency gets to the true probability
of the event.
We will look at the probability distribution for quantitative variables. When the
outcomes are quantitative, we called the variable a random variable. Discrete random
variables have values that can be listed and often can be counted. Continuous random
variables can take any value in an interval and are often measurements.
The features of a probability distribution (also called a probability model):
Outcomes are random. Individual outcomes are uncertain but there is a regular
predictable distribution of outcome sin a large number of repetitions.
Provides a way of assigning probabilities to all possible outcomes
The probability of each possible outcome can be viewed as the relative frequency
of the outcome in a large number of repetitions so it can be any value between 0
and 1.
The sum of the probabilities of all outcomes must be 1.

A) What is the probability that a college student will change majors at most once?
The phrase indicates that the student never changed majors or the student changed
once. Therefore, to find this probability, we need to add up the probabilities that
are highlighted in the table. P(a college students changes majors at most once) =
P(X=0) + P(X=1) = 0.135 + 0.271 = 0.406
B) What is the probability that a randomly selected college student will change his
major as often as or more often than John?
The phrase means to change majors twice or more times. P(change major 2 or
more times) = P(X=2) + P(X=3) + + P(X+8) = 0.594
Or 1 [(P(X=0) + P(X=1)] = 1- [0.135+0.271] = 0.594
C) How often does one have to change a major to be considered unusual?
To make a judgment, we have to consider what might be unusual based on the
table. For example: we might notice that the probability that a student will change
majors 5 or more times if 5% only. That is unusual.
If the distribution is skewed right then probability decreases on the right side. If the
probability distribution is skewed, the mean and SD is not the best way to describe the
center and spread. However, we can use the mean and standard deviation to identify
unusual values for the random variable.
Example:

The example shows that changing five majors is unusual. It is true. Another way to think
about unusual is to look at the outcomes relative to the mean. We might consider
outcomes more than two SDs from the mean as unusual.

The distribution in which it is more likely to find values that are further from the mean
will have a larger SD. The distribution in which it is less likely to find values that are
further from the mean will have a smaller SD.

The distribution that has the largest SD will have the largest spread relative to the mean
with higher relative frequencies further from the mean than other distributions.

In a histogram, the area of each bar has the same numerical value as the height. It area of
the bars can be used to find probability.

The Probability Distribution of a Continuous Random Variable
The heights of all rectangles in the histogram must sum to 1. Smooth curves are used to
represent continuous random variables which are called probability density curves. The
total area under any possibility density curve must be 1. Density curves can have any
shape as long as the total area underneath the curve is 1.

The probability that X gets a value in any interval of interest is the area above this
interval and below the density curve.
P( a <= X <= b) = (Areas between a and b and below the density curve) = the integral of
a to b f(x) where f(x) represents the density curve.

There are many normal distributions. Even though all of them have the bell-shape, they
vary in their center and spread. The shape of the distribution is determined by its mean
and its standard deviation.

In general, if X is a normal random variable, then the probability is
68% that X falls within 1 of , that is, in the interval
95% that X falls within 2 of , that is, in the interval 2
99.7% that X falls within 3 of , that is, in the interval 3
Probability notation is
0.68 = P( - < X < + )
0.95= P( - 2 < X < + 2)
0.997= P( - 3 < X < + 3)



The first step to assessing a probability associated with a normal value is to determine the
relative value with respect to all the other values taken by that normal variable.

We have to find the z-score or standardize the value. The standardized value z tells how
many standard deviations below or above the mean the original value is and is calculated
as: z-score = (value mean)/standard deviation

Z = (x - ) /

The normal table tells the probability of a normal variable taking a value less than any
standardized score z.
By construction, the probability P(Z<z*) equals the area under the z curve tp the left of
that particular value z*.

What is the probability of a normal random variable taking a value more than .75
standard deviations above its mean?
Method 1: By symmetry of the z curve centered on 0, P(Z >+ .750 = P(Z< -.75) = .2266

Method 2: Because the total area under the normal curve is 1. P(Z+.75) = 1 P(<+.75) =
1 - .7734 = .2266

What is the probability of a normal random variable taking a value between 1 standard
deviation below and 1 standard deviation above its mean?
P(-1<Z<+1) = P(Z<+1) P(Z<-1) = .8413 - .1587 = .6826

The probability is .15 that a standardized normal variable takes a value above what
particular value of z? (remember that the table only provide probabilities of being below
a certain value, not above)
Method 1: According to the table, .15 (actually .1492) is the probability of being below -
1.04. By symmetry, .15 must also be the probability of being above +1.04

Method 2: If .15 is the probability of being above the value we seek, then 1 - .15 = .85
must be the probability of being below the value we seek. According to the table, .85 is
actually .8505 and is the probability of being below +1.04.

The probability is .95 that a normal variable takes a value within how many SDs of its
mean? A symmetric area of .95 centered at 0 extends to values z* and +z* such that the
remaining (1-.95)/2 = .025 is below z* and .025 above +z*. Therefore, the probability is
.95 that a normal variable takes a value within 1.96 SDs of its mean/


Ex: Male foot lengths have a normal distribution, with = 11, = 1.5 inches. What is the
probability of a foot length of more than 13 inches?
1. We standardize: z = x - / = 13-11/1.5 = +1.33.
2. P(Z>+.133) = P(Z<-1.33) = .0918.
A male foot length of more than 13 inches is on the long side but not too unusual and its
probability is about 9%.

Ex: What is the probability of a male foot length between 10 and 12 inches?
1. The standardized values of 10 and 12 are 10-11/1.5 = -0.67 and 12-11/1.5 = 0.67
2. P(-.67 < Z <+.67) = P(Z<+.67) P(Z<-.67) = .7486 - .2514 = .4972

Steps in Solving Two Types of Normal Problems
1. Given a normal value x, solve for probability:
a. Standardize: calculate z = x - /
b. Locate z in the margins of the normal table (ones and tenths for the row,
hundredths for the column). Find the corresponding probability (given to
four decimal places) of a normal random variable taking a value below z
inside the table. (Adjust if the problem involves something other than a
less-than probability by invoking either symmetry or the fact that the
total area under the normal curve is 1.
2. Given a probability, solve for normal value x:
a. (Adjust if the problem involves something other than less-than
probability by invoking either symmetry or the fact that the total area
under the normal curve is 1. Locate the probability (given to four decimal
places) inside the normal table. Find the corresponding z value in the
margins (row for ones and tenths, column for hundredths).
b. Unstandardized: calculate x = + z *

The idea that sample results change from sample to sample is called sampling
variability. A parameter is a number that describes the population. A statistic is a
number that is computed from the sample.
Examples of following parameters and statistics:
Population proportion and sample proportion
Population mean and sample mena
Population standard deviation and sample standard deviation

Parameters are usually unknown because it is impractical or impossible to know exactly
what values a variable takes for every member of the population. Statistics are computed
from the sample and vary from sample to sample due to sampling variability.

Behavior of Sample Proportion
The mean of the distribution of p-hat should be p. The sample size plays a role in the
spread of the distribution of the sample proportion. There is less spread for larger samples
and more spread for smaller samples. The shape should be somewhat normal.

The distribution of the values of the sample proportions (p-hat) in repeated samples is
called the sampling distribution of p-hat.

Larger random samples will better approximate the population proportion. When the
sample size is large, sample proportions will be closer to p.

If repeated random samples of a given size n are taken from a population of values for a
categorical variable, where the proportion in the category of interest is p, then the mean
of all sample proportions (p-hat) is the population proportion (p).
Standard deviation for all sample proportions (p-hat) is exactly
Since sample size n appears in the denominator of the square root, the
standard deviation does decrease as sample size increases. The shape of the distribution
of p-hat will be approximately normal as long as the sample size n is long enough. The
convention is to require both np and n(1-p) to be at least 10.


If sampling distribution is normally shaped, then we can apply the SD rule and use z-
score to determine probabilities.

As long as the sample is truly random, the distribution of p-hat is centered at p, no matter
what size sample has been taken. Larger samples have less spread.

Behavior of Sample Mean
Explore the behavior of the statistic x-bar, the sample mean, relative to the parameter ,
the population mean (when the variable of interest is quantitative).

The center, the mean of the sample means will be , just as the mean of sample
proportions was p. Greater variability in sample means for smaller samples. Sample size
will play a role in the spread of the distribution of sample measures. The shape should be
somewhat normal.

The distribution of the values of the sample means )x-bar) in repeated samples is called
the sample distribution of x-bar.
If repeated random samples of a given size n are taken from a population of values for
quantitative variable, where the population mean is and the population standard
deviation is , then the mean of all sample means (x-bar) is the population mean . Less
spread for larger samples.
Standard deviation for all sample means (x-bar) is exactly
since the square root of sample size n appears in the denominator, the SD does
decrease as sample size increases.

Central Limit theorem The distribution of sample means will be approximately
normal as long as the sample size is large enough. We can use Central Limit Theorem to
do normal probability calculations. We can do this even if the population distribution is
not normal.

The general rule is that samples of size 30 or greater will have a fairly normal distribution
regardless of the shape of the distribution of the variable in the population.

(For categorical variables, our claim that sample proportions are approximately normal
for large enough n is actually a special case of the Central Limit Theorem).


Pg 116 for Summary of Probability

Unit 4: Inference
Statistical inference the process of inferring something about the population based on
what it is measured in the sample.

Three forms to use information obtained in the sample to draw conclusions about the
population:
Point estimation estimate an unknown parameter using a single number that is
calculated from the sample data.
Interval estimation estimate an unknown parameter using an interval of values
that is likely to contain the true value of that parameter (and state how confident
we are that this interval indeed captures the true value of that parameter).
Hypothesis testing have a claim about the population and check whether or not
the data obtained from the sample provide evidence against this claim.

When the variable of interest is categorical, the population parameter that we will infer
about is the population proportion (p) associated with that variable.
When the variable of interest is quantitative, the population parameter that we infer
about is the population mean () associated with that variable.

Point estimation is the form of statistical inference, based on the sample data, we estimate
the unknown parameter of interest using a single value (hence the name point
estimation).

Ex: A random sample of 100 SU students was chosen and their (sample) mean IQ level
was found to be x-bar = 115
We always use x-bar as the point estimator for . (If we talk about a specific value, we
use the term estimate and when we talk about x-bar statistic we use the term estimator.)

Ex: Suppose a poll of 1,000 US adults find that 560 of them believe marijuana should be
legalized. If we wanted to estimate p, the population proportion, using a single number
based on the sample, it would make intuitive sense to use the corresponding quantity in
the sample, the sample proportion p-hat = 560/1000 = .56.


The best estimate for is x-bar and the best estimator for p should be p-hat. Probability
theory gives us an explanation why x-bar and p-hat are good choices as point estimators
for and p. As long as a sample is taken at random, the distribution of sample means is
exactly centered at the value of the population mean.

X-bar is said to be an unbiased estimator for . In the long run, such sample means are
on target in that they will not underestimate any more or less often than the overestimate.
P-hat is also an unbiased estimator for p.

Point estimates are truly unbiased estimates for the population parameter only if the
sample is random and the study design is not flawed.

As sample size n increases, the sampling distribution of x-bar gets less spread out. Values
of x-bar that are based on a larger sample are more likely to be closer to . Similarly, the
sampling distribution p-hat is centered at p is more likely to be closer to p when the
sample size is larger.

Point estimate is simple and intuitive but is also a bit problematic. We are almost
guaranteed to make some kind of error. Even though the values of x-bar will fall around
, it is very unlikely that the value of x-bar will fall exactly at . The estimates are in
limited usefulness unless we are able to quantify the extent of the estimation error.
Interval estimation enhances the simple point estimates by supplying information about
the size of the error attached.
Ex: We use x-bar 115 for the point estimate however we have no idea what the estimation
error involved in such estimation might be.
We say I am 95% confident that by using the point estimate x-bar = 115 to estimate , I
am off by no more than 3 IQ points. In other words, I am 95% that is within 112 (115-
3) and 118 (115+3).

Supposed we are interested in estimating the unknown population mean based on a
random sample of size n. We assume that the population standard deviation is known.

The values of x-bar follow a normal distribution with an (unknown) mean and standard
deviation (known since both and n are known). By the (second part of the)
Standard Deviation Rule, this means that:
There is a 95% chance that our sample mean (x-bar) will fall within 2 * of which
means that: We are 95% confidant that falls within 2 * of our sample mean (x-
bar).

In other words, a 95% confidence interval for the population mean is:

The way the confidence interval is used is that we hope the interval contains the
population mean . This is why we call it an interval for the population mean.

95% is taking a value that is within 2 standard deviations of its mean. A more accurate
formula for the 95% confidence interval for is . However 2 will
be used.

95% is the most commonly used level of confidence. However, there is 99% confident
and 90% confident.
99% confidence interval for is
90% confidence interval for is

A 99% confidence interval includes more values and provides a wider confidence interval
while a 90% confidence interval includes less values and has a smaller confidence The
wider 99% confidence interval gives us a less precise estimation about the value of
than the narrower 90% confidence interval because the smaller interval narrows-in on
the plausible values of .

There is a trade-off between the level of confidence and the precision with which the
parameter is estimated. The price we have to pay for a higher level of confidence is that
the unknown population mean will be estimated with less precision (wider confidence
interval). If we would like to estimate with more precision (narrower confidence
interval), we will need to sacrifice and report an interval with a lower level of confidence.

For a 90% level of confidence, z* = 1.645
For a 95% level of confidence, z* = 2 (or 1.96)
For a 99% level of confidence, z* = 2.576

The confidence level has the form x-bar m:
x-bar is the sample mean, the point estimator for the
unknown population mean . M is called the margin of error because it represents the
maximum estimation error for a given level of confidence.

M is made up of the product of two components:
Z* is the confidence multiplier and , the standard deviation of x-bar and the point
estimator of .

The structure is : estimate margin of error where the margin of error is further
composed of the product of a confidence multiplier and the standard deviation. Although
each confidence interval has the same components, what these components are actually is
different for each confidence level and depends on what unknown parameter the
confidence interval aims to estimate.

Since the structure of the confidence interval is such that it has aa margin of error on
either side of the estimate, it is centered at the estimate (in our case, x-bar) and its width
(or length) is exactly twice the margin of error:

The margin of error, m is therefore in charge of the width or precision of the confidence
interval and the estimate is in charge of its location (and has no effect on the width).

Is there a way to increase the precision of the interval without compromising on the level
of confidence? Yes, we can do that by increasing the sample size n because it appears
in the denominator.

If our estimate, x-bar, is based on a larger sample, we have more faith in it. It is more
reliable an therefore we need to account for less error around it.

We can calculate what sample size we need in order to report a confidence interval with a
certain level of confidence and a certain margin of error.

The formula is

The sample size must be an integer so we should always round up to the next highest
integer. This is called the conservative approach which will be used to achieve an
interval at least as narrow as the one desired.

Sometimes we need to know the population standard deviation, sigma, . Usually sigma
is not known because it is a parameter. In order to compute the required sample size, we
use an estimate of sigma.

Range rule of thumb sigma is no bigger than the range/4.

When is it safe to use the confidence interval we developed?
The sample must be random. The Central Limit Theorem works when the sample size is
large. ( A common rule of tumb for large is n>30) or for smaller sample sizes, we need
to know that the quantitative variable of interest is distributed normally in the population.

If is unknown, then we can replace it with the sample standard deviation which is s.
The bad news is that once is replaced,
we lose the Central Limit Theorem together with the normality of x-bar and therefore the
confidence multiples z* are not accurate anymore. The new multipliers cme from t
distribution and are denoted by t*.

t* multipliers have an added complexity that they depend on both the level of confidence
and on the sample size.

Since it is rare that is known, the interval is more commonly used as a confidence
interval for estimating . The quantity s/radical(n) is called the standard error of x-bar.
The sample must also be random. The interval cannot be used when the sample size is
small and the variable is not known to vary normally.

For large values of n, the t* multipliers are not that different for the z* multipliers and
therefore we can use:
for when is unknown.
Pg 133 for Summary
Confidence Intervals for Population Proportion
If the population is categorical, the parameter we are trying to infer about is the
population proportion (p) associated with that variable.

The confidence interval for p has the form:
The standard deviation of our estimator p-hat is

When we are trying to estimate the unknown population proportion p, we replace p with
its sample counterpart p-hat and work with the standard error of p-hat. The confidence
interval for the population proportion p is:

There is a trade-off between level of confidence and the width or precision of the
confidence interval. The more precision you would like the confidence interval for p to
have, the more you have to pay by having a lower level of confidence.

Since n appears in the denominator of the margin of error of the confidence interval for p,
for a fixed level of confidence, the larger the sample, the narrower, or more precise it is.

In order to determine sample size for a given margin of error in estimating proportions,
we use the formula:

The margin of error of a poll is determined (conservatively) by 1/(radical (n))

We usually determine the sample size, then choose a random sample of that size and then
use the collected data to find p-hat.


It is safe to use these methods when p-hat is roughly normal. Under the conditions that
, it is safe to use the methods.

Hypothesis testing is the other kind of inference because the methods for estimating the
unknown parameter, the idea, logic and goal of the hypothesis testing is quite different.

Statistical hypothesis testing is defined as:
Assessing evidence provided by the data in favor of or against some claim about the
population.
1. Two claims. Claim 1 and claim 2.
2. Choose a sample, collect relevant data and summarize them.
3. Figure out how likely it is to observe data that we got if claim 1 is true.
4. Make decision.
We do not state I accept claim 1 but only I dont have enough evidence to reject
claim1.


Hypothesis testing step 1: Stating the claims.
Claim 1 is called the null hypothesis (denoted as H
0
) and Claim 2is the alternative
hypothesis (denoted as H
a
). The null hypothesis suggests nothing is going on and there is
no change form the status quo. The alternative hypothesis disagrees with this and states
that there is a change from the status quo. It usually represents what we want to check or
what we suspect is really going on.

Hypothesis testing step 2: Choosing a sample and collecting data.
We collect data and summarize it. We use the sample statistics to summarize the data
with something called a test statistic.

Hypothesis testing step 3: Assessing the evidence
There is a crucial probability known as the p-value.

Hypothesis testing step 4: Making conclusions
The significance level of the test usually denoted by the Greek letter . The most
commonly use significance level is = 0.5 or 5%. This means that
If the p-value < (usually .05), then the data we got is considered to be rare (or
surprising) enough when H
0
is true, and we say that the data provides significant
evidence against H
0
, so we reject H
0
and accept H
a
.
If the p-value > (usually 0.5), then our data are not considered to be suprising
enough when H
0
os true, and we say that our data does not provide enough
evidence to reject H
0
(or requivalently, that the data does not provide enough
evidence to accept H
a
).
The results are statistically significant when the p-value<
The results are not statistically significant when the p-value>

It should be noted that scientific journals do consider .05 to be the cutoff point for which
any p-value below the cutoff indicates enough evidence against H
o
, and any p-value
above it, or even equal to it, indicates there is not enough evidence against H
o
.

It is important to draw your conclusions in context. It is never enough to say p-
valuereject H
0
at the significance level. You should always add: and conclude that
(context).

Two conclusions: Either I reject H
o
and accept H
a
(when the p-value is smaller than the
significance level) or I cannot reject H
o
(when the p-value is larger than the significance
level).

4 basic steps in the process of hypothesis testing:
1. State the null and alternative hypotheses.
2. Collect relevant data from a random sample and summarize them (using a test
statistic).
3. Find the p-value, the probability of observing data like those observed assuming
that H
0
is true.
4. Based on the p-value, decide whether we have enough evidence to reject H
0
(and
accept H
a
) and draw our conclusions in context.

Note that the hull hypothesis always takes the form:
H
0
: p =some value
And the alternative hypothesis takes one of the following three forms:
H
a
: p <that value) or
H
a
: p>that value or
H
a
: p that value

The value that is specified in the null hypothesis is called the null value and is generally
denoted by p
0
. The null hypothesis about the population proportion (p) would take the
form:
H
0
: p=p
0

The alternative hypothesis is one of the following three forms:
H
a
: p < p
o
(one-sided)
H
a
: p > p
o
(one-sided)
H
a
: p p
o
(two-sided)
The first two are one-sided alternatives and the third one is a two-sided alternative. Two
sided alternatives can go in either direction, either much larger or much smaller than the
null value.
It is important that our sample is representative of the population about which we want to
draw conclusions. This is ensured when the sample chosen is random.
The test statistic captures the essence of the test, it is a measure of how far the sample
proportion, p-hat, is from the null value p
0
, the value that the null hypothesis claims is the
value of p. Since p-hat is what data estimates p to be, the test staitsitc can be viewed as
the measure of the distance between what the data tells us about p and what the null
hypothesis claims p to be.

The test statistic for this test measures the difference between the sample proportion p-hat
and the null value p
0
by the z score of the sample proportion p-hat assuming that the null
hypothesis is true.

The test statistic is:



A representation of the sampling distribution of p-hat assuming p = p
0
.

The test statistic can be used as a measure of evidence in the data against H
0
. The larger
the test statistic, the further the data are from H
0
and therefore the more evidence the
data provide against H
0
.

The test is known as the the z-test for the population proportion. It comes from that
fact that it is base don a test staitistc that is a z-score.

We take a random sample of size n froma population with population proportion p, the
possible values of sample proportion p-hat have approximately a normal distribution with
a mena of .. and a standard deviation of

The conditions needed to statisfy the data in order to use this test are:
1. the sample has to be random.
2. The conditions under which the sampling distribution of p-hat is normal are met.
3.

In the case of the test statistic, the larger it is in magnitude, the further p-hat is from p
0
,
the more evidence we have against H
0
.

In the case of the p-value, it is the opposite; the smaller it is, the more unlikely it is to get
data like those observed when H
0
is true, the more evidence it is against H
0
.

How is p-value calculated?
The p-value is the probability of observing data like those observed assuming H
0
is true.
- The calculation will involve the test statistic.
- By like, we mean as extreme or evne more extreme.
The p0value is the probability of observing a test statistic as extreme as that
observed (or even more extreme) assuming the null hypothesis is true.

By extreme we mean extreme in the direction of the alternative hypothesis.
Specifically, for the z-test for the population proportion:
1. If the alternative hypothesis is H
a
: p < p
0
(less than), then extreme means small,
and the p-value is: The probability of observing a test statistic as small as that
observed or smaller if the null hypothesis is true.
2. If the alternative hypothesis is H
a
: p> p
0
(greater than), then extreme means
large, and the p-value is The probability of observing a test statistic as large as
that observed or larger if the null hypothesis is true.
3. If the alternative is H
a
: p p
0
(different from), then extreme means extreme in
either direction either small or large (i.e. larger in magnitude) and the p-value
therefore is: The probability of observing a test statistic,= as large in magnitude as
that observed or larger if the null hypothesis is true.

Less than
The probability of observing a test statistic as small as that observed or smaller assuming
that the values of the test statistic follow a standard normal distribution.

Looking at the shaded region, this is often referred to as a left-tailed test.




Greater than
The probability of observing a test statistic as large as that observed or larger, assuming
that the values of the test statistic follow a standard normal distribution.
This is known as right-tailed test.




Not Equal to
Two-tailed test




In any test, p-values are found using the sampling distribution of the test statistic when
the null hypothesis is true. It is relatively easy to argue that the null distribution of our
test statistic is N (0,1).



Step 1: State the null and alternative hypotheses:
H
0
: p = p
0




Step 2: Obtain data from a sample and: 1. Check whether the data satisfies the conditions
which allow you to use this test. * random sample ( or at least a sample that can be
considered random in context)

2. Calculate the sample proportion p-hat and summarize the data using the test-statistic.


Step 3: Find the p-value of the test either by using software or by using the test statistic as
follows:

Step 4: Reach a conclusion first regarding the significance of the results, and then
determine what it means in the context of the problem. Recall that:
If the p0value is small, the results are significant and we reject H
0
. If the p0value is not
small, we do not have enough statistical evidence to reject H
0
and so we continue to
believe that H
0
may be true.

The effect of sample size on Hypothesis Testing. Larger sample sizes give us more info
to pin down the true nature of the population. We can therefore expect the sample mean
and the sample proportion obtained from a larger sample to be closer to the population
mean and proportion respectively. It might be that a sample size was simply too small to
detect a statistically significant difference or that a sample size was large enough to detect
a statistically significant difference.

Statistical significance vs practical importance

One-sided vs two-sided
In general it is harder to reject H
0
against a two-sided H
a
because the p-value is twice as
large. Intuitively, a one-sided alternative gives us a head-start, and on top of that we have
the evidence provided by the data. When our alternative is the two-sided test, we get no
head-start and all we have are the data, and therefore it is harder to cross the finish line
and reject H
o
.



pg 157 for summary of the hypothesis testing

S-ar putea să vă placă și