Documente Academic
Documente Profesional
Documente Cultură
Decision-
Making in
Business
1st Edition
Milos Podmanik
Foreword: What is This Book Good For?
You‟re probably thinking to yourself, “Who does this guy think he is by trying to write his own
book?”
The answer is both satisfying and deceiving to those who expect the traditional math course with
the traditional instructor. I write this course manual to most closely match my personal teaching
philosophy. What might that be? Well, I firmly believe that math education focuses too much on
processes, templates, and repetitive, mundane computational skills. Is this of any importance? To
some extent, yes, they are important. For the most part, however, students fail to make
connections from math to the real-world and vice versa. We tend to teach students how to “do”
and not how to “think.” As a result, I believe it is far more important to promote a deep level of
understanding, engagement, and connections to the planet we live on. After all, do you really
want to become a calculator? If your answer is “yes,” then this will come as a major
disappointment: a computer could calculate faster and more accurately than you decades ago!
Not to mention, computers will only continue to get faster and better than you at computing.
Here‟s the good news: computers don‟t understand why they‟re doing what they‟re doing! They
are simply computing machines. It takes (and most likely will always take) a rational, deep-
thinking human being to provide a contextual and meaningful analysis of the inputs and outputs
of a numerical process. And that, my friends, is what this book is all about.
What‟s my point?
2: Visual Representations
of Data
2.1 Visualizing Categorical Data 29
2.2 Visualizing Quantitative Data 43
2.3 Descriptive Statistics – Center and Position 56
2.4 Descriptive Statistics – Variability 67
3: Probability and
Decision-Making
3.1 The Idea of Probability 82
3.2 Joint Probability 89
3.3 Probability of Unions 99
3.4 Conditional Probability 107
3.5 Combinations and Permutations 119
3.6 Expected Value 135
4: Discrete Probability
Distributions
4.1 The Binomial Distribution 146
5: Continuous Probability
Distributions
5.1 The Ideas Behind the Continuous 158
Distribution
5.2 The Normal Distribution 172
6: Sampling Distributions
and Estimation
7: Hypothesis Testing
7.1 The Concept Behind Hypothesis Testing 208
Appendices
APPENDIX A: 220
Answers to Select Problems
Chapter 1
Fundamentals of Statistics
Our lives are filled with information. While at one point we didn‟t have enough data in the
world, now we have so much of it that computers need to be revamped continually in order to
keep up with it. Facebook records rich information about hundreds of millions of users. Studies
are revealing new conclusions that allow us to make decisions about choosing the right type of
treatment for medical conditions. Scientific data is establishing the strong correlation between
humans‟ interaction with the planet and changes in climate. The power of data is limitless.
However, due to our regularly failing media expertise, the results of studies are often
miscommunicated because they are not understood. In order to fully extract the meaningfulness
of data, we must understand how to analyze them. We must be accurate and precise in what we
measure and how we measure it.
1. To be informed
What does it mean to be informed? To be informed we should be able to understand and interpret
tables, charts, and graphs. We should be able to make sense of conclusions of other's research
Examples:
- Does it matter how long children are bottle-fed? An experiment was run to determine
differences in iron deficiency and the length of time that a child is bottle-fed.
- In 2005, Medicare candidates faced a decision of which prescription medication plan to choose.
A program called PlanFinder was made available online to compare available options. But, are
senior citizens online?
- A study in 2005 attempted to answer the question, are students ruder today than in the past? A
survey was conducted.
- Is domestic violence common? A study in 2005 interviewed about 24,000 women to attempt to
answer this question.
- What factors are involved in student achievement in school? Is study-time the most important
factor in answering this question? A study concluded that things such as prioritizing student
achievement and encouraging teacher collaboration may have some impact.
- Do the accounts receivable reported by a business accurately reflect the true accounts
receivable? The IRS randomly audits businesses to try and answer this question.
- A stock’s share value change has fluctuated between -1.2% and 8.9% over the last year. What
predictions should an investor make about the stock over the coming year in order to decide
whether to purchase?
- CVS Pharmacy sells 5 lb. bags of 100% Pure Cane Granulated Sugar. As a quality control
measure, the company would like to know the amount of variability in the true weight of sugar
placed into each of the bags.
In order to be able to reach the goals mentioned above, we need to have some sort of information
about which to make our decisions – we call this information data.
Quantitative variables, as the title implies, deal with numerical quantities. For example, the
average revenue of a Whole Foods market store is considered a quantitative variable, since the
measurement is a number.
Qualitative variables, on the other hand, deal with qualities. For example, the type of television
that a customer is likely to purchase is considered a qualitative variable, since its value will be,
for instance, plasma, LED, LCD, etc.
Just because a variable is stated as a numerical value doesn‟t mean that it can be treated as a
numerical value. A variable must be classified according to its scale of measurement.
For instance, suppose you are to test three marketing tactics on customers. You call these tactics,
Tactics 1, 2, and 3, respectively. These tactics have numerical values, but the numbers do not
have any ordering significance. That is, tactic 1 is not necessarily better than tactic 3. These
numbers serve simply as names for the values of the variables and cannot be numerically
compared. We call this a variable of nominal scale.
Suppose that a business magazine reports the top three new businesses in the city each month.
That is, we have businesses 1, 2, and 3, where 1 is considered the best of the three, 2 the second
best, and 3 the third best. In this case, we can talk about 1 being better than 2 and 3 and 3 being
worse than 1 and 2. This type of variable has the properties of a nominal scaled variable, but also
has the property of order. We call this a variable of ordinal scale.
In another example, consider the variable IQ. Suppose two people have IQ‟s of 100 and 120.
Based on this information, we can say that the person with 120 has a higher IQ. However, we
can also say that the second person has an IQ that is 20 points higher than the first person. We
couldn‟t really say this for the example above. In addition to being nominal (a person can be
identified by their value) and ordinal (can rank the scores), we can also talk about the differences
in scores. This type of variable is of interval scale.
Existing Data
In some instances, this data already exists and is available to the researcher. For instance, one
can easily go online and find existing data on the U.S. public. We can view things like the
average credit card debt per person by state, pounds of grains produced in the United States since
1950, etc. This data is usually available through a number of websites, such as:
There are literally thousands of other repositories for existing data. Sometimes a little bit of
research unveils a plethora of results.
If a company is doing a study of its clients, it may already have a myriad of existing internal
data.
Many times, observational studies are conducted. There is no experimenter manipulation in this
type of study. For example, a zoologist might study elephant eating patterns in various climates
to determine whether climate has an effect on caloric intake (response variable – what is
measured). He probably cannot manipulate the climate (predictor variable – serves to predict
responses) in which the elephant lives (for many reasons, not the least of which is the difficulty
An experiment, on the other hand, is a type of study in which the experimenter is able to control
and manipulate most, if not all, environmental factors. If the experimenter is studying the effects
of caffeine on math test scores, for instance, he would have a control group of, perhaps, students
who he gives no coffee to and another, experimental group, to which he gives coffee with 60
mg of caffeine. He then measures each group on test score performance (% of total correct):
Suppose the experimental group does poorly compared to the control group. Can we be sure that
it was due to the caffeine? As long as test conditions were the same in each group, yes. If,
however, there was something different between the two groups in addition to the
presence/absence of caffeine, then the results are not so clear. What if, for instance, they played
music with the control group and none with the control group? How do we know better
performance in the control group wasn‟t an effect of soothing music calming the nerves? It could
even have been a combination of no caffeine and music.
Punchline: In an experiment, we manipulate one factor and hold all other conditions constant.
Most of the time it is desirable to run an experiment. The number one reason for this is that we
can usually collect evidence that leads to a cause-and-effect relationship, assuming the
experiment is conducted properly. In an observational study it is impossible to do this as there
are many confounding variables, or variables that might be related to the explanatory and
response variable. Consider this classic example: a researcher counts the number of crimes
committed in a city and then the number of churches in that city. She does this for quite a few
cities. It is found that there is a positive relationship between the number of crimes committed
and the number of churches. That is, as crime increases, so do the number of churches. What
gives? Do these people just repent more often for their guilty consciences?
Example 1: An educational researcher finds that there is a strong relationship between the
number of hours a student studies and his/her grade point average (GPA)? List a few possible
confounding variables.
SOLUTION: There is no guarantee that studying more causes a higher GPA. There are many
factors that might influence a higher GPA:
More sleep
Less stress (maybe due to lack of job)
Less television viewing
Better study environment
More support from family/friends
There are many. Let's consider the following scenario to help illustrate a few.
Scenario: Suppose we want to test whether or not a newly designed Freud circular saw blade
runs at a lower temperature, and hence causes less burn marks in the wood, than the old blade at
7200 revolutions per minute (RPM).
Can we just run the cuts, take the temperatures, and compare? I think you know the answer to
this.
First off, we face many extraneous factors, or variables that are not of interest in the current
study but that are thought to affect the response variables. Examples? The person doing the
cutting with each blade (same or not?). The type of wood being cut (is one pine and
the other oak?). The type of saw (low-power Craftsman, or professional Jet?).
In order to avoid having these types of factors affect our measurement, we must control them.
We can do this by having the same person do the cutting, having both boards being cut exactly
the same, and use the same saw for both tests.
Secondly, is it sufficient to cut just one board using each blade? Definitely not. We must expect
that there will be some variation or variability in the temperatures we measure. That is, if I run
the cut with the old saw four times, I may read temperatures of 205 , 202 , 209 and 219 . This
difference among the measurements is called variability. Thus, to take into account the
variability, we must take several replications, or repeated measurements. Then, we would likely
use the mean, or average of the replications.
If you have a total of 8 sheets of wood to be cut, is it okay to cut the first 4 with the old blade and
the last 4 with the new blade? Surprisingly, the answer is "no." Why not? Suppose the sheets
were delivered freshly cut, and still moist. Well, moisture is subject to gravity, and so the last
four boards might be more moist than the top four. Thus, we must randomize each board to one
of the two types of saw blades. In other words, we randomly assign each board to a blade. We
will not consider this any further at this point.
1. Classify each of the following variables as nominal, ordinal, interval, or ratio scale.
Justify your answer.
a. Favorite flavor of ice cream
b. Temperature ( F)
c. Accounts Receivable Balance
d. Ranking of Presidential Candidates According to Preference
2. Based on a study of 2121 children between the ages of one and four, researchers at the
Medical College of Wisconsin concluded that there was an association between iron
deficiency and the length of time that a child is bottle-fed (Milwaukee Journal Sentinal,
November 26, 2005).
a. How many elements does this dataset contain?
b. Is the variable categorical or quantitative? Explain.
3. The student senate at a university with 15,000 students is interested in the proportion of
students who favor a change in the grading system to allow for plus and minus grades
(e.g., B+, B, B-, rather than just B). Two hundred students are interviewed to determine
their attitude toward this proposed change.
a. How many elements does this dataset contain?
b. Is the variable categorical or quantitative? Explain.
4. An article titled “Guard Your Kids Against Allergies: Get Them a Pet” (San Luis Obispo
Tribune, August 28, 2002) described a study that led researchers to conclude that “babies
raised with two or more animals were about half as likely to have allergies by the time
they turned six.”
6. “More than half of California‟s doctors say they are so frustrated with managed care they
will quit, retire early, or leave the state within three years.” This conclusion from an
article titled “Doctors Feeling Pessimistic, Study Finds” (San Luis Obispo Tribune, July
15, 2001) was based on a mail survey conducted by the California Medical Association.
Surveys were mailed to 19,000 California doctors, and 2000 completed surveys were
returned.
a. Is this study an observational study or an experiment? Explain.
b. Describe any concerns you have regarding the conclusion drawn.
Statistics is a branch of mathematics that deals with the analysis of data. This is often confusing
to some people, since the lower-case version of this word, statistic, actually means: a piece of
data. So, we have statistics, which are the data themselves, and we have Statistics, which deals
with the analysis of statistics. Confusing, huh? We generally use the word statistics loosely to
mean “data.”
A statistician is a special type of mathematician who deals with the analysis of data. Many
people confuse the profession of the statistician with a person who simply has many statistics
memorized. While some certainly may, most do not.
Needless to say, our purpose in the field of Statistics is to understand data. Depending on one‟s
goal, statistics may be used to simply describe an obtained set of data or to extrapolate the data to
describe something much larger. These two goals are respectively called, descriptive and
inferential statistics.
Suppose you work in the accounting department and have collected the following data on
revenues earned from new and existing customers over the past day:
Your goal is to summarize the data in some meaningful way(s). Descriptive statistics is the
method of describing or summarizing data. How could this be done?
Revenue – Quantitative
o Range from $2,230 to $9,590
We can provide the relative frequency of these values. A relative frequency is a ratio of the
number of observations of a given value to the total number of observations. Here, we could
summarize by saying:
Old
This allows us to conclude that 27% of the sales came from new clients while 73% came from
existing clients. This is very valuable information! This information demonstrates that the
company has grown over the course of this one day.
We could present these two descriptive statistics to management by either providing the raw
percentages, or by some visual display, such as a pie chart or a bar graph. A pie chart shows
the ratios (or all parts of one whole) of the categorical variable and thus the entire circle
represents 100% of all account types (100% of the categorical variable values):
New
27%
Old
73%
This literally shows the “ingredients” of the pie. A corresponding bar graph might be:
Account Type
9
8
7
6
Frequency
5
4
3
2
1
0
New Old
Type
In a similar way, we could describe Revenue, the quantitative variable. Typically, quantitative
variables are described by:
Variability – measure of how spread-out the data values are. A number of possible
measures exist including (but not limited to): range, interquartile range, and standard
deviation.
Since we‟re most used to finding a simple average, or the mean, we will do that here. Recall, that
the mean can be found by summing the observations and dividing by the number of
observations:
Recall that when we find an average, we are placing all values into a common “pot.” We then
divide the pot into equal parts. That is to say, if each company had spent the same amount of
money on each purchase, they would each spend $5,750. We like to think of this as a measure of
the center value. Spending less than this amount puts a company below the average and spending
more puts the company above the average.
This value represents the amount allocated to each observation, if each observation were to
receive an equal share of the total. We think of this as the “center” value.
In conjunction with measures that summarize the center, it is critical to focus also on how spread
out the data is. One such measure is the range. The range is simply the difference between the
minimum and maximum values in the dataset. In this instance, we have:
Minimum: $2,230
Maximimum: $9,590
Thus, the range of the dataset is $7,360. This tells us that the amount spent varied by as much as
$7,360 from company-to-company.
Range
Range, a measure of the variability (or spread) of a dataset, is measured by taking the difference
between the largest and smallest observed value. That is,
SOLUTION: We are being asked to look at values specific to the account type. Thus, we will
have two means and two ranges.
We see that both company‟s tend to have about the same average purchase amount. However, it
appears that the amount spent by old customers is prone to more fluctuation than that of new
customers. This might be due simply to the fact that there are only three new customers.
Technology Note: All of the information above was generated using Microsoft Excel.
Descriptive statistics is a great way to describe what you have, but how can we describe data that
we do not have?
Let‟s consider an example. You are the manager of the production branch at Healthy Heart
Organic Foods. Due to recent workload increases, you are concerned that your employees‟ team
morale has decreased. You have 864 employees working in your department. You would like to
conduct a survey, but you do not have the means to investigate the data in each of the surveys
provided. Certainly, you could pay your assistant overtime to analyze them for you, but that
would be costly of his time and payroll. Instead, you decide to randomly survey 50 of the
employees in your department in order to get an idea of the overall morale. This process of
collecting data on a smaller portion of the whole in order to generalize to the whole is known as
statistical inference. This branch of statistics is called inferential statistics.
It is of utmost importance to make appropriate conclusions when reporting findings of any study,
a survey or an experiment. For example, if we find that rats die after ingestion of 20mg of
caffeine, does that mean caffeine will kill a human, as well? This brings up the worthwhile
discussion of a population versus the sample. Let‟s consider the figure below:
It is often quite time-consuming and costly to conduct a study based on whole populations. Even
presidential polls rarely involve more than a couple hundred participants. Through one of a
variety of processes, only a select number of elements of the target population will be selected.
This select number is referred to as the sample. The process of selecting a sample from the
population that we will consider is simple random sampling (SRS). This process helps to
ensure that any differences that we notice among sample elements is entirely due to chance and,
importantly, that every element in the target population has an equally likely chance of being in
the sample.
Simple random sampling can be done by many means. You’ve probably heard of the random
process of drawing a name from a box to declare the winner of a raffle. More sophisticated
means of this are done by a random number generator on a computer, wherein every element of
the population is assigned a whole number. Then, a series of random numbers is drawn by a
computer and those elements are selected to be in the sample.
We can see in the illustration above that our goal is to then make inferences about the population
based on our observations of the sample. Just as you might hear from Gallup: “55% of voters
plan on voting for Candidate X,” we try to make generalizations based on the target population.
As another example, consider a lighting company that is hoping to manufacture a light bulb with
a new type of filament. As with any light bulb, a consumer would want to know how long the
light bulb is expected to last. Unfortunately, not every light bulb will last equally long as every
other light bulb. This means that an average will have to be taken. To add to this, it is not
possible to test every single light bulb to determine how long it will last. So, the company
decides to randomly test 200 bulbs that come through the assembly line. They hope to use this
sample, since it is random and is assumed to be representative of all light bulbs, to estimate the
true average lifespan of a light bulb with this new filament. Here is an overview of their
inferential statistics process:
Though it might seem simple enough to conclude that the average light bulb survives for 76
hours, we have to take into account the variability in the lifetimes. That is to say, we need some
way to produce a reasonable interval for the true average, since it is the entire population we are
looking to describe. A discussion of this inference process is left for future sections.
1. Over its first week in the Box Office (12/14/2012 to 12/20/2012), the movie The Hobbit:
An Unexpected Journey grossed the following amounts, in millions of dollars (no
particular order):
(SOURCE: www.the-numbers.com)
2. A marketing firm conducts a focus group with eighteen randomly selected college
students to determine their preference for a variety of clothing lines.
a. Describe the sample.
b. Describe the population.
c. What variables might the marketing firm want to measure?
3. In a quality control process, 250 packages of cheese are randomly selected from an
assembly line. Each package of cheese will be described as either “pass” or “fail,”
depending on whether or not it passes the inspection.
a. Describe the sample.
b. Describe the population.
c. Quality control will fail if more than 1% of the packages fail. How many
packages must pass?
4. Two datasets have a range of 30. Describe how it is possible that one dataset is
considered to be more spread out that the other dataset.
5. One hundred randomly selected CGCC students are surveyed and asked, “Do you believe
that racism is an issue in the college setting?” The survey makers would like to generalize
to college students. What is wrong with their study?
When conducting an analysis of realistic amounts of data, it is tiresome, mundane, and even
unfeasible to carry out computations by hand. Microsoft Excel is by far a more powerful and
accessible piece of software that does this all for us. As such, we seek to better understand how it
works in this section. All images below come from the most recent version of Microsoft Excel.
Excel is a spreadsheet-based software. This means that each entry, or cell, represents one piece
of information that is all a part of a larger grid of cells. A cell may contain numerical or textual
information.
Eventually, you will learn to make beautiful spreadsheets, but we are now only concerned with
some basic features. Let‟s begin by entering the following accounting data from Section 1.2:
We can choose any cell we want to begin entering data. Let‟s choose cell A1 to type in the
header. This cell reference means that we are looking at row A and column 1. We will enter our
second column‟s label into cell B1. We will list the data vertically, as shown in the table above.
After clicking on a cell and typing in each entry, simply press ENTER or TAB to move to the
next cell. Do not press ESC, or the data you are typing will be cancelled.
In order to see the entire labels in cells A1 and B1, we can expand the column by placing the
cursor between the grey-shaded labels for columns A and B, clicking, holding, and dragging the
window to an appropriate size.
Excel is extremely useful due to the fact that it allows us to create formulas based on the values
of existing cells or cell ranges (a collection of one or more cells).
A formula can either act on a provided value or on a provided set of cells. For example, suppose
we want to add up the total revenue. We want the result to appear in cell D3. To initiate a
formula, we must begin with = in the desired formula cell. Thus, we could click cell D3 and
type:
This, however, would defeat the purpose of having entered all the data in already! So, we will
use the built in sum function. To use this, we type:
This tells Excel to sum up the range of values from B2 to B12. The colon indicates that we want
the full range and not just the two cells B2 and B12. If we were only to have wanted to sum cells
B2 and B12 (no in between), then we would have replaced the colon with a comma.
NOTE: Excel is not case-sensitive when it comes to formulas. You can type SUM or Sum or
even sUm and Excel will recognize what you are asking it to do. However, if you are analyzing
categorical data, then “New” is not recognized as being the same as “new.”
We get:
(NOTE: It is highly recommended that you label your spreadsheet values. Before or after
inserting the sum into D3, it is a good idea to label that cell‟s content, perhaps in cell C3 as
shown above. This will be very helpful when your spreadsheet is loaded with information.)
To get the proper formatting, highlight cell D3 and select “Currency” from the Number column
in the Home Tab. This formatting only applies to the selected cell(s).
To find the average revenue, we would simply type the following into the desired cell (we‟ll use
D4):
= average(B2:B12)
= max(B2:B12) – min(B2:B12)
This will find the maximum value from B2 to B12 and subtract away the minimum from B2 to
B12, giving us precisely the range. If it is desirable to see the max or the min, you can choose a
cell and simply type in the max portion or the min portion without doing the subtraction, as
shown below:
= 30*D3
NOTE: To indicate multiplication in Excel formulas, you must use the multiplication sign.
Parenthesis to indicate multiplication will produce an error.
There are literally hundreds of functions available through Excel. A very useful tool for learning
how to do new things in Excel is to Google what you are trying to accomplish. For example, if I
wanted to find the standard deviation of revenues, I might search Google for “standard deviation
in Excel.” Thousands of results are bound to pop-up. Why stop there… try YouTube for many
useful videos.
1.3.2 Countif()
It is nice to know that Excel has formulas to operate on quantities, but it could still be
devastating to have to count categorical values by hand.
The countif() function is useful for such an act. This function works as follows: you provide a
range of cells for the function to evaluate. You then provide a condition that it should search for
and it counts the number of such instances. Suppose we want to count the number of new
accounts in cells B2 to B12. We would enter:
= countif(B2:B12, “New”)
NOTE: we separate the cell range with a comma. After the comma, we type in parenthesis the
word it is to search for. Note that case does matter here, since we need to tell Excel exactly what
to search for.
Statistics for Decision-Making in Business © Milos Podmanik Page 26
We get:
A neat little trick is to modify our formula. Let‟s say that we want to minimize the number of
areas in our spreadsheet that we would need to change if, say, we began calling “New” accounts
“NB” for “New Business.” We would need to change all the account type names, as well as the
search criteria in the formula. To make this easier, we can tell our formula to search for
something that is already typed into an existing cell. Since C10 contains the actual word we want
to search for, we will simply put C10 after the comma instead of the word “New.”
= countif(B2:B12, C10)
This tells Excel what cells to count, and it tells it what cell to find the search criteria in. We still
get the same result. Caution to the wind: if you modify the entry in C10, your result in D10 will
change accordingly (or it might produce an error).
a. Determine the mean number of years this sample has been with the company.
b. Determine the minimum and maximum number of years a person from this
sample has been with the company.
c. Determine the combined overall number of years this sample has been with the
company.
d. Determine the frequency with which people within this sample agreed and
disagreed with the policy change.
e. Calculate the mean, the minimum and maximum, and the range for each of the
two groups (agree and disagree).
f. Describe any patterns that emerged when considering the two groups separately.
When summarizing data, it goes without say that there are appropriate and inappropriate ways to
display the data. For example, if you collected a person‟s age and income, you might be
interested in studying income as a function of age. In this case, you probably would not want to
build a pie chart, since you‟re studying quantitative variables (two of them, at that).
In the previous chapter, the main types of categorical data visualizations were mentioned – bar
graphs and pie charts. Our aim here is simply to summarize and to show how to use them in
conjunction with Excel. We‟ll create three types of representations:
Pie Chart
Frequency Bar Graph – Vertical axis keeps tracks the number of instances of each
observation
Relative Frequency Bar Graph – Vertical axis keeps tracks the ratio of instances of each
observation (decimal or percentage, typically)
Suppose a hotel owner asks 20 randomly selected recent guests to respond to the following
statement regarding their experiences at the new hotel lounge:
- SD - Strongly Disagree
- D -Disagree
- A - Agree
- SA - Strongly Agree
Since the participant number is not important, it is okay to ignore that line of the dataset. Our
focus is on the Opinion row. This is a categorical variable, so we‟ll begin by counting the
number of SD, D, A, and SA responses by using Excel‟s countif() option. Further, we‟ll calculate
the relative frequency of each response by dividing the number of responses for each category by
the total number of observations, which we tally below all the individual frequencies:
One new trick worth mentioning is Excel‟s ability to recognize patterns in our formulas. Let‟s
say that we typed in our countif() formula for SD in G7 as follows.
This does work! Note that, since we shifted the formula down one level, F7 turned into F8. That
is, the search criteria is now being “pulled” from F8, the cell corresponding to an opinion of „D‟.
However, we have one problem: the counting region also shifted from D6:D25 to D7:D26. We
don‟t want that! To tell Excel that we still want the counting region to be D6:D25 and to not
change when we copy our formula, we “lock” the rows and columns by putting a dollar-sign ($)
before the row letter and before the column number, as shown below:
(HINT: If you place your cursor over each of the cell names in the formula and press command
F4 on your keyboard, you will notice the dollar-sign toggle for you)
Notice that F7 contains no dollar-signs, so as to indicate to Excel that we wish for the criteria cell
to adjust down one row (still in column F) as we move down one row. We can now copy-paste
the formula down the remaining cells:
= sum(G7:G10)
To get the relative frequencies, we want to divide each frequency by the constant 20. For
instance, the relative frequency of „A‟ would be 2/20 = 0.1. Instead of telling Excel to divide 2
by 20, we will type the following formula into H7:
= G7/$G$11
Note that we lock cell G11 so that, when we copy this formula to the remaining cells, we
continue to divide by 20, the value in G11.
It is neat to note that we can copy the formula all the way down to H11, since it will simply take
20 and divide it by 20, indicating that the total is 1 or 100% of the data.
To build a pie chart, we can simply highlight the four opinions and the corresponding
frequencies (click and drag from cell F7 to G10), selecting the Insert tab, clicking on Pie in the
Charts column, and selecting the desired pie chart. We‟ll select the first one.
Now we would like to label the chart. It would be nice to see a title and the percentages for each
of the slices. To do this, select the chart and click on Design in the Chart Tools tab that appears.
In the Chart Layouts column, we can select the style of chart most appropriate to our needs. For
demonstration purposes, the first option will be shown below:
There are many options when it comes to formatting graphs and charts. This will be left for
exploration. Note also that many online sources, such as YouTube, offer tutorials on professional
formatting within Excel.
Depending on what one would like to emphasize, a bar graph may be suitable to meet that need.
We can create either a frequency bar graph or a relative frequency, depending on whether we
want to display the number of times an observation appears or the percentage of observations
resulting in each of the possible variable values.
Using our example from above, since the frequencies are in the column adjacent to the opinion
value, we can simply highlight all observations and frequencies and select the Insert tab, the
Charts column, and select the first 2-D Column graph from Column. Be careful not to select the
Total row.
14
12
10
8
Series1
6
0
SD D A SA
There is only one variable here, we can click on the “Series1” in the legend and press DELETE.
This will free-up some space.
16
14
12
10
0
SD D A SA
With the graph selected, Choose the Layout tab that appears in the Chart Tools area.
10
8
6
4
2
0
SD D A SA
Opinion
In the relative frequency bar graph, we wish only to change the measurement on the vertical axis.
We want to draw the proportions from the third column of our data.
We can update our current bar graph to reflect this. If you do not want to lose the information in
your frequency bar graph, you can copy the graph and paste it beside the existing graph. This
will allow us to modify the data that is being drawn in.
Beside the “Series values” box, click the icon. This will now allow you to select the values
of the dependent variable. Click and drag to select all the relative frequencies, except the total
frequency. Then press the icon to close the dialogue box. After relabeling the vertical axis,
you should now see:
We notice that both graphs look nearly identical. This is due to the fact that the relative
frequencies are proportional to the frequencies (they are the frequencies multiplied by 1/20!).
2.1.3 Conclusions
The owner of the hotel can reasonably conclude that 80% of his recent guests enjoyed the lounge
(enough to consider revisiting!). He can conclude that 20% of his guests either did not care for it
or absolutely hated it! If he is interested in additional repeat visitors, perhaps he might like to
determine how to make the experience better for those who seem to be highly dissatisfied. Are
these descriptive measures demonstrative of the entire population of visitors? To a greater or
lesser extent – perhaps.
1. The following dataset represents the meat selection made by individuals at a dinner
banquet. Attendees selected from beef (B), chicken (C) veal (V), or pork (P).
B C B C V B C
C C P P B B C
Pounds per
Meat Person
Beef 58.1
Veal 0.3
Lamb and mutton 0.7
Pork 46.6
Chicken 56.0
Turkey 13.3
3. On opening day, the owners of Green Heart Restaurant invited 29 food critics to be a part
of the culinary experience. Each critic gave a grade of A (Best), B, C, D, or F (Worst) to
reflect the quality of the overall dining experience. The scores are shown below:
A B B A C B C B B
D C B B A A C C C
C B A D C C B B B
A B
To make an assessment of how efficient the technical support department is in helping customers
solve software issues, management keeps track of the length of each phone call taking place over
the day. They find the following:
Since this data is quantitative, the discussed visual displays are not appropriate. However,
management still would like to visualize the 35 observations.
One quick, by-hand technique to visualize how the times appear would be a dot plot, or a simple
number line, with any repeats stacked above others. Given the presence of great technology, we
will use Excel to create a histogram, which is a graph similar to a bar graph (can be either
frequency or relative frequency). The difference is that, instead of having nominal categories on
the horizontal axis, we will create numerical categories. For example, we could simply create
tick marks for each observation value present in the table and to then display the number of time
it appears. Often, with small amounts of data, the graph may appear spread out. In this case, we
might decide to create a bar representing, say, all calls that fall between 0 and 3 minutes. Let‟s
demonstrate both:
Call Times
14
12
10
Frequency
8
6
4
2
0
0 1 2 3 4 6 7 8 9 10 12 13 15
Length (min.)
Call Times
14
12
10
Frequency
8
6
4
2
0
0-2 3-5 6-8 9-11 12-15
Length (min.)
Beautiful! Now it is more clear how call times are distributed. This visualization is a bit simpler
than the one above, as it groups times into more manageable categories. Note that the bars are
touching. This is the distinction of a histogram from a bar graph – we want to emphasize that
times are continuous and that every time length between 0 and 15 are accounted for (even
fractions of minute, potentially).
We can make these categories as wide or narrow as we‟d like. We call these categories bins.
Think about this as you would about sorting recycling materials into one of several bins.
The most time-consuming part of building a histogram by hand is organizing the data and
counting the number of observations. Excel does this quite easily via the use of a pivot table. A
pivot table is a “live” table whose values can be formatted in many different ways.
We must first begin with the dataset in Excel as a raw column or row of data:
To insert a pivot table, highlight the entire set of data, including the data label. Click on the
Insert tab and choose the PivotTable option from the Tables column. A data prompt should
appear with the table range already appearing in the box:
This generic template will now allow us to construct a table. From the PivotTable Field List
window, we will drag the Times variable into the Row Labels box. This will create a series of
rows with each of the observations appearing, only once. Thus, we will not have to see repeats!
The values of time are, by default, the sums of the times for each of the row labels. This is not
what we want. We want “Count of Times.” To change the type of value, click the arrow on the
“Sum of Times” button. Choose “Value Field Settings.” Change “Summarize value field by”
option to “Count” and close the dialogue box:
Statistics for Decision-Making in Business © Milos Podmanik Page 47
We can double-check that these values are correct by noting that the Grand Total is 35, the same
as the number of observations. We would like a histogram to show the “Row Labels” along the
horizontal axis and the “Count of Times” along the vertical axis. To do this, select the pivot table
and choose the Options tab from the PivotTable Tools menu.
8
6
4
2
0
0 1 2 3 4 6 7 8 9 10 12 13 15
Times (min.)
To make the gaps between bars disappear, select the graph and choose the eighth graph option
from the Design tab in the PivotChart Tools menu shown below (NOTE: this option will
automatically put in axis labels):
8
6
4
2
0
0 1 2 3 4 6 7 8 9 10 12 13 15
Times (min.)
We now would like to adjust the bin widths. Doing this is simple!
Select the pivot table. From the Options tab under the PivotTable Tools menu, choose “Group
Selection” from the Group column. In the dialogue box that appears, the “Starting at” and
“Ending at” boxes should reflect the smallest and largest values of the variable. You can adjust
these to be wider or narrower, if you choose to show less than the full dataset. In the “By:” box,
put the width of the classes. In this case, we chose 3. Press “OK” and the you should then see the
updated pivot table and graph!
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
0-2 3-5 6-8 9-11 12-15
Times (min.)
Frequency
6
5
4
3
2
1
0
60-64 65-69 70-74 75-79 85-90
Percentage Earned
a. What can the instructor conclude about the fairness of the test?
b. What appears to be the mean score, based on the histogram?
c. What is the approximate range of scores, and why is it only possible to be
approximate this from the given information?
2. A cashier at a mall retail clothing outlet asked customers their age for an anonymous
survey. The ages he collected can be found below:
31 34 30 30 31 27 33 36
33 30 29 28 20 32 24 30
32 30 30 22 31 38 28 31
25 24 25 31 25 24 36 32
24 31 31 32 31 31 28 31
33 20 32 32 52 31 27 30
3. The total number of people (in millions) working in all of the various industries in the
United States in 2010 is given in the table below:
4. A resort chain that wishes to expand is constantly searching for new sites to add
properties that will be profitable. A good place to start is by considering climates.
Suppose Starwood Hotels and Resorts Worldwide obtains the following data from the
U.S. Census Bureau on highest temperatures ever recorded in various cities in the United
States:
To make peace with some regularly occurring notation in statistics, we will use (“sigma”) to
mean “the sum of.” For instance,
Let‟s say that we have a set of variable values. To distinguish each of these “ ‟s” we‟ll use
subscripts, denoting them:
Then, to indicate that we want to sum these values across all subscripts, we would write:
Using this new notation, we already know how to calculate the mean:
The mean value, or average, of a dataset containing values can be written as:
A common point of confusion for students is the difference in the subscript and the
denominator . Many people think that the subscript should be to match the number of
elements in the dataset. However, specifically refers to the very last value in the dataset. We
treat the as an index that goes across all subscripts from 1 all the way up to and including . To
As you can see, the sigma notation can quickly become convoluted, and so we typically just
write to indicate the sum of all -values.
Median
The median value of a dataset is the value that represents the physical center of the data set. To
locate the median:
If there is an odd number of values in the data set, the center value can be located by counting in
positions from the smallest value, including the smallest value. Alternatively, one can count in an
equal number of values from the left and right endpoints to locate the center value.
If there is an even number of values in the data set, average the two middle-most values together.
The locations of the two middle-most values are:
Positions from the smallest value, including the smallest value. Once again, these values can be
found by counting from the left and the right endpoints of the dataset.
Example 1: Find the mean and median salaries for the company represented by the following
dataset (in thousands). Explain which measure better reflects the overall company
demographic.
The two middle values are 48 and 50 (these values are four values in from either side). These
represent the 10/2=5th and 10/2+1=6th values in the dataset. To find the median, we average them
together to get
The median is clearly a more viable measure. The mean takes into account all values, including
the outlier, or “extreme” salary of $1.1 million per year. The median is not influenced by
extreme outliers.
To find the mean and median salaries in Excel we use the functions average() and median().
The parameter for both functions is the cell range corresponding to the dataset.
Another useful tool for describing the location of data points is a percentile.
Percentile
The th percentile is a value such that percent of the values in a dataset (of values) are less
than or equal to this value.
To find the location of this value, that is, the index, , first arrange the data in ascending order.
The index can be calculated by:
. /
That is, find the th percent of the number of observations. Round up if the index is a decimal
and take the average of the values in positions and if the calculated value of is an
integer. One of these two actions will be taken
Example 2: Find the 50th percentile for the salaries in Example 1:. Interpret the real-world
meaning of this value.
We take . Since this is an integer, we average together the values in positions 5 and
6, giving us a value of 49. This means that 50% of employees represented in this dataset make
$49,000 or less.
In Excel, we can use the Percentile() function. The set-up of this function‟s parameters is:
2.3.3 Quartiles
Often times, data analysts like to think about data in terms of quartiles, or quarters. There are 4
quartiles and can be represented as follows:
What if, on the other hand, an employee wants to know what the rank of his salary is (he knows
his percentile value)? This requires reverse-engineering of the idea of a percentile. Without the
use of any mathematical formulas, we would need to count the number of values that are equal to
or lesser than salary in question. To make this easier, we can use Excel‟s Rank() function. The
parameters we will use are as follows:
This will return the number of values that are less than or equal to the value in question. If we
changed the parameter of 1 to a 0, Excel would return the ranking of that value, treating rankings
as being similar to the ranks of, say, runners in a race.
We will then need to divide this output by the number of observations in the dataset. To make
the counting process more automated, we can take this output and divide it by the output of the
count() function. This function will simply count the number of entries in the specified range,
and has the following parameter:
= count(cell range)
Let‟s say the employee making $24,000 would like to know his salary‟s rank. To calculate, we
would type the following:
Giving us:
Another approach would be to use the “Rank and Percentiles” tool in an Excel add-in called
Analysis ToolPak. This method will show the ranks and percentiles of all values in the dataset
and is only useful for relatively small, manageable datasets. The Analysis ToolPak will be
important later on, so we‟ll describe it‟s installation here.
To install the Analysis ToolPak, select the File tab within Excel. Then select Options from the
ribbon that appears. Select the Add-Ins option. Click Analysis ToolPak and press Go.
You can either specify an output range, or have Excel create a new worksheet with the results.
This is up to your preferences. Check “Labels in First Row” and be sure that the data label has
been selected.
1. Suppose your instructor releases scores on a recent project. The scores are as follows:
83 89 76 41 92 85 76 71
95 92 80 84 77 78 81 75
64 30 80 79 78 70 75 81
99 85 80 82 70 69 71 70
2. In order to make way for new products, a grocery store chain would like to determine
whether the Lunch Pack or Family Pack of Flaxem Crackers generate more revenue. The
following two datasets show the revenue generated by each over a 10-month period:
3. Suppose that Budget Car Rentals assesses a variety of new 2012 and 2013 sedans for its
new line of rental cars. It finds the following information on city and highway fuel
efficiencies (mpg) for eight vehicles in consideration:
a. Find the mean and median fuel efficiency for city and highway mileages of the
vehicles being considered. Comment on any differences between the two values.
b. What is the rank percentage of a vehicle that has 43 city mpg?
c. If the company makes its choice based on the top 15% of city and highway for the
vehicles being considered, what will be the minimum city and highway mileages
they should consider?
d. Make a recommendation for which vehicle(s) should be purchased, if any.
The measure of center is always a good start. But what does a sample mean not tell us? It fails to
describe how far apart the data are from one another. In other words, we need to assess the
variability of variance of the numbers we have collected.
The simplest way we might go about describing the variability is by simply looking at the range
of the data, such that:
Albeit, this still does not help us identify how spread out the data are. For example, suppose we
find our range to be 100 units (see dataset below). This might seem rather daunting at first, but
what if all values were clumped between 0 and 10, and there existed an outlier of 110?
Obviously, this range is often determined by outliers alone.
0 1 3 10 8 7 4 110
To create a better measure of variability that takes all data points into account, just like the mean
does, statisticians established a standard deviation. As the title implies, this is a standard tool
that measures the average deviations (or by how much each values deviates) from the mean. This
requires us to find all the deviations for points in our dataset,
We would find all of these. Let‟s demonstrate with the above dataset:
Value ̅
0 -17.875
1 -16.875
3 -14.875
10 -7.875
8 -9.875
7 -10.875
4 -13.875
110 92.125
Mean: 17.875
The deviations that we observe to be below the mean produce a negative deviation and the one
above the mean has a positive deviation. To find an average deviation, we would ideally add
Value ̅ ( ̅)
0 -17.875 319.5156
1 -16.875 284.7656
3 -14.875 221.2656
10 -7.875 62.01563
8 -9.875 97.51563
7 -10.875 118.2656
4 -13.875 192.5156
110 92.125 8487.016
Great, now they can be summed up to give 9782.88! Thus, we have found the following:
∑( ̅)
One would think that dividing by 8 would now be appropriate to find the average. Due to
mathematical properties that are beyond the scope of this course, the division will be by 7, which
is . Thus:
∑( ̅)
NOTE: The division by has to do with the fact that we are often dealing with a sample in
inferential statistics and hope to make conclusions above a population.
Sample Variance
∑( ̅)
To make all of these calculations more meaningful (to have a true average), we should probably
“unsquare” the value that we have. When we do this, we get the sample standard deviation:
This is what we can think of as the average deviation of each point from the mean. It is clearly
high for this dataset. What is causing it? The outlier of 110!
Conclusion: On average, values in the dataset deviate from the mean by about 37 units.
∑( ̅)
√
In Excel, the standard deviation can be calculated simply by using the function below:
= stdev(cell range)
Example 1: A river with mild current is known to have an average depth of 3 feet with a
standard deviation of 3 feet. The bottom is not visible. Is the river safe to cross by foot? Also,
what is the variance?
SOLUTION: Since there is a standard deviation of 3 feet, we can conclude, that, on average, the
river depth deviates by 3 feet from the mean. It would not be unusual to encounter a part of the
river with a depth of 6 or more feet. Therefore, the river should not be crossed by foot.
Since the standard deviation is the square root of the variance, the variance is the square of the
standard deviation. That is,
Thus, the variance is 9. The variance does not have a valuable interpretation.
Think about this: n is a fixed value for our sample, specifically 5. The only thing that could make
s2 large or small is the numerator. Thus, if the deviations are large (a bad thing!), then the
squared deviations will be large, and so the sum of squares will be large. This implies a large
standard deviation.
So, a large standard deviation means that there is a lot of variability, or that the values are vastly
different from one another. A small standard deviation means the values in the data set are quite
alike. In the near future, you'll see why it is important to have a small standard deviation. In
general, as the variance and standard deviation get larger, our ability to make precise statements
about the population quickly evaporates.
We will be using variance and standard deviation consistently for the rest of the semester. It is
important to get comfortable with it.
Indeed they do. Do you think that we can find them? Definitely not! The population variance
requires the use of the population mean, . How do we get ? We take the average of all the
values in the entire population. Since we typically don't know this value, we also typically don't
know the population variance, so certainly we don't know the population standard deviation
(since it's the square root of the population variance).
Variance Standard
Deviation
Sample
Population
The population parameter, , is the lowercase Greek letter “Sigma.” (This is as opposed to the
sample statistic, .)
The standard deviation, much like the mean, is easily skewed by excessively small or large
values. We noticed this in the first example in this section. Using the idea of medians and
percentiles is a safe bet for outlier-proofing our spread estimates. An interquartile range is the
difference between the 3rd quartile and the 1st quartile. Remember, these are simply the 75th and
25th percentiles, respectively. The difference is the middle 50% of the dataset.
Example 2: Consider the following home prices and find both the standard deviation and the
interquartile range. Describe what conclusions can be drawn from these values.
The standard deviation indicates that home prices, on average, vary by $277,100 from the mean
value. However, we see from the interquartile range that the middle 50% of homes only vary by
$6,500. The standard deviation is being skewed by the home that is priced at $875,000. The
interquartile range tells us that the majority of home values stay pretty close to the median value.
Additionally, we see that most home values are between $88,000 and $96,000.
To generate most of the features we have discussed up until now, we turn to Excel‟s Analysis
ToolPak for a more automated approach.
Access the Data Analysis tool from the Data tab in Excel. Select “Descriptive Statistics” from
the menu and select the data from the spreadsheet containing the data.
Now that we have a basis for measuring data in terms of its center and spread, we turn back to
making connections with the visual shape of the distribution.
There are many different shapes that we encounter for distributions. Let's discuss a few. First,
note that the following do not look like the rectangular histograms from earlier on. These are
smoothed out forms of what we experienced earlier. They are often used to describe the general
shape of a distribution. And, of course, they are much easier to sketch.
A histogram is said to be (a) unimodal if it has a single peak, (b) bimodal if it has two peaks,
and (c) multimodal if it has more than two peaks.
If we follow the curves from left to right, we begin at the lower tail, move over the peak(s), and
arrive back down to what is called the upper tail.
A unimodal histogram is said to be symmetric, if we are able to draw a line down the center
such that the left side of the line is a mirror image of the right side. Consider the following
unimodal symmetric histograms:
A unimodal histogram that is not symmetric is said to be skewed. If the upper tail of the
histogram stretches out much farther than the lower tail, then the distribution of values is
positively (right) skewed. On the other hand, if the lower tail is much longer than the upper tail,
the histogram is negatively (left) skewed. Can you identify the following unimodal histograms
as positively or negatively skewed?
2.4.7 Skewness
Excel also produces a nice measure that allows us to make conclusions about the general shape
of the distribution. This measure is called skewness.
The farther from 0 that the skewness measure is, the more skewed in the respective direction the
distribution will be.
Consider the following data showing the number of televisions owned by randomly sampled
individuals in a big city:
We notice that the Skewness measure is positive: 0.51. This means the dataset is slightly skewed
to the right:
20
Frequency
15
10
0
0 1 2 3 4 5 6
Number of TV's
After analyzing a dataset, how do we assess likely values for data and deem other values as
outliers?
One approach is to determine how many standard deviations above (positive value) or below the
mean (negative value) a given data value is.
For instance, suppose we have a dataset with mean 20 and standard deviation 3. We have an
observation of 14. In terms of units, this value is 6 units below the mean. Thus, it has a deviation
of -6. This deviation tells us that the data value in question is 2 standard deviations below the
mean, since:
-Score
A -score tells us the number of standard deviations a data point, , is from its mean, ̅ .
Mathematically,
Chebyshev’s Theorem
For any , at least . / of the data values must be within (to the left and the right)
standard deviations of the mean, for any.
Example 3: A data value is 3 standard deviations above the mean. Is this an extreme value?
89% of all data points in this distribution will lie between -3 and +3 standard deviations from the
mean. Thus, there is, at most, an 11% chance of observing something higher than +3 standard
deviations. This data value is fairly unlikely an might be considered a mild outlier.
1. The Connecticut Agricultural Experiment Station conducted a study of the calorie content
of different types of beer. The calorie content (calories per 100 mL) for 26 brands of
light beer are:
29 28 33 31 30 33 30 28 27 41 39 31 29
23 32 31 32 19 40 22 34 31 42 35 29 43
a. Find the standard deviation. Explain the real-world meaning of this value.
b. Find the interquartile range. Explain the real-world meaning of this value.
c. Find the skewness. What type of shape does this distribution have?
2. The UNICEF report “Progress for Children” (April, 2005) included the accompanying
data on the percentage of primary-school-age children who were enrolled in school for 23
countries in Central Africa.
a. Find the range, standard deviation, and interquartile range. Explain what these
three values tell us about the shape of the distribution.
b. Explain the real-world meaning of the standard deviation and the interquartile
range.
c. Produce descriptive statistics for this dataset with the Analysis ToolPak in Excel.
d. Is the distribution skewed? If so, in which direction?
e. Create a relative frequency histogram. Describe any trends in the data.
f. Is an observation of 79.6 an outlier? Use Chebyshev‟s Theorem to justify your
answer.
a. Find the range, standard deviation, and interquartile range. Explain what these
three values tell us about the shape of the distribution.
b. Explain the real-world meaning of the standard deviation and the interquartile
range.
c. Produce descriptive statistics for this dataset with the Analysis ToolPak in Excel.
d. Is the distribution skewed? If so, in which direction?
e. Find the -score for the observation 79.6. Explain what your answer means in
real-world terms.
f. Create a relative frequency histogram. Is an observation of 79.6 an outlier? Use
Chebyshev‟s Theorem to justify your answer.
4. Using the five class intervals 100 to 120, 120 to 140, . . ., 180 to 200, devise a frequency
distribution based on 70 observations whose histogram could be described as follows:
a. Using Analysis ToolPak in Excel, generate all descriptive statistics. Discuss the
best measure of center and the best measure of spread based on what you see.
Justify why these measure were selected.
b. Find the -score for the observation 4194. Explain what your answer means in
real-world terms.
c. Is $4,194 considered an extreme outlier? Also use Chebyshev‟s Theorem to
numerically reinforce your answer.
6. Cost-to-charge ratios were reported for the 10 hospitals in California with the lowest
ratios (San Luis Obispo Tribune, December 15, 2002). The 10 cost-to-charge values
were
8.81 10.26 10.2 12.66 12.86 12.96 13.04 13.14 14.7 14.84
Discuss relevant descriptive statistics and a relative frequency distribution . Use your
information to make a conclusion about the state of hospitals in California.
7. The technical report “Ozone Season Emissions by State” (U.S. Environmental Protection
Agency, 2002) gave the following nitrous oxide emissions (in thousands of tons) for 16
states in the continental United States:
76 22 40 7 30 5 6 136 72 33
0 89 136 39 92 40 13 27 1 63
Generate a brief report about the distribution of nitrous oxide emissions in the sampled
states. Use descriptive measures and visuals to justify your answer.
In this chapter, we‟ll explore the nature of probabilistic thinking. You‟ll also notice the phrase
“Decision Theory” in the title. Instead of focusing on the trite probability questions involving
situations that we don‟t ever encounter, we‟ll concern ourselves with real-world situations where
probabilistic reasoning will help us make a decision.
Example 1: A weather report by the National Weather Service (NWS) stated on July 31, 2011
that, overnight, there was a 50% chance of precipitation in the 85225 zip code in which
Chandler-Gilbert Community College is located. What does this mean?
(SOURCE: www.crh.noaa.gov/)
SOLUTION: This is actually quite a loaded statement. One might want to say that, out of 100
times, it will rain 50 times. This is a very misleading approach for a couple of different reasons.
First off, what is meant by “times”? We are only concerned with one time: overnight on July 31,
2011.
A probability is actually a measure of how likely something is to occur in the long-run. That is,
if something were to be repeated in trials over and over again then, theoretically, the specified
outcome would occur a certain percentage of time. Importantly, it must be noted that the
conditions under which we are measuring a probability must be in place in order for the
probability to be a valid measure.
In our case, NWS states that, under the exact same environmental conditions taking place
throughout the night of July 31, 2011, it would be expected to rain 50% of the time.
The graph below shows a hypothetical scenario in which there is a 50% chance of precipitation
under the set of conditions that occurred on the above night. Notice that it rained on the initial
day and so immediately the proportion (or probability) of rainy days is 100%. As the same
conditions occur on different days, sometimes it rains and sometimes it does not. Having noted
that, any given day has a 50% chance of precipitation. We notice that the proportion is quite
unstable at first, jumping from 100%, down to nearly 40%; However, as many days with this
same set of conditions pass (in the long-run), we notice that the proportion becomes more stable
and approaches the theoretical probability of 50%.
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 120 140
Graph: Based on a random simulation involving the true probability of a 50% chance of precipitation
and what occurs in the long-run.
As an interesting note, NWS has sophisticated helium “balloons” that they send up into the air to
measure properties such as wind speed and direction, humidity, and barometric pressure. Then
physics is used based on theories of fluid mechanics to make the prediction.
Among many others that we could begin to state, there is one other major misconception about
probability: that if the probability that it rains is said to be very small and yet it rains, then the
probability must be wrong. This is incorrect. Probability is a measure of uncertainty. As in the
case of meteorology, the predictions are scientific and are based upon prior data. Just because it
has only rained, say, 10% of the time on days like today, this is not to say that it won‟t rain. In
fact, it very well might! The moral of the story is that probability talks about likelihood. Only in
the instance of 0% and 100% probabilities is anything guaranteed. If there are situations in which
something either never happens or always happens, then we‟re probably not concerned about
understanding probabilities.
Probability
Measuring Probabilities
While probability is considerably more complicated than we‟ll let on, the basic idea is that a
probability can be calculated by considering the number of times some event occurs relative to
the total number of “trials,” or observable situations. In simpler terms, it is the number of
“successes” out of the total number of trials.
Calculating Probability
The probability that event occurs, denoted ( ), is the ratio (or fraction) of successes divided
by the number of trials. Mathematically, we write the number of times occurs by ( ) and the
total number of trials as ( ). That is,
( )
( )
( )
This formula works when all elements in the sample space are equiprobable, that is, each
individual outcome in the sample space has the same probability of occurring as any other
outcome.
As a note the () notation stands for “the number of ways” the event in parenthesis can occur.
The in the denominator stands for sample space or the total number of things/situations/trials
being considered in the experiment.
Example 2: In a 2009 study of high-fructose corn syrup (HFCS), a corn-based sweetener used in
a wide variety of foods, beverages, and condiments, 20 samples of HFCS were analyzed. Of
those, nine of them were found to contain mercury by researchers. Based on the results of this
study, find the probability that a random sample of HFCS contains mercury and explain what this
result means.
SOURCE: http://www.washingtonpost.com/wp-
dyn/content/article/2009/01/26/AR2009012601831.html
SOLUTION: The event in this scenario is that mercury is found. Out of the total 20 trials, nine of
them contained mercury. Therefore,
( )
Example 3: In July 2011, temperatures in Gilbert, Arizona were above 100 every day
(SOURCE: www.weather.com). Based on this data, a researcher concludes that the probability of
above 100 temperatures in Arizona is 100%. Comment on his findings.
SOLUTION: Since temperatures in July 2011 were above 100 31 days of the 31 days in the
month, it is fair to make the experimental observation that approximately 100% of all days in
July 2011 have temperatures exceeding 100 , in the long-run (there have been days in the past
when temperatures were below 100 ); However, because we know that temperatures are
periodic, or that they go from low to high and back to low over the course of a year, 100% is not
a good estimate for temperatures in Arizona, in general (temperatures are reasonably never above
100 in January!).
This example truly stresses the importance of critical thinking when using probabilities. It is
often that probabilities are used and abused in the media, education, and in politics, just to name
a few. We want to make sure that we are as specific as possible.
It will often be considerably helpful to display probabilities in a tabular form, that is, through the
use of tables. This type of table is called a contingency table. This not only helps to organize
data, but to simultaneously see the big picture. Let‟s consider an example.
Example 4: In a 1950 study that considered 1,418 hospital patients in London (half of each) with
and without lung cancer and whether or not they smoked over the course of their lives, the
following was found:
Assuming this data can be used as a representation of the entire population of London residents,
analyze the data by discussing the following:
a. What is the probability that a randomly selected participant within this study develops
lung cancer?
b. Provided that a person was a smoker, what is the probability that he has lung cancer?
c. Provided that a person was not a smoker, what is the probability that he has lung cancer?
d. Given that a person has lung cancer, what is the probability that he smokes?
SOLUTION:
Statistics for Decision-Making in Business © Milos Podmanik Page 85
When answering these questions, it is fairly useful to fully organize the data by providing all
totals:
1. Since there is a total of 1,418 individuals being considered and, of those, 709 developed
lung cancer,
( )
We must be careful in using this probability as it doesn‟t really reveal anything about the
link between lung cancer and smoking, since 709 patients with lung cancer and 709
without lung cancer were chosen to participate in the study to begin with. This is a
probability that was fixed by the researchers.
2. There is a total of 1,338 individuals in the study that smoke (we are limited to the
smokers only, per the way the question is stated). Of those individuals, 688 have lung
cancer.
( )
Slightly over half of the patients who are smokers developed lung cancer. This number is
frighteningly large. Before we jump the gun in assuming that smoking is the culprit here,
we should probably consider what happens with nonsmokers.
( )
Slightly more than one-fourth of non-smokers developed lung cancer. This number
appears to be significantly less severe than for the smokers. We speculate (but did not
prove) that smoking increases the likelihood that one will develop lung cancer.
4. There are 709 patients with lung cancer. Of these, 688 smoke.
( )
The moral of the story is: analyze the situation from a variety of lenses. What appears to be true
might be an illusion of what we see immediately! Sometimes, however, it is about what the
naked eye does not detect. This is what makes good analysts.
1. A classmate of yours was absent when this section was discussed. Explain to her what a
probability is in your own words.
2. In a study performed by Cambridge University in the United Kingdom, it was found that,
“One out of three people is overwhelmed by the latest breakthroughs in technology.”
(SOURCE: http://www.gev.com/2011/07/study-one-out-of-three-people-overwhelmed-
by-technology/). Primarily, individuals are overwhelmed by how much information is
available through the use of social networks and smartphones, to name just two. Explain
what is meant by this and explain in terms of probabilistic reasoning.
Find the probability that a respondent believes that consistency in branding is:
4. The probability that a visit to a primary care physician‟s (PCP) office results in neither
lab work nor referral to a specialist is 35%. Of those coming to a PCP‟s office, 30% are
referred to specialists and 40% require lab work.
Determine the probability that a visit to a PCP‟s office results in both lab work and
referral to a specialist. (Video Solution)
5. A public health researcher examines the medical records of a group of 937 men who died
in 1999 and discovers that 210 of the men died from causes related to heart disease.
Moreover, 312 of the 937 men had at least one parent who suffered from heart disease,
and, of these 312 men, 102 died from causes related to heart disease.
In the previous section, we began computing probability using some fairly basic ideas. In
calculating probabilities, we made a huge assumption: that the found number represents what
will occur in the long-run. For instance, if we conduct a study and find that out of 100 people, 94
respond positively to a new energy drink, can we conclude the drink is effective in providing
added energy?
Okay, so you have a data sample collected from a specific population and your goal is to now
talk about probabilities.
Example 1: Imagine that you work for a marketing agency and your goal is to determine the
effectiveness of two different branding approaches to a new line of clothing. The first approach
involves establishing a group of Facebook followers by giving incentives for discounts on
clothing by becoming a friend of the company. The company hypothesizes that seeing the
company logo under on their Facebook account each week,
they will gain a strong familiarity and comfort level with
the company‟s product. The second approach involves
hiring Hollywood actors to endorse the product at film
festivals and celebrity appearances. The company then
tracks the degree of success of the branding tactic by
measuring the number of retail outlets that agree to stock
the product based on the branding used. They find that, of
the 6 companies exposed to Tactic 1 (T1), 5 agreed to stock
the product. Of the 7 companies exposed to Tactic 2 (T2), 5
agreed to stock the product.
SOLUTION: Let‟s start with a simpler question, and first consider T1. We find that the
probability of a successful sale is:
( )
This means that we should expect 80% of all companies to sell the clothing line, in the long-run.
Suppose that a marketing analyst is to offer T1 to two different companies. He would like to
know, what is the probability that both companies agree to sell the product? Is the answer 80%?
Unfortunately, no. There is an 80% chance that each company agrees to sell the clothing line.
We should expect that the probability that both sign-on is less.
We know that about 8 out of 10 times, Company 1 (C1) will sign-on and that 8 out of 10 times
Company 2 (C2) will sign on. Let‟s compare the possibilities by using a tabular approach:
Company 2
Choices
Y Y Y Y Y Y Y Y N N
Y
Y
Y
Y
Y
Company 1 Y
Choices Y
Y
N
N
Each cell in the table represents a particular combination of the C1‟s choice and C2‟s choice. So,
the 1-1 entry (remember, this means first row, first column) of the table is the situation in which
it does indeed turn out that C1 and C2 agree to sell the clothing line. The question was, what is
the probability that both sign-on? Since the definition of probability is the ratio of the number of
ways the event can occur divided by the total number of possible outcomes, let‟s do a bit of
counting by highlighting important features of the table:
Company 2
Choices
Y Y Y Y Y Y Y Y N N
The shaded region represents the number of ways in which we can get both companies to sign
on. This region is 8 x 8, which creates 64 possibilities. The total number of possibilities is simply
the total number of cells in the table. Since the table is 10 x 10, we have100 possibilities.
So,
( )
This is, as speculated, less than the probability that only one company signs on. Let‟s consider
what we really did here:
( )
( )
Notice that
( ) ( )
( ) ( )
Or, in short,
( ) ( ) ( )
SOLUTION: We can fairly assume that the first driver being caught
and the second driver being caught (calling these events and ,
respectively) constitute events that do not affect one another. Thus,
( ) ( ) ( )
There is a 49% chance that both drivers are caught. This is about the likelihood of getting heads
on the toss of a coin.
( ) ( ) ( )
We know that ( ) . Now, since the first probability “removes” one of the two
contaminated bushels and one bushel out of the 20 available, the probability of shipping a second
bushel is slightly changed to:
( )
Thus, the events are indeed dependent, and so the probability becomes:
( )
Does this outcome satisfy the farm producing these bushels of corn? Thinking in more detail, the
main concern is actually in regards to one or more (at least one) contaminated bushel going out!
In order to address how to find this, it is useful to think about the following, perhaps obvious,
characteristic.
1) A particular event is: guaranteed to not occur, is guaranteed to occur, or lies somewhere
between these extremes.
2) In a given situation, or sample space, the likelihood of something occurring (however
small or insignificant), is guaranteed.
3) The summed probabilities of all the possible events in a situation constitute the entire, or
the whole of all possibilities.
1) For any arbitrary event between events 1 and n, let‟s call this event , then:
( )
( )
( ) ( ) ( ) ( )
These basic properties are often referred to as the Kolmogorov axioms, named after the
mathematician Andrey Kolmogorov. An axiom can be thought of as a necessary assumption. For
instance, when physicists develop new concepts in physics, they assume that gravity follows
certain properties. Thus, they have gravity axioms.
The Kolmogorov axioms are extremely important in probability and the development of new
ideas.
Are there any others? Not unless there is a possibility we have not considered. Since two bushels
are guaranteed to go out, the outcome must fall into one of the three categories listed.
( ) ( )
( ) ( )
( ): there are two possibilities; either the first is contaminated and the second is not, or
vice versa. We must consider both outcomes below:
o ( )
( ) ( )
o ( )
( ) ( )
These two possibilities give 9.5% + 9.5% = 19% of the sample space.
(NOTE: Importantly, summing these three probabilities gives 1, as stated in the axioms!)
Needless to say, this was a lot of work; however, we can use the axioms to simplify the amount
of work we commit to ourselves.
According to axiom 2:
( ) ( ) ( )
Our earlier statement involved wanting to know the likelihood that at least one contaminated
bushel went out. That only involves and ! Solving for the sum of these two probabilities:
( ) ( ) ( )
That is,
( ) ( )
( )
This is the same number we achieved taking the long route! We only had to find the probability
of shipping 0 bushels, which is a little bit of work as compared to a lot of work!
Given any number of events involving quantities, the probability of at least one in quantity is 1
minus the probability of 0 in quantity. That is:
( ) ( )
( ) ( ) ( ) ( )
1. In 2009 the H1N1 virus, commonly referred to as the “Swine Flu,” reportedly infected an
estimated 10% of New Yorkers (SOURCE:
http://www.reuters.com/article/2009/08/30/us-flu-newyork-idUSTRE57T26Y20090830).
2. Many fire stations handle emergency calls for medical assistance as well as calls
requesting firefighting equipment. A particular station says that the probability that an
incoming call is for medical assistance is .85. This can be expressed as P(call is for
medical assistance) = .85.
a. Give a relative frequency interpretation of the given probability. That is, interpret
what the number .85 means based on the definition of probability.
b. What is the probability that a call is not for medical assistance?
c. Assuming that successive calls are independent of one another (i.e., knowing that
one call is for medical assistance doesn't influence our assessment of the
probability that the next call will be for medical assistance), calculate the
probability that both of the two successive calls will be for medical assistance.
d. Still assuming independence, calculate the probability that for two successive
calls, the first is for medical assistance and the second is not for medical
assistance.
e. Still assuming independence, calculate the probability that exactly one of the next
two calls will be for medical assistance. (Hint: There are two different
possibilities that you should consider. The one call for medical assistance might
be the first call, or it might be the second call.)
f. Do you think it is reasonable to assume that the requests made in successive calls
are independent? Explain.
3. "N.Y. Lottery Numbers Come Up 9-1-1 on 9/11" was the headline of an article that
appeared in the San Francisco Chronicle (September 13, 2002). More than 5600 people
had selected the sequence 9-1-1 on that date, many more than is typical for that sequence.
A professor at the University of Buffalo is quoted as saying, "I'm a bit surprised, but I
wouldn't characterize it as bizarre. It's randomness. Every number has the same chance of
coming up. People tend to read into these things. I'm sure that whatever numbers come up
tonight, they will have some special meaning to someone, somewhere." The New York
state lottery uses balls numbered 0-9 circulating in 3 separate bins. To select the winning
4. On August 8, 2011, the Dow Jones Industrial fell 635 points (5.5%) to 10,810 points,
representing the 6th worst point loss ever experienced. On that day, President Obama‟s
approval ratings also suffered tremendously; only 22% of the nation‟s voters “Strongly
Approve” of how he is performing in the presidential role (SOURCE:
http://www.rasmussenreports.com/public_content/politics/obama_administration/daily_pr
esidential_tracking_poll).
5. The following case study is reported in the article "Parking Tickets and Missing
Women," which appears in an early edition of the book Statistics: A Guide to the
Unknown. In a Swedish trial on a charge of overtime parking, a police officer testified
that he had noted the position of the two air valves on the tires of a parked car: To the
closest hour, one valve was at the 1 o' clock position and the other was at the 6 o' clock
position. After the allowable time for parking in that zone had passed, the policeman
returned, noted the valves were in the same position, and ticketed the car. The owner of
the car claimed that he had left the parking place in time and had returned later. The
values just happened by chance to be in the same positions. An "expert" witness
computed the probability of this occurring as (1/12)(1/12) = 1/144.
a. What reasoning did the expert use to arrive at the probability of 1/144?
b. Can you spot the error(s) in the reasoning that leads to the stated probability of
1/144? What effect does this error(s) have on the probability of occurrence? Do
you think that 1/144 is larger or smaller that the correct probability of occurrence?
6. Jeanie is a bit forgetful, and if she doesn't make a "to do" list, the probability that she
forgets something she is supposed to do is .1. Tomorrow she intends to run three errands,
and she fails to write them on her list.
a. What is the probability that Jeanie forgets all three errands? What assumptions did
you make to calculate this probability?
b. What is the probability that Jeanie remembers at least one of the three errands?
c. What is the probability that Jeanie remembers the first errand but not the second
or third?
7. One of the myths most commonly believed by students on multiple choice exams is that,
as long as they always use letter „C‟ as their guess, they increase their chances of
Suppose that a multiple-choice quiz has two problems on it and that the student has no
idea how to answer them, so he guesses. Each problem has letters A-E corresponding to
the answers to choose from. Using counting techniques discussed in class, find and
explain how you found the following: (Video Solution)
Imagine that you toss a fair, two-sided quarter. You let it land and take a look at the side facing
up. What is the probability that you see heads or tails (assume the toss will be ignored if it
happens to land on its side)?
You can probably see fairly quickly that the outcome desired is guaranteed; when a coin is
tossed, it will result in one of two outcomes: heads or tails. If someone in a bet were to tell you
that he will win if the toss of a coin results in heads or tails, then you could probably tell him,
“Congratulations!”
Adding to our intuition (no pun intended), we will write the situation in the form of a
mathematical probability. The sample space will have two outcomes:
Then,
( )
( ) ( )
( ) ( ) ( )
Letting ,
( ) ( ) ( )
He quickly realizes that this probability is invalid because a probability cannot be greater than 1,
or 100%. What happened?
SOLUTION:
We first organize his data into a table to help us better see what is happening:
The probabilities outside of the boxes represent totals for mental health claims and for high risk
claims. The probability in the 1-1 entry of the table represents the probability of being low risk
and filing a physical health claim. Since we know that this data represents all of those who have
filed claims, we know that 100% have filed one type or the other. Additionally, each employee
considered falls into one of the two risk categories. So we fill in more details:
We can also proceed to fill in the boxes in the table, since each person falls into exactly one of
the four positions (low physical, low mental, high physical, high mental):
Now, the analyst added to second row total with the second column total, as highlighted in the
table below:
The problem seems to be that the .40 and the .70 both include the probability of High Risk and
Mental Claim! In other words, it is being counted twice, hence the end probability that is great
than 1.
Instead, let‟s add up the individuals box probabilities as illustrated in the table below:
While this does not seem like a huge amount of work, suppose that we instead had three types of
claims and 3 different statuses. It would probably be convenient to have some sort of
mathematical approach to the solution.
We are free to add the two probabilities, ( ) and ( ), but we must be sure to take out the .30
one time, so that it is single-counted and not double-counted:
( )
( )
( ) ( ) ( )
Regardless of the context/application of the probability, this issues can be resolved as shown.
( ) ( ) ( ) ( )
Typically, is used (called a union) to replace the word “or”, making the above equation
( ) ( ) ( ) ( )
At the beginning of this section, we addressed a coin-tossing problem that involve the
summation of the probability of heads and the probability of tails. Let‟s see why we could get
away with not subtracting away the double-count. We use the “Or” probability set-up:
( ) ( ) ( ) ( )
We already know that the first two probabilities on the right-hand side, but what is the third
probability value? Let‟s analyze its meaning:
( )
Of course, it is impossible to get both heads and tails in one toss of a coin! Any impossible
outcome has a probability of 0%. That is:
( )
So,
( ) ( ) ( ) ( )
We simply “lucked-out” when this problem worked-out according to our intuition. In general,
you need only to remember the “Or” probability formula for the reasons given to solve any
problem involving the occurrence of one outcome or another.
We know that the number of husbands voting democrat is . This means that the
number of husbands voting Republican is . Additionally, we conclude that the
number of couples where the husband votes Republican and the wife votes Democrat is
. We fill this information in:
We convert the totals into percentages by dividing each cell entry by the total number of couples,
160:
Let
So,
( ) ( ) ( ) ( )
At this point you might be wondering why we don‟t simply draw out the table and ignore the
mathematical formulas. When possible, tables are extremely useful, but they might not always be
available. Consider the following example.
SOLUTION: This is the probability that one or both missiles hit the target.
We only have one probability, so filling out a table would not be possible.
Let
We want to know
( ) ( ) ( ) ( )
We already know the first two probabilities on the right hand-side (.80), but we are not given
information on ( ). We can fairly assume that the outcome of one missile has no (or
very minimal) impact on the outcome of another missile, and so we assume the events are
independent. This allows us to write:
( ) ( ) ( )
And so,
( )
We conclude that there is a 96% chance that the enemy jet is eliminated.
2. A researcher conducts a study on a total of 600 cats to determine whether or not they tend
to be adaptive to danger and whether or not their time to respond to those dangers is fast
enough to avoid harm. The animals were exposed to non-harmful stimuli to assist in
answering the researcher‟s questions. In his report he details that, “207 non-adaptive cats
were studied and, of them, 180 were found to have response times that were simply not
fast enough. By comparison, a total of 300 cats were both adaptive and had response
times that were fast enough.” How likely is it that a cat is adaptive to environmental
physical dangers or has a response time that is fast enough? (Video Solution)
3. In the March 3, 2011 episode of the Dr. Oz Show entitled “Dangerous Doctors: Is Your
MD Hazardous to Your Health?” Dr. Oz mentioned that 20% of the time doctors order
scans to protect themselves from a lawsuit. Dr. Oz also said, “Up to 1/3 of all tests and
treatments are entirely unnecessary.” (Video Solution)
a. Two patients are given orders for scans from a particular doctor. What is the
probability that one patient or the other were given scans to protect the doctor
against a lawsuit?
b. One patient is given orders for two different tests/treatments. What is the
probability that one or both of them was/were unnecessary?
c. A patient is prescribed a scan and a blood test. What is the probability that an
unnecessary prescription was made, through the patient‟s eyes?
4. In all of his Fall 2010 classes, Milos discovered that 44% of his students earned a „B‟ or
better on their homework average. He also discovered that 50% of his students had a „B‟
or better homework average or a „B‟ or better overall grade in the class (SOURCE:
Milos‟ Fall 2010 Grade Spreadsheet). If 30% of all his students received a „B‟ or better
homework average and a „B‟ or better class grade, what percentage of his students earned
a „B‟ or better in the class? (Video Solution)
5. In all of his Fall 2010 classes, Milos discovered that the percentage of all students that
earned a „C‟ or better homework average, 87% of these students earned a „C‟ or better
final class grade. 70% of all students in his classes earned a „C‟ or better homework
average or earned a „C‟ or better final class grade (SOURCE: Milos‟ Fall 2010 Grade
Spreadsheet), while only 49% earned a „C‟ or better on homework and as a final class
In many cases, a probability depends on what we already know. For instance, would we believe
that the likelihood of a car accident changes, provided that the roads are slick from snow? We
would probably agree that the likelihood increases if we already know the road conditions.
Suppose a fair, two-sided coin is tossed. You are told that the outcome is not a head. What is the
likelihood that the outcome is tails?
The answer is probably obvious… if you know the outcome was not heads, and the only two
possibilities are heads and tails, then there is a 100% chance the outcome is tails.
Further, to indicate that the outcome is not one of the above, we often put a bar on top of the
event name:
̅
̅
Then,
( )
However, given that we know the outcome was not tails, the probability of heads jumped to 1.
We might write:
( ̅)
Instead of using the word “given” we often use a vertical line (called a “pipe”), |. That is,
Conditional Probability
( )
And implies that the likelihood of may be different, knowing that already took place.
Example 1: Due to wars at sea, shipwrecks, and other such disasters, there are (roughly)
around 3,000,000 sunken vessels in the all of the seas in the world! Suppose an area of the ocean
is mapped out due to the historic ships that have wrecked in that area. There is speculation that,
of the estimated 20 ships in that region, 11 are original pirate ships. Given that a pirate ship is the
first of the 20 recovered, what is the probability that the next one found will also be a pirate ship?
SOLUTION:
We would like to find the probability that a pirate ship is found, given that one pirate ship has
already been removed. If one ship is removed, there are 19 ships left. Since the ship removed
was a pirate ship, there are only 10 remaining. That is,
( )
( )
Why?
This probability has no condition placed on it. It assumes the very basic information: 20 ships, 11
pirate ships. So,
( )
The conditional probability, in this case, is different than the unconditional probability.
SOLUTION:
a) Dependent; rain likely greatens the likelihood for accidents
b) Independent; these events probably don‟t have any impact on one another
c) Dependent; Microsoft is part of the Dow Jones Industrial and so there is a strong
relationship between the two
d) Independent; we see that the likelihood of does not change given that has occurred
– it is still .75
e) Dependent; the likelihood of does change given that has occurred – it drops to .3
f) If the product of the two given events does equal the probability of and , then the
events are independent, as this would mean that ( ) is .75, which is the same as
( ). We see that , and so we conclude that the events are independent.
Example 3: An aircraft radar system detects 30 aircraft in a 100-mile radius. Of these, 18 are
ally planes, 6 are cargo planes, and 6 are enemy planes. Given that a plane approaching the radar
is ruled out as being an enemy plane, what is the probability that it is a cargo plane?
We want to know,
( ̅)
Since it is not an enemy plane, it must be one of the remaining 24 aircraft. Of those, 6 are cargo
planes, so
( ̅)
SOLUTION: In this situation, the decision of C2 is dependent (conditional) upon the decision
of C1. Consider a table in which C2‟s choices will reflect the decision of C1.
Company 2
Choices When
C1 Agrees
Y Y Y Y Y Y Y Y Y Y
Y
Y
Y
Y
Y
Company 1 Y
Choices Y
Y
N
N
( )
The difference is that C2‟s decisions are all to agree, provided that C1 has agreed. If C1 does not
agree, then we‟re not really sure how C2 will act, but we don‟t really care, since the probability
we are in search of is when both companies agree!
Here we have:
( ) ( ) ( )
If you look back at the reasoning here, you‟ll notice that we have bolded the word “dependent.”
In previous sections, we didn‟t have to worry about dependency, since we assumed that the
choices of C1 and C2 were independent, that is, one outcome did not affect the other, and vice
versa.
How do we know whether events are dependent or independent? Often times this is based upon
some knowledge of the situation or, perhaps, our intuition. Let‟s set up the important ideas here
and then we‟ll look at a few examples of dependence versus independence.
( ) ( ) ( )
( ) ( ) ( )
where is a symbol to represent the word “and”. We use this in mathematics often.
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
NOTE: and are generic names and thus can be attached to an event in an arbitrary order.
Given two events, and , if ( ) ( ), then does not depend on , and so the
dependence formula reduces to:
( ) ( ) ( )
( ) ( ) ( )
This result is important, because it allows you to only have to remember the “and” rule for
dependent events. If the next event does not depend on the prior event, then the end probability is
just a product of the two individual probabilities.
Though the ideas presented above might at first seem confusing, you‟ll notice that the idea of
joint probabilities has not changed. The only new caution is to take care to acknowledge whether
the events are independent or not. We‟ll consider a few more examples below.
Example 5: The probability that a resistor and capacitor both fail in a portable electronic
device in the fifth year of use is 0.95%. The probability that the resistor fails is 1.22% and the
probability that the capacitor fails is 1%. Are the events independent? If they are not
independent, what is the probability that the capacitor fails given that the resistor fails?
SOLUTION:
Let
If the two events are independent, then the product of unconditional probabilities should give us
the provided joint probability.
We have that,
( )
( )
( )
Thus,
( ) ( ) ( )
( )
Dividing gives,
( )
Thus, there is a 77.9% chance the capacitor fails if the resistor fails. The resistor is an integral
part in this device. The likelihood of the capacitor failing increases, if the resistor fails first.
Since ( ) ( ) ( )
We have that,
( )
( )
( )
Example 6: In a demographic study of a small, it is found that 5% of the adult residents are
unemployed and living at or below poverty level. A total of 8% are unemployed. What is the
probability that a person in this town is living at or below the poverty level, given that they are
unemployed? Interpret the meaning of your answer.
SOLUTION:
Letting = a person lives at or below the poverty level and = a person is unemployed, we
would like to know, ( )
( )
This says that, if a person is unemployed, there is a 62.5% chance they are living at or below the
poverty level. We would probably expect this figure to be quite high.
Example 7: As part of a narcotics checkpoint, officers randomly search freight trucks for
shipments of illegal drugs. The officers search a small number of crates in the trucks that are
chosen for random inspection. Suppose that, unbeknownst to the officers, there are two trucks
ahead, one of which contains one crate with illegal drugs. This truck has a total of 8 crates, while
the truck without drugs has a total of 5 crates. One of the two trucks will be randomly chosen.
What is the probability that the officers find the drugs?
SOLUTION: At first, it is tempting to say that the probability is , however this is not accurate.
The probability that the officers find the crate with drugs is dependent on them choosing the
correct truck first!
Let
Two things must happen: they must choose the correct truck and they must choose the correct
crate. Randomly choosing one of the two trucks is equiprobable, ( ) . If the correct truck is
chosen, then the probability of choosing the correct crate is , that is, ( )
( ) ( ) ( )
Why is it not valid to say 1/13? It might appear that probability is simply pulling a “fast one” on
our intuition.
A simple way to think about it is as follows: there is not just one random process here. If all the
crates were in the same truck, there would indeed be a 1/13 chance that we‟d get the right crate.
However, there are two random processes here. If you don‟t choose the correct truck, then
choosing the correct crate is impossible. The likelihood of the second random process leading to
the correct crate is indeed deeply affected by the outcome of the first random process!
Example 8: Reconsider Example 7:: Let‟s say that the second truck had two crates with
shipments of drugs. As before, one of the two trucks will be randomly chosen. What is the
probability that the officers find the drugs?
SOLUTION:
the truck with 8 crates ( ) is selected and the one correct crate is chosen OR
the truck with 5 crates ( ) is selected and one of the two correct crates is chosen
We will first create a small tree diagram showing the possible outcomes.
The beauty of this diagram is that it displays the conditional probabilities on the right “stems” of
the tree for each initial choosing of the truck.
Truck 1:
Truck 2:
Since these are distinct outcomes and cannot both occur (there is no overlap in the events), it is
okay to add them
Thus, there is a 37% chance that drugs are found between the two trucks. Again, note that the
probability is not simply , as our intuition might falsely lead us to believe.
Let
Since only one truck will be chosen, the probability of findings drugs in T1 and T2 is 0.
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
1. A deck of standard playing cards has 52 cards. There are four suits: clubs, diamonds,
hearts, and spades. There are two colors of cards – red and black. Diamonds and hearts
are red, and clubs and spades are black. The cards are labeled A (Ace), 1-10, J (Jack), Q
(Queen), and K (king). To better visualize, consider the illustration below:
Suppose you are given various conditions and that you must determine the probability of
the specified draw on the next card. Use the card descriptions above to find that
probability that: (Video Solution)
2. An auto insurance company finds that there is an 18% chance that a teenager gets into a
car accident between ages 16 and 19. There is a 34% chance that a teenager gets a traffic
ticket during this same age range. They find that the chance of getting into a car accident
and getting a traffic ticket (not necessarily because of the accident) is 10%. (Video
Solution)
a. Based on the probabilities provided, are the two events independent? Perform a
calculation to justify your answer.
b. Given that a teenager gets into an accident, what is the probability that he gets a
traffic ticket?
c. Why did the probability change in this way, as compared to the unconditional
probability of getting a traffic ticket?
d. Given that a teenager gets a traffic ticket, what is the probability that he gets into
an accident?
e. Explain, in practical terms, what your answer in d) means.
( )
( )
( )
( )
( )
( )
a. ( )
b. ( )
c. ( )
4. Gregor Mendel was a monk who, in 1865, suggested a theory of inheritance based on the
science of genetics. He identified heterozygous individuals for flower color that had two
alleles (one r = recessive white color allele and one R = dominant red color allele). When
these individuals were mated, ¾ of the offspring were observed to have red flowers and
¼ had white flowers. The table summarizes this mating; each parent gives one of its
alleles to form the gene of the offspring.
Parent 2
Parent 1 r R
r rr rR
R Rr RR
5. There are 5 candidates for 2 town council positions. Three of them are for the removal of
a landfill just outside of the city limits. The same candidate cannot fill both seats. (Video
Solution)
a. What is the probability that one randomly chosen candidate in the group is for the
removal of the landfill?
b. Given that one of the positions is filled with a candidate in favor of the landfill
removal, what is the probability that the second candidate chosen is also in favor?
c. What is the probability that two candidates in favor of the landfill removal are
chosen?
d. What is the probability that only one seat is filled by a candidate in favor of the
landfill removal?
e. What is the probability that at least one seat is filled by a candidate in favor of the
landfill removal?
Recall from Section 3.2 the problem faced by a corn growing business: the FDA determines that
two of the 20 bushels are potentially contaminated with E. coli. Two bushels had been shipped
out and the question was: what is the probability that both bushels that were shipped to the local
grocer were uncontaminated?
( )
( ) ( )
Due to the fact that one of the uncontaminated bushels was removed from the “pool”, there was
now only a 17/19 chance that the second uncontaminated bushel would be pulled. In short, we
wrote:
( )
We notice that the numerator and denominator both have a product of two sequential numbers.
Had they shipped, say, four bushels, the probability that all four were uncontaminated would be:
How painful, though, would it be to have to multiply eight or nine probabilities of this nature
together? You could certainly do it, but you might think, “It sure would be nice to take advantage
of this pattern!” Well, we‟re in luck!
( ) ( )
Example 1: Find .
Here‟s a little trick: write the factorial out, then divide out the factors that are not needed. For us,
this means:
Before we push this too far and get ourselves into a trap, let‟s consider a different example with a
smaller sample space.
Suppose that there are only 3 bushels of corn and that only one is contaminated with E. coli.
Again, let‟s say that two are shipped out. Then,
( )
If you recall the tabular approach to thinking about this, we might show the possibilities for
uncontaminated bushels, U1 and U2, and the way in which they can appear:
2nd Bushel
U1 U2
st
1 Bushel U1
U2
We know that the pairs (U1, U1) and (U2, U2) for the 1st and 2nd bushels cannot be possible,
since that particular bushel is removed from the population. So, we denote that in the table by
blacking-out those cells:
Perfect! So we see the remaining two possibilities, right? Well, actually, is there a difference
between (U2, U1) and (U1, U2)? Not unless those two bushels are actually different than one
another! So, blacking out either one of these pairs leaves:
2nd Bushel
U1 U2
st
1 Bushel U1
U2
One possibility!
You might be wondering why we‟re bothering with this if we‟ve already found the probability.
This is a good thing to wonder.
Recall that a probability is the number of ways an event can happen divided by the total number
of outcomes. To be consistent with this definition, we really should be putting 1 in the
numerator. Does that mean we miscomputed the probability? Not in this particular example, but
it can happen.
To make our denominator consistent, let‟s look at the total number of possibilities for selecting
bushels, adding in the contaminated bushel, C:
2nd Bushel
U1 U2 C
st
1 Bushel U1
U2
C
Again, it is not possible to select the same pair twice, so we black-out the diagonals:
2nd Bushel
U1 U2 C
1st Bushel U1
U2
C
Are we done? Not unless we feel that (U2, U1) is different than (U1, U2). We notice that the
three cells to the right of our blacked out diagonal are duplicates of those to the left. Thus we can
cross them out, as well:
This leaves us with three possibilities. So, our probability should be:
( )
( )
Since we get the same answer, one might think that it must not matter which approach we take.
Many times, it doesn‟t; however, “many” is not satisfying enough, since this leaves us prone to
mistakes under different circumstances.
Let‟s analyze the full situation two different ways. We found that if we don‟t eliminate order
differences, then we can write the probability as:
If we did (correctly) eliminate order differences, notice that we cut the number of possibilities in
half, that is, divided by 2. You‟ll notice that 2 is the same thing as . So, let‟s divide out
the number of duplicates from top and bottom:
And
This does look rather complicated, but remember that it follows from some fairly simple things
that we have built up on. Also notice that both the top fraction and the bottom fraction have .
Ah, yes! So that‟s why the order-not-eliminated and order-eliminated answers are the same:
⏟ ⏟
While this works out beautifully in this example, it is not always true, and so we must take care
to observe whether order difference is important. We will see examples later where this
difference will come into play, but those situation are a bit more advanced.
Let‟s simplify this horrid notation a bit. Suppose that there are a total of items and of those
are to be drawn.
If order is not to be eliminated (in cases where order is important), then the number of ways to
select things from the given is called a permutation and is denoted:
( )
NOTE: ( ) , that is, factorial is not distributable!! Subtract first, then use
factorial.
For our numerator, we had selected 2 uncontaminated bushels from a total of 18 uncontaminated
bushels. According to our new notation, this can be written as:
( )
For our denominator, we had selected 2 (general) bushels from a total of 20 (general) bushels,
since we want to know the total number of ways 2 objects can come out of 20.
In simplified notation,
( )
To evaluate a permutation,
1. first enter in your home screen
TIP: Sometimes the value of the numerator or denominator is so large that the computer
throws an overflow error. It is advisable to enter the entire probability in, numerator and
denominator to avoid this potential problem.
If order is to be eliminated (in cases where order is not important), then the number of ways to
select things from the given is called a combination and is denoted:
( )
NOTE: ( ) , that is, factorial is not distributable!! Subtract first, then use
factorial. Additionally, the factorial of a product is not the product of factorials, that is,
.
For our numerator, we had selected 2 uncontaminated bushels from a total of 18 uncontaminated
bushels, eliminating the number of repeats, which was 2, or . According to our new notation,
this can be written as:
( )
And this is precisely what we have written for the numerator!
For our denominator, we had selected 2 (general) bushels from a total of 20 (general) bushels,
since we want to know the total number of ways 2 objects can come out of 20, order aside.
( )
In simplified notation,
( )
Follow the steps for finding permutations, but in Step 3, use 3: nCr instead.
This event occurs when the 2 cards drawn both come out of the 7 he has put in thus far. Since the
order in which his two cards are drawn don‟t matter (as the prize is the same), we would like to
know the value of
The sample space is simply the total number of outcomes. Two cards will be drawn from the
stack of 40, and since order doesn‟t matter
( )
There is about a 3% chance that both of the cards drawn are Cori‟s.
SOLUTION: The event is that the three criminals come from a group of five particular gang
members. There are
The total number of way three-criminal groups that can be formed out of the suspects is
This means,
There is only a .9% chance that the three gang members all come from the presumed gang. The
detective should consider more evidence to narrow down the search results before making
assumptions.
Example 4: A business creates a new system to keep track of client relations, such that
information about the client and a particular orders placed can be accessed by a nonrepeating,
four character or digit number. For instance, KA23 and
AK23 are possible codes. Any code containing only letters
will be reserved for large clients. How many such codes of
non-repeating letters can they make available, and
assuming all such codes will eventually be used up what
percentage of the company‟s clients will be considered
large clients?
In order to know what percentage (or probability) of the total number of possible codes this
represents, we need to compute the total number of codes that can be formed, where no letter or
number is repeated, but where order does matter. This is precisely what permutations are for.
Since there are 26 letters and 10 numbers, a total of 36 different “symbols” can be selected from.
The number of permutations is
total different codes1 without the same letters or numbers being repeated, but
where order does matter.
( )
We conclude that 25% of all clients (the large clients) will have completely alphabetical codes.
1
Notice that the increase in the number of possibilities after increasing the size of the sample space is not
proportional to the increase amount. The growth is actually exponential, not linear.
In this situation, we are allowing repeats. For the number of ways to form a 4-letter code, we
have 26 possibilities for each digit. That is 26 for the first, the second, the third, and the fourth.
Crossing all of these possibilities gives:
Which we expect to be larger than in the previous example since we are allowing repeats.
Similarly, the number of letter/number codes that are possible can be calculated by noting
that, in general, each piece of the code has 36 possibilities. So,
The percentage/probability is
( )
The percentage changes to 27% of all codes will contain only letters.
’
determining some key pieces of information:
You might be wondering why we must divide by to remove all repeats. This was
probably somewhat obvious when working with two objects. Say there are 5 objects to
select from. One is now gone, so for the second selection there are only 4. We proceed to
cross out everything along the and to the right of the diagonal since they are either not
possible or are s ’
We have essentially multiplied the first five possibilities by the next number of possibilities,
which is only 4 (this is accounted for by crossing out the diagonals, since this subtracts out
five possibilities to give ), and then divided that result by 2, since half of the table is a
repeat. That is,
What happens when we select a third object? We extend the above table as a multiple of 3,
since there are three objects left. Each table represents a pairing with one of the three
remaining objects, as shown in the upper-left corner:
In the first table, we can cross out the first column (and first row, if it were there), since it is
not possible to select object 1 for a third time. In the second table, we can cross out the
second column/row and in the third table we can cross out the third column/row for the
same reason as table 1.
Also notice that the second column of table 1 and the last three rows of table are the same
(1, 2, 3), (1, 2, 4), and (1, 2, 5). For a similar reason, the third column of table 1 can be
crossed out, since it is a repeat of what we have in column 1 of table 3.
Nothing else in table 1 can be eliminated, since (1, 4, 5) cannot be found in either of the two
remaining tables (this is a unique characteristic of the bottom, right-most entry).
In table 2, we will try to eliminate any entries that can be found in table 3. These
eliminations will involve any entries that contain Object 3. We can do so with the (2, 1, 3)
entry and the third column:
Now, notice that we have 10 white spots left. This happens to be exactly one-third of what
we had after we tripled the table. That is,
⏟
⏟
( )
Selecting items allows this process to repeat, ad nauseam, any number of times.
Mathematicians discovered that this tabular process could be reduced into the formula we
“ ” general case (where we
allow to be any value between 0 and the number of items we have to choose from), which
tends to be discussed in more theoretical mathematics courses such as Discrete
Mathematical Structures (our MAT227).
1. If possible, give an imaginary (but realistic) scenario for each of the following. If not
possible, state why.
a.
b.
c.
d.
2. Your classmate was absent when permutations and combinations. Explain when he
should and when he should not use permutations and combinations. (Video
Solution)
3. A police officer has been brought before the court on accusations of racial profiling.
This occurs when a person of a particular race has been pulled over or detained by
the police due to his race. The officer stopped 2 vehicles out of 10 that passed by
through a freeway tollbooth. Both of the suspects were Asian and there were a total
of 3 Asian drivers in the 10. (Video Solution)
a. In how many ways could 2 drivers have been selected from the 10?
b. In how many ways could 2 Asian drivers have been selected from the 3?
c. How likely is it that the 2 selected drivers would both have been Asian if the
stops were truly random?
4. In the United States, 20 out of the 50 states spend more than 50% of their state park
and recreation areas revenue on keeping the state park operable (SOURCE: 2012
U.S. Statistical Abstract). Suppose a survey of 10 states is to be conducted next year
to see if anything has changed. (Video Solution)
a. In how many ways can 10 states be selected for the survey?
b. In how many ways can 10 states be drawn so that all 10 are operating on
more than 50% of their state park and recreation areas revenue?
c. What is the probability that all 10 of the states drawn are operating on more
than 50% of their state park and recreation areas revenue?
5. Ten pieces of furniture are to be arranged in a long row in a furniture store. In how
many ways can all 10 be arranged? (Video Solution)
7. A frequent concern of cautious consumers is the idea of the last four digits of a credit
card number being displayed on receipts. Suppose a consumer has a Visa, which has a
total of 16-digits, each of which can be between 0 and 9. For the sake of simplicity,
suppose any combination is possible. A customer left the following receipt lying around
and is now concerned about his identity: (Video Solution)
Let‟s look at the possibilities in a tabular form. Since there‟s a 15% chance the driver will get
into an accident, there is an 85% chance he won‟t (since it either does happen or it doesn‟t). If
there is no accident, then the insurance company receives $1200 for the entire year. If an
accident does occur, the insurer pays out $3200 (hence a negative effect), but still receives the
year‟s premiums. Thus, the net difference is $2000, which the insurer is responsible for.
If we now consider 100 years, it is expected that 15 of those years there would be an accident
and 85 of them there would be no accident, assuming the constant probability. That means the
insurer would pay $2000 a total of 15 times and receive $1200 a total of 85 times. Let‟s consider
the net difference:
Notice what we really did here. We took the sum of the amounts and divided by 100:
( )
( ) ( )
( ) ( )
( ) ( )
In reality, we multiplied each monetary value by its respective probability. This idea is
known as expected value, since it is what we expect to happen in the long-run.
Expected value is the expected, or average, quantity that should occur in the long-run,
provided that each quantity occurs with a certain probability.
, -
A capital , , is used to denote what is called a discrete random variable, a variable that
takes on one of (a natural number of) values with a certain probability. This value is
defined by what it measures in the given situation.
Observe that we can use properties of fractions to separate the sum as follows:
( ) ( ) ( )
While one-third in this situation is not a probability (since the scores have already been
) “ ” -third of the overall
class grade.
SOLUTION: We should determine what will happen, on average. We first see that the
warranty is a 2-year warranty and the defect rate is for one year. If 3% malfunction each
year, then 6% of all televisions are expected to malfunction within the first two years.
This means that the company will make $175 with a 94% probability and will lose $1200-
$175=$1025 with a 6% probability, since it will still receive the payment, but will have to
either replace the product or offer a credit to the consumer.
, - ( )
Example 2: The Arizona Lottery has a number of different lottery games that a person
can play. One in particular is Fantasy 5. The rules of the game are simple: pay $1 per
ticket and select five numbers between 1 and 41. Five numbers are then selected at
random. If you correctly selected two or more of these numbers, then you are
considered a winner. The following table describes the likelihood of winning:
(SOURCE: www.arizonalottery.com)
The estimated jackpot for the Wednesday, August 17, 2011 lottery was $54,000. Is the
game in your favor? Why or why not?
SOLUTION:
We must first consider the fact that these prizes do not take into account that $1 was lost to
purchase the ticket; we should subtract $1 from each of the prizes. Additionally, we note
that the probabilities do not add to 1:
53,999 499 4 0 -1
( ) 1/749,398 1/4163 1/119 1/11 9,004/10,000
, - ( ) ( ) ( ) ( ) ( )
This means that if one were to play time-after-time, taking into consideration the small
likelihood of winning occasionally, one would be expected to lose, on average, $0.67 per
ticket.
Notice that we represented the outcomes by using a table, in which we listed the outcomes,
or the individual along with the probability that this occurs, ( ). This is one way in
which to display a probability distribution, or how all probabilities are distributed among
the various outcomes.
Example 3: A fair, six-sided die is tossed repeatedly. The number of dots that are facing
up after each throw is recorded. Define the random variable, find its probability
distribution, and find and interpret the expected value of the random variable.
The different values that can take on are , since we know there are six
sides. Since this is a fair die, each of these six outcomes has an equally likely chance of
appearing, so ( ) , for all values, of . Our probability distribution is thus,
1 2 3 4 5 6
( ) 1/6 1/6 1/6 1/6 1/6 1/6
The expected value is the sum of the products of each outcome value and its associated
probability.
The average value of a die that is repeatedly tossed will be 3.5. If we were to conduct a
simulation we would probably see something similar as in the introductory section of this
chapter:
0
0 20 40 60 80 100 120 140
As time passes, we see that the average roll becomes more stable and seems to e approaching
3.5, as we have shown mathematically.
Example 4: In hopes of understanding the directions in which married couples are naturally
inclined to walk at an outdoor mall in Arizona, a marketing group conducts a study. It is the
experience of the mall that men and women tend to walk in different directions once they
park (and catching up later). The first question is how many individuals within a couple can
they expect to start their walk through a street that has one or more clothing stores?
The random variable can take on values, , since it is possible that neither of them
take a clothing store route, only one does, or both do.
individuals taking a route with a clothing store would occur when, from the three clothing
store routes, none are selected, and both routes without clothing stores are selected. We then
must compare this to the number of ways two routes can be chosen from five. That is,
( )( )
( )
( )
Similarly, for , we want to know how many ways one clothing-store route and one non-
clothing-store route can be selected. That is,
( )( )
( )
( )
For
0 1 2
( ) 1/10 6/10 3/10
We can see that the probabilities sum to 1, which helps to imply that we have accounted for all
possibilities.
The number of individuals expected to take a clothing store route is an expected value of this
distribution,
, - ( ) ( ) ( )
Thus, it can be expected that, on average, at least one person from the couple will walk along a
route that contains a clothing store.
Number of Individuals
This is a convenient visual way to view the distribution of probabilities. It is clear to us that it is
quite unlikely that neither of the individuals in the couple will walk a route without a clothing
store.
1. While working in downtown Phoenix, the author tracked minutes that the Blue Line
bus going through downtown Phoenix, AZ was late in arriving at a specific bus stop. He
discovered the following: (Video Solution)
On time 1 2 3 4
( ) 0.53 0.25 0.18 0.03 0.01
2. A Geico auto insurance policy for a 21-year-old Chandler male driver of a 2012 BMW
M5 with no previous tickets has a semi-annual premium of $312.41. In the instance of an
accident, there is a $1,000 deductible that the policyholder must pay before insurance will
cover the damages (SOURCE: www.geico.com). The vehicle costs about $115,000 to
replace. From past experience, suppose Geico knows there is a 2.5% chance (annually)
3. An insurance policy pays $100 per day for up to 3 days of hospitalization and $50 per
day for each day of hospitalization thereafter. (Video Solution)
( ) {
4. You work on a dairy farm and are in charge of quality control for eggs. Your primary
concern is that broken eggs do not go out. You know from past experience that about
25% of the outgoing boxes contain one or more broken eggs (based on complaints). If a
local restaurant purchases 4 boxes of eggs from you, what is the expected number of
boxes with broken eggs that this vendor should receive? (Video Solution)
It might seem paradoxical to say that uncertainty occurs in certain ways, but the truth is that it
does – assuming certain assumptions are satisfied. As we build a probability distribution,
whether in the form of a table or histogram, we can often times save ourselves a lot of labor by
focusing on the type of experiment that lay before us. The purpose of this chapter is to
(hopefully) simplify some of our efforts.
Suppose a friend of yours, let‟s call him Kyle, tells you that his brother is 6-feet, 9-inches tall.
You are most likely wide-eyed and surprised by what he just told you.
Why is this?
You likely have some idea of how tall people generally are. You would probably consider a
height of 6-feet, 9-inches to be uncommon in the environment you‟re used to. In fact, you might
even go as far as to call this height an outlier, or a value that falls outside the usual data range.
How can you be absolutely sure that this height is uncommon? What if you live in a region that
tends to have shorter people?
The statistician would say that it would be nice to see a probability distribution associated with
heights of all people living in the region, state, country, or continent on which you live. She
would argue that, if you are trying to describe the people in the U.S. based on people living in
Arizona, you are drawing from a biased sample.
While we will not discuss continuous random variables here (variables that can take on any
number in a specified range), we will show a theoretical distribution for heights in the U.S.
below:
You might be wondering how we know that the shapes of the distributions should look like bells.
This is based on the data collection process. It is not unlikely in nature for distributions to have a
heavily loaded center with lower frequencies out towards the left and right tails. While the
histogram of all heights might not have a perfect bell shape as we indicate, having this shape
allows us to use mathematics to model the curve.
Although many variables do take on a continuous set of values, we will begin with discrete
random variables, as these are slightly simpler to describe.
When we talk about any variable that can take on a finite (as opposed to infinite) number of
possibilities, we are dealing with a discrete random variable.
Specifically, a binomial random variable is one that takes on one of two possible values, as
indicated by the prefix “bi.” We will simply refer to the outcome as either a “success” or a
“failure.”
Consider this example: let‟s say that you and a friend are tossing a coin (since this is one of the
most exciting things to do). Your friend tosses 9 heads out of 10 tosses. Curious about this, you
begin to analyze the results – how likely is that this type of event could take place?
By letting and represent the events that a head/tail is facing up on a coin toss, respectively,
we know that one possible way in which this can happen is:
( ) ( )
The probability of this sequence is the same: 9 heads, 1 tail. This is okay, since the probability of
tossing a certain sequence does not affect the probability of getting a head or tail on the next toss.
So,
( ) ( ) ( ) ( ) ( ) ( )
Not surprisingly, there are 8 more places for the tail to have appeared. We‟ll summarize in the
table below:
Arrangement of 9 , 1 Probability
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )
Since these are 10 distinct ways of getting this outcome, each with probability 0.000977 (that is,
each takes up 0.0977% of the entire sample space), the probability of getting 9 heads and 1 tail
is:
( )
What if we complicated the problem a little more and asked, what would be the probability of
having two tails mixed up in 10 total tosses?
This gets more complicated, since the two tosses could occur one after another, two tosses apart,
three tosses apart, etc. To simplify our lives, it can be shown that the total number of ways in
which a binary “success” can occur is by finding the following combination:
. /
. /
Then, we simply need to find the probability of just one of those arrangements and multiply it by
the number of different arrangements.
Since we defined a head resulting as a success, then, what we just calculated was:
. / ( ) ( )
At first glance, it might seem a little confusing that the second exponent is the number of trials
less the number of successes.
Why is this?
Suppose there are 10 trials and you want 6 successes. This necessarily means that the other 4
trials would result in failures. This is precisely , or the number of trials less the
number of successes.
Let‟s make this formula easier to consider. First off, let‟s define some variables:
Let
Now, in any event, success and failure make up the whole sample space. That is:
( ) ( )
So,
. / ( )
To make this more clear, we first define a random variable, . In the case of a binomial
experiment (one in which there are two possible outcomes for each trial), the set listing all
possible values that can be achieved (between 0 and the number of trials).
For example, if
in coin tosses, then * +. That is, between 0 and 10 heads can possibly
be achieved in 10 tosses of the coin (though not all have the same probability). To indicate a
binomial pdf calculation, we often write:
( ) . / ( )
We summarize a binomial pdf below, along with the necessary assumptions to use this.
then the experiment is a binomial experiment and the probability of successes can be
calculated by
( ) . / ( )
SOLUTION:
a) Since there are 10 events and 8 successes desired, there are:
. /
b)
1) There are trials
2) Each outcome is either a head (success) or a tail (failure)
3) The probability of success on any trial is
4) One toss does not influence the outcome of any other toss
c)
( ) . /( ) ( )
The fact that the probability of getting 8 heads in 10 tosses is higher than getting 9 heads in 10
tosses should not surprise us. Getting 9 heads is a rather extreme request. Getting 8 heads, while
still extreme, is a bit more likely.
Let‟s now build the probability distribution histogram for . We first display the probabilities in
a table below by applying the binomial pdf:
Does this match our expectations? The table indicates that getting 5 heads has the highest
likelihood of all 11 possible events. Even more importantly, the probability of getting between 4
and 6 heads in 10 tosses is . The probability of getting very few
or many successes gets to be very unlikely. This data is displayed in the histogram below:
0.250
0.200
Probability
0.150
0.100
0.050
0.000
1 2 3 4 5 6 7 8 9 10 11
Successes
Additionally, note that the sum of all event probabilities sums to 1. This is necessary and
important in describing the distribution.
With trials in a binomial experiment, the sum of the probabilities of 0 up to successes must
constitute the sample space and hence equal 1.
That is,
Example 2: A fair, 6-sided die is rolled 8 times. The goal is to roll a 1 or a 2 four times during
the experiment.
SOLUTION:
( ) . /( ) ( )
. /( ) ( )
A question that follows from Example 2: is, what does the distribution look like? Let‟s develop
the distribution in tabular form first. To do this, we calculate binomial probabilities for each of
the 9 possible outcomes (anywhere between 0 and 8 successes possible).
Successes Probability
0 0.039
1 0.156
2 0.273
3 0.273
4 0.171
5 0.068
6 0.017
7 0.002
8 0.000
0.250
0.200
Probability
0.150
0.100
0.050
0.000
1 2 3 4 5 6 7 8 9
Successes
Notice that this distribution is not symmetric. It is said to have to be skewed to the right, since
the distribution has its probabilities heavily concentrated towards the left and so has a tail to the
right (hence the name)
Distribution Types
It can be shown that the expected value of , or the average number of successes we expect to
see, given that is a binomial random variable, is:
( )
Example 3: Pristine Air Conditioning uses a digital phonebook to call homeowners in a large
city regarding a $55.99 A/C maintenance special. In an hour, a telemarketer can make about
10 calls. If the probability that a randomly called homeowner signs up for the maintenance
special is 0.40,
a. what is the probability that telemarketer gets at least 80% of his hourly customers
to sign up?
b. Represent this probability in a histogram.
c. Find and explain the expected value of the random variable.
SOLUTION:
a) We first need to determine whether or not this is a binomial probability. Since the
probability of success is 0.40 on every one of 10 trials and we assume that the size of the
population does not significantly impact the percentage of success (as removing one
potential customer from the pool reduces the size of the callable population), we conclude
that this is a binomial experiment. Thus, the number of called homeowners that
accept the offer.
We want to know the probability of getting business from 8, 9, or all 10 of the called
individuals. We want:
( ) ( ) ( )
because each of these accounts for disjoint pieces of the sample space.
. /( ) ( ) . /( ) ( ) . /( ) ( )
Thus, there is only about a 1.23% chance that the A/C company gets the business of 80%
or more of the homeowners called.
b) The histogram is below. The probability we are looking at is the sum of probabilities after
7 successes:
2. Suppose the outcome of random variable is conducted with trials each with
independent probability of success, . (Video Solution)
a. Is this a binomial experiment?
b. What is the probability that
c. What is the probability that
d. What is the probability that
e. What is the probability that
3. In preparing for a New Year‟s Eve celebration, police look at past records for arrests due
driving under the influence (DUI). In the U.S., 10.5% of arrests made are for DUI
(SOURCE: U.S. Statistical Abstract, Table 324). If it is expected that each police officer
makes 10 arrests, what is the probability that all arrests result in DUI‟s? (Video Solution)
4. Pancreatic cancer is a vicious killer. The 5-year survival rate between 2001 and 2007 was
only 5.9%, meaning that the majority of people with pancreatic cancer die within 5-years
of contracting the cancer. In a group of 25 patients, 5 survive beyond. How likely is such
an event? Assume that the survival of one person is independent of another person.
(SOURCE: U.S. Statistical Abstract, Table 182). (Video Solution)
5. A new herbal drink blend is being compared to an older blend via a blind taste-test
comparison. Four judges will taste each of the two drinks and will state their preference.
It is anticipated that both blends are equally impressive. (Video Solution)
a. Find the probability distribution for the number of judges that vote in favor of the
new blend.
b. Construct a probability histogram.
c. What is the probability that at least two of the judges prefer the new blend?
d. What is the expected value of this distribution and what is its real-world meaning?
6. Goranson and Hall (1980) explain that the probability of detecting a crack in an airplane
wing is the product of , the probability of inspecting a plane with a wing crack; , the
probability of inspecting the detail in which the crack is located; and , the probability
of detecting the damage. (Problem Source: Mathematical Statistics with Applications, 6th
Ed., Wackerly, et. al.) (Video Solution)
a. What assumptions justify the multiplication of these probabilities?
b. Suppose and for a certain fleet of planes. If three planes
are inspected from this fleet, find the probability that a wing crack will be
detected on at least one of them.
c. Find the probability distribution for the number of planes in this fleet with
detected wing cracks.
d. Construct a probability histogram.
e. What is the expected value of this distribution and what is its real-world meaning?
Up until this point, we have only considered distribution that have discrete values – non-negative
integers. There are many variables, however, that are continuous in nature. In fact, almost every
variable you studied in algebra and calculus was continuous!
Take, for example, heights of NBA basketball players, hourly wage, response time of a database
server, temperature, depth of a lake, the value of a share of Intel stock, and the lifespan of a car
engine, to name just a very few. These are all variables that can take on infinitely many values,
even within a limited range. For example, the response time of a database could be 0 seconds and
1 second. It could be 0.01 seconds, 0.00001 seconds, or 0.98727495 seconds.
Think back to a discrete distribution. The probability of a particular value was found by
observing the height of the relative frequency bar. While relative frequency represents the
percentage of observations found to have the value specified, it can also be thought of as a
probability, if we feel that it accurately models predictions that we might use it for. Consider the
example below showing the number of children in a classroom of 30 that are likely to likely to
have the flu.
0.4
0.35
0.3
0.25 0.2
0.2 0.16
0.14
0.15 0.1 0.1
0.1
0.05
0
0 1 2 3 4 5
Number of Children w/Flu
Let‟s call this random variable # of children in a classroom of 30 that have the flu.
Then, we will write the probability that any 2 children have the flu as:
( )
This reads, “the probability that the number of children that have the flue is 2”
( )
This is asking us to find the probability that 2 or fewer children have the flu. In other words,
what is the probability that 0, 1, or 2 children have the flu. To answer this, we simply add the bar
heights corresponding to .
( )
Thus, there is a 74% chance that 2 or fewer children in a class of 30 children have the flu.
With continuous distributions, we cannot simply read the “height of the bar!” For instance
consider the following continuous probability distribution that shows the likelihood of various
wait times in line at a fast-food restaurant:
0.2
Probability
0.15
0.1
0.05
0
0 1 2 3 4 5
Minutes
In this example, suppose we wish to find ( ), that is, the probability that the wait time is
2-and-a-half minutes. At first glance, we might simply decide to locate 2.5 minutes and assess
the probability output. We would find:
( )
If this were the case, wouldn‟t it be the case that all wait times have a probability of 0.2? Based
on the graph, of course. This, however, would be a logical pitfall: if there are infinitely many
different wait times between 0 and 5 minutes, then the sum of all probabilities would be a sum of
infinitely many 0.2‟s. In other words, it is only possible for the wait times to have individual
probabilities of 0.2 if the times were discrete. When we deal with continuous random variables,
we should actually consider the vertical axis to be density instead of probability. In and of itself,
density is not a meaningful value, however, in conjunction what we will mention next, it will
prove to be useful.
Without going into too much detail, an interval of densities is designed in such a way that the
area under the function is 1, or 100%. Let‟s reconsider the above graph:
0.2
0.15
Density
0.1
0.05
0
0 1 2 3 4 5
Minutes
And, so we are able to confirm that represents all possible wait times this particular
store has experienced.
As you might guess, if we wish to find the probability of a range of values, we would simply find
the probability between those two values of time.
The answer might not come as too much of a surprise: the probability is 0!
The probability of a single value in a continuous distribution is 0, since there are infinitely many
possible values. Thus, 2.5 represents 1 of infinitely many values. Take and you get 0!
We can only find the probability of a non-zero range of values for a continuous random variable!
A continuous random variable is a random variable that has infinitely many possible values
within a range of real numbers.
As a result, the probability that a continuous random variable takes on any one specific value is
0.
The PDF of a continuous random variable is a continuous function such that the total area
between the function and the horizontal axis is 1. The function‟s input values are the values of
the random variable, while the output values are densities. Densities are individually meaningless
values designed so that the total area equals 1.
0.2
0.15
Density
0.1
0.05
0
0 1 2 3 4 5
Minutes
Thus,
( )
We can expect to wait between 2.5 and 3.5 minutes with a 20% chance. Thus, approximately one
in five visits, our wait-time will be somewhere within this interval.
( )
This is the probability that the wait-time is between 0.3 and 4.4 minutes. We identify this region
below:
Thus, there is an 82% chance that the wait-time is between 0.3 and 4.4 minutes.
When the PDF of a random variable is a constant, we call this a uniform distribution. That is,
values of the random variable are uniformly distributed.
The PDF of a random variable, , whose values are in the interval is:
( ) {
( )
( )
( )
( )
√
Example 1: The amount of revenue that a farmers market generates on a given Saturday is
uniformly distributed between $5,000 and $22,000.
SOLUTION:
( )
This is constant function is only valid for values between 5000 and 22000. It is valued as
0 everywhere else.
Revenue PDF
0.00007
0.00006
0.00005
Density
0.00004
0.00003
0.00002
0.00001
0
5000 22000
Revenue ($)
( )
There is about a 12% chance that revenue earned will fall between $6,000 and $8,000.
This is a simple average. Thus, on average, the farmers market will make $13,500 on a
given Saturday.
On average, revenue will vary by $4,908 less or more than the mean.
1.2
0.8
Density
0.6
0.4
0.2
0
0 1 2
Random Variable Values
Practically speaking, it appears to be most probable that the random variable will take on a value
around 1. It is less likely that the random variable will take on values close to 0 or close to 2.
This might be handy in situations where such criteria is desired.
Notice that the area is also 1. If you divide the triangle into 2 and use the area of a triangle
formula . /:
In this next section, we will focus our attention on the most commonly used continuous random
variable: the normally distributed random variable.
The first two questions below involve discrete random variables. The aim of these questions is to
get you thinking in terms of the probabilities of ranges of values.
1. A pizza shop sells pizzas in four different sizes. The 1000 most recent orders for a single
pizza gave the following proportions for the various sizes:
With denoting the size of a pizza in a single-pizza order, the given table is an
approximation to the population distribution of .
2. Airlines sometimes overbook flights. Suppose that for a plane with 100 seats, an airline
takes 110 reservations. Define the variable as the number of people who actually show
up for a sold-out flight. From past experience, the population distribution of is given in
the following table:
3. A particular professor never dismisses class early. Let denote the amount of time past
the hour (in minutes) that elapses before the professor dismisses class. Suppose that the
density curve shown in the following figure is an appropriate model for the probability
distribution of :
0.20
0.15
0.10
0.05
2 4 6 8 10
a. Find the probability density function (PDF) for this random variable.
b. What is the probability that at most 5 minutes elapse before dismissal?
c. Find ( ). Explain what your answer means.
d. Find the expected value of this distribution and explain its real-world meaning.
e. Find the standard deviation of this distribution and explain its real-world meaning.
f. What is the probability that instructor let‟s out class within one standard deviation
of the average overtime?
4. A delivery service charges a special rate for any package that weighs less than 1 lb. Let
denote the weight of a randomly selected parcel that qualifies for this special rate. The
probability distribution of is specified by the following density curve:
Density 0.5 x
1.5
1.0
0.5
a. What is the probability that a randomly selected package of this type weighs at
most 0.5 lb.?
b. What is the probability that a randomly selected package of this type weighs
between 0.25 lb. and 0.5 lb.?
c. What is the probability that a randomly selected package of this type weighs at
least 0.75 lb.?
d. The probability is defined on the interval . Verify that the area under
the curve in this region is 1.
The normal distribution (pictured above), much like the uniform distribution, is a continuous
distribution. In fact, this distribution is defined for all real numbers. The curve runs from to
. However, as you might observe, the most likely values occur close to where the density
function peaks. Values that occur in either one of the “tails” are highly unlikely and, as it
appears, the density function is very close to the horizontal axis as it extends farther to the left
and to the right.
Why do we use this distribution? Much like the infamous appears in many natural places,
many random variables tend to be normally distributed. That is to say, the bulk of values tend to
occur near the mean and median (both of which are located directly in the center of the
distribution, since it is perfectly symmetric). For instance, heights of individuals in the United
States (roughly) follow a normal distribution – there are many people whose heights are near
average. There are fewer extremely short and extremely tall people in the United States. Thus,
we would say that the bulk of people are “normal” with respect to their heights.
While certainly not all random variables are normally distributed, many are. Weights, IQ, new-
vehicle gas mileages (to name just a very few) are variables that have been known to follow a
normal distribution. As we will later see, any distribution can “become” a normal distribution.
This is a beautiful phenomenon that allows us to make some important conclusions (more on this
idea in a later section).
As before, the overall area under the normal curve is 1 (50% on either side of the mean/median,
as in the image). To find the area, we would need to use some rather unusual shapes in order to
apply the same methodology as before. The idea of an integral in calculus would actually allow
us to find the area exactly, however, the normal curve is modeled by the following pdf:
( )
( )
√
( )
( )
√
where the
∑( )
IMPORTANT NOTE: and represent the population mean and variance. represents the
population size. Recall that the sample variance has a divisor of , so that it is an unbiased
estimator of the population variance.
Below is an example of what a typical table would look like. We call this a standard normal
table, since it requires that values between which we would like to know areas are
“standardized.” This means they are converted to scores prior to using the table:
1. In an Arizona town, suppose the heights of adult males is such that inches and
(so the standard deviation is the square root of this value, ). What is the
probability that a male is shorter than 72 inches (6 feet tall)?
We round to two decimal places, since the standard normal table can handle up to two decimal
places. Any additional decimal places would not make a substantial difference.
We locate by first locating 1.1 along the rows and 0.04 along the columns (since 1.1 +
0.04 = 1.14).
What if we wanted to know an area to the right, such as ( )? The table does not provide
these values. However, if we know that ( ) then the probability of a height
greater than 72 must be the remaining area, .
Similarly, if we wish to find the area between two points, we must get creative.
and
http://www.rossmanchance.com/applets/NormalCalcs/NormalCalculations.html
As you can see, we enter the mean and standard deviation in the first section. If we would like to
plot two functions over one another, we could check the box and enter a second mean and
standard deviation.
The probability of such an event is displayed in the “prob” box. If we have two values entered
and both boxes checked, then the “probability between” these two values is displayed. Isn‟t this
much more intuitive and convenient than using tables?
NOTE: One limitation of the above applet is that values rounded to two decimal places require
a bit of finagling.
Use the applet mentioned in this section to complete these exercises. You are not required to use
the standard normal table.
2. In the UK, birth weights are approximately normally distributed with lbs. and
lbs. (SOURCE: http://www.healthknowledge.org.uk).
a. Find and explain the real-world meaning of ( ).
b. Find and explain the real-world meaning of ( ).
c. Find and explain the real-world meaning of ( ).
d. Find and explain the real-world meaning of ( ).
e. What weight is such that 20% of infants weight less than this amount? (HINT:
You can still use the calculator applet.)
3. In a recent years, Scholastic Aptitude Test (SAT) scores for all college-bound seniors in
the United States was such that points and points (SOURCE:
http://www.collegeboard.com) .
a. 50% of students scored less than how many points?
b. 50% of students scored more than how many points?
c. In order to be in the top 10% of SAT-takers, what score would one have to
achieve?
d. What score do the lowest 10% score between?
e. The middle 50% of students scored between what two values?
When it is only our dataset that is of interest, we use descriptive statistics. This is precisely the
trouble we have been up to so far! Often times, however, we cannot collect all elements in the
population. Take, for example, a poll to gauge Americans‟ opinion of a candidate in office.
Certainly, you cannot sample all voting-age adults. This is easily resolved with a manageable
random sample, but is further complicated by the following idea: sampling variability!
How do we estimate true population parameters using a random sample, all the while taking into
account the fact that our sample statistic is variable from sample-to-sample?
This is the purpose of inferential statistics and is a very important aspect of understanding the
structure of an underlying population. With many advances in statistics, it is possible to make
precise claims about our population.
The hard-cold truth is that, when working with statistical inference, we likely have no idea what
the underlying probability distribution for the population looks like. If we did, then we wouldn‟t
have to draw a random sample and would be nearly done with this course. Since we don‟t, we
can‟t in good conscience assume that the distribution is normal. So, why spend time studying
such a distribution? We will soon experience why.
Suppose we roll a die. Without too much effort, we can produce the probability distribution for
the population of all possible outcomes. Here it is:
0.10
0.08
0.06
0.04
0.02
0.00
1 2 3 4 5 6
Die Value
In words, the probability of getting any one face value on a die roll is about 0.17 or 1/6. The
distribution is uniform.
, - ( ) ( ) ( ) ( ) ( ) ( )
The variance of this population requires us to use the population standard deviation formula
(remember, division by occurs if we are dealing with a sample, so that we have an
unbiased estimate for the population standard deviation). That is:
∑( )
, -
1 1
2 2
3 3
4 4
5 5
6 6
=VAR.P(A2:A7) which give: 2.916666667
In reality, keep in mind that we would often not know much about our population. We get the
luxury of studying something we can fully explain. This is all in an effort to better understand
sampling distributions.
Suppose we conducted an experiment of rolling the die 10 times. For one random sequence, we
might obtain the following result:
4 6
3 4
4 1
3 4
1 2
Not surprisingly, we get a fairly even spread of values 1 – 6. If we are to compute the average,
we would obtain 3.2. That is if all rolls came up as the same number, each roll would be 3.2.
Suppose we asked 19 other people to roll a die 10 times and to then report back to us the mean.
Here is what we might find (based on a computer simulation of rolls):
First off, we notice there is sampling variability. Not every person obtained the same average
outcome from 10 tosses each. This is expected, since the process is a random one.
Sampling Distribution
The distribution of sample statistics (such as ̅ ) computed from repeated sampling is called a
sampling distribution.
We do notice that the means tend to gravitate towards 3.5. Some, as expected, deviate from this
value.
Let us now consider a histogram for this sampling distribution of sample means:
5.15>
2.4 to 2.65
2.9 to 3.15
3.4 to 3.65
3.9 to 4.15
4.4 to 4.65
4.9 to 5.15
2.65 to 2.9
3.15 to 3.4
3.65 to 3.9
4.15 to 4.4
4.65 to 4.9
This is quite interesting… we have obtained a distribution (of means) that appears somewhat
bell-shaped.
Suppose now that we had a total of 1000 people roll a die 10 times each, and to then compute the
sample mean. Here is what a simulation of this process would look like:
2.3 to 2.4
2.6 to 2.7
2.9 to 3
3 to 3.1
3.9 to 4
4 to 4.1
4.5 to 4.6
4.9 to 5
5 to 5.1
1.7 to 1.8
1.8 to 1.9
2.1 to 2.2
2.2 to 2.3
2.4 to 2.5
2.5 to 2.6
2.7 to 2.8
2.8 to 2.9
3.1 to 3.2
3.2 to 3.3
3.3 to 3.4
3.4 to 3.5
3.5 to 3.6
3.6 to 3.7
3.7 to 3.8
3.8 to 3.9
4.1 to 4.2
4.2 to 4.3
4.3 to 4.4
4.4 to 4.5
4.6 to 4.7
4.7 to 4.8
4.8 to 4.9
5.1 to 5.2
Wow! Our distribution of means for 1000 individuals for experiments of 10 rolls each produces
something remarkably like a normal distribution. Additionally, it appears that the mean of this
distribution is around 3.5!
Let‟s try this again, but now, let‟s say that 1000 individuals each roll a die 20 times, and each
individual computes a sample mean. This simulated event would produce the following
distribution of die-roll average:
100
80
60
40
20
4.7>
2.7 to 2.8
2.9 to 3
3 to 3.1
3.9 to 4
4 to 4.1
4.4 to 4.5
2.2 to 2.3
2.3 to 2.4
2.4 to 2.5
2.5 to 2.6
2.6 to 2.7
2.8 to 2.9
3.1 to 3.2
3.2 to 3.3
3.3 to 3.4
3.4 to 3.5
3.5 to 3.6
3.6 to 3.7
3.7 to 3.8
3.8 to 3.9
4.1 to 4.2
4.2 to 4.3
4.3 to 4.4
4.5 to 4.6
4.6 to 4.7
The distribution looks a bit more normal. Upon closer inspection, we also see that the variability
of these averages is smaller. That is:
We notice that increasing the sample size ( ) has decreased the sampling distribution‟s
variability.
In fact, the standard deviation for the distribution of means computed from 10 and 20 tosses is
about 0.52 and 0.38, respectively.
Let‟s do one more experiment. Let‟s say that 1000 individuals each roll a die 30 times, and each
individual computes the mean of his/her rolls. The sampling distribution of means would look
like this (based on simulation):
120
100
80
60
40
20
4.9>
2.9 to 3
3 to 3.1
3.9 to 4
4 to 4.1
4.6 to 4.7
2.4 to 2.5
2.5 to 2.6
2.6 to 2.7
2.7 to 2.8
2.8 to 2.9
3.1 to 3.2
3.2 to 3.3
3.3 to 3.4
3.4 to 3.5
3.5 to 3.6
3.6 to 3.7
3.7 to 3.8
3.8 to 3.9
4.1 to 4.2
4.2 to 4.3
4.3 to 4.4
4.4 to 4.5
4.5 to 4.6
4.7 to 4.8
4.8 to 4.9
Again, we notice the bell-curved shape and the decreased range of means (about 2.6 to 4.4)!
Let‟s summarize:
We can very easily see that the expected value of the sampling distribution is the same as , the
expected value of the population distribution. That is:
, ̅-
But, what is the relationship of the standard deviations of the means in relation to the standard
deviation of the population of die roll value?!
This is not so clear. Statisticians, after much research, found that the standard deviation of each
of the sampling distribution is related to the sample size in the following way:
For example,
The reason for this difference is simply due to randomness, and estimates can be improved more
(if desired) by increasing the number of “individuals rolling the die.”
What we have observed here is formally known as the Central Limit Theorem.
Regardless of the distribution of a random variable, , if we take repeated random samples from
this distribution of and compute the mean, ̅ , for each sample, then the following will
hold:
(NOTE: A sample size of at least 30 is a rule-of-thumb and can vary slightly depending on the
severity of skews and abnormalities in the distribution. For even severely skewed distributions,
the approximate shape is typically normal.)
First of all, we do not need to understand the shape of the underlying distribution from which we
are sampling. This is an amazing result in-and-of itself, since we usually have little to know
information about the population itself (again, if we did, we wouldn‟t be wasting our time with
any of this!).
Secondly, since the resulting sampling distribution is approximately normally distributed, we can
proceed to calculate probabilities using the normal distribution. This is also great, since we
already have the background in that process!
Example 1: After experimentation, researchers believe that the mean lifespan of a strain of
bacteria is days with days. Due to the complexity of the bacteria, the shape
of the distribution of bacteria lifespans is unknown. A sample of 60 bacteria strains is
collected.
a. Does the CLT apply here?
b. Calculate the probability that the sample mean lifespan, ̅ , is less than 3 days.
SOLUTION:
a. Since the sample size is 60, we should be safe in assuming that the sampling distribution
of all means is normally distributed with mean and standard deviation
√
.
b. We want ( ). Using our probability calculator
Given the very small level of variability in the sampling distribution of lifespan means,
we would consider observing an average smaller than 3 feasibly 0.
1. In your own words, what does the Central Limit Theorem tell us?
2. In your own words, why is the Central Limit Theorem a very powerful practical result?
3. A sample of size 36 is taken from a population distribution of unknown shape, though the
mean is believed to be 100 with a standard deviation of 18. What is the probability that
the sample mean is:
a. Greater than 102?
b. Less than 98?
c. Between 95 and 105?
d. Between what two values will the middle 90% of means be?
4. A stained glass company produces panes of glass with a mean thickness of 0.42 inches
and a standard deviation of 0.04 inches, if produced properly. Suppose a random sample
of windows reveals a sample mean of 0.43.
a. What is the probability of this average, or a larger average?
b. Given the probability you have computed, what can be said about recent
production standards?
5. Promote Marketing has a research team to research new marketing tactics to propose to
potential clients. A group of 40 clients have been invited for a conference to be put on by
the marketing firm. The research team usually generates in revenues for
each member of the team with .
a. What will be the shape of the distribution of ̅ ? How do you know?
b. What is the probability that average sales will exceed $420,000 for this particular
event?
c. How would your answer change if 100 clients were to show up?
d. If the team (300 people) have an average revenue that is in the 90th percentile of
revenues, they will earn 4-days of paid vacation. What average sales would be
required for this?
7. Use the Excel Sampling Distribution Applet to address this problem. In a population, it is
found that 30% of homes have 5 rooms, 40% have 4 rooms, and 30% have 3 rooms. You
Statistics for Decision-Making in Business © Milos Podmanik Page 190
can set this up in our applet by having a “die” with 10 values: three 5‟s, four 4‟s, and
three 3‟s.
a. What is the average number of rooms a home has in this population? What is the
standard deviation in the number of rooms in this population?
b. Now, suppose you take a sample of size 30 from this population. What shape will
the distribution have and how do you know?
c. Take 1,000 random samples each of size and compute the 1,000 sample
means. According to the applet, what is the average of the average rooms in the
sample? What is the standard deviation in the average number of rooms in a
house? Compare these two results to what the Central Limit Theorem says we
should come up with. That is, find , ̅ - and , ̅ -.
d. Take 1,000 random samples each of size and compute the 1,000 sample
means. According to the applet, what is the average of the average rooms in the
sample? What is the standard deviation in the average number of rooms in a
house? Compare these two results to what the Central Limit Theorem says we
should come up with. That is, find , ̅ - and , ̅ -.
e. Take 1,000 random samples each of size and compute the 1,000 sample
means. According to the applet, what is the average of the average rooms in the
sample? What is the standard deviation in the average number of rooms in a
house? Compare these two results to what the Central Limit Theorem says we
should come up with. That is, find , ̅ - and , ̅ -.
f. Why do the values in the population have the highest standard deviation when
compared with the distribution of means in the last there parts?
g. What is the probability that, in a sample of 100 homes, the average number of
rooms is greater than 5?
h. Explain in practical terms why the standard deviation of any ̅ distribution
decreases as the sample size increases.
As discussed previously, our ultimate goal is to make inferences about the population parameter
. Again, keep in mind that this is the only reason why we are spending time on this! Otherwise,
we would have completed our semester early!
When we generate our sampling distribution for ̅ we see very vividly that our sample means are
subject to sampling variability, depending on which “die values” are “rolled” for each individual
sample of size . Thus, we should be very skeptical of concluding that ̅ is representative
of the true population mean. However if we have many, many “individuals roll the die,” we
should get a fairly reasonable understanding of a range of values for the true value of . Let‟s
consider an example.
23.46
9.250319
But, wait! Let‟s pretend that we actually don‟t have access to the entire population of values
(yes, we clearly see them in the table above, but we normally do not have that luxury). Due to
limited time and money, you are only able to sample 30 of these values. After taking a random
sample, here is what you have chosen:
32 31 31 35 19 20 22 21 20 20
20 25 29 32 33 19 19 19 18 22
25 27 30 18 21 33 30 32 31 33
̅ 25.56667
5.870342
Again, at this point, we would have no way of telling how close we are to the actual mean of
23.46.
1) From the population, take a random sample, preferably of size 30 or greater. The larger
the random sample, the more power we have in making inferences about the population.
2) If this is a truly representative sample, then we can think of it as a “mini” population that
acts and behaves according to the population as a whole. This is a key ingredient!
Sample 1
Sample 2
Sample 3
Random
Population
Sample,
Sample 4
.
.
.
Sample 10,000
Some of the assumptions we make are indeed dangerous. For example, do we really have a mini
population? If the answer is “no,” then theoretical results are equally worthless since they, too,
assume that the sample is representative.
If we have truly collected a random sample, then we should be able to think about the sample as
a small population. If this is a small population, then we should be able to sample from it. We
will draw random samples of size from the small “population” which is also of size
. Sounds strange, but we will sample with replacement, so it is possible to resample the
same value multiple times.
We will draw 1,000 samples of size from this “population” and, as you might have
figured, we will calculate the mean of each and build the sampling distribution for ̅ .
29.7666666666667>
24.7666666666667 to
28.2666666666667 to
22.2666666666667 to
22.7666666666667 to
23.2666666666667 to
23.7666666666667 to
24.2666666666667 to
25.2666666666667 to
25.7666666666667 to
26.2666666666667 to
26.7666666666667 to
27.2666666666667 to
27.7666666666667 to
28.7666666666667 to
29.2666666666667 to
25.2666666666667
28.7666666666667
22.7666666666667
23.2666666666667
23.7666666666667
24.2666666666667
24.7666666666667
25.7666666666667
26.2666666666667
26.7666666666667
27.2666666666667
27.7666666666667
28.2666666666667
29.2666666666667
29.7666666666667
As we should expect based on CLT, the distribution of these 1,000 means is approximately
normal.
Let‟s suppose that we want to have an interval within which there is a 95% probability that the
true population mean, , lies. This is the same as looking for the middle 95% of means!
Thus, we can say that we are 95% confident that the true population mean is between 23.6 years
and 27.5 years. In other words, there is a 95% probability that we have “trapped” the population
mean between our lower and upper limit. Said one other way, 95% of all sample means, when
the variability from sample to sample is taken into account, are between these lower and upper
limits. If this is representative of the population, then we should believe that 95% of the time, we
will have means between these two values.
What if we wanted to be 99% certain? We would need to find lower and upper limits so that
there is only 1% in the tails:
Thus, we would like 0.01/2 = 0.005 (or .5%) in each of the two tails. To find the lower and upper
limits, we would need to find the 0.005 percentile and the 1-0.005 = 0.995 percentile. We get:
Thus, we are 99% confident that the true population mean age, , is between 22.83 years and
28.17 years. In other words, there is a 99% probability that the true mean age is between 22.83
and 28.17 years.
Note that in only one of our confidence intervals (99%), we have captured the true mean within
our range. This is very likely, since our confidence percentage is very high. BUT, keep in mind
that we never know what the true mean is! Thus, we cannot say that it would have been better to
stick with the wider 99% interval. After all, there is a 1% chance we might have made an error.
The level of confidence that we desire depends on the situation and the allowable mean width we
are willing to tolerate. More confidence means wider possibilities. In general, we never know
As a final note, it is interesting that we actually missed the true mean in our 95% confidence
interval, since there is only a 5% chance of error. Keep in mind, however, that this interval was
based on simulation. It is based on 1,000 samples and may have been better to increase the
number of samples.
6.2.2 Confidence Interval for ̅ Using Theoretical Results – When and are Unkown
̅ ( )
√
This reads, “ -bar is normally distributed with mean and standard deviation .”
√
This, however, assumes that we know something that we probably don‟t – the population mean
and standard deviation!
As you might guess, we will use ̅ and to approximate these. This proposes a problem: we are
√
introducing more error. In order to account for this, the normal distribution is not appropriate.
When using these approximations, we must use the theoretical Student’s Distribution. This
distribution looks much like the normal distribution, but is constructed by sample size, not the
mean and standard deviation. Below is a comparison of the -distribution in comparison to the
standard normal distribution for size .
As we mentioned, this distribution‟s shape relies on the sample size. The relationship is called
the degrees of freedom and can be calculated as , that is degrees of freedom is equal
to one less than the sample size.
Thus, we would expect 95% of sample means to be within 2.045 standard deviations of the
mean. In other words:
̅
√
Or:
Thus, we are 95% confident that the true average age in this town is between 23.4 and 27.8.
Notice that this is not very much different than our simulated confidence interval of 23.6 to 27.5.
So, which is more precise? This is arguable, but it is difficult to argue with empirical data.
Personally, I prefer the bootstrap confidence interval we ran earlier. My reasoning is that a
distribution of means is asymptotically normal, meaning that, under infinitely many sampled
units, the distribution would be exactly normal. This is very theoretical and not always valid.
Lower limit:
Upper limit:
Similarly, there is a 95% chance that the population mean age is between 22.6 and 28.5.
Compare this to our empirical result above of 22.8 to 28.2. We are, again, very close.
1. Describe, in your own words, what a bootstrap distribution is and why we would want to
use one. Be sure to mention the logical process behind building one, as well as the
assumptions we are making when we do so.
3. The following is a random sample of 10 labor costs associated with farming for civilian
consumers (in billions of dollars) since 1970.
(SOURCE: Data randomly sampled from U.S. Statistical Abstract, Table 847)
a. Does the Central Limit Theorem apply for this data? Why or why not?
b. Using a bootstrap distribution, calculate a 95% confidence interval for , the true
population average labor cost.
c. In a complete sentence, interpret the real-world meaning of this value.
d. Using the bootstrap distribution and percentiles, how likely is it that a sample of
labor costs has a mean greater than $190,000,000,000?
4. In Arizona, primarily the Phoenix Metropolitan area, the issue of red-light cameras used
to catch red-light runners and speeders was a prominent one for much of the early 2000‟s.
Many studies were carried out over this period of debate to determine whether or not they
were effective, and whether or not they used taxpayer money appropriately. Suppose the
883 522 590 779 887 615 690 771 843 509
872 840 536 892 880 588 547 770 687 842
832 840 676 555 884 617 517 586 505 552
a. Can the state be 95% confident that the desired average is possible?
b. Generate a 99% confidence interval for , the population average daily revenue
per camera. Explain in a complete sentence what this means.
c. Is the CLT valid in this problem? Explain.
d. Using the assumption that the distribution of ̅ is normally distributed, calculate a
theoretical 95% confidence interval for (you will need to estimate the
√
standard deviation of ̅ ‟s and ̅ to estimate .
e. In reality, anytime we estimate parameters, like you did above in part d), we
actually shouldn‟t assume a normal distribution. Instead, we should assume what
is known as a -distribution, which is symmetrical, though has more variability to
account for the uncertainty in our estimates.
http://www.youtube.com/watch?v=yV-0ReCXW64
For example, you will find that a 99% confidence interval for a sample of size 100
has endpoints that are 2.626 standard deviation from the mean (left and right).
Let‟s say your sample mean is ̅ and standard deviation . Then, the
confidence interval will be an interval around the sample mean. That is, one
standard deviation is (remember, the standard deviation of means
√ √
requires that we divide the standard deviation among individual ‟s and divide by
the square root of the sample size). So, 2.626 standard deviations would be
2.626(0.5) = 1.313 units away from the mean. The endpoints would be 40 – 1.313
and 40 + 1.313, or 38.687 to 41.313.
Formulaically, we found:
Suppose that it is of interest to estimate the proportion of recent customers that say they would
come back and shop at your store. You take a sample and determine that, of 30 people, 20 said
they would and 10 said they wouldn‟t. You would like to make an inference about the population
of all of your customers. In your sample, you know that:
Is the proportion of your customers that will come back and purchase from you again. You are
looking to find a confidence interval for ̂ . How do we do that with the simulator if we have no
data?
In reality, we do. We just have to make it numerical. In reality, 20/30 is an average. It is the
average of 30 responses. If we let:
So, we have a set of twenty 1‟s and ten 0‟s. We enter these in to our simulator.
We run the bootstrap sample on these 1‟s and 0‟s 1,000 times. We will get a variety of sample
proportions:
0.933333333333334
0.983333333333334
0.433333333333333
0.483333333333333
0.483333333333333
0.533333333333333
0.533333333333333
0.583333333333333
0.583333333333333
0.633333333333333
0.633333333333333
0.683333333333333
0.683333333333333
0.733333333333334
0.733333333333334
0.783333333333334
0.783333333333334
0.833333333333334
0.833333333333334
0.883333333333334
0.883333333333334
0.933333333333334
0.983333333333334>
to
to
to
to
to
to
to
to
to
to
to
We calculate the 2.5- and 97.5-percentiles to get the middle 95% of sample proportions
generated in the bootstrap sample:
(As %) Results
Percentile 1: 97.5 0.833
Percentile 2: 2.5 0.500
Thus, we are 95% confident that the proportion of the population of customers that will shop at
your store will between 0.50 and 0.83. This is quite a wide interval! At least you know what to
expect with 95% confidence!
Without providing the intuition for this method, we will simply state the results for the CLT
pertaining to the sampling distribution of ̂ :
The sampling distribution of ̂ (which is really just an average of 0‟s and 1‟s) is approximately
normal just as long as (similar idea as for the standard CLT).
̂
( ̂)
With
, ̂-
̂( ̂)
, ̂- √
NOTE: the standard deviation is often referred to as the margin of error in polls.
1. the average proportion of the sampling distribution is the true population proportion.
2. The standard deviation of proportions of the sampling distribution is the above, complex,
calculation.
Here, we get to use the standard normal distribution to calculate the number of standard
deviations corresponding to the desired interval. So, we know that:
The number of standard deviations corresponding to the middle 95% of a standard normal
distribution is calculated below:
Thus, these endpoints are approximately 1.96 standard deviations away from the mean. So, our
confidence interval would be:
̂( ̂)
̂ √
In our case:
Lower limit:
Upper limit:
1. In a sample of 55 students from Arizona State University taking a political science class,
30 say they would be interested in taking another political science class. The university is
interested in determine the proportion of all its students that are interested in taking
another political science class.
a. What is the population of interest in this study?
b. Construct a 90% bootstrap confidence interval for, , the true proportion.
c. Interpret the real-world meaning of your confidence interval.
2. A software company takes a random sample of recent orders and finds that, of the 250
sampled, 42 resulted in the return of a piece of purchased software.
a. What is the population of interest in this study?
b. Construct a 99% bootstrap confidence interval for, , the true proportion.
c. Interpret the real-world meaning of your confidence interval.
3. A batch of apples was inspected prior to shipment for any defects. Each apple was
marked as either pass (P), re-inspect (R) or fail (F). The following results were reported.
F P P P P P P P R R
P P R P R R P R P P
P R P R P F R R P P
P P P P P P P R P P
P P P F P R P P P R
Chapter 7
Hypothesis Testing
We are often faced with uncertainty. Specifically, we often want to know whether one product is
better than the other, whether one group outperforms another in some type of task, or how one
manufacturing process compares to another, among many other things. How can we ever know?
The first step would be to conduct a study and collect data. The data must then be compared.
So, you have a research question… what now? The question might at first seem obvious: let‟s
run a study. This question, however, needs some special treatment before anything else happens,
especially if the study comes at a significant cost.
For instance, suppose we‟re interested in determining whether pesticides damage the soil in
which we grow the majority of our food. This is a loaded curiosity. We first need to fully define
how it is that we would conduct such a study. For instance, will be comparing two regions, one
that has been sprayed with pesticides and one that hasn‟t been sprayed? What is it, exactly, that
we will measure in order determine the level of soil damage?
First and foremost, we need to formulate a hypothesis, or a belief about what it is that we expect
to see. For example,
Great, so we know what we believe. Did we just state what we wanted to happen? Probably not.
We‟ll usually formulate a hypothesis based on some existing observations. Perhaps we‟re seeing
that plants aren‟t producing as many edibles as previously thought. Or, maybe we‟re finding
rising levels of cancers. (By the way, all of the above are becoming eminent public concerns in
the U.S. and beyond.) So, based on these observations, we‟re forming an educated belief on the
effect of pesticides.
This can be a controversial question and may lack a consensus of an answer. Will it be measured
by the quantities of beneficial microbes present in the soil? By the soil‟s pH level? By the
amount of nitrogen it contains?
However we choose to measure “soil damage,” we want to be sure that we are being accurate.
That is, we need to be sure that we are actually measuring what we say we‟re measuring. This
sounds infantile, but it happens all the time that researchers say they‟re measuring something that
they‟re not actually measuring.
So, suppose we do some research and conclude that we test for soil damage by determining the
weight of vegetables harvested from these plants and comparing the average weight per plant for
the experimental group (some determined quantity of pesticides sprayed). We find that healthy
plants produce about 30 lbs. of some vegetable across their seasonal life span. Will the average
plant yield for plants sprayed with pesticides be lower?
Since we are dealing with an average in this scenario, the statistical symbol often used to
represent the average plant yield for the entire population of this particular vegetable is the
Greek letter Mu, .
Now, our experimental hypothesis is that pesticides damage the soil, measured by the pounds of
vegetables yielded from these plants. If that is the case, we would expect to see a yield of less
than 30 lbs. of fruit per plant. That is, our hypothesis is that
Since this is the experimental hypothesis, we have no evidence to conclude that this is true. Thus,
we should probably assume that there is no difference between the yields of pesticide-sprayed
and non-sprayed plants. Thus, begin by assuming that:
This second hypothesis is called the null hypothesis, that is, the hypothesis that is assumed until
there is sufficient evidence otherwise. Symbolically, this hypothesis is written and is typically
read as “null hypothesis,” or “h-naught.”
The hypothesis that we believe is called the alternative hypothesis, and is written , or “h-ay.”
When evidence is sufficient to conclude that the average is really below 30, we say
We are cautious to make these conclusions based on sample data. Certainly, we may have
obtained an oddball sample that doesn‟t represent the population.
Let‟s practice writing some hypotheses. First, off, let‟s make note of the variety of population
characteristics, called population parameters, that we can seek to describe in a study.
In a study, we seek to gain information about the target population. There is a number of things
we can test about the population parameters, actual values. Two common ones are:
Unfortunately, we do not know the true values for and and realistically cannot, unless we
sample the entire population. We can only estimate them based on the sample we collect. The
values we collect from the sample are sample statistics and are estimators for the respective
population parameters. These estimators for the values above, respectively, are notated:
1) ̂ (“mew-hat”)
2) ̂ (“pie-hat”)
SOLUTION:
Under the original assumption, . The researcher wants to test whether . So:
SOLUTION:
SOLUTION:
The fuse is designed and assumed to be 40 amps. That is, on average, . He wants to make
sure it is not the case that . So,
In our pesticide experiment, our target population is all plants of this particular variety. Thus, we
will take a random sample of plants from the pesticide group. Once we have that, we will find
the sample mean, which is called a sample statistic. That is, we can‟t possibly keep track of all
the plants in the population, so we will use the mean of the sample to help us describe the entire
population. Usually, this sample statistic is written as ̂ (“mew-hat”). Suppose that you find,
from the pesticide group, that
̂
We must remember that this is just one random sample from all plants. Certainly, this sample
average is lower, but can it not just be due to random variation that we‟re seeing a difference?
After all, not all no-pesticide plants will produce exactly 30 lbs. of the vegetable.
When making conclusions about the population based on sample data, we must first ask the
question,
That is, if the probability of observing what we have just seen, or what is more extreme, is small
“enough,” then we will reject and conclude that might be a more valid conclusion.
Punchline: We shouldn‟t reject the null hypothesis unless the probability of seeing something as
or more extreme is very unlikely.
Imagine a medical test to determine whether or not you have some disease. Let‟s call this
disease, Disease X.
As for having the condition, you have one of two possibilities: you have it or you don‟t.
As for the test, it will either say that you have it or you don‟t.
Now, realistically, we know that there is no way to be omniscient and really know whether or not
you have the condition. However, let‟s imagine that we are all-knowing and can judge the
validity of the test. There are four possibilities:
It is evident that possibilities 2) and 3) represent scenarios where there is an inaccurate result.
That is, it would be invalid for the test to tell you that you have the condition when, in fact, you
don‟t. It would also be invalid for the test to tell you that you don‟t have the condition when, in
fact, you do.
Contrarily, we do want the test to tell us positive when we do have the condition and negative
when we don‟t.
Truth
Have Don‟t Have
Test Says
Positive True Positive False Positive
(Type II Error)
Negative False Negative True Negative
(Type I Error)
As can be seen, the green cells represent accurate results (true results) and the red cells represent
inaccurate results (false results).
As a patient, you would probably be quite upset (devastated, even) if you received false results
for a terrible condition, such as X!
In a hypothesis test, we are up against the same dilemma: our test result can be either positive or
negative. The truth may or may not be accurately represented. Let‟s modify our table slightly to
represent the hypothesis test scenario:
Truth
True False
Hypothesis Test
In reality, we shouldn‟t reject (make it appear false), when it is true. If we do, we have a false
negative on our hands. Similarly, we shouldn‟t not reject (make it appear true), when it is
false. These are labeled Type I and Type II errors, respectively.
Unfortunately, we are not omniscient. Thus, we can never be sure that our conclusions are
accurate. If we knew, there would be no testing necessary!
On the flipside, we can determine how large of an error rate we require. Earlier, we mentioned
that we will reject when the probability of observing something as or more extreme as what
we have observed is “small.” This value of small fully determines our probability of a Type I
error. As researchers, it is our duty to set this value. This probability of a Type I error is called
the criterion, or alpha-level, and is denoted with the Greek letter alpha, .
Criterion/Alpha-Level
That is, rarely will we choose a very small or considerably large alpha-level.
Suppose that we reject when the probability of observing something as or more extreme as
what we have observed is 5% (or smaller). We have that .
This means that there is still a 5% (or smaller) chance that we observe a value (sample mean,
sample proportion, etc.) more extreme than what we have observed. That is, there is a 5% chance
that we have falsely rejected the null hypothesis. Probabilistically,
( ) ( )
( )
To visualize this, consider the diagram below. Recall that a conditional probability statement
limits us to the event after the “pipe,” |, and then asks the question, “what percentage of the time
can we expect the event to occur, out of the times the specified condition occurs. The modified
table below shows that.
Truth
True
Hypothesis Test
Reject 95%
At this point we might wonder: why shouldn‟t we set extremely small so that we minimize the
Type 1 error risk?
Good question. Imagine that your alpha is 0.0001. This means you will only reject 0.01% (or
1 out of 10,000 times) of the time, when it is true. Certainly, your risk of a Type I error is
extremely small.
Okay, so if you very rarely reject the null hypothesis, then you are also potentially committing
another act of error: not rejecting the null hypothesis, even though it may be false. That is, you
increase the likelihood of a Type II error. Recall that,
( ) ( )
We can see here that failing to reject results in potentially failing to reject it even when it
should be rejected! Unfortunately, there is no free lunch in hypothesis testing.
Truth
True
Hypothesis Test
Though we cannot yet easily provide numerical support for this claim (which certainly makes
sense), we will make the following preliminary conclusion:
Type II Error -
Important Caution
Students are often confused that the probability of rejecting when is true and the
probability of failing to reject when is true sum to 1. After all, these two possibilities are
only two of the four possible results in a test decision.
However, keep in mind that these are the percentages of time we reject and fail to reject out of all
the times that is true! This out of only one column total, not the entire sample space.
If
( ) ( )
( )
, then,
( )
Similarly,
If
( ) ( )
( )
, then,
( )
The probability that we reject the null hypothesis when it is false is referred to as the power of
the test. We summarize these in the table below:
Truth
True False
Hypothesis Test
Don‟t
Conclusion
Reject
Reject
Example 4: The college dropout rate for a particular county is known to be 30%. The
educational board of a city within the county believe its dropout rate is significantly lower.
The board follows 60 students and, of them, 15 dropout. The board wants to run a statistical
hypothesis test with to determine whether their belief is true. Describe the
hypothesis test by:
a. Writing competing hypotheses
b. A decision rule for rejecting
c. A decision criterion rule
SOLUTION:
b.) We will reject if the probability of observing something as or more extreme as 15 out of
60 dropouts ( ) under the assumption of the null hypothesis is less than or equal to
0.05. That is:
( )
c.) We will reject if the observed value of is smaller than some cutoff value of . That
is, it might be the case that would have to be smaller than, say, 13 in order for us to
reject the null hypothesis.
As we see from the above example, our hypothesis test needs to have a structured layout. We
need to know ahead of time what we‟ll do.
It is tempting, but we cannot determine our rejection criterion based on what the sample data
tells us! In practice, you can carry this type of philosophy, but you increase the error rate.
Consider, for example, the scenario wherein you take an exam for a biology class. You get the
results back and look at what you missed. You say, “oh, of course I should have put that! I knew
that!” If you told that to the instructor, she may say, “sorry, you didn‟t demonstrate that on the
exam.” Without surprise, we expect this response. Why? Because, it is the test that helps to
determine our level of understanding! It is not the other way around. If the instructor allowed
you to change your answer, then the test wouldn‟t really be demonstrating what you knew at that
time of the test. A hypothesis test is quite analogous. We carry one out because we have a hunch.
If you dig long enough in your data, you will find something!
This, however, looks upon the digging process as a negative thing since it does not justify the
decision questions. In fact, it creates a high likelihood that we are observing a coincidence and
not a solid finding at all! Thus, we increase the probability of error exponentially!
As an important note: we never say, “accept as true.” Instead, we remain accurate and say
that there is simply not enough evidence to reject it. Think about this as “innocent,” vs. “not
guilty.” Just because a court cannot prove that someone is guilty, they don‟t say that he is
innocent. Instead, they give the verdict of “not guilty.”
1. In your own words, explain the difference between the null and alternative hypotheses.
Also, explain how to identify each in a research study.
2. Explain why we assume that the null hypothesis is true before testing a hypothesis.
5. A snack dispenser has a failure rate of over a 5-year span. After changes to the
machine, the manufacturer would like to know whether or not this has changed. Write
competing hypotheses.
7. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test
for the scenario in question 3, assuming and that he finds that only 52 out of
1000 bushels of his crop are lost to insect infestations.
8. Based on the “Structure of a Hypothesis Test” blue box, fully describe the hypothesis test
for the scenario in question 4, assuming and that he finds his students have
been averaging ̅ on the test.
10. In real-world terms, describe what Type I and II errors would mean for each of questions
3, 4, and 5.
1.
a. Nominal; ice cream names cannot be ordered, in general.
b. Interval; temperatures have order and the differences in temperature can be
reasonably discussed. For example, to talk about a difference is meaningful.
c. Ratio: Absolute 0 exists since there can be no balance at all. Additionally, it
makes sense to talk about ratios. For instance, accounts receivable balances can
be, say, 20% higher this month as compared to last.
d. Ordinal; there is an ordering, though we can‟t talk about the number 1 candidate
as being 2 better than the number 3 candidate. This is because the difference of 1
might not necessarily be the same from 1 to 2 as it would be from 2 to 3. Maybe
candidate 3 is a far third.
2.
a. 2,121 elements in the sample
b. Length of time is a quantitative variable, since it is a numerical measure.
3.
a. 15,000 elements in the sample
b. A proportion is a quantitative variable, since it is a ratio.
4.
a. Observational; the number of animals a family have is not being assigned.
Instead, families are simply being asked about how many animals they have.
b. The study might have considered families with horses. People with horses likely
live on the outskirts of a big city, perhaps being exposed to less pollen. Also,
maybe more families have pets because their children do not seem to have
allergies to them.
5.
a. Observational; the researchers are looking at preexisting habits. They are not
attempting to alter the habits to determine what effect doing so might have on
measures of reading ability and short-term memory.
b. No; perhaps those who watch more television also have other habits that lead
them to scoring poorly on such assessments.
6.
a. Observational; the opinions of the doctors are not being altered in any way.
b. There is a nonresponse bias since not all participants responded. Thus, it might be
the case that those with the strongest opinions decided to come forward, whereas
the other 17,000 who didn‟t respond might have influenced the poll in a different
way.
1.
a. $4 million/day
b. If all days had the same gross revenue, $4 million would be earned.
c. $7.6
d. The amount of gross revenue earned on a given day varies by as much as $7.6
million as another day.
e. The film has generated an average of $4 million/day. There is much instability in
this average in that the actual gross revenue has varied from $1.6 million to $9.2
million, a range of $7.6 million. It is dangerous to place too many bets on what
might happen next, due to the extreme variability in revenues.
2.
a. 18 randomly selected college students
b. All college students
c. Answers vary; spending on clothing, style preference, etc.
d. Inferential; they wish to make conclusions about the population of all college
students
3.
a. 250 packages of cheese selected
b. All packages of cheese produced by the company
c. 248 or more must pass
0, 1, 2, 2, 3, 2, 28, 29, 30
0, 1, 2, 3, 4, 3, 4, 2, 1 30
While both have a range of 30, the first dataset has most of its data towards the outer ends
of the dataset. In the second dataset, there appears to tightly spaced data, followed by one
outlier of 30. The second dataset is, overall, less spread out.
5. The researchers are trying to use CGCC students as a representative population of all
college students. This presents a bias, in that CGCC probably does not accurately
represent all college students.
1.
a. Standard deviation = 5.9; on average, beers in this sample are within 5.9 calories
of the average calorie content.
Mean 60.9
Standard Error 3.9
Median 61.9
Mode 61.9
Standard Deviation 18.7
Sample Variance 351.2
Kurtosis -0.4
Skewness 0.4
Range 64.3
Minimum 34.6
Maximum 98.9
Sum 1401.2
Count 23.0
d. Yes, it is skewed to the right, since the skewness value is 0.4, a positive value.
e.
Percentage
The majority of people in Central Africa are not enrolled in school, since it is
predominantly the case that fewer than 50% of people in each nation attend school.
( )
of all enrollment percentages would be within one standard deviation of the mean.
This is considered to be a very normal percentage (it is still within the “average”
spread).
3.
a. The range is 5750, which tells us that there is a difference of 5,750 feet from the
shortest street to the longest street. The interquartile range is 2170, telling us that
the middle 50% of all street lengths range from 980 feet to 3,150 feet. The
standard deviation is 1634, telling us that, on average, a street varies by 1,634 feet
from the mean street length.
b. The interquartile range is 2170, telling us that the middle 50% of all street lengths
range from 980 feet to 3,150 feet. The standard deviation is 1634, telling us that,
on average, a street varies by 1,634 feet from the mean street length.
c.
Mean 2231.4
Standard Error 238.4
Median 2100.0
Mode 960.0
Standard Deviation 1634.1
Sample Variance 2670328.9
Kurtosis -0.2
Skewness 0.8
Range 5750.0
Minimum 100.0
Maximum 5850.0
Sum 104874.0
Count 47.0
This means that a street length of 79.6 feet would be about 1.3 standard deviations
below the mean.
f.
Street Length
35.00%
30.00%
Relative Frequency
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
100-1099 1100-2099 2100-3099 3100-4099 4100-5099 5100-6099
Feet
4. Answers vary;
35
30
25
20
15
10
0
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200
30
25
20
15
10
0
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200
Right Skewed:
30
25
20
15
10
0
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200
Left Skewed:
35
30
25
20
15
10
0
100 to 120 120 to 140 140 to 160 160 to 180 180 to 200
5.
a.
Mean 971
Standard Error 382
Median 738
Mode -
Standard Deviation 1,207
Sample Variance 1,455,875
Kurtosis 7
Skewness 2
Range 4,194
Minimum -
Maximum 4,194
Sum 9,707
Count 10
Due to the great variability in repair costs, it would be most appropriate to use the
median as measure of center. It also reflects the fact that most repair costs, if there
are any, tend to be between $600 and $1000. Since the standard deviation
describes movement about the mean, it is not appropriate to be used in
combination with a median. Thus, we should probably use the interquartile range
to describe the middle 50% of repair costs.
b.
The repair costs of $4,194 is nearly 3 standard deviations above the mean. This
means that it is an outlier cost.
6.
Mean 12.35
Standard Error 0.62
Median 12.91
Mode #N/A
Standard Deviation
1.97
Sample Variance3.90
Kurtosis -0.50
Skewness -0.60
Range 6.03
Minimum 8.81
Maximum 14.84
Sum 123.47
Count 10.00
There do not appear to be extreme outliers, since the mean and median are close. However,
based on the mean being smaller than the median, and the skewness value being negative, there
is a slight left-skew to the distribution. The standard deviation tells us that average CC ratios are
within 0.62, or 62% points, of the mean. We verify these notions by consider the histogram
CC Ratio Distribution
45.00%
40.00%
35.00%
30.00%
rel freq
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
CC Ratio
We should also be careful to note that there is not very much data available, which is why we
don‟t distinctly see a skew.
7.
Mean 46.35
Standard Error 9.395205
Median 36
Mode 40
Standard Deviation 42.01663
Sample Variance 1765.397
Kurtosis 0.09474
Skewness 0.949789
Range 136
Minimum 0
Maximum 136
Sum 927
Count 20
15%
10%
5%
0%
The distribution of nitrous oxide emissions is skewed to the right indicating that most states have
relatively low emissions, whereas fewer states have relatively high emissions. We note that the
median is a good measure, indicating that 36 thousand tons is the 50th percentile. There are two
outliers of 136 thousand tons. For this value, , indicating that at least around
75% of all values in the data set are within 2.1 standard deviations of the mean. Thus, 136 can be
considered a mild outlier.
1.
a.
0.5
0.4
Probability
0.3
0.2
0.1
0
12 14 16 18
Size (inches)
b. ( )
c. ( )
d. , - ( ) ( ) ( ) ( ) inches per pizza, on
average.
e. ( ) (doesn‟t include the 12-inch pizza!)
2.
a. ( )
b. ( )
3.
a. , so ( ) for
b. ( )
c. ( )
d. ; on average, the professor dismisses class 5 minutes after the hour.
e. ; on average, the amount of time that the professor dismisses the class
after the hour by varies by 2.9 minutes about the mean.
f. ( ) ( )
4.
a. ( )
c. ( ) ( ) ( )
d. ( )
5.
a. , so ( ) for
b. ( )
c. ( )
d. Both ( ) ( ) because, in a continuous distribution, the
probability that is 0.
1.
a.
b.
d.
b. The long-run proportion of all children born in the U.K. expected to weigh at
most 10 lbs. is 09814.
d. The long-run proportion of all children born in the U.K. expected to weigh
between 1 and 2 lbs. is 0.0000.
6. In a recent years, Scholastic Aptitude Test (SAT) scores for all college-bound seniors in
the United States was such that points and points (SOURCE:
http://www.collegeboard.com) .
a. 50% of students scored less than how many points?
b. 50% of students scored more than how many points?
c. In order to be in the top 10% of SAT-takers, what score would one have to
achieve?
d. What score do the lowest 10% score between?
e. The middle 50% of students scored between what two values?
d. About 1123.
4.
a.
5.
a. The distribution would maintain its exact shape, though would be shifted 10 units
to the right.
b. The distribution would become wider and have a lower peak. This must happen to
make sure the area is still 1 when the distribution becomes wider.
c. The distribution would become narrower and have a higher peak. If a distribution
becomes narrower, its height must increase to maintain an area of 1.
d. The mean, , determines where the distribution is centered without altering its
shape. The standard deviation, , will make a distribution wide and low-peaked if
it large, and will make a distribution narrow and high-peaked if small.
1. Answers vary
2. Answers vary – emphasis on the ability to have a population distribution with any
unknown shape.
3.
a. 0.2525
b. 0.2514
c. 0.9044
d. 95.1 and 104.9
4.
a. 0.0272
1. Answers vary
Our 95% confidence interval would be 652.1 to 755.1, which is close to our
bootstrap confidence interval. It is a bit wider than we would like.
Where
( )
( )
Or
( )
This is a bit wider, accounting for the extra variability in estimating and .
1. The null hypothesis is assumed to be true and is usually based on what has been observed
before. The alternative hypothesis is what we would like to test, which is something that
would challenge past observations or assumptions about a population.
3.
4.
5.
7.
1) Hypotheses:
2) Decision Rule: We will reject the null hypothesis when the likelihood of
observing something as small or smaller than 52 out of 1000 bushels is no
larger than a 1% probability, under the assumption of the null hypothesis. That is,
( )
3) We will reject if the observed value of is smaller than some cutoff
value of .
4) Based on the sample evidence, we will either:
a. Reject in favor of of insect-related crop destruction for the
farmer‟s new method.
b. Fail to reject . We do not have sufficient evidence to conclude that the
farmer‟s new method is better than his old method.
8.
1) Hypotheses:
2) Decision Rule: We will reject the null hypothesis when the likelihood of
observing something as large or larger than ̅ is no larger than a 5%
probability, under the assumption of the null hypothesis. That is,
( ̅ )
9.
1) Hypotheses:
( )
10.
1) Type I: We conclude the farmer‟s method reduces crop destruction, when there is
no difference; Type II: We conclude the farmer‟s method is no different than the
old method, when in fact there is less than 7% crop destruction with his new
method.
2) Type I: We conclude the instructors students perform better than his former
students, when in fact there is no difference; Type II: We conclude that his new
students perform just as well as his former students, when in fact they do better.
3) Type I: We conclude that the new machines fail more or less than the former
machines, when in fact there is no difference; Type II: We conclude that there is
no difference between the failure rates of the new and old machines, when in fact
there is a significant difference.
11. Increasing means we will reject less often, as we set more stringent conditions upon
the rejection process. If we reject less often, then there is an elevated likelihood that we
may fail to reject, when in fact we should. This is precisely what a Type II error is.