Sunteți pe pagina 1din 65

Prelucrarea datelor experimentale

Introducere
De ce este necesara studierea disciplinei?
Ca si argument folosim o exprimare compusa din doua parti. Prima parte face referinta la observatia
ca pe bancnotele americena sta scris IN GOD WE TRUST adica in traducere asta ar suna ca si
expresie Noi credem in dumnezeu si la care adaugam cea de-a doua parte a expresiei
EVERYTHINK ELSE NEEDS DATA adica orice altceva se bazeaza pe date. Datele ce stau la
baza demonstrarii opricarei argumentatii trebuiesc sa fie valide, sa nu contina erori si sa includa
informatii coerente. Pentru a putea fi citite si interpretate aceste date este necesara parcurgerea
acestei discipline.

Why Study Statistics?


Think about the following questions, then click on the icon to the left to display the statistical
application example.
To evaluate printed numerical facts...
To interpret the results of sampling or to perform statistical analysis in your
work...
To make inference about the population using information collected from the
sample...

Statistica este o ramura a stiintei ce se ocupa cu:


colectarea datelor
organizarea si structurarea datelor
analiza datelor
realizarea de predictii, decizii sau deduceri.
Ultima exprimare este de fapt obiectivul statisticii, adica de a face deductii despre o populatie
bazandu-ne pe informatiile continute in un grup reprezentativ din acea populatie.

Obiective de instruire
dupa completarea seminarului veti fi capabili sa:
formulati un raspuns la intrebarea, "ce este statistica?"
obtineti date inteligente.
faceti distinctie intre studiile stiintifice se cele observationale.
organizati datele in tabele si sa folositi tehnici grafice adecvatepentru a descrie seturi
variate de date.
Identificati variabilele ca si categorii calitative (binary, ordinal, nominal) sau cantitative
(discrete, continuous).

Definitii
Sunt cateva concepte ce vor trebui sa fie definite. Un prim concept este populatie un grup de
obiecte denumite elemente care au o anumita proprietate comuna, si care este caracteristica

intregului grup studiat. De exemplu daca vorbim de studenti, populatia ar trebui sa fie toti studentii
la care ne gandim.
Exemplul 1. studentii inrolati la cursul PDE din o institutie formeaza o populatie, deoarece nu mai
exista si alti studenti care sa posede aceeasi proprietate.
Deoarece uneori este aproape imposibil sa analizam intreaga populatie, o parte mica si
reprezentativa din aceasta populatie va fi luata in considerare. Aceasta parte este denumita esantion.
Nivel de masurare datele pot fi clasificate pe urmatoarele patru nivele de masurare:
date nominale -constau in nume, etichete sau categorii, gen, grupuri particulare de persoane
grupate dupa interese, aspiratii privilegii etc. aceste date nu pot fi ordonate ca date nominale
(cum ar fi cea mai mare sau cea mai mica) si de asemenea nu pot fi prelucrarea aristmetic.
Date ordinale (ordinal data)- ce pot fi aranjate in orice ordine particulara
date interval sunt similare cu datele ordinale
date raport (ratio data) presupun existenta unei relatii cantitative (de exemplu numarul de
joburi pentru barbati raportat la femei este de 8/1
Exemplul 2 Identificati ce nivel de masurare este caracteristic pentru urmatoarele date:
anii cuprinsi in perioada istorica 1821- 1877
venitul annual al studentilor de la masterat : 0lei- 10000lei
genul studentiler: masculin , feminin
Raspuns
anii cuprinsi intre 1821 si 1877 reprezinta date interval. Aici nu avem date de an zero. De
asemenea impartirea datelro nu are sens (ce reprezinta anul 1821/1877?) asa ca nu avem
nicio ratie. De altfel substarctia nu are sens perioada caoperita este 1877-1821=56 ani.
Venidul annual al studentilor reprezinta ratio data. Aici impartirea are sens, adica daca
cineva are venitul de 2000 si altul de 40000 veniturile pot fi comparate,. De asemenea daca
un student are venit 0 atunci valoarea 0 are sens.
Genul studentilor este data nominala, nu pot fi ordonate si de asemenea datele nu pot
adunate scazute sau impartite.
Datele colectate de la fiecare individ dintro populatie sunt denumite variabile. Aceste variabile sunt
caracteristice fiecarui individ dintr-o populatie. Varibilele pot avea diferite tipuri de valori, unele din
ele sunt numere iar altele categorii. De exemplu: numarul de usi dintr-o locuinta este 10, pstiul
laboratorului este de 68.5mp sunt date de tip numeric. Pe de alta parte daca casa este a unei singure
familii naloarea numerica a varibilei nu are sens. Acest lucru dertermina clasificarea variabilelor in
varibile cantitative si calitative.
O variabila calitativa sau pe categorii permite listarea caracteristicilor individuale pe categorii. De
exemplu impartirea pe sex de la exemplul anterior.
Primele doua masuratori de la exemplul anterior sunt date cantitative.
varibile
calitative

cantitative

discrete

continue

Variabilele cantitative pot fi de doua tipuri: discrete si continui.


O variabila discreta este o variabila cantitativa ce presupune un set de valori finite sau namarabile.
Exemplu:
numarul de copii ai unei familii sau numarul de prieteni din universitate a unui student sunt
variabile discrete
Variabilele continui sunt variabilel cantitative care au un numar infinit de valori. Aceste variabile

sa mei numesc variabile scalare (de interval sau de raport) de exemplu. Inaltimea studentilor,
greutatea, varsta,...
pentru a prelucra aceste date si a lua cea mai buna decizie de aici prelucrarea statistica a datelor are
doua directii prelucrarea statistica generala si prelucrarea statistica inferentiala.

1.2 Colectarea datelor


Modul in care datele sunt colectate
The following are a few frequently used methods for how data is
collected:
Personal Interview

People usually respond when asked by a person but their answers may
be influenced by the interviewer.

Telephone Interview

Cost-effective but need to keep it short since respondents tend to be


impatient.

Self-Administered
Questionnaires

Cost-effective but the response rate is lower and the respondents may
be a biased sample.

Direct Observation

For certain quantities of interest, one may be able to measure it from


the sample.

Web-Based Survey

Can only target the population who uses the web.

Strategies for Collecting Data


How can we get data, for instance, how do we select data for a
study? There are two types of methods for collecting data, nonprobability methods and probability methods
Non-probability Methods
These might include:
Convenience sampling (haphazard): for instance, surveying
students as they pass by in the university's student union
building, or
Gathering volunteers: for instance, using an advertisement in
a magazine or on a website inviting people to complete a form
or participate in a study. .
Probability methods
Simple random sample making selections from a population
where each subject in the population has an equal chance of
being selected.

Stratified random sample where you have first identified


population of interest, you then divide this population into
strata or groups based on some characteristic (e.g. sex,
geographic region), then perform simple random sample from
each strata.
Cluster sample where a random cluster of subjects is taken
from population of interest. An example might be grabbing
handful of M&M's from a large jar of M&M's.
The primary benefit for using probability sampling methods is the
ability to make inference. The results can be extended or
generalized to the population from which the sample came. This
comes as the result of probability methods providing a
representative sample from a population of interest. This is not the
case with nonprobability samples. When nonprobability methods
are used you can make generalizations about the sample, not the
population. We can assume that by using random sampling we
attain a representative sample of the population.
Example: Airline Company Survey of Passengers
Here is an example of different sampling
techniques and how they might be
implemented. Let's say that you are the
owner of a large airline company and you
live in Los Angeles. You want to survey your
L.A. passengers on what they like and
dislike about traveling on your airline.
Convenience Sampling: Since you live in
L.A. you go to airport and just interview passengers as they
approach your ticket counter.
Volunteer Sampling: You have your ticket counter personnel
distribute a questionnaire to each passenger requesting they
complete the survey and return it at end of flight.
Simple Random Sampling: You randomly select a set of passengers
flying on your airline and question those that you have selected.
Stratified Sampling: You stratify your passengers by the class they
fly (first, business, economy), and then take a random sample from
each of these strata.
Cluster Sampling: You stratify your passengers by class they fly
(first, business, economy) and randomly select such classes from
various flights and survey each passenger in that that class and
flight selected.

Example: Phone Survey Snafu


In predicting the 2008 Iowa Caucus results a phone
survey said that Hillary Clinton would win, but instead
Obama won. Where did they go wrong? The survey
was based on landline phones, which was skewed to
older people who tended to support Hillary. However,
lots of younger people got involved in this election
and voted for Obama. The younger people could only be reached by
cellphone.
Margin of Error
Surveys are they random or nonrandom? How are the samples
drawn? If samples are randomly taken, we can include some
measure of error. This helps us understand better where the truth
lies. We call this the margin of error. i.e. 3.1 % This margin of
error for random survey can be approximated by:

1/n1/n where n = sample size.


A common margin of error of interest in the survey industry is 3%.
During U.S. presidential elections several random polls (e.g. Gallup
and N.Y. Times) will be conducted to estimate who will win an
upcoming election. With a desired margin of error equal to 3% and
approximately 110 million registered voters, how many registered
voters do they need to attain this margin of error? What might be
surprising is the answer. Working backward to solve for 'n' where
we have 0.03 = 1/sqrt(n) we come to n = 1111.11 or 1112 voters.
Not many! Keep in mind that this is the number they need to
participate in the poll and not simply the size of the sample they
would take from the population of all voters. If they expect about a
50% response rate (i.e. half the people the contact do not
participate - think of yourself receiving a call but declining) then
they would need to sample at least twice this number in order to
finish with the desired sample size. Of course their actual sampling
designs are not necessarily this simple as they would want to
consider the effect party affiliation, age, sex, geographic location,
race, etc. may have in how a voter selects their candidate.
Types of Studies
There are predominantly two different types of studies:
1.Observational these studies show that there is a

relationship
2.Experimental this involves some random assignment of a
treatment; researchers can draw cause and effect (or causal)

conclusions.
* Random selection (a probability method of sampling) is not
random assignment (as in an experiment). In an ideal world you
would have a completely randomized experiment; one that
incorporates random sampling and random assignment.
Example
Let's say that there is an option to take quizzes throughout this
class. In an observational study we may find that better students
tend to take the quizzes and do better on exams. Consequently, we
might conclude that there may be a relationship between quizzes
and exam scores. In an experimental study we would randomly
assign quizzes to specific students to look for improvements. In
other words, we would look to see whether taking quizzes causes
higher exam scores.
Ethics is an important aspect of experimental design to keep in
mind. For example, the original relationship between smoking and
lung cancer was based on an observational study and not an
assignment of smoking behavior.
Variables
Explanatory (predictor or independent) and response (outcome or
dependent) variables. A variable can serve as an explanatory
variable in one study but response in another. For instance,
consider the variabls Sex (Female, Male) and Height (in inches).
Which variable do you believe explains the other? Would it make
more sense to say a person's sex more likely explains that person's
height, or to say a person's height explains that person's sex? In
this case Sex (explanatory) would explain Height (response). On
the other hand, consider the variables Height and Weight. In this
case, a person's height would more likely explain their weight than
the other way around. Now Height serves as the explanatory
variable.
Let's think about the example of height and weight. Typically height
(the explanatory variable) explains weight (the outcome variable).
In tabular data the predictor variable is usually displayed in the
rows of the table, and the response variable in columns. Here is an
example of the tally or results of a survey asking males and females
if they smoke.
Yes No
Male

20 55

Female 15 70

Random selection is equivalent to random sampling. This allows one


to extend results to the population from which the sample came.
Random assignment is NOT the same as random sampling. Random
assignment is how treatments get assigned. This distinguishes a
study as an experiment.
Dont forget! Observational studies can show an association or
relationship. Experiments can show cause.
Observational Studies Versus Scientific Studies
It is very important to distinguish between observational and
scientific studies since one has to be very skeptical about drawing
cause and effect conclusions using observational studies. The use
of random assignment of treatments (i.e. what distinguishes a
scientific study (experiment) from an observational study) allows
one to employ cause and effect conclusions.
Observational
Studies

Researcher observes the data and has no control over which subject takes
which treatment,

Scientific Studies

The researcher assigns randomly treatments to each subject - there is random


assignment of treatment.

Question: We want to decide whether Advil or Tylenol is more


effective in reducing fever.
Think about the following, then click on the icon to the left
to display the statistical application example.
Method 1: Ask the subjects which one they use and ask them
to rate the effectiveness. Is this an observational study or
scientific study?
Method 2: Randomly assign half of the subjects to take Tylenol
and the other half to take Advil. Ask the subjects to rate the
effectiveness. Is this an observational study or scientific
study?
LOOKING AHEAD: Students interested in pursuing topics related to
sampling might explore STAT 506: Sampling theory. STAT 506
covers sampling design and analysis methods that are useful for
research and management in many fields. A well designed sampling
procedure ensures that we can summarize and analyze data with a

minimum of assumptions and complications.


Principles of Experimental Design
The following principles of experimental design have to be followed
to enable a researcher to conclude that differences in the results of
an experiment, not reasonably attributable to chance, are likely
caused by the treatments.
Control

Need to control for effects due to factors other than the ones of
primary interest.

Randomization

Subjects should be randomly divided into groups to avoid


unintentional selection bias in the groups.

Replication

A sufficient number of subjects should be used to ensure that


randomization creates groups that resemble each other closely and
to increase the chances of detecting differences among the
treatments when such differences actually exist.

The benefits to randomization are: 1. If random assignment of


treatment is done then significant results can be concluded
as causal or cause and effect conclusions. That is, that the
treatment caused the result. This treatment can be referred to as
the explanatory variable and the result as the response variable.
2. If random selection is done where the subjects are randomly
selected from some population, then the results can be extended to
that population. The random assignment is required for an
experiment. When both random assignment and selection are part
of the study then we have a completely randomized experiment.
Without random assignment (i.e.an observational study) then the
treatmen can only be referred to as being related to the outcome.
Lurking versus Confounding Variables
The difference between the two is that a lurking variable is a
variable not considered in the study but could influence the
relationship between the variables included in the
study. Aconfounding variable is one that is in the study and is
related to the other study variables, thus having an effect on the
relationship between these variables. Therefore a lurking variable,
if included in the study, could have a confounding effect and then
be classified as a confounding variable. For example: Say you teach
a class where students must submit a weekly homework and then
take a weekly quiz. You want to see if there is a relationship
between the scores on the two assignments (i.e. higher homework

scores are aligned with higher quiz scores). As you look at the data
you begin to consider whether submission date of the homework
has an effect on the quiz grades; that is, do students who submit
the homework several days before taking the quiz perform better
overall on the quiz than students who do not leave much of a time
gap between completing the assignments (e.g. they do both on the
same day). The rational is that students who allow time between
the homework and quiz to study may perform better compared to
the other group. In this example, days between submission of
homework and quiz would be a lurking variable as it was not
included in the study. Now once you got that information and reexamined the relationship between the two assignments taking into
consideration the time gap, if you saw a change in the relationship
between the two assignments (i.e. the relationship changed
somewhat from the analysis without the time gap compared to
when the time gap was included) then this days between
submission would be considered a confounding variable. In an
experiment where treatments are randomly assigned, one assumes
these variables get evenly shared across the groups with the
intention that any influence they may have on the outcome is
negated or reduced.
Types of Bias
1.Non-response large percentage of those sampled do not
to respond or participate.
2.Response when study participants either do not respond
truthfully or give answers they feel the researcher wants to
hear. For example, when students are asked if they ever
cheated on an exam even those who have would respond with
"no".
3.Selection this bias occurs when the sample selected does
not reflect the population of interest. For instance, you are
interested in the attitude of female students regarding campus
safety but when sampling you also include males. In this case
your population of interest was female students however your
sample included subject not in that population (i.e. males).
LOOKING AHEAD: Students interested in pursuing topics related to
design of experiments might explore STAT 503: Design of
Experiments. STAT 503 includes extensive coverage
implementation and analysis of a wide range of experimental
designs.

1.3 - Graphing Data

Unit Summary
Variable and Its Type
Graphs for a Categorical Variable
Pie Chart
Bar Chart
Graphs for a Single Quantitative Variable
Dot Plot
Frequency Histogram and Relative Frequency
Histogram
Stem-and-Leaf Diagram
Time Plot
Boxplot or Box-and-Whisker Plot

An Introduction to Lesson 1.3


by Course Authow Mosuk Chow - (length 6:23)

[summary transcript - histogram

[1]

- bar chart

[2]

Distinguishing between categorical (qualitative) variables and


quantitative variables is a basic and intergral part of applied
statistics as the methods to analyze these data are very different.
Sometimes, when one codes surveys, you would code male as 1
and female as 2. Beware, gender is qualitative: there are two
different classes. 1 and 2 just denote two different symbols for
gender and there is no ordering between these two symbols when
used to denote male and female. Another example is team
assignments. For your team project, I will call the teams: Team 1 ,
Team 2 etc. The team a student belongs to is again qualitative. In
statistics, as in most languages, we sometime call the same thing
by different names. So qualitative is also called nominal, or
categorical.
How can one graph qualitative variables? Two common choices are
pie chart and bar chart. Please pay attention that even though
histogram also have bars sticking up, they are used to describe the
frequency for quantitative variables; bar chart is reserved to
describe graphs that show frequency of categorical variables.
You will practice drawing graphs for these two different types of
variables. Again, you will be asked in this lesson to work these
examples out by hand. After a good understanding of these
concepts has been established, the course will review all of these
using the Minitab statistical software.
Reading Assignment

An Introduction to Statistical Methods and Data Analysis, (see


course schedule)..
Techniques of describing data in ways to capture the essence
of the information in the data are called descriptive statistics.
To draw conclusions from data about the population is
called inferential statistics.

Identifying Categorical and Quantitative Variables

One survey of 500 Penn State University students about their


favorite sport to watch shows that 238 said Football, 126 said
Basketball, 45 said Hockey, 46 said Others.
Think about the following, then click on the icon to thte left to
display the statistical application example.
[3]

What is the variable of interest?

[4]

It is important that each observation for the variable falls into


one and only one values. For the above example, the values
are:

It is important to distinguish between the following two


types of variables since the methods to describe them and to
do inferences about them are very different.
1. Qualitative (Categorical) : Data that serves the function of a
name only. For example, for coding purposes, you may assign Male
as 0, Female as 1. The numbers 0 and 1 stand only for the two
categories and there is no order between them. Categorical values
may be:

Binary where there are two choices, e.g. Male and Female;
Ordinal where the names imply levels with hierarchy or order
of preference, e.g. level of education
Nominal where no hierarchy is implied, e.g. political party
affiliation.

[5]

Please provide one or more examples for qualitative variable:


2. Quantitative: Data that takes on numerical values that has a
measure of distance between them. Quantitative values can be
discrete, or counted as in the number of people in attendance, or
continuous or measured as in the weight or height of a person.
[6]

Please provide one or more examples for quantitative


variable:
Additional examples of both include:
Number of females in this class (Quantitative, Discrete)
Nationality (Categorical, nominal)
Amount of milk in a 1 gallon container (Quantitative,
Continuous)
Sex of students (even if coded as M = 0, F = 1) (Categorical,
Binary)

Graphs for a Categorical Variable


1. Pie Chart: area of the pie represents the percentage of that
category.

Example

A hand drawn pie chart to represent the Penn State


University student's favorite sport to watch. (We will use
Minitab to draw graphs and charts in Lesson 3).

Remarks:
a) Pie charts may not be suitable for too many categories. Thus, if
there are too many categories, you can either combine some
categories or use a bar chart to represent the data. What is mean
by "too many"? There is no clear cut off, more of just a judgment
on the appearance.
b) Readers may find the pie chart more useful if the percentages
are arranged in a descending or ascending order.
2. Bar Chart: The height of the bar for each category is equal to the
frequency (number of observations) in the category. Leave space in
between the bars to emphasize that there is no ordering in the
classes.
Example

A hand drawn bar chart to represent the Penn State


University student's favorite sport to watch.

Graphs for a Single Quantitative Variable


1. Dotplot: Useful to show the relative positions of the data.
Example

Each of the ten children in the second grade was given a


reading aptitude test. The scores were as follows:
95 78 69 91 82
76 76 86 88 79

Here is a dot plot for the data.

2. Frequency Histogram and Relative Frequency Histogram: If


there are many data points and we would like to see the
distribution of the data, we can represent the data by a
frequency histogram or a relative frequency histogram.
Group the data into about 5-20 class intervals and show the
frequency or relative frequency of data in each interval.
Example

Jessica weighs herself every Saturday for the past 30 weeks


135 137 136 137 138 139
140 139 137 140 142 146
148 145 139 140 142 143
144 143 141 139 137 138
139 136 133 134 132 132

For histograms, we usually want to have from 5 to 20


intervals. Since the data range is from 132 to 148, it is
convenient to have a class of width 2 since that will give us 9
intervals :
131.5 - 133.5 133.5 - 135.5 135.5 - 137.5
137.5 - 139.5 139.5 - 141.5 141.5 - 143.5
143.5 - 145.5 145.5 - 147.5 147.5 - 149.5

The reason that we choose the end points as .5 is to avoid


confusion whether the end point belongs to the interval to its
left or the interval to its right. An alternative is to specify the
end point convention. For example, Minitab includes the left
end point and excludes the right end point. Having the

intervals, one can construct the frequency table and then


draw the frequency histogram or get the relative frequency
histogram to construct the relative frequency histogram. The
following histogram is produced by Minitab when we specify
the midpoints for the definition of intervals according to the
intervals chosen above.

If we do not specify the midpoint for definition of intervals,


Minitab will default to choose another set of class intervals
resulting in the following histogram. According to the include
left and exclude right end point convention, the observation
133 is included in the class 133-135.

Note that different choices of class intervals will result in


different histograms. Relative frequency histograms are
constructed in much the same way as a frequency histogram
except that the vertical axis represents the relative frequency
instead of the frequency. For the purpose of visually
comparing the distribution of two data sets, it is better to use
relative frequency rather than a frequency histogram since
the same vertical scale is used for all relative frequency--from
0 to 1.
3. Stem-and-Leaf Diagram: Group the data and still keep the
number. One can recover the original data (except the order
the data is taken) from the diagram.
The stem represents the major groupings of the data. The
leaves represent the last digit. For example, the first value
(also smallest value) is 132, with 13 as the stem and 2 as
the leaf.
Stem-and-Leaf of weight of Jessica
N = 30
Leaf Unit = 1.0
3

13 223

13 45

11 13 667777
(7) 13 8899999
12 14 0001
8

14 2233

14 45

14 6

14 8

The above Stem-and-Leaf diagram can also be drawn by


Minitab. The first column, called depths, are used to display

cumulative frequencies. Starting from the top, the depths


indicate the number of observations that lie in a given row
or before. For example, the 11 in the third row indicates
that there are 11 observations in the first three rows. The
row that contains the middle observation is denoted by
having a bracketed number of observation in that row; (7)
for our example. We thus know that the middle value lies in
the fourth row. The depths following that row indicate the
number of observations that lie in a given row or after. For
example, the 4 in the seventh row indicates that there are
four observations in the last three rows.
4. Boxplot: The boxplot will be discussed in greater detail
when we discuss "Summarizing Data" because the design of
the boxplot is dependent upon various summary measures
we will learn in that lesson.
5. Time Plot: Note that for the weight of Jessica, one important
aspect of the data is lost if one just shows the distribution. Jessica
may be really interested in how her weight changes over time. For
that purpose, a plot of weight versus the order it is taken (time) is
warranted.

1.4 - Practice Problems


Printer-friendly version
1. Draw - by hand - two appropriate graphs for the following data:
University officials periodically review the distribution of undergraduate majors within the colleges of the
university to help determine a fair allocation of resources to departments within the colleges. At one such
review, the following data were obtained:

College

Number of majors

Agriculture

1,500

Arts and Sciences

11,000

Business
Administration

7,000

Education

2,000

Engineering

5,000

2. Draw - by hand - a frequency histogram and relative frequency histogram


Draw a frequency histogram and a relative frequency histogram for the 2014 annual per capita city tax
given below:

2470, 520, 561, 488, 986, 359, 1305, 512, 467, 270, 360, 451, 4904, 572, 498, 382, 271, 634,
1682, 784, 298, 643, 947, 686
3. Draw - by hand - a Stem-and-Leaf plot
Draw the stem-and-leaf diagram for the following data set. Use the stem and leaf diagram to find the
median of the data set:

11 11 12 13 14 14 14 14 15 15 15 16 16 17 17 19 19 20 22 25
4. Populations and Samples
Selecting the proper diet of brook trout or other fresh water fish is an important aspect of fish farming. A
researcher want to estimate the mean weight of brook trout maintained on a specific diet for a period of 6
months. One hundred brook trout are selected from a fishery's tank and each is weighed.
a. What is the population of interest to the researcher?
b. What is the sample?
c. What characteristics of the population are of interest to the researcher?
d. If the sample measurements are used to make inferences about certain characteristics of the population,
why is a measure of the reliability of the information important?

Prelucrarea datelor experimentale in MINITAB


Minitab este o aplicatie software folosita pentru analiza statistica a datelor pentru intrepretarea intro maniera usoara a datelor tabelare si a graficelor. Poate fi folosit in statsitica de baza la :

Statistica de baza

Reliability/survival (including distribution


analysis, regression with life data, accelerated
life testing,
probit analysis, warranty prediction, test plans,
and growth curves)

Regresie

Multivariate analysis

Analiza variatiei

Time series

DOE (factorial, suprafete de raspuns,taguchi


design, mixture)

Tables

Grafice de control

Nonparametrics

Instrumente de calitate (incluzand instrumente


Power and sample size
de planificare, process capability, acceptance
sampling, and gage study)
You can also access guidance for the following graphs in the Graph menu:
Scatterplot

Interval plot

Matrix plot

Individual value plot

Marginal plot

Line plot

Histogram

Bar chart

Dotplot

Pie chart

Stem-and-leaf plot

Time series plot

Probability plot

Area graph

Empirical CDF

Contour plot

Probability distribution
plot

3D scatterplot

Boxplot

3D surface plot

Use this information to assess the basic properties of your data distributions:
number of observations
central tendency the location of the center, or most typical value, of the data set
dispersion the amount of variation or spread in the data set
Display Descriptive Statistics can provide summary information for whole columns of data or for
subsets of data within columns.

Statistica descriptiva
Odata ce am identificat populatia, tipurile de date si variabilele si am colectat datele pentru
esantion, obiectivul nostru este de a descrie caracteristicile esantionului fara ambiguitati, de o
maniera precisa, astfel incat aceste date sa poata sa fie communicate cu usurinta celorlalti. Descriere
sau sumarizarea datelor colectate sa poate face in doua maniere: grafic sau numeric.
Descrierea grafica depinde de tipurile de date. Asa cum s-a prezentat mai sus exista date cantitative
si calitative.
Graficele ce descriu datele calitative includ:
-grafice sub forma de bara, Pie chart si Pareto chart

pentru descrierea grafica a datelor cantitative se foloseste:


-dot plot, histogram, grafice ramuri si frunze.
In statistica descriptiva metodele de prezentare a datelor includ:
distributia de frecvente

Descrierea datelor
Pentru un studiu de meteorologie un student a colectat urmatoarele date pentru orasul in care
locuieste intr-un an. Valorile reprezinta numarul de zile din luna cand s-au inregistrat precipitatii
semnificative. Proiectul l-a inceput dupa terminarea lunii ianuarie asa ca nu sunt inregistrari in acea
luna:
Ian

Febr.

Zile *
2
cu
precip
itatii
Experiment meteo.

Mar.

Apr.

Mai

Iun

Iul.

Aug.

Sept.

Oct

Nov.

Dec.

10

Data: Precipitation.MTW (available in the Sample Data folder).


Se deschide un nou proiect . :

Trei tipuri de ferestre


Atunci cand se deschide Minitab, acesta arata astfel:

Fereastra de sus denumita "Session" window este locul in care Minitab va afisa rezultatul analizei

statistice cerute. Ferestra de jos denumita "Worksheet" window este locul unde copiem si
introducem datele. Cea de-a treia fereastra denumita "Graphics" window apare doar atunci cand ni
se cere sa ploteze ceva. Hiata mai jos un exemplu de ferestra grafica:

feresatra activa este ferestra la care bara apare de culoare albastru inchis. Pentru a face o ferestra inactiva
pur si simplu se face clic cu mouse-ul oriunde in ferestra
In WORKSHEET se introduc datele in coloana C1

Copying and Pasting Data into a Minitab Worksheet


All of the data that you analyze in this course will be posted on the course web site. You will just have to
copy and paste the data into a worksheet. Let's try it out on the idealwt.txt data set. Once the data are in
your browser's window, the easiest way of copying the data is to Select all and then right-click and Copy.
To paste the data into the Minitab worksheet, put your cursor in the first (unnumbered) row of the first
column, and then click on Edit >> Paste cells (or click on the standard clipboard icon used to denote
pasting).
Your worksheet should look like this:

Note that the first (unnumbered) row is reserved for variable names. This is one thing about which you
will have to be careful. If you accidentally place your cursor instead in the row numbered 1, Minitab will
then treat the data as if they are text:

Note that Minitab has added a "-T" to the column names C1 and C2 to denote that the content of the two
columns is text. Another indication that the content of the columns is treated as text is that the textual
content is left-justified whereas numbers are always right-justified. Minitab cannot summarize data, such
as calculating means and standard deviations, when they are treated as text. If you accidentally make this
mistake, just open a new worksheet (Select File >> New... >> Minitab Worksheet >> OK) and paste the
data properly.
https://onlinecourses.science.psu.edu/statprogram/print/book/export/html/51

Analyzing Data
Once you've pasted or uploaded data into a Minitab worksheet, you no doubt will want to analyze it. The
commands to do so all appear in one of the Minitab menus:

More often than not, we will use the "Stat" and "Graph" menus in this course. The menus are generally
self-explanatory, but we will provide you with Minitab help throughout the course.
To create a scatter plot of the data, just select Graph >> Histogram... >> Simple. A standard Minitab dialog
box will appear. In general, the dialog box means that you have to provide Minitab with more information
before it can complete your request. For the scatter plot dialog box:

you need to tell Minitab that "actual" is variable of interest. To do this you can either 1) click on the
variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The
name should appear in the box labeled "Graph variables." Then, once you select "OK," a new graph

window containing the scatter plot should appear:

To do have Minitab display basic descriptive statistics, select Stat >> Basic Statistics >> Display
Descriptive Statistics .... The following Minitab dialog box will appear:

You need to tell Minitab that in this case "actual" is the variable of interest. You can either 1) click on the
variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The
name should appear in the box labeled "Variables." Do the same thing to tell Minitab which variable you
would like to group the statistics by. In this case we will click in the box labeled "By variables" and then
either 1) click on the variable name ("sex") once and then click on "Select" or just 2) double-click on the
variable name.Once you select "OK," the standard descriptive statistics output should appear in the
"Session" window:

All Minitab graph and analysis commands function similarly to the examples illustrated above.
Throughout the course, help will be provided for the various Minitab commands we will encounter.
In the next lesson page we will use a Viewlet to walk you through another example of how Minitab works.

Copying Minitab Output and Graphs into Word


To copy output appearing in the Session window, select the desired output using your mouse. To copy a
graph window, make the graph window active by clicking anywhere in it, and the select Edit >> Copy
Graph.
To paste either output or a graph, select Edit >> Paste (or use the standard clipboard icon used to denote
pasting).
RemoteApps and WebApps users should select the Send Graph to Microsoft
Word option and a Word Document with this graph in it will be created and can
be saved in your PASS space.

Saving Your Work


While you can save your work in bits and pieces the graphs separately from the worksheet more
often than not, it is best to save your entire Minitab "project." A Minitab project includes all of the work
created in one session, including multiple worksheets, the Session window, and multiple graph windows.
Basically, if you save your work as a Minitab project, you can resume your work right where you left off.
To save your work as a Minitab project, select File >> Save Project As..., and provide an appropriate
filename in the dialog box. Minitab projects are given a ".MPJ" extension. For the purpose of this course,
you may consider creating one project for each lesson, and thereby naming the projects lesson1.MPJ,
lesson2.MPJ, and so on.
RemoteApps and WebApps users's work will be saved in the user's PASS space.
It can be download to the user's local computer from there.

Printing Your Work


Of course, you can print your Minitab work as well. To do so, activate the window that you want to print
by clicking your mouse anywhere on the window. Then, select File >> Print Worksheet or File >> Print
Session Window or File >> Print Graphdepending on what it is that you want to print.

Minitab Help
There are various ways that you can get Minitab help.
1.If you would like more tutorial help, you can try the Tutorials option provided under the Help pull-down
menu in the Minitab Application shown below:

2.You can look for help in the Minitab Help on-line manual also listed in the pull-down menu pictured
above.
3.You can use the various sets of Minitab instructions provided to you throughout the course. You will
probably find links to these from the Homework Problems and Lab Activities in each lesson.
4.Finally, you can post a question to the Minitab discussion board located at the course level under the
CONTENT tab in ANGEL. This is a separate discussion board just for questions related to how to use
Minitab.

Support From Minitab


Minitab offers several resources that are helpful for you. Minitab 17 Support - Getting Started is a concise
guide designed to quickly get you familiar with using Minitab Statistical Software.
In addition to Minitab 17 - Getting Started, the following tools are available:
Help: A complete Help file is incorporated in Minitab, which provides you with instructions, examples
with interpretations, overviews and detailed explanations, troubleshooting tips, formulas, references, and a
glossary. Open Help by choosing Help >> Help or by clicking on the Help button on every dialog box in
the software.
StatGuide: The StatGuide provides statistical guidance after you run a procedure in Minitab. Open the
StatGuide by right-clicking on your output in the Session window, then choosing StatGuide.
Tutorials: Step-by-step tutorials help new users learn how to use Minitab. You can open these by choosing
Help > Tutorials while using Minitab.
Minitab News: Every month Minitab also publishes a newsletter delivered to your email box that provides
you with customer stories, highlights of new product capabilities and tips on how to get the most out of
using our products. To sign up for Minitab News, simply create an account under My Account. Or feel free
to look at some of the past issues.
The Minitab website is full of powerful articles and information including why Minitab is used by over
4,000 colleges and universities worldwide. You will find:
Minitab Tips Tricks and Tutorials
Jobs that are looking for Minitab experience
YouTube videos
Minitab also offers free web events both live and recorded.
Last, but not least, remember that Minitab provides free access to their support team, staffed by
professionals with expertise in the software, statistics, quality improvement, and computer systems.

Visit the support web site or call +1-814-231-2682 to speak with Minitab's technical support specialists.

https://onlinecourses.science.psu.edu/stat500/node/11
Lesson 2 - Summarizing Data: Measures of Central Tendency and Measures of Variability, Box Plot

Printer-friendly version
We will first talk about the important concepts of statistical inference. Then a few descriptive measures of
the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive
measures of the other important characteristic of a data set, measure of variability, will be discussed. This
lesson will be concluded by a discussion of box plots, which are simple graphs that show the central
location, variability, symmetry, and outliers very clearly.
Again, this lesson will focus on simple examples that can be calculated or drawn by hand. This lesson will
be followed by another lesson that will work through many of these procedures using Minitab.

Lesson 2 Objectives
Upon successful completion of this lesson, you will be able to:
conceptualize statistical inference.
use appropriate summary measures to describe different data sets.
construct and use box plots.
2.1 - Measures of Central Tendency and Skewness

Printer-friendly version
Unit Summary

Measures of Central Tendency


Mean
Median
Mode
Trimmed Mean
Skewness
Adding and Multiplying Constants

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (See your course schedule.)
Measures of Central Tendency
Three of the many ways to measure central tendency are:

1. Mean

the average of the data

2. Median

the middle value of the ordered data

3. Mode

the value that occurs most often in the


data

In most research experimental situations, examination of all members of a population is not typically
conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a
representative subset of the population.

Let's take a closer look at this diagram implies with Dr. Wiesner.

Object 1

Descriptive measures of population are parameters. Descriptive measures of a sample are statistics. For
example, a sample mean is a statistic and a population mean is a parameter. The sample mean is usually
denoted by yy:

y=y1+y2++ynn=ni=1yiny=y1+y2++ynn=i=1nyin
where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate
the population mean since usually only a random sample is drawn and we don't know the population
mean.

A Note on Notation!
What if we say we used xixi for our measurements instead of yiyi? Is this a problem? No. The
formula would simply look like this:
x=x1+x2++xnn=ni=1xinx=x1+x2++xnn=i=1nxin
The formulas are exactly the same. The letters that you select to denote the measurements are up
to you. For instance, many textbooks use x instead of y to denote the measurements.
The point is to understand how the calculation that is expressed in the formula works. In this
case, the formula is calculating the mean by summing all of the observations and dividing by the

number of observations.
There is some notation that you will come to see as standards, i.e, n will always equal sample
size. We will make a point of letting you know what these are. However, when it comes to the
variables, these labels can (and do) vary.
For example, in one study x may be used to denote weight and y may be used to denote height, (or
the reverse may be used!), butn will always be used to denote sample size in each case.
Note that for the data set:
1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1
Steps to finding the median for a set of data:
1. Arrange the data in increasing order
2. Find the location of median in the ordered data by (n + 1) / 2
3. The value that represents the location found in Step 2 is the median. NOTE: if the sample size is an odd
number then the location point will produce a median that is an observed value as in the example above.
If sample size is an even number, then the location will require one to take the mean of two numbers to
calculate the median. The result may or may not be an observed value as the example below illustrates.
Mean, median and mode are usually not equal. When the data is symmetric, the mean is equal to the
median.
4. Trimmed Mean
One shortcoming of the mean is that: Means are easily affected by extreme values.

Consider the aptitude test scores of ten children below:


95, 78, 69, 91, 82, 76, 76, 86, 88, 80
Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1
If the entry 91 is mistakenly recorded as 9, the mean would be 73.9,
which is very different from 82.1.
On the other hand, let us see the effect of the mistake on the median
value:
The original data set in increasing order are:
69, 76, 76, 78, 80, 82, 86, 88, 91, 95
With n = 10, the median position is found by (10 + 1) / 2
= 5.5. Thus, the median is the average of the fifth (80)
and sixth (82) ordered value and the median = 81
The data set (with 91 coded as 9) in increasing order is:
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
where the median = 79
The medians of the two sets are not that different. Therefore the median is not
that affected by the extreme value 9.
Measures that are not that affected by extreme values are called resistant.
A variation of the mean is the trimmed mean. A 10% trimmed mean drops the
highest 10%, the lowest 10%, and averages the remaining. Let's calculate the
trimmed mean for the data we were looking at above:
(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)
The 10% trimmed mean = 82.13
(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)
The 10% trimmed mean = 79. 38
The 10% trimmed mean of the two sets is not that different. The trimmed mean is
not as affected by the extreme value 9 as the mean.
After reading this lesson you should know that there are quite a few options when one wants to describe

central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about
mainly about the mean. However, we need to be aware of one of its short comings, which is that it is easily
affected by extreme values. One remedy is to use trimmed mean to estimate the central tendency.
Remember, however, that this is very different from saying that one can trim data. Unless data points are
known mistakes, one should not remove them from the data set! One should keep the extreme points and
use more resistant measures. For example, use the sample median to estimate the population median. Or,
use the sample trimmed mean to estimate the population trimmed mean. Again, this is very different from
saying that it is OK to trim data from a data set.
Skewness
Skewness is a measure of degree of asymmetry of the distribution.
1. Symmetric
Mean, median, and mode are all the same here; the distribution is mound shaped, and no skewness is
apparent. The distribution is described as symmetric.

The above distribution is symmetric.


2. Skewed Left
Mean to the left of the median, long tail on the left.

The above distribution is skewed to the left.


3. Skewed Right
Mean to the right of the median, long tail on the right.

The above distribution is skewed to the right.


When one has very skewed data, it is better to use the median as measure of central tendency since the
median is not much affected by extreme values.
Example: The Skewed Nature of Salary Data
Salary distributions are almost always positively skewed, with

a few people that make the most money. To illustrate this, consider your favorite sports team or
even the company for which you work. There will be one or two players or personnel that earn
the big bucks, followed by others who earn less. This will produce a shape that is skewed to
the right. Knowing this can be a useful aid in negotiating a higher salary.
When one interviews for a position and the discussion gets around to compensation, it is
common that the interviewer states an offer that is typical for someone in your position. That
is, they are offering you the average salary for someone with your particular skill set (e.g. little
experience). But is this average the mode, median, or mean? The company for whom business
is business! will want to pay you the least they can while you prefer to earn the most you can.
Since salaries tend to be skewed to the right, the offer will most likely reflect the mode or
median. You simply need to ask to which average the offer refers and what is the mean of this
average since the mean would be the highest of the three values. Once you have these averages,
you can begin to negotiate toward the highest number.
Adding and Multiplying Constants
What happens to the mean and median if we add or multiply each observation in a data set by a constant?
Consider for example if an instructor curves an exam by adding five points to each students score. What
effect does this have on the mean and the median? The result of adding a constant to each value has the
intended effect of altering the mean and median by the constant. For example, if in the above example
where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be
87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).
Similarly, if each observed data value was multiplied by a constant, the new mean and median would
change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were
doubled, the then the new mean and new median would be double the original mean and median. As we
will learn shortly, the effect is not the same on the variance!
Why would you want to know this? One reason, especially for those moving onward to more applied
statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods a
required assumption is that the data is normal, or very near bell-shaped. When the data is not normal,
statisticians will transform the data using numerous techniques e.g. logarithmic transformation. But, the
log cannot be taken of all values, for instance the log of 0 is undefined. However, if we add a constant to
all the data values making them all greater than zero, then a log can be taken without risk.We just need to
remember the original data was transformed!!

2.2 - Measures of Variability


Unit Summary
Measures of Variability
Range
Interquartile Range (IQR)
Variance and Standard Deviation
Adding and Multiplying Constants
Empirical Rule
How to Roughly Approximate Standard Deviation
Coefficient of Variation
Z-score, Z-value, or Z

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (see

course schedule).

Measures of Variability
Think about the following, then click on the icon to the left to
display the statistical answer.
If you can use two numbers to summarize
Jessica's weight data, which two characteristics
will you use as measures?

Why do we want to know variability?

There are many ways to describe variability including:


Range
Interquartile range (IQR)
Variance and Standard deviation
Let's look at each of these in turn.
A. Range: R = maximum - minimum
1.Easy to calculate
2.Very much affected by extreme values (range is not a
resistant measure of variability)
B. Interquartile range (IQR)
In order to talk about interquartile range, we need to first
talk about percentiles.
The pth percentile of the data set is a measurement such
that after the data are ordered from smallest to largest, at
most, p% of the data are at or below this value and at most,
(100 - p)% at or above it.

Thus, the median is the 50th percentile. Fifty percent or the


data values fall at or below the median.

Also, Q1 = lower quartile = the 25th percentile and Q3 =


upper quartile = the 75th percentile.

The interquartile range is the difference between upper and


lower quartiles and denoted as IQR.
IQR = Q3 - Q1 = upper quartile - lower quartile = 75th
percentile - 25th percentile.
Details about how to compute IQR will be given in Lesson
2.3.
Note: IQR is not affected by extreme values. It is thus a
resistant measure of variability.

C. Variance and Standard Deviation


Two vending machines A and B drop candies when a quarter
is inserted. The number of pieces of candy one gets is
random. The following data are recorded for six trials at
each vending machine:
Pieces of candy from vending machine A:
1, 2, 3, 3, 5, 4
mean = 3, median = 3, mode = 3
Pieces of candy from vending machine B:
2, 3, 3, 3, 3, 4
mean = 3, median = 3, mode = 3
Dotplots for the pieces of candy from vending
machine A and vending machine B:

They have the same center, but what about their spreads?
One way to compare their spreads is to compute their
standard deviations. In the following section, we are going
to talk about how to compute the sample variance and the
sample standard deviation for a data set.
Variance is the average squared distance from the mean.
Population variance is defined as:

2=i=1N(yi)2N2=i=1N(yi)2N
In this formula is the population mean and the summation
is over all possible values of the population. N is the
population size.
The sample variance that is computed from the sample and
used to estimate 2 is:

s2=i=1n(yiy)2n1s2=i=1n(yiy)2n1

Why do we divide by n - 1 instead of by n? Since is


unknown and estimated by yy, the yi's tend to be closer
to yy than to . To compensate, we divide by a smaller
number, n - 1. The sample variance (and therefore sample
standard deviation) are the common default calculations
used by software. When asked to calculate the variance or
standard deviation of a set of data, assume - unless
otherwise instructed - this is sample data and therefore
calculating the sample variance and sample standard
deviation.
For example, let's find S2S2 for the data set from vending
machine A: 1, 2, 3, 3, 4, 5

y=1+2+3+3+4+56=3y=1+2+3+3+4+56=3
s2=(y1y)2++
(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)
2+(53)261=2s2=(y1y)2++

(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)2+(5
3)261=2

Calculate S2 for the data set from vending


machine B yourself and check that it is smaller
2
than the S for data set A. Work out your answer first, then
click the graphic to compare answers.
Standard Deviation
The population standard deviation is notated by and found
by =2=2 has the same unit as yi's. This is a desirable
property since one may think about the spread in terms of the
original unit.
is estimated by the sample standard deviation s :

s=s2s=s2
For the data set A,

s=2=1.414s=2=1.414 pieces of candy.

Calculate the standard deviation for the data set


from vending machine B . Work out your answer
first, then click the graphic to compare answers.
The standard deviation is approximately the average distance
the values of a data set are from the mean, and is a very
useful measure. One reason is that it has the same unit of
measurement as the data itself (e.g. if a sample of student
heights were in inches then so, too, would be the standard
deviation. The variance would be in squared units, for
example inches2. Also, the empirical rule, which will be
explained in the following section, makes the standard
deviation an important yardstick to find out approximately
what percentage of the measurements fall within certain
intervals.
Adding and Multiplying Constants
What happens to measures of variability if we add or multiply each
observation in a data set by a constant? We learned previously
about the effect such actions have on the mean and the median,
but do variation measures behave similarly? Not really.
When we add a constant to all values we are basically shifting the
data upward (or downward if we subtract a constant). This has the
result of moving the middle but leaving the variability measures
(e.g. range, IQR, variance, standard deviation) unchanged.
On the other hand, if one multiplies each value by a constant this
does effect measures of variation. The result on the variance is that
the new variance is multiplied by the square of the constant, while
the standard deviation, range, and IQR are multiplied by the
constant. For example, if the observed values of Machine A in the
example above were multiplied by three, the new variance would be
18 (the original variance of 2 multiplied by 9). The new standard
deviation would be 4.242 (the original standard 1.414 multiplied by
3). The range and IQR would also change by a factor of 3.
Empirical Rule
Empirical Rule is sometimes referred to as the 68-95-99.7% Rule. If
the set of measurements follows a bell-shaped distribution, then

ysys

contains about 68% of data

y2sy2s contains about 95% of data

y3sy3s contains about all of data

Object 2

The following five examples (a-e) show that the empirical rule
is not that far off even when the underlying distribution is not bell
shaped.
a. For the following graph, y=5.5y=5.5, s=1.49s=1.49

60% within ysys


(5.5 - 1.49, 5.5 + 1.49) = (4.01, 6.99)
94% within y2sy2s
(5.5 - 2.98, 5.5 + 2.98) = (2.52, 8.48)
100% within y3sy3s
(5.5 - 4.47, 5.5 + 4.47) = (1.03, 9.97)
b. For the following graph, y=5.5y=5.5, s=2.07s=2.07

64% within ysys


96% within y2sy2s
100% within y3sy3s

c. For the following graph, y=5.5y=5.5, s=2.89s=2.89

60% within ysys


100% within y2sy2s
100% within y3sy3s
d. For the following graph, y=3.49y=3.49, s=1.87s=1.87

75% within ysys


96% within y2sy2s
98.5% within y3sy3s
e. For the following graph, y=2.57y=2.57, s=1.87s=1.87

87% within ysys


95% within y2sy2s
97.6% within y3sy3s
Approximating the Standard Deviation
Think about the following, then click on the icon to the left
display the statistical application example.
How can one find an approximate value
of s without going through the detailed
computation? It follows from the empirical rule that
approximately 95% of measurements lie
in y2sy2s(almost all).
Range

4s

Approximate value of srange4srange4


Why don't we say y3sy3s contains all and divide by 6 to
obtain the approximate value of s?
It is important to remember that one has to use the formula:

s=ni=1(yiy)2n1s=i=1n(yiy)2n1
to compute the sample standard deviation. The formula
{Approximate value of srange4srange4 } only gives a rough
estimate of s.

For example, the actual ages (in years) of 36 millionaires


sampled, arranged in increasing order is:
31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,
54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,
71, 71, 74, 75, 77, 79, 79, 79
The data range is from 31 to 79. Thus, using the 'shortcut' formula
to approximate the value of s is as follows: (79-31) / 4 = 12 years.
Shortcut Method for Calculating the Standard Deviation
Instead of using the formula for calculating the variance and
standard deviation that involves comparing each observation to the
mean, there is a shortcut method to calculating the variance and
standard deviation. This shortcut method is as follows:
1.Sum all the values in the data set.
2.Square this sum.
3.Divide this squared sum by the total number of observations,
n, (call this the average sum squared).
4.Square each value in the data set.
5.Sum these squared values (called the sum of squares).
6.Subtract this sum of squares minus average sum squared.
7.Divide this difference by n - 1; this is the variance.
8.Take the square root to get the standard deviation.
For example, recall the data results for Vending Machine A at the
beginning of this lesson: 1, 2, 3, 3, 4, and 5. We calculated the
variance to be 2 and the standard deviation to be 1.414. Using the
shortcut method:
1.1 + 2 + 3 + 3 + 4 + 5 = 18
2.18*18 = 324
3.324/6 = 54
4.1, 4, 9, 9, 16, and 25
5.1 + 4 + 9 + 9 + 16 + 25 = 64
6.64 - 54 = 10
7.10/5 = 2
8.Square root of 2 equals 1.414
Coefficient of Variation
Above we considered three measures of variation: Range,
Interquartile Range (IQR), and Variance (and its square root
counterpart - Standard Deviation). These are all measures we can

calculate from one quantitative variable e.g. height, weight. But


how can we compare dispersion (i.e. variability) of data from two or
more distinct populaions that have vastly different means? A
popular statistic to use in such situations is the Coefficient of
Variation or CV. This is a unit-free statistic and one where the
higher the value the greater the dispersion. The calcuation of CV is:
CV = Standard Deviation / Mean
Example: Comparing Prices

You are shopping for toilet tissue. As you


compare prices of various brands, some
offer price per roll while others offer price
per sheet. You are interested in
determining which pricing method has
less variability so you sample several of each and calculate
the mean and standard deviation for the sampled items that
are priced per roll, and the mean and standard deviation for
the sampled items that are priced per sheet. The table below
summarizes your results.
Item
Mean Standard Deviation
Price per Roll 0.9196 0.4233
Price per Sheet 0.01134 0.00553

Comparing the standard deviations the Per Sheet appear to


have much less variability in pricing. However, the mean is
also much smaller. The coefficient of variation allows us to
make a relative comparison of the variability of these two
pricing schemes:

CVRoll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49CVR
oll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49

Relatively speaking, the variation for Price per Sheet is


greater than the variability for Price per Roll.
Another example to consider is hotel pricing. Think of prices
for luxury and budget hotels. Which do you think would have

the higher average cost per night? Which would have the
greater standard deviation? The CV would allow you to
compare this dispersion in costs in relative terms by
accounting for the fact that the luxury hotels would have a
greater mean and standard deviation.
Z-value, Z-score, or Z
Z-value, or sometimes referred to as Z-score or simply Z, represents
the number of standard deviations an observation is from the mean
for a set of data. To find the z-score for a particular observation we
apply the following formula:
Z = (observed value mean) / SD
Example: Exam Scores

For a recent final exam the mean was 68.55 with a standard
deviation of 15.45

If you scored an 80%: Z = (80 - 68.55) / 15.45 = 0.74, which


means your score of 80 was 0.74 SD above the mean.

If you scored a 60%: Z = (60 - 68.55) / 15.45 = -0.55, which


means your score of 60 was 0.55 SD below the mean.

Is it always good to have a positive Z score? It depends on


the question.

For exams you would want a positive Z-score (indicates you


scored higher than the mean). However, if one was analyzing
days of missed work then a negative Z-score would be more
appealing as it would indicate the person missed less than
the mean number of days.

Characteristics of Z-scores
1.The scores can be positive or negative
2.For data that is symmetric (i.e. bell shaped) or nearly
symmetric, a common application of Z-scores for identifying
potential outliers is for any Z-scores that are beyond 3.
3.Maximum possible Z-score for a set of data is (n1)/n(n1)/n.
4. Sum of allsquared Z-scores for a set of data is n - 1.
2.4 - Practice Problems

Printer-friendly version
1. Our statistics department surveyed a random sample of 5 staff personnel and 5 faculty on how often
during a week they used public transportation in traveling for work. The table below reflects the
responses.

Staff

Faculty

4 2 2 2 5 0 1 5 1 3
a. What sampling method was used to gather this data? What population of interest is best represented by
the samples?
b. Calculate by hand the mean and standard deviation for number of times a week public transportation
was used by staff and faculty.
c. Based on means and standard deviations, do you think there is a statistically significant difference
between these two means? Explain.
2. The College of Dentistry at the University of Florida has made a commitment to develop its entire
curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and so on.
It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional
staff lab more free time for personal consultation in student faculty interaction. One such instructional
modules developed and tested in the first 50 students proceeding through the curriculum the following
measurements represent the number of hours it took the students to complete the required modular
material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 6 5 23 21 10 17 15
Here is a link to the data (hours.txt) for the time it took students to complete the required material.
a. Calculate by hand the five number summary for these recorded completion times. Helpful hint: you can
use software such as Excel to sort the data.
b. Do we expect the Empirical Rule to describe adequately the variability of these data? Explain.
c. Calculate the standard deviation, s, by using the approximation formula and compare that answer to that
real standard deviation of 8.45
d. The mean for this data set is 16. Using the actual s of 8.45, construct the intervals and check whether the
Empirical Rule applies to this data set.

I.2 - Displaying Descriptive Statistics in Minitab


Let's perform some basic operations in Minitab. Some of the
examples below are repeats of what we did by hand in earlier
lessons while others are new. First, we saw previously how you can
enter data into the Minitab worksheet by hand, we will now walk
through how to load a dataset into Minitab from an Excel file.
Loading Data into Minitab from an Excel File
Right click and save this Excel spreadsheet
file, MinitabIntroData.xlsx [1]. Save the file locally (if using Minitab
installed on the computer you are using) or save the file in your
PASS space if using WebApps.
Open Minitab, then using the Minitab menus at the top of the
application, select the option:
'File' > 'Open worksheet'.
In the Files of Type field click the drop down arrow and select
'Excel'.
In the Look In field use the drop down arrow to locate the
saved Excel data file.
Double click the file and the data should open in the Minitab
worksheet (the window that looks similar to an Excel
spreadsheet).
With the data in the Minitab worksheet you can then perform any
number of procedures. First we obtain some basic descriptive
statistics.
Descriptive Statistics
With the data from the Excel spreadsheet file into your Minitab
worksheet window, you should notice that all columns are labeled
Cx where the x is a number. Some of these are followed by a -T.
Those columns with the -T indicate that the data in this column are
considered text or categorical data. Otherwise, Minitab recognizes
the data as quantitative. If the operation you conduct in Minitab
only functions on a certain variable type (e.g. calculating the mean
can only be done on quantitative data) then only columns of that
data type will be available to use for those operations.
Example: Using the Hours Data from Previous Practice
Problems
Let's use Minitab to calculate the five number summary,
mean and standard deviation for the Hours data,

( contained in MinitabIntroData.xlsx [1]). And, as you will see, Minitab


by default will provide some added information.
1. At top of the Minitab window select the menu option 'Stat' >
'Basic Statistics' > 'Display Descriptive Statistics'
2. Once this dialog box opens your cursor should be blinking in the
'Variables' window. If not, simply click inside this part of the dialog
box. The only variables you should see in the left side window are
columns of quantitative data (the two price columns, age, and
hours). To enter a variable from the left hand window into the
Variables window you can either double-click that variable or click
the variable to highlight it and then click the 'Select' button. Do so
with the variable 'Hours'.

3. With the variable 'Hours' in the 'Variable' window click the 'OK'
button.
You should now find the following output in the Session window
above the worksheet.

The mean, standard deviation (StDev), etc. should be the same


values as those calculated in the practice problems. Minitab also
gives the size of the sample used to create these statistics (N), and
the number of observations from this data that were missing (N*).
These statistics are the default statistics. Additional basic
descriptive statistics are also available such as trim mean and
coefficient of variation (CV).

Example: Obtaining Coefficient of Variation (CV)


To get the CV values for the Price per Sheet and Price per
Roll an example found in an earlier lesson, (data contained
in MinitabIntroData.xlsx [1]).
1.Open Minitab and return to 'Stat' > 'Basic Statistics' > 'Display
Descriptive Statistics'.
2.Enter both variables into the Variables window. That is, both
'Price_Roll' and 'Price_Sheet' should be in the Variables window.
3.Click the 'Statistics' tab and then check the box for 'Coefficient
of Variation' (notice the other statistics available!) and click
'OK' .
4.Click 'OK' again.
The output will include the same statistics as the example above
plus the CV values, (it will be titled 'CoefVar').

One can also get statistics of a quantitative variable broken down


by levels of a categorical variable.
Example: Basic Statistics of a Quantitative Variable by the
Levels of a Categorical Variable
To see this we will use the data for 'Age' and 'Sex' from
the MinitabIntroData.xlsx [1] file.
1.Open Minitab and return to 'Stat' > 'Basic Statistics' > 'Display
Descriptive Statistics'.
2.Enter the variable Age into the Variable window.
3.Click inside the By variables (optional) window. Any
categorical variables in the worksheet (e.g. 'Sex') should now
display in the box on the left.
4.Select the variable 'Sex' for the By variables window.

5.Click 'OK'.
Minitab will now display the 'Age' statistics (including CV since we
had that statistic selected from our last operation) for each
category of 'Sex'.

I.3 - Drawing Graphs in Minitab


Creating graphs in Minitab is very straightforward. Graphing options
are located under the 'Graph' tab across the list of menu choices at
top of Minitab application. Click the 'Graph' menu option and a long
list of graphs will appear. On this page we will simply create a
boxplot for the Hours data and also a side-by-side boxplots for 'Age'
by 'Sex' data found inMinitabIntroData.xlsx [1].
Example: Boxplot for the Hours Data
With the MinitabIntroData data file open in the Minitab
worksheet:
Select 'Graph' > 'Boxplot' > 'One Y-Simple' and click 'OK'.
Select the 'Hours' variable and move into the Graph Variables
window.
Click 'OK'

Minitab will now display the boxplot for this data.

If you place your computers mouse over the box portion of the
plot some statistics will pop-up (Q1, median, Q3, IQR, the value to
which the whiskers extend, and the sample size N). If there are any
outliers using the methods outlined in the previous lesson, these
will be marked with an * in the plot.
Example: Side-by-Side Boxplot for the Age by Sex Data
Again, with the MinitabIntroData data file open in the
Minitab worksheet:
Select 'Graph' > 'Boxplot' > 'One Y-With Groups' and click 'OK'.
Select the 'Age' variable and move into the 'Graph Variables'
dialog box.
Click inside the Categorical variables for grouping window.
Any categorical variables in the worksheet (e.g. Sex) should
now display in the left side box.
Select the variable 'Sex' for the Categorical variables for
grouping window.

Click 'OK'.
Minitab will now display inside one frame two boxplots: one for
Females and another for Males.

Create other graphs using the data found


in MinitabIntroData.xlsx [1]. Use the graphs to help you explore what
the data distributions of the different variables looks like

I.4 - Using the Calc Function in Minitab


Finally let us consider how we can use the Calc function along the
menu options to easily create a new set of data.
Example: Using the Calc Function to Create New Data
First, in any empty column of the worksheet starting in row
1 for that column, type in the five data values:
1, 2, 3, 4, 5.
Just to be clear, do not include the commas or the period symbol! In

essence you are creating a new column of data with five


observations
To create another column where we just add a constant of 1 to
each of these values:
1.From the menus at top of page and select 'Calc' > 'Calculator'.
2.In the text box for Store result in variable you can type in any
word which will serve as the variable name for our new set of
data. For this example, you can type in this text box the word
'Plus1'.
3.Double-click or highlight and click the 'Select' button to put the
variable of interest into the expression window. In this case
column C1 contains our five new observations.
4.With your cursor active in the Expression window click the +
and then 1 on the keypad. This should create an expression
such as C1 + 1.
5.Click 'OK'.
You should now find a new column in the spreadsheet window with
the values 2, 3, 4, 5, 6.

Just for kicks, use the steps described previously to have Minitab
calculate the descriptive statistics for these two variables, the
column of entered data and the column of calculated 'Plus1' data.
You should find that the original mean was 3 for the entered data
and is 4 for the plus1 data, however the standard deviations for
both sets of data remain the same.

I.5 - Practice Problems


1. Draw two appropriate graphs for the following data:
University officials periodically review the distribution of
undergraduate majors within the colleges of the university to help
determine a fair allocation of resources to departments within the
colleges. At one such review, the following data were obtained
(majors.txt [1]):

College

Number of majors

Agriculture

1,500

Arts and Sciences

11,000

Business Administration 7,000


Education

2,000

Engineering

5,000

HINT!: How to Create a Simple Pie Chart and Simple Bar Chart

[2]

2. Draw frequency histogram and relative frequency histogram


Draw a frequency histogram and a relative frequency histogram for
the 1994 annual per capita city tax (city_tax.txt [3]) given below:
2470, 520, 561, 488, 986, 359, 1305, 512, 467, 270, 360, 451, 4904, 572, 498, 382, 271, 634, 1682,
784, 298, 643, 947, 686

HINT!: How to Create a Simple Histogram

[4]

3. Draw a Stem-and-leaf plot


Draw the stem-and-leaf diagram for the following data set
(stem_leaf.txt [5]) using Minitab and use the stem and leaf diagram
to find the median of the data set:
11 11 12 13 14 14 14 14 15 15 15 16 16 17 17 19 19 20 22 25

4. Draw a stem-and-leaf plot


Draw the stem-and-leaf diagram for city tax data given in practice
problem 2 above and use the diagram to find the median of the
data.
5. Selecting the proper diet of brook trout or other fresh water fish
is an important aspect of fish farming. A researcher want to
estimate the mean weight of brook trout maintained on a specific
diet for a period of 6 months. One hundred brook trout are selected
from a fisheries tank and each is weighed.
a. What is the population of measurements that is of interest to the
researcher.
b. What is the sample.
c. What characteristics of the population are of interest to the
researcher?
d. If the sample measurements are used to make inferences about
certain characteristics of the population, why is a measure of the
reliability of the information important?
6. In a packing plant, a machine packs carton with jars. The times it
takes each machine to pack 10 cartons are recorded. The results

(machine.txt
New machine

), in seconds, are shown in the following table:

[6]

Old machine

42.1 41.3 42.4 43.2 41.8 42.7 43.8 42.5 43.1 44.0
41.0 41.8 42.8 42.3 42.7 43.6 43.3 43.5 41.7 44.1

a. Compute the mean and standard deviation for the time to pack a
carton for each machine.
b. Plot the data for each machine.
c. Describe the data for the two machines.
7. The College of Dentistry at the University of Florida has made a
commitment to develop its entire curriculum around the use of selfpaced instructional materials such as videotapes, slide tapes, and
so on. It is hoped that each student proceeded apace
commensurate with his or her ability and of the instructional staff
lab more free time for personal consultation in student faculty
interaction. One such instructional modules developed and tested
in the first 50 students proceeding through the curriculum the
following measurements represent the number of hours it took the
students to complete the required modular material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 4 5 23 21 10 17 15
Here is a link to the data (hours.txt
to complete the required material.

) for the time it took students

[7]

a. Calculate the mode, the median, and the mean for these
recorded completion times.
b. Guess the value of s.
c. Compute s by using the shortcut formula and compare your
answers to that of part (b) above.
d. We do expect the Empirical Rule to describe adequately the
variability of these data? Explain.
e. Construct the intervals and check whether the Empirical Rule
applies to this data set.

[8]

Probability Distributions and Minitab


We cannot use Minitab for general discrete distributions but we can use it for binomial distributions. We
can also use Minitab for normal distributions.
Either distribution can be found in Minitab under the Calc > Probability Distributions list. The list contains
many different distributions and we will begin with binomial and normal. If you select either one, you will
be presented with a pop-up window that contains three radio buttons labeled:
Probability
Cumulative probability
Inverse cumulative probability
We use Probability to find exact probabilities: P(X=x)P(X=x). This
is applicable for binomial but not for normal. The latter reason extends
from calculus. With the normal distribution being continuous, the idea of the area under a curve for an
exact point (i.e. X=xX=x) is equal to zero. This result is determined from using integration to find the
area under a curve and then evaluation this integral from point a to point b. However, at X=xX=x
the points to be evaluated are the same thus zero area under an exact point.
We use Cumulative probability to find less than or equal to probabilities: P(Xx)P(Xx). Remember
for discrete distributions (e.g. binomial) the use of the equality is important as there is a difference
between, for instance, saying less than 3 students and less than or equal to 3 students. On the other
hand, for continuous distributions (e.g. normal) there is no distinction; the use of the equal sign is not
relevant. The reason relates to the explanation above for probability. The equality is for the exact
observation of a value which under a curve the area is zero.
We use the Inverse cumulative probability to find the value of XX that produces some cumulative
probability. For instance, in the normal distribution we would use this option when we wanted to find the
observed score for some specified percentile.
In general, we will use the first two radio buttons for binomial and the latter two, although primarily
Cumulative Probability, for normal.
II.1 - Finding Binomial Distribution Probabilities with Minitab

Printer-friendly version
Binomial Examples
Referring back to the FBI Crime Survey Example in the binomial lesson, we
had the probability of 0.2 that a randomly selected property crime is solved
and we had three such crimes committed. Let's use Minitab this time to find
the probability that:
1. Exactly one of the three crimes is solved
2. At least two of the three crimes is solved.
In Minitab, go to Calc > Probability distributions > Binomial
To solve a we are asked to find P(X=1)P(X=1) so we will select
the Probability radio button and enter the following:
Number of trials: 3
Event probability: 0.2
Click radio button for 'Input constant': 1
Then click "OK".
The result is as follows:

Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we
wanted to find P(X=1)=0.384P(X=1)=0.384 [NOTE: this is what we found by hand earlier.]
To solve b we are asked to find P(X1)P(X1). Since the software does not find greater than
probabilities we need to use the complement rule. This leads us to find P(X1)P(X1) by 1P(X<1)=1

P(X=0)1P(X<1)=1P(X=0).
This time we will select again the probability radio button and enter the following:
Number of trials: 3
Event probability: 0.2
Click radio button for 'Input constant': 0
Then click "OK".
The result is:

Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we
wanted to find P(X=0)=0.512P(X=0)=0.512. Subtracting this probability from 1 we have our answer
to P(X1)P(X1) which is 0.488
NOTE: The usual mistakes students make are to not set the problem up correctly (e.g. use Probability
when should be Cumulative and vice-versa), incorrectly include the equality when using the complement,
or simply forget to subtract from one when necessary.
Practice Problems
1. You are given a 25 question exam for which you are not prepared (i.e. you will be guessing).
Each question involves True/False answer choices where only one choice is correct.
First, is this a binomial situation?
Does it have a fixed number of trials? YES - 25.
Two outcomes, success and failure? YES - right and wrong.
Equal chances of success? YES - 0.5
Is each trial independent? YES, how you answer one question well assume does not affect your
answer to another question.
OK, now that those assumptions have been met, let's use Minitab to answer the following
questions:
A) What is the probability that you get exactly 20 correct?
So, this is solving for P(X=20)P(X=20). We will select the Probability radio button and enter
25 for number of events,
0.5 for event probability, and
20 as the input constant.
The output looks like:

The answer is 0.0015834 meaning that you have roughly a 0.001 (or 1 in a 1000) chance of
getting exactly 20 right in such a situation.
B) What is the probability that you get at least 20 correct?
As we saw above, we need to solve this differently. We will solve for P(X20)=1

P(X<20)=1(X19)P(X20)=1P(X<20)=1(X19). We must select the Cumulative


probability radio button and enter:
25 for number of events,
0.5 for event probability, and
19 as the input constant.
The output looks like:

The answer is 10.997961=0.00203910.997961=0.002039 which means that you have a


slightly better chance of answering at least 20 or more questions correctly in such a situation
than you did of answering exactly 20: roughly a 2 out a 1000 chance.
II.2 - Finding Normal Distribution Probabilities with Minitab

Printer-friendly version
Normal Examples
Here we will use Minitab to find the probabilities for two of the problems from the practice examples we
saw earlier.
To find the probabilities associated with normal distributions in Minitab, go to Calc > Probability
distributions > Normal. The default is set up to that of a standard normal (i.e. we have a z-score) where the
mean is 0 and the standard deviation is 1.
Example: Intelligence Scores for Children
1. The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC)
are normally distributed with =100=100 and =15=15. Our question is, "What score will
separate a child from the top 5% of the population from the bottom 95%? What do we call this
value?"
To solve this question we are asked to find P(Xx)=0.95P(Xx)=0.95 That is, we want to find
the score that would result in a child falling in the 95th percentile. (NOTE: this would be
equivalent to finding P(Xx)=0.05P(Xx)=0.05 or being in the top 5%).
In Minitab go to Calc > Probability distributions > Normal and select the radio
button for Inverse cumulative probability since we want to find the observed
score associated with a given cumulative probability: 0.95.
For mean enter 100 and for standard deviation enter 15. Click the radio button for
'Input Constant' and enter the cumulative probability of 0.95 and the click 'OK'. The result is as
follows:

As we see, this result is very near that which we found by hand earlier: 124.6.
The interpretation is that an IQ score of about 127 is needed for that child to fall in the top 5%.
2. A class has 16 children and they are from the reference population in problem 1 above. One
child is randomly picked from the class. What is the probability that the IQ of the child is more
than 110?
To solve this question we are asked to find P(X>110)P(X>110). This means we will have to
solve using the complement or 1P(X110)1P(X110). But remember, this equality is not
relevant in regards to finding the probability since we are talking about a continuous
distribution!
In Minitab go to Calc > Probability distributions > Normal and select the radio
button for Cumulative probability and enter the following:
For mean enter 100 and for standard deviation enter 15. Click the radio button for
Input Constant and enter 110 and click OK. The result is as follows:

We then take this P(X110)=0.747507P(X110)=0.747507 and subtract from one to get our
final probability of 0.252943 which resembles our final result using the standard normal table:
0.2514.
The difference we are finding here is from the rounding we did to get the z-score to use the
table. The interpretation is that we have about a 25% chance of randomly selecting a child with
an IQ of at least 110.
II.3 - Finding Sampling Distributions in Minitab

Printer-friendly version
Sampling Distributions
For sampling distributions we again will focus on using the normal distribution. The key distinction will
be that instead of inputting the standard deviation we will use the standard error. Again, to illustrate we
will use two problems from the sampling distribution practice problems.
Example: JCrew
The company JCrew advertises that 95% of its online orders ship within two working days. You
select a random sample of 200 of the 10,000 orders received over the past month to audit. The
audit reveals that 180 of these orders shipped on time. If JCrew really ships 95% of its orders on
time, what is probability that the proportion in a random sample of 200 orders is as small or
smaller as the proportion in the audit?
We already verified that the sample proportion meets the conditions needed in order to apply
normal approximation methods. Once this is verified, the question asks us to find the probability
of getting a sample proportion of 0.9 or less if the true 'ship on time' proportion is 0.95. Recall,
we already had calculated a 0.015 standard error.
Using Minitab, we will again go to Calc > Probability distributions > Normal.

We will select the radio button for Cumulative probability and enter the following:
For mean enter 0.95 and for standard deviation enter 0.015.
Click the radio button for Input Constant and enter 0.9
Then click OK. The result is as follows:

Remember when we used the standard normal table the best we could do was a probability less
than 0.001 which Minitab has verified with 0.0004291 which is less. When we did this by
hand we came up with a z-score of 3.33 which was not on the table so we used 3.09
Alternatively, we could have used this z-score to find our answer by using the Minitab default
values which as we mentioned at the beginning are in standard normal format. If one has the zscore, you simply need to plug in the z-score as the input constant ( but remember to change the
mean to 0 and the standard deviation to 1).
Example: Tire Lifetime
Penn State Fleet which operates and manages car rentals for Penn
State employees found that the tire lifetime for their vehicles has a
mean of 50,000 miles and standard deviation of 3500 miles. What
is the probability that the sample mean lifetime for these 50
vehicles exceeds 52,000?
We already verified that the sample mean meets the conditions
needed in order to apply normal approximation methods. Once this
is verified, the question asks us to find the probability of getting a
sample mean greater than 52,000 miles if the true tire lifetime is 50,000 miles . Recall we
already had calculated a 495 for the standard error.
Using Minitab, again go to Calc > Probability distributions > Normal. Select the
radio button for Cumulative probability and enter the following:
For mean 50000 and for standard deviation enter 495.
Click the radio button for 'Input Constant' and enter 52000
Then click OK. The result is as follows:

Not done yet! The problem asks for greater than 52000 and what we have is less than 52000.
Therefore we need on final step of subtracting this probability from one. The probability of
getting a sample mean of 52000 from a sample of 50 vehicles would be 0.000027 which satisfies
our result when using the table.
Remember, when we used the standard normal table the best we could do was a probability less
than 0.001 which Minitab has verified as less. When we did this by hand we came up with a zscore of 4.04 which was not on the table so we used 3.09 Alternatively, we could have used this
z-score to find our answer by using the Minitab default values which as we mentioned at the
beginning are in standard normal format. If one has the z-score, you simply need to plug in the
z-score as the input constant (but remember to change the mean to 0 and the standard deviation
to 1).

Lesson 6 - Confidence Intervals for Population Proportions and Population Means

Printer-friendly version
This lesson starts with the basic concept of using confidence intervals to understand and perform
inference. We then talk about how to find confidence intervals for one population proportion. The
important issue of determining the required sample size to estimate a population proportion will also be
discussed in detail in this lesson.
Estimating the population mean is one of the most common and important questions one comes across in
practice. In this lesson, we will also talk about the confidence interval for a population mean when the
population standard deviation is unknown. We will also explain how to determine the number of
observations to be included in the sample.

Lesson 6 Objectives
Upon successful completion of this lesson, you will be able to:
understand the reason for estimating with confidence interval.
calculate confidence intervals for population proportions.
interpret a confidence interval.
know the meaning of margin of error and its use.
compute sample sizes for different experimental setting.
know when and how to use t-interval to estimate the population
mean.
compute sample sizes for estimating the population mean.

6.6 - Using Software for Confidence Intervals

Printer-friendly version
1.
2.

3.
4.
5.

Minitab Commands to Find the Confidence Interval for a Population Proportion


Stat > Basic Statistics > 1 proportion.
From the drop down box select the Summarized data option button. (If you have
the raw data you would use the default drop down of One or more samples, each in
a column.)
Enter the number of successes in the Number of Events text box, and the sample size in the
Number of Trials text box.
Click the Options button. The default confidence level is 95. If your desire another confidence
level edit appropriately.
To use the z- interval method choose Normal Approximation from the Method text box. The
exact interval is always appropriate and is the default. Under the conditions
that: np^5np^5, n(1p^)5n(1p^)5, one can also use the z-interval to approximate the

answers. The exact interval and the z-interval should be very similar when the conditions are
satisfied.
6. Click OK and OK again.
Click on the 'Minitab Movie' icon to display a walk through of 'Find a
Confidence Interval for a Population Proportion in Minitab'.
Example: Presidential Approval Rating
Referring to our presidential approval rating example at the beginning
of this lesson, we will use Minitab to verify our by-hand results.
Recall in that example a random sample of 1500 was taken from the

population of U.S. adults, with 660 responding with a positive approval. In Minitab and
following the steps above, we would enter 660 for the Number of Events and 1500 for the
Number of Trials. The confidence level was 95% and we satisfied the necessary conditions to
use the Normal Approximation (or z-interval) method. The results are:
Test and CI for One Proportion
Sample X N Sample p
95% CI
1
660 1500 0.440000 (0.414880, 0.465120)
Using the normal approximation.
These results closely match our by-hand interval of 0.415 to 0.465
What if we had calculated the exact confidence interval (i.e. did not choose Normal
Approximation as the method)? With the exact method the interval is (0.414685, 0.465550).
Consistent to three decimal places in this case. You will notice that in the output Minitab does
provide a notification that the normal approximation was used.

How different can the normal approximation and exact intervals be when conditions are not
satisfied? Consider this example: In estimating the proportion of premature babies born at the local
hospital, a random sample of 10 newborn babies was taken in which 3 were born prematurely. Find a 90%
confidence interval for the true proportion of premature babies born at the hospital.
As we can see the conditions to use normal approximation method is not satisfied; we only have 3
successes and we need at least 5. If we used normal approximation methods (note that we are constructing
90% intervals now), Minitab produces an interval of (0.061638, 0.538362). If the exact method were
used, the interval is (0.087264, 0.606624). The intervals are nearly as close as in the first example.
Also note how wide the intervals are in this second example. This is a direct result of the small sample
size. The smaller n produces a much a larger error which increases the width of an interval.

1.
2.
3.
4.
5.

Minitab Commands to Find the Confidence Interval for a Population Mean (sigma
unknown)
Stat > Basic Statistics > 1-Sample t.
From the drop down box select the Summarized data option button. (If you have
the raw data you would use the default drop down of One or more samples, each in a column.)
Enter the sample size, sample mean, and sample standard deviation in their respective text
boxes.
Click the Options button. The default confidence level is 95. If your desire another confidence
level edit appropriately.
Click OK and OK again.
Example: Emergency Room Wait Time
Referring to our prior example of average emergency room wait
time from our discussion on confidence intervals for a population
mean, our by-hand calculations produced a 95% confidence
interval of 24.28 to 35.72 minutes. Recall the following for that
example: sample size 50, sample mean 30, and sample standard
deviation 20. In Minitab following the above steps, we get a 95%
confidence interval:
One-Sample T
N Mean StDev SE Mean 95% CI
50 30.00 20.00 2.83
(24.32, 35.68)
The slight discrepancy between the estimates is due to our by-hand calculation using the t-value
associated with 40 degrees of freedom since the table did not include a d.f. of 49. Minitab used
a t-value for the actually 49 degrees of freedom. With the larger degrees of freedom comes a

smaller t-value. This would result in a smaller margin of error and a narrower interval precisely what we have here.
Using Minitab to Check Normality
This Minitab process was presented in the lesson for finding confidence intervals
for a population mean. It is repeated here for convenience. For small sample size,
if the distribution is not normal or if there are outliers, then one needs to use other
procedures such as nonparametric methods. Thus, if sample size is less than 30, one needs to
use normal probability plot to check whether the sample may come from a normal distribution
and then follow the above guideline to determine whether one can use the t-interval.
1. Graph > Probability Plot > Simple (note: if we have summarized data only we cannot plot the
data!)
2. Select the column that contains the data you want to graph.
3. Click OK.
Example: Rattlesnake Lengths
It is very time consuming to find rattlesnakes and nerve racking to
measure them. A scientist randomly finds 12 snakes from the
Central Pennsylvania area and measures their length (snakes.txt).
The following twelve measurements in inches are obtained:

40.2 43.1 45.5 44.5 39.5 38.5


40.2 41.0 41.6 43.1 44.9 42.8
The sample size is only 12. Let's do a normal probability plot to check whether the data may
come from a normal distribution. What do you conclude about whether they may come from a
normal distribution?

Since the points all fall within the confidence limits, there is no evidence to suggest that the data
do not come from a normal distribution.
6.7 - Practice Problems

Printer-friendly version
1. Many individuals over the age of 40 develop intolerance for milk and milk-based products. A dairy has
developed a line of lactose-free products that are more tolerable to such individuals. To assess the potential

market for these products, the dairy commissioned a market research study of individuals over 40 years
old in its sales area. A random sample of 250 individuals showed that 86 of them suffer from milk
intolerance. Based on the sample results, calculate a 90% confidence interval for the population proportion
that suffers milk intolerance. Interpret this confidence interval.
a) First, show that it is okay to use the 1-proportion z-interval.
b) Calculate by hand a 90% confidence interval.
c) Provide an interpretation of your confidence interval.
d) If the level of confidence was 95% instead of 90%, would the resulting interval be narrower or wider?
Explain.
e) If the researchers were interested in a 90% interval with a 3% margin of error, what size sample would
they require assuming sample costs are high and the response rate is 80%.
f) Verify your 90% confidence interval in Minitab.
2. Consumer reports tested 15 brands of vanilla yogurt and found the following numbers of calories per
serving: 160, 200, 220, 230, 120, 180, 140, 130, 170, 180, 80, 120, 100, 170, 190, (yogurt.txt). The
sample statistics were 159.3 for the sample mean and 43.5 for the standard deviation.
a) By hand, place a 99% confidence interval on the average number of calories per serving for vanilla
yogurt.
b) Provide an interpretation of your interval.
c) Use Minitab to find the interval and to check the assumption of normality. Is the assumption satisfied?
Explain.

7.6 - Using Software for Hypothesis Testing

Printer-friendly version
Unit Summary
Conducting a One-Proportion Z-test in Minitab
Conducting a One-Mean t-test in Minitab
Finding Exact Critical Value for a One-Mean t-test in
Minitab
Note about Software
In general, as we will learn, software usually performs tests using the p-value method. That is, the output
from software will provide the test statistic and the p-value, along with some other general information
e.g. a confidence interval. To perform rejection region tests you would need to find the critical values
from the tables. However, at the end of this lesson we do demonstrate how to find the correct critical
value from the t distribution, i.e. the t-value that corresponds to the degrees of freedom when not on the
table.
Conducting a One-Proportion Z-test
Note: these steps are very similar to those for one-proportion confidence interval.
The differences occur in steps 4 and 5b.
1. Click Stat > Basic Stat > 1 Proportion.
2. In the drop-down box use "One or more samples, each in a column" if you have the raw data,
otherwise select "Summarized data" if you only have the sample statistics.
3. If using the raw data enter the column of interest into the blank variable window below the drop
down selection. If using summarized data enter the number of successes for "Events" and the
sample size for "Trials".
4. Click the check box for "Perform hypothesis test" and enter the null hypothesis value.

5.
1.
2.
3.

Click Options.
Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).
From the drop down list for "Alternative hypothesis" select the correct alternative.
If conditions are satisfied to perform a z-test for one proportion, select from the "Method" field
"normal approximation"
6. Click OK and OK.
Example: Penn State Students from Pennsylvania
Recall our one-proportion example at the beginning of this lesson
on whether the a majority of Penn State students are from
Pennsylvania. In that example, we took a random sample of 500
Penn State students and find that 278 are from Pennsylvania. Can
we conclude that the proportion is larger than 0.5 at a 5% level of
significance? Also recall in that example we found by hand a test
statistic of Z* = 2.504 and p-value of 0.0062.
Our hypotheses were: H0:p=0.5H0:p=0.5 and Ha:p>0.5Ha:p>0.5

1.
2.
3.

4.
5.
1.
2.
6.

Using Minitab, we would select Stat > Basic Stat > 1 Proportion. Choose the summarized data
option and enter 278 for "Events" and 500 as the "Trials". Check the box for Perform
Hypothesis Test and enter the null value of 0.5 Click Options. With our stated alpha value of
5% we keep the default confidence level of 95. Select "Proportion > hypothesized proportion"
from the Alternative Hypothesis list. Since we verified the the conditions were satisfied, select
Normal Approximation under Method. Click OK and OK again. The output is:
Test and CI for One Proportion
Test of p = 0.5 vs p > 0.5
Sample X N Sample p 95% Lower Bound Z-Value P-Value
1
278 500 0.556000
0.519451
2.50
0.006
Using the normal approximation.
As the output indicates, our by-hand calculations were very accurate!
Conducting a One-Mean t-test
Note that these steps are very similar to those for one-mean confidence interval.
The differences occur in steps 4 and 5b
Click Stat > Basic Stat > 1 Sample t.
In the drop-down box use "One or more samples, each in a column" if you have the raw data,
otherwise select "Summarized data" if you only have the sample statistics.
If using the raw data enter the column of interest into the blank variable window below the drop
down selection. If using summarized data enter the sample size, sample mean, and sample
standard deviation in their respective fields.
Click the check box for "Perform hypothesis test" and enter the null hypothesis value.
Click Options.
Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).
From the drop down list for "Alternative hypothesis" select the correct alternative.
Click OK and OK.
Example: Emergency Room Wait Time
Recall our emergency room wait time example where an
administrator at your local hospital states that on weekends the
average wait time for emergency room visits is 10 minutes. From
our random sample of 40 patients, the average wait time for these
40 patients was 11 minutes with a standard deviation of 3
minutes. We conducted the test at a 5% level of significance and
wante to demonstrate that the average time exceeded 10 minutes. Also recall in that example we
found by hand a test statistic of t* = 2.11 and p-value with a range between 0.01 to 0.025

Our hypotheses were: H0:=10H0:=10 and Ha:>10Ha:>10


Using Minitab, we would select Stat > Basic Stat > 1 Sample t. Choose the summarized data
option and enter 40 for "Sample size", 11 for the "Sample mean", and 3 for the "Standard
deviation". Check the box for Perform Hypothesis Test and enter the null value of 10 Click
Options. With our stated alpha value of 5% we keep the default confidence level of 95. Select
"Mean> hypothesized mean" from the Alternative Hypothesis list. Click OK and OK again.
The output is:
One-Sample T
Test of = 10 vs > 10
N Mean StDev SE Mean 95% Lower Bound T P
40 11.000 3.000 0.474
10.201
2.11 0.021
Again, as the output indicates, our hand calculations were quite good. Notice that Minitab
provides a more exact p-value of 0.021 which corresponds to our results as it falls within our
calculated range of 0.01 to 0.025.
Finding Exact Critical Value for a One-Mean t-test
Since the t-table is not as detailed as the z-table, we can only estimate the critical
value when the degrees of freedom are not found on the table. In order to obtain
the exact critical value to use in order to conduct the rejection region approach we
can use a statistical package such as Minitab.
Minitab commands to obtain critical value:
1. Calc > Probability Distributions > t-distribution
2. Choose the radio button for Inverse Cumulative Distribution (this finds the t-value
that produces the entered probability to the left of it).
3. Enter the correct degrees of freedom
4. Choose the radio button for "Input constant" and enter the alpha value (if one-side
alternative) or alpha/2 (if two-sided alternative).
5. Click Ok
Example: Emergency Room Wait Time
Find the exact critical value for our emergency room example. Recall by hand that we had to
use the row with 35 degrees of freedom instead of the correct df of 39. In that example our
critical value for alpha of 5% was 1.69.
Go to Calc > Probability Distributions > t-distribution. Choose radio button for Inverser
Cumulative Distribution. Enter 39 for degrees of freedom. Click radio button for Input
Constant and enter 0.05 The output is as follows:
Inverse Cumulative Distribution Function
Students t distribution with 39 DF
P( X x )
x
0.05 -1.68488
This is where you need to be a little careful. Remember that our alternative was "greater than"
or a right-tailed test. The output is the critical value for a left-tailed test. However, since the tdistribution is symmetrical, the area to the left of -1.68488 would be the same as the area to the
right of 1.68488. Therefore, the critical value for out test with 39 degrees of freedom would be
1.68488, which doesn't differ much from the 1.69 we estimated using 35 degrees of freedom.
This is why the table skips going one by one after 30; there is little difference between the values
when increasing by only one degree of freedom.
7.7 - Practice Problems

Printer-friendly version

Practice Problems:
1. The benign mucosal cyst is the most common lesion of a pair of sinuses in the upper jawbone. In a
random sample of 800 males, 35 persons were observed to have a benign mucosal cyst.
a. Would it be appropriate to use a normal approximation in conducting a statistical test of the null
hypothesis H0:p0.096H0:p0.096(the highest incidence in previous studies among males)? Explain.
b. Conduct a statistical test of the research hypothesis Ha:p<0.096Ha:p<0.096 by computing the p-value
manually and drawing a conclusion using the p-value approach at a 1% Type I error rate.
c. What is the rejection region for this test?
d. Use Minitab to verify your results.
2. Some mushrooms were found in a forest. You do not know much about whether those are poisonous.
There are two hypotheses:
The mushrooms are poisonous and cannot be eaten
The mushrooms are not poisonous and can be eaten
How will you set up the hypotheses? Give a brief explanation.
3. A dealer in recycled paper places empty trailers at various sites. The trailers are gradually filled by
individuals who bring in old newspapers and magazines, and are picked up on several schedules. One such
schedule involves pickup every second week. This schedule is desirable if the average amount of recycled
paper is more than 1,600 cubic feet per 2-week period. The dealers records for eighteen 2-week periods
show the following volumes (in cubic feet) at a particular site (recycled_paper.txt) where xx = 1,718.3
and s = 137.8.
a. By hand, compute a 95% confidence interval of and provide an interpretation of your interval.
b. Is there strong evidence that is greater than 1,600? Conduct the test by hand using the p-value
approach with a 5% level of signficance.
c. Use Minitab to verify your results in parts a and b and to check for normality of data.
4. The undergraduate GPA of 18 students from a large MBA class of 800 students is selected. The data
are given as (mba_student_gpa.txt).
Use the data in the file above and Minitab to test the research hypothesis that the average undergraduate
GPA of the MBA class is differs from 3.5. Use the p-value approach to perform the test at a default level
of significance. Remember to check normality of the data.

ASK!