Documente Academic
Documente Profesional
Documente Cultură
Introducere
De ce este necesara studierea disciplinei?
Ca si argument folosim o exprimare compusa din doua parti. Prima parte face referinta la observatia
ca pe bancnotele americena sta scris IN GOD WE TRUST adica in traducere asta ar suna ca si
expresie Noi credem in dumnezeu si la care adaugam cea de-a doua parte a expresiei
EVERYTHINK ELSE NEEDS DATA adica orice altceva se bazeaza pe date. Datele ce stau la
baza demonstrarii opricarei argumentatii trebuiesc sa fie valide, sa nu contina erori si sa includa
informatii coerente. Pentru a putea fi citite si interpretate aceste date este necesara parcurgerea
acestei discipline.
Obiective de instruire
dupa completarea seminarului veti fi capabili sa:
formulati un raspuns la intrebarea, "ce este statistica?"
obtineti date inteligente.
faceti distinctie intre studiile stiintifice se cele observationale.
organizati datele in tabele si sa folositi tehnici grafice adecvatepentru a descrie seturi
variate de date.
Identificati variabilele ca si categorii calitative (binary, ordinal, nominal) sau cantitative
(discrete, continuous).
Definitii
Sunt cateva concepte ce vor trebui sa fie definite. Un prim concept este populatie un grup de
obiecte denumite elemente care au o anumita proprietate comuna, si care este caracteristica
intregului grup studiat. De exemplu daca vorbim de studenti, populatia ar trebui sa fie toti studentii
la care ne gandim.
Exemplul 1. studentii inrolati la cursul PDE din o institutie formeaza o populatie, deoarece nu mai
exista si alti studenti care sa posede aceeasi proprietate.
Deoarece uneori este aproape imposibil sa analizam intreaga populatie, o parte mica si
reprezentativa din aceasta populatie va fi luata in considerare. Aceasta parte este denumita esantion.
Nivel de masurare datele pot fi clasificate pe urmatoarele patru nivele de masurare:
date nominale -constau in nume, etichete sau categorii, gen, grupuri particulare de persoane
grupate dupa interese, aspiratii privilegii etc. aceste date nu pot fi ordonate ca date nominale
(cum ar fi cea mai mare sau cea mai mica) si de asemenea nu pot fi prelucrarea aristmetic.
Date ordinale (ordinal data)- ce pot fi aranjate in orice ordine particulara
date interval sunt similare cu datele ordinale
date raport (ratio data) presupun existenta unei relatii cantitative (de exemplu numarul de
joburi pentru barbati raportat la femei este de 8/1
Exemplul 2 Identificati ce nivel de masurare este caracteristic pentru urmatoarele date:
anii cuprinsi in perioada istorica 1821- 1877
venitul annual al studentilor de la masterat : 0lei- 10000lei
genul studentiler: masculin , feminin
Raspuns
anii cuprinsi intre 1821 si 1877 reprezinta date interval. Aici nu avem date de an zero. De
asemenea impartirea datelro nu are sens (ce reprezinta anul 1821/1877?) asa ca nu avem
nicio ratie. De altfel substarctia nu are sens perioada caoperita este 1877-1821=56 ani.
Venidul annual al studentilor reprezinta ratio data. Aici impartirea are sens, adica daca
cineva are venitul de 2000 si altul de 40000 veniturile pot fi comparate,. De asemenea daca
un student are venit 0 atunci valoarea 0 are sens.
Genul studentilor este data nominala, nu pot fi ordonate si de asemenea datele nu pot
adunate scazute sau impartite.
Datele colectate de la fiecare individ dintro populatie sunt denumite variabile. Aceste variabile sunt
caracteristice fiecarui individ dintr-o populatie. Varibilele pot avea diferite tipuri de valori, unele din
ele sunt numere iar altele categorii. De exemplu: numarul de usi dintr-o locuinta este 10, pstiul
laboratorului este de 68.5mp sunt date de tip numeric. Pe de alta parte daca casa este a unei singure
familii naloarea numerica a varibilei nu are sens. Acest lucru dertermina clasificarea variabilelor in
varibile cantitative si calitative.
O variabila calitativa sau pe categorii permite listarea caracteristicilor individuale pe categorii. De
exemplu impartirea pe sex de la exemplul anterior.
Primele doua masuratori de la exemplul anterior sunt date cantitative.
varibile
calitative
cantitative
discrete
continue
sa mei numesc variabile scalare (de interval sau de raport) de exemplu. Inaltimea studentilor,
greutatea, varsta,...
pentru a prelucra aceste date si a lua cea mai buna decizie de aici prelucrarea statistica a datelor are
doua directii prelucrarea statistica generala si prelucrarea statistica inferentiala.
People usually respond when asked by a person but their answers may
be influenced by the interviewer.
Telephone Interview
Self-Administered
Questionnaires
Cost-effective but the response rate is lower and the respondents may
be a biased sample.
Direct Observation
Web-Based Survey
relationship
2.Experimental this involves some random assignment of a
treatment; researchers can draw cause and effect (or causal)
conclusions.
* Random selection (a probability method of sampling) is not
random assignment (as in an experiment). In an ideal world you
would have a completely randomized experiment; one that
incorporates random sampling and random assignment.
Example
Let's say that there is an option to take quizzes throughout this
class. In an observational study we may find that better students
tend to take the quizzes and do better on exams. Consequently, we
might conclude that there may be a relationship between quizzes
and exam scores. In an experimental study we would randomly
assign quizzes to specific students to look for improvements. In
other words, we would look to see whether taking quizzes causes
higher exam scores.
Ethics is an important aspect of experimental design to keep in
mind. For example, the original relationship between smoking and
lung cancer was based on an observational study and not an
assignment of smoking behavior.
Variables
Explanatory (predictor or independent) and response (outcome or
dependent) variables. A variable can serve as an explanatory
variable in one study but response in another. For instance,
consider the variabls Sex (Female, Male) and Height (in inches).
Which variable do you believe explains the other? Would it make
more sense to say a person's sex more likely explains that person's
height, or to say a person's height explains that person's sex? In
this case Sex (explanatory) would explain Height (response). On
the other hand, consider the variables Height and Weight. In this
case, a person's height would more likely explain their weight than
the other way around. Now Height serves as the explanatory
variable.
Let's think about the example of height and weight. Typically height
(the explanatory variable) explains weight (the outcome variable).
In tabular data the predictor variable is usually displayed in the
rows of the table, and the response variable in columns. Here is an
example of the tally or results of a survey asking males and females
if they smoke.
Yes No
Male
20 55
Female 15 70
Researcher observes the data and has no control over which subject takes
which treatment,
Scientific Studies
Need to control for effects due to factors other than the ones of
primary interest.
Randomization
Replication
scores are aligned with higher quiz scores). As you look at the data
you begin to consider whether submission date of the homework
has an effect on the quiz grades; that is, do students who submit
the homework several days before taking the quiz perform better
overall on the quiz than students who do not leave much of a time
gap between completing the assignments (e.g. they do both on the
same day). The rational is that students who allow time between
the homework and quiz to study may perform better compared to
the other group. In this example, days between submission of
homework and quiz would be a lurking variable as it was not
included in the study. Now once you got that information and reexamined the relationship between the two assignments taking into
consideration the time gap, if you saw a change in the relationship
between the two assignments (i.e. the relationship changed
somewhat from the analysis without the time gap compared to
when the time gap was included) then this days between
submission would be considered a confounding variable. In an
experiment where treatments are randomly assigned, one assumes
these variables get evenly shared across the groups with the
intention that any influence they may have on the outcome is
negated or reduced.
Types of Bias
1.Non-response large percentage of those sampled do not
to respond or participate.
2.Response when study participants either do not respond
truthfully or give answers they feel the researcher wants to
hear. For example, when students are asked if they ever
cheated on an exam even those who have would respond with
"no".
3.Selection this bias occurs when the sample selected does
not reflect the population of interest. For instance, you are
interested in the attitude of female students regarding campus
safety but when sampling you also include males. In this case
your population of interest was female students however your
sample included subject not in that population (i.e. males).
LOOKING AHEAD: Students interested in pursuing topics related to
design of experiments might explore STAT 503: Design of
Experiments. STAT 503 includes extensive coverage
implementation and analysis of a wide range of experimental
designs.
Unit Summary
Variable and Its Type
Graphs for a Categorical Variable
Pie Chart
Bar Chart
Graphs for a Single Quantitative Variable
Dot Plot
Frequency Histogram and Relative Frequency
Histogram
Stem-and-Leaf Diagram
Time Plot
Boxplot or Box-and-Whisker Plot
[1]
- bar chart
[2]
[4]
Binary where there are two choices, e.g. Male and Female;
Ordinal where the names imply levels with hierarchy or order
of preference, e.g. level of education
Nominal where no hierarchy is implied, e.g. political party
affiliation.
[5]
Example
Remarks:
a) Pie charts may not be suitable for too many categories. Thus, if
there are too many categories, you can either combine some
categories or use a bar chart to represent the data. What is mean
by "too many"? There is no clear cut off, more of just a judgment
on the appearance.
b) Readers may find the pie chart more useful if the percentages
are arranged in a descending or ascending order.
2. Bar Chart: The height of the bar for each category is equal to the
frequency (number of observations) in the category. Leave space in
between the bars to emphasize that there is no ordering in the
classes.
Example
13 223
13 45
11 13 667777
(7) 13 8899999
12 14 0001
8
14 2233
14 45
14 6
14 8
College
Number of majors
Agriculture
1,500
11,000
Business
Administration
7,000
Education
2,000
Engineering
5,000
2470, 520, 561, 488, 986, 359, 1305, 512, 467, 270, 360, 451, 4904, 572, 498, 382, 271, 634,
1682, 784, 298, 643, 947, 686
3. Draw - by hand - a Stem-and-Leaf plot
Draw the stem-and-leaf diagram for the following data set. Use the stem and leaf diagram to find the
median of the data set:
11 11 12 13 14 14 14 14 15 15 15 16 16 17 17 19 19 20 22 25
4. Populations and Samples
Selecting the proper diet of brook trout or other fresh water fish is an important aspect of fish farming. A
researcher want to estimate the mean weight of brook trout maintained on a specific diet for a period of 6
months. One hundred brook trout are selected from a fishery's tank and each is weighed.
a. What is the population of interest to the researcher?
b. What is the sample?
c. What characteristics of the population are of interest to the researcher?
d. If the sample measurements are used to make inferences about certain characteristics of the population,
why is a measure of the reliability of the information important?
Statistica de baza
Regresie
Multivariate analysis
Analiza variatiei
Time series
Tables
Grafice de control
Nonparametrics
Interval plot
Matrix plot
Marginal plot
Line plot
Histogram
Bar chart
Dotplot
Pie chart
Stem-and-leaf plot
Probability plot
Area graph
Empirical CDF
Contour plot
Probability distribution
plot
3D scatterplot
Boxplot
3D surface plot
Use this information to assess the basic properties of your data distributions:
number of observations
central tendency the location of the center, or most typical value, of the data set
dispersion the amount of variation or spread in the data set
Display Descriptive Statistics can provide summary information for whole columns of data or for
subsets of data within columns.
Statistica descriptiva
Odata ce am identificat populatia, tipurile de date si variabilele si am colectat datele pentru
esantion, obiectivul nostru este de a descrie caracteristicile esantionului fara ambiguitati, de o
maniera precisa, astfel incat aceste date sa poata sa fie communicate cu usurinta celorlalti. Descriere
sau sumarizarea datelor colectate sa poate face in doua maniere: grafic sau numeric.
Descrierea grafica depinde de tipurile de date. Asa cum s-a prezentat mai sus exista date cantitative
si calitative.
Graficele ce descriu datele calitative includ:
-grafice sub forma de bara, Pie chart si Pareto chart
Descrierea datelor
Pentru un studiu de meteorologie un student a colectat urmatoarele date pentru orasul in care
locuieste intr-un an. Valorile reprezinta numarul de zile din luna cand s-au inregistrat precipitatii
semnificative. Proiectul l-a inceput dupa terminarea lunii ianuarie asa ca nu sunt inregistrari in acea
luna:
Ian
Febr.
Zile *
2
cu
precip
itatii
Experiment meteo.
Mar.
Apr.
Mai
Iun
Iul.
Aug.
Sept.
Oct
Nov.
Dec.
10
Fereastra de sus denumita "Session" window este locul in care Minitab va afisa rezultatul analizei
statistice cerute. Ferestra de jos denumita "Worksheet" window este locul unde copiem si
introducem datele. Cea de-a treia fereastra denumita "Graphics" window apare doar atunci cand ni
se cere sa ploteze ceva. Hiata mai jos un exemplu de ferestra grafica:
feresatra activa este ferestra la care bara apare de culoare albastru inchis. Pentru a face o ferestra inactiva
pur si simplu se face clic cu mouse-ul oriunde in ferestra
In WORKSHEET se introduc datele in coloana C1
Note that the first (unnumbered) row is reserved for variable names. This is one thing about which you
will have to be careful. If you accidentally place your cursor instead in the row numbered 1, Minitab will
then treat the data as if they are text:
Note that Minitab has added a "-T" to the column names C1 and C2 to denote that the content of the two
columns is text. Another indication that the content of the columns is treated as text is that the textual
content is left-justified whereas numbers are always right-justified. Minitab cannot summarize data, such
as calculating means and standard deviations, when they are treated as text. If you accidentally make this
mistake, just open a new worksheet (Select File >> New... >> Minitab Worksheet >> OK) and paste the
data properly.
https://onlinecourses.science.psu.edu/statprogram/print/book/export/html/51
Analyzing Data
Once you've pasted or uploaded data into a Minitab worksheet, you no doubt will want to analyze it. The
commands to do so all appear in one of the Minitab menus:
More often than not, we will use the "Stat" and "Graph" menus in this course. The menus are generally
self-explanatory, but we will provide you with Minitab help throughout the course.
To create a scatter plot of the data, just select Graph >> Histogram... >> Simple. A standard Minitab dialog
box will appear. In general, the dialog box means that you have to provide Minitab with more information
before it can complete your request. For the scatter plot dialog box:
you need to tell Minitab that "actual" is variable of interest. To do this you can either 1) click on the
variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The
name should appear in the box labeled "Graph variables." Then, once you select "OK," a new graph
To do have Minitab display basic descriptive statistics, select Stat >> Basic Statistics >> Display
Descriptive Statistics .... The following Minitab dialog box will appear:
You need to tell Minitab that in this case "actual" is the variable of interest. You can either 1) click on the
variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The
name should appear in the box labeled "Variables." Do the same thing to tell Minitab which variable you
would like to group the statistics by. In this case we will click in the box labeled "By variables" and then
either 1) click on the variable name ("sex") once and then click on "Select" or just 2) double-click on the
variable name.Once you select "OK," the standard descriptive statistics output should appear in the
"Session" window:
All Minitab graph and analysis commands function similarly to the examples illustrated above.
Throughout the course, help will be provided for the various Minitab commands we will encounter.
In the next lesson page we will use a Viewlet to walk you through another example of how Minitab works.
Minitab Help
There are various ways that you can get Minitab help.
1.If you would like more tutorial help, you can try the Tutorials option provided under the Help pull-down
menu in the Minitab Application shown below:
2.You can look for help in the Minitab Help on-line manual also listed in the pull-down menu pictured
above.
3.You can use the various sets of Minitab instructions provided to you throughout the course. You will
probably find links to these from the Homework Problems and Lab Activities in each lesson.
4.Finally, you can post a question to the Minitab discussion board located at the course level under the
CONTENT tab in ANGEL. This is a separate discussion board just for questions related to how to use
Minitab.
Visit the support web site or call +1-814-231-2682 to speak with Minitab's technical support specialists.
https://onlinecourses.science.psu.edu/stat500/node/11
Lesson 2 - Summarizing Data: Measures of Central Tendency and Measures of Variability, Box Plot
Printer-friendly version
We will first talk about the important concepts of statistical inference. Then a few descriptive measures of
the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive
measures of the other important characteristic of a data set, measure of variability, will be discussed. This
lesson will be concluded by a discussion of box plots, which are simple graphs that show the central
location, variability, symmetry, and outliers very clearly.
Again, this lesson will focus on simple examples that can be calculated or drawn by hand. This lesson will
be followed by another lesson that will work through many of these procedures using Minitab.
Lesson 2 Objectives
Upon successful completion of this lesson, you will be able to:
conceptualize statistical inference.
use appropriate summary measures to describe different data sets.
construct and use box plots.
2.1 - Measures of Central Tendency and Skewness
Printer-friendly version
Unit Summary
Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (See your course schedule.)
Measures of Central Tendency
Three of the many ways to measure central tendency are:
1. Mean
2. Median
3. Mode
In most research experimental situations, examination of all members of a population is not typically
conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a
representative subset of the population.
Let's take a closer look at this diagram implies with Dr. Wiesner.
Object 1
Descriptive measures of population are parameters. Descriptive measures of a sample are statistics. For
example, a sample mean is a statistic and a population mean is a parameter. The sample mean is usually
denoted by yy:
y=y1+y2++ynn=ni=1yiny=y1+y2++ynn=i=1nyin
where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate
the population mean since usually only a random sample is drawn and we don't know the population
mean.
A Note on Notation!
What if we say we used xixi for our measurements instead of yiyi? Is this a problem? No. The
formula would simply look like this:
x=x1+x2++xnn=ni=1xinx=x1+x2++xnn=i=1nxin
The formulas are exactly the same. The letters that you select to denote the measurements are up
to you. For instance, many textbooks use x instead of y to denote the measurements.
The point is to understand how the calculation that is expressed in the formula works. In this
case, the formula is calculating the mean by summing all of the observations and dividing by the
number of observations.
There is some notation that you will come to see as standards, i.e, n will always equal sample
size. We will make a point of letting you know what these are. However, when it comes to the
variables, these labels can (and do) vary.
For example, in one study x may be used to denote weight and y may be used to denote height, (or
the reverse may be used!), butn will always be used to denote sample size in each case.
Note that for the data set:
1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1
Steps to finding the median for a set of data:
1. Arrange the data in increasing order
2. Find the location of median in the ordered data by (n + 1) / 2
3. The value that represents the location found in Step 2 is the median. NOTE: if the sample size is an odd
number then the location point will produce a median that is an observed value as in the example above.
If sample size is an even number, then the location will require one to take the mean of two numbers to
calculate the median. The result may or may not be an observed value as the example below illustrates.
Mean, median and mode are usually not equal. When the data is symmetric, the mean is equal to the
median.
4. Trimmed Mean
One shortcoming of the mean is that: Means are easily affected by extreme values.
central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about
mainly about the mean. However, we need to be aware of one of its short comings, which is that it is easily
affected by extreme values. One remedy is to use trimmed mean to estimate the central tendency.
Remember, however, that this is very different from saying that one can trim data. Unless data points are
known mistakes, one should not remove them from the data set! One should keep the extreme points and
use more resistant measures. For example, use the sample median to estimate the population median. Or,
use the sample trimmed mean to estimate the population trimmed mean. Again, this is very different from
saying that it is OK to trim data from a data set.
Skewness
Skewness is a measure of degree of asymmetry of the distribution.
1. Symmetric
Mean, median, and mode are all the same here; the distribution is mound shaped, and no skewness is
apparent. The distribution is described as symmetric.
a few people that make the most money. To illustrate this, consider your favorite sports team or
even the company for which you work. There will be one or two players or personnel that earn
the big bucks, followed by others who earn less. This will produce a shape that is skewed to
the right. Knowing this can be a useful aid in negotiating a higher salary.
When one interviews for a position and the discussion gets around to compensation, it is
common that the interviewer states an offer that is typical for someone in your position. That
is, they are offering you the average salary for someone with your particular skill set (e.g. little
experience). But is this average the mode, median, or mean? The company for whom business
is business! will want to pay you the least they can while you prefer to earn the most you can.
Since salaries tend to be skewed to the right, the offer will most likely reflect the mode or
median. You simply need to ask to which average the offer refers and what is the mean of this
average since the mean would be the highest of the three values. Once you have these averages,
you can begin to negotiate toward the highest number.
Adding and Multiplying Constants
What happens to the mean and median if we add or multiply each observation in a data set by a constant?
Consider for example if an instructor curves an exam by adding five points to each students score. What
effect does this have on the mean and the median? The result of adding a constant to each value has the
intended effect of altering the mean and median by the constant. For example, if in the above example
where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be
87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).
Similarly, if each observed data value was multiplied by a constant, the new mean and median would
change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were
doubled, the then the new mean and new median would be double the original mean and median. As we
will learn shortly, the effect is not the same on the variance!
Why would you want to know this? One reason, especially for those moving onward to more applied
statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods a
required assumption is that the data is normal, or very near bell-shaped. When the data is not normal,
statisticians will transform the data using numerous techniques e.g. logarithmic transformation. But, the
log cannot be taken of all values, for instance the log of 0 is undefined. However, if we add a constant to
all the data values making them all greater than zero, then a log can be taken without risk.We just need to
remember the original data was transformed!!
Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (see
course schedule).
Measures of Variability
Think about the following, then click on the icon to the left to
display the statistical answer.
If you can use two numbers to summarize
Jessica's weight data, which two characteristics
will you use as measures?
They have the same center, but what about their spreads?
One way to compare their spreads is to compute their
standard deviations. In the following section, we are going
to talk about how to compute the sample variance and the
sample standard deviation for a data set.
Variance is the average squared distance from the mean.
Population variance is defined as:
2=i=1N(yi)2N2=i=1N(yi)2N
In this formula is the population mean and the summation
is over all possible values of the population. N is the
population size.
The sample variance that is computed from the sample and
used to estimate 2 is:
s2=i=1n(yiy)2n1s2=i=1n(yiy)2n1
y=1+2+3+3+4+56=3y=1+2+3+3+4+56=3
s2=(y1y)2++
(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)
2+(53)261=2s2=(y1y)2++
(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)2+(5
3)261=2
s=s2s=s2
For the data set A,
ysys
Object 2
The following five examples (a-e) show that the empirical rule
is not that far off even when the underlying distribution is not bell
shaped.
a. For the following graph, y=5.5y=5.5, s=1.49s=1.49
4s
s=ni=1(yiy)2n1s=i=1n(yiy)2n1
to compute the sample standard deviation. The formula
{Approximate value of srange4srange4 } only gives a rough
estimate of s.
CVRoll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49CVR
oll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49
the higher average cost per night? Which would have the
greater standard deviation? The CV would allow you to
compare this dispersion in costs in relative terms by
accounting for the fact that the luxury hotels would have a
greater mean and standard deviation.
Z-value, Z-score, or Z
Z-value, or sometimes referred to as Z-score or simply Z, represents
the number of standard deviations an observation is from the mean
for a set of data. To find the z-score for a particular observation we
apply the following formula:
Z = (observed value mean) / SD
Example: Exam Scores
For a recent final exam the mean was 68.55 with a standard
deviation of 15.45
Characteristics of Z-scores
1.The scores can be positive or negative
2.For data that is symmetric (i.e. bell shaped) or nearly
symmetric, a common application of Z-scores for identifying
potential outliers is for any Z-scores that are beyond 3.
3.Maximum possible Z-score for a set of data is (n1)/n(n1)/n.
4. Sum of allsquared Z-scores for a set of data is n - 1.
2.4 - Practice Problems
Printer-friendly version
1. Our statistics department surveyed a random sample of 5 staff personnel and 5 faculty on how often
during a week they used public transportation in traveling for work. The table below reflects the
responses.
Staff
Faculty
4 2 2 2 5 0 1 5 1 3
a. What sampling method was used to gather this data? What population of interest is best represented by
the samples?
b. Calculate by hand the mean and standard deviation for number of times a week public transportation
was used by staff and faculty.
c. Based on means and standard deviations, do you think there is a statistically significant difference
between these two means? Explain.
2. The College of Dentistry at the University of Florida has made a commitment to develop its entire
curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and so on.
It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional
staff lab more free time for personal consultation in student faculty interaction. One such instructional
modules developed and tested in the first 50 students proceeding through the curriculum the following
measurements represent the number of hours it took the students to complete the required modular
material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 6 5 23 21 10 17 15
Here is a link to the data (hours.txt) for the time it took students to complete the required material.
a. Calculate by hand the five number summary for these recorded completion times. Helpful hint: you can
use software such as Excel to sort the data.
b. Do we expect the Empirical Rule to describe adequately the variability of these data? Explain.
c. Calculate the standard deviation, s, by using the approximation formula and compare that answer to that
real standard deviation of 8.45
d. The mean for this data set is 16. Using the actual s of 8.45, construct the intervals and check whether the
Empirical Rule applies to this data set.
3. With the variable 'Hours' in the 'Variable' window click the 'OK'
button.
You should now find the following output in the Session window
above the worksheet.
5.Click 'OK'.
Minitab will now display the 'Age' statistics (including CV since we
had that statistic selected from our last operation) for each
category of 'Sex'.
If you place your computers mouse over the box portion of the
plot some statistics will pop-up (Q1, median, Q3, IQR, the value to
which the whiskers extend, and the sample size N). If there are any
outliers using the methods outlined in the previous lesson, these
will be marked with an * in the plot.
Example: Side-by-Side Boxplot for the Age by Sex Data
Again, with the MinitabIntroData data file open in the
Minitab worksheet:
Select 'Graph' > 'Boxplot' > 'One Y-With Groups' and click 'OK'.
Select the 'Age' variable and move into the 'Graph Variables'
dialog box.
Click inside the Categorical variables for grouping window.
Any categorical variables in the worksheet (e.g. Sex) should
now display in the left side box.
Select the variable 'Sex' for the Categorical variables for
grouping window.
Click 'OK'.
Minitab will now display inside one frame two boxplots: one for
Females and another for Males.
Just for kicks, use the steps described previously to have Minitab
calculate the descriptive statistics for these two variables, the
column of entered data and the column of calculated 'Plus1' data.
You should find that the original mean was 3 for the entered data
and is 4 for the plus1 data, however the standard deviations for
both sets of data remain the same.
College
Number of majors
Agriculture
1,500
11,000
2,000
Engineering
5,000
HINT!: How to Create a Simple Pie Chart and Simple Bar Chart
[2]
[4]
(machine.txt
New machine
[6]
Old machine
42.1 41.3 42.4 43.2 41.8 42.7 43.8 42.5 43.1 44.0
41.0 41.8 42.8 42.3 42.7 43.6 43.3 43.5 41.7 44.1
a. Compute the mean and standard deviation for the time to pack a
carton for each machine.
b. Plot the data for each machine.
c. Describe the data for the two machines.
7. The College of Dentistry at the University of Florida has made a
commitment to develop its entire curriculum around the use of selfpaced instructional materials such as videotapes, slide tapes, and
so on. It is hoped that each student proceeded apace
commensurate with his or her ability and of the instructional staff
lab more free time for personal consultation in student faculty
interaction. One such instructional modules developed and tested
in the first 50 students proceeding through the curriculum the
following measurements represent the number of hours it took the
students to complete the required modular material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 4 5 23 21 10 17 15
Here is a link to the data (hours.txt
to complete the required material.
[7]
a. Calculate the mode, the median, and the mean for these
recorded completion times.
b. Guess the value of s.
c. Compute s by using the shortcut formula and compare your
answers to that of part (b) above.
d. We do expect the Empirical Rule to describe adequately the
variability of these data? Explain.
e. Construct the intervals and check whether the Empirical Rule
applies to this data set.
[8]
Printer-friendly version
Binomial Examples
Referring back to the FBI Crime Survey Example in the binomial lesson, we
had the probability of 0.2 that a randomly selected property crime is solved
and we had three such crimes committed. Let's use Minitab this time to find
the probability that:
1. Exactly one of the three crimes is solved
2. At least two of the three crimes is solved.
In Minitab, go to Calc > Probability distributions > Binomial
To solve a we are asked to find P(X=1)P(X=1) so we will select
the Probability radio button and enter the following:
Number of trials: 3
Event probability: 0.2
Click radio button for 'Input constant': 1
Then click "OK".
The result is as follows:
Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we
wanted to find P(X=1)=0.384P(X=1)=0.384 [NOTE: this is what we found by hand earlier.]
To solve b we are asked to find P(X1)P(X1). Since the software does not find greater than
probabilities we need to use the complement rule. This leads us to find P(X1)P(X1) by 1P(X<1)=1
P(X=0)1P(X<1)=1P(X=0).
This time we will select again the probability radio button and enter the following:
Number of trials: 3
Event probability: 0.2
Click radio button for 'Input constant': 0
Then click "OK".
The result is:
Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we
wanted to find P(X=0)=0.512P(X=0)=0.512. Subtracting this probability from 1 we have our answer
to P(X1)P(X1) which is 0.488
NOTE: The usual mistakes students make are to not set the problem up correctly (e.g. use Probability
when should be Cumulative and vice-versa), incorrectly include the equality when using the complement,
or simply forget to subtract from one when necessary.
Practice Problems
1. You are given a 25 question exam for which you are not prepared (i.e. you will be guessing).
Each question involves True/False answer choices where only one choice is correct.
First, is this a binomial situation?
Does it have a fixed number of trials? YES - 25.
Two outcomes, success and failure? YES - right and wrong.
Equal chances of success? YES - 0.5
Is each trial independent? YES, how you answer one question well assume does not affect your
answer to another question.
OK, now that those assumptions have been met, let's use Minitab to answer the following
questions:
A) What is the probability that you get exactly 20 correct?
So, this is solving for P(X=20)P(X=20). We will select the Probability radio button and enter
25 for number of events,
0.5 for event probability, and
20 as the input constant.
The output looks like:
The answer is 0.0015834 meaning that you have roughly a 0.001 (or 1 in a 1000) chance of
getting exactly 20 right in such a situation.
B) What is the probability that you get at least 20 correct?
As we saw above, we need to solve this differently. We will solve for P(X20)=1
Printer-friendly version
Normal Examples
Here we will use Minitab to find the probabilities for two of the problems from the practice examples we
saw earlier.
To find the probabilities associated with normal distributions in Minitab, go to Calc > Probability
distributions > Normal. The default is set up to that of a standard normal (i.e. we have a z-score) where the
mean is 0 and the standard deviation is 1.
Example: Intelligence Scores for Children
1. The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC)
are normally distributed with =100=100 and =15=15. Our question is, "What score will
separate a child from the top 5% of the population from the bottom 95%? What do we call this
value?"
To solve this question we are asked to find P(Xx)=0.95P(Xx)=0.95 That is, we want to find
the score that would result in a child falling in the 95th percentile. (NOTE: this would be
equivalent to finding P(Xx)=0.05P(Xx)=0.05 or being in the top 5%).
In Minitab go to Calc > Probability distributions > Normal and select the radio
button for Inverse cumulative probability since we want to find the observed
score associated with a given cumulative probability: 0.95.
For mean enter 100 and for standard deviation enter 15. Click the radio button for
'Input Constant' and enter the cumulative probability of 0.95 and the click 'OK'. The result is as
follows:
As we see, this result is very near that which we found by hand earlier: 124.6.
The interpretation is that an IQ score of about 127 is needed for that child to fall in the top 5%.
2. A class has 16 children and they are from the reference population in problem 1 above. One
child is randomly picked from the class. What is the probability that the IQ of the child is more
than 110?
To solve this question we are asked to find P(X>110)P(X>110). This means we will have to
solve using the complement or 1P(X110)1P(X110). But remember, this equality is not
relevant in regards to finding the probability since we are talking about a continuous
distribution!
In Minitab go to Calc > Probability distributions > Normal and select the radio
button for Cumulative probability and enter the following:
For mean enter 100 and for standard deviation enter 15. Click the radio button for
Input Constant and enter 110 and click OK. The result is as follows:
We then take this P(X110)=0.747507P(X110)=0.747507 and subtract from one to get our
final probability of 0.252943 which resembles our final result using the standard normal table:
0.2514.
The difference we are finding here is from the rounding we did to get the z-score to use the
table. The interpretation is that we have about a 25% chance of randomly selecting a child with
an IQ of at least 110.
II.3 - Finding Sampling Distributions in Minitab
Printer-friendly version
Sampling Distributions
For sampling distributions we again will focus on using the normal distribution. The key distinction will
be that instead of inputting the standard deviation we will use the standard error. Again, to illustrate we
will use two problems from the sampling distribution practice problems.
Example: JCrew
The company JCrew advertises that 95% of its online orders ship within two working days. You
select a random sample of 200 of the 10,000 orders received over the past month to audit. The
audit reveals that 180 of these orders shipped on time. If JCrew really ships 95% of its orders on
time, what is probability that the proportion in a random sample of 200 orders is as small or
smaller as the proportion in the audit?
We already verified that the sample proportion meets the conditions needed in order to apply
normal approximation methods. Once this is verified, the question asks us to find the probability
of getting a sample proportion of 0.9 or less if the true 'ship on time' proportion is 0.95. Recall,
we already had calculated a 0.015 standard error.
Using Minitab, we will again go to Calc > Probability distributions > Normal.
We will select the radio button for Cumulative probability and enter the following:
For mean enter 0.95 and for standard deviation enter 0.015.
Click the radio button for Input Constant and enter 0.9
Then click OK. The result is as follows:
Remember when we used the standard normal table the best we could do was a probability less
than 0.001 which Minitab has verified with 0.0004291 which is less. When we did this by
hand we came up with a z-score of 3.33 which was not on the table so we used 3.09
Alternatively, we could have used this z-score to find our answer by using the Minitab default
values which as we mentioned at the beginning are in standard normal format. If one has the zscore, you simply need to plug in the z-score as the input constant ( but remember to change the
mean to 0 and the standard deviation to 1).
Example: Tire Lifetime
Penn State Fleet which operates and manages car rentals for Penn
State employees found that the tire lifetime for their vehicles has a
mean of 50,000 miles and standard deviation of 3500 miles. What
is the probability that the sample mean lifetime for these 50
vehicles exceeds 52,000?
We already verified that the sample mean meets the conditions
needed in order to apply normal approximation methods. Once this
is verified, the question asks us to find the probability of getting a
sample mean greater than 52,000 miles if the true tire lifetime is 50,000 miles . Recall we
already had calculated a 495 for the standard error.
Using Minitab, again go to Calc > Probability distributions > Normal. Select the
radio button for Cumulative probability and enter the following:
For mean 50000 and for standard deviation enter 495.
Click the radio button for 'Input Constant' and enter 52000
Then click OK. The result is as follows:
Not done yet! The problem asks for greater than 52000 and what we have is less than 52000.
Therefore we need on final step of subtracting this probability from one. The probability of
getting a sample mean of 52000 from a sample of 50 vehicles would be 0.000027 which satisfies
our result when using the table.
Remember, when we used the standard normal table the best we could do was a probability less
than 0.001 which Minitab has verified as less. When we did this by hand we came up with a zscore of 4.04 which was not on the table so we used 3.09 Alternatively, we could have used this
z-score to find our answer by using the Minitab default values which as we mentioned at the
beginning are in standard normal format. If one has the z-score, you simply need to plug in the
z-score as the input constant (but remember to change the mean to 0 and the standard deviation
to 1).
Printer-friendly version
This lesson starts with the basic concept of using confidence intervals to understand and perform
inference. We then talk about how to find confidence intervals for one population proportion. The
important issue of determining the required sample size to estimate a population proportion will also be
discussed in detail in this lesson.
Estimating the population mean is one of the most common and important questions one comes across in
practice. In this lesson, we will also talk about the confidence interval for a population mean when the
population standard deviation is unknown. We will also explain how to determine the number of
observations to be included in the sample.
Lesson 6 Objectives
Upon successful completion of this lesson, you will be able to:
understand the reason for estimating with confidence interval.
calculate confidence intervals for population proportions.
interpret a confidence interval.
know the meaning of margin of error and its use.
compute sample sizes for different experimental setting.
know when and how to use t-interval to estimate the population
mean.
compute sample sizes for estimating the population mean.
Printer-friendly version
1.
2.
3.
4.
5.
answers. The exact interval and the z-interval should be very similar when the conditions are
satisfied.
6. Click OK and OK again.
Click on the 'Minitab Movie' icon to display a walk through of 'Find a
Confidence Interval for a Population Proportion in Minitab'.
Example: Presidential Approval Rating
Referring to our presidential approval rating example at the beginning
of this lesson, we will use Minitab to verify our by-hand results.
Recall in that example a random sample of 1500 was taken from the
population of U.S. adults, with 660 responding with a positive approval. In Minitab and
following the steps above, we would enter 660 for the Number of Events and 1500 for the
Number of Trials. The confidence level was 95% and we satisfied the necessary conditions to
use the Normal Approximation (or z-interval) method. The results are:
Test and CI for One Proportion
Sample X N Sample p
95% CI
1
660 1500 0.440000 (0.414880, 0.465120)
Using the normal approximation.
These results closely match our by-hand interval of 0.415 to 0.465
What if we had calculated the exact confidence interval (i.e. did not choose Normal
Approximation as the method)? With the exact method the interval is (0.414685, 0.465550).
Consistent to three decimal places in this case. You will notice that in the output Minitab does
provide a notification that the normal approximation was used.
How different can the normal approximation and exact intervals be when conditions are not
satisfied? Consider this example: In estimating the proportion of premature babies born at the local
hospital, a random sample of 10 newborn babies was taken in which 3 were born prematurely. Find a 90%
confidence interval for the true proportion of premature babies born at the hospital.
As we can see the conditions to use normal approximation method is not satisfied; we only have 3
successes and we need at least 5. If we used normal approximation methods (note that we are constructing
90% intervals now), Minitab produces an interval of (0.061638, 0.538362). If the exact method were
used, the interval is (0.087264, 0.606624). The intervals are nearly as close as in the first example.
Also note how wide the intervals are in this second example. This is a direct result of the small sample
size. The smaller n produces a much a larger error which increases the width of an interval.
1.
2.
3.
4.
5.
Minitab Commands to Find the Confidence Interval for a Population Mean (sigma
unknown)
Stat > Basic Statistics > 1-Sample t.
From the drop down box select the Summarized data option button. (If you have
the raw data you would use the default drop down of One or more samples, each in a column.)
Enter the sample size, sample mean, and sample standard deviation in their respective text
boxes.
Click the Options button. The default confidence level is 95. If your desire another confidence
level edit appropriately.
Click OK and OK again.
Example: Emergency Room Wait Time
Referring to our prior example of average emergency room wait
time from our discussion on confidence intervals for a population
mean, our by-hand calculations produced a 95% confidence
interval of 24.28 to 35.72 minutes. Recall the following for that
example: sample size 50, sample mean 30, and sample standard
deviation 20. In Minitab following the above steps, we get a 95%
confidence interval:
One-Sample T
N Mean StDev SE Mean 95% CI
50 30.00 20.00 2.83
(24.32, 35.68)
The slight discrepancy between the estimates is due to our by-hand calculation using the t-value
associated with 40 degrees of freedom since the table did not include a d.f. of 49. Minitab used
a t-value for the actually 49 degrees of freedom. With the larger degrees of freedom comes a
smaller t-value. This would result in a smaller margin of error and a narrower interval precisely what we have here.
Using Minitab to Check Normality
This Minitab process was presented in the lesson for finding confidence intervals
for a population mean. It is repeated here for convenience. For small sample size,
if the distribution is not normal or if there are outliers, then one needs to use other
procedures such as nonparametric methods. Thus, if sample size is less than 30, one needs to
use normal probability plot to check whether the sample may come from a normal distribution
and then follow the above guideline to determine whether one can use the t-interval.
1. Graph > Probability Plot > Simple (note: if we have summarized data only we cannot plot the
data!)
2. Select the column that contains the data you want to graph.
3. Click OK.
Example: Rattlesnake Lengths
It is very time consuming to find rattlesnakes and nerve racking to
measure them. A scientist randomly finds 12 snakes from the
Central Pennsylvania area and measures their length (snakes.txt).
The following twelve measurements in inches are obtained:
Since the points all fall within the confidence limits, there is no evidence to suggest that the data
do not come from a normal distribution.
6.7 - Practice Problems
Printer-friendly version
1. Many individuals over the age of 40 develop intolerance for milk and milk-based products. A dairy has
developed a line of lactose-free products that are more tolerable to such individuals. To assess the potential
market for these products, the dairy commissioned a market research study of individuals over 40 years
old in its sales area. A random sample of 250 individuals showed that 86 of them suffer from milk
intolerance. Based on the sample results, calculate a 90% confidence interval for the population proportion
that suffers milk intolerance. Interpret this confidence interval.
a) First, show that it is okay to use the 1-proportion z-interval.
b) Calculate by hand a 90% confidence interval.
c) Provide an interpretation of your confidence interval.
d) If the level of confidence was 95% instead of 90%, would the resulting interval be narrower or wider?
Explain.
e) If the researchers were interested in a 90% interval with a 3% margin of error, what size sample would
they require assuming sample costs are high and the response rate is 80%.
f) Verify your 90% confidence interval in Minitab.
2. Consumer reports tested 15 brands of vanilla yogurt and found the following numbers of calories per
serving: 160, 200, 220, 230, 120, 180, 140, 130, 170, 180, 80, 120, 100, 170, 190, (yogurt.txt). The
sample statistics were 159.3 for the sample mean and 43.5 for the standard deviation.
a) By hand, place a 99% confidence interval on the average number of calories per serving for vanilla
yogurt.
b) Provide an interpretation of your interval.
c) Use Minitab to find the interval and to check the assumption of normality. Is the assumption satisfied?
Explain.
Printer-friendly version
Unit Summary
Conducting a One-Proportion Z-test in Minitab
Conducting a One-Mean t-test in Minitab
Finding Exact Critical Value for a One-Mean t-test in
Minitab
Note about Software
In general, as we will learn, software usually performs tests using the p-value method. That is, the output
from software will provide the test statistic and the p-value, along with some other general information
e.g. a confidence interval. To perform rejection region tests you would need to find the critical values
from the tables. However, at the end of this lesson we do demonstrate how to find the correct critical
value from the t distribution, i.e. the t-value that corresponds to the degrees of freedom when not on the
table.
Conducting a One-Proportion Z-test
Note: these steps are very similar to those for one-proportion confidence interval.
The differences occur in steps 4 and 5b.
1. Click Stat > Basic Stat > 1 Proportion.
2. In the drop-down box use "One or more samples, each in a column" if you have the raw data,
otherwise select "Summarized data" if you only have the sample statistics.
3. If using the raw data enter the column of interest into the blank variable window below the drop
down selection. If using summarized data enter the number of successes for "Events" and the
sample size for "Trials".
4. Click the check box for "Perform hypothesis test" and enter the null hypothesis value.
5.
1.
2.
3.
Click Options.
Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).
From the drop down list for "Alternative hypothesis" select the correct alternative.
If conditions are satisfied to perform a z-test for one proportion, select from the "Method" field
"normal approximation"
6. Click OK and OK.
Example: Penn State Students from Pennsylvania
Recall our one-proportion example at the beginning of this lesson
on whether the a majority of Penn State students are from
Pennsylvania. In that example, we took a random sample of 500
Penn State students and find that 278 are from Pennsylvania. Can
we conclude that the proportion is larger than 0.5 at a 5% level of
significance? Also recall in that example we found by hand a test
statistic of Z* = 2.504 and p-value of 0.0062.
Our hypotheses were: H0:p=0.5H0:p=0.5 and Ha:p>0.5Ha:p>0.5
1.
2.
3.
4.
5.
1.
2.
6.
Using Minitab, we would select Stat > Basic Stat > 1 Proportion. Choose the summarized data
option and enter 278 for "Events" and 500 as the "Trials". Check the box for Perform
Hypothesis Test and enter the null value of 0.5 Click Options. With our stated alpha value of
5% we keep the default confidence level of 95. Select "Proportion > hypothesized proportion"
from the Alternative Hypothesis list. Since we verified the the conditions were satisfied, select
Normal Approximation under Method. Click OK and OK again. The output is:
Test and CI for One Proportion
Test of p = 0.5 vs p > 0.5
Sample X N Sample p 95% Lower Bound Z-Value P-Value
1
278 500 0.556000
0.519451
2.50
0.006
Using the normal approximation.
As the output indicates, our by-hand calculations were very accurate!
Conducting a One-Mean t-test
Note that these steps are very similar to those for one-mean confidence interval.
The differences occur in steps 4 and 5b
Click Stat > Basic Stat > 1 Sample t.
In the drop-down box use "One or more samples, each in a column" if you have the raw data,
otherwise select "Summarized data" if you only have the sample statistics.
If using the raw data enter the column of interest into the blank variable window below the drop
down selection. If using summarized data enter the sample size, sample mean, and sample
standard deviation in their respective fields.
Click the check box for "Perform hypothesis test" and enter the null hypothesis value.
Click Options.
Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).
From the drop down list for "Alternative hypothesis" select the correct alternative.
Click OK and OK.
Example: Emergency Room Wait Time
Recall our emergency room wait time example where an
administrator at your local hospital states that on weekends the
average wait time for emergency room visits is 10 minutes. From
our random sample of 40 patients, the average wait time for these
40 patients was 11 minutes with a standard deviation of 3
minutes. We conducted the test at a 5% level of significance and
wante to demonstrate that the average time exceeded 10 minutes. Also recall in that example we
found by hand a test statistic of t* = 2.11 and p-value with a range between 0.01 to 0.025
Printer-friendly version
Practice Problems:
1. The benign mucosal cyst is the most common lesion of a pair of sinuses in the upper jawbone. In a
random sample of 800 males, 35 persons were observed to have a benign mucosal cyst.
a. Would it be appropriate to use a normal approximation in conducting a statistical test of the null
hypothesis H0:p0.096H0:p0.096(the highest incidence in previous studies among males)? Explain.
b. Conduct a statistical test of the research hypothesis Ha:p<0.096Ha:p<0.096 by computing the p-value
manually and drawing a conclusion using the p-value approach at a 1% Type I error rate.
c. What is the rejection region for this test?
d. Use Minitab to verify your results.
2. Some mushrooms were found in a forest. You do not know much about whether those are poisonous.
There are two hypotheses:
The mushrooms are poisonous and cannot be eaten
The mushrooms are not poisonous and can be eaten
How will you set up the hypotheses? Give a brief explanation.
3. A dealer in recycled paper places empty trailers at various sites. The trailers are gradually filled by
individuals who bring in old newspapers and magazines, and are picked up on several schedules. One such
schedule involves pickup every second week. This schedule is desirable if the average amount of recycled
paper is more than 1,600 cubic feet per 2-week period. The dealers records for eighteen 2-week periods
show the following volumes (in cubic feet) at a particular site (recycled_paper.txt) where xx = 1,718.3
and s = 137.8.
a. By hand, compute a 95% confidence interval of and provide an interpretation of your interval.
b. Is there strong evidence that is greater than 1,600? Conduct the test by hand using the p-value
approach with a 5% level of signficance.
c. Use Minitab to verify your results in parts a and b and to check for normality of data.
4. The undergraduate GPA of 18 students from a large MBA class of 800 students is selected. The data
are given as (mba_student_gpa.txt).
Use the data in the file above and Minitab to test the research hypothesis that the average undergraduate
GPA of the MBA class is differs from 3.5. Use the p-value approach to perform the test at a default level
of significance. Remember to check normality of the data.
ASK!