0 evaluări0% au considerat acest document util (0 voturi)

48 vizualizări65 paginipde

pde1.odt_0

© © All Rights Reserved

ODT, PDF, TXT sau citiți online pe Scribd

pde

© All Rights Reserved

0 evaluări0% au considerat acest document util (0 voturi)

48 vizualizări65 paginipde1.odt_0

pde

© All Rights Reserved

Sunteți pe pagina 1din 65

Introducere

De ce este necesara studierea disciplinei?

Ca si argument folosim o exprimare compusa din doua parti. Prima parte face referinta la observatia

ca pe bancnotele americena sta scris IN GOD WE TRUST adica in traducere asta ar suna ca si

expresie Noi credem in dumnezeu si la care adaugam cea de-a doua parte a expresiei

EVERYTHINK ELSE NEEDS DATA adica orice altceva se bazeaza pe date. Datele ce stau la

baza demonstrarii opricarei argumentatii trebuiesc sa fie valide, sa nu contina erori si sa includa

informatii coerente. Pentru a putea fi citite si interpretate aceste date este necesara parcurgerea

acestei discipline.

Think about the following questions, then click on the icon to the left to display the statistical

application example.

To evaluate printed numerical facts...

To interpret the results of sampling or to perform statistical analysis in your

work...

To make inference about the population using information collected from the

sample...

colectarea datelor

organizarea si structurarea datelor

analiza datelor

realizarea de predictii, decizii sau deduceri.

Ultima exprimare este de fapt obiectivul statisticii, adica de a face deductii despre o populatie

bazandu-ne pe informatiile continute in un grup reprezentativ din acea populatie.

Obiective de instruire

dupa completarea seminarului veti fi capabili sa:

formulati un raspuns la intrebarea, "ce este statistica?"

obtineti date inteligente.

faceti distinctie intre studiile stiintifice se cele observationale.

organizati datele in tabele si sa folositi tehnici grafice adecvatepentru a descrie seturi

variate de date.

Identificati variabilele ca si categorii calitative (binary, ordinal, nominal) sau cantitative

(discrete, continuous).

Definitii

Sunt cateva concepte ce vor trebui sa fie definite. Un prim concept este populatie un grup de

obiecte denumite elemente care au o anumita proprietate comuna, si care este caracteristica

intregului grup studiat. De exemplu daca vorbim de studenti, populatia ar trebui sa fie toti studentii

la care ne gandim.

Exemplul 1. studentii inrolati la cursul PDE din o institutie formeaza o populatie, deoarece nu mai

exista si alti studenti care sa posede aceeasi proprietate.

Deoarece uneori este aproape imposibil sa analizam intreaga populatie, o parte mica si

reprezentativa din aceasta populatie va fi luata in considerare. Aceasta parte este denumita esantion.

Nivel de masurare datele pot fi clasificate pe urmatoarele patru nivele de masurare:

date nominale -constau in nume, etichete sau categorii, gen, grupuri particulare de persoane

grupate dupa interese, aspiratii privilegii etc. aceste date nu pot fi ordonate ca date nominale

(cum ar fi cea mai mare sau cea mai mica) si de asemenea nu pot fi prelucrarea aristmetic.

Date ordinale (ordinal data)- ce pot fi aranjate in orice ordine particulara

date interval sunt similare cu datele ordinale

date raport (ratio data) presupun existenta unei relatii cantitative (de exemplu numarul de

joburi pentru barbati raportat la femei este de 8/1

Exemplul 2 Identificati ce nivel de masurare este caracteristic pentru urmatoarele date:

anii cuprinsi in perioada istorica 1821- 1877

venitul annual al studentilor de la masterat : 0lei- 10000lei

genul studentiler: masculin , feminin

Raspuns

anii cuprinsi intre 1821 si 1877 reprezinta date interval. Aici nu avem date de an zero. De

asemenea impartirea datelro nu are sens (ce reprezinta anul 1821/1877?) asa ca nu avem

nicio ratie. De altfel substarctia nu are sens perioada caoperita este 1877-1821=56 ani.

Venidul annual al studentilor reprezinta ratio data. Aici impartirea are sens, adica daca

cineva are venitul de 2000 si altul de 40000 veniturile pot fi comparate,. De asemenea daca

un student are venit 0 atunci valoarea 0 are sens.

Genul studentilor este data nominala, nu pot fi ordonate si de asemenea datele nu pot

adunate scazute sau impartite.

Datele colectate de la fiecare individ dintro populatie sunt denumite variabile. Aceste variabile sunt

caracteristice fiecarui individ dintr-o populatie. Varibilele pot avea diferite tipuri de valori, unele din

ele sunt numere iar altele categorii. De exemplu: numarul de usi dintr-o locuinta este 10, pstiul

laboratorului este de 68.5mp sunt date de tip numeric. Pe de alta parte daca casa este a unei singure

familii naloarea numerica a varibilei nu are sens. Acest lucru dertermina clasificarea variabilelor in

varibile cantitative si calitative.

O variabila calitativa sau pe categorii permite listarea caracteristicilor individuale pe categorii. De

exemplu impartirea pe sex de la exemplul anterior.

Primele doua masuratori de la exemplul anterior sunt date cantitative.

varibile

calitative

cantitative

discrete

continue

O variabila discreta este o variabila cantitativa ce presupune un set de valori finite sau namarabile.

Exemplu:

numarul de copii ai unei familii sau numarul de prieteni din universitate a unui student sunt

variabile discrete

Variabilele continui sunt variabilel cantitative care au un numar infinit de valori. Aceste variabile

sa mei numesc variabile scalare (de interval sau de raport) de exemplu. Inaltimea studentilor,

greutatea, varsta,...

pentru a prelucra aceste date si a lua cea mai buna decizie de aici prelucrarea statistica a datelor are

doua directii prelucrarea statistica generala si prelucrarea statistica inferentiala.

Modul in care datele sunt colectate

The following are a few frequently used methods for how data is

collected:

Personal Interview

People usually respond when asked by a person but their answers may

be influenced by the interviewer.

Telephone Interview

impatient.

Self-Administered

Questionnaires

Cost-effective but the response rate is lower and the respondents may

be a biased sample.

Direct Observation

the sample.

Web-Based Survey

How can we get data, for instance, how do we select data for a

study? There are two types of methods for collecting data, nonprobability methods and probability methods

Non-probability Methods

These might include:

Convenience sampling (haphazard): for instance, surveying

students as they pass by in the university's student union

building, or

Gathering volunteers: for instance, using an advertisement in

a magazine or on a website inviting people to complete a form

or participate in a study. .

Probability methods

Simple random sample making selections from a population

where each subject in the population has an equal chance of

being selected.

population of interest, you then divide this population into

strata or groups based on some characteristic (e.g. sex,

geographic region), then perform simple random sample from

each strata.

Cluster sample where a random cluster of subjects is taken

from population of interest. An example might be grabbing

handful of M&M's from a large jar of M&M's.

The primary benefit for using probability sampling methods is the

ability to make inference. The results can be extended or

generalized to the population from which the sample came. This

comes as the result of probability methods providing a

representative sample from a population of interest. This is not the

case with nonprobability samples. When nonprobability methods

are used you can make generalizations about the sample, not the

population. We can assume that by using random sampling we

attain a representative sample of the population.

Example: Airline Company Survey of Passengers

Here is an example of different sampling

techniques and how they might be

implemented. Let's say that you are the

owner of a large airline company and you

live in Los Angeles. You want to survey your

L.A. passengers on what they like and

dislike about traveling on your airline.

Convenience Sampling: Since you live in

L.A. you go to airport and just interview passengers as they

approach your ticket counter.

Volunteer Sampling: You have your ticket counter personnel

distribute a questionnaire to each passenger requesting they

complete the survey and return it at end of flight.

Simple Random Sampling: You randomly select a set of passengers

flying on your airline and question those that you have selected.

Stratified Sampling: You stratify your passengers by the class they

fly (first, business, economy), and then take a random sample from

each of these strata.

Cluster Sampling: You stratify your passengers by class they fly

(first, business, economy) and randomly select such classes from

various flights and survey each passenger in that that class and

flight selected.

In predicting the 2008 Iowa Caucus results a phone

survey said that Hillary Clinton would win, but instead

Obama won. Where did they go wrong? The survey

was based on landline phones, which was skewed to

older people who tended to support Hillary. However,

lots of younger people got involved in this election

and voted for Obama. The younger people could only be reached by

cellphone.

Margin of Error

Surveys are they random or nonrandom? How are the samples

drawn? If samples are randomly taken, we can include some

measure of error. This helps us understand better where the truth

lies. We call this the margin of error. i.e. 3.1 % This margin of

error for random survey can be approximated by:

A common margin of error of interest in the survey industry is 3%.

During U.S. presidential elections several random polls (e.g. Gallup

and N.Y. Times) will be conducted to estimate who will win an

upcoming election. With a desired margin of error equal to 3% and

approximately 110 million registered voters, how many registered

voters do they need to attain this margin of error? What might be

surprising is the answer. Working backward to solve for 'n' where

we have 0.03 = 1/sqrt(n) we come to n = 1111.11 or 1112 voters.

Not many! Keep in mind that this is the number they need to

participate in the poll and not simply the size of the sample they

would take from the population of all voters. If they expect about a

50% response rate (i.e. half the people the contact do not

participate - think of yourself receiving a call but declining) then

they would need to sample at least twice this number in order to

finish with the desired sample size. Of course their actual sampling

designs are not necessarily this simple as they would want to

consider the effect party affiliation, age, sex, geographic location,

race, etc. may have in how a voter selects their candidate.

Types of Studies

There are predominantly two different types of studies:

1.Observational these studies show that there is a

relationship

2.Experimental this involves some random assignment of a

treatment; researchers can draw cause and effect (or causal)

conclusions.

* Random selection (a probability method of sampling) is not

random assignment (as in an experiment). In an ideal world you

would have a completely randomized experiment; one that

incorporates random sampling and random assignment.

Example

Let's say that there is an option to take quizzes throughout this

class. In an observational study we may find that better students

tend to take the quizzes and do better on exams. Consequently, we

might conclude that there may be a relationship between quizzes

and exam scores. In an experimental study we would randomly

assign quizzes to specific students to look for improvements. In

other words, we would look to see whether taking quizzes causes

higher exam scores.

Ethics is an important aspect of experimental design to keep in

mind. For example, the original relationship between smoking and

lung cancer was based on an observational study and not an

assignment of smoking behavior.

Variables

Explanatory (predictor or independent) and response (outcome or

dependent) variables. A variable can serve as an explanatory

variable in one study but response in another. For instance,

consider the variabls Sex (Female, Male) and Height (in inches).

Which variable do you believe explains the other? Would it make

more sense to say a person's sex more likely explains that person's

height, or to say a person's height explains that person's sex? In

this case Sex (explanatory) would explain Height (response). On

the other hand, consider the variables Height and Weight. In this

case, a person's height would more likely explain their weight than

the other way around. Now Height serves as the explanatory

variable.

Let's think about the example of height and weight. Typically height

(the explanatory variable) explains weight (the outcome variable).

In tabular data the predictor variable is usually displayed in the

rows of the table, and the response variable in columns. Here is an

example of the tally or results of a survey asking males and females

if they smoke.

Yes No

Male

20 55

Female 15 70

to extend results to the population from which the sample came.

Random assignment is NOT the same as random sampling. Random

assignment is how treatments get assigned. This distinguishes a

study as an experiment.

Dont forget! Observational studies can show an association or

relationship. Experiments can show cause.

Observational Studies Versus Scientific Studies

It is very important to distinguish between observational and

scientific studies since one has to be very skeptical about drawing

cause and effect conclusions using observational studies. The use

of random assignment of treatments (i.e. what distinguishes a

scientific study (experiment) from an observational study) allows

one to employ cause and effect conclusions.

Observational

Studies

Researcher observes the data and has no control over which subject takes

which treatment,

Scientific Studies

assignment of treatment.

effective in reducing fever.

Think about the following, then click on the icon to the left

to display the statistical application example.

Method 1: Ask the subjects which one they use and ask them

to rate the effectiveness. Is this an observational study or

scientific study?

Method 2: Randomly assign half of the subjects to take Tylenol

and the other half to take Advil. Ask the subjects to rate the

effectiveness. Is this an observational study or scientific

study?

LOOKING AHEAD: Students interested in pursuing topics related to

sampling might explore STAT 506: Sampling theory. STAT 506

covers sampling design and analysis methods that are useful for

research and management in many fields. A well designed sampling

procedure ensures that we can summarize and analyze data with a

Principles of Experimental Design

The following principles of experimental design have to be followed

to enable a researcher to conclude that differences in the results of

an experiment, not reasonably attributable to chance, are likely

caused by the treatments.

Control

Need to control for effects due to factors other than the ones of

primary interest.

Randomization

unintentional selection bias in the groups.

Replication

randomization creates groups that resemble each other closely and

to increase the chances of detecting differences among the

treatments when such differences actually exist.

treatment is done then significant results can be concluded

as causal or cause and effect conclusions. That is, that the

treatment caused the result. This treatment can be referred to as

the explanatory variable and the result as the response variable.

2. If random selection is done where the subjects are randomly

selected from some population, then the results can be extended to

that population. The random assignment is required for an

experiment. When both random assignment and selection are part

of the study then we have a completely randomized experiment.

Without random assignment (i.e.an observational study) then the

treatmen can only be referred to as being related to the outcome.

Lurking versus Confounding Variables

The difference between the two is that a lurking variable is a

variable not considered in the study but could influence the

relationship between the variables included in the

study. Aconfounding variable is one that is in the study and is

related to the other study variables, thus having an effect on the

relationship between these variables. Therefore a lurking variable,

if included in the study, could have a confounding effect and then

be classified as a confounding variable. For example: Say you teach

a class where students must submit a weekly homework and then

take a weekly quiz. You want to see if there is a relationship

between the scores on the two assignments (i.e. higher homework

scores are aligned with higher quiz scores). As you look at the data

you begin to consider whether submission date of the homework

has an effect on the quiz grades; that is, do students who submit

the homework several days before taking the quiz perform better

overall on the quiz than students who do not leave much of a time

gap between completing the assignments (e.g. they do both on the

same day). The rational is that students who allow time between

the homework and quiz to study may perform better compared to

the other group. In this example, days between submission of

homework and quiz would be a lurking variable as it was not

included in the study. Now once you got that information and reexamined the relationship between the two assignments taking into

consideration the time gap, if you saw a change in the relationship

between the two assignments (i.e. the relationship changed

somewhat from the analysis without the time gap compared to

when the time gap was included) then this days between

submission would be considered a confounding variable. In an

experiment where treatments are randomly assigned, one assumes

these variables get evenly shared across the groups with the

intention that any influence they may have on the outcome is

negated or reduced.

Types of Bias

1.Non-response large percentage of those sampled do not

to respond or participate.

2.Response when study participants either do not respond

truthfully or give answers they feel the researcher wants to

hear. For example, when students are asked if they ever

cheated on an exam even those who have would respond with

"no".

3.Selection this bias occurs when the sample selected does

not reflect the population of interest. For instance, you are

interested in the attitude of female students regarding campus

safety but when sampling you also include males. In this case

your population of interest was female students however your

sample included subject not in that population (i.e. males).

LOOKING AHEAD: Students interested in pursuing topics related to

design of experiments might explore STAT 503: Design of

Experiments. STAT 503 includes extensive coverage

implementation and analysis of a wide range of experimental

designs.

Unit Summary

Variable and Its Type

Graphs for a Categorical Variable

Pie Chart

Bar Chart

Graphs for a Single Quantitative Variable

Dot Plot

Frequency Histogram and Relative Frequency

Histogram

Stem-and-Leaf Diagram

Time Plot

Boxplot or Box-and-Whisker Plot

by Course Authow Mosuk Chow - (length 6:23)

[1]

- bar chart

[2]

quantitative variables is a basic and intergral part of applied

statistics as the methods to analyze these data are very different.

Sometimes, when one codes surveys, you would code male as 1

and female as 2. Beware, gender is qualitative: there are two

different classes. 1 and 2 just denote two different symbols for

gender and there is no ordering between these two symbols when

used to denote male and female. Another example is team

assignments. For your team project, I will call the teams: Team 1 ,

Team 2 etc. The team a student belongs to is again qualitative. In

statistics, as in most languages, we sometime call the same thing

by different names. So qualitative is also called nominal, or

categorical.

How can one graph qualitative variables? Two common choices are

pie chart and bar chart. Please pay attention that even though

histogram also have bars sticking up, they are used to describe the

frequency for quantitative variables; bar chart is reserved to

describe graphs that show frequency of categorical variables.

You will practice drawing graphs for these two different types of

variables. Again, you will be asked in this lesson to work these

examples out by hand. After a good understanding of these

concepts has been established, the course will review all of these

using the Minitab statistical software.

Reading Assignment

course schedule)..

Techniques of describing data in ways to capture the essence

of the information in the data are called descriptive statistics.

To draw conclusions from data about the population is

called inferential statistics.

favorite sport to watch shows that 238 said Football, 126 said

Basketball, 45 said Hockey, 46 said Others.

Think about the following, then click on the icon to thte left to

display the statistical application example.

[3]

[4]

one and only one values. For the above example, the values

are:

types of variables since the methods to describe them and to

do inferences about them are very different.

1. Qualitative (Categorical) : Data that serves the function of a

name only. For example, for coding purposes, you may assign Male

as 0, Female as 1. The numbers 0 and 1 stand only for the two

categories and there is no order between them. Categorical values

may be:

Binary where there are two choices, e.g. Male and Female;

Ordinal where the names imply levels with hierarchy or order

of preference, e.g. level of education

Nominal where no hierarchy is implied, e.g. political party

affiliation.

[5]

2. Quantitative: Data that takes on numerical values that has a

measure of distance between them. Quantitative values can be

discrete, or counted as in the number of people in attendance, or

continuous or measured as in the weight or height of a person.

[6]

variable:

Additional examples of both include:

Number of females in this class (Quantitative, Discrete)

Nationality (Categorical, nominal)

Amount of milk in a 1 gallon container (Quantitative,

Continuous)

Sex of students (even if coded as M = 0, F = 1) (Categorical,

Binary)

1. Pie Chart: area of the pie represents the percentage of that

category.

Example

University student's favorite sport to watch. (We will use

Minitab to draw graphs and charts in Lesson 3).

Remarks:

a) Pie charts may not be suitable for too many categories. Thus, if

there are too many categories, you can either combine some

categories or use a bar chart to represent the data. What is mean

by "too many"? There is no clear cut off, more of just a judgment

on the appearance.

b) Readers may find the pie chart more useful if the percentages

are arranged in a descending or ascending order.

2. Bar Chart: The height of the bar for each category is equal to the

frequency (number of observations) in the category. Leave space in

between the bars to emphasize that there is no ordering in the

classes.

Example

University student's favorite sport to watch.

1. Dotplot: Useful to show the relative positions of the data.

Example

reading aptitude test. The scores were as follows:

95 78 69 91 82

76 76 86 88 79

there are many data points and we would like to see the

distribution of the data, we can represent the data by a

frequency histogram or a relative frequency histogram.

Group the data into about 5-20 class intervals and show the

frequency or relative frequency of data in each interval.

Example

135 137 136 137 138 139

140 139 137 140 142 146

148 145 139 140 142 143

144 143 141 139 137 138

139 136 133 134 132 132

intervals. Since the data range is from 132 to 148, it is

convenient to have a class of width 2 since that will give us 9

intervals :

131.5 - 133.5 133.5 - 135.5 135.5 - 137.5

137.5 - 139.5 139.5 - 141.5 141.5 - 143.5

143.5 - 145.5 145.5 - 147.5 147.5 - 149.5

confusion whether the end point belongs to the interval to its

left or the interval to its right. An alternative is to specify the

end point convention. For example, Minitab includes the left

end point and excludes the right end point. Having the

draw the frequency histogram or get the relative frequency

histogram to construct the relative frequency histogram. The

following histogram is produced by Minitab when we specify

the midpoints for the definition of intervals according to the

intervals chosen above.

Minitab will default to choose another set of class intervals

resulting in the following histogram. According to the include

left and exclude right end point convention, the observation

133 is included in the class 133-135.

different histograms. Relative frequency histograms are

constructed in much the same way as a frequency histogram

except that the vertical axis represents the relative frequency

instead of the frequency. For the purpose of visually

comparing the distribution of two data sets, it is better to use

relative frequency rather than a frequency histogram since

the same vertical scale is used for all relative frequency--from

0 to 1.

3. Stem-and-Leaf Diagram: Group the data and still keep the

number. One can recover the original data (except the order

the data is taken) from the diagram.

The stem represents the major groupings of the data. The

leaves represent the last digit. For example, the first value

(also smallest value) is 132, with 13 as the stem and 2 as

the leaf.

Stem-and-Leaf of weight of Jessica

N = 30

Leaf Unit = 1.0

3

13 223

13 45

11 13 667777

(7) 13 8899999

12 14 0001

8

14 2233

14 45

14 6

14 8

Minitab. The first column, called depths, are used to display

indicate the number of observations that lie in a given row

or before. For example, the 11 in the third row indicates

that there are 11 observations in the first three rows. The

row that contains the middle observation is denoted by

having a bracketed number of observation in that row; (7)

for our example. We thus know that the middle value lies in

the fourth row. The depths following that row indicate the

number of observations that lie in a given row or after. For

example, the 4 in the seventh row indicates that there are

four observations in the last three rows.

4. Boxplot: The boxplot will be discussed in greater detail

when we discuss "Summarizing Data" because the design of

the boxplot is dependent upon various summary measures

we will learn in that lesson.

5. Time Plot: Note that for the weight of Jessica, one important

aspect of the data is lost if one just shows the distribution. Jessica

may be really interested in how her weight changes over time. For

that purpose, a plot of weight versus the order it is taken (time) is

warranted.

Printer-friendly version

1. Draw - by hand - two appropriate graphs for the following data:

University officials periodically review the distribution of undergraduate majors within the colleges of the

university to help determine a fair allocation of resources to departments within the colleges. At one such

review, the following data were obtained:

College

Number of majors

Agriculture

1,500

11,000

Business

Administration

7,000

Education

2,000

Engineering

5,000

Draw a frequency histogram and a relative frequency histogram for the 2014 annual per capita city tax

given below:

2470, 520, 561, 488, 986, 359, 1305, 512, 467, 270, 360, 451, 4904, 572, 498, 382, 271, 634,

1682, 784, 298, 643, 947, 686

3. Draw - by hand - a Stem-and-Leaf plot

Draw the stem-and-leaf diagram for the following data set. Use the stem and leaf diagram to find the

median of the data set:

11 11 12 13 14 14 14 14 15 15 15 16 16 17 17 19 19 20 22 25

4. Populations and Samples

Selecting the proper diet of brook trout or other fresh water fish is an important aspect of fish farming. A

researcher want to estimate the mean weight of brook trout maintained on a specific diet for a period of 6

months. One hundred brook trout are selected from a fishery's tank and each is weighed.

a. What is the population of interest to the researcher?

b. What is the sample?

c. What characteristics of the population are of interest to the researcher?

d. If the sample measurements are used to make inferences about certain characteristics of the population,

why is a measure of the reliability of the information important?

Minitab este o aplicatie software folosita pentru analiza statistica a datelor pentru intrepretarea intro maniera usoara a datelor tabelare si a graficelor. Poate fi folosit in statsitica de baza la :

Statistica de baza

analysis, regression with life data, accelerated

life testing,

probit analysis, warranty prediction, test plans,

and growth curves)

Regresie

Multivariate analysis

Analiza variatiei

Time series

design, mixture)

Tables

Grafice de control

Nonparametrics

Power and sample size

de planificare, process capability, acceptance

sampling, and gage study)

You can also access guidance for the following graphs in the Graph menu:

Scatterplot

Interval plot

Matrix plot

Marginal plot

Line plot

Histogram

Bar chart

Dotplot

Pie chart

Stem-and-leaf plot

Probability plot

Area graph

Empirical CDF

Contour plot

Probability distribution

plot

3D scatterplot

Boxplot

3D surface plot

Use this information to assess the basic properties of your data distributions:

number of observations

central tendency the location of the center, or most typical value, of the data set

dispersion the amount of variation or spread in the data set

Display Descriptive Statistics can provide summary information for whole columns of data or for

subsets of data within columns.

Statistica descriptiva

Odata ce am identificat populatia, tipurile de date si variabilele si am colectat datele pentru

esantion, obiectivul nostru este de a descrie caracteristicile esantionului fara ambiguitati, de o

maniera precisa, astfel incat aceste date sa poata sa fie communicate cu usurinta celorlalti. Descriere

sau sumarizarea datelor colectate sa poate face in doua maniere: grafic sau numeric.

Descrierea grafica depinde de tipurile de date. Asa cum s-a prezentat mai sus exista date cantitative

si calitative.

Graficele ce descriu datele calitative includ:

-grafice sub forma de bara, Pie chart si Pareto chart

-dot plot, histogram, grafice ramuri si frunze.

In statistica descriptiva metodele de prezentare a datelor includ:

distributia de frecvente

Descrierea datelor

Pentru un studiu de meteorologie un student a colectat urmatoarele date pentru orasul in care

locuieste intr-un an. Valorile reprezinta numarul de zile din luna cand s-au inregistrat precipitatii

semnificative. Proiectul l-a inceput dupa terminarea lunii ianuarie asa ca nu sunt inregistrari in acea

luna:

Ian

Febr.

Zile *

2

cu

precip

itatii

Experiment meteo.

Mar.

Apr.

Mai

Iun

Iul.

Aug.

Sept.

Oct

Nov.

Dec.

10

Se deschide un nou proiect . :

Atunci cand se deschide Minitab, acesta arata astfel:

Fereastra de sus denumita "Session" window este locul in care Minitab va afisa rezultatul analizei

statistice cerute. Ferestra de jos denumita "Worksheet" window este locul unde copiem si

introducem datele. Cea de-a treia fereastra denumita "Graphics" window apare doar atunci cand ni

se cere sa ploteze ceva. Hiata mai jos un exemplu de ferestra grafica:

feresatra activa este ferestra la care bara apare de culoare albastru inchis. Pentru a face o ferestra inactiva

pur si simplu se face clic cu mouse-ul oriunde in ferestra

In WORKSHEET se introduc datele in coloana C1

All of the data that you analyze in this course will be posted on the course web site. You will just have to

copy and paste the data into a worksheet. Let's try it out on the idealwt.txt data set. Once the data are in

your browser's window, the easiest way of copying the data is to Select all and then right-click and Copy.

To paste the data into the Minitab worksheet, put your cursor in the first (unnumbered) row of the first

column, and then click on Edit >> Paste cells (or click on the standard clipboard icon used to denote

pasting).

Your worksheet should look like this:

Note that the first (unnumbered) row is reserved for variable names. This is one thing about which you

will have to be careful. If you accidentally place your cursor instead in the row numbered 1, Minitab will

then treat the data as if they are text:

Note that Minitab has added a "-T" to the column names C1 and C2 to denote that the content of the two

columns is text. Another indication that the content of the columns is treated as text is that the textual

content is left-justified whereas numbers are always right-justified. Minitab cannot summarize data, such

as calculating means and standard deviations, when they are treated as text. If you accidentally make this

mistake, just open a new worksheet (Select File >> New... >> Minitab Worksheet >> OK) and paste the

data properly.

https://onlinecourses.science.psu.edu/statprogram/print/book/export/html/51

Analyzing Data

Once you've pasted or uploaded data into a Minitab worksheet, you no doubt will want to analyze it. The

commands to do so all appear in one of the Minitab menus:

More often than not, we will use the "Stat" and "Graph" menus in this course. The menus are generally

self-explanatory, but we will provide you with Minitab help throughout the course.

To create a scatter plot of the data, just select Graph >> Histogram... >> Simple. A standard Minitab dialog

box will appear. In general, the dialog box means that you have to provide Minitab with more information

before it can complete your request. For the scatter plot dialog box:

you need to tell Minitab that "actual" is variable of interest. To do this you can either 1) click on the

variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The

name should appear in the box labeled "Graph variables." Then, once you select "OK," a new graph

To do have Minitab display basic descriptive statistics, select Stat >> Basic Statistics >> Display

Descriptive Statistics .... The following Minitab dialog box will appear:

You need to tell Minitab that in this case "actual" is the variable of interest. You can either 1) click on the

variable name ("actual") once and then click on "Select" or just 2) double-click on the variable name. The

name should appear in the box labeled "Variables." Do the same thing to tell Minitab which variable you

would like to group the statistics by. In this case we will click in the box labeled "By variables" and then

either 1) click on the variable name ("sex") once and then click on "Select" or just 2) double-click on the

variable name.Once you select "OK," the standard descriptive statistics output should appear in the

"Session" window:

All Minitab graph and analysis commands function similarly to the examples illustrated above.

Throughout the course, help will be provided for the various Minitab commands we will encounter.

In the next lesson page we will use a Viewlet to walk you through another example of how Minitab works.

To copy output appearing in the Session window, select the desired output using your mouse. To copy a

graph window, make the graph window active by clicking anywhere in it, and the select Edit >> Copy

Graph.

To paste either output or a graph, select Edit >> Paste (or use the standard clipboard icon used to denote

pasting).

RemoteApps and WebApps users should select the Send Graph to Microsoft

Word option and a Word Document with this graph in it will be created and can

be saved in your PASS space.

While you can save your work in bits and pieces the graphs separately from the worksheet more

often than not, it is best to save your entire Minitab "project." A Minitab project includes all of the work

created in one session, including multiple worksheets, the Session window, and multiple graph windows.

Basically, if you save your work as a Minitab project, you can resume your work right where you left off.

To save your work as a Minitab project, select File >> Save Project As..., and provide an appropriate

filename in the dialog box. Minitab projects are given a ".MPJ" extension. For the purpose of this course,

you may consider creating one project for each lesson, and thereby naming the projects lesson1.MPJ,

lesson2.MPJ, and so on.

RemoteApps and WebApps users's work will be saved in the user's PASS space.

It can be download to the user's local computer from there.

Of course, you can print your Minitab work as well. To do so, activate the window that you want to print

by clicking your mouse anywhere on the window. Then, select File >> Print Worksheet or File >> Print

Session Window or File >> Print Graphdepending on what it is that you want to print.

Minitab Help

There are various ways that you can get Minitab help.

1.If you would like more tutorial help, you can try the Tutorials option provided under the Help pull-down

menu in the Minitab Application shown below:

2.You can look for help in the Minitab Help on-line manual also listed in the pull-down menu pictured

above.

3.You can use the various sets of Minitab instructions provided to you throughout the course. You will

probably find links to these from the Homework Problems and Lab Activities in each lesson.

4.Finally, you can post a question to the Minitab discussion board located at the course level under the

CONTENT tab in ANGEL. This is a separate discussion board just for questions related to how to use

Minitab.

Minitab offers several resources that are helpful for you. Minitab 17 Support - Getting Started is a concise

guide designed to quickly get you familiar with using Minitab Statistical Software.

In addition to Minitab 17 - Getting Started, the following tools are available:

Help: A complete Help file is incorporated in Minitab, which provides you with instructions, examples

with interpretations, overviews and detailed explanations, troubleshooting tips, formulas, references, and a

glossary. Open Help by choosing Help >> Help or by clicking on the Help button on every dialog box in

the software.

StatGuide: The StatGuide provides statistical guidance after you run a procedure in Minitab. Open the

StatGuide by right-clicking on your output in the Session window, then choosing StatGuide.

Tutorials: Step-by-step tutorials help new users learn how to use Minitab. You can open these by choosing

Help > Tutorials while using Minitab.

Minitab News: Every month Minitab also publishes a newsletter delivered to your email box that provides

you with customer stories, highlights of new product capabilities and tips on how to get the most out of

using our products. To sign up for Minitab News, simply create an account under My Account. Or feel free

to look at some of the past issues.

The Minitab website is full of powerful articles and information including why Minitab is used by over

4,000 colleges and universities worldwide. You will find:

Minitab Tips Tricks and Tutorials

Jobs that are looking for Minitab experience

YouTube videos

Minitab also offers free web events both live and recorded.

Last, but not least, remember that Minitab provides free access to their support team, staffed by

professionals with expertise in the software, statistics, quality improvement, and computer systems.

Visit the support web site or call +1-814-231-2682 to speak with Minitab's technical support specialists.

https://onlinecourses.science.psu.edu/stat500/node/11

Lesson 2 - Summarizing Data: Measures of Central Tendency and Measures of Variability, Box Plot

Printer-friendly version

We will first talk about the important concepts of statistical inference. Then a few descriptive measures of

the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive

measures of the other important characteristic of a data set, measure of variability, will be discussed. This

lesson will be concluded by a discussion of box plots, which are simple graphs that show the central

location, variability, symmetry, and outliers very clearly.

Again, this lesson will focus on simple examples that can be calculated or drawn by hand. This lesson will

be followed by another lesson that will work through many of these procedures using Minitab.

Lesson 2 Objectives

Upon successful completion of this lesson, you will be able to:

conceptualize statistical inference.

use appropriate summary measures to describe different data sets.

construct and use box plots.

2.1 - Measures of Central Tendency and Skewness

Printer-friendly version

Unit Summary

Mean

Median

Mode

Trimmed Mean

Skewness

Adding and Multiplying Constants

Reading Assignment

An Introduction to Statistical Methods and Data Analysis, (See your course schedule.)

Measures of Central Tendency

Three of the many ways to measure central tendency are:

1. Mean

2. Median

3. Mode

data

In most research experimental situations, examination of all members of a population is not typically

conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a

representative subset of the population.

Let's take a closer look at this diagram implies with Dr. Wiesner.

Object 1

Descriptive measures of population are parameters. Descriptive measures of a sample are statistics. For

example, a sample mean is a statistic and a population mean is a parameter. The sample mean is usually

denoted by yy:

y=y1+y2++ynn=ni=1yiny=y1+y2++ynn=i=1nyin

where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate

the population mean since usually only a random sample is drawn and we don't know the population

mean.

A Note on Notation!

What if we say we used xixi for our measurements instead of yiyi? Is this a problem? No. The

formula would simply look like this:

x=x1+x2++xnn=ni=1xinx=x1+x2++xnn=i=1nxin

The formulas are exactly the same. The letters that you select to denote the measurements are up

to you. For instance, many textbooks use x instead of y to denote the measurements.

The point is to understand how the calculation that is expressed in the formula works. In this

case, the formula is calculating the mean by summing all of the observations and dividing by the

number of observations.

There is some notation that you will come to see as standards, i.e, n will always equal sample

size. We will make a point of letting you know what these are. However, when it comes to the

variables, these labels can (and do) vary.

For example, in one study x may be used to denote weight and y may be used to denote height, (or

the reverse may be used!), butn will always be used to denote sample size in each case.

Note that for the data set:

1, 1, 2, 3, 13

mean = 4, median = 2, mode = 1

Steps to finding the median for a set of data:

1. Arrange the data in increasing order

2. Find the location of median in the ordered data by (n + 1) / 2

3. The value that represents the location found in Step 2 is the median. NOTE: if the sample size is an odd

number then the location point will produce a median that is an observed value as in the example above.

If sample size is an even number, then the location will require one to take the mean of two numbers to

calculate the median. The result may or may not be an observed value as the example below illustrates.

Mean, median and mode are usually not equal. When the data is symmetric, the mean is equal to the

median.

4. Trimmed Mean

One shortcoming of the mean is that: Means are easily affected by extreme values.

95, 78, 69, 91, 82, 76, 76, 86, 88, 80

Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1

If the entry 91 is mistakenly recorded as 9, the mean would be 73.9,

which is very different from 82.1.

On the other hand, let us see the effect of the mistake on the median

value:

The original data set in increasing order are:

69, 76, 76, 78, 80, 82, 86, 88, 91, 95

With n = 10, the median position is found by (10 + 1) / 2

= 5.5. Thus, the median is the average of the fifth (80)

and sixth (82) ordered value and the median = 81

The data set (with 91 coded as 9) in increasing order is:

9, 69, 76, 76, 78, 80, 82, 86, 88, 95

where the median = 79

The medians of the two sets are not that different. Therefore the median is not

that affected by the extreme value 9.

Measures that are not that affected by extreme values are called resistant.

A variation of the mean is the trimmed mean. A 10% trimmed mean drops the

highest 10%, the lowest 10%, and averages the remaining. Let's calculate the

trimmed mean for the data we were looking at above:

(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)

The 10% trimmed mean = 82.13

(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)

The 10% trimmed mean = 79. 38

The 10% trimmed mean of the two sets is not that different. The trimmed mean is

not as affected by the extreme value 9 as the mean.

After reading this lesson you should know that there are quite a few options when one wants to describe

central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about

mainly about the mean. However, we need to be aware of one of its short comings, which is that it is easily

affected by extreme values. One remedy is to use trimmed mean to estimate the central tendency.

Remember, however, that this is very different from saying that one can trim data. Unless data points are

known mistakes, one should not remove them from the data set! One should keep the extreme points and

use more resistant measures. For example, use the sample median to estimate the population median. Or,

use the sample trimmed mean to estimate the population trimmed mean. Again, this is very different from

saying that it is OK to trim data from a data set.

Skewness

Skewness is a measure of degree of asymmetry of the distribution.

1. Symmetric

Mean, median, and mode are all the same here; the distribution is mound shaped, and no skewness is

apparent. The distribution is described as symmetric.

2. Skewed Left

Mean to the left of the median, long tail on the left.

3. Skewed Right

Mean to the right of the median, long tail on the right.

When one has very skewed data, it is better to use the median as measure of central tendency since the

median is not much affected by extreme values.

Example: The Skewed Nature of Salary Data

Salary distributions are almost always positively skewed, with

a few people that make the most money. To illustrate this, consider your favorite sports team or

even the company for which you work. There will be one or two players or personnel that earn

the big bucks, followed by others who earn less. This will produce a shape that is skewed to

the right. Knowing this can be a useful aid in negotiating a higher salary.

When one interviews for a position and the discussion gets around to compensation, it is

common that the interviewer states an offer that is typical for someone in your position. That

is, they are offering you the average salary for someone with your particular skill set (e.g. little

experience). But is this average the mode, median, or mean? The company for whom business

is business! will want to pay you the least they can while you prefer to earn the most you can.

Since salaries tend to be skewed to the right, the offer will most likely reflect the mode or

median. You simply need to ask to which average the offer refers and what is the mean of this

average since the mean would be the highest of the three values. Once you have these averages,

you can begin to negotiate toward the highest number.

Adding and Multiplying Constants

What happens to the mean and median if we add or multiply each observation in a data set by a constant?

Consider for example if an instructor curves an exam by adding five points to each students score. What

effect does this have on the mean and the median? The result of adding a constant to each value has the

intended effect of altering the mean and median by the constant. For example, if in the above example

where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be

87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).

Similarly, if each observed data value was multiplied by a constant, the new mean and median would

change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were

doubled, the then the new mean and new median would be double the original mean and median. As we

will learn shortly, the effect is not the same on the variance!

Why would you want to know this? One reason, especially for those moving onward to more applied

statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods a

required assumption is that the data is normal, or very near bell-shaped. When the data is not normal,

statisticians will transform the data using numerous techniques e.g. logarithmic transformation. But, the

log cannot be taken of all values, for instance the log of 0 is undefined. However, if we add a constant to

all the data values making them all greater than zero, then a log can be taken without risk.We just need to

remember the original data was transformed!!

Unit Summary

Measures of Variability

Range

Interquartile Range (IQR)

Variance and Standard Deviation

Adding and Multiplying Constants

Empirical Rule

How to Roughly Approximate Standard Deviation

Coefficient of Variation

Z-score, Z-value, or Z

Reading Assignment

An Introduction to Statistical Methods and Data Analysis, (see

course schedule).

Measures of Variability

Think about the following, then click on the icon to the left to

display the statistical answer.

If you can use two numbers to summarize

Jessica's weight data, which two characteristics

will you use as measures?

Range

Interquartile range (IQR)

Variance and Standard deviation

Let's look at each of these in turn.

A. Range: R = maximum - minimum

1.Easy to calculate

2.Very much affected by extreme values (range is not a

resistant measure of variability)

B. Interquartile range (IQR)

In order to talk about interquartile range, we need to first

talk about percentiles.

The pth percentile of the data set is a measurement such

that after the data are ordered from smallest to largest, at

most, p% of the data are at or below this value and at most,

(100 - p)% at or above it.

data values fall at or below the median.

upper quartile = the 75th percentile.

lower quartiles and denoted as IQR.

IQR = Q3 - Q1 = upper quartile - lower quartile = 75th

percentile - 25th percentile.

Details about how to compute IQR will be given in Lesson

2.3.

Note: IQR is not affected by extreme values. It is thus a

resistant measure of variability.

Two vending machines A and B drop candies when a quarter

is inserted. The number of pieces of candy one gets is

random. The following data are recorded for six trials at

each vending machine:

Pieces of candy from vending machine A:

1, 2, 3, 3, 5, 4

mean = 3, median = 3, mode = 3

Pieces of candy from vending machine B:

2, 3, 3, 3, 3, 4

mean = 3, median = 3, mode = 3

Dotplots for the pieces of candy from vending

machine A and vending machine B:

They have the same center, but what about their spreads?

One way to compare their spreads is to compute their

standard deviations. In the following section, we are going

to talk about how to compute the sample variance and the

sample standard deviation for a data set.

Variance is the average squared distance from the mean.

Population variance is defined as:

2=i=1N(yi)2N2=i=1N(yi)2N

In this formula is the population mean and the summation

is over all possible values of the population. N is the

population size.

The sample variance that is computed from the sample and

used to estimate 2 is:

s2=i=1n(yiy)2n1s2=i=1n(yiy)2n1

unknown and estimated by yy, the yi's tend to be closer

to yy than to . To compensate, we divide by a smaller

number, n - 1. The sample variance (and therefore sample

standard deviation) are the common default calculations

used by software. When asked to calculate the variance or

standard deviation of a set of data, assume - unless

otherwise instructed - this is sample data and therefore

calculating the sample variance and sample standard

deviation.

For example, let's find S2S2 for the data set from vending

machine A: 1, 2, 3, 3, 4, 5

y=1+2+3+3+4+56=3y=1+2+3+3+4+56=3

s2=(y1y)2++

(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)

2+(53)261=2s2=(y1y)2++

(yny)2n1=(13)2+(23)2+(33)2+(33)2+(43)2+(5

3)261=2

machine B yourself and check that it is smaller

2

than the S for data set A. Work out your answer first, then

click the graphic to compare answers.

Standard Deviation

The population standard deviation is notated by and found

by =2=2 has the same unit as yi's. This is a desirable

property since one may think about the spread in terms of the

original unit.

is estimated by the sample standard deviation s :

s=s2s=s2

For the data set A,

from vending machine B . Work out your answer

first, then click the graphic to compare answers.

The standard deviation is approximately the average distance

the values of a data set are from the mean, and is a very

useful measure. One reason is that it has the same unit of

measurement as the data itself (e.g. if a sample of student

heights were in inches then so, too, would be the standard

deviation. The variance would be in squared units, for

example inches2. Also, the empirical rule, which will be

explained in the following section, makes the standard

deviation an important yardstick to find out approximately

what percentage of the measurements fall within certain

intervals.

Adding and Multiplying Constants

What happens to measures of variability if we add or multiply each

observation in a data set by a constant? We learned previously

about the effect such actions have on the mean and the median,

but do variation measures behave similarly? Not really.

When we add a constant to all values we are basically shifting the

data upward (or downward if we subtract a constant). This has the

result of moving the middle but leaving the variability measures

(e.g. range, IQR, variance, standard deviation) unchanged.

On the other hand, if one multiplies each value by a constant this

does effect measures of variation. The result on the variance is that

the new variance is multiplied by the square of the constant, while

the standard deviation, range, and IQR are multiplied by the

constant. For example, if the observed values of Machine A in the

example above were multiplied by three, the new variance would be

18 (the original variance of 2 multiplied by 9). The new standard

deviation would be 4.242 (the original standard 1.414 multiplied by

3). The range and IQR would also change by a factor of 3.

Empirical Rule

Empirical Rule is sometimes referred to as the 68-95-99.7% Rule. If

the set of measurements follows a bell-shaped distribution, then

ysys

Object 2

The following five examples (a-e) show that the empirical rule

is not that far off even when the underlying distribution is not bell

shaped.

a. For the following graph, y=5.5y=5.5, s=1.49s=1.49

(5.5 - 1.49, 5.5 + 1.49) = (4.01, 6.99)

94% within y2sy2s

(5.5 - 2.98, 5.5 + 2.98) = (2.52, 8.48)

100% within y3sy3s

(5.5 - 4.47, 5.5 + 4.47) = (1.03, 9.97)

b. For the following graph, y=5.5y=5.5, s=2.07s=2.07

96% within y2sy2s

100% within y3sy3s

100% within y2sy2s

100% within y3sy3s

d. For the following graph, y=3.49y=3.49, s=1.87s=1.87

96% within y2sy2s

98.5% within y3sy3s

e. For the following graph, y=2.57y=2.57, s=1.87s=1.87

95% within y2sy2s

97.6% within y3sy3s

Approximating the Standard Deviation

Think about the following, then click on the icon to the left

display the statistical application example.

How can one find an approximate value

of s without going through the detailed

computation? It follows from the empirical rule that

approximately 95% of measurements lie

in y2sy2s(almost all).

Range

4s

Why don't we say y3sy3s contains all and divide by 6 to

obtain the approximate value of s?

It is important to remember that one has to use the formula:

s=ni=1(yiy)2n1s=i=1n(yiy)2n1

to compute the sample standard deviation. The formula

{Approximate value of srange4srange4 } only gives a rough

estimate of s.

sampled, arranged in increasing order is:

31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,

54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,

71, 71, 74, 75, 77, 79, 79, 79

The data range is from 31 to 79. Thus, using the 'shortcut' formula

to approximate the value of s is as follows: (79-31) / 4 = 12 years.

Shortcut Method for Calculating the Standard Deviation

Instead of using the formula for calculating the variance and

standard deviation that involves comparing each observation to the

mean, there is a shortcut method to calculating the variance and

standard deviation. This shortcut method is as follows:

1.Sum all the values in the data set.

2.Square this sum.

3.Divide this squared sum by the total number of observations,

n, (call this the average sum squared).

4.Square each value in the data set.

5.Sum these squared values (called the sum of squares).

6.Subtract this sum of squares minus average sum squared.

7.Divide this difference by n - 1; this is the variance.

8.Take the square root to get the standard deviation.

For example, recall the data results for Vending Machine A at the

beginning of this lesson: 1, 2, 3, 3, 4, and 5. We calculated the

variance to be 2 and the standard deviation to be 1.414. Using the

shortcut method:

1.1 + 2 + 3 + 3 + 4 + 5 = 18

2.18*18 = 324

3.324/6 = 54

4.1, 4, 9, 9, 16, and 25

5.1 + 4 + 9 + 9 + 16 + 25 = 64

6.64 - 54 = 10

7.10/5 = 2

8.Square root of 2 equals 1.414

Coefficient of Variation

Above we considered three measures of variation: Range,

Interquartile Range (IQR), and Variance (and its square root

counterpart - Standard Deviation). These are all measures we can

how can we compare dispersion (i.e. variability) of data from two or

more distinct populaions that have vastly different means? A

popular statistic to use in such situations is the Coefficient of

Variation or CV. This is a unit-free statistic and one where the

higher the value the greater the dispersion. The calcuation of CV is:

CV = Standard Deviation / Mean

Example: Comparing Prices

compare prices of various brands, some

offer price per roll while others offer price

per sheet. You are interested in

determining which pricing method has

less variability so you sample several of each and calculate

the mean and standard deviation for the sampled items that

are priced per roll, and the mean and standard deviation for

the sampled items that are priced per sheet. The table below

summarizes your results.

Item

Mean Standard Deviation

Price per Roll 0.9196 0.4233

Price per Sheet 0.01134 0.00553

have much less variability in pricing. However, the mean is

also much smaller. The coefficient of variation allows us to

make a relative comparison of the variability of these two

pricing schemes:

CVRoll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49CVR

oll=0.4233/0.9196=0.46andCVSheet=0.00553/0.01134=0.49

greater than the variability for Price per Roll.

Another example to consider is hotel pricing. Think of prices

for luxury and budget hotels. Which do you think would have

the higher average cost per night? Which would have the

greater standard deviation? The CV would allow you to

compare this dispersion in costs in relative terms by

accounting for the fact that the luxury hotels would have a

greater mean and standard deviation.

Z-value, Z-score, or Z

Z-value, or sometimes referred to as Z-score or simply Z, represents

the number of standard deviations an observation is from the mean

for a set of data. To find the z-score for a particular observation we

apply the following formula:

Z = (observed value mean) / SD

Example: Exam Scores

For a recent final exam the mean was 68.55 with a standard

deviation of 15.45

means your score of 80 was 0.74 SD above the mean.

means your score of 60 was 0.55 SD below the mean.

the question.

scored higher than the mean). However, if one was analyzing

days of missed work then a negative Z-score would be more

appealing as it would indicate the person missed less than

the mean number of days.

Characteristics of Z-scores

1.The scores can be positive or negative

2.For data that is symmetric (i.e. bell shaped) or nearly

symmetric, a common application of Z-scores for identifying

potential outliers is for any Z-scores that are beyond 3.

3.Maximum possible Z-score for a set of data is (n1)/n(n1)/n.

4. Sum of allsquared Z-scores for a set of data is n - 1.

2.4 - Practice Problems

Printer-friendly version

1. Our statistics department surveyed a random sample of 5 staff personnel and 5 faculty on how often

during a week they used public transportation in traveling for work. The table below reflects the

responses.

Staff

Faculty

4 2 2 2 5 0 1 5 1 3

a. What sampling method was used to gather this data? What population of interest is best represented by

the samples?

b. Calculate by hand the mean and standard deviation for number of times a week public transportation

was used by staff and faculty.

c. Based on means and standard deviations, do you think there is a statistically significant difference

between these two means? Explain.

2. The College of Dentistry at the University of Florida has made a commitment to develop its entire

curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and so on.

It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional

staff lab more free time for personal consultation in student faculty interaction. One such instructional

modules developed and tested in the first 50 students proceeding through the curriculum the following

measurements represent the number of hours it took the students to complete the required modular

material:

16 8 33 21 34 17 12 14 27 6

33 25 16 7 15 18 25 29 19 27

5 12 29 22 14 25 21 17 9 4

12 15 13 11 6 9 26 5 16 5

9 11 5 6 5 23 21 10 17 15

Here is a link to the data (hours.txt) for the time it took students to complete the required material.

a. Calculate by hand the five number summary for these recorded completion times. Helpful hint: you can

use software such as Excel to sort the data.

b. Do we expect the Empirical Rule to describe adequately the variability of these data? Explain.

c. Calculate the standard deviation, s, by using the approximation formula and compare that answer to that

real standard deviation of 8.45

d. The mean for this data set is 16. Using the actual s of 8.45, construct the intervals and check whether the

Empirical Rule applies to this data set.

Let's perform some basic operations in Minitab. Some of the

examples below are repeats of what we did by hand in earlier

lessons while others are new. First, we saw previously how you can

enter data into the Minitab worksheet by hand, we will now walk

through how to load a dataset into Minitab from an Excel file.

Loading Data into Minitab from an Excel File

Right click and save this Excel spreadsheet

file, MinitabIntroData.xlsx [1]. Save the file locally (if using Minitab

installed on the computer you are using) or save the file in your

PASS space if using WebApps.

Open Minitab, then using the Minitab menus at the top of the

application, select the option:

'File' > 'Open worksheet'.

In the Files of Type field click the drop down arrow and select

'Excel'.

In the Look In field use the drop down arrow to locate the

saved Excel data file.

Double click the file and the data should open in the Minitab

worksheet (the window that looks similar to an Excel

spreadsheet).

With the data in the Minitab worksheet you can then perform any

number of procedures. First we obtain some basic descriptive

statistics.

Descriptive Statistics

With the data from the Excel spreadsheet file into your Minitab

worksheet window, you should notice that all columns are labeled

Cx where the x is a number. Some of these are followed by a -T.

Those columns with the -T indicate that the data in this column are

considered text or categorical data. Otherwise, Minitab recognizes

the data as quantitative. If the operation you conduct in Minitab

only functions on a certain variable type (e.g. calculating the mean

can only be done on quantitative data) then only columns of that

data type will be available to use for those operations.

Example: Using the Hours Data from Previous Practice

Problems

Let's use Minitab to calculate the five number summary,

mean and standard deviation for the Hours data,

by default will provide some added information.

1. At top of the Minitab window select the menu option 'Stat' >

'Basic Statistics' > 'Display Descriptive Statistics'

2. Once this dialog box opens your cursor should be blinking in the

'Variables' window. If not, simply click inside this part of the dialog

box. The only variables you should see in the left side window are

columns of quantitative data (the two price columns, age, and

hours). To enter a variable from the left hand window into the

Variables window you can either double-click that variable or click

the variable to highlight it and then click the 'Select' button. Do so

with the variable 'Hours'.

3. With the variable 'Hours' in the 'Variable' window click the 'OK'

button.

You should now find the following output in the Session window

above the worksheet.

values as those calculated in the practice problems. Minitab also

gives the size of the sample used to create these statistics (N), and

the number of observations from this data that were missing (N*).

These statistics are the default statistics. Additional basic

descriptive statistics are also available such as trim mean and

coefficient of variation (CV).

To get the CV values for the Price per Sheet and Price per

Roll an example found in an earlier lesson, (data contained

in MinitabIntroData.xlsx [1]).

1.Open Minitab and return to 'Stat' > 'Basic Statistics' > 'Display

Descriptive Statistics'.

2.Enter both variables into the Variables window. That is, both

'Price_Roll' and 'Price_Sheet' should be in the Variables window.

3.Click the 'Statistics' tab and then check the box for 'Coefficient

of Variation' (notice the other statistics available!) and click

'OK' .

4.Click 'OK' again.

The output will include the same statistics as the example above

plus the CV values, (it will be titled 'CoefVar').

by levels of a categorical variable.

Example: Basic Statistics of a Quantitative Variable by the

Levels of a Categorical Variable

To see this we will use the data for 'Age' and 'Sex' from

the MinitabIntroData.xlsx [1] file.

1.Open Minitab and return to 'Stat' > 'Basic Statistics' > 'Display

Descriptive Statistics'.

2.Enter the variable Age into the Variable window.

3.Click inside the By variables (optional) window. Any

categorical variables in the worksheet (e.g. 'Sex') should now

display in the box on the left.

4.Select the variable 'Sex' for the By variables window.

5.Click 'OK'.

Minitab will now display the 'Age' statistics (including CV since we

had that statistic selected from our last operation) for each

category of 'Sex'.

Creating graphs in Minitab is very straightforward. Graphing options

are located under the 'Graph' tab across the list of menu choices at

top of Minitab application. Click the 'Graph' menu option and a long

list of graphs will appear. On this page we will simply create a

boxplot for the Hours data and also a side-by-side boxplots for 'Age'

by 'Sex' data found inMinitabIntroData.xlsx [1].

Example: Boxplot for the Hours Data

With the MinitabIntroData data file open in the Minitab

worksheet:

Select 'Graph' > 'Boxplot' > 'One Y-Simple' and click 'OK'.

Select the 'Hours' variable and move into the Graph Variables

window.

Click 'OK'

If you place your computers mouse over the box portion of the

plot some statistics will pop-up (Q1, median, Q3, IQR, the value to

which the whiskers extend, and the sample size N). If there are any

outliers using the methods outlined in the previous lesson, these

will be marked with an * in the plot.

Example: Side-by-Side Boxplot for the Age by Sex Data

Again, with the MinitabIntroData data file open in the

Minitab worksheet:

Select 'Graph' > 'Boxplot' > 'One Y-With Groups' and click 'OK'.

Select the 'Age' variable and move into the 'Graph Variables'

dialog box.

Click inside the Categorical variables for grouping window.

Any categorical variables in the worksheet (e.g. Sex) should

now display in the left side box.

Select the variable 'Sex' for the Categorical variables for

grouping window.

Click 'OK'.

Minitab will now display inside one frame two boxplots: one for

Females and another for Males.

in MinitabIntroData.xlsx [1]. Use the graphs to help you explore what

the data distributions of the different variables looks like

Finally let us consider how we can use the Calc function along the

menu options to easily create a new set of data.

Example: Using the Calc Function to Create New Data

First, in any empty column of the worksheet starting in row

1 for that column, type in the five data values:

1, 2, 3, 4, 5.

Just to be clear, do not include the commas or the period symbol! In

observations

To create another column where we just add a constant of 1 to

each of these values:

1.From the menus at top of page and select 'Calc' > 'Calculator'.

2.In the text box for Store result in variable you can type in any

word which will serve as the variable name for our new set of

data. For this example, you can type in this text box the word

'Plus1'.

3.Double-click or highlight and click the 'Select' button to put the

variable of interest into the expression window. In this case

column C1 contains our five new observations.

4.With your cursor active in the Expression window click the +

and then 1 on the keypad. This should create an expression

such as C1 + 1.

5.Click 'OK'.

You should now find a new column in the spreadsheet window with

the values 2, 3, 4, 5, 6.

Just for kicks, use the steps described previously to have Minitab

calculate the descriptive statistics for these two variables, the

column of entered data and the column of calculated 'Plus1' data.

You should find that the original mean was 3 for the entered data

and is 4 for the plus1 data, however the standard deviations for

both sets of data remain the same.

1. Draw two appropriate graphs for the following data:

University officials periodically review the distribution of

undergraduate majors within the colleges of the university to help

determine a fair allocation of resources to departments within the

colleges. At one such review, the following data were obtained

(majors.txt [1]):

College

Number of majors

Agriculture

1,500

11,000

Education

2,000

Engineering

5,000

HINT!: How to Create a Simple Pie Chart and Simple Bar Chart

[2]

Draw a frequency histogram and a relative frequency histogram for

the 1994 annual per capita city tax (city_tax.txt [3]) given below:

2470, 520, 561, 488, 986, 359, 1305, 512, 467, 270, 360, 451, 4904, 572, 498, 382, 271, 634, 1682,

784, 298, 643, 947, 686

[4]

Draw the stem-and-leaf diagram for the following data set

(stem_leaf.txt [5]) using Minitab and use the stem and leaf diagram

to find the median of the data set:

11 11 12 13 14 14 14 14 15 15 15 16 16 17 17 19 19 20 22 25

Draw the stem-and-leaf diagram for city tax data given in practice

problem 2 above and use the diagram to find the median of the

data.

5. Selecting the proper diet of brook trout or other fresh water fish

is an important aspect of fish farming. A researcher want to

estimate the mean weight of brook trout maintained on a specific

diet for a period of 6 months. One hundred brook trout are selected

from a fisheries tank and each is weighed.

a. What is the population of measurements that is of interest to the

researcher.

b. What is the sample.

c. What characteristics of the population are of interest to the

researcher?

d. If the sample measurements are used to make inferences about

certain characteristics of the population, why is a measure of the

reliability of the information important?

6. In a packing plant, a machine packs carton with jars. The times it

takes each machine to pack 10 cartons are recorded. The results

(machine.txt

New machine

[6]

Old machine

42.1 41.3 42.4 43.2 41.8 42.7 43.8 42.5 43.1 44.0

41.0 41.8 42.8 42.3 42.7 43.6 43.3 43.5 41.7 44.1

a. Compute the mean and standard deviation for the time to pack a

carton for each machine.

b. Plot the data for each machine.

c. Describe the data for the two machines.

7. The College of Dentistry at the University of Florida has made a

commitment to develop its entire curriculum around the use of selfpaced instructional materials such as videotapes, slide tapes, and

so on. It is hoped that each student proceeded apace

commensurate with his or her ability and of the instructional staff

lab more free time for personal consultation in student faculty

interaction. One such instructional modules developed and tested

in the first 50 students proceeding through the curriculum the

following measurements represent the number of hours it took the

students to complete the required modular material:

16 8 33 21 34 17 12 14 27 6

33 25 16 7 15 18 25 29 19 27

5 12 29 22 14 25 21 17 9 4

12 15 13 11 6 9 26 5 16 5

9 11 5 4 5 23 21 10 17 15

Here is a link to the data (hours.txt

to complete the required material.

[7]

a. Calculate the mode, the median, and the mean for these

recorded completion times.

b. Guess the value of s.

c. Compute s by using the shortcut formula and compare your

answers to that of part (b) above.

d. We do expect the Empirical Rule to describe adequately the

variability of these data? Explain.

e. Construct the intervals and check whether the Empirical Rule

applies to this data set.

[8]

We cannot use Minitab for general discrete distributions but we can use it for binomial distributions. We

can also use Minitab for normal distributions.

Either distribution can be found in Minitab under the Calc > Probability Distributions list. The list contains

many different distributions and we will begin with binomial and normal. If you select either one, you will

be presented with a pop-up window that contains three radio buttons labeled:

Probability

Cumulative probability

Inverse cumulative probability

We use Probability to find exact probabilities: P(X=x)P(X=x). This

is applicable for binomial but not for normal. The latter reason extends

from calculus. With the normal distribution being continuous, the idea of the area under a curve for an

exact point (i.e. X=xX=x) is equal to zero. This result is determined from using integration to find the

area under a curve and then evaluation this integral from point a to point b. However, at X=xX=x

the points to be evaluated are the same thus zero area under an exact point.

We use Cumulative probability to find less than or equal to probabilities: P(Xx)P(Xx). Remember

for discrete distributions (e.g. binomial) the use of the equality is important as there is a difference

between, for instance, saying less than 3 students and less than or equal to 3 students. On the other

hand, for continuous distributions (e.g. normal) there is no distinction; the use of the equal sign is not

relevant. The reason relates to the explanation above for probability. The equality is for the exact

observation of a value which under a curve the area is zero.

We use the Inverse cumulative probability to find the value of XX that produces some cumulative

probability. For instance, in the normal distribution we would use this option when we wanted to find the

observed score for some specified percentile.

In general, we will use the first two radio buttons for binomial and the latter two, although primarily

Cumulative Probability, for normal.

II.1 - Finding Binomial Distribution Probabilities with Minitab

Printer-friendly version

Binomial Examples

Referring back to the FBI Crime Survey Example in the binomial lesson, we

had the probability of 0.2 that a randomly selected property crime is solved

and we had three such crimes committed. Let's use Minitab this time to find

the probability that:

1. Exactly one of the three crimes is solved

2. At least two of the three crimes is solved.

In Minitab, go to Calc > Probability distributions > Binomial

To solve a we are asked to find P(X=1)P(X=1) so we will select

the Probability radio button and enter the following:

Number of trials: 3

Event probability: 0.2

Click radio button for 'Input constant': 1

Then click "OK".

The result is as follows:

Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we

wanted to find P(X=1)=0.384P(X=1)=0.384 [NOTE: this is what we found by hand earlier.]

To solve b we are asked to find P(X1)P(X1). Since the software does not find greater than

probabilities we need to use the complement rule. This leads us to find P(X1)P(X1) by 1P(X<1)=1

P(X=0)1P(X<1)=1P(X=0).

This time we will select again the probability radio button and enter the following:

Number of trials: 3

Event probability: 0.2

Click radio button for 'Input constant': 0

Then click "OK".

The result is:

Reading the output we can see that the number of trials was 3, the probability of success was 0.2, and we

wanted to find P(X=0)=0.512P(X=0)=0.512. Subtracting this probability from 1 we have our answer

to P(X1)P(X1) which is 0.488

NOTE: The usual mistakes students make are to not set the problem up correctly (e.g. use Probability

when should be Cumulative and vice-versa), incorrectly include the equality when using the complement,

or simply forget to subtract from one when necessary.

Practice Problems

1. You are given a 25 question exam for which you are not prepared (i.e. you will be guessing).

Each question involves True/False answer choices where only one choice is correct.

First, is this a binomial situation?

Does it have a fixed number of trials? YES - 25.

Two outcomes, success and failure? YES - right and wrong.

Equal chances of success? YES - 0.5

Is each trial independent? YES, how you answer one question well assume does not affect your

answer to another question.

OK, now that those assumptions have been met, let's use Minitab to answer the following

questions:

A) What is the probability that you get exactly 20 correct?

So, this is solving for P(X=20)P(X=20). We will select the Probability radio button and enter

25 for number of events,

0.5 for event probability, and

20 as the input constant.

The output looks like:

The answer is 0.0015834 meaning that you have roughly a 0.001 (or 1 in a 1000) chance of

getting exactly 20 right in such a situation.

B) What is the probability that you get at least 20 correct?

As we saw above, we need to solve this differently. We will solve for P(X20)=1

probability radio button and enter:

25 for number of events,

0.5 for event probability, and

19 as the input constant.

The output looks like:

slightly better chance of answering at least 20 or more questions correctly in such a situation

than you did of answering exactly 20: roughly a 2 out a 1000 chance.

II.2 - Finding Normal Distribution Probabilities with Minitab

Printer-friendly version

Normal Examples

Here we will use Minitab to find the probabilities for two of the problems from the practice examples we

saw earlier.

To find the probabilities associated with normal distributions in Minitab, go to Calc > Probability

distributions > Normal. The default is set up to that of a standard normal (i.e. we have a z-score) where the

mean is 0 and the standard deviation is 1.

Example: Intelligence Scores for Children

1. The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC)

are normally distributed with =100=100 and =15=15. Our question is, "What score will

separate a child from the top 5% of the population from the bottom 95%? What do we call this

value?"

To solve this question we are asked to find P(Xx)=0.95P(Xx)=0.95 That is, we want to find

the score that would result in a child falling in the 95th percentile. (NOTE: this would be

equivalent to finding P(Xx)=0.05P(Xx)=0.05 or being in the top 5%).

In Minitab go to Calc > Probability distributions > Normal and select the radio

button for Inverse cumulative probability since we want to find the observed

score associated with a given cumulative probability: 0.95.

For mean enter 100 and for standard deviation enter 15. Click the radio button for

'Input Constant' and enter the cumulative probability of 0.95 and the click 'OK'. The result is as

follows:

As we see, this result is very near that which we found by hand earlier: 124.6.

The interpretation is that an IQ score of about 127 is needed for that child to fall in the top 5%.

2. A class has 16 children and they are from the reference population in problem 1 above. One

child is randomly picked from the class. What is the probability that the IQ of the child is more

than 110?

To solve this question we are asked to find P(X>110)P(X>110). This means we will have to

solve using the complement or 1P(X110)1P(X110). But remember, this equality is not

relevant in regards to finding the probability since we are talking about a continuous

distribution!

In Minitab go to Calc > Probability distributions > Normal and select the radio

button for Cumulative probability and enter the following:

For mean enter 100 and for standard deviation enter 15. Click the radio button for

Input Constant and enter 110 and click OK. The result is as follows:

We then take this P(X110)=0.747507P(X110)=0.747507 and subtract from one to get our

final probability of 0.252943 which resembles our final result using the standard normal table:

0.2514.

The difference we are finding here is from the rounding we did to get the z-score to use the

table. The interpretation is that we have about a 25% chance of randomly selecting a child with

an IQ of at least 110.

II.3 - Finding Sampling Distributions in Minitab

Printer-friendly version

Sampling Distributions

For sampling distributions we again will focus on using the normal distribution. The key distinction will

be that instead of inputting the standard deviation we will use the standard error. Again, to illustrate we

will use two problems from the sampling distribution practice problems.

Example: JCrew

The company JCrew advertises that 95% of its online orders ship within two working days. You

select a random sample of 200 of the 10,000 orders received over the past month to audit. The

audit reveals that 180 of these orders shipped on time. If JCrew really ships 95% of its orders on

time, what is probability that the proportion in a random sample of 200 orders is as small or

smaller as the proportion in the audit?

We already verified that the sample proportion meets the conditions needed in order to apply

normal approximation methods. Once this is verified, the question asks us to find the probability

of getting a sample proportion of 0.9 or less if the true 'ship on time' proportion is 0.95. Recall,

we already had calculated a 0.015 standard error.

Using Minitab, we will again go to Calc > Probability distributions > Normal.

We will select the radio button for Cumulative probability and enter the following:

For mean enter 0.95 and for standard deviation enter 0.015.

Click the radio button for Input Constant and enter 0.9

Then click OK. The result is as follows:

Remember when we used the standard normal table the best we could do was a probability less

than 0.001 which Minitab has verified with 0.0004291 which is less. When we did this by

hand we came up with a z-score of 3.33 which was not on the table so we used 3.09

Alternatively, we could have used this z-score to find our answer by using the Minitab default

values which as we mentioned at the beginning are in standard normal format. If one has the zscore, you simply need to plug in the z-score as the input constant ( but remember to change the

mean to 0 and the standard deviation to 1).

Example: Tire Lifetime

Penn State Fleet which operates and manages car rentals for Penn

State employees found that the tire lifetime for their vehicles has a

mean of 50,000 miles and standard deviation of 3500 miles. What

is the probability that the sample mean lifetime for these 50

vehicles exceeds 52,000?

We already verified that the sample mean meets the conditions

needed in order to apply normal approximation methods. Once this

is verified, the question asks us to find the probability of getting a

sample mean greater than 52,000 miles if the true tire lifetime is 50,000 miles . Recall we

already had calculated a 495 for the standard error.

Using Minitab, again go to Calc > Probability distributions > Normal. Select the

radio button for Cumulative probability and enter the following:

For mean 50000 and for standard deviation enter 495.

Click the radio button for 'Input Constant' and enter 52000

Then click OK. The result is as follows:

Not done yet! The problem asks for greater than 52000 and what we have is less than 52000.

Therefore we need on final step of subtracting this probability from one. The probability of

getting a sample mean of 52000 from a sample of 50 vehicles would be 0.000027 which satisfies

our result when using the table.

Remember, when we used the standard normal table the best we could do was a probability less

than 0.001 which Minitab has verified as less. When we did this by hand we came up with a zscore of 4.04 which was not on the table so we used 3.09 Alternatively, we could have used this

z-score to find our answer by using the Minitab default values which as we mentioned at the

beginning are in standard normal format. If one has the z-score, you simply need to plug in the

z-score as the input constant (but remember to change the mean to 0 and the standard deviation

to 1).

Printer-friendly version

This lesson starts with the basic concept of using confidence intervals to understand and perform

inference. We then talk about how to find confidence intervals for one population proportion. The

important issue of determining the required sample size to estimate a population proportion will also be

discussed in detail in this lesson.

Estimating the population mean is one of the most common and important questions one comes across in

practice. In this lesson, we will also talk about the confidence interval for a population mean when the

population standard deviation is unknown. We will also explain how to determine the number of

observations to be included in the sample.

Lesson 6 Objectives

Upon successful completion of this lesson, you will be able to:

understand the reason for estimating with confidence interval.

calculate confidence intervals for population proportions.

interpret a confidence interval.

know the meaning of margin of error and its use.

compute sample sizes for different experimental setting.

know when and how to use t-interval to estimate the population

mean.

compute sample sizes for estimating the population mean.

Printer-friendly version

1.

2.

3.

4.

5.

Stat > Basic Statistics > 1 proportion.

From the drop down box select the Summarized data option button. (If you have

the raw data you would use the default drop down of One or more samples, each in

a column.)

Enter the number of successes in the Number of Events text box, and the sample size in the

Number of Trials text box.

Click the Options button. The default confidence level is 95. If your desire another confidence

level edit appropriately.

To use the z- interval method choose Normal Approximation from the Method text box. The

exact interval is always appropriate and is the default. Under the conditions

that: np^5np^5, n(1p^)5n(1p^)5, one can also use the z-interval to approximate the

answers. The exact interval and the z-interval should be very similar when the conditions are

satisfied.

6. Click OK and OK again.

Click on the 'Minitab Movie' icon to display a walk through of 'Find a

Confidence Interval for a Population Proportion in Minitab'.

Example: Presidential Approval Rating

Referring to our presidential approval rating example at the beginning

of this lesson, we will use Minitab to verify our by-hand results.

Recall in that example a random sample of 1500 was taken from the

population of U.S. adults, with 660 responding with a positive approval. In Minitab and

following the steps above, we would enter 660 for the Number of Events and 1500 for the

Number of Trials. The confidence level was 95% and we satisfied the necessary conditions to

use the Normal Approximation (or z-interval) method. The results are:

Test and CI for One Proportion

Sample X N Sample p

95% CI

1

660 1500 0.440000 (0.414880, 0.465120)

Using the normal approximation.

These results closely match our by-hand interval of 0.415 to 0.465

What if we had calculated the exact confidence interval (i.e. did not choose Normal

Approximation as the method)? With the exact method the interval is (0.414685, 0.465550).

Consistent to three decimal places in this case. You will notice that in the output Minitab does

provide a notification that the normal approximation was used.

How different can the normal approximation and exact intervals be when conditions are not

satisfied? Consider this example: In estimating the proportion of premature babies born at the local

hospital, a random sample of 10 newborn babies was taken in which 3 were born prematurely. Find a 90%

confidence interval for the true proportion of premature babies born at the hospital.

As we can see the conditions to use normal approximation method is not satisfied; we only have 3

successes and we need at least 5. If we used normal approximation methods (note that we are constructing

90% intervals now), Minitab produces an interval of (0.061638, 0.538362). If the exact method were

used, the interval is (0.087264, 0.606624). The intervals are nearly as close as in the first example.

Also note how wide the intervals are in this second example. This is a direct result of the small sample

size. The smaller n produces a much a larger error which increases the width of an interval.

1.

2.

3.

4.

5.

Minitab Commands to Find the Confidence Interval for a Population Mean (sigma

unknown)

Stat > Basic Statistics > 1-Sample t.

From the drop down box select the Summarized data option button. (If you have

the raw data you would use the default drop down of One or more samples, each in a column.)

Enter the sample size, sample mean, and sample standard deviation in their respective text

boxes.

Click the Options button. The default confidence level is 95. If your desire another confidence

level edit appropriately.

Click OK and OK again.

Example: Emergency Room Wait Time

Referring to our prior example of average emergency room wait

time from our discussion on confidence intervals for a population

mean, our by-hand calculations produced a 95% confidence

interval of 24.28 to 35.72 minutes. Recall the following for that

example: sample size 50, sample mean 30, and sample standard

deviation 20. In Minitab following the above steps, we get a 95%

confidence interval:

One-Sample T

N Mean StDev SE Mean 95% CI

50 30.00 20.00 2.83

(24.32, 35.68)

The slight discrepancy between the estimates is due to our by-hand calculation using the t-value

associated with 40 degrees of freedom since the table did not include a d.f. of 49. Minitab used

a t-value for the actually 49 degrees of freedom. With the larger degrees of freedom comes a

smaller t-value. This would result in a smaller margin of error and a narrower interval precisely what we have here.

Using Minitab to Check Normality

This Minitab process was presented in the lesson for finding confidence intervals

for a population mean. It is repeated here for convenience. For small sample size,

if the distribution is not normal or if there are outliers, then one needs to use other

procedures such as nonparametric methods. Thus, if sample size is less than 30, one needs to

use normal probability plot to check whether the sample may come from a normal distribution

and then follow the above guideline to determine whether one can use the t-interval.

1. Graph > Probability Plot > Simple (note: if we have summarized data only we cannot plot the

data!)

2. Select the column that contains the data you want to graph.

3. Click OK.

Example: Rattlesnake Lengths

It is very time consuming to find rattlesnakes and nerve racking to

measure them. A scientist randomly finds 12 snakes from the

Central Pennsylvania area and measures their length (snakes.txt).

The following twelve measurements in inches are obtained:

40.2 41.0 41.6 43.1 44.9 42.8

The sample size is only 12. Let's do a normal probability plot to check whether the data may

come from a normal distribution. What do you conclude about whether they may come from a

normal distribution?

Since the points all fall within the confidence limits, there is no evidence to suggest that the data

do not come from a normal distribution.

6.7 - Practice Problems

Printer-friendly version

1. Many individuals over the age of 40 develop intolerance for milk and milk-based products. A dairy has

developed a line of lactose-free products that are more tolerable to such individuals. To assess the potential

market for these products, the dairy commissioned a market research study of individuals over 40 years

old in its sales area. A random sample of 250 individuals showed that 86 of them suffer from milk

intolerance. Based on the sample results, calculate a 90% confidence interval for the population proportion

that suffers milk intolerance. Interpret this confidence interval.

a) First, show that it is okay to use the 1-proportion z-interval.

b) Calculate by hand a 90% confidence interval.

c) Provide an interpretation of your confidence interval.

d) If the level of confidence was 95% instead of 90%, would the resulting interval be narrower or wider?

Explain.

e) If the researchers were interested in a 90% interval with a 3% margin of error, what size sample would

they require assuming sample costs are high and the response rate is 80%.

f) Verify your 90% confidence interval in Minitab.

2. Consumer reports tested 15 brands of vanilla yogurt and found the following numbers of calories per

serving: 160, 200, 220, 230, 120, 180, 140, 130, 170, 180, 80, 120, 100, 170, 190, (yogurt.txt). The

sample statistics were 159.3 for the sample mean and 43.5 for the standard deviation.

a) By hand, place a 99% confidence interval on the average number of calories per serving for vanilla

yogurt.

b) Provide an interpretation of your interval.

c) Use Minitab to find the interval and to check the assumption of normality. Is the assumption satisfied?

Explain.

Printer-friendly version

Unit Summary

Conducting a One-Proportion Z-test in Minitab

Conducting a One-Mean t-test in Minitab

Finding Exact Critical Value for a One-Mean t-test in

Minitab

Note about Software

In general, as we will learn, software usually performs tests using the p-value method. That is, the output

from software will provide the test statistic and the p-value, along with some other general information

e.g. a confidence interval. To perform rejection region tests you would need to find the critical values

from the tables. However, at the end of this lesson we do demonstrate how to find the correct critical

value from the t distribution, i.e. the t-value that corresponds to the degrees of freedom when not on the

table.

Conducting a One-Proportion Z-test

Note: these steps are very similar to those for one-proportion confidence interval.

The differences occur in steps 4 and 5b.

1. Click Stat > Basic Stat > 1 Proportion.

2. In the drop-down box use "One or more samples, each in a column" if you have the raw data,

otherwise select "Summarized data" if you only have the sample statistics.

3. If using the raw data enter the column of interest into the blank variable window below the drop

down selection. If using summarized data enter the number of successes for "Events" and the

sample size for "Trials".

4. Click the check box for "Perform hypothesis test" and enter the null hypothesis value.

5.

1.

2.

3.

Click Options.

Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).

From the drop down list for "Alternative hypothesis" select the correct alternative.

If conditions are satisfied to perform a z-test for one proportion, select from the "Method" field

"normal approximation"

6. Click OK and OK.

Example: Penn State Students from Pennsylvania

Recall our one-proportion example at the beginning of this lesson

on whether the a majority of Penn State students are from

Pennsylvania. In that example, we took a random sample of 500

Penn State students and find that 278 are from Pennsylvania. Can

we conclude that the proportion is larger than 0.5 at a 5% level of

significance? Also recall in that example we found by hand a test

statistic of Z* = 2.504 and p-value of 0.0062.

Our hypotheses were: H0:p=0.5H0:p=0.5 and Ha:p>0.5Ha:p>0.5

1.

2.

3.

4.

5.

1.

2.

6.

Using Minitab, we would select Stat > Basic Stat > 1 Proportion. Choose the summarized data

option and enter 278 for "Events" and 500 as the "Trials". Check the box for Perform

Hypothesis Test and enter the null value of 0.5 Click Options. With our stated alpha value of

5% we keep the default confidence level of 95. Select "Proportion > hypothesized proportion"

from the Alternative Hypothesis list. Since we verified the the conditions were satisfied, select

Normal Approximation under Method. Click OK and OK again. The output is:

Test and CI for One Proportion

Test of p = 0.5 vs p > 0.5

Sample X N Sample p 95% Lower Bound Z-Value P-Value

1

278 500 0.556000

0.519451

2.50

0.006

Using the normal approximation.

As the output indicates, our by-hand calculations were very accurate!

Conducting a One-Mean t-test

Note that these steps are very similar to those for one-mean confidence interval.

The differences occur in steps 4 and 5b

Click Stat > Basic Stat > 1 Sample t.

In the drop-down box use "One or more samples, each in a column" if you have the raw data,

otherwise select "Summarized data" if you only have the sample statistics.

If using the raw data enter the column of interest into the blank variable window below the drop

down selection. If using summarized data enter the sample size, sample mean, and sample

standard deviation in their respective fields.

Click the check box for "Perform hypothesis test" and enter the null hypothesis value.

Click Options.

Enter the confidence level associated with alpha (e.g. 95% for alpha of 5%).

From the drop down list for "Alternative hypothesis" select the correct alternative.

Click OK and OK.

Example: Emergency Room Wait Time

Recall our emergency room wait time example where an

administrator at your local hospital states that on weekends the

average wait time for emergency room visits is 10 minutes. From

our random sample of 40 patients, the average wait time for these

40 patients was 11 minutes with a standard deviation of 3

minutes. We conducted the test at a 5% level of significance and

wante to demonstrate that the average time exceeded 10 minutes. Also recall in that example we

found by hand a test statistic of t* = 2.11 and p-value with a range between 0.01 to 0.025

Using Minitab, we would select Stat > Basic Stat > 1 Sample t. Choose the summarized data

option and enter 40 for "Sample size", 11 for the "Sample mean", and 3 for the "Standard

deviation". Check the box for Perform Hypothesis Test and enter the null value of 10 Click

Options. With our stated alpha value of 5% we keep the default confidence level of 95. Select

"Mean> hypothesized mean" from the Alternative Hypothesis list. Click OK and OK again.

The output is:

One-Sample T

Test of = 10 vs > 10

N Mean StDev SE Mean 95% Lower Bound T P

40 11.000 3.000 0.474

10.201

2.11 0.021

Again, as the output indicates, our hand calculations were quite good. Notice that Minitab

provides a more exact p-value of 0.021 which corresponds to our results as it falls within our

calculated range of 0.01 to 0.025.

Finding Exact Critical Value for a One-Mean t-test

Since the t-table is not as detailed as the z-table, we can only estimate the critical

value when the degrees of freedom are not found on the table. In order to obtain

the exact critical value to use in order to conduct the rejection region approach we

can use a statistical package such as Minitab.

Minitab commands to obtain critical value:

1. Calc > Probability Distributions > t-distribution

2. Choose the radio button for Inverse Cumulative Distribution (this finds the t-value

that produces the entered probability to the left of it).

3. Enter the correct degrees of freedom

4. Choose the radio button for "Input constant" and enter the alpha value (if one-side

alternative) or alpha/2 (if two-sided alternative).

5. Click Ok

Example: Emergency Room Wait Time

Find the exact critical value for our emergency room example. Recall by hand that we had to

use the row with 35 degrees of freedom instead of the correct df of 39. In that example our

critical value for alpha of 5% was 1.69.

Go to Calc > Probability Distributions > t-distribution. Choose radio button for Inverser

Cumulative Distribution. Enter 39 for degrees of freedom. Click radio button for Input

Constant and enter 0.05 The output is as follows:

Inverse Cumulative Distribution Function

Students t distribution with 39 DF

P( X x )

x

0.05 -1.68488

This is where you need to be a little careful. Remember that our alternative was "greater than"

or a right-tailed test. The output is the critical value for a left-tailed test. However, since the tdistribution is symmetrical, the area to the left of -1.68488 would be the same as the area to the

right of 1.68488. Therefore, the critical value for out test with 39 degrees of freedom would be

1.68488, which doesn't differ much from the 1.69 we estimated using 35 degrees of freedom.

This is why the table skips going one by one after 30; there is little difference between the values

when increasing by only one degree of freedom.

7.7 - Practice Problems

Printer-friendly version

Practice Problems:

1. The benign mucosal cyst is the most common lesion of a pair of sinuses in the upper jawbone. In a

random sample of 800 males, 35 persons were observed to have a benign mucosal cyst.

a. Would it be appropriate to use a normal approximation in conducting a statistical test of the null

hypothesis H0:p0.096H0:p0.096(the highest incidence in previous studies among males)? Explain.

b. Conduct a statistical test of the research hypothesis Ha:p<0.096Ha:p<0.096 by computing the p-value

manually and drawing a conclusion using the p-value approach at a 1% Type I error rate.

c. What is the rejection region for this test?

d. Use Minitab to verify your results.

2. Some mushrooms were found in a forest. You do not know much about whether those are poisonous.

There are two hypotheses:

The mushrooms are poisonous and cannot be eaten

The mushrooms are not poisonous and can be eaten

How will you set up the hypotheses? Give a brief explanation.

3. A dealer in recycled paper places empty trailers at various sites. The trailers are gradually filled by

individuals who bring in old newspapers and magazines, and are picked up on several schedules. One such

schedule involves pickup every second week. This schedule is desirable if the average amount of recycled

paper is more than 1,600 cubic feet per 2-week period. The dealers records for eighteen 2-week periods

show the following volumes (in cubic feet) at a particular site (recycled_paper.txt) where xx = 1,718.3

and s = 137.8.

a. By hand, compute a 95% confidence interval of and provide an interpretation of your interval.

b. Is there strong evidence that is greater than 1,600? Conduct the test by hand using the p-value

approach with a 5% level of signficance.

c. Use Minitab to verify your results in parts a and b and to check for normality of data.

4. The undergraduate GPA of 18 students from a large MBA class of 800 students is selected. The data

are given as (mba_student_gpa.txt).

Use the data in the file above and Minitab to test the research hypothesis that the average undergraduate

GPA of the MBA class is differs from 3.5. Use the p-value approach to perform the test at a default level

of significance. Remember to check normality of the data.

ASK!