Sunteți pe pagina 1din 59

MODULE I

DESCRIPTIVE STATISTICS

Module I Plan

Module I Descriptive Statistics

Population and Sample


Type of Data: Discrete and Continuous
Location and Dispersion statistics
Histogram
Basic Distribution and Central Limit Theorem
Boxplot
Scatterplot and Correlation

Module I Objectives

At the end of this module, participants should


Know the consequences of using a sample when analysing
data
Differentiate types of data
Understand the features of various location and dispersion
statistics
Use various graphs in order to explore a data set and unveil
important questions
Have basic knowledge of Minitab and know how to use it to
conduct descriptive statistics analyses

Population and Sample

Statistics

POPULATION

Statistics is the science which


studies and analyzes populations.
Such a study often uses samples.

Individual

SAMPLE

Sampling

Inference

N = 15

n=4

Exercise

Lets start with a few observations:


Measure your head size (in cm).
Sort your pennies per year.

These data will help us take a closer look at


the distribution of Noranda employees head
size and years shown on Canadian pennies.

Population

A population is all of the elements being


studied. The total number of elements of a
population is represented by N.
Populations used for previous examples are:
All of Norandas employees;
All of the Canadian pennies which are in this room.

N = 11

Individual

An individual is a member of the population.


Individuals used in the previous examples are:
A Noranda employee;
A Canadian penny.

Variable

A variable is a feature relating to each


individual of the population which is being
measured.
The variables for the previous examples are:
The head size;
The year written on coins.

In general...
We want to get
information
on an entire population
for a given variable,

The number of
individuals in a
population being
too high

but

Therefore
And thus It becomes tedious, costly,
We measure
or impossible to measure
a subset of individuals
the entire population

Called a SAMPLE.

Sample

A sample is a subset of individuals taken


among the population. The total number of
members within the sample is represented by
n (sample size).
The samples for the previous examples are:
Noranda employees taking this course.
n = ???
Canadian pennies brought by participants.

n=3

Sampling and Inference


Sampling

Inference

Method to draw a sample at


random within the population

Method using information


obtained after study of
sample to draw conclusions
with regard to the
population

We must point out here that if the sampling is not


performed correctly, a generalization of results to all
of the population does not make sense. For
statistical inference to be valid, one must follow
certain sampling rules.

Sampling and Inference

The sampling theory intervenes before and


after data is collected to minimize the risks of
error when making our estimates.
Targeted population versus real population
Example of homeless and of telephone.

Sample chosen at random, but following certain rules


Example of the Charlottetown referendum.

We want to get the most representative


sample of the population on which we want to
infer.

Sampling and Inference


Considerations

Lets say an engineer would like to test the efficiency


of a new product with regard to the performance of a
machine. He could collect 10 samples during one day
(during which he uses the new product). In light of
these results, he would then conclude the product is
efficient.

Please comment this engineers approach.

Population and sample


Practical tips

Even if we follow the rules of sampling theory, we


must be aware that:
Some difference between the population and sample is
inevitable.
There are risks of error with conclusions drawn on the
population.

If there is a wide variability in the population, then statistics


calculated from samples (ex.: average) are not very
accurate.
Statistics calculated from a sample have a certain degree of
accuracy.

Population and sample


Summary

Statistics goal is to collect information on a


population or draw conclusions on this same
population based on the study and analysis of a
sample.

The sample we will obtain will allow us to get


ESTIMATES of the true values of the population.

We must proceed with a good sampling strategy if


we wish to draw valid conclusions for the entire
population.

Discrete or Continuous Data


Distinguishing between qualitative and
quantitative variables

Qualitative variable

A variable is said to be qualitative when its values


are represented by category or class.

These variables main features are:


Restricted choice of answers;
Represented by letters and numbers corresponding to a
characteristic.

Qualitative variables are divided in 2 groups:


Nominal
Ordinal

Quantitative variable

A variable is said to be quantitative if values


represented by this variable can be measured and
therefore represented by numbers.

These variables main features are:


Unlimited choice of answers;
Possibility of making calculations;
Always represented by numbers.

Quantitative variables are divided in 2 groups:


Discrete
Continuous

Quantitative Variables
Examples

When people say they are 18 years old, they have between 18 and 19 years
of age.

17

18

19

18 years constitutes the interval between 18 and 19. The " age" variable is
therefore continuous.

When someone says he or she has 3 children, it means they have exactly 3
children.

3 is an isolated value. The " number of children" is a

discrete variable.

Continuous

A quantitative variable is said to be continuous if


the possible values for this variable constitute number
intervals.

Variables which identify:


Size,
Weight,
Speed,

are examples of continuous variables.

Discrete

A quantitative variable is said to be discrete if the


possible values of this variable are limited (often
numbers issuing from counting).

Variables identifying:
DPMO,
The number of accidents per month,
The number of participants to training,

are examples of discrete variables.

Discrete and continuous


Examples
Are these data discrete or continuous?

Head size (in cm).


CONTINUOUS

Year written on Canadian pennies.


DISCRETE

Discrete and continuous


Practical tips
In general, we suggest that you collect continuous
data:
They can be calculated and modified in discrete data if necessary.
For the same size of sampling, you will find that continuous data gives more
information than other types of data.
Example of an alloys resistance

Resistance
Test

Passes
Fails
Fails

Resistance = ?

Upper
Specification

Lower
Specification

Location and Dispersion


Statistics

Location Statistics

We will see 3 location statistics


Mean or Average
Median
Mode

Utility of Location statistics:


To place emphasis
on distributions point
of concentration

AND

To have an idea of
where data mainly
stands

Location Statistics
Description

Mean or Average
The mean or average is the center of gravity in a distribution. It
equals the equilibrium point on a scale. In general, it is
represented by the X symbol.

Location Statistics
Description
The mean or average is the most popular statistic!

Avera
ge
for th temperat
u
e mon
th of re
July...

imir
d
a
l
V
of
e
g
a
r
l .. .
ve
l
a
a
b
g
e
n
i
s
Batt
n ba
i
o
r
e
Guer

Location Statistics
Description
Median
The median in a distribution is the value found in the middle of
observed data placed in an ascending sequence. Half of
observed data will be lower than this value and the other half
will be higher than this value.
Im the median!

Location Statistics
Description

Mode
The mode is the most likely value of a distribution data i.e. its
the most frequently observed value.

THE MOST FREQUENT


NOT THE HIGHEST!

Location Statistics
Formulas
Lets say we have these 5 observed values:
Average: X = Xi =
n

5, 2, 2, 4, 7

5+2+2+4+7 = 4
5
Number of Observed Values

Median: md = 2 2 4 5 7 = 4
There are 2 observed values on each side of 4. (If the number
of observed values is even, we take the average of the two
middle observed values).
Mode :
Mode = 2
(its the most recurring observed value). It is possible for a
distribution to have more than one mode.

Location Statistics
Interpretation

The average and the median are two central location statistics.

The major difference is that the MEDIAN is much less likely to be


influenced by outliers in comparison with the AVERAGE which will
be pulled by outliers. The median is said to be robust to outliers.

0 0 1 1 2 2 2 2 3 4

Changing
4 for 34

0 0 1 1 2 2 2 2 3 34

The median goes from 2 to 2 and the average goes from 1.7 to 4.7.

Choosing to use one or the other statistics often depends on the


context in which data is examined.

Dispersion Statistics
If a statistician had ice on his head
and fire under his feet, he would
say that on average,
average hes feeling good!!!

What is ignored with this example?


VARIATION

Dispersion Statistics

Here are 3 dispersion statistics


Range
Standard Deviation
Variance

Utility of dispersion statistics:


Putting emphasis on variations
observed between data

Dispersion Statistics
Description

Range
The range of a distribution is the difference between the
maximum and minimum data. It represents the width of the
distribution. In general, it is represented by the R symbol.

Dispersion Statistics
Description

Standard Deviation
The standard deviation is a quantity based on the distance between
each observed value and the average. It measures the variation
around the average. In general, it is represented by the S symbol.
2

Variance
The variance is the squared standard deviation. In general, it is
represented by the S2 symbol.

Dispersion Statisitcs
Formulas
Lets say we have these 5 observed values:

5, 2, 2, 4, 7

Range: R = Max Min = 7 - 2 = 5


Standard Deviation:
s=

(Xi -X)2 =
n-1

Average

(5-4)2+ (2-4)2+ (2-4)2+ (4-4)2+ (7-4)2


4

= 2.1

(Number of Observations - 1)

Always use (Number of Observed Values - 1) at the divisor of the


standard deviation formula (STANDARD DEVIATION function in Excel).
Variance: s2 = (2.1)2 = 4.5

Dispersion Statistics
Interpretation

The range and standard deviation are the two most often used dispersion
statistics. Both of these statistics are strongly affected by outliers.
Changing
4 for 34
0 0 1 1 2 2 2 2 3 4
0 0 1 1 2 2 2 2 3 34
The range goes from 4 to 34 and the standard deviation goes from 1.3 to 10.3

The range is popular because of its simple form. However, the standard
deviation is a more accurate estimate of the "real" dispersion of a distribution.

The majority of data (between 60% and 75%) can be found within 1 standard
deviation from the average.

Almost all of the data (between 90% and 98%) can be found within 2
standard deviations from the average.

Dispersion Statistics
Practical tips

In general, the average and the standard deviation are calculated from a
sample, and therefore, they are NEVER accurate. Its an approximate
figure. The larger the sample, the more precise the approximation.

When comparing averages and standard deviations obtained from two


samples, it is important to verify if the observed difference is
statistically significant.

A statistically significant difference between two averages (standard


deviations) allows to establish, with some degree of certainty, that the
observed difference is not the result of chance (ant that it is real).

The average and the standard deviation are affected by outliers. This
phenomenon must be taken into account when making calculations and
interpretations.

Introduction to Minitab

Introduction to Minitab

In this section, we will see:


Minitabs Main Windows
The Worksheet

Columns Function
Rows Function
Worksheet Function
Formulas

Minitab Files
Importing Files in Minitab

Minitabs Main Windows

Session Window
Contains numerical results of analyses

Data Window
Contains data columns. There is one data window per
worksheet.

Graphics Window
Contains graphs and analyses

Minitabs Main Windows

Project Manager Window


Allows management of project contents by:
Session File: allows management of contents of the session
window.
Graphics Window: allows management and naming of graphic
windows.
Worksheet File: displays a summary of columns contained in
the worksheet in progress.

Worksheet

Data entry in Minitab is done with a worksheet just as


with the Excel spreadsheet program.
Type of Data
Numerical
File Name
Row Number

Missing Data

Text

Variables Name

Column
Number

Columns

To add or modify the title of a column, double-click on


the cell under the column number.
Double-click here

Columns

To change the type of data in a column, click on


Manip Change Data Type. It is then possible,
among other things, to change a numerical column in
text format and vice versa.

Here the contents of


the Operator column
presently in numerical
format will be modified
to text format in a new
column
called
Operator txt

Columns

To erase a column, click on Manip Erase Variables


then select columns to be erased.
Columns to erase

Rows

To delete rows, click on Manip Delete


Rows. Then, you can enter the row numbers
which need to be deleted.

To insert rows, click on the row number which


corresponds to the place where the row must
me inserted, then click on Editor Insert
Rows.

Worksheet

To create a new worksheet from a data subset, click


on Manip Subset Worksheet.

It is possible to create a subset


by:
Specifying rows to be included
Specifying rows to be excluded

Rows to be included or excluded


are therefore chosen according to
three possible methods
Rows meeting certain criterion
Rows being brushed
Specific row numbers

Worksheet

In order to split a worksheet in several worksheets , click on


Manip Split Worksheet.

You must specify which variable


will serve to determine how the
worksheet will be split. Here we
will create a worksheet for
each Plant value.

In the same way, it is possible to merge several worksheets


in one by clicking on Manip Merge Worksheet.

Formulas

To use a formula in order to generate data in a


column, click on Calc Calculator. Then, specify the
calculation to be done and the name of the column
where results of calculations will appear.
Name of
column where
results will
appear
Formula

Minitab Files

There are three types of files in Minitab:


Worksheet Files
Include all datasets. These files have the .MTW extension.

Project Files
Include all of the analyses done on a set of data : worksheet,
numerical summaries, graphs, etc. Very useful to keep analyses
done on a dataset. These files have the .MPJ extension.

Graphic Files
Include graphs saved during previous analyses. These files
have the .MGF extension.

Importing Files in Minitab

To import a file in Minitab you must:


1. Click on File Open Worksheet.
2. Select the type of file to import (text, Excel, etc.)
in the Files of type menu.
3. The file to be imported should have one column
per variable. The title of each variable can appear
on the first row.

Location and Dispersion


Statistics in Minitab

To obtain the average, the median and the standard


deviation in Minitab, you have to click on Stat
Basic Statistics Display Descriptive Statistics.
Results appear in the Session Window.
The standard deviation is found under StDev.
The range can be calculated by subtracting the values
found under Maximum and Minimum.

Exercise !

M&M TRIVIA
1.

Try to guess the total number of M&M:

2.

Try to guess the number of red ones:

3.

Open your bag and count the total number of


M&M:

4.

Also count the number of red ones in your


bag:

Results

With results obtained by participants, enter the data in a file called


exercise1.mtw. Add a column for the total number of M&M and one for
the number of red ones and calculate :

The mean
The median
The mode
The range
The standard deviation

Total

Red Ones

Use of Minitab
Exercise

Import the Excel Furntemp plant 1.xls file in


Minitab.
Save the file in Minitab format under
Furntemp plant 1.mtw
Close the file and open it.
Find the average, the standard deviation, the
median and the range of temperature.

If you have any questions?

S-ar putea să vă placă și