2017 Univariate Data - Topic 1 Notes

TOPIC ONE UNIVARIATE DATA (CORE)
1.1 Types of data

CLASSIFYING DATA
Data
Numerical Categorical
Discrete Continuous Nominal Ordinal
DATA CAN BE EITHER NUMERICAL OR CATEGORICAL
Numerical data (e.g. 2, 6999…) Categorical data

Data which provide information about: Data which provide information about:
QUANTITIES QUALITIES
Examples: Examples:
NUMERICAL DATA CAN BE EITHER DISCRETE OR CONTINUOUS
Discrete data (2, 5, 10, 114, …) Continuous data

Data which are countable and can only take Data which are measurable rather than
certain values. countable, and they can take any value in a
(Ask yourself ‘How many?’) certain range including fractional and
decimal values.
Examples:
(Ask yourself ‘How much?’)
Examples:
CATEGORICAL DATA CAN BE EITHER ORDINAL OR NOMINAL
Ordinal data Nominal data

Data which have some order implicit in the Data which have no order implicit in the
categories used. categories used.
Example: Examples:
Warning!
It is not the variable name itself that determines whether the data are numerical or categorical, it is the way
the data for the variable are recorded.
For example:
 ‘weight’ recorded in kilograms, is a numerical variable
 ‘weight’ recorded as 1 = underweight, 2 = normal weight, 3 = overweight, is a categorical variable
CATEGORICAL DATA
Categorical data are displayed in either bar charts or segmented bar charts.
FREQUENCY TABLE
Climate Types In Various Countries
Frequency Working area for finding the percentages
Climate count
type Count Percentage percentage frequency = 100%
total count
(to 1 decimal place)
Cold 5
Moderate 11
Hot 19
Total 35 100.0
BAR CHART
Climate Types  The bars are parallel.
 The lengths are drawn to scale and they
represent the frequencies (counts or
percentages).
 The width of the bars should be the same
and the space between them should be the
same.
 The lengths of the bars show us the pattern
of the data. For example, the most
common climate is called the mode or
modal category.
REPORT – DESCRIBING A BAR CHART

In describing a bar chart, the focus is on two important features:
 the presence of a dominant category (or group of categories) in the distribution. This is given by the
mode. If there is no dominant category, then this should be stated.
 the order of occurrence of each category and its relative importance.
Report
The climate types of
 countries were classified as being, `cold', `moderate' or `hot'.
The majority of the countries,

 %, were found to have a _______________ climate.
Of the remaining countries,

 % were found to have a _______________ climate
while
 % were found to have a _______________ climate.
1.3 Stem plots
Numerical data are displayed in histograms, stem plots or dot plots. In this section we will consider the
stem-and-leaf plot which is a way of ordering and displaying a set of data.
 It is useful for displaying small to medium sized sets of data (up to about 50 data values).
 It retains all the original data values enabling the centre and the range of the distribution to be located
more precisely.
 It also enables the clear identification of outliers.
eg. If
n=20,
20+1
=
2
10.5,
the
10th
and
11th
terms
are in
Example the
middl
The ordered stem plot shows the distribution of test marks of 23 students.
e.
a. Name its shape and note outliers (if any).
b. Locate the centre of the distribution.
c. Estimate the spread of the distribution.

Key
1|5 =
d. Write down the values of any outliers. 15
REPORT – DESCRIBING A STEMPLOT

Report
From the 23 students, the distribution of marks is _______________________________ with
an outlier. The centre of the distribution is at _______________ and the distribution has a
spread of _______________ . The outlier is a mark of ______ .
BACK-TO-BACK STEM PLOTS
Back-to-back stemplots are used for comparing the distribution of two sets of data values for the same
variable.
Example
Use the back-to-back stem plot to write a report
comparing the distribution of the two sets of test marks in
terms of shape, centre, spread and outliers.
Report
The distribution of the Test 1 marks is _______________________________ while the
distribution of the Test 2 marks is _______________________________ . The two distributions
have similar centres; ________________________ . The spread of the Test 1 marks is
__________________ the Test 2 marks; ________________________. There are no outliers.
1.4 Dot plots, frequency tables and histograms, and bar charts
STACKED or SEGMENTED BAR CHARTS
Percentage segmented bar charts are particularly useful when analysing the relationship between two
categorical variables.
A segmented bar chart with actual values A segmented bar chart using percentage frequency
Climate Types Climate Types
 The lengths of the segments represent the  The lengths of the segments represent the
frequencies of the different values of the percentages of the different values of the variable.
variable.  The height of the bar gives the total percentage
 The height of the bar gives the total frequency. (100%).
 A legend needs to be included to identify the segments.

 Segmented bar charts should only be used when there are a relatively small number of components,
usually no more than four or five
DOT PLOTS
 Dot plots are used for small (discrete) data sets.
 They are best used when the data values are relatively close together.
Example
The ages (in years) of the 13 members of a sporting team are:
22 19 18 19 23 25 22 29 18 22 23 24 22
Construct a dot plot.
Note:
The
horizontal
axis must
be scaled
and have
equally
spaced
values.
REPORT – DESCRIBING A DOT PLOT
Usually there is little we can say about the shape of the distribution from the dot plot because there are not
sufficient data points for any pattern to be revealed.
From the dot plot above, we see that the distribution of ages is centred at 22 years (the middle value) with a
spread of 11 years (29 − 18 = 11).
Report
From the dot plot, we see that the distribution of ages is centred at ______ (the middle value)
with a spread of ______________.
DISCRETE DATA
FREQUENCY TABLE HISTOGRAM

Family Size Family Size
Frequency
Family size
Count
2 1
3 5
4 3
5 2
Total 11
 each data value is in the centre of each rectangle.

 because of the continuous nature of the variable,
the ‘bars’ in a histogram are joined together.
Grouping Data
When the variable takes a large range of values, we group the data into a small number of intervals,
generally between five and fifteen intervals are used; the smaller the number of data values, the smaller the
number of intervals.
Example
The ages of a sample of 200 people aged from 16 to 72 years are to be recorded. Group the ages into six
equal-sized categories that will cover all of these ages.
CONTINUOUS DATA
Average Hours Worked In Various Countries

FREQUENCY TABLE HISTOGRAM
Average Hours Worked Average Hours Worked
Frequency
Family size
Count
30.0 – 34.9 1
35.0 – 39.9 6
40.0 – 44.9 8
45.0 – 49.9 5
50.0 – 54.9 3
Total 23
 each column starts at the beginning of each

interval and finishes at the beginning of the
next.
REPORT – DESCRIBING A HISTOGRAM

In describing a histogram, the focus is on three important features:
 shape and outliers (values in the data set that appear to stand out
from the rest)
 centre
 spread
Example
The histogram shows the distribution of the number of phones per
1000 people in 85 countries.
Report
For the 85 countries, the distribution of the number of phones per 1000 people is
______________________________ . The centre of the distribution lies somewhere in the
interval ______________________________. The spread of the distribution is
______________________________. There are no outliers. List the
outliers here
if there are
Using a CAS Calculator to display UNIVARIATE DATA in a HISTOGRAM
UNGROUPED DATA
Example
The following data lists the number of hours of part-time work undertaken by 16 students in a week.
4, 0, 3, 1, 3, 2, 3, 4, 1, 3, 2, 5, 3, 2, 1, 0
Using a CAS calculator, make a histogram for the data set.
Step 1
In the Statistics app, change list1 to ‘hours’ by tapping list1 and typing
‘hours’.
(Note: You do not have to change the column headings.)
Enter the data in the ‘hours’ column.
Step 2
To draw the histogram, tap:
 SetGraph
 Setting
Draw: ON
Type: Histogram
XList: main\hours
Freq: 1
 Set

Set:H Start: 0
H Step: 1
 OK Note:
Interval
GROUPED DATA marks should
be in the
Example middle of the
Using a CAS calculator, make a histogram for the following data set. bar.
Frequency Step 1
Family
size Count
In the Statistics app, change list1 to
‘family’ and list2 to ‘freq’.
2 1 (Note: You do not have to change the column
3 5 headings.)
4 3
Enter the data into the two columns.
5 2
Total 11
Step 2
To draw the histogram, tap:
 SetGraph
 Setting
Draw: ON
Type: Histogram
XList: main\family
Freq: 1 main\freq
 Set Set:H
 Start:
0 Note:
H Interval
Step: marks should
be in the
Using Log (base 10) scales
Many numerical variables that we deal with in statistics have values that range over several orders of
magnitude. For example, the population of countries range from a few thousand to hundreds of thousands,
to millions, to hundreds of millions to just over 1 billion. Constructing a histogram that eﬀectively locates
every country on the plot is impossible.
One way to solve this problem is to use a scale that spreads out the countries with small populations and
‘pulls in’ the countries with huge populations.
A scale that will do this is called a logarithmic scale (or, more commonly, a log scale).
An introduction to Logs
To understand logs you need to understand powers of 10
a) 101 = 10
b) 102 = 100
c) 103 = 1000
These can all be written in log form
a) log10 10 = 1
b) log10 100 = 2
c) log10 10 00 = 3
Using your calculator find the following correct to three decimal places
a) 103.201
b) 101.476
c) 100.587
Rewrite each of the following in Log notation

Now use the CAS to find the following
a) log10 4531
b) log10 23.9754
c) log10 2390000000
The log button we will use in Further Mathematic this year is
Properties of logs to the base 10

 If a number is greater than one, its log to the base 10 is greater than zero.
 If a number is greater than zero but less than one, its log to the base 10 is negative.
 If the number is zero, then its log is undeﬁned.
Why use Logs?
The histogram below displays the body weights (in kg) of a number of animal
species. Because the animals represented in this dataset have weights ranging from around
1 kg to 90 tonnes (a dinosaur), most of the data are bunched up at one end of the scale and
much detail is missing. The distribution of weights is highly positively skewed, with an
outlier.
However, when a log scale is used, their weights are much more evenly spread along the scale. The
distribution is now approximately symmetric, with no outliers, and the histogram is considerably more
informative.
We can now see that the percentage of animals with weights between 10 and 100 kg is similar to the
percentage of animals with weights between 100 and 1000 kg.
Example
The weights of 27 animal species (in kg) are recorded below.
1.4 470 36 28 1.0 12000 2600 190 520 10 3.3 530 210 62 6700 9400 6.8 35 0.12 0.023 2.5 56
100 52 87000 0.12 190
Construct a histogram to display the distribution:
a) of the body weights of these 27 animals and describe its shape
b) of the log body weights of these animals and describe its shape.
For part b
Enter the data into the Statistics section in List1
Under List 2 tap in the Cal section Then Tap in the Cal box
Describe the shape of the distribution.
1.5 Describing the shape of stem plots and histograms
WHAT TO LOOK FOR IN A HISTOGRAM
SHAPE
Symmetric distributions
Single-peak Bi-modal distribution

symmetric has two distinct peaks.
Skewed distributions
distribution.
The graph tails off The graph tails off in

in a positive a negative direction.
direction.
OUTLIERS
Outliers are any data values that stand out from the main body of
data.
CENTRE
Consider the centre to be the middle of the distribution.
SPREAD
For a histogram, the spread of the distribution is given by the range.
Range = largest value – smallest value
Stem plot shapes
WHICH GRAPH?
1.6 the median, the interquartile range, the range and the mode
Summary statistics are classified as
 Measures of centre – locating the middle of the distribution (mean, median and mode)
 Measures of spread – how the values are spread out (interquartile range, range and standard deviation)
The median, range and interquartile (IQR) are all summary statistics based on ordering the data.
THE MEDIAN (Q2)
 The median is the midpoint of a distribution.
 50% of values in the data set are less than or equal to the median.
 The data must be arranged in numerical order before finding the position of the median.
Rule
𝑛+1 𝑡ℎ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠+1
For n values, the median lies in the ( ) position ie.
2 2
20+1
When n is an even number: If n = 20, = 10.5, the 10th and 11th terms are in the middle and you will need to
2
calculate the average of these values..
THE RANGE
 The range gives the spread of the distribution.
Rule: range = largest data value − smallest data value
THE INTERQUARTILE RANGE (IQR)

 Quartiles are the points that divide a distribution into quarters.
 The interquartile range (IQR) is defined to be the spread of the middle 50% of data values.
Rule: IQR = Q3 − Q1
ODD NUMBER of data
7 data values 9 data values
4 9 11 13 17 23 30 2 7 13 14 17 19 21 25 29
Median = Median =
Interquartile range = Interquartile range =
Range = Range =
EVEN NUMBER of data
6 data values 8 data values
3 6 10 12 15 21 1 3 9 10 15 17 21 26
Median = Median =
Interquartile range = Interquartile range =
Range = Range =
STEM PLOT
Distribution of Test Scores

a. Determine the median for a. Determine the median for
Test 1 Test 2.
b. Determine the IQR for b. Determine the IQR for

Test 1. Test 2.
c. Determine the range for c. Determine the range for

Test 1. Test 2.
Limitations of the range

As the range depends only on the two extreme values in a set of data it is not always an informative measure of
spread. For example, the largest and smallest values in a data set might be outliers and not at all typical of the rest of
the values. Furthermore, any two sets of data with the same highest and lowest values will have the same range,
irrespective of the way in which the data values are spread out in between. However, the range is useful to know
because it gives us an indication of the absolute spread of the distribution.
Why is the IQR a more useful measure of spread than the range?
The IQR is a measure of spread of a distribution that includes the middle 50% of observations. Since the
upper 25% and lower 25% of observations are discarded, the interquartile range is generally not affected by
the presence of outliers/extreme values . This makes it a more useful measure of spread than the range.
 Use one of the two terms, outlier or extreme value, according to the data presented.
 If you can, state the outlier or extreme value in your data set.
 Note: To be an outlier you must have calculated that the data value is 1.5ｘIQR below Q1 or above Q3.
Do not assume this.
FIVE-NUMBER SUMMARY
List the data values in the order: Minimum, Q1, M, Q3, Maximum
Q0 Q2 Q4
1.7 Boxplots
 A box-and-whisker plot is a graphical version of a five-number summary.
Whisker Whisker
Interquartile range
Lowest Lower Median Upper Highest

value Q0 quartile QM orQ2 quartile value Q4
(minimum) QL or Q1 QU or Q3 (maximum)
Range
 The number line is drawn so that it covers all the values in the set of data.
Number line (with a suitable scale)
 The rectangular ‘box’ represents the interquartile range (the middle half of the data).
 The ‘box’ is drawn from the lower quartile (QL or Q1 ) to the upper quartile (QU or Q3 ).
 The vertical line drawn in the ‘box’ represents the median (middle value).
 A whisker extends from each end of the ‘box’ to the lowest and highest values.
OUTLIERS
 Outliers are data values which are found outside the main group of the data.
 If the whiskers in a box plot are more than 1.5 times the width of the box, then there are outliers in the
sample.
 The whiskers in a box plot are extended to the last actual data values before the outliers.
Outliers are data values that are:

 less than Q 1  1.5  IQR (lower fence)
 greater than Q3 + 1.5  IQR (upper fence)
lowest data value highest data value

inside fence IQR inside fence
* * *
Outlier Outliers
Lower Fence Upper Fence

Q1 − 1.5 IQR 1.5  IQR 1.5  IQR Q3 + 1.5 IQR
TESTING FOR OUTLIERS

Step 1: Establish the values for Q1, Q3, and IQR.
Step 2: Calculate Q 1  1.5  IQR. This value is called the lower fence.
Step 3: Check to see whether there are any data values less than Q1 − 1.5  IQR.
If so, then each value is an outlier.
(The value of the outlier must be less than the lower fence value.)
Step 4: Calculate Q3 + 1.5  IQR. This value is called the upper fence.
Step 5: Check to see whether there are any data values greater than Q3 + 1.5  IQR.
If so, then each value is an outlier.
(The value of the outlier is more than the upper fence value.)
Example
Lower fence =
Upper fence =
Five number summary:
Minimum, Q1, M, Q3, Maximum
Interquartile range = Outlier(s):
Example
Lower fence =
Stem Leaf
0 7
1
2
3 2 2 4 6 8 9
4 0 2 3 6
5 4 6
6 5
7 Upper fence =
8 3 9 Key 1|5 = 15
Five number summary:

Minimum, Q1, M, Q3, Maximum
Interquartile range =
Outlier(s):
Using a CAS Calculator to determine the FIVE-NUMBER SUMMARY and display a BOX PLOT
UNGROUPED DATA
Example
a. Using a CAS calculator, display the five-number summary for the
following data set.
3 6 4 8 17 12 9 7 13 13 5 9 7 2 1 7 5 4 2
Step 1
In the Statistics app, enter the data in list1.
Step 2
To obtain the five-number summary, tap:
 Calc
 One-Variable
XList: list1
Freq: 1
 OK
Five-number Summary
Minimum value Q0
Q1
Median Q2
Q3
Maximum value Q4
b. Using a CAS calculator, make a box plot for the data set.
Step 1
In the Statistics app, enter the data in list1.
Step 2
To draw the box plot, tap:
 SetGraph
 Setting
Draw: ON
Type: MedBox
XList: list1
Freq: 1
Show Outliers – tap the box to add a tick
 Set
Step 3

To move around the boxplot tap:

 Analysis
 Trace
Use the cursor to move to the left and right across the boxplot to locate
the minimum, maximum, quartile values and any outliers.
ROUPED DATA
Example
The following frequency table displays the goals scored in 100 games.
Goals in a game 0 1 2 3 4 5 6 Total

Frequency 5 15 30 20 15 10 5 100
a. Using a CAS calculator, display the five-number summary for the following data set.
Step 1
In the Statistics app, enter the ‘goals’ in list1 and the ‘frequency’ in list2.
Step 2
To obtain the five-number summary, tap:
 Calc
 One-Variable
XList: list1
Freq: list2
 OK
Five-number Summary
Minimum value Q0
Q1
Median Q2
Q3
Maximum value Q4
b. Using a CAS calculator, make a box plot for the data set.
Step 1
In the Statistics app, enter the ‘goals’ in list1 and the ‘frequency’ in list2.
Step 2
To draw the box plot, tap:
 SetGraph
 Setting
Draw: ON
Type: MedBox
XList: list1
Freq: list2
Show Outliers – tap the box to add a tick
 Set
Step 3

To move around the boxplot tap:

 Analysis
 Trace
Use the cursor to move to the left and right across the boxplot to locate
the minimum, maximum, quartile values and any outliers.
RELATING A BOXPLOT TO DISTRIBUTION SHAPE
SYMMETRIC DISTRIBUTIONS
 The box plot is symmetric.
 The median is generally in the middle of the box.
 The whiskers are approximately equal in length.
POSITIVELY SKEWED DISTRIBUTIONS

 The box plot has the median off-centre and generally to the left.
 The left-hand whisker will be short, while the right-hand whisker will be
long, reflecting the gradual tailing off of data values to the right.
ie. The upper 25% of data are sparse and spread out whereas the lower 25%
of data are bunched up.
NEGATIVELY SKEWED DISTRIBUTIONS

 The box plot has the median off-centre and generally to the right.
 The right-hand whisker will be short, while the left-hand whisker will
be long, reflecting the gradual tailing off of data values to the left.
ie. The lower 25% of data are sparse and spread out whereas the top 25%
of data are bunched up.
DISTRIBUTIONS WITH OUTLIERS

Distributions with outliers are characterised by gaps between the main
body and data values in the tails.
INTERPRETING BOX PLOTS: DESCRIBING AND COMPARING

DISTRIBUTIONS
REPORT – DESCRIBING A BOX PLOT
In describing a box plot, the focus is on four important features:
 shape – symmetric, positively skewed or negatively skewed
 outliers – do they exist
 centre – median
 spread – range and interquartile range
Example
Describe the distributions represented by the following box plots in terms of shape, centre and spread. Give
appropriate values.
a. Using a box plot to describe a distribution without outliers
Report
The distribution is _______________________________ with no outliers. The distribution is
centred at ________ , the median value. The spread of the distribution, as measured by
the IQR, is ________ , and, as measured by the range, ________ .
b. Using a box plot to describe a distribution with outliers
Report
The distribution is ______________________________ but with outliers. The distribution is
centred at ________ , the median value. The spread of the distribution, as measured by
the IQR, is ________ nd, as measured by the range, ________. There are four outliers:
__________________________ .
c. Using a box plot to compare distributions

i. The parallel box plots show the distribution of age
at marriage of 45 married men and 38 married women.
Report
The distributions of age at marriage are ______________________________ for both men
and women. There are no outliers. The median age for marriage is higher for men
( ______________ ) than women (______________). The IQR is also greater for men
(______________)than women (______________)The range of age at marriage is also
greater for men (______________)than women (______________)
ii. Comment on how the age at marriage of men compares to women for the data.
For this group of men and women, the men on average, married at an older age and the age at
which they married is more variable.
1.8 The mean of a sample


The mean is given by x  x where Ʃx represents the sum of all the observations in the data set and n
n
represents the number of observations in the data set.
The mean is calculated by using the values of the observations and because of this it becomes a less reliable
measure of the centre of the distribution when the distribution is skewed or contains an outlier.
( f  m)
To find the mean for grouped data, x  where f represents the frequency of the data and m
f
represents the midpoint of the class interval of the grouped data.
The more symmetrical the distribution, the closer the value of the mean is to the value of the median.
Example
Find the mean of the following data set: 3, 6, 10, 12, 16, 21
Example
Athletes were asked to take their heart rate after completing a marathon, and their results are tabulated
below. Find their mean heart rate.
Heart Rate Midpoint Frequency
100 -< 110 10
110 -< 120 15
120 -< 130 56
130 -< 140 48
140 -< 150 37
Total
1.9 Standard deviation of a sample

Definitions:
Variance (s2) – measures the average of the squared deviations of each data value from the mean.
Standard Deviation (s)– square root of the variance; the measure of spread of the data from the mean.
The following example demonstrates the steps used to calculate the standard deviation.
Example
The hand span (to the nearest cm) of a group of 10 students was recorded:
15, 14, 17, 19, 14, 14, 13, 16, 21, 16
Find the standard deviation.
Steps to find the standard deviation
1. Calculate the mean.
2. Find the difference between each value and the mean.
3. Square each of these.
4. Add the squared differences.
5. Divide this sum by the number of results less one.
6. Take the square root.
The rule to find the standard deviation follows, however, we will be using the CAS calculator
Mean
(15 + 14 + 17 + 19 + 14 + 14 + 13 + 16 + 21 + 16)
𝑥̅ = = 15.9
10
x x-𝑥̅ (x-𝑥̅ )2
15
14
17
19
14
14
13
16
21
16
Total =
s2 = 6.32
s = 2.51
This is a rough guide of where the results lie from the mean.
Two thirds of the results are about one standard deviation away from the mean.
95% of the results are about two standard deviations from the mean.
Outliers
Any value in a data set that is significantly larger, or smaller, than the other values is called an outlier.
These must be taken into account when describing data, because they can have a misleading effect on the
____________ and ______________ .
Any value more than 2 standard deviations away from the mean is an outlier
Formula (using stand.dev. and mean): values below x  2s and above x  2s
The histograms above represent the data from three different components of the IQ test. All sets have a
mean of 100. However, the spread differs. The Verbal component has a smaller spread with a standard
deviation of 7.9 whereas the ‘widest’ spread is seen in the Spatial component with a standard deviation of
31.6.
1.10 Populations and simple random samples
To end up with a sample of data, we need to select the sample from a population. This is best done
randomly to eliminate any possible bias, and to hopefully get a set of data that is representative of the
population.
A random sample has been chosen when every value in the population has an equal chance of being
selected, and when a value is selected it has no influence on whether or not other values would be chosen.
The easiest way to do this is using the ClassPad or Graphics Calculator. Any set of data can have numbers
assigned to each value. Then the calculator chooses a random number, which then tells us which value to
select.
Example:
Select a random sample of 10 “Victorian population” values from all the recorded statistics (1851-2004,
found on page 22).
Calculate the mean and standard deviation of your sample. Compare the population statistics to the sample
statistics using two boxplots.
Year
Population
Year
Population
Population Sample:
μ = 2,111,719 𝑥̅ = ____________
σ = 1,366,170.1 s = _____________
Xmin = 97,489 Xmin = _____________
Q1 = 1,112,136 Q1 = _______________
Med = 1,751,789 Med = _______________
Q3 = 3,236,347 Q3 = _______________
Xmax = 4,992,667 Xmax = ______________
Generating Random Numbers

Use the following instructions to generate random
numbers.
Press KEYBOARD
Tap abc
Type randList(10,1851,2004)
Press EXE.
Population of Victoria (1851 – 2004)
Population
Sample
Comments:
POPULATION OF VICTORIA (1851-2004)
1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863
97,489 168,321 222,436 283,942 347,305 390,384 456,522 496,146 521,072 538,234 539,764 551,388 567,906
1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876
598,003 617,791 633,602 648,302 671,324 696,762 723,925 746,450 759,428 773,808 786,108 794,934 805,424
1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
818,935 829,918 841,757 858,605 873,965 892,765 912,453 935,777 959,838 993,717 1,025,476 1,079,077 1,104,938
1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902
1,133,728 1,158,372 1,168,747 1,176,170 1,182,155 1,185,676 1,179,850 1,182,106 1,182,281 1,188,541 1,196,213 1,209,900 1,208,231
1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915
1,204,742 1,205,608 1,210,421 1,219,832 1,232,807 1,250,449 1,277,022 1,301,408 1,339,893 1,382,553 1,415,416 1,435,188 1,424,445
1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928
1,404,663 1,417,060 1,437,245 1,503,035 1,527,909 1,550,727 1,590,273 1,625,455 1,657,151 1,684,051 1,711,987 1,741,832 1,761,746
1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
1,778,269 1,792,605 1,803,570 1,813,387 1,824,217 1,836,660 1,841,595 1,849,607 1,856,991 1,871,099 1,883,133 1,914,918 1,946,425
1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954
1,962,558 1,981,616 1,997,954 2,015,107 2,039,769 2,062,709 2,108,125 2,168,884 2,237,182 2,299,538 2,366,719 2,416,035 2,477,986
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
2,546,332 2,618,112 2,680,555 2,745,165 2,811,429 2,888,290 2,955,299 3,011,043 3,071,046 3,137,921 3,195,860 3,249,843 3,303,606
1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
3,356,827 3,421,178 3,482,031 3,633,843 3,686,136 3,730,824 3,779,587 3,800,656 3,823,941 3,852,589 3,874,501 3,899,993 3,930,655
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
3,968,398 4,012,687 4,054,498 4,097,640 4,140,421 4,183,842 4,234,945 4,295,300 4,348,225 4,400,707 4,437,479 4,465,415 4,478,835
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
4,500,354 4,539,796 4,579,429 4,615,526 4,661,741 4,713,190 4,770,042 4,830,508 4,883,538 4,936,785 4,992,667
26
1.11 The 68-95-99.7% rule and z-scores
Using the Mean and Standard Deviation
The mean and standard distribution allow us to estimate the percentage of values within a range of values.
If the data set has a symmetrical distribution which is bell-shaped, then the data has a normal distribution.
This is particularly useful since
For data sets with a normal distribution we also expect:

68% of the values to lie within one standard deviation of the mean ( x  s  x  x  s )
95% of the values to lie within two standard deviations of the mean ( x  2s  x  x  2s )
99.7% of the values to lie within three standard deviations of the mean ( x  3s  x  x  3s )
Diagram of a normal data set:
Example 1
Over several months the diameters of apples picked at Aumann’s Orchard was recorded. The mean
diameter was found to be 10 cm, and the standard deviation was 2 cm. Aumann’s will use this data to
estimate the size of their apples in the future.
a) What range of diameters would Aumann’s expect 68% of their apples to be within?
b) What range of diameters would Aumann’s expect 95% of their apples to be within?
c) What range of diameters would Aumann’s expect 99.7% of their apples to be within?
d) Find the percentage of apples with a diameter greater than 14cm.
e) Find the percentage of apples with a diameter less than 4cm.
f) Find the percentage of apples with a diameter between 8cm and 16cm.
Standard z-scores
When two or more normal data sets need to be compared, we translate the data values into a standard
normal distribution.
The raw data values are applied to a formula and become z-values. These z-values are used to compare the
values from one data set to another (see the example below).
raw score - mean X x

Formula: z  value  
standard deviation s
Example 2
The Schools Assessment Board needs to compare the SAC result of Tom, from Alfred Deakin High School,
and Julia, from Blackburn High School.
Alfred Deakin’s results for the SAC had an average of 25 and a standard deviation of 4, whilst Blackburn’s
average was 33 with a standard deviation of 2.
Visually:
Tom (ADHS) scored 30, whilst Julia (BHS) scored 34.
Who actually scored better? (considering that teachers may mark differently from school to school)
1. Work out the z-value for Tom:
2. Work out the z-value for Julia:
3. State who actually performed better:__________________________

28
29

2017 Univariate Data - Topic 1 Notes

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2017 Univariate Data - Topic 1 Notes

Încărcat de

Drepturi de autor:

Formate disponibile

TOPIC ONE UNIVARIATE DATA (CORE)

1.1 Types of data

Discrete Continuous Nominal Ordinal

DATA CAN BE EITHER NUMERICAL OR CATEGORICAL

Numerical data (e.g. 2, 6999…) Categorical data

NUMERICAL DATA CAN BE EITHER DISCRETE OR CONTINUOUS

Discrete data (2, 5, 10, 114, …) Continuous data

CATEGORICAL DATA CAN BE EITHER ORDINAL OR NOMINAL

Ordinal data Nominal data

REPORT – DESCRIBING A BAR CHART

The majority of the countries,

Of the remaining countries,

b. Locate the centre of the distribution.

c. Estimate the spread of the distribution.

REPORT – DESCRIBING A STEMPLOT

 A legend needs to be included to identify the segments.

FREQUENCY TABLE HISTOGRAM

 each data value is in the centre of each rectangle.

Average Hours Worked In Various Countries

 each column starts at the beginning of each

REPORT – DESCRIBING A HISTOGRAM

Rewrite each of the following in Log notation

Properties of logs to the base 10

Single-peak Bi-modal distribution

The graph tails off The graph tails off in

THE INTERQUARTILE RANGE (IQR)

ODD NUMBER of data

7 data values 9 data values

Interquartile range = Interquartile range =

EVEN NUMBER of data

6 data values 8 data values

Interquartile range = Interquartile range =

Distribution of Test Scores

b. Determine the IQR for b. Determine the IQR for

c. Determine the range for c. Determine the range for

Limitations of the range

Lowest Lower Median Upper Highest

Outliers are data values that are:

lowest data value highest data value

Lower Fence Upper Fence

TESTING FOR OUTLIERS

Interquartile range = Outlier(s):

Five number summary:

To move around the boxplot tap:

Goals in a game 0 1 2 3 4 5 6 Total

To move around the boxplot tap:

POSITIVELY SKEWED DISTRIBUTIONS

NEGATIVELY SKEWED DISTRIBUTIONS

DISTRIBUTIONS WITH OUTLIERS

INTERPRETING BOX PLOTS: DESCRIBING AND COMPARING

c. Using a box plot to compare distributions

1.8 The mean of a sample

120 -< 130 56

130 -< 140 48

140 -< 150 37

1.9 Standard deviation of a sample

Xmin = 97,489 Xmin = _____________

Med = 1,751,789 Med = _______________

Xmax = 4,992,667 Xmax = ______________

Generating Random Numbers

For data sets with a normal distribution we also expect:

Diagram of a normal data set:

d) Find the percentage of apples with a diameter greater than 14cm.