Documente Academic
Documente Profesional
Documente Cultură
Numerical Categorical
Warning!
It is not the variable name itself that determines whether the data are numerical or categorical, it is the way
the data for the variable are recorded.
For example:
‘weight’ recorded in kilograms, is a numerical variable
‘weight’ recorded as 1 = underweight, 2 = normal weight, 3 = overweight, is a categorical variable
CATEGORICAL DATA
Categorical data are displayed in either bar charts or segmented bar charts.
FREQUENCY TABLE
Climate Types In Various Countries
Frequency Working area for finding the percentages
Climate count
type Count Percentage percentage frequency = 100%
total count
(to 1 decimal place)
Cold 5
Moderate 11
Hot 19
Total 35 100.0
BAR CHART
Climate Types The bars are parallel.
The lengths are drawn to scale and they
represent the frequencies (counts or
percentages).
The width of the bars should be the same
and the space between them should be the
same.
The lengths of the bars show us the pattern
of the data. For example, the most
common climate is called the mode or
modal category.
Report
The climate types of
countries were classified as being, `cold', `moderate' or `hot'.
while
% were found to have a _______________ climate.
1.3 Stem plots
Numerical data are displayed in histograms, stem plots or dot plots. In this section we will consider the
stem-and-leaf plot which is a way of ordering and displaying a set of data.
It is useful for displaying small to medium sized sets of data (up to about 50 data values).
It retains all the original data values enabling the centre and the range of the distribution to be located
more precisely.
It also enables the clear identification of outliers.
eg. If
n=20,
20+1
=
2
10.5,
the
10th
and
11th
terms
are in
Example the
middl
The ordered stem plot shows the distribution of test marks of 23 students.
e.
a. Name its shape and note outliers (if any).
Report
The distribution of the Test 1 marks is _______________________________ while the
distribution of the Test 2 marks is _______________________________ . The two distributions
have similar centres; ________________________ . The spread of the Test 1 marks is
__________________ the Test 2 marks; ________________________. There are no outliers.
1.4 Dot plots, frequency tables and histograms, and bar charts
STACKED or SEGMENTED BAR CHARTS
Percentage segmented bar charts are particularly useful when analysing the relationship between two
categorical variables.
A segmented bar chart with actual values A segmented bar chart using percentage frequency
Climate Types Climate Types
The lengths of the segments represent the The lengths of the segments represent the
frequencies of the different values of the percentages of the different values of the variable.
variable. The height of the bar gives the total percentage
The height of the bar gives the total frequency. (100%).
DISCRETE DATA
Frequency
Family size
Count
2 1
3 5
4 3
5 2
Total 11
CONTINUOUS DATA
Frequency
Family size
Count
30.0 – 34.9 1
35.0 – 39.9 6
40.0 – 44.9 8
45.0 – 49.9 5
50.0 – 54.9 3
Total 23
Example
The histogram shows the distribution of the number of phones per
1000 people in 85 countries.
Report
For the 85 countries, the distribution of the number of phones per 1000 people is
______________________________ . The centre of the distribution lies somewhere in the
interval ______________________________. The spread of the distribution is
______________________________. There are no outliers. List the
outliers here
if there are
Using a CAS Calculator to display UNIVARIATE DATA in a HISTOGRAM
UNGROUPED DATA
Example
The following data lists the number of hours of part-time work undertaken by 16 students in a week.
4, 0, 3, 1, 3, 2, 3, 4, 1, 3, 2, 5, 3, 2, 1, 0
Using a CAS calculator, make a histogram for the data set.
Step 1
In the Statistics app, change list1 to ‘hours’ by tapping list1 and typing
‘hours’.
(Note: You do not have to change the column headings.)
Enter the data in the ‘hours’ column.
Step 2
To draw the histogram, tap:
SetGraph
Setting
Draw: ON
Type: Histogram
XList: main\hours
Freq: 1
Set
Set:H Start: 0
H Step: 1
OK Note:
Interval
GROUPED DATA marks should
be in the
Example middle of the
Using a CAS calculator, make a histogram for the following data set. bar.
Frequency Step 1
Family
size Count
In the Statistics app, change list1 to
‘family’ and list2 to ‘freq’.
2 1 (Note: You do not have to change the column
3 5 headings.)
4 3
Enter the data into the two columns.
5 2
Total 11
Step 2
To draw the histogram, tap:
SetGraph
Setting
Draw: ON
Type: Histogram
XList: main\family
Freq: 1 main\freq
Set Set:H
Start:
0 Note:
H Interval
Step: marks should
be in the
Using Log (base 10) scales
Many numerical variables that we deal with in statistics have values that range over several orders of
magnitude. For example, the population of countries range from a few thousand to hundreds of thousands,
to millions, to hundreds of millions to just over 1 billion. Constructing a histogram that effectively locates
every country on the plot is impossible.
One way to solve this problem is to use a scale that spreads out the countries with small populations and
‘pulls in’ the countries with huge populations.
A scale that will do this is called a logarithmic scale (or, more commonly, a log scale).
An introduction to Logs
To understand logs you need to understand powers of 10
a) 101 = 10
b) 102 = 100
c) 103 = 1000
These can all be written in log form
a) log10 10 = 1
b) log10 100 = 2
c) log10 10 00 = 3
Using your calculator find the following correct to three decimal places
a) 103.201
b) 101.476
c) 100.587
However, when a log scale is used, their weights are much more evenly spread along the scale. The
distribution is now approximately symmetric, with no outliers, and the histogram is considerably more
informative.
We can now see that the percentage of animals with weights between 10 and 100 kg is similar to the
percentage of animals with weights between 100 and 1000 kg.
Example
The weights of 27 animal species (in kg) are recorded below.
1.4 470 36 28 1.0 12000 2600 190 520 10 3.3 530 210 62 6700 9400 6.8 35 0.12 0.023 2.5 56
100 52 87000 0.12 190
Construct a histogram to display the distribution:
a) of the body weights of these 27 animals and describe its shape
b) of the log body weights of these animals and describe its shape.
For part b
Enter the data into the Statistics section in List1
Under List 2 tap in the Cal section Then Tap in the Cal box
Describe the shape of the distribution.
1.5 Describing the shape of stem plots and histograms
WHAT TO LOOK FOR IN A HISTOGRAM
SHAPE
Symmetric distributions
CENTRE
Consider the centre to be the middle of the distribution.
SPREAD
For a histogram, the spread of the distribution is given by the range.
Range = largest value – smallest value
Stem plot shapes
WHICH GRAPH?
1.6 the median, the interquartile range, the range and the mode
Summary statistics are classified as
Measures of centre – locating the middle of the distribution (mean, median and mode)
Measures of spread – how the values are spread out (interquartile range, range and standard deviation)
The median, range and interquartile (IQR) are all summary statistics based on ordering the data.
THE MEDIAN (Q2)
The median is the midpoint of a distribution.
50% of values in the data set are less than or equal to the median.
The data must be arranged in numerical order before finding the position of the median.
Rule
𝑛+1 𝑡ℎ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠+1
For n values, the median lies in the ( ) position ie.
2 2
20+1
When n is an even number: If n = 20, = 10.5, the 10th and 11th terms are in the middle and you will need to
2
calculate the average of these values..
THE RANGE
The range gives the spread of the distribution.
Rule: range = largest data value − smallest data value
4 9 11 13 17 23 30 2 7 13 14 17 19 21 25 29
Median = Median =
Range = Range =
3 6 10 12 15 21 1 3 9 10 15 17 21 26
Median = Median =
Range = Range =
STEM PLOT
Why is the IQR a more useful measure of spread than the range?
The IQR is a measure of spread of a distribution that includes the middle 50% of observations. Since the
upper 25% and lower 25% of observations are discarded, the interquartile range is generally not affected by
the presence of outliers/extreme values . This makes it a more useful measure of spread than the range.
Use one of the two terms, outlier or extreme value, according to the data presented.
If you can, state the outlier or extreme value in your data set.
Note: To be an outlier you must have calculated that the data value is 1.5xIQR below Q1 or above Q3.
Do not assume this.
FIVE-NUMBER SUMMARY
List the data values in the order: Minimum, Q1, M, Q3, Maximum
Q0 Q2 Q4
1.7 Boxplots
A box-and-whisker plot is a graphical version of a five-number summary.
Whisker Whisker
Interquartile range
The number line is drawn so that it covers all the values in the set of data.
Number line (with a suitable scale)
The rectangular ‘box’ represents the interquartile range (the middle half of the data).
The ‘box’ is drawn from the lower quartile (QL or Q1 ) to the upper quartile (QU or Q3 ).
The vertical line drawn in the ‘box’ represents the median (middle value).
A whisker extends from each end of the ‘box’ to the lowest and highest values.
OUTLIERS
Outliers are data values which are found outside the main group of the data.
If the whiskers in a box plot are more than 1.5 times the width of the box, then there are outliers in the
sample.
The whiskers in a box plot are extended to the last actual data values before the outliers.
* * *
Outlier Outliers
Upper fence =
Five number summary:
Minimum, Q1, M, Q3, Maximum
Example
Lower fence =
Stem Leaf
0 7
1
2
3 2 2 4 6 8 9
4 0 2 3 6
5 4 6
6 5
7 Upper fence =
8 3 9 Key 1|5 = 15
Interquartile range =
Outlier(s):
Using a CAS Calculator to determine the FIVE-NUMBER SUMMARY and display a BOX PLOT
UNGROUPED DATA
Example
a. Using a CAS calculator, display the five-number summary for the
following data set.
3 6 4 8 17 12 9 7 13 13 5 9 7 2 1 7 5 4 2
Step 1
In the Statistics app, enter the data in list1.
Step 2
To obtain the five-number summary, tap:
Calc
One-Variable
XList: list1
Freq: 1
OK
Five-number Summary
Minimum value Q0
Q1
Median Q2
Q3
Maximum value Q4
b. Using a CAS calculator, make a box plot for the data set.
Step 1
In the Statistics app, enter the data in list1.
Step 2
To draw the box plot, tap:
SetGraph
Setting
Draw: ON
Type: MedBox
XList: list1
Freq: 1
Show Outliers – tap the box to add a tick
Set
Step 3
Report
The distribution is _______________________________ with no outliers. The distribution is
centred at ________ , the median value. The spread of the distribution, as measured by
the IQR, is ________ , and, as measured by the range, ________ .
b. Using a box plot to describe a distribution with outliers
Report
The distribution is ______________________________ but with outliers. The distribution is
centred at ________ , the median value. The spread of the distribution, as measured by
the IQR, is ________ nd, as measured by the range, ________. There are four outliers:
__________________________ .
Report
The distributions of age at marriage are ______________________________ for both men
and women. There are no outliers. The median age for marriage is higher for men
( ______________ ) than women (______________). The IQR is also greater for men
(______________)than women (______________)The range of age at marriage is also
greater for men (______________)than women (______________)
ii. Comment on how the age at marriage of men compares to women for the data.
For this group of men and women, the men on average, married at an older age and the age at
which they married is more variable.
Example
Find the mean of the following data set: 3, 6, 10, 12, 16, 21
Example
Athletes were asked to take their heart rate after completing a marathon, and their results are tabulated
below. Find their mean heart rate.
Heart Rate Midpoint Frequency
100 -< 110 10
110 -< 120 15
Total
The following example demonstrates the steps used to calculate the standard deviation.
Example
The hand span (to the nearest cm) of a group of 10 students was recorded:
15, 14, 17, 19, 14, 14, 13, 16, 21, 16
Find the standard deviation.
Steps to find the standard deviation
1. Calculate the mean.
2. Find the difference between each value and the mean.
3. Square each of these.
4. Add the squared differences.
5. Divide this sum by the number of results less one.
6. Take the square root.
The rule to find the standard deviation follows, however, we will be using the CAS calculator
Mean
(15 + 14 + 17 + 19 + 14 + 14 + 13 + 16 + 21 + 16)
𝑥̅ = = 15.9
10
x x-𝑥̅ (x-𝑥̅ )2
15
14
17
19
14
14
13
16
21
16
Total =
s2 = 6.32
s = 2.51
This is a rough guide of where the results lie from the mean.
Two thirds of the results are about one standard deviation away from the mean.
95% of the results are about two standard deviations from the mean.
Outliers
Any value in a data set that is significantly larger, or smaller, than the other values is called an outlier.
These must be taken into account when describing data, because they can have a misleading effect on the
____________ and ______________ .
Any value more than 2 standard deviations away from the mean is an outlier
Formula (using stand.dev. and mean): values below x 2s and above x 2s
The histograms above represent the data from three different components of the IQ test. All sets have a
mean of 100. However, the spread differs. The Verbal component has a smaller spread with a standard
deviation of 7.9 whereas the ‘widest’ spread is seen in the Spatial component with a standard deviation of
31.6.
1.10 Populations and simple random samples
To end up with a sample of data, we need to select the sample from a population. This is best done
randomly to eliminate any possible bias, and to hopefully get a set of data that is representative of the
population.
A random sample has been chosen when every value in the population has an equal chance of being
selected, and when a value is selected it has no influence on whether or not other values would be chosen.
The easiest way to do this is using the ClassPad or Graphics Calculator. Any set of data can have numbers
assigned to each value. Then the calculator chooses a random number, which then tells us which value to
select.
Example:
Select a random sample of 10 “Victorian population” values from all the recorded statistics (1851-2004,
found on page 22).
Calculate the mean and standard deviation of your sample. Compare the population statistics to the sample
statistics using two boxplots.
Year
Population
Year
Population
Population Sample:
μ = 2,111,719 𝑥̅ = ____________
σ = 1,366,170.1 s = _____________
Q1 = 1,112,136 Q1 = _______________
Q3 = 3,236,347 Q3 = _______________
Press KEYBOARD
Tap abc
Type randList(10,1851,2004)
Press EXE.
Population of Victoria (1851 – 2004)
Population
Sample
Comments:
POPULATION OF VICTORIA (1851-2004)
1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863
97,489 168,321 222,436 283,942 347,305 390,384 456,522 496,146 521,072 538,234 539,764 551,388 567,906
1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876
598,003 617,791 633,602 648,302 671,324 696,762 723,925 746,450 759,428 773,808 786,108 794,934 805,424
1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889
818,935 829,918 841,757 858,605 873,965 892,765 912,453 935,777 959,838 993,717 1,025,476 1,079,077 1,104,938
1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902
1,133,728 1,158,372 1,168,747 1,176,170 1,182,155 1,185,676 1,179,850 1,182,106 1,182,281 1,188,541 1,196,213 1,209,900 1,208,231
1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915
1,204,742 1,205,608 1,210,421 1,219,832 1,232,807 1,250,449 1,277,022 1,301,408 1,339,893 1,382,553 1,415,416 1,435,188 1,424,445
1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928
1,404,663 1,417,060 1,437,245 1,503,035 1,527,909 1,550,727 1,590,273 1,625,455 1,657,151 1,684,051 1,711,987 1,741,832 1,761,746
1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
1,778,269 1,792,605 1,803,570 1,813,387 1,824,217 1,836,660 1,841,595 1,849,607 1,856,991 1,871,099 1,883,133 1,914,918 1,946,425
1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954
1,962,558 1,981,616 1,997,954 2,015,107 2,039,769 2,062,709 2,108,125 2,168,884 2,237,182 2,299,538 2,366,719 2,416,035 2,477,986
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
2,546,332 2,618,112 2,680,555 2,745,165 2,811,429 2,888,290 2,955,299 3,011,043 3,071,046 3,137,921 3,195,860 3,249,843 3,303,606
1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
3,356,827 3,421,178 3,482,031 3,633,843 3,686,136 3,730,824 3,779,587 3,800,656 3,823,941 3,852,589 3,874,501 3,899,993 3,930,655
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
3,968,398 4,012,687 4,054,498 4,097,640 4,140,421 4,183,842 4,234,945 4,295,300 4,348,225 4,400,707 4,437,479 4,465,415 4,478,835
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
4,500,354 4,539,796 4,579,429 4,615,526 4,661,741 4,713,190 4,770,042 4,830,508 4,883,538 4,936,785 4,992,667
26
1.11 The 68-95-99.7% rule and z-scores
Using the Mean and Standard Deviation
The mean and standard distribution allow us to estimate the percentage of values within a range of values.
If the data set has a symmetrical distribution which is bell-shaped, then the data has a normal distribution.
This is particularly useful since
95% of the values to lie within two standard deviations of the mean ( x 2s x x 2s )
99.7% of the values to lie within three standard deviations of the mean ( x 3s x x 3s )
Example 1
Over several months the diameters of apples picked at Aumann’s Orchard was recorded. The mean
diameter was found to be 10 cm, and the standard deviation was 2 cm. Aumann’s will use this data to
estimate the size of their apples in the future.
a) What range of diameters would Aumann’s expect 68% of their apples to be within?
b) What range of diameters would Aumann’s expect 95% of their apples to be within?
c) What range of diameters would Aumann’s expect 99.7% of their apples to be within?
f) Find the percentage of apples with a diameter between 8cm and 16cm.
Standard z-scores
When two or more normal data sets need to be compared, we translate the data values into a standard
normal distribution.
The raw data values are applied to a formula and become z-values. These z-values are used to compare the
values from one data set to another (see the example below).
Example 2
The Schools Assessment Board needs to compare the SAC result of Tom, from Alfred Deakin High School,
and Julia, from Blackburn High School.
Alfred Deakin’s results for the SAC had an average of 25 and a standard deviation of 4, whilst Blackburn’s
average was 33 with a standard deviation of 2.
Visually:
Who actually scored better? (considering that teachers may mark differently from school to school)