Sunteți pe pagina 1din 2

# Summary statistics: mean x and standard Segmented bar charts: to display the

deviation s for describing the centre and relationship between two categorical variables,
spread of numerical data, e.g. e.g voting intention and gender.
Copyright itute.com 2006 x f fx fx2 gender
Do not reproduce by other means F
5 5 25 125
M
Further mathematics 7 4 28 196
8 2 16 128
Summary sheets (Core only) n=f=14 fx=75 fx2=461
Lib Lab Gre voting intention

( )
Data analysis fx 75 1 Scatterplots: to display the association
Univariate data (i.e. data of a single variable) x= = = 5.36 , s = fx 2 nx 2
n 14 n 1 between two numerical variables, e.g.height
Categorical data: e.g. nationality, language,
and weight,
transportation, profession, sport, blood type.
Discrete numerical data: e.g. petrol price/litre, =
1
13
( )
461 14 5.36 2 = 2.13 w

number of students in a class, road toll/year.
Continuous numerical data: monthly rain fall, x and s are good measures of the centre and
daily maximum temperature, height of a tree. the spread provided the data set has no
extreme outliers. h
Bar-charts, segmented bar-charts, pie-charts
are suitable for displaying categorical data. Median and interquartile range are more Describe a scatterplot in terms of direction
appropriate measures of the centre and spread (positive or negative association), form (linear
if the data set contains extreme outliers. or non-linear association) and strength
Arrange the data in ascending order. Median (strong, moderate or weak association).
is the middle number (or the average of the Correlation is a term used in statistics for
middle two) of the arranged data. The middle association or connection.
of the half before the median is called the Estimation of Pearsons product-moment
Stemplots are suitable for discrete numerical correlation coefficient r from a scatterplot:
lower quartile QL, the middle of the other
data.
half after the median is the upper quartile For the scatterplot above, r +0.5
2 00378 2 003
QU. Interquartile range IQR = QU QL .
3 2336999 2 78
4 111778 3 233 These statistics can be presented in the form
3 6999 of a boxplot.
2 | 0 rep 2.0 4 111
outliers
4 778
min. QL med. QU max.
Frequency histograms are suitable for both
discrete and continuous numerical data when For bell shaped data sets: r = +1 r=0
they are grouped into class intervals. 68% of the data lies within x = x 1s ,
f 95% within x = x 2 s ,
99.7% within x = x 3s . r 0.7
A data point is an outlier if it is less than
Q L 1.5 IQR or greater than
x r is always a number between 1 and +1. It
Terms used to describe a set of data: QU + 1.5 IQR .
measures the strength of the linear association
Symmetrical Outliers Bell shaped data sets are modelled by the between two variables. r = 0 shows that there
normal distribution. Values (x) from different is no association and it suggests that the two
data sets are converted to standard z scores variables are independent of each other. A
xx positive (negative) value of r shows an
Positively skewed Negatively skewed for comparison. z = . increasing (decreasing) trend. r = 1 shows a
s
definite linear association.
Bivariate data (i.e. data for two variables)
Dependent (response) variable on the y-axis Extreme outliers in the data set can result in a
and independent (explanatory) variable on the misleading value for r.
Modal class is the interval with the highest x-axis for graphing.
Back-to-back stemplot: to display A strong correlation between two variables
frequency. Two or more modal classes are does not imply that a change in one variable
possible in the same data set. Mode is a value relationship between numerical variable
(e.g. age) and two-valued categorical variable causes a change in the other because there may
with the highest frequency in a data set be a third variable that influences both in the
(e.g. gender), e.g.
30 3 1145 same way. A strong correlation only allows
Cumulative frequency distribution and curve prediction of one variable from the other.
(or ogive): cum freq 552 4 36689
110 7421 5 227 Coefficient of determination r2: e.g. r = 0.95,
Speed v Cumul. 0 | 3 | 1 rep 30 yrs old F, 31 yrs old M. r2 = 0.90. The interpretation of r2 = 0.90 is that
(kmh-1) Freq. Parallel boxplots: to display relationship
90% of the variation of one variable with the
<40 0 other can be explained by the relationship that
between a numerical variable (e.g. exam
<50 6 exists between them.
scores)and a two or more level categorical
<60 19 variable (e.g. subjects), e.g. Example. Refer to the scatterplot for weight
<70 40 and height above. r = 0.5, r2 = 0.25, hence 25%
<80 75 English of the variation of weight w with height h can
<90 100 be explained by the linear relationship between
<100 110 w and h.
0 v Science
40 100 Regression encompasses analysis of data in
order to develop mathematical relationship
Maths (equation) between variables and exploration
Scores of the relationship to make predictions.
Methods to find regression line y = a + bx .
(1) Fitting line by eye: Draw line of best fit. Transformation of some forms of non-linear Seasonal pattern exists in many time series. To
Mark two convenient points on line as far apart data to linearity: study long term trends, remove the seasonal
as possible. Use the coordinates of the two Square transformation effects from the data, the process in doing so is
marked points to calculate the slope of the line y y called deseasonalising the data and the
y y1 resulting figures are called seasonally
b= 2 , a is the y-intercept and can be y = a + b(x2) adjusted figures. The estimates of seasonal
x2 x1
effects are called seasonal indices.
found from the scatterplot, or by substituting Example
the coordinates of one of the marked points in .
0 x 0 x2 Year Q1 Q2 Q3 Q4 Quart. Av.
the equation. 1996 68 70 64 55 64.25
(2) The three-median line: Divide the data 1997 65 67 64 55 62.75
into three groups (left L, middle M, right R) of y y2
1998 64 66 64 55 62.25
equal number of points if possible, otherwise a 1999 61 65 59 53 59.50
single extra point goes to the middle or put one (y2)= a + bx 2000 60 64 59 52 58.75
in each outer region for two extra points. For *Quarterly average=(Q1 + Q2 + Q3 + Q4) 4
each group find the medians for the x and y Divide each quarterly entry by the quarterly av.
values, they form the summary points (xL , y L ) , 0 x 0 x for that year to obtain the following table.
Log transformation
( xM , y M ) and (xR , y R ) . Then calculate slope y y
Year
1996
Q1
1.0584
Q2
1.0895
Q3
0.9961
Q4
0.8560
yR yL 1997 1.0359 1.0677 1.0199 0.8765
b= and y-intercept
x R xL 1998 1.0281 1.0602 1.0281 0.8835
y = a + b(log x) 1999 1.0252 1.0924 0.9916 0.8908
1
a= [( yL + yM + yR ) b(xL + xM + xR )] 2000 1.0213 1.0894 1.0043 0.8851
3 0 x 0 log x 1.0338 1.0798 1.0080 0.8784
Draw the three-median line by placing a ruler y log y The entries in the last row of the above table
in alignment with (xL , y L ) and (xR , y R ) and are the seasonal indices for the quarters. Q1
then sliding the ruler vertically one-third of the seasonal index is obtained by taking the
average of the 5 entries for the 5 years under
way towards (xM , y M ) . (log y) = a + bx
Q1, etc. NB. Sum of seasonal indices = 4.
(3) Estimation of line of best fit y = a + bx 0 x Deseason.figure = actual figure/seasonal index
sy 0 x The following table shows the deseasonalised
from scatterplot by formulas b = r and Reciprocal transformation figures and can be analysed for long term
sx
y y trends without the effects of seasonal factors.
a = y bx . Year Q1 Q2 Q3 Q4
(4) The least squares line: is the line for
which the sum of the squares of the vertical
y=a+b ()
1
x
1996
1997
65.78
62.87
64.83
62.05
63.49
63.49
62.61
62.61
deviations is a minimum. Use graphics 1998 61.91 61.12 63.49 61.48
1999 59.01 60.20 58.53 60.34
calculator to find a and b in y = a + bx .
1 2000 58.04 59.27 58.53 59.20
0 x 0
Slope b represents the rate of change in y as x x
Random variations or seasonal effects in time
increases, i.e. how much y changes when x y 1
series can also be removed by smoothing
y
increases by 1. Intercept a is the y value when procedures: e.g. 3- moving median smoothing
x = 0. 1 =a + bx
y
Extrapolation: Using the regression line to
make prediction beyond the data set. 0 x 0 x
Interpolation: Using the regression line to Residual plots: For displaying quality of fit.
insert a value within the data set. Residual
Residual analysis: For checking quality of fit.
Residual is defined as y y where y is the
observed (actual) value and y is the value e.g. centred 4-moving average smoothing
0 x (suitable for removing seasonal effects in
predicted by a regression line. A regression
line for which the sum of the squares of the patterns that repeat every 4 terms, e.g.
residuals is smaller is a better fit. quarterly data)
A random residual plot suggests the
Example Which of the following regression regression line (representing certain Year Qtr x 4-m. av Centred
lines is a better fit for the data set (x, y). relationship) is a good fit. If the residual plot 1996 1 68
Line 1: y1 = 2.6 + 1.3 x , line 2: y2 = 2.5 + 1.2 x shows a pattern, then another relationship
2 70
x y y1 resid (resid)2 (depending on the pattern) may result in a
64.25
1 3.7 3.9 0.2 0.04 better fit.
3 64 63.88
2 5.2 5.2 0 0 Qualitative analysis of time series: (1) Trend 63.50
3 6.8 6.5 +0.3 0.09 pattern-increase or decrease in the data over 4 55 63.13
4 7.2 7.8 0.6 0.36 time. (2) Seasonal pattern-variation of the 62.75
= 0.49 data due to seasonal factors, e.g. the month of 1997 1 65 62.75
the year, the day of the week or the hour of 62.75
x y y2 resid (resid)2 the day. (3) Cyclic pattern-longer term 2 67 62.75
(years) fluctuations in the data. (4) Random 62.75
1 3.7 3.7 0 0
3 64
2 5.2 4.9 +0.3 0.09 pattern-unpredictable short term variations
3 6.8 6.1 +0.7 0.49 of the data about a constant value.
4 55
4 7.2 7.3 0.1 0.01
How to distinguish seasonal from cyclic
= 0.59 For daily seasonal data, use 7-moving average;
pattern: Seasonal pattern has a constant length
y1 = 2.6 + 1.3 x is a better fit. in the cycle and constant magnitude in the for monthly data, use 12-moving average.
Check for accuracy before use. variation. Cyclic pattern has a changing cycle A regression line can be fitted for the
Please inform mathlinE (mathline@itute.com) re length and magnitude from cycle to cycle. deseasonalised or smoothed data and used to
typing or mathematical errors. make predictions.