Sunteți pe pagina 1din 56

ETC1000/ETW1000/ETX9000 Business and Economic Statistics LECTURE NOTES Topic 2: Understanding What is Happening

1.

Relating Variables Together

In the diabetes example in Topic 1 we had a contingency table that suggested not doing enough exercise was related to diabetes. With bivariate data we can do a lot to see how the two characteristics (variables) relate to each other. This can be particularly helpful in policy. For example, if we can get some idea of how exercise can affect the incidence of disease, then the government can come up with some strategies to improve public health and social wellbeing (and its budget).

1.1

Scatter Plots

A SCATTER PLOT is a graph that allows us to visually see how two characteristics (variables) relate to each other. It is generally most appropriate for numerical data or data with some natural ordering. Suppose we are interested in another major government responsibility: education. In particular, suppose we are interested in whether investing in education improves income. To draw a scatter plot, we begin with some data. The data we need comes in pairs: a data point for years of education and a data point for income. Heres what the data looks like in our spreadsheet:

We then plot income against education:

Notice that we have put education on the X-axis and income on the Y-axis. In general, the variable or characteristic that we have some control over goes on the X-axis, and the variable we want to influence goes on the Y-axis. You will see later why this is important. We can see from the scatter plot above that someone with 15 years of education earns a lot more than someone with only 9 years of education. As we move along the Xaxis, education increases, and so does income. We would say there does seem to be a relationship between education and income, and this relationship is positive. This would suggest that by finishing your degree, you could earn a much higher income. The scatter plot in general is used for exploring questions like: Is there a relationship between X and Y?

X and Y related
200 150 Y 100 50 0 0 0.5 1 1.5 2 X 2.5 3 3.5 4

No apparent relationship
250 200 150 Y 100 50 0 0 2 4 X 6 8 10

What direction is the relationship?

Icecream Demand - Positive Relationship


4 3 2 1 0 0 10 20 30 40

Temperature

Icecream Demand - Negative Relationship


4 3 2 1 0 0 1 2 3 4

Price

Does the relationship appear to be linear or non-linear?

Sunscreen Consumption 25 20 15 10 5 0 0 10 20
Temperature

30

40

Electricity Consumption

160 140 120 100 80 60 40 20 0 0 10 20


Temperature

30

40

1.2

Covariance

We can quantify how (i.e. in what direction) two variables move together by a summary measure called the COVARIANCE. The covariance of two data sets, X and Y, is given by the formula:

Cov(X,Y) =

[X
n i =1

X ][Yi Y ] n

Technically speaking, the covariance is the average of the products of paired deviations from the mean in two sets of data. To see what that means, notice that inside the summation operator, the formula involves subtracting the mean of X from each X data point, subtracting the mean of Y from each Y data point, and multiplying each de-meaned pair together. To see how this works visually, first consider where the mean of X and the mean of Y lie in a scatter plot of the data:

X
Drawing lines for the mean of X and the mean of Y , we split the data into 4 quadrants. The values we sum the de-meaned pairs will be positive or negative depending on which quadrant the data lies in. And the sign of the covariance will be positive or negative accordingly:

(Y
Y

Y )

positive

(Y

Y )

negative

(X

X ) negative

(X

X ) positive

In the scatter plot above, there is clearly a positive relationship between X and Y. For the data points in the lower left quadrant:

(X

X ) is negative and (Yi Y ) is negative, so (X i X ) * (Yi Y ) will be

positive in each case. For the data points in the upper right quadrant:

(X

X ) is positive and (Yi Y ) is positive, so (X i X ) * (Yi Y ) will again be

positive in each case. Summing mostly positive values together will give you a covariance that is positive. Consider an example with a negative relationship. In this case, the data will mostly be in the upper left and lower right quadrants.

(Y
Y

Y )

positive

(Y

Y )

negative

(X
( )

X ) negative

(X
) (

X ) positive

When X i X is negative, mostly Yi Y is positive, so X i X * Yi Y will be negative. Likewise, when X i X is positive, Yi Y will be negative, so

(X

X ) * (Yi Y ) will again be negative in each case. This would give us a

) (

negative covariance overall.

We can calculate the covariance automatically in Excel using: =COVAR(range of X data, range of Y data) e.g.

=COVAR(B2:B55, C2:C55) Or we can bring up the box below by going to Data Analysis under the Data tab and choosing Covariance:

The formula for the covariance we have seen so far is technically only appropriate if the data comprises the whole population. In practice, the data we have is a sample rather than the population. In this case, we have to estimate the means as well as the covariance, so the formula is slightly different to take into account the extra uncertainty.

The sample covariance of two random variables, X and Y, is given by the formula:

Cov(X,Y) =

[X
n i =1

X ][Yi Y ] n 1

Excel does not have an in-built function to calculate the sample covariance, but when n is large, there is practically no difference between the two calculations. What does the covariance mean? The covariance indicates both the strength and direction of the linear association between two variables. However, it has no meaningful interpretation it is measured in squared units of the random variables, like the variance of a random variable. The interesting thing we learn from the covariance is about the direction of a relationship do the two variables tend to move in the same direction or in opposite directions? A positive covariance indicates that when X is big, there is a higher probability of a big Y. Similarly, when X is small there is a higher probability of Y being small as well. i.e. X and Y tend to move together. This is captured with positive covariance. e.g. Income and education: When someone has completed a lot of education (eg. a degree as opposed to completion of VCE), chances are they will earn higher income. So wed say education and income have a positive covariance. Interest rates and prices of bonds: When interest rates increase, the prices of bonds decrease. Conversely, the prices of bonds increase when interest rates decrease. So wed say that interest rates and bond prices have a negative covariance.

e.g.

1.3

Correlation

The problem with using the covariance to measure the relationship between two variables is that we can only interpret the direction of the relationship, not the strength of it. An alternative measure of relationship is the COEFFICIENT OF CORRELATION. The coefficient of correlation is a standardised measure in that its values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). If two variables are perfectly correlated, this means that if all points were plotted with a scatter plot, all the points could be connected with a straight line. Perfect negative correlation:
Y

Perfect negative correlation means that when X is big, Y is small, and it is small in a (perfectly) predictable manner.
8

No correlation:
Y

When two variables are uncorrelated, when one variable changes, there is no tendency for the other variable to change. Perfect positive correlation:
Y

Perfect positive correlation means that when X is big, Y is also predictably big. In the intervening cases, e.g. the case where there is a coefficient of correlation of -0.8:
Y

In this case, we can say that large values of X tend to be paired with small values of Y. The data do not form a complete straight line, though, so it cannot be described as perfect. In the case of income and education, our coefficient of correlation is 0.693, which means that large values of X tend to be paired with large values of Y.

How do we calculate the coefficient of correlation? The formula is:

Corr(X,Y) =

[X
n i =1

X ][Yi Y ]
2

[X
n i =1

X]

[Y
n i =1

Y ]

The numerator is the same as the numerator of the covariance, but the denominator is different: notice that the denominator is effectively standardising the numerator by removing the squared units of X and Y. In Excel, we can automatically calculate the correlation using the function: =CORREL(range of X data, range of Y data)

2.

Making Causal Connections

While tools such as scatter plots, covariance and correlation can tell us a great deal about the strength and direction of a relationship between two variables, they do not quantify the effect in a way that would allow us to make informed decisions. For example, we may be interested in quantities: Predicting how much more your income could be if you complete your degree rather than finish education at the end of secondary school. Establishing the effectiveness of advertising expenditure in increasing market share. Estimating how production or service delivery costs vary with certain key factors. Predicting how sales / turnover will increase if certain inputs are varied. Consider again the scatter plot of income on education.

10

We can call income Y, as that is the factor we want to influence, and education X, as that is the variable we can potentially change. Now lets draw a line roughly through the middle of the dots.

The line represents the average slope of Y with respect to X. The equation for a straight line can be written:

Y = a + bX (you may have seen it written as Y = mX + c )


where b = dY / dX = rise / run, is the slope of the line (the average slope of Y with respect to X). The slope of the line tells us how much Y would change (and in what direction) if X were 1 unit larger. That is, the slope of around 4.047 tells us that if someone were to do 1 more year of education, their income would be $4,047 (4.047*1000) higher. We can develop a mathematical model which exploits this nice interpretation for the slope. The model defines the relationship between X and Y, and allows us to check if some relationship exists, and then quantify it. The model is a mathematical function which describes how one variable (Y) changes in response to another (X). The model we will use is called the SIMPLE LINEAR REGRESSION MODEL.
11

2.1

The Simple Linear Regression Model

We write the Simple Linear Regression Model as:

Yi = 0 + 1 X i + i
The subscript i denotes the ith observation, such as the ith person, the ith country, the ith firm, etc. Y is the Dependent Variable (the thing you want to influence) X is the Independent or Explanatory Variable (the thing you can control)

(pronounced epsilon) is the error how far the observed value of Yi is from the regression line. We need to include in our model because not all data points lie on
the line e.g. education does not entirely determine income.

Income
120,000

100,000

Income($)

80,000

error
60,000

error

40,000

20,000

0 0 5 10 15 20 25 30

Years of Education

Now, the way we have set up the model so far is really only valid for when we have data on the whole population of interest (e.g. all Australians, all countries, all firms, etc.). But in reality we never have information on the whole population. Rather, we would have data on a sample. But thats okay. As long as our sample is representative of the population (e.g. a non-representative sample would be one where we select a few firms in a particular industry rather than a representative sample of all industries), we can use the sample data to estimate the true population model. We just make some slight but important notational differences: The true population model is:

Yi = 0 + 1 X i + i

An estimate of the true model using our sample data is: Yi = b0 + b1 X i + ei Now lets go back to our sample scatter plot.

12

Income
120,000

100,000

Income($)

80,000

e15
60,000

b1 e23

40,000

20,000

b0
0 5 10 15 20 25 30

Years of Education

2.2

The Concept of Least Squares

We actually could have drawn a number of possible lines through the plot, each with a different intercept and/or slope. That is, each possible line has a different b0 and/or

b1 . Which line is the best one? That is, how do we choose b0 and/or b1 ?
We need a criterion for deciding the best line. The best line would be the one where the errors are closest to 0. The criterion we use is to minimise the SUM OF SQUARED ERRORS. The aim is to make errors as small as possible: error = ei = Yi b0 b1 X i But we will have some errors which are positive, and some negative. So we square all errors, to make them all positive, and then add them up. We then choose the line which minimises the sum of squared errors. This criterion translates into some formulas for calculating b0 and b1 with a given set of data. We wont go through this calculation, but it is worth noting the formulas for b0 and b1 that result. First, we have

b0 = Y b1 X .

This tells us that the intercept is a function of the mean of Y and the mean of X. From the scatter plot, this makes sense: the intercept will be somewhere in the middle of the range of Y, adjusted by the range that X takes.

Second, we have

b1 =

X
i =1 n i =1

X Yi Y
.
i

13

If we were to divide the numerator and the denominator by n 1 (roughly the number of observations we have), we would not change the value of b1. Wed then have:

1 n Xi X Yi Y sample cov ( X , Y ) n 1 i =1 b1 = = 2 1 n sample var ( X ) Xi X n 1 i =1


So the slope is a function of how much and in what direction X and Y vary together, relative / standardised according to the overall variation in X. The equation for the regression line, based on sample data, can then be written as:

=b +b X Y i 0 1 i
Notice that the estimated equation for the line has a hat on the Y and no error term. the predicted value of Y, given X. This prediction is essentially an average We call Y value of Y for any given X. There is no error term in this prediction because we assume the errors are unpredictable, and any error is averaged out.

= 22596 + 2154.3 X Y i i

We can get Excel to compute b0 and b1 for us using Data Analysis under the Data tab and choosing Regression. The Excel output for our income/education model looks like this:

14

From this we end up with a line as follows:

=b +b X Y i 0 1 i Yi = 22596 + 2154.3 X i

or

Incomei = 22596 + 2154.3 x Years of Educationi


This is a model which we can use to help with policy decision-making. e.g. The government decides that the minimum number of years of education an individual should have is 10. What is the salary that an individual with 10 years of education can expect?

We get:

= 22596 + 2154.3*10 = $44,136 Y i

So, the government can expect an individual educated for 10 years to earn $44,136 annually. e.g. Stephen has had 15 years of education and earns an annual income of $50,000. Is Stephen earning more or less than an average individual with the same amount of education?

The model gives us an indication of the average income for an individual with X years of education, so we can look to the model to estimate what the average income of an individual with 15 years of education is. The model predicts an individual with 15 years of education should earn an annual income of $54,910.50 on average: i.e.

= 22596 + 2154.3*15 = $54,910.50 Y i

Since Stephens income is lower than this, Stephen is earning less than an average individual with the same amount of education, according to this model.
15

Before going too far on the uses of these models, we need to look more closely at what it all means, and how we can evaluate the Excel output we have obtained.

2.3

Interpreting the Model

We want to understand what the model we have estimated tells us about the nature of the relationship between X and Y. More specifically, the actual estimates b0 and b1 are informative. b0 is known as the intercept the estimated value of Y when X = 0. b1 is the slope of Y with respect to X the estimated change in Y for a 1 unit change in X. e.g. In the above example, we found:

Incomei = 22596 + 2154.3 x Years of Educationi


We have b0 = 22596. This means that an individual who has zero years of education can expect an annual income of $22,596. This is an example of a case where b0 doesnt have a meaningful interpretation. It does not really make sense to speak of an individual with zero years of education. Often this happens with the intercept. Whenever interpreting b0, we need to consider whether the interpretation is sensible. b1 = 2154.3 is an estimate of the effect of education on income: it tells us how much income would change if education were 1 year higher. In particular, we can say take 2 people, one of whom has 1 more year of education than the other. The person with 1 year higher education can expect to earn, on average, $2,154.30 more per annum than the person with lower education. Being able to specify quantities like this can be very important in policy and planning. In this case, it suggests that if a number of people could be encouraged to stay at school one year longer, then they could earn more income (and the government could receive more tax income!).

2.4

Is There Really a Relationship between X and Y?

Usually a regression model is used to consider a factor that affects Y. If you are working as an analyst, it will probably be up to you to decide what factor is important in explaining or causing Y to change. You may have a good idea about what influences Y, but how do you know that X really does cause Y? e.g. We think money spent on training schemes for the unemployed (X) should reduce the unemployment rate (Y). But does it?

Recall 1 is the slope of the line relating Y and X. If there is no relationship between X and Y, then they would not co-vary together (i.e. the covariance would be 0), and the slope of the line would be 0:

16

Training Expenditure
7 Unemployment Rate(%) 6 5 4 3 2 1 0 0 100 200 300 400 500 600 700 Training Expenditure ($m)

From this scatter plot, there doesnt seem to be an obvious relationship. Heres Excels regression output:

We see the estimated coefficient on Training Expenditure is very small (-0.002). Now, 1 is a continuous numerical variable it could take any value, depending on how many decimal places you went to. Chances are that because we have a sample, we will not get an estimate of 1 = 0 exactly. So how big does b1 need to be before we can say the true slope is nonzero? We can answer this with a HYPOTHESIS TEST for whether the true slope, 1, is actually zero. How do we do this test? There are 4 steps: 1. Formulate Null and Alternative Hypotheses H 0 : 1 = 0 i.e. The null is that there is NO relationship between X and Y

H 1 : 1 0 i.e. The alternative is that there IS a relationship between X and Y:


the true slope is not 0.

17

2. Decide a Significance Level This is a small value we choose ourselves, denoted . Usually we choose = 0.05 or 5%. 3. Calculate the p-value The p-value is a probability we will go through the meaning and theory underlying p-values in Topic 3. For now, it is a value we calculate that we use to determine whether b1 is close enough to 0. Excel calculates the p-value for us in its regression output. From the output above, the p-value for our estimate of b1 is 0.007272. 4. Make a Decision To make a decision, we compare the p-value for b1 to our chosen significance level . The decision rule is to reject H0: 1 = 0 if the p-value < . That is, if the p-value is smaller than , we would conclude that there IS a relationship between X and Y. Conversely, if the p-value is bigger than , we would conclude that there IS NOT a relationship between X and Y. In the above example, since 0.007272 < 0.05, we reject H0 and conclude that the amount of money spent on training schemes DOES affect the unemployment rate. In fact, with b1 negative, we could say that higher training expenditure reduces the unemployment rate. Who would have thought this by looking at the scatter plot! So, what next? If you conclude the slope is nonzero, then you can go ahead with associated policy and planning. But if you conclude the slope is zero, you may need to think about trying another factor which influences Y. Of course, often finding no relationship between X and Y can be useful for policy and planning too! We could have also done a similar test on 0 then wed be testing whether the intercept is 0 or not. However, the implications of this is, if we conclude 0 = 0, then we should drop the intercept. This is generally not a good idea, as dropping the intercept effectively forces the regression line through the origin, which may change the slope of the line inappropriately. Lets do another example. A well-known model in finance, called the market model, assumes that the rate of return on a share (R) is linearly related to the monthly rate of return on the overall market (M). The mathematical description of the model is: Ri = 0 + 1Mi + ei For practical purposes, M is taken to be the annual rate of return on some major stock market index, such as the Australian All Ordinaries Index. The coefficient 1, well-known as the shares beta-coefficient, measures how sensitive the shares rate of return is to changes in the level of the overall market. For example, if 1 > 1, the shares rate of return is more sensitive to changes in the level of the overall market than is the average share. Conversely, 1 < 1 suggests it is less sensitive.

18

The scatter plot below shows the return of a particular share ANZ against the market (All Ordinaries) return. From the scatter plot alone, a positive linear relationship seems to exist.
ANZ Share Returns against Market Returns 0.15 0.1 ANZ share Returns 0.05 0 -0.02 0 -0.05 -0.1 -0.15 Market Returns

-0.08

-0.06

-0.04

0.02

0.04

0.06

0.08

0.1

The estimation output is given below:


SUMMARY OUTPUT Regression Statistics Multiple R 0.500805313 R Square 0.250805961 Adjusted R Square 0.241556652 Standard Error 0.043341318 Observations 83 ANOVA df Regression Residual Total 1 81 82 Coefficients 0.005039079 0.843432857 SS 0.050936932 0.152156054 0.203092986 Standard Error 0.005053313 0.161970617 MS 0.050937 0.001878 F 27.11618 Significance F 1.42333E-06

Intercept M (Market Return)

t Stat 0.997183 5.20732

P-value 0.321645 1.42E-06

Lower 95% -0.005015437 0.521161879

Upper 95% 0.0150936 1.1657038

Estimated equation:

= 0.0050 + 0.843M R i i
Interpretation of coefficients: b0 = 0.0050 b1 = 0.843 This means that when the market return is zero, one would expect the ANZ share return to be 0.0050 or 0.5%. Consider a share in ANZ on two particular trading days. On the second day, the All Ordinaries return was 1% higher than it was on the first day. On the second day, the ANZ share would be expected to earn a return 0.843% higher than it would on the first day.
19

Hypothesis test 1. Formulate Null and Alternative Hypotheses H 0 : 1 = 0 Market returns have no impact on ANZ share returns

H 1 : 1 0 Market returns have an impact on ANZ share returns


2. Decide a Significance Level Test at 5% level of significance, i.e. = 0.05 3. Calculate the p-value p-value = 1.42 * 10 6 4. Make a Decision The decision rule is to reject H0: 1 = 0 if the p-value < . Since 1.42 * 10 6 < 0.05, we reject H0 and conclude that the return on the market does affect the return on ANZ shares. N.B. Wouldnt it be nice if we could test specific financial and economic theories, like whether the ANZ share return is sensitive to changes in the overall market return (i.e. H 0 : 1 = 1 )?

2.5

Evaluating the Model

We have come up with a simple model to explain the behaviour of Y. It assumes that X is the key factor in explaining Y, and that the relationship is linear. We have shown how the model can be used to aid understanding, policy and decision-making. An important next step, though, is to evaluate how good the model is. There is no point coming up with a model and making all these predictions from it, etc, if it is lousy at explaining the relationship between X and Y. Any conclusions we draw from the model are likely to be misleading and unhelpful. So how do we evaluate our model? There are three things we can use. (1) R2

R 2 is closely related to the covariance of X and Y. In fact, R 2 equals the square


of the sample correlation coefficient:

R 2 = [Corr ( X , Y )]

In the Excel output you will see a quantity called R Square.

20

This will be a value between 0 and 1. It measures the proportion of variation in Y that the model has been able to explain. So, a value of R2 close to 1 indicates that the model has been able to explain a large proportion of variation in Y, and hence is a very good model. A value of R2 close to zero indicates a poor model not much of Y has been explained. Lets look a little more closely at what R2 measures.

Y
SST = (Yi - Y)2

SSE =(Yi - Yi )2

_ SSR = (Yi - Y)2

_ Y X

Xi

The distance of each observed data point, Yi, from the fitted line is the error, so if we sum the squares of all these errors we get the SSE, or Sum of Squared Errors:

ei = Yi b0 b1 X i = Yi Y i

ei2 = Yi Yi
i =1 i =1

The distance of Y from the mean line contributes to SST, the Total Sum of Squares, or the total variation in Y from its mean. The other distance, from the value of Y on the line to its mean, is the part of Ys behaviour that the model has been able to explain. So if we square and add these for all data points, we get SSR, the Regression Sum of Squares.
21

R2 is defined as

SSR R = = SST
2

(Y Y ) (Y Y )
i i

2 2

Hence the interpretation as the proportion of variation in Y explained by the model. e.g. In the above output, we have an R2 of 0.0358. So we can say that 3.58% of the variation in unemployment rates can be explained by variation in training expenditure. This is a very poor result, but not unexpected, as a there are a lot of other factors that affect the unemployment rate, such as education, economic factors, etc.

(2) Standard Error The STANDARD ERROR provides another measure of how good our model is. It is simply the standard deviation of the error term in the model. Positive errors (where the model predicts a Y value smaller than what actually happened) will cancel out with the negative errors, so the average error will be zero. This means the error standard deviation is actually just the square root of the average of the squared errors. Or more loosely, the standard error gives us an estimate of the magnitude of the average error that will be produced from the model. This is the same intuitive interpretation we gave to the standard deviation in Topic 1. e.g. In the unemployment rate / training expenditure case, we have a standard error of 1.685. So we say that, on average, the models predictions of Y (the unemployment rate) are in error by 1.685%, either above or below the actual value of Y. This can be very useful. For example, if we were using the model to predict how the governments pledge to increase training for the unemployed might improve the overall unemployment rate, we can gauge from the standard error that our prediction of the unemployment rate will be out, on average, by 1.685% either above or below. Another way to get a handle on the standard error is to compare it with the actual mean of Y and/or with the kinds of values Y takes. In Australia, the unemployment rate fluctuates around 6%, not usually going below 4% or above 8%. So to have a model that predicts the unemployment rate with an average error of 1.685% is not particularly accurate.

(3) Error / Residual Plots The aim of a model is to explain patterns in Y. Sometimes variables just fluctuate randomly, in a totally unpredictable way. No model can expect to explain this. But what we do hope for is that a good model will have at least explained the main patterns in Y. A tool for evaluating whether the model has succeeded in this is the ERROR PLOTS, or RESIDUAL PLOTS (residual = error). We would hope that the errors would not contain an obvious pattern in them. If they do, then there is something wrong with or missing from the model there is some pattern that the data should
22

have picked up by the model. What we should ideally see in the error is only the random, unpredictable bits left over after the model has taken care of all the patterns. Excel can produce residual plots as part of the regression function. At this stage, we look to the residual plots for two things: Evidence that the use of a linear model may not have been appropriate. Evidence that there may be an important variable left out of our model. Lets take a look at some examples. In the graph below, the errors seem to be randomly distributed no particular pattern. Looks good.
R esidual P lot 0.2 0.15 0.1 0.05 0 -0.05 0 -0.1 -0.15 -0.2 Residuals

20

40

60

80

The second graph, below, suggests either a variable has been left out, or more likely we have fitted a linear model when the relationship is not linear.
Residual Plot 30 20 10 0 -10 0 -20 Residuals

20

40 X

60

80

In fact, the above residual plot came from the following scatter plot and fitted line:

Observed and Predicted Y


60 50 40

Y 30
20 10 0 -10 0 20 40 60 80

Y Predicted Y

23

Clearly Y and X are not related in a linear way some kind of curve would fit much better. The linear model gives errors which are all negative for X between 25 and 70, and all positive for X outside this range. A linear model is clearly not appropriate for this data. The third graph, below, has a clear cyclical pattern to it. This is clear evidence of some other variable being important in causing Y.
Residual Plot
2 Residuals 1 0 -1 0 -2 -3 X 20 40 60 80

What do we do if we find evidence in our residual plots of a problem with the model? Nothing much at this stage we need more skills see the next section. But its important to be aware of the problems right from the beginning, as it causes us to draw conclusions from the model with a little more caution.

3.

The Multiple Regression Model

We want to generalise the simple regression model because we believe that there are likely to be several factors causing variation in Y. It is unrealistic to restrict ourselves to a model which allows for just one factor. In fact, sometimes considering just one factor can give us a misleading impression about what is relevant. e.g. Suppose we are seeking to explain differences in infant mortality rates across countries (number of infants who died before the age of 1, per 1000 live births). There are three possible causal factors: average income levels (Real per capita GDP), average education levels (secondary school enrolment ratio), and number of TV sets per capita. Here are the results we obtain.

24

Model 1:

Y = Infant Mortality Rate. X = Number of TV sets per capita.

Model 2:

Y = Infant Mortality Rate. X = Real GDP per capita.

25

Model 3:

Y = Infant Mortality Rate. X = Secondary School Enrolment Rate.

All three models look plausible, and suggest a relationship between the variables: pvalues on the X variables are all small, R2 values are reasonably big, 1 estimates are the right sign (infant mortality rates fall as income rises, as education rises, and as number of TV sets rises). There is a hint that the most important of the three variables could be the education variable it has a much bigger R2. But even better, it is conceivable that all of these three variables are important in explaining infant mortality rate. Heres what happens when we estimate a MULTIPLE LINEAR REGRESSION MODEL a model with more than one X variable. Its easy to do in Excel put all three X variables together in 3 adjacent columns (getting rid of rows where there are blanks / missing data). Then choose Data Analysis under the Data tab, Regression, and select these columns in the Input X Range part of the dialog box.

26

The story now is very different. We now have four coefficients to interpret and test: the constant / intercept, and the coefficients of each of the three variables. Note that all the coefficients have the same sign as before all negative, as expected. But the actual coefficients are all smaller, especially for TV sets and GDP. Increasing the number of TV sets and the income level has a much smaller effect on infant mortality rates than what is suggested by the simple regression results. In fact, looking at the p-values for these two variables, they are 0.0816 and 0.6634 bigger than = 0.05, so we would not reject the null hypothesis that these coefficients are zero. This is not surprising. When we just do simple regression, a factor like GDP can show up as significant. But really, it is not important. It is the education level that really matters in reducing infant mortality rates. The significance of GDP in the simple regression result is that countries with higher GDP usually have higher education levels, so GDP is capturing the effect of higher education, and thus appears to be an important factor. When the model includes both GDP and Education, then the reality shows up variations in GDP cannot explain variations in infant mortality rates, once differences in education level are taken into account. Lets go into the ideas of multiple regression a little more systematically.

3.1

The Model and Interpretation

Heres the mathematical representation of the multiple regression model:

Yi = 0 + 1 X 1i + 2 X 2i + K + k X ki + ei .
Assuming we have k possible X variables explaining Y in the infant mortality example, k = 3. The s represent the population parameters they have a similar interpretation as for the simple regression model, but there is a small but important difference.

0 = intercept = the average value of Y when ALL the X variables equal zero.
N.B. Usually we dont worry about interpreting 0, as it often doesnt make a lot of sense for all the Xs to be zero (e.g. what sense is there is considering what the mortality rate would be for a country with no TV sets, zero income, and no people in secondary school?).

1 = the change in Y for a 1 unit change in X1, holding all other X variables constant.
In a simple regression of Y on X1, 1 tells us about the total contribution of X1 to explaining Y. In multiple regression, 1 tells us about the contribution of X1 to explaining Y after taking account of the possible impact of the other variables on Y. The same interpretation applies for 2, 3, etc.

27

Return to the infant mortality rate example. Using our sample, we came up with estimates of these s the bjs. These could be interpreted as follows: b1 = -45.2359. If there were two countries which had the same GDP per capita and the same secondary school enrolment rate, but one country had 1 more TV set per capita than the other, then the country with more TVs per capita would have an infant mortality rate of 45.2 deaths per 1000 live births lower than the country with fewer TVs per capita. b2 = -0.00034. If there were two countries which had the same number of TV sets per capita and the same secondary school enrolment rate, but one countrys GDP was US$1 more than the other, then the country with the higher GDP would have an infant mortality rate 0.00034 deaths per 1000 live births lower than the country with lower GDP. b3 = -0.62023. If there were two countries which had the same GDP per capita and the same number of TV sets per capita, but one countrys secondary school enrolment rate was 1% higher than the other, then the country with the higher rate would have an infant mortality rate 0.62 deaths per 1000 live births lower than the country with the lower enrolment rate. N.B. Sometimes we want to vary our interpretations to take into account the sorts of values the different X variables take. For example, the data on TV sets per capita range in value from 0.0 to 0.71 (the USA). No country has more than 1 TV set per person! So to talk of differences in the number of TV sets per capita of ONE is totally unrealistic. Its probably better to talk, say, of differences of 0.1 this might represent differences between Uruguay (0.2) and Taiwan (0.3), for example. If b1 = -45.2, then a 1 unit difference leads to a 45.2 unit difference in Y. So our interpretation would be more relevant if we said: If there were two countries which had the same GDP per capita and the same secondary school enrolment rate, but one country had 0.1 more TV set per capita than the other, then the country with more TVs per capita would have an infant mortality rate of 4.52 deaths per 1000 live births lower than the country with fewer TVs per capita. From a policy angle, we can use these differences in mortality rates to predict what might happen if we were able to change one of the X variables. For example, if a particular country could introduce a policy aimed at increasing the secondary school enrolment rate by 1%, then our model suggests that that country could reduce its infant mortality rate by 0.62 deaths per 1000 live births.

3.2

Hypothesis Testing

testing 2 and 3 , etc. Each coefficient has its own p-value which can be compared to the chosen significance level. Lets go through how wed do the test for the infant mortality example. Heres the relevant part of the output.

Using more than one X variable in our model does not change the way we do hypothesis tests. What we did in testing 1 in simple regression can be extended to

28

Lets look first at whether the TV sets variable affects Y. Our null and alternative hypotheses would be: H0: 1 = 0 H1: 1 0 TV sets per capita has no impact on infant mortality rate TV sets per capita does have an impact on infant mortality rate

From the output, the p-value is 0.082. If we choose a significance level of 5%, or 0.05, then since the p-value exceeds 0.05, we do not reject H0. That is, we conclude that there is not sufficient evidence to support the view that number of TV sets per capita can help explain infant mortality rates, once the other variables have been taken into account. Note in this case that the decision is pretty close: had we chosen a significance level of 10%, or 0.1, then we would have rejected H0 and concluded that the number of TV sets is important in explaining infant mortality. This is where we need to be careful not to be too black and white. The choice of a 5% significance level was subjective, and could have made a difference to our conclusion. Given all this uncertainty and subjectiveness, a reasonable conclusion might be TV sets per capita might be an important explanatory variable, but the evidence for this is not conclusive. In the case of Real GDP and School enrolments, we have: H0: 2 = 0 H1: 2 0 Real GDP per capita has no impact on infant mortality rate Real GDP per capita has an impact on infant mortality rate

From the output, the p-value is 0.664. If we choose a significance level of 5%, or 0.05, then since the p-value exceeds 0.05, we do not reject H0. That is, we conclude that there is not sufficient evidence to support the view that real GDP per capita can help explain infant mortality rates. In this case, the decision is clear-cut: the p-value is huge compared to the significance level. There is virtually no evidence suggesting GDP is an important variable, once the other variables have been taken into account. For secondary school enrolment rates: H0: 3 = 0 H1: 3 0 Secondary School enrolment rates have no impact on infant mortality rate Secondary school enrolment rates have an impact on infant mortality rate (i.e. higher schooling, lower mortality)

From the output, the p-value is 0.0000006. If we choose a significance level of 5%, or 0.05, then since the p-value is less than 0.05, we reject H0. That is, we conclude that there is sufficient evidence to support the view that secondary school enrolment rates help explain infant mortality rates. In this case, the decision is also clear-cut: the pvalue is tiny compared to the significance level. There is quite convincing evidence suggesting schooling is an important variable, once the other variables have been taken into account.

29

Take note of the slight but important difference in how we interpret the results of the hypothesis test once the other variables have been taken into account. If we conclude that X1 is not important, but X2 is important, we would say that X1 is not important, once X2 is taken into account. Thats exactly the case we had in this infant mortality example at first, when we did regressions one variable at a time, it seemed that all three variables were important in explaining the infant mortality rate. But, for example, countries which have more TV sets per capita and higher GDP tend to have higher school enrolment rates. When we put these variables in separately, they seemed significant, but only because they were acting as a proxy for the effect of higher school enrolment rates. So when we put all three in together, it was clear that the only strong determinant of infant mortality was education.

3.3

Assessing the Overall Model

In simple regression we used three things to help us decide if we have a decent model: R2, the standard error, and residual analysis. All of these tools are available to us in multiple regression. The relevant output for our infant mortality example is:

(1) R2 As with simple regression, R2 measures the proportion of variation in Y explained by the fitted equation that is, the proportion explained by X1, X2.Xk. In this case, we have R2=68.2%, which is reasonable. Note that in the simple regression model with just the School Enrolments variable, we got an R2 of 66.3%, so the addition of these two extra variables has done little to improve the model. This fits with the fact that the tests discussed above suggested neither of the other variables were convincingly significant explanatory variables. Wed say that 68.2% of the variation in infant mortality rates is explained by the variation in TVs, GDP and the secondary school enrolment rate. N.B. One important point about R2: we can show mathematically that R2 will always increase if we add more explanatory variables to a model. Thus it is not a good idea to use R2 as a guide to whether one model is better than another, especially if one has more explanatory variables than the other. If you are considering adding another variable, then using a hypothesis test to decide if its coefficient is significantly different from zero is a much better guide.
30

(2) Standard Error The understanding of the standard error is the same as for simple regression. The standard error is the standard deviation of the error term in the model if our additional X variables are helpful in explaining Y, then the error term should be smaller on average, because more of Y is able to be explained. In the case of our example, we get a standard error of 21.34 deaths per 1000 live births. That is, loosely speaking our model will provide predictions of the infant mortality rate which are in error by an average of 21.34 deaths. This is reasonable but not brilliant. Infant mortality rates range from 3.7 for Japan to 135 for Mozambique, so to have a model that on average will make an error of around 20 is something: the model can put us in the general ball park of the kinds of infant mortality rate a country should have, given its level of TV ownership and income and education levels, but it is not able to predict outcomes in a very precise manner. There are clearly still several other important factors explaining differences in infant mortality rates.

(3) Residual Analysis. When we plotted the residuals in simple regression, we plotted them against the X variable. With multiple regression, there is more than one X variable. So we need to do a number of different plots, with each different X variable. If you have a large number of X variables, this would be very tedious! So we tend not to use this tool in most cases.

4. 4.1

Regression with Different Types of X Variables When the Relationship between X and Y is Not Linear

So far we have limited ourselves to modelling the relationship between the X variables and Y with a linear relationship. This is somewhat necessary for practical reasons, but it would be nice to be able to look at some alternatives other than the linear relationship. At this stage we can consider only one specific type of non-linear models, namely those where the data can be transformed so that the relationship becomes linear. This then allows us to estimate linear relationships in the transformed variables. More complex options are explored in further econometrics subjects! Lets look at our infant mortality example. If we do a scatter plot of each X variable against the dependent variable, heres how they look:

31

Infant Mortality rate vs TV sets per capita 150.00 100.00 50.00 0.00 0.00
160.00 140.00 120.00

Infant Mortality rate vs Real GDP per capita

0.20

0.40

0.60

0.80

100.00 80.00 60.00 40.00 20.00

Infant Mortality rate vs Sec. School Enrolments


160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.00

0.00 0.00

5000.00

10000.00 15000.00 20000.00 25000.00

50.00

100.00

150.00

200.00

It is obvious here that the variables do not have a linear relationship with infant mortality rates. We want a way of capturing this non-linear relationship. One possibility is to add a quadratic term for each Xj variable to the model. It is conceivable that a quadratic curve may provide a reasonably accurate fit to the data as shown in these scatter plots. The model we want to estimate would now be:

Yi = 0 + 1 X 1i + 2 X 2i + 3 X 3i + 4 X 1i + 5 X 2i + 6 X 3i + ei .
2 2 2

This allows for each of the X variables to influence Y in a non-linear (quadratic) way. To estimate this model in Excel, we just need to add more columns to our data sheet, where all the values of each of the three X variables are squared. Heres how a part of the data sheet would look, showing the formulas:

Of course, we can enter these formulas by typing in just one, and getting the rest by clicking and dragging.

32

Now, we simply do the regression analysis (Data Analysis under Data tab, Regression) with the X data range to include these extra columns (i.e. the X range is C1:H99). Heres how the output looks:

Examining the p-values for each of the X variables, we see that the venture into quadratic models has been successful. School Enrolment Squared is clearly significant, and GDP Squared is also significant. More interestingly, the GDP variable is also now very significant (p-value = 0.006). In the purely linear model we had concluded that GDP was not relevant. But now when we include it in quadratic form, we find that it is an important variable. On the other hand, TV sets is possibly not an appropriate variable to include: the p-value for the linear term is 0.053 (almost significant), and for the quadratic term it is 0.108. Both of these values are in that indecisive range not small enough to be absolutely clear that this is a relevant variable, but not big enough to lead us to discard the variable outright. In general, the p-values on the quadratic terms in the model tell us whether we were justified in using a quadratic model instead of a linear model. If none of the quadratic terms had small p-values (i.e. none were significantly different from zero), then this suggests that the linear model was adequate and there is no need to consider a quadratic model. Notice the improvement in R2 for this quadratic model: we had an R2 of 68.2% with the purely linear model. This has now risen to 83.6% - quite a substantial improvement. The final model we might consider is the quadratic model without either TV sets or TV sets squared. After all, we dont really believe more TV sets make people healthier. And the evidence is not entirely convincing that it should be there.

33

Heres how the model output looks now:

All the variables are highly significant, so we might settle on this as our preferred model. How do we interpret the coefficients in this quadratic model? Do we even know if they have the right sign? This is nowhere near as easy to answer as in the straight linear model. The easiest way to do it is to come up with predictions of Y for a range of values of each of the X variables, then observe how varying the Xs affects our predictions. For example, the table below shows how varying Real GDP per capita can affect the models predictions for Y, given a School Enrolment of 60%. Real GDP p.c. School enrolment Predicted Y
50 45 40 35

2000 4000 6000 8000 10000 12000 14000 60 60 60 60 60 60 60 43.0454 29.57807 18.91118 11.04474 5.978734 3.713169 4.248044

Predicted Y

30 25 20 15 10 5 0 2000 4000 6000 8000 10000 12000 14000

Real GDP p.c.

34

Increases in GDP of $2000 reduce infant mortality by around 10-14 deaths per 1000 live births at the low-income end (up to $6000), but then improvements in infant mortality rates are not as substantial if income continues to increase. This is not surprising for a very poor country moving to a middle-income country, we would expect rapid improvements in many social welfare outcomes. But then as that country goes to being a high-income country, there is less room for improvement, and the kinds of improvements that are needed will cost substantially more, so improvements are much slower in coming.

4.2

Categorical Explanatory Variables

4.2.1 Variables with Two Categories So far we have concentrated on X variables that are numerical. However, there are occasions when an explanatory factor is a categorical variable. For example, you may be interested in whether there is gender discrimination in income. You have data like the following:

Gender is qualitative: individuals are male or female. So how can we include characteristics like this in our model? We invent an extra variable, known as a DUMMY variable, which takes only the values 0 and 1, depending on which category each observation (person) belongs to. Your spreadsheet would look like this:

35

The column MALE is what we call a dummy variable. A value of 1 indicates that the person is male, and a 0 indicates the person is not male (i.e. female). We use the MALE column as X in our regression:

The main difference with using dummy variables is in the interpretation of the slope coefficient. What does 12628 mean for the MALE dummy variable? It means that males are estimated to earn $12,628 more than females on average. How did we make this interpretation? Recall the 1 coefficient tells us how much Y would change if X were 1 unit higher. With a dummy variable, we only have two values: 0 for female and 1 for male, so increasing X by 1 unit means changing X = 0 to X = 1. The 1 coefficient thus tells us what the difference in Y would be if the person was male rather than female. In particular, we can say take 2 people who are identical in every way, except one is female and the other is male. The person who is male can expect to earn, on average, $12,628 more per annum than the person who is female. So b1 can give us an idea of whether there is income discrimination purely on the basis of gender. What about the intercept? Recall 0 tells us what Y would be, on average, if X = 0. When X = 0, the person is female, so 0 tells us what Y would be, on average, for females. In this case, we could say the model predicts that females earn, on average, $56,672 per annum. It follows, then, that males are estimated to earn, on average, $56,672 + $12,628 = $69,300 per annum. That is, b0 = 56672 is the estimated average income of females, and b1 = 12628 is the estimated male-female difference.

36

i.e.

Income 0 + 1 0 1 Females

Males

Essentially we have 2 different intercepts: one for males and one for females. That is, we have a model with just an intercept for females:

= 56, 672 Y i
And a model with a different intercept for males:

= 56, 672 + 12, 628 1 Y i = 69,300


i.e. This model is the same as the first, except the intercept is bigger, by 12,628.

It also follows that if there is no difference in income between males and females, 1 = 0 . N.B. If we were to regress Y on just a constant, there would be no X and b 0 would just be equal to Y . In tutorials you will test this out using data in Excel, and in Topic 3 we will build on this concept. To help us think more about this, consider a simple model with only an intercept and one X variable, a 0-1 dummy variable. Recall the formulae for the estimated coefficients for a simple regression:

b0 = Y b1 X

b1 =

X
i =1 n i =1

X Yi Y
i

If X takes only a value of 0 or 1, then X is the proportion of observations that take a value of 1, e.g. the proportion of males in the sample.
37

With a fair bit of manipulation, it can be shown that:

b0 = YX = 0 b1 = YX =1 YX = 0
where Y X =0 is the average value of Y for observations when X = 0 and Y X =1 is the average value of Y for observations when X = 1. So, the intercept is the mean of Y for females, and b1 is the difference between the mean of Y for males and the mean of Y for females. So we can interpret b1 as the average difference in income between males and females.

4.2.2 Including Other X Variables The same concept holds when we include other X variables in the model: we just add the phrase holding all other variables constant to our interpretations. Here is an example. We now include an additional X-variable in our data Years of Education. The spreadsheet would look like this:

Therefore, our Y-variable is income and our X-variables are MALE and Years of Education.

38

We now have: b0 = 23396: The model predicts that females with no education earn, on average, $23,396 per annum. As before, it is not really plausible to speak about someone having no years of education. On average, males earn $4,598 more per annum than females, holding years of education constant. That is, take 2 people with the same number of years of education, but one is female and the other male, the male would be expected to earn around $4,598 more per annum than the female. Take 2 people of the same gender, but one has 1 more year of education than the other. The person with the higher education would be expected to earn, on average, $2,013 per annum more than the person with less education.

b1 = 4598:

b2 = 2013:

With a continuous X variable in our model, we are effectively estimating a model of income on education with two intercepts, depending on whether the person is male or female. That is, for females the model becomes:

= 23396 + 2013 X Y i 2i
While for males:

= 23396 + 4598 1 + 2013 X Y i 2i = 27994 + 2013 X 2i

39

i.e.

Income

2 2

Males Females

0 + 1 0 1 Education

By including another factor in the model, we again see the importance of multiple regression over simple regression. In the simple regression case, where we just included gender, there was strong evidence of gender discrimination the p-value for MALE was 0.001. When we take into account education, however, the p-value increases to 0.13. Clearly, once we take education into account, there is no longer any evidence of gender discrimination. What was driving the apparent gender bias was the fact that, in this particular sample, on average, females have lower education than males, and with education strongly related to earnings, the gender dummy picked up a lot of the effect of education on income. This kind of distinction is important if we had gone with the first model, the appropriate policy would be to tackle gender discrimination in the workforce. But with the second model, we can see that policy action would be more effective in the education sector females are earning less because they have lower levels of education in general, not because of gender discrimination of employers. The more pertinent policy issue is why females seem to have lower education levels than men.

4.2.3 Multiple Categories Now suppose we have a categorical variable with more than 2 categories. e.g. We have 3 broad mutually exclusive and exhaustive occupation types managerial, clerical and labour. What would we do in this case?

We would create a set of dummy variables as follows:

40

We then include 2 of the occupational category columns along with the other X variables in our regression. Why only 2 categories? We need to leave one category out to refer to as our base category (note that in the gender example, we only included the MALE category, not MALE and FEMALE). If you try to put all 3 in, Excel will come up with an error youll find out the technical reason why in second year!

Here, our base person is female and works as a labourer. Our interpretations then become: b0 = 8872 This is the case where MALE = 0, Education = 0, Clerical = 0 and Managerial = 0, i.e. the individual is female, with no education and works as a labourer. The model predicts that a person with these characteristics would earn $8,872 per annum on average.

41

b1 = 5593

The coefficient on the MALE dummy tells us the difference in income between males and females, holding the other X variables constant. That is, if we were to take 2 people with the same level of education and who work in the same occupation, except that one was male and the other female, the male would be expected to earn, on average, $5,593 more than the female. The coefficient on Years of Education tells us how education affects income. If there were 2 people of the same gender and same occupational type, but one had 1 more year of education than the other, the person with higher education is estimated to earn $2,618 more than the person with lower education. The coefficient on the Clerical dummy tells us the difference in income between Clerical occupations and Labour occupations, holding the other X variables constant. That is, if there were 2 people of the same gender and same level of education, but one worked as a clerk and the other worked as a labourer, the clerk is estimated to earn $2,085 more than the labourer, on average. The coefficient on the Managerial dummy tells us the difference in income between Managerial occupations and Labour occupations, holding the other X variables constant. That is, if there were 2 people of the same gender and same level of education, but one worked as a manager and the other worked as a labourer, the manager is estimated to earn $20,625 more than the labourer, on average.

b2 = 2618

b3 = 2085

b4 = 20625

Incom e ($000) 100000 90000 80000 70000


Male and Labour

Income ($000)

60000 50000 40000 30000 20000 10000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Education (Years)

Male and Clerical Male and Managerial Female and Labour Female and Clerical Female and Managerial

42

5.

Putting It All Together and Exploring What Might Happen Next

So far we have developed a mathematical model which allowed us to identify the factors that contribute to variations in Y. We can use that model to make predictions about what Y would be if we were to change a key factor. This is a powerful tool for all kinds of problems in business and economics. Lets go through an example to see how a regression model can be used to shape government policy. The state and commonwealth governments have begun to include spending on mental health care in its health priorities. Critics argue, however, that, given its high prevalence and heavy disease burden, the amount allocated to mental health care falls far below what is needed in Australia. It is claimed that, amongst a number of other things, those who are mentally ill tend to earn less than those without mental illness. Using our knowledge of Topic 2, we could investigate these claims. We could model income ($000) as a function of a bunch of characteristics like education (categories: primary (base), secondary and tertiary), age, age2 and gender (male=1/female=0), as well as a dummy variable indicating whether or not the person has been diagnosed with a mental illness. Using data from a recent health survey, we obtain the following regression output:

What do these results tell us? There are a number of factors that do not appear to affect income:

Age does not appear to affect income in either a linear or non-linear (squared) way, once other factors are controlled for. Secondary education is no different from primary education in terms of income, once gender, age and mental health status are taken into account.
43

There are also a number of factors that do seem to affect income:

Males are estimated to earn $9,423 more per annum than females of the same age, education and mental health status. This would suggest there is gender discrimination in income. Individuals who received tertiary education tend to earn $15,237 more per annum than individuals who received only primary education. This seems to be a sensible result, as we would expect more highly educated people to earn higher incomes, and the magnitude looks plausible. An individual with mental illness earns, on average, $17,555 less per annum than an otherwise identical individual without mental illness. This result goes in favour of those who argue that current government spending is inadequate.

The important point to note from the output is that those with mental illness do seem to earn significantly less than those without mental illness, even after we take into account education, age, and gender. In fact, if we imagine 2 people who are identical in terms of gender, age and education, except that one has a mental illness, then the person with mental illness would be expected to earn, on average, $17,555 per annum less than the person without mental illness. So far this is nothing new the results do suggest that people with mental illness experience poorer income outcomes than those without mental illness, and the government could do well to allocate more funds towards mental health services. But we can also use this information to do more than this. We can use this information to quantify how much mental illness is costing Australia, in terms of individual lost earnings (and government tax revenue), and therefore how much the government should pledge towards mental health care services. Health professionals estimate that around 12% of the Australian working-age population suffers from mental illness. With a working-age population of 12 million people, that would mean 1,440,000 people suffer from mental illness in Australia. $17,555 per person per annum x 1,440,000 people = $25,279,200,000 per annum We estimate that the annual cost of mental illness in Australia, in terms of lost income, is around $25.3 billion. The current spending on mental health care is less than one tenth of this amount, which suggests there is a need for the government to consider sizeable increases to its spending on mental health.

6.

Regression When the Data is Recorded Over Time

So far the examples that we have looked at involved observations at the level of individual, country, firm, household etc. That is, weve used a subscript i for our variables. Regression can also be used when the data is observed over time. For example, we could look at factors that affect the Australian inflation rate over time. In this case, we tend to use a t subscript:

Yt = 0 + 1 X t + t
or;

Yt = 0 + 1 X 1t + 2 X 2t + L + k X kt + t
44

Data over time is called a TIME SERIES: it is a series of observations on some variable of interest over a sequence of time periods. Most areas of business and economics would record or maintain time series data e.g. production, unemployment, sales, inflation, interest rates, prices, inventory etc. The data would look something like this:

The spreadsheet tells us that in 1981, total revenue for this company was $1,622.8 million dollars. A helpful way of representing this data is with a simple LINE GRAPH the data has a natural ordering, so a line graph is most appropriate.
Revenue 2500.0 2000.0 1500.0 1000.0 500.0 0.0
19 84 19 82 19 86 19 88 19 90 19 94 19 92 19 98 19 96 Ye ar

Why do we want to model a time series? Probably the most common reason is that we need to be able to forecast what is going to happen to this series in the future. For example, this series might represent revenue of some company, and knowing what is likely to happen to revenue in the future is important to being able to anticipate share price movements and hence to plan investment strategies.

45

There are many approaches to forecasting the future. One is to use ones intuition to make an educated guess about what might happen. Sometimes this works well, but often it works badly it all depends on the experiences and biases of the person making the forecast. Instead, a much more reliable approach is to study the past values of the series and look for patterns. If we can safely assume that the patterns will persist into the future, then we can make use of the patterns we have identified to provide predictions for the future. For example, you may find the following pattern in the past data: sales seem to grow from one year to the next; typically they grow by about 5% per year. This provides some guidance for future forecasts. You will forecast ongoing growth at around 5% per year. Modelling time series is about observing and defining / measuring the patterns that are present in the data. It is the building block for successful forecasting.

6.1

Components of a Time Series

Data that is recorded over time has some special features or components. These features can be seen in the graph of mobile phone sales from Topic 1.
350

300

250

200

150

100

50

0 May-96 May-97 May-98 May-99 May-00 May-01 May-02 May-03 May-04 May-05 May-06 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Sep-96 Sep-97 Sep-98 Sep-99 Sep-00 Sep-01 Sep-02 Sep-03 Sep-04 Sep-05 Jan-06 Sep-06

The features to note in this graph are: Trend: The trend is a persistent, long term upward or downward pattern of movement. The duration of a trend is usually several years. The source of such a trend might be gradual and ongoing changes in technology, population, wealth, etc. The cycle is a pattern of up-and-down swings that tend to repeat every 2-10 years. They have periods of contraction, leading to peaks, then contractions leading to troughs, with the cycle then repeating itself. A seasonal pattern is a regular pattern of fluctuations that occur within each year, and tend to repeat year after year. This component represents whatever is left over after identifying the other three systematic components. It represents the random, unpredictable fluctuations in data. There is no pattern to the irregular component.
46

Cycle:

Seasonal:

Irregular:

We will now look briefly at how to model the Trend and Seasonal components of a time series. We will ignore Cycle thats for next year! And of course, the Irregular component cant be modelled, by definition there is no pattern to the irregular component.

6.2

Modelling the Trend

The aim in trend analysis is to fit a simple model that captures the long-term movement in a series. It is essential to medium or long term forecasting. Consider the following example, which shows annual revenue for the Coca-Cola company over a 25-year period.
Annual Revenues at Coca-Cola (US$billion) 25 20 15 10 5 0 1975 1978 1981 1984 1987 1990 1993 1996 1999

From the graph there is a general upward trend, with revenue growing steadily over the sample period. There are a range of possible trend models that can be applied to a set of data, but in this course we will focus on only one: the linear trend model. This model can be understood as a linear relationship between the actual series (Yt) and the time sequence variable time (t). For the Coca-Cola example, the data and the time variable is given as follows:

47

The linear trend model is given by the equation:

Yt = 0 + 1t + et .
In this model, we are assuming that there is reasonably steady growth in Y each time period. Since t represents years in this case, the model implies that Y grows, on average, by the amount 1 per period. Heres the model for the above Coca-Cola data:
Annual Revenues at Coca-Cola (US$billion) 25 20 15 10 5 0 1975 1978 1981 1984 1987 1990 1993 1996 1999

y = 0.7382x + 0.316 R2 = 0.9349

To get the equation and R2 printed with the chart, we have to choose Options in the Add Trendline dialog box, and tick the Display Equation on chart and Display R2 value on chart options, then return and select the linear trend model. To estimate the coefficients, we choose Data Analysis under the Data tab, Regression, and choose the column of Y data for the Y range, and the column of observations on time (t) for the X range. If the data were laid out as per the adjacent sheet, the regression dialog box would look like this:

48

Heres how the Excel output would look in this case:

Notice that the coefficients and the R2 value are the same as those from the trendline option in the chart output. How do we interpret the estimates of 0 and 1? b0 = 0.316. This is the fitted value of Y when X = 0; since X is time in this case, and t = 1 in 1975, then t = 0 in 1974. So we say that the model predicts a trend value for revenue of $316 million in 1974. This is the change in Y for a 1-unit change in X. In this case, X changes by one unit each year. So we say that the model estimates the average growth in revenue to be $738 million per year.

b1 = 0.738.

6.3

Modelling Seasonal Patterns

Lets consider another time series example. This time we will look at quarterly sales of Wal-Mart, a huge chain of department stores, which started in the USA, but has since spread world-wide. Heres what a selection of the data looks like (several rows are hidden so that it doesnt take too much space), together with a graph of the time series.

49

Global Sales of Wal-Mart (US$ million)


60000

50000

40000

30000

20000

10000

0 1992-1

1993-1

1994-1

1995-1

1996-1

1997-1

1998-1

1999-1

2000-1

From this graph, two features of this series strike us immediately: a general upward trend, and a recurring pattern of ups and downs throughout the four quarters of each year. The peak occurs in the 4th quarter of each year (October-December), and the 1st quarter (January-March) is usually the lowest. We have learned how to model the trend already. Heres how the linear trend fits, as modelled using the Add Trendline function:
Global Sales of Wal-Mart (US$ million)
60000

50000

40000

y = 908.47x + 7304 R2 = 0.9162

30000

20000

10000

0 1992-1

1993-1

1994-1

1995-1

1996-1

1997-1

1998-1

1999-1

2000-1

Now, how do we augment our model to take account of the obvious seasonal pattern in the data? To deal with seasonality, we need to add dummy variables to the trend equation, to allow the model to have a different intercept for each quarter of the year.

50

The linear model we estimate is:

Yt = 0 + 1t + 2 Q1t + 3Q2t + 4 Q3t + et


Q1t is a dummy variable that takes the value 1 in the first quarter of each year, and zeros for the other quarters. Likewise, Q2t = 1 in the 2nd quarter of each year, and zero for other quarters, and Q3t is the 3rd quarter dummy variable. Recall when we discussed dummy variables earlier in this topic, we interpreted them as ways of allowing for a different intercept in periods where the dummy variable equals one. The same interpretation applies here. Consider the first quarter of each year. In this case Q1 = 1, but Q2 = Q3 = 0. So the model becomes:

Yt = 0 + 1t + 2 + et
So the intercept in the first quarter of each year is 0 + 2. By similar logic, the intercept in the 2nd quarter of each year is 0 + 3, and it is 0 + 4 for the 3rd quarter. In the 4th quarter, all the dummies equal zero, so the intercept is just 0. Heres how the data sheet would look for this example, with the three dummy variables added.

Now we use Tools, Data Analysis, Regression, with column B as the Y range, and columns C, D, E, and F in the X range, and obtain the following results:

51

This is the complete model, incorporating a linear trend and seasonal components:

= 11283.24 + 888.849t 5760.68Q 4180.75Q 4523.6Q Y t 1t 2t 3t


We can interpret the coefficients as follows: b0 = 11283.24: the estimated trend value of the intercept in period t = 0 (4th quarter 1991) is $11,283.24 million. b1 = 888.849: the estimated average growth in sales is $888,849,000 per quarter.

b2 = -5760.68: we estimate that sales in the 1st quarter each year are typically $5,761 million below what they would be in the 4th quarter, after adjusting for trend. b3 = -4180.75: we estimate that sales in the 2nd quarter each year are typically $4,181 million below what they would be in the 4th quarter, after adjusting for trend. b4 = -4523.6: we estimate that sales in the 3rd quarter each year are typically $4,524 million below what they would be in the 4th quarter, after adjusting for trend.

The 4th quarter of each year is the quarter for which we dont have a dummy variable, so it is like the benchmark quarter. The information in these seasonal estimates can be quite useful to business planners. For example, knowing just how much higher sales are in the 4th quarter of each year can help them in planning staff levels, and ensuring adequate supply of stock, etc. Similarly, they would not want to be overstaffed in the 1st quarter of each year, as this is always much quieter than the rest of the year. But more importantly, the model can now be used to generate forecasts into the future see the next section.
52

6.4

Using the Time Series Model to Forecast

Now that we have our complete time series model, incorporating both trend and seasonal components, we can easily use it to generate forecasts for Yt into the future. All that we need to do is plug in the appropriate values for t (time) and the quarterly dummy variables into the right hand side of the models. The linear model we estimated was:

= 11283.24 + 888.849t 5760.68Q 4180.75Q 4523.6Q Y t 1t 2t 3t


The data series ended in the 4th quarter of 2000, with a t value of 36. So to forecast sales for the 1st quarter of 2001, we plug in t = 37, and Q1t = 1, Q2t = Q3t = 0, giving: 1st quarter of 2001:

= 11283.24 + 888.849 37 5760.68 Y 37 = 38409.98


Predictions for the quarters that follow could be calculated in the same way: 2nd quarter of 2001:

= 11283.24 + 888.849 38 4180.75 Y 38 = 40878.75


3rd quarter of 2001

= 11283.24 + 888.849 39 4523.6 Y 39 = 41424.75


4th quarter of 2001:

= 11283.24 + 888.849 40 Y 40 = 46837.20

Global Sales of Wal-Mart (US$ million)

60000

50000

40000 Sales 30000 Predicted 20000

10000

0 1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1 2001-1

53

6.5

Other Aspects of Modelling Time Series

There are several other important issues we have not dealt with in relation to modelling time series and forecasting with the linear trend model. We will briefly highlight these here. Even if you dont know a lot about these things, its good to be aware of the issues, and to make use of what you know in practical modelling situations.

6.5.1 Outliers When you graph the time series at the beginning of your analysis, look for any unusual data points values which are especially large or small relative to the rest of the values. This might suggest a data error (check that the data point is actually correct!), or an outlier a value that occurs because of a one-off event. e.g. A September 11th attack, or a big strike, or a political revolution. If we dont recognise this value, it can have undue influence on the rest of the modelling, and produce unhelpful models. Once we identify the outlier, sometimes its best to just omit that data point from the analysis. Heres an example:

Average Cost of One Night Accommodation in Hotels / Motels in NSW


$170 $160 $150 $140 $130 $120 $110 $100 Mar97 Sep97 Mar98 Sep98 Mar99 Sep99 Mar00 Sep00 Mar01 Sep01

Notice the significant extra cost of a nights accommodation in the September quarter 2000. This is when the Sydney 2000 Olympics were taking place, and hotel / motel rates increased astronomically. This once-off event can be thought of as an outlier the increase in average cost from around $110 per night to near $160 per night was clearly not part of an ongoing trend, nor a seasonal fluctuation. It is the result of a oneoff event.

54

6.5.2 Cycle We have learned how to model trend and seasonal patterns, but not cycle. This will come in later years! Meanwhile, we should recognise that our models will produce forecasts that ignore this component, and maybe make some allowance for this. e.g. If your model produces a set of forecasts for sales, and its generally believed that we are about to enter a bad economic recession, then your forecasts are likely to be too high. You might want to make some adjustment to your forecasts.

6.5.3 Standardising In Topic 1 we talked about the importance of standardising data before drawing interpretations or conclusions from it. For example, to say that sales by Wal-Mart have risen from $9 billion to $51 billion over the past 10 years sounds impressive. But remember there has been inflation in that period, so some of this growth could be attributed to rising prices, and doesnt represent real growth. A better picture would be obtained by looking at Real Sales sales adjusted for inflation by dividing the sales figure by the CPI. When modelling time series, sometimes it helps to analyse a time series in each of its component parts, rather than the final series. For example, movements in total sales of Wal-Mart come about because of growth in the number of stores, growth in real sales per store, and growth in overall price level. That is: Total Nominal Sales = Number of Stores x Average Real Sales per Store x Price Level We could do trend and seasonal models of each of these three components, and then combine them. Sometimes this gives better insight into what drives the total sales variable (e.g. is the growth mostly from opening new stores, or from growth in sales in each store, or inflation?), and possibly more accurate forecasts.

6.5.4 Other Functional Forms In this course we have only modelled a linear trend and additive seasonal dummies. For the trend, this means that we expect Y to grow at a constant amount per year. Often this is not appropriate, and other functional forms would fit the data better and make more accurate forecasts. For example, if we believe Y grows at a constant percentage rate per year, an exponential model might be more appropriate:

55

Global Sales of Wal-Mart (US$ million)


60000

50000

y = 10566e0.0399x R = 0.9476

40000

30000

20000

10000

0 1992-1 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1

Or, if the rate of growth of Y is not constant, but levels off over time, a logarithmic model may be better:
Global Sales of Wal-Mart (US$ million)
60000 50000 40000 30000 20000 10000 0 1992-1 -10000 1993-1 1994-1 1995-1 1996-1 1997-1 1998-1 1999-1 2000-1

y = 9783ln(x) - 1901.1 R = 0.7168

6.5.5 Growth Rates vs. Levels So far we have only talked about modelling the level of a series e.g. sales. Sometimes it makes more sense to model the growth rate in a series. The growth rate is more difficult to model it fluctuates more than the underlying level of a series but it can sometimes give more accurate forecasts of the actual level. Youll learn more about this in later years!

56

S-ar putea să vă placă și