Documente Academic
Documente Profesional
Documente Cultură
Weeks 1 and 2
1. (a) What is meant by a variable in a statistical sense? Distinguish between qualitative and quantitative
statistical variables, and between continuous and discrete variables. Give examples.
A variable in a statistical sense is just some characteristic of an object. It may take different values.
Data on a quantitative variable can be expressed numerically in a meaningful way (e.g., height of an
individual, number of children in a family). Data on qualitative variables cannot be expressed numerically
in a meaningful way (e.g., sex or hair colour of an individual) although such data can be coded into numerical
expressions. In the case that qualitative data has an innate ordering (e.g., survey answers to the question,
how happy are you, all things considered? rated on a scale of Very Happy to Very Unhappy), the
numerical coding can contain some meaning.
A discrete quantitative variable can assume only certain discrete numerical values on the number line (can
be a finite or infinite number of these values). A continuous quantitative variable can assume any value in a
specific range or interval; e.g. length of a pipe. In some cases, what in theory is a continuous variable must
be in practice measured as a discrete variable because of limitations to measurement precision.
(b) Distinguish between (i) a statistical population and a sample; (ii) a parameter and a statistic. Give
examples.
A statistical population is the set of measurements or observations of a characteristic of interest for all
elementary units in a frame; e.g., the shoe sizes of all men in Australia. A statistical sample is a subset of a
population, e.g., the shoe sizes of all the men enrolled in ECON 1203 is a sample of the population represented
by the shoe sizes of all men in Australia.
A parameter is a numerical description of a population. For example, the average shoe size of all Australian
men is a parameter (of the population of the shoe sizes of all Australian men). A statistic is a numerical
description of a sample. For example, the average shoe size of all men in this classroom is a statistic
(calculated from the sample of the shoe sizes of all men in this classroom).
1
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
2. In order to know the market better, the second-hand car dealership, Anzac Garage, wants to analyze
the age of second-hand cars being sold. A sample of 20 advertisements for passenger cars is selected
from the second-hand car advertising/listing website www.drive.com.au The ages in years of the
vehicles at time of advertisement are listed below:
(a) Calculate the frequency, cumulative frequency and relative frequency distributions for the age data using
the following bin classes:
More than 0 to less than or equal to 8 years
More than 8 to less than or equal to 16 years
More than 16 to less than or equal to 24 years.
Relative Cumulative
Bin Frequency Frequency Frequency
0 < 8 0.7 14 14
8 < 16 0.25 5 19
16 < 24 0.05 1 20
(b) Sketch a frequency histogram using the calculations in part (a). What can you say about the distribution
of the age of these second-hand cars? Is there anything that concerns you about the frequency table and
histogram? Specifically, is the choice of bin classes appropriate? What needs to be done differently?
0.8
0.7
0.6
0.5
Frequency
0.4
0.3
0.2
0.1
0
8 16 24
Bin
From this graph, the age distribution appears to be skewed to the right. 70% of observations have age
between 0 and 8. However, this histogram only provides limited information about the age distribution
because there are too few bins and they are very wide.
2
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
(c) Halve the width of the bins (0 to 4, 4 to 8, etc) and recalculate the frequency, cumulative frequency and
relative frequency distributions. Using the new distributions and histogram, what can you now say about
the distribution of the age of second-hand cars?
Relative Cumulative
Bin Frequency Frequency Frequency
0 < 4 0.25 5 5
4 < Age 8 0.45 9 14
8 < Age 12 0.2 4 18
12 < Age 16 0.05 1 19
16 < Age 20 0 0 19
20 < Age 24 0.05 1 20
10
9
8
Frequency
7
6
5
4
3
2
1
0
2 6 10 14 18 22
Age
There still appears to be a skew to the right, but now we can also see that there is an outlier in the 21~24 Age
category. 5~8 are the most frequently observed ages. A quite sizable proportion of the second-hand cars are
relatively new (25% being less than or equal to 4 years old).
3. Health expenditure
A recent report by Access Economics provides a comparison of Australian expenditures on health with
that of comparable OECD countries. Data from that report relating to the year 2005 have been used to
reproduce their Figure 2.2 (below denoted as Figure 2.1).
A strong positive association more per capita GDP implies more health expenditure per capita.
There are (at least) 2 outliers, the observation with the largest health expenditure (Luxembourg) and
the observation with the highest GDP (USA). Without these 2 the relationship is approximately
linear. With them, there is a suggestion of a non-linear relationship.
An indication of more variability in health expenditures when GDP is larger.
(b) While this is a bivariate scatter plot, there are three variables involved: health expenditure, GDP and
population. Why account for population by expressing health expenditure and GDP in per capita terms?
3
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
This line of questioning is intended to prompt the recognition that there may be factors other than GDP
associated with health expenditures per capita, and population size is one obvious factor since (for example)
there may be returns to scale in health care delivery, and/or differences in how concentrated the health care
industry is in larger countries versus smaller ones. Expressing everything in per capita terms is one way to
control for population variation and hence isolate the GDP-health expenditure relationship, so it is good if
that is the relationship we want to know about. However, controlling for population size in this fashion makes
it harder to see the relationship between population and health care expenditure, so if that relationship were
our target of analysis, this would not be a good way to present the data.
The time series evolution is quite similar for Sydney and Melbourne housing prices they track each
other quite well and hence we would say there is a strong positive association between these two
series.
Sydney prices are typically above Melbourne prices
There seem to be 2 regimes. In the first regime, up until the 1950s, there is little growth in housing
prices and they are quite stable from year to year (low variability). In the second regime, since the
1950s, there have been quite dramatic increases in housing prices in both cities and there is much
greater year-to-year variability more volatility. (In his analysis, Stapledon notes that this two-
regime pattern is quite common and has been observed in the US as well.)
One reason housing prices increase over time is inflation and if all prices and incomes increase by the same
proportion then there are no real changes, meaning no changes that people would feel in their wallets. So
just as in the previous question, we control for this other factor (inflation) so as to better see the
relationship between real housing prices and time.
4
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
500
Thousands of dollars
400
300
200
100
0
1860 1880 1900 1920 1940 1960 1980 2000 2020
Year
Sydney Melbourne
(a) Calculate the mean, median and mode for this sample of data and use these statistics to further describe
the distribution of car ages.
5 5 6 ... 24 11
Mean 7.3
20
Ordering the data from lowest to highest:
Median = (6+6)/2=6
Mode = 6
The sample mean is to the right of mode and median, suggesting that the sample distribution is skewed towards
the right. The cause seems to be the large outlier one car had an age of 24, which appeared to be very
different to the age of other cars. Given the skewness and the outlier, the median is possibly a better measure
of central tendency. Hence a typical second-hand car is 6 years old.
Mean 7.3
Standard Error 1.126476
Median 6
Mode 6
Standard Deviation 5.037752
Sample Variance 25.37895
Kurtosis 5.712234
Skewness 2.0983
5
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Range 22
Minimum 2
Maximum 24
Sum 146
Count 20
(b) If the largest observation were removed from this data set, how would the three measures of central
tendency you have calculated change?
5 5 6 ... 6 11
Mean 6.4 (Now closer to median)
19
Median = 6 (unchanged, but now not an average of the two middle values but the actual middle value, since
we now have an odd number of observations)
Mode = 6 (unchanged)
6. For the following statistical population, compute the mean, range, variance and standard deviation: 3,
3, 5, 12, 13, 14, 17, 20, 21, 21.
3 3 5 12 13 14 17 20 21 21
Mean 12.9
10
Range 21 3 18
Variance
2 ( xi ) 2 (3 12.9) 2 .... (21 12.9) 2
N 10
45.89
Standard deviation 45.89 6.7742
6
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
The mean would increase by 4, but the range, variance and standard deviation would be unchanged.
The mean, range and standard deviation would be multiplied by 2, whilst the variance would be multiplied by
4.
7. Migrant wealth.
Suppose the Minister for Immigration is interested in research on the assimilation of migrant
households (a household where the chief income-earner is foreign born). The Household, Income and
Labour Dynamics in Australia (HILDA) survey is a representative survey of Australian households.
Using 4,669 household observations for 2002 from HILDA, we find there are 3,567 households
classified as Australian-born and 1,102 classified as migrants. One key consideration is how migrant
households are doing in terms of wealth compared with Australian-born households. Using these data,
we find the following:
(a) What can you say about the distribution of net household wealth, for both Australian-born and migrant
households, by looking at just the mean and the median figures?
The wealth distribution is skewed quite heavily towards the right for both Australian-born and migrant
households. The mean is much larger than the median, suggesting that more than 50% of each sample have
less than average wealth, while less than 50% of each sample have more than average wealth. In other words,
there is a fair amount of wealth inequality in both samples.
(b) More generally, what can you say about the distribution of wealth for migrant households compared
to that for Australian-born households? In particular, which type of household has greater variation in
wealth?
Based on just the mean and the median measures, a typical migrant family appears to be slightly wealthier
than a typical Australian-born family. Both figures are larger for the migrant sample than the Australian-
born sample. This is also the case for the 10 th percentile figure. By contrast, the 90th percentile is greater for
the Australian-born sample than the migrant sample. These figures suggest that, while typical migrant families
are better off than typical Australian families in terms of wealth, migrant families are less likely to be very
poor or very rich compared with Australian-born families. In other words, Australian-born families have
greater variation in household wealth than migrant families.
(c) Suppose the minister has net household wealth of $600,000. What can you say about his or her
financial circumstances relative to other Australian-born households?
The ministers household has greater wealth than at least 90% of Australian-born households in Australia.
His/her household is amongst the wealthiest 10% of Australian households.
7
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
(a) What would you expect the correlation to be between price and distance?
There is an inverse relationship between distance to CBD and price, so we expect the correlation to be
negative.
(b) Does it appear that there is a linear relationship between the two variables?
The relationship does not look linear largely because of the large variability in prices for suburbs close to
the CBD. (These observations also tend to distort what the relationship looks like for the bulk of the data. If
you were to eliminate these outliers, it is not clear what the relationship would look like for the remainder of
the data.)
(c) What other key features of these data can be determined from the plot?
5000000
4000000
Price $
3000000
2000000
1000000
0
0 10 20 30 40 50 60 70 80
Distance to CBD (kms)
We have already mentioned the large variability in prices for suburbs close to the CBD. To say this
more formally, the variance of prices close to the CBD (conditional variance, where the
conditioning is on small distance to the CBD) is much larger than the variance of prices further
away from the CBD.
Other outliers appear around 30kms from CBD (these are Clareville, Palm Beach and Whale
Beach).
There is no suspicion that these outliers are due to errors. All are feasible observations.
We can see that the price and distance variables are both skewed to the right. (Imagine pushing the
graph up from the bottom left corner, keeping it in the same 2D plane, so it sits on its Y-axis in that
plane, and then looking at the distribution of price from behind the graph (so the values of the
price variable are ordered from lowest on the left to highest on the right).)
There are numerous suburbs where there were no sales, which are reported as zeroes in the graph.
Note though that it was not explicitly said that suburbs with no sales would have entries of zero, so
8
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
this would have to be inferred by the viewer of the graph. Most of these no-sale suburbs are
suburbs relatively close to the CBD.
What should we do with the zero sales observations when we analyse the data? They are not data
errors, as sometimes occur. But they are not real zeroes, as we dont know what the price would
have been had there been sales for the period in question. They are also extreme values, meaning
they have a higher chance of influencing various types of analysis we might do. Hence, we should
think hard about whether or not to include them in any given piece of analysis.
9. Anzac Garage wants to develop guidelines for setting prices of cars according to the cars age. They
hire a business consultant who chooses a sample of 117 second-hand passenger car advertisements
collected from www.drive.com.au and retrieves data on the age and price of the cars.
(a) The business consultant first calculates the correlation coefficient between age and price and finds it
to be -0.278. Interpret this result.
Correlation coefficients lie between -1 and 1. A negative value suggests an inverse relationship between the
variables (which makes sense: older cars are less expensive). A magnitude of 0.278 suggests that the
relationship is present but not very strong.
(b) Sketch what you think the scatter diagram from which this correlation coefficient was calculated might
look like. Suppose the business consultant constructs a simple linear regression model using price as
the dependent variable, and age as the independent variable. What do you think the estimated
regression line might look like here? (We will return to this particular example later in the course and
address this question more formally.)
Below is a possible scatter diagram with a linear regression model superimposed. Scatters that answer the
question will have the key feature of being consistent with a negative correlation, i.e., a negatively sloped line
of best fit.
Price age scatter with OLS regression line superimposed
60000
50000
40000
Price
30000
Price
Linear (Predicted Price)
20000
10000
0
0 2 4 6 8 10 12 14 16
Age
9
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Weeks 3 and 4
1. (a) Explain what it means to say that two probabilistic events in a sample space are mutually
exclusive of one another.
If two events lets call them A and B are mutually exclusive, then it means that they do not
have any simple events in common: i.e., that the simple events that combine to make up A
have no elements in common with those that make up B.
(a) Explain what it means to say that two probabilistic events in a sample space are independent
of one another.
When two events are independent of one another, it means that the effect of conditioning on
the occurrence of one of them has no effect of the marginal probability of the other: i.e.,
Pr(A/B) = Pr(A).
(b) Why can two events not at the same time be both mutually exclusive and independent of one
another?
Because if A and B are mutually exclusive, then Pr(A and B) = 0, whereas if they are
independent, Pr(A and B) = Pr(A)*Pr(B) 0.
10
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
2. A department store wants to study the relationship between the way customers pay for an item and the
price of the item. 250 transactions are recorded and the following table is formed.
Price category Payment
Cash Credit card Debit card
Under $20 15 9 18
$20-$100 11 53 52
Over $100 6 38 48
Convert the table to a joint distribution. Express each of the following questions in terms of probability
statements, and then solve:
Joint distribution:
Price category Means of Payment
Cash Credit card Debit card Marginal
Under $20 0.06 0.036 0.072 0.168
$20-$100 0.044 0.212 0.208 0.464
Over $100 0.024 0.152 0.192 0.368
Marginal 0.128 0.4 0.472 1
(b) What is the probability that an item with a price tag of $43 is paid for in cash?
(c) What is the probability that people pay for an item that is at least $20 by credit?
(d) If somebody used a debit card to pay for an item, what is the probability that the item was less
than $100?
One way to check is to compare the marginal distribution of price with the conditional
distribution of price given a particular payment type (say, cash):
3. In a small batch of 20 manufactured widgets, there are, in fact, 3 defective ones. You, as quality control
officer for the company making the widgets, decide to examine a sample of 3 widgets, selected without
replacement, to see how many defective ones are selected.
(a) Use a probability tree to evaluate the probability distribution of the number of defectives sampled.
The tree is of the obvious kind with the first branch from a branch where the probability of
defective is 0.15 and not defective is 0.85. From the upper of these branches at the next node the
11
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
probability of defective being selected is 2/19 and non-defective is 17/19. From the lower first
branch, the probability of a defective is 3/19 and of a non-defective is 16/19. From the nodes at
the end of the 4 second branches, the 8 probabilities of defective and non-defective are,
respectively, 1/18, 17/18, 2/18, 16/18, 2/18, 16/18, 3/18 and 15/18.
Since draws are made independently each time, the relevant probability distribution of X, the
number of defectives drawn in a sample of 3 without replacement, is
x 0 1 2 3
P(X = x) 680/1140 408/1140 51/1140 1/1140 P(X=x) = 1
(b) How would your answer change if the sampling were done with replacement?
x 0 1 2 3
P(X = x) 4913/8000 2601/8000 459/8000 27/8000 P(X=x) = 1
7. The manager of a factory has determined from past experience that X, the number of repairs required
to machines in her factory on any one day, has the following probability distribution:
x 0 1 2 3 4
P(X = x) 0.41 0.25 0.18 0.10 0.06
12
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
(b) P(0 X 3)
(c) E(X)
() = = ( = )
= 0 0.41 + 1 0.25 + 2 0.18 + 3 0.1 + 4 0.06 = 1.15
(d) Var(X)
() = ( )2 = 2 = ( )2 ( = )
= (0 1.15)2 0.41 + (1 1.15)2 0.25 + (2 1.15)2 0.18 + (3 1.15)2 0.10 + (4 1.15)2
0.06 = 1.5075
(e) What is the conditional probability distribution of X, conditional on some positive number of
repairs taking place?
x 1 2 3 4
P(X = x|x>0) 0.42 0.31 0.17 0.10
8. Suppose that the daily number of errors a randomly-selected bank teller makes is denoted by X and
follows the distribution given in the table below. A human resource manager records the daily
numbers of errors of two randomly selected tellers. Denote the associated random variables by X1 and
X2. As the selection is random, X1 and X2 are independent and follow the same distribution as X. The
+
manager then computes the sample mean = 1 2 2 where the sample size is n = 2.
x 0 1 2
P(X = x) 0.6 0.2 0.2
(a) Find the mean and variance of X1. Explain why we do not need to find the mean and variance of
X2 once we know those of X1.
(1 ) = 0.6; (1 ) = 0.64
The mean and variance of X2 are the same because they have identical distributions.
(b) Since X1 and X2 are random, so is . Find the mean and variance of the random variable .
Compare these with the result from (a) and comment. Hint: you will find it useful to note that
(1 , 2 ) = 0 because X1 and X2 are independent. This simplifies the evaluation of the variance
of the random variable .
+ 1 1
() = [ 1 2 2] = 2 (1 ) + 2 (2 ) = 0.6
13
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
1 + 2 1 1
() = [ ] = (1 ) + (2 ) = 0.32
2 4 4
The means of and X are the same, and the variance of is the variance of X divided by 2 (the sample
size).
(c) Find the possible values that may take. Hence list the probability distribution of for
samples of size 2. (This is known as the sampling distribution of ).
If n=2 then the possible values for the mean are 0, , 1, 3/2, 2.
We know the possible values for the mean are 0, , 1, 3/2, 2. Now we need to assign probabilities to each
outcome to produce the probability distribution for the sample mean.
The following table lists all possible outcomes and their associated probabilities:
1 , 2 Probability
0,0 0 0.36
0,1 0.12
0,2 1 0.12
1,0 0.12
1,1 1 0.04
1,2 3/2 0.04
2,0 1 0.12
2,1 3/2 0.04
2,2 2 0.04
0 1/2 1 3/2 2
( = ) 0.36 0.24 0.28 0.08 0.04
(d) Examine briefly what would happen if n =3, 4, ? For this last sub-question, you will need to
use the idea of a factorial of an integer n, labelled !, which means n multiplied by every
positive integer smaller than itself. So, for example, 3! = 3 2 1 = 6. Also recall the
combinatorial formula for the number of ways of selecting x from n distinct objects (Sharpe
page 193): Cxn = !/( )! !.
1
If n=3, the possible values are 0, 1/3, 2/3, 1, 4/3, 5/3, 2. In combinatorial form, ( = 3) = 13 (0.6)2 . 0.2.
To understand this, note that the mean can only be 1/3 if two tellers make no errors and the remaining one
makes 1 error, and the combinatorial formula is used to account for the fact that the teller who makes 1
error can be the first, the second or the third sampled teller.
14
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
9. A student has enrolled in three courses in this semester. Lets call them courses A, B and C. Her
chances of passing each course are 0.8, 0.65, and 0.5, respectively. Passing each course is assumed to
be independent of passing other courses. Answer the following:
(b) What is the probability that this student passes exactly two courses? Express this question in
terms of probability statements, and then solve.
P(passing two courses) = P(pass A & B but fail C)+ P(pass A & C but fail B)+ P(pass C & B but fail A)=0.8
0.65 (1 0.5) + 0.8 0.5 (1 0.65) + 0.65 0.5 (1 0.8) = 0.465
(c) What is the probability that this student fails at least one course? Express this question in terms
of probability statements, and then solve.
Independence is likely to be an unreasonable assumption. Results are likely to be dependent (strong positive
association) because most of the variability in course outcomes across students is due to idiosyncratic
factors about the student him/herself i.e., working hard, being motivated, being of high academic ability.
The importance of these factors means that there is strong within-student correlation of marks in different
courses.
X can take on values 0, 1, 2, 3,or 4. Now we need all possible combinations that will produce each of
these outcomes.
nCkpossible combinations over n=4 tosses. (This is the notation used in Sharpe, e.g.,
Value of X on page 221. Equivalent notation that is sometimes used is .)
0 (TTTT) [4C0=1]
1 (HTTT) (THTT) (TTHT) (TTTH) [4C1=4]
2 (HHTT) (HTHT) (HTTH) (THHT) (THTH) (TTHH) [4C2=6]
3 (THHH) (HTHH) (HHTH) (HHHT) [4C3=4]
4 (HHHH) [4C4=1]
Each of these combinations are equally likely because on any toss of a fair coin, P(H) = P(T) = 0.5 and
were assuming outcomes are independent
15
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
0 1 2 3 4
( = ) 0.0625 0.25 0.375 0.25 0.0625
(c) Consider a game where you win $5 for every head but lose $3 for every tail that appears in 4 tosses
of a fair coin. Let the variable Y denote the winnings from this game. Formulate the probability
distribution of Y based on the probability distribution of X.
The general formula for determining Y from X is Y = 5X 3*(4-X). Plugging in, when X=0, you lose 12,
and so on. Hence:
y - 12 -4 4 12 20
P( Y = y) 0.0625 0.25 0.375 0.25 0.0625
(d) What is the expected value of Y? Would you like to play this game? If so, why? If not, why
not?
If you play the game enough times you would expect to win $4 per game on average. Thus, this is not a fair
game (since in a fair game, expected returns are zero) but it is biased towards the player. This is unlike
games in casinos where expected winnings are negative, meaning the game is biased towards the house.
Notice on any one play of the game you still might lose money and hence someone who is extremely risk
averse might not want to play the game even though on average, over many plays of the game, they should
win money.
16
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Weeks 5 and 6
1. A random number generator is designed to draw numbers at random from within a specified range. We
can consider any number in the range as a possible outcome.
(a) What type of distribution is the random number generator drawing from?
A continuous uniform distribution.
(b) Suppose we program a random number generator to generate a random number with a value
falling in the interval [0, 2]. What is the height of the density of the distribution from which the
random number generator is drawing? Draw a graph of the probability density function.
1
() = 0 2
2
= 0
(c) What is the cumulative probability distribution of the random variable from which draws are being
taken? Draw a graph of the cumulative probability distribution function.
The cumulative probability distribution, F(y) = P(YY) is just a graph of F(y) against y. So, from the above
graph, we can see F(0)=0 and F(2)=1. Since the probability is increasing uniformly the graph must be a
straight line with an upward slope (since probability cannot be negative) increasing from the point
(y,F(y))=(0,0) to (y,F(y))=(2,1). Specifically, F(y)=0.5y in the range (0,2). If y<0, F(y)=0 and if y>2,
F(y)=1.
(d) Find the following for this case: P(Y<0.8); P(Y0.8); P(0.5<Y<1.5), using both the density
function and the cumulative probability function. Show that your answers match whichever
you use.
( < 0.8) = 0.8 0.5 = 0.4
( 0.8) = ( < 0.8) = 0.4
(0.5 < < 1.5) = 1 0.5 = 0.5
Whether you get these values from the uniform probability density function as given here, or from F(y)=0.5y
(the cumulative probability distribution), the results are identical:
( < 0.8) = (0.8) = 0.5 0.8 = 0.4 and
(0.5 < < 1.5) = (1.5) (0.5) = 0.75 0.25 = 0.5.
17
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
2. From several years records, a fish market manager has determined that the weight of deep sea bream sold
in the market (X) is approximately normally distributed with a mean of 420 grams and a standard
deviation of 80 grams. Assuming this distribution will remain unchanged in the future, calculate the
expected proportions of deep sea bream sold over the next year weighing
a) between 300 and 400 grams.
300 420 400 420
(300 < < 400) = ( << )
80 80
= (1.5 < < 0.25)
= ( < < 0.25) ( < < 1.5)
= 0.4013 0.0668
= 0.3345
3. In a certain large city, household annual incomes are considered approximately normally distributed with
a mean of $40,000 and a standard deviation of $6,000. What proportion of households in the city have
an annual income over $30,000? If a random sample of 60 households were selected, how many of these
households would we expect to have annual incomes between $35,000 and $45,000?
. , ~(40000, 60002 )
30000 40000
( > 30000) = ( > )
6000
= ( > 1.67)
= 1 ( < < 1.67)
= 1 0.0475
= 0.9525
So 95.25% of households in the city would be expected to have annual incomes greater than $30,000.
Therefore we expect 0.5934(60)36 households in the sample to have annual incomes between $35,000
and $45,000.
18
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Let x be the required percentile. First find z, the 25th percentile of a standard normal.
~(10,9) 25 :
10
= 0.67
3
= 7.99.
5. In a certain city, it is estimated that 40% of households have access to the internet. A company wishing
to sell services to internet users randomly chooses 150 households in the city and sends them advertising
material.
(a) Calculate the probability that fewer than 60 contacted households have internet access.
Let X be the number of households contacted that have internet access. Then assume X is a binomial
random variable with n=150 and p=0.4. Because n is large, we can use the normal approximation to the
binomial where:
= = 150 0.4 = 60
2 = (1 ) = 150 0.4 0.6 = 36
(b) Calculate the probability that between 50 and 100 (inclusive) contacted households have
internet access.
(c) There is a 90% chance (probability of .9) that the number of contacted households with internet
access equals or exceeds what value?
( ) = 0.9
, ( > 0.5) = 0.9
0.5 60
( > ) = 0.9
6
60.5
( < < ) = 0.1
6
19
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
60.5
1.28 = 52.82
6
There is a 90% chance that the number of contacted households with internet access is 52 or more.
(b) Do you need to manipulate the raw data provided, before proceeding to statistical analyses, in
order to address the clients question? If so, how?
8. UNSW wants to measure the attractiveness of its brand to potential students. The university performs
an experiment by inviting 100 high school students from different public schools across New South
Wales to browse a few websites related to different universities, and then to choose the one that they
would prefer most.
(a) Is this a random sample? Can you think of any potential source of selection bias?
The sample is not perfectly random. First of all, only students in NSW are sampled, and
therefore the attitudes of students from other states of Australia and overseas students are
missed. Also the students are all coming from public schools, and public school graduates
might have different aspirations or expectations compared to private school graduates.
(b) Suppose that a perfectly random sample of students is drawn from the target population, and
these students take part in the exercise described above. Can you think of any confounding
factors that is, factors that might lead to lack of confidence in using students expressed
preferences, as measured in this exercise, as an indicator of their degree of overall attraction to
the UNSW brand?
Even if the sample is perfectly random, universities have different qualities in different fields.
For instance, UNSW engineering and science might be leading faculties, but the medical
faculty might not be the top. A students choice of a university does not only depend on the
attractiveness of the University as a whole, but also on whether they are leading in the
students particular field of interest. While part of the appeal of the university as a whole may
be due to such field-specific factors, the stated preference data alone cannot be used to
distinguish these factors from other factors purely related to the overall appeal of the
universitys brand. As another example, Australian students have historically been reluctant to
travel in order to attend university, so proximity (which is not what the question originally
targets) has also been an important factor in determining university choice.
20
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
as n, the sample size, rises, the sampling distribution (which can only be imagined or drawn conditional
on n!) of the parameter starts to look more and more like the normal distribution.
21
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Weeks 7 and 8
1. Suppose a normally distributed random variable X has a mean of 50 and a variance of 100. Also suppose a
sample of size 16 is drawn from this population. Calculate the following probabilities:
(a) P(40< X <55)
4050 5550
= P( 10 < < 10 )
= P(-1 < Z <0.5)
= P(Z < 0.5) P(Z < -1)
= 0.6915 0.1587
= 0.5328
(b) P(40< <55)
4050 5550
= P( << )
10/4 10/4
= P( - 4 < Z <2)
= P( Z < 2) P( Z < - 4)
= 0.9772 0
= 0.9772
2. Recall the Anzac Garage data used previously. These data are available from the course website (in
the Tutorial Questions and Information folder) in an Excel file called Anzacg.xls. Use these 117
observations on used passenger cars to find the 95% confidence interval for the population mean
distance travelled by used passenger cars (this variable is labelled odometer in the data set and is
measured in kilometres). Assume the population standard deviation is 60,000kms.
60,000 2
Since n=117 is large, we invoke the central limit theorem: X ~ N , , at least
117
approximately.
Using Excel, we find the sample mean is 78,561 kms. The 95% confidence interval is given by
60,000
x z0.025 78,561 1.96
n 117
78,561 10,872
(67,689, 89,433)
The calculated interval is one of the possible realizations of the 95% confidence interval. In repeated
sampling, 95% of intervals calculated in this way would contain the true .
22
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
3. What would be the effects on the width of the confidence interval calculated in the previous question
of:
(a) a decrease in the level of confidence used?
Decreases width
(b) an increase in sample size?
Decreases width
(c) an increase in the population standard deviation?
Increases width
(d) an increase in the sample standard deviation?
No effect on the width since we are told the population standard deviation.
(e) an increase in the value of found?
No effect on the width.
4. Again referring to the data in odometer from Anzacg.xls and the population from which it is drawn,
determine the sample size required to estimate the population mean to within 5,000 kms with 90%
confidence. Again assume the population standard deviation is 60,000kms.
z / 2 z0.05 1.645 , B 5,000 , 60 ,000
Where B is the size of the margin of error on either side of the point estimate. Sample size
calculation:
z
2 2
1.645(60,000)
n /2 389.67
B 5,000
A sample of 390 would be required.
5. Perform the following hypothesis tests of the population mean. In each case, draw a picture to illustrate
the rejection regions on both the Z and distributions, and calculate the p-value of the test.
(a) H0: = 50, H1: > 50, n = 100, = 55, = 10, = 0.05
Rejection region:
50
= > 0.05 = 1.645
10100
Alternatively,
10
> = 0 + 0.05 = 50 + 1.645 ( ) = 51.645
100
Since
5550
= 10 = 5 > 0.05 = 1.645,
100
we can reject H 0 and conclude that we are 95% confident that the population mean is greater than 50.
0.05
X
23
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
50 51.645
reject
0.05
0
1.645 Z
reject
= ( > 5) 0.0000.
Rejection region:
25
= < 0.1 1.28
5100
Alternatively,
5
< = 0 0.1 = 25 1.28 ( ) = 24.36
100
Since
2425
= = 2 < 0.1 = 1.28 ,
5 100
we can reject H 0 and conclude that we are 95% confident that the population mean is less than 25.
0.1
X
24.36 25
reject
24
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
0.1
-1.28 0 Z
reject
= ( < 2) = 0.0228
Rejection region:
80
= < 0.025 1.96 > 1.96
4100
Alternatively:
4
< = 0 0.025 = 80 1.96 ( ) = 79.216
100
or
4
> = 0 + 0.025 = 80 + 1.96 ( ) = 80.784
100
Since
80.5 80
= = 1.25
4100
is neither less than -1.96 nor greater than 1.96, we do not reject H 0 , with 95% confidence.
0.025 0.025
X
79.216 80 80.784
reject reject
0.025 0.025
Z
-1.96 0 1.96
25
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
reject reject
6. A real estate expert claims the current mean value of houses in a particular area is more than $250,000.
A random sample of 150 recent sales prices in the area yields a sample mean of $265,000. It is known
that house values in the area are approximately normally distributed with a standard deviation of
$50,000.
(a) Perform an upper tail test of the null hypothesis that the population mean house value in the
area is $250,000. Use a 5% level of significance and state the rejection (critical) region in
terms of both and z.
Let X value of a house in the area.
2
= $265,000, = $50,000, ~(, )
We wish to test
0 : = 250,000; 1 : > 250,000
Rejection region:
250,000
= > 0.05 = 1.645
50,000150
or
50,000
> = 0 + 0.05 = 250,000 + 1.645 ( ) 256,715.68
150
Since
265,000250,000
= = 3.67 > 0.05 = 1.645 ,
50,000150
we reject H 0 and conclude that with 95% confidence, the mean house value in the area is more
than $250,000.
The nature of the research problem dictates an upper tail test. In this case we will not believe the
experts claim unless there is significant sample evidence to do so. The claim itself implies the
possibility of an alternative above the conservative number one would otherwise guess, which in
turn implies an upper tail test.
(c) What is the p-value associated with the test statistic used in the part (a) test? Interpret this
value.
The p-value is the probability of obtaining a test statistic as or more extreme than the realized
value, assuming the null hypothesis is true, from a sample of the given size. The lower the p-value,
the greater is the evidence for rejection of the null hypothesis. In this case it is very unlikely to find
a sample mean as extreme as $265,000 in a sample of 150 observations if in reality the population
mean is $250,000.
(d) Define in words the type I and II errors that could afflict the part (a) test.
26
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Type I Error: Concluding that average housing price is more than $250,000, when in fact it is
really $250,000.
Type II Error: Not rejecting the claim that average housing price is $250,000, when in fact it is
really more.
Note that the exact probability of a Type II error cannot be determined without specifying an
exact alternative hypothesis.
7. What effect does increasing the sample size have on the outcome of a hypothesis test? Explain your
answer using the example of a one-tail test concerning the mean of a normally distributed population
with known variance.
Under 0 :
2 0
~(, 2 ) ~ (, )= ~(0,1)
The point on N(0,1) corresponds to the point = 0 + on the distribution of 0 .
27
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
X
0 0 z
n
But suppose the true is to the right of . Then the true distribution of is say:
0 0 z
n
The shaded area in the above diagram gives the probability of correctly rejecting H0 (i.e. the power,
1- , which is greater than the area under the tail and beyond the cutoff point on the first graph).
2
() = decreases and hence decreases.
28
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Suppose the new sample size is n1>n. The distribution of 0 will now look something like:
0 0 z X
n1
Note that with a fixed the rejection region cutoff is now smaller (we have to fit the same amount
of probability density in above the cutoff, but because of this distributions lower variance than in the
initial case, there is less density in the tails so we have to move our cutoff point closer to the centre
of the distribution in order to capture enough probability density). Again, if the true is actually to
the right of , the probability of rejecting the same incorrect null hypothesis is higher than before.
Diagrammatically the true distribution of will be, say,:
0 X
0 z
n1
Again the shaded area in the above diagram gives the probability of correctly rejecting H0.
Conclusion: The probability of correctly rejecting a false H0. (the power of the test) increases as n
increases, given we keep the Type I error () fixed.
8. Project Review: For the course project, you are only expected to use statistical methods covered in
lectures up to and including those in Week 9. Thus you should now have sufficient material to complete
29
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
What might be useful at this stage is to think about presentation. See the statistical report section of
the Project folder on Moodle for some ideas in general. As a directed exercise for this tutorial , compare
and contrast the presentation of material in the NSW BOCSAR report on driving under the influence
of cannabis (driving-cannabis.pdf) and Queensland Office of Economic and Statistical Research
bulletin on computer and internet usage in Queensland (computer-internet-useage-qld-c01.pdf). You
should be able to read these reports comfortably, although there are a few methods that may be
unfamiliar in the cannabis report (although these methods will be covered later in the course).
(a) What are some key differences in the presentation of results in the 2 reports?
In the observations below, DC = driving under the influence of cannabis report and CIU =
computer and internet use report.
(b) Which is better? What criteria do you use to determine which is better?
For example, do we really need Figure 11 in CIU where we have only 6 numbers to report? Also see
the text associated with Figure 11 where several other results are reported that cannot be deduced
from the figure (comparison of reported skills by age).
(c) What are some key similarities in the overall presentation in the 2 reports?
Both have a non-technical summary recall the need for an Executive Summary in the project.
Both should be relatively clear and easy to read.
Both have an absence of intermediate calculations as would be present in a tutorial problem. This
is entirely appropriate for reports of this type and hence the project.
The above is not meant to be comprehensive.
30
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Weeks 9 and 10
1. State whether the normal distribution, the t distribution, or neither would be the right type of sampling
distribution to assume for the sample mean in order to test hypotheses regarding the population mean
in the following situations:
(a) Population variable normally distributed, 2 unknown, sample size less than 30.
t-distribution, because of the unknown population variance and the low sample size.
(b) Population variable normally distributed, 2 unknown, sample size greater than 30.
t-distribution although as the sample size gets very large this effectively becomes the same as
using the normal.
(c) Population variable normally distributed, 2 known, sample size less than 30.
Normal distribution because we know the population variance. In this case, even in a small
sample, the sampling distribution of the mean follows a normal distribution.
(d) Population variable not normally distributed, 2 unknown, sample size greater than 30.
Even though the population variable is not normally distributed, because the sample size is large
you can invoke the CLT and use the fact that s2 is a consistent estimator of 2 to justify using the
normal distribution.
(e) Population variable not normally distributed, 2 unknown, sample size less than 30.
Here the sampling distribution of the mean is unknown, and hence we dont know how to test a
hypothesis about in this circumstance. In practice you could either assume the population variable
is approximately normally distributed and proceed as in (a); or alternatively invoke the CLT and
proceed as in (d). How well either of these solutions works ultimately depends on the extent of non-
normality of the distribution of the variable in the population, which is not specified in the question.
2. Reconsider the example used earlier in the course in which a real estate expert claimed the current
mean value of houses in a particular area was more than $250,000. A random sample of 150 recent
sales prices in the area yielded a sample mean of $265,000, and it is known that house values in the
area are approximately normally distributed with a standard deviation of $50,000.
(a) If in fact the population mean house value in the area is $260,000, what is the probability of
committing a type II error in performing an upper-tail test of the null hypothesis that the mean
house value price in the area is $250,000, as was done in Part (a) of the prior weeks exercise?
What is the power of the test in these circumstances? State in words what the power of the test
means.
Rejection region:
50,000
> = 0 + 0.05 = 250,000 + 1.645 ( ) 256,715.68
150
Thus the Type II error probability (probability of not rejecting H0 when it is false) is:
31
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
= 1 = 0.7881
The power of the test gives the probability of correctly rejecting the null hypothesis when it is false. Note
that power cannot be specified unless we specify a particular alternative around which we assume the
sampling distribution of our statistic to be centred. For this reason the dependence of power on the
exact alternative being considered power is sometimes defined as the ability of a test to detect a
particular alternative.
(b) Illustrate your answer to part (a) above by showing on a diagram the areas representing the
probability of a type II error and the power of the test.
Under 0 : = 250,000
1- power
under 260,000
250,000 260,000
= $256,715.68
3. A company running an urban rail service wishes to estimate its daily average number of late-running
trains on weekdays. For 10 randomly selected weekdays, it finds the following numbers of late running
trains:
(a) Assuming the number of late running trains on a weekday is approximately normally
distributed, calculate a 90% confidence interval for the mean number of late-running trains on
a weekday.
Since 2 is unknown, n is small and the underlying distribution of the variable in the population is
(approximately) normal, we construct the confidence interval using the t distribution. The required
interval is:
6.9514
,1 = 17.9 0.05,9
2 10
6.9514
= 17.9 1.833
10
= 17.9 4.029
= (13.871,21.929)
(b) If we did not have the assumption of normality, could we still calculate a confidence interval
in this example? If not, suggest a way of overcoming this problem.
32
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
Everything else the same, we could not construct a confidence interval in the same way as in (a) since the
t distribution is only valid if the underlying distribution of the variable in the population is normal. This
problem could be overcome by obtaining a larger sample size and then making use of the central limit
theorem (and still using s instead of ).
4. Reconsider the question from a previous week that used the Anzac Garage data, available from the
course website (in the Tutorial Questions and Information folder) in an Excel file called Anzacg.xls.
Would normality be a good approximation for the population distribution of distance travelled by used
passenger cars? (Hint: look at the summary statistics and a histogram.) Do you need to assume
normality? Redo the 95% confidence interval for the population mean distance travelled by used
passenger cars without assuming a known population standard deviation.
Excel summary statistics and histogram for distance traveled indicate non-normality. The distribution is
skewed to the right, the median is much less than the mean, and the sample mean is only 1.35 standard
deviations from zero:
Odometer (km)
Mean 78560.83
Standard Error 5384.86
Median 67980
Mode 147000
Standard Deviation 58246.19
Sample Variance 3392618896
Kurtosis 3.426
Skewness 1.528
Range 315597
Minimum 403
Maximum 316000
Sum 9191617
Count 117
33
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
45
40
35
30
Frequency
25
20
15
10
0
20000 60000 100000 140000 180000 220000 260000 300000
Odometer (kms)
While the population distribution seems non-normal, the sample size is large enough to invoke the CLT
and hence to assume the sample mean is approximately normally distributed.
In a previous question using these data we assumed was known, but here we consider the more likely
situation where it is unknown and we replace by s as calculated by Excel. The 95% confidence interval
is given by:
58,246
/2 = 78,561 1.96
117
= 78561 10,554
= (68,007,89,115)
5. It is known that 80% of people suffering from a particular disease are cured by a certain standard
medication. Test the claim of the developers of a new medication that their product is more effective
than the standard medication in curing the disease, using a 5% significance level, given a random
sample of 400 people with the disease of whom 330 are cured by using the new medication. (Hint:
Use the normal approximation, and ignore the continuity correction.)
330
0 : = 0.8, 1 : > 0.8, = 400, = 0.05 & = = 0.825
400
Therefore we can use the normal approximation to the binomial. Under H0:
(1 ) 0.8 0.2)
~ (, ) ~ (0.8, )
400
So, ignoring the continuity correction, calculate the empirical significance level, or p-value:
0.825 0.8
( > 0.825) = ( > ) = ( > 1.25) = 0.1056
(0.8 0.2)/400
Because the p-value > (0.1056 > 0.05) we do not reject H0 and instead we conclude that there is not
enough evidence to support the developers claim of a more effective cure.
34
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
6. Download the data Credit_Card_Bank from the MyStatLab website (available under the heading
of Chapter 1: Data and Decisions). Using the variables Offer Status and Spendlift Positive,
conduct the appropriate Chi-squared test to determine whether these there is a relationship between
the type of offer a customer was exposed to and whether a lift in spending was observed, assuming a
significance level of 0.05. Interpret your results.
To determine the chi-squared value, we first need to construct a table of expected values, which is as
follows:
Offer/Lift No Yes TOTAL
Double Miles + Free (29/100)*(57/100)*100 12.47 29
Flight Insurance =16.53
Free Flight Insurance 15.39 11.61 27
No Offer 11.97 9.03 21
Rtl w/o Enr 13.11 9.89 23
TOTAL 57 43 100
Summing the entries in all cells of the table above yields the Chi-sq statistic of 4.2974283. Using the
Chi-sq critical value for alpha = .05 and df = (r-1)*(c-1) = (4-1)*(2-1) = 3, which (consulting the table
at the back of the book) is equal to 7.815, we find that our statistic falls into the non-critical region.
The P-value associated with our statistic is 0.231. We fail to reject the null hypothesis that the two
35
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
variables distributions (that of the offer type, and whether or not there was a spend lift) are
statistically independent.
The evidence suggests that the amount of spend lift obtained does not depend on the type of offer
received.
7. Use a calculator to compute the sample least squares regression line for the model = 0 + 1 + ,
given the following six observations:
y 2 8 6 12 9 11
x 1 4 3 10 10 8
1 + 4 + 3 + 10 + 10 + 8 2 + 8 + 6 + 12 + 9 + 11
= = 6; = =8
6 6
( )( ) = (1 6) (2 8) + + (8 6) (11 8) = 62
( )2 = (1 6)2 + + (8 6)2 = 74
( )( ) 62
1 = 2
= 2
= 0.8378
( ) 74
0 = 1 = 8 0.8378 6 = 2.9732
8. Suppose the relationship between the dependent variable weekly household consumption
expenditure in dollars (y) and the independent variable weekly household income in dollars (x) is
represented by the simple regression model (i refers to the ith observation or household):
= 0 + 1 +
Suppose a sample of observations yields least squares estimates of b0 = -32 and b1 = 0.82 for this
model.
(b) State the basic (classical) assumptions made about the s in this model. Explain in words what
the assumptions mean.
(i) These errors are random variables, for which one classical assumption is that ( | ) = 0 for all
observations. In words, the conditional mean of the disturbance does not depend on x and is
normalized to zero. Note this is not something directly addressed by Sharpe in Chapter 15. It is
nevertheless a crucial assumption. That the conditional mean of the disturbances does not depend
on x ensures the unbiasedness of the OLS estimator as an indicator of the direct effect of x on y, and
is hence as important as the other assumptions below in determining how the output of the model
can be interpreted. This assumption implies that omitted factors that might affect expenditure, but
in fact are only included in the disturbance term rather than as separate variables on the right-hand
side of the model, must be uncorrelated with x. In the present example, a violation of this requirement
would occur if, for example, people who are taller also earn more income (due to more confidence in
the labor market than shorter people, say), and at the same time consume more because they require
more calories to keep their larger bodies running. This would then mean that additional income does
36
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
not directly bring about all of the additional consumption implied by the slope coefficient estimate;
at least part of this effect is due to the third-party cause of each variable (income and consumption):
height.
(ii) ( , ) are drawn by simple random sampling and are hence independent and identically distributed.
(iii) The standard deviation of is constant for all observations (no changing spread as Sharpe says).
This spread is denoted by and we say the disturbances in such as case are homoskedastic. Here
that implies the variability in consumption expenditure does not depend on income, which is possibly
problematic in practice (poorer people are probably more similar to one another in how much of
their income they consume, since they operate closer to the line of subsistence; richer people have
many more options for how to use their money, and hence there may be more variability across rich
people in what share of their income is used for consumption expenditure).
(iv) The disturbances for any two observations are independent. This will imply, in particular, that there
is no correlation between the disturbances associated with different observations. In this example
this requires that the factors in the disturbance for household i are not correlated with those for
household j, which seems reasonable.
(v) is normally distributed for all observations.
(c) Does the estimate of b0 = -32 make sense? If not, does this necessarily invalidate the model?
Explain your answer.
This indicates that if a household had a zero weekly income then on average such a household would
have negative consumption, which does not make sense. However, this does not necessarily invalidate
the model. It may be that the linear model is only a reasonable approximation for some range of
household incomes, not including incomes near zero. In particular, the relationship between the two
variables may be non-linear for values of x near zero. The conclusion is that we should be careful in
interpreting the intercept term, as it may not be very meaningful in some cases.
(d) Interpret both 1 and b1. What does the model predict would be the change in y following a $10
increase in x from some initial level?
1 is the (unknown) population change in the value of y resulting from a one-unit increase in x, whereas
b1=0.82 is an estimate of 1. In this particular example the quantity being estimated is the marginal
propensity to consume that is discussed in economics courses. The predicted change in y following a $10
increase in x would be 10 1 = 10 0.82 = $8.20.
(e) Suppose we measured y and x in cents rather than dollars. What effect would this have on the
estimated coefficient of x? What effect would it have on the estimated intercept?
In this case: $x becomes 100x cents and $y becomes 100y cents. The estimated coefficient of xi when the
variables are measured in dollars is given by
( )( )
1 =
( )2
If we let 1 be the estimated slope coefficient when the variables are measured in cents, we have
(100 100 )(100 100) 1002 ( )( )
1 = = = 1
(100 100 )2 1002 ( )2
Thus estimation of this model (with the same, but re-scaled data) would lead to an unchanged b1, whilst
the estimated intercept term would become 1000 = 3200.
37
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875
(f) Suppose y were measured in dollars but x were measured in cents. What effects would this have
on the estimated coefficient of x?
Denote the estimated slope and intercept in this case by 1 and 0 , respectively . Then
0 = 1 100 = 1 = 0
The estimation of this model would lead to an estimated coefficient for the income variable of 0.0082,
and the estimated intercept would be unchanged. This makes sense since:
If income is measured in dollars, we predict expenditure (in dollars) will increase by$0.82 if household
income increases by one dollar.
If income is measured in cents, we predict expenditure (in dollars) will increase by $0.0082 if household
income increases by one cent.
(g) Distinguish between and (the residual associated with observation i). Illustrate your
answer with a diagram.
We can think of = as an estimate of the true random disturbance associated with observation
i, = 0 1 . (In the above diagram, we are just imagining what the true model might look like
we never could draw it since we never know it!)
38
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)