Sunteți pe pagina 1din 39

lOMoARcPSD|1204875

Tutorial work - Week 1 - 12

Business and Economic Statistics (University of New South Wales)

Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)


lOMoARcPSD|1204875

ECON 1203 Tutorial Sample Solutions


Semester 1 2015

Weeks 1 and 2
1. (a) What is meant by a variable in a statistical sense? Distinguish between qualitative and quantitative
statistical variables, and between continuous and discrete variables. Give examples.

A variable in a statistical sense is just some characteristic of an object. It may take different values.

Data on a quantitative variable can be expressed numerically in a meaningful way (e.g., height of an
individual, number of children in a family). Data on qualitative variables cannot be expressed numerically
in a meaningful way (e.g., sex or hair colour of an individual) although such data can be coded into numerical
expressions. In the case that qualitative data has an innate ordering (e.g., survey answers to the question,
how happy are you, all things considered? rated on a scale of Very Happy to Very Unhappy), the
numerical coding can contain some meaning.

A discrete quantitative variable can assume only certain discrete numerical values on the number line (can
be a finite or infinite number of these values). A continuous quantitative variable can assume any value in a
specific range or interval; e.g. length of a pipe. In some cases, what in theory is a continuous variable must
be in practice measured as a discrete variable because of limitations to measurement precision.

(b) Distinguish between (i) a statistical population and a sample; (ii) a parameter and a statistic. Give
examples.

A statistical population is the set of measurements or observations of a characteristic of interest for all
elementary units in a frame; e.g., the shoe sizes of all men in Australia. A statistical sample is a subset of a
population, e.g., the shoe sizes of all the men enrolled in ECON 1203 is a sample of the population represented
by the shoe sizes of all men in Australia.

A parameter is a numerical description of a population. For example, the average shoe size of all Australian
men is a parameter (of the population of the shoe sizes of all Australian men). A statistic is a numerical
description of a sample. For example, the average shoe size of all men in this classroom is a statistic
(calculated from the sample of the shoe sizes of all men in this classroom).

1
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

2. In order to know the market better, the second-hand car dealership, Anzac Garage, wants to analyze
the age of second-hand cars being sold. A sample of 20 advertisements for passenger cars is selected
from the second-hand car advertising/listing website www.drive.com.au The ages in years of the
vehicles at time of advertisement are listed below:

5, 5, 6, 14, 6, 2, 6, 4, 5, 9, 4, 10, 11, 2, 3, 7, 6, 6, 24, 11

(a) Calculate the frequency, cumulative frequency and relative frequency distributions for the age data using
the following bin classes:
More than 0 to less than or equal to 8 years
More than 8 to less than or equal to 16 years
More than 16 to less than or equal to 24 years.

Relative Cumulative
Bin Frequency Frequency Frequency

0 < 8 0.7 14 14

8 < 16 0.25 5 19

16 < 24 0.05 1 20

(b) Sketch a frequency histogram using the calculations in part (a). What can you say about the distribution
of the age of these second-hand cars? Is there anything that concerns you about the frequency table and
histogram? Specifically, is the choice of bin classes appropriate? What needs to be done differently?

Relative frequency histogram for Age

0.8

0.7

0.6

0.5
Frequency

0.4

0.3

0.2

0.1

0
8 16 24
Bin

From this graph, the age distribution appears to be skewed to the right. 70% of observations have age
between 0 and 8. However, this histogram only provides limited information about the age distribution
because there are too few bins and they are very wide.

2
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

(c) Halve the width of the bins (0 to 4, 4 to 8, etc) and recalculate the frequency, cumulative frequency and
relative frequency distributions. Using the new distributions and histogram, what can you now say about
the distribution of the age of second-hand cars?

Relative Cumulative
Bin Frequency Frequency Frequency
0 < 4 0.25 5 5
4 < Age 8 0.45 9 14
8 < Age 12 0.2 4 18
12 < Age 16 0.05 1 19
16 < Age 20 0 0 19
20 < Age 24 0.05 1 20

Figure 3.1: Revised histogram for age


of cars

10
9
8
Frequency

7
6
5
4
3
2
1
0
2 6 10 14 18 22
Age

There still appears to be a skew to the right, but now we can also see that there is an outlier in the 21~24 Age
category. 5~8 are the most frequently observed ages. A quite sizable proportion of the second-hand cars are
relatively new (25% being less than or equal to 4 years old).

3. Health expenditure
A recent report by Access Economics provides a comparison of Australian expenditures on health with
that of comparable OECD countries. Data from that report relating to the year 2005 have been used to
reproduce their Figure 2.2 (below denoted as Figure 2.1).

(a) What are the key features of these data?

A strong positive association more per capita GDP implies more health expenditure per capita.
There are (at least) 2 outliers, the observation with the largest health expenditure (Luxembourg) and
the observation with the highest GDP (USA). Without these 2 the relationship is approximately
linear. With them, there is a suggestion of a non-linear relationship.
An indication of more variability in health expenditures when GDP is larger.

(b) While this is a bivariate scatter plot, there are three variables involved: health expenditure, GDP and
population. Why account for population by expressing health expenditure and GDP in per capita terms?

3
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Figure 2.1 OECD Health Expenditure and


GDP
7

Health expenditure per capita (US$000) 6


5
4
3
2
1
0
0 10 20 30 40 50 60 70

GDP per capita (US$000)

This line of questioning is intended to prompt the recognition that there may be factors other than GDP
associated with health expenditures per capita, and population size is one obvious factor since (for example)
there may be returns to scale in health care delivery, and/or differences in how concentrated the health care
industry is in larger countries versus smaller ones. Expressing everything in per capita terms is one way to
control for population variation and hence isolate the GDP-health expenditure relationship, so it is good if
that is the relationship we want to know about. However, controlling for population size in this fashion makes
it harder to see the relationship between population and health care expenditure, so if that relationship were
our target of analysis, this would not be a good way to present the data.

4. Australian housing prices


Recent research by Dr Nigel Stapledon at the UNSW School of Economics provides an extensive
analysis of Australian housing prices since 1880. In Figure 2.2 his data are used to provide a
comparison of Sydney and Melbourne housing prices over time.

(a) What are the key features of these data?

The time series evolution is quite similar for Sydney and Melbourne housing prices they track each
other quite well and hence we would say there is a strong positive association between these two
series.
Sydney prices are typically above Melbourne prices
There seem to be 2 regimes. In the first regime, up until the 1950s, there is little growth in housing
prices and they are quite stable from year to year (low variability). In the second regime, since the
1950s, there have been quite dramatic increases in housing prices in both cities and there is much
greater year-to-year variability more volatility. (In his analysis, Stapledon notes that this two-
regime pattern is quite common and has been observed in the US as well.)

(b) Why have prices been expressed in constant dollars?

One reason housing prices increase over time is inflation and if all prices and incomes increase by the same
proportion then there are no real changes, meaning no changes that people would feel in their wallets. So
just as in the previous question, we control for this other factor (inflation) so as to better see the
relationship between real housing prices and time.

4
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Figure 2.2 Comparison of Sydney and Melbourne median


house prices in constant 2007-08 Dollars
600

500
Thousands of dollars

400

300

200

100

0
1860 1880 1900 1920 1940 1960 1980 2000 2020
Year

Sydney Melbourne

5. Using the car data from Question 2:

(a) Calculate the mean, median and mode for this sample of data and use these statistics to further describe
the distribution of car ages.

5 5 6 ... 24 11
Mean 7.3
20
Ordering the data from lowest to highest:

2, 2, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6, 7, 9, 10, 11, 11, 14, 24,

Median = (6+6)/2=6
Mode = 6

The sample mean is to the right of mode and median, suggesting that the sample distribution is skewed towards
the right. The cause seems to be the large outlier one car had an age of 24, which appeared to be very
different to the age of other cars. Given the skewness and the outlier, the median is possibly a better measure
of central tendency. Hence a typical second-hand car is 6 years old.

Alternatively the EXCEL output is:


Age

Mean 7.3
Standard Error 1.126476
Median 6
Mode 6
Standard Deviation 5.037752
Sample Variance 25.37895
Kurtosis 5.712234
Skewness 2.0983

5
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Range 22
Minimum 2
Maximum 24
Sum 146
Count 20

(b) If the largest observation were removed from this data set, how would the three measures of central
tendency you have calculated change?

5 5 6 ... 6 11
Mean 6.4 (Now closer to median)
19
Median = 6 (unchanged, but now not an average of the two middle values but the actual middle value, since
we now have an odd number of observations)
Mode = 6 (unchanged)

6. For the following statistical population, compute the mean, range, variance and standard deviation: 3,
3, 5, 12, 13, 14, 17, 20, 21, 21.

3 3 5 12 13 14 17 20 21 21
Mean 12.9
10
Range 21 3 18

Variance
2 ( xi ) 2 (3 12.9) 2 .... (21 12.9) 2

N 10
45.89
Standard deviation 45.89 6.7742

6
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

What would happen to each of the measures you have calculated if :


(a) 4 were added to each data point (observation)?

The mean would increase by 4, but the range, variance and standard deviation would be unchanged.

(b) each data point were multiplied by 2?

The mean, range and standard deviation would be multiplied by 2, whilst the variance would be multiplied by
4.

7. Migrant wealth.
Suppose the Minister for Immigration is interested in research on the assimilation of migrant
households (a household where the chief income-earner is foreign born). The Household, Income and
Labour Dynamics in Australia (HILDA) survey is a representative survey of Australian households.
Using 4,669 household observations for 2002 from HILDA, we find there are 3,567 households
classified as Australian-born and 1,102 classified as migrants. One key consideration is how migrant
households are doing in terms of wealth compared with Australian-born households. Using these data,
we find the following:

Summary statistics for net household wealth ($A)


Mean 10th percentile Median 90th percentile
Australian-born 236,064 1,545 123,020 560,006
Migrant 248,970 1,720 131,152 524,372

(a) What can you say about the distribution of net household wealth, for both Australian-born and migrant
households, by looking at just the mean and the median figures?

The wealth distribution is skewed quite heavily towards the right for both Australian-born and migrant
households. The mean is much larger than the median, suggesting that more than 50% of each sample have
less than average wealth, while less than 50% of each sample have more than average wealth. In other words,
there is a fair amount of wealth inequality in both samples.

(b) More generally, what can you say about the distribution of wealth for migrant households compared
to that for Australian-born households? In particular, which type of household has greater variation in
wealth?

Based on just the mean and the median measures, a typical migrant family appears to be slightly wealthier
than a typical Australian-born family. Both figures are larger for the migrant sample than the Australian-
born sample. This is also the case for the 10 th percentile figure. By contrast, the 90th percentile is greater for
the Australian-born sample than the migrant sample. These figures suggest that, while typical migrant families
are better off than typical Australian families in terms of wealth, migrant families are less likely to be very
poor or very rich compared with Australian-born families. In other words, Australian-born families have
greater variation in household wealth than migrant families.

(c) Suppose the minister has net household wealth of $600,000. What can you say about his or her
financial circumstances relative to other Australian-born households?

The ministers household has greater wealth than at least 90% of Australian-born households in Australia.
His/her household is amongst the wealthiest 10% of Australian households.

7
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

8. Sydney housing prices.


Figure 3.2 depicts a scatter plot of Sydney-area housing prices versus distance from the CBD. The
unit of observation is a suburb, price is the mean of the median price of houses sold in each suburb
for two quarters (those ending in September and December 2002), and distance is measured in
kilometers from downtown.

(a) What would you expect the correlation to be between price and distance?

There is an inverse relationship between distance to CBD and price, so we expect the correlation to be
negative.

(b) Does it appear that there is a linear relationship between the two variables?

The relationship does not look linear largely because of the large variability in prices for suburbs close to
the CBD. (These observations also tend to distort what the relationship looks like for the bulk of the data. If
you were to eliminate these outliers, it is not clear what the relationship would look like for the remainder of
the data.)

(c) What other key features of these data can be determined from the plot?

Figure 3.2: House prices in Sydney suburbs versus distance to


CBD
6000000

5000000

4000000
Price $

3000000

2000000

1000000

0
0 10 20 30 40 50 60 70 80
Distance to CBD (kms)

We have already mentioned the large variability in prices for suburbs close to the CBD. To say this
more formally, the variance of prices close to the CBD (conditional variance, where the
conditioning is on small distance to the CBD) is much larger than the variance of prices further
away from the CBD.
Other outliers appear around 30kms from CBD (these are Clareville, Palm Beach and Whale
Beach).
There is no suspicion that these outliers are due to errors. All are feasible observations.
We can see that the price and distance variables are both skewed to the right. (Imagine pushing the
graph up from the bottom left corner, keeping it in the same 2D plane, so it sits on its Y-axis in that
plane, and then looking at the distribution of price from behind the graph (so the values of the
price variable are ordered from lowest on the left to highest on the right).)
There are numerous suburbs where there were no sales, which are reported as zeroes in the graph.
Note though that it was not explicitly said that suburbs with no sales would have entries of zero, so

8
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

this would have to be inferred by the viewer of the graph. Most of these no-sale suburbs are
suburbs relatively close to the CBD.
What should we do with the zero sales observations when we analyse the data? They are not data
errors, as sometimes occur. But they are not real zeroes, as we dont know what the price would
have been had there been sales for the period in question. They are also extreme values, meaning
they have a higher chance of influencing various types of analysis we might do. Hence, we should
think hard about whether or not to include them in any given piece of analysis.

9. Anzac Garage wants to develop guidelines for setting prices of cars according to the cars age. They
hire a business consultant who chooses a sample of 117 second-hand passenger car advertisements
collected from www.drive.com.au and retrieves data on the age and price of the cars.

(a) The business consultant first calculates the correlation coefficient between age and price and finds it
to be -0.278. Interpret this result.

Correlation coefficients lie between -1 and 1. A negative value suggests an inverse relationship between the
variables (which makes sense: older cars are less expensive). A magnitude of 0.278 suggests that the
relationship is present but not very strong.

(b) Sketch what you think the scatter diagram from which this correlation coefficient was calculated might
look like. Suppose the business consultant constructs a simple linear regression model using price as
the dependent variable, and age as the independent variable. What do you think the estimated
regression line might look like here? (We will return to this particular example later in the course and
address this question more formally.)

Below is a possible scatter diagram with a linear regression model superimposed. Scatters that answer the
question will have the key feature of being consistent with a negative correlation, i.e., a negatively sloped line
of best fit.
Price age scatter with OLS regression line superimposed

60000

50000

40000
Price

30000
Price
Linear (Predicted Price)
20000

10000

0
0 2 4 6 8 10 12 14 16

Age

10. Work through problem 31 on page 164 of Sharpe (Chapter 4).

9
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Weeks 3 and 4
1. (a) Explain what it means to say that two probabilistic events in a sample space are mutually
exclusive of one another.

If two events lets call them A and B are mutually exclusive, then it means that they do not
have any simple events in common: i.e., that the simple events that combine to make up A
have no elements in common with those that make up B.

(a) Explain what it means to say that two probabilistic events in a sample space are independent
of one another.

When two events are independent of one another, it means that the effect of conditioning on
the occurrence of one of them has no effect of the marginal probability of the other: i.e.,
Pr(A/B) = Pr(A).

(b) Why can two events not at the same time be both mutually exclusive and independent of one
another?

Because if A and B are mutually exclusive, then Pr(A and B) = 0, whereas if they are
independent, Pr(A and B) = Pr(A)*Pr(B) 0.

10
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

2. A department store wants to study the relationship between the way customers pay for an item and the
price of the item. 250 transactions are recorded and the following table is formed.
Price category Payment
Cash Credit card Debit card
Under $20 15 9 18
$20-$100 11 53 52
Over $100 6 38 48

Convert the table to a joint distribution. Express each of the following questions in terms of probability
statements, and then solve:
Joint distribution:
Price category Means of Payment
Cash Credit card Debit card Marginal
Under $20 0.06 0.036 0.072 0.168
$20-$100 0.044 0.212 0.208 0.464
Over $100 0.024 0.152 0.192 0.368
Marginal 0.128 0.4 0.472 1

(a) What is the probability that an item is under $20?

P(Under $20) = 0.168

(b) What is the probability that an item with a price tag of $43 is paid for in cash?

P($20-$100 and cash) = 0.044

(c) What is the probability that people pay for an item that is at least $20 by credit?

P($20 and credit) = 0.212 + 0.152 = 0.364

(d) If somebody used a debit card to pay for an item, what is the probability that the item was less
than $100?

P(<$100|debit) = (0.072+0.208)/0.472 = 0.593

(e) Are price and means of payment independent?

One way to check is to compare the marginal distribution of price with the conditional
distribution of price given a particular payment type (say, cash):

P($20-$100|cash) = 0.344 P(($20-$100) = 0.464


This implies dependence.

3. In a small batch of 20 manufactured widgets, there are, in fact, 3 defective ones. You, as quality control
officer for the company making the widgets, decide to examine a sample of 3 widgets, selected without
replacement, to see how many defective ones are selected.
(a) Use a probability tree to evaluate the probability distribution of the number of defectives sampled.

The tree is of the obvious kind with the first branch from a branch where the probability of
defective is 0.15 and not defective is 0.85. From the upper of these branches at the next node the
11
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

probability of defective being selected is 2/19 and non-defective is 17/19. From the lower first
branch, the probability of a defective is 3/19 and of a non-defective is 16/19. From the nodes at
the end of the 4 second branches, the 8 probabilities of defective and non-defective are,
respectively, 1/18, 17/18, 2/18, 16/18, 2/18, 16/18, 3/18 and 15/18.

Since draws are made independently each time, the relevant probability distribution of X, the
number of defectives drawn in a sample of 3 without replacement, is

x 0 1 2 3
P(X = x) 680/1140 408/1140 51/1140 1/1140 P(X=x) = 1

(b) How would your answer change if the sampling were done with replacement?

The resultant probability distribution is now

x 0 1 2 3
P(X = x) 4913/8000 2601/8000 459/8000 27/8000 P(X=x) = 1

4. Work through problem 16 on page 200 of Sharpe (Chapter 5).

5. Work through problem 18 on page 200 of Sharpe (Chapter 5).


(a) The radio announced is referring to the so-called law of averages, which is a mistaken
belief that probability will compensate in the short term for odd occurrences in the past. At
face value, the weather is not more likely to be bad in the winter because of a few sunny days
in autumn. The only way that such a statement could be justified would be with reference to
some type of global weather event that affects weather patterns in both winter and summer,
such as El Nino/La Nina.
(b) Standard statistics says that there is no such thing as being due for a hit (the statement
being based on the so-called law of averages): the batters chance for a hit should not
change based on recent successes or failures. The only way that such a statement could be
justified would be through some sort of psychological or physiological process that causes
repeated performances to be correlated in some way (e.g., the players frustration grows when
his performance starts to streak poorly, causing him to work harder to get a hit the next
time).

6. Work through problem 44 on page 203 of Sharpe (Chapter 5).


(a) Her thinking is correct. There are 14 boxes left, of which 10 are mens bikes and only 4 are
womens bikes.
(b) This is not an example of the Law of Large Numbers. The box selections are not independent
of each other: the boxes are not put back into the choice set once they are opened.

7. The manager of a factory has determined from past experience that X, the number of repairs required
to machines in her factory on any one day, has the following probability distribution:

x 0 1 2 3 4
P(X = x) 0.41 0.25 0.18 0.10 0.06

Calculate the following:

12
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

(a) P(1 < X < 4)

P(1 < X < 4) = P(X=2)+P(X=3) = 0.18+0.10 = 0.28

(b) P(0 X 3)

P(0 X 3) = P(X=0)+P(X=1)+P(X=2)+P(X=3) = 1-P(X=4) = 0.94

(c) E(X)

() = = ( = )

= 0 0.41 + 1 0.25 + 2 0.18 + 3 0.1 + 4 0.06 = 1.15

(d) Var(X)
() = ( )2 = 2 = ( )2 ( = )

= (0 1.15)2 0.41 + (1 1.15)2 0.25 + (2 1.15)2 0.18 + (3 1.15)2 0.10 + (4 1.15)2
0.06 = 1.5075

(e) What is the conditional probability distribution of X, conditional on some positive number of
repairs taking place?

x 1 2 3 4
P(X = x|x>0) 0.42 0.31 0.17 0.10

8. Suppose that the daily number of errors a randomly-selected bank teller makes is denoted by X and
follows the distribution given in the table below. A human resource manager records the daily
numbers of errors of two randomly selected tellers. Denote the associated random variables by X1 and
X2. As the selection is random, X1 and X2 are independent and follow the same distribution as X. The
+
manager then computes the sample mean = 1 2 2 where the sample size is n = 2.

x 0 1 2
P(X = x) 0.6 0.2 0.2

(a) Find the mean and variance of X1. Explain why we do not need to find the mean and variance of
X2 once we know those of X1.

(1 ) = 0.6; (1 ) = 0.64

The mean and variance of X2 are the same because they have identical distributions.

(b) Since X1 and X2 are random, so is . Find the mean and variance of the random variable .
Compare these with the result from (a) and comment. Hint: you will find it useful to note that
(1 , 2 ) = 0 because X1 and X2 are independent. This simplifies the evaluation of the variance

of the random variable .

+ 1 1
() = [ 1 2 2] = 2 (1 ) + 2 (2 ) = 0.6

13
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

1 + 2 1 1
() = [ ] = (1 ) + (2 ) = 0.32
2 4 4

The means of and X are the same, and the variance of is the variance of X divided by 2 (the sample
size).

(c) Find the possible values that may take. Hence list the probability distribution of for
samples of size 2. (This is known as the sampling distribution of ).

If n=2 then the possible values for the mean are 0, , 1, 3/2, 2.

We know the possible values for the mean are 0, , 1, 3/2, 2. Now we need to assign probabilities to each
outcome to produce the probability distribution for the sample mean.

For example, ( = 0) = (1 = 0, 2 = 0) = 0.6 0.6 = 0.36

The following table lists all possible outcomes and their associated probabilities:

1 , 2 Probability
0,0 0 0.36
0,1 0.12
0,2 1 0.12
1,0 0.12
1,1 1 0.04
1,2 3/2 0.04
2,0 1 0.12
2,1 3/2 0.04
2,2 2 0.04

The required probability distribution is therefore:

0 1/2 1 3/2 2
( = ) 0.36 0.24 0.28 0.08 0.04

(d) Examine briefly what would happen if n =3, 4, ? For this last sub-question, you will need to
use the idea of a factorial of an integer n, labelled !, which means n multiplied by every
positive integer smaller than itself. So, for example, 3! = 3 2 1 = 6. Also recall the
combinatorial formula for the number of ways of selecting x from n distinct objects (Sharpe
page 193): Cxn = !/( )! !.

1
If n=3, the possible values are 0, 1/3, 2/3, 1, 4/3, 5/3, 2. In combinatorial form, ( = 3) = 13 (0.6)2 . 0.2.
To understand this, note that the mean can only be 1/3 if two tellers make no errors and the remaining one
makes 1 error, and the combinatorial formula is used to account for the fact that the teller who makes 1
error can be the first, the second or the third sampled teller.
14
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

As n increases, we get a finer grid of values between the extremes of 0 and 2.

9. A student has enrolled in three courses in this semester. Lets call them courses A, B and C. Her
chances of passing each course are 0.8, 0.65, and 0.5, respectively. Passing each course is assumed to
be independent of passing other courses. Answer the following:

(a) Define a random variable for each course outcome.

A=0 (fail A) & A=1 (pass A)


B=0 (fail B) & B=1 (pass B)
C=0 (fail C) & C=1 (pass C)

(b) What is the probability that this student passes exactly two courses? Express this question in
terms of probability statements, and then solve.

P(passing two courses) = P(pass A & B but fail C)+ P(pass A & C but fail B)+ P(pass C & B but fail A)=0.8
0.65 (1 0.5) + 0.8 0.5 (1 0.65) + 0.65 0.5 (1 0.8) = 0.465

(c) What is the probability that this student fails at least one course? Express this question in terms
of probability statements, and then solve.

P(failing at least one course) = 1 P(passing all courses)


= 1 0.8 0.65 0.5 = 0.74

(d) How reasonable is the assumption of independence?

Independence is likely to be an unreasonable assumption. Results are likely to be dependent (strong positive
association) because most of the variability in course outcomes across students is due to idiosyncratic
factors about the student him/herself i.e., working hard, being motivated, being of high academic ability.
The importance of these factors means that there is strong within-student correlation of marks in different
courses.

10. Let X be the number of heads in 4 tosses of a fair coin.

(a) What is the probability distribution of X?

X can take on values 0, 1, 2, 3,or 4. Now we need all possible combinations that will produce each of
these outcomes.

nCkpossible combinations over n=4 tosses. (This is the notation used in Sharpe, e.g.,
Value of X on page 221. Equivalent notation that is sometimes used is .)
0 (TTTT) [4C0=1]
1 (HTTT) (THTT) (TTHT) (TTTH) [4C1=4]
2 (HHTT) (HTHT) (HTTH) (THHT) (THTH) (TTHH) [4C2=6]
3 (THHH) (HTHH) (HHTH) (HHHT) [4C3=4]
4 (HHHH) [4C4=1]

Each of these combinations are equally likely because on any toss of a fair coin, P(H) = P(T) = 0.5 and
were assuming outcomes are independent

P(TTTT) = P(HTTT) = .= P(HHHH) = (0.5)4 = 0.0625

15
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

The required probability distribution becomes:

0 1 2 3 4
( = ) 0.0625 0.25 0.375 0.25 0.0625

(b) What are the mean and variance of X?

E(X) = 0 + 10.25 + 20.375 +30.25 +40.0625


= 2
Var(X) = (-2)20.0625+(-1)20.25+0+(1)20.25 +(2)20.0625
= 1

(c) Consider a game where you win $5 for every head but lose $3 for every tail that appears in 4 tosses
of a fair coin. Let the variable Y denote the winnings from this game. Formulate the probability
distribution of Y based on the probability distribution of X.

The general formula for determining Y from X is Y = 5X 3*(4-X). Plugging in, when X=0, you lose 12,
and so on. Hence:

y - 12 -4 4 12 20
P( Y = y) 0.0625 0.25 0.375 0.25 0.0625

(d) What is the expected value of Y? Would you like to play this game? If so, why? If not, why
not?

Directly from the formula given in part (c), we have:

E(Y) = 5E(X) 3[4-E(X)]


= 10 12 +6 = 4
Or
E(Y) = -120.0625 40.25 + 40.375 +12.025 +200.0625
= 4,
where the latter calculation comes directly from the probability distribution of Y given in the table
constructed in part (c). (The two evaluations, of course, give the same value!)

If you play the game enough times you would expect to win $4 per game on average. Thus, this is not a fair
game (since in a fair game, expected returns are zero) but it is biased towards the player. This is unlike
games in casinos where expected winnings are negative, meaning the game is biased towards the house.

Notice on any one play of the game you still might lose money and hence someone who is extremely risk
averse might not want to play the game even though on average, over many plays of the game, they should
win money.

11. Work through problem 41 on page 234 of Sharpe (Chapter 6).

16
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Weeks 5 and 6
1. A random number generator is designed to draw numbers at random from within a specified range. We
can consider any number in the range as a possible outcome.
(a) What type of distribution is the random number generator drawing from?
A continuous uniform distribution.
(b) Suppose we program a random number generator to generate a random number with a value
falling in the interval [0, 2]. What is the height of the density of the distribution from which the
random number generator is drawing? Draw a graph of the probability density function.
1
() = 0 2
2
= 0

(c) What is the cumulative probability distribution of the random variable from which draws are being
taken? Draw a graph of the cumulative probability distribution function.
The cumulative probability distribution, F(y) = P(YY) is just a graph of F(y) against y. So, from the above
graph, we can see F(0)=0 and F(2)=1. Since the probability is increasing uniformly the graph must be a
straight line with an upward slope (since probability cannot be negative) increasing from the point
(y,F(y))=(0,0) to (y,F(y))=(2,1). Specifically, F(y)=0.5y in the range (0,2). If y<0, F(y)=0 and if y>2,
F(y)=1.

(d) Find the following for this case: P(Y<0.8); P(Y0.8); P(0.5<Y<1.5), using both the density
function and the cumulative probability function. Show that your answers match whichever
you use.
( < 0.8) = 0.8 0.5 = 0.4
( 0.8) = ( < 0.8) = 0.4
(0.5 < < 1.5) = 1 0.5 = 0.5
Whether you get these values from the uniform probability density function as given here, or from F(y)=0.5y
(the cumulative probability distribution), the results are identical:
( < 0.8) = (0.8) = 0.5 0.8 = 0.4 and
(0.5 < < 1.5) = (1.5) (0.5) = 0.75 0.25 = 0.5.

17
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

2. From several years records, a fish market manager has determined that the weight of deep sea bream sold
in the market (X) is approximately normally distributed with a mean of 420 grams and a standard
deviation of 80 grams. Assuming this distribution will remain unchanged in the future, calculate the
expected proportions of deep sea bream sold over the next year weighing
a) between 300 and 400 grams.
300 420 400 420
(300 < < 400) = ( << )
80 80
= (1.5 < < 0.25)
= ( < < 0.25) ( < < 1.5)
= 0.4013 0.0668
= 0.3345

b) between 300 and 500 grams.


300 420 500 420
(300 < < 500) = ( << )
80 80
= (1.5 < < 1)
= ( < < 1) + ( < < 1.5)
= 0.8413 0.0668
= 0.7745

c) more than 600 grams.


600 420
( > 600) = ( > )
80
= ( > 2.25)
= 1 ( < < 2.25)
= 1 0.9878 = 0.0122

3. In a certain large city, household annual incomes are considered approximately normally distributed with
a mean of $40,000 and a standard deviation of $6,000. What proportion of households in the city have
an annual income over $30,000? If a random sample of 60 households were selected, how many of these
households would we expect to have annual incomes between $35,000 and $45,000?

. , ~(40000, 60002 )

30000 40000
( > 30000) = ( > )
6000
= ( > 1.67)
= 1 ( < < 1.67)
= 1 0.0475
= 0.9525

So 95.25% of households in the city would be expected to have annual incomes greater than $30,000.

35000 40000 45000 40000


(35000 < < 45000) = ( << )
6000 6000
(0.83 < < 0.83)
= 1 2( < < 0.83)
= 1 2 0.2033
= 0.5934

Therefore we expect 0.5934(60)36 households in the sample to have annual incomes between $35,000
and $45,000.

18
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

4. What is the 25th percentile of the normal distribution N(10, 9)?

Let x be the required percentile. First find z, the 25th percentile of a standard normal.

( < < ) = 0.25


, = 0.67

~(10,9) 25 :
10
= 0.67
3
= 7.99.

5. In a certain city, it is estimated that 40% of households have access to the internet. A company wishing
to sell services to internet users randomly chooses 150 households in the city and sends them advertising
material.
(a) Calculate the probability that fewer than 60 contacted households have internet access.

Let X be the number of households contacted that have internet access. Then assume X is a binomial
random variable with n=150 and p=0.4. Because n is large, we can use the normal approximation to the
binomial where:

= = 150 0.4 = 60
2 = (1 ) = 150 0.4 0.6 = 36

Thus incorporating the continuity correction we need to find:

( < 60) = ( 59) ( < 59.5)


59.5 60
= ( < )
6
= ( < 0.083)
= 0.4681

(b) Calculate the probability that between 50 and 100 (inclusive) contacted households have
internet access.

(50 100) (49.5 < < 100.5)


49.5 60 100.5 60
= ( < )
6 6
= (1.75 < < 6.75)
= ( < 6.75) ( < 1.75)
= 1 0.0401 = 0.9599

(c) There is a 90% chance (probability of .9) that the number of contacted households with internet
access equals or exceeds what value?

( ) = 0.9
, ( > 0.5) = 0.9
0.5 60
( > ) = 0.9
6
60.5
( < < ) = 0.1
6

19
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

60.5
1.28 = 52.82
6

There is a 90% chance that the number of contacted households with internet access is 52 or more.

6. Using your personalized Course Project data:


(a) Calculate the sample averages of all variables. Which of these averages are meaningful?
Express the meaning of each average in words that are understandable and effective for a
layperson such as your client.

(b) Do you need to manipulate the raw data provided, before proceeding to statistical analyses, in
order to address the clients question? If so, how?

7. Work through problem 28 on page 264 of Sharpe (Chapter 7).

8. UNSW wants to measure the attractiveness of its brand to potential students. The university performs
an experiment by inviting 100 high school students from different public schools across New South
Wales to browse a few websites related to different universities, and then to choose the one that they
would prefer most.

(a) Is this a random sample? Can you think of any potential source of selection bias?

The sample is not perfectly random. First of all, only students in NSW are sampled, and
therefore the attitudes of students from other states of Australia and overseas students are
missed. Also the students are all coming from public schools, and public school graduates
might have different aspirations or expectations compared to private school graduates.

(b) Suppose that a perfectly random sample of students is drawn from the target population, and
these students take part in the exercise described above. Can you think of any confounding
factors that is, factors that might lead to lack of confidence in using students expressed
preferences, as measured in this exercise, as an indicator of their degree of overall attraction to
the UNSW brand?

Even if the sample is perfectly random, universities have different qualities in different fields.
For instance, UNSW engineering and science might be leading faculties, but the medical
faculty might not be the top. A students choice of a university does not only depend on the
attractiveness of the University as a whole, but also on whether they are leading in the
students particular field of interest. While part of the appeal of the university as a whole may
be due to such field-specific factors, the stated preference data alone cannot be used to
distinguish these factors from other factors purely related to the overall appeal of the
universitys brand. As another example, Australian students have historically been reluctant to
travel in order to attend university, so proximity (which is not what the question originally
targets) has also been an important factor in determining university choice.

Therefore, to measure overall attractiveness of the brand as an independent construct, we should


control for strengths and weaknesses across schools and differences in travel times.

9. Work through problem 22 on page 325 of Sharpe (Chapter 9).


All histograms are centered at 0.85 the true value of p in the distribution from which the samples were
drawn (so this is a mechanical result of the process of simulating!) but as the sample size increases,
the distribution becomes more and more symmetric and unimodal, and the variability in the sample
proportion reduces. This is what happens to the sampling distribution of a parameter in general terms:

20
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

as n, the sample size, rises, the sampling distribution (which can only be imagined or drawn conditional
on n!) of the parameter starts to look more and more like the normal distribution.

10. Work through problem 44 on page 328 of Sharpe (Chapter 9).

11. Work through problem 60 on page 329 of Sharpe (Chapter 9).

12. Work through problem 36 on page 356 of Sharpe (Chapter 10).

21
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Weeks 7 and 8
1. Suppose a normally distributed random variable X has a mean of 50 and a variance of 100. Also suppose a
sample of size 16 is drawn from this population. Calculate the following probabilities:
(a) P(40< X <55)

4050 5550
= P( 10 < < 10 )
= P(-1 < Z <0.5)
= P(Z < 0.5) P(Z < -1)
= 0.6915 0.1587
= 0.5328
(b) P(40< <55)

4050 5550
= P( << )
10/4 10/4
= P( - 4 < Z <2)
= P( Z < 2) P( Z < - 4)
= 0.9772 0
= 0.9772

2. Recall the Anzac Garage data used previously. These data are available from the course website (in
the Tutorial Questions and Information folder) in an Excel file called Anzacg.xls. Use these 117
observations on used passenger cars to find the 95% confidence interval for the population mean
distance travelled by used passenger cars (this variable is labelled odometer in the data set and is
measured in kilometres). Assume the population standard deviation is 60,000kms.
60,000 2
Since n=117 is large, we invoke the central limit theorem: X ~ N , , at least
117
approximately.

Using Excel, we find the sample mean is 78,561 kms. The 95% confidence interval is given by

60,000
x z0.025 78,561 1.96
n 117
78,561 10,872
(67,689, 89,433)
The calculated interval is one of the possible realizations of the 95% confidence interval. In repeated
sampling, 95% of intervals calculated in this way would contain the true .

22
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

3. What would be the effects on the width of the confidence interval calculated in the previous question
of:
(a) a decrease in the level of confidence used?
Decreases width
(b) an increase in sample size?
Decreases width
(c) an increase in the population standard deviation?
Increases width
(d) an increase in the sample standard deviation?
No effect on the width since we are told the population standard deviation.
(e) an increase in the value of found?
No effect on the width.

4. Again referring to the data in odometer from Anzacg.xls and the population from which it is drawn,
determine the sample size required to estimate the population mean to within 5,000 kms with 90%
confidence. Again assume the population standard deviation is 60,000kms.
z / 2 z0.05 1.645 , B 5,000 , 60 ,000
Where B is the size of the margin of error on either side of the point estimate. Sample size
calculation:

z
2 2
1.645(60,000)
n /2 389.67
B 5,000
A sample of 390 would be required.

5. Perform the following hypothesis tests of the population mean. In each case, draw a picture to illustrate
the rejection regions on both the Z and distributions, and calculate the p-value of the test.
(a) H0: = 50, H1: > 50, n = 100, = 55, = 10, = 0.05

Rejection region:

50
= > 0.05 = 1.645
10100
Alternatively,

10
> = 0 + 0.05 = 50 + 1.645 ( ) = 51.645
100
Since

5550
= 10 = 5 > 0.05 = 1.645,
100
we can reject H 0 and conclude that we are 95% confident that the population mean is greater than 50.

0.05

X
23
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

50 51.645

reject

0.05

0
1.645 Z

reject

= ( > 5) 0.0000.

(b) H0: = 25, H1: < 25, n = 100, = 24, = 5, = 0.1

Rejection region:
25
= < 0.1 1.28
5100
Alternatively,
5
< = 0 0.1 = 25 1.28 ( ) = 24.36
100

Since
2425
= = 2 < 0.1 = 1.28 ,
5 100

we can reject H 0 and conclude that we are 95% confident that the population mean is less than 25.

0.1

X
24.36 25

reject

24
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

0.1

-1.28 0 Z

reject

= ( < 2) = 0.0228

(c) H0: = 80, H1: 80, n = 100, = 80.5, = 4, = 0.05

Rejection region:
80
= < 0.025 1.96 > 1.96
4100
Alternatively:
4
< = 0 0.025 = 80 1.96 ( ) = 79.216
100
or
4
> = 0 + 0.025 = 80 + 1.96 ( ) = 80.784
100

Since
80.5 80
= = 1.25
4100
is neither less than -1.96 nor greater than 1.96, we do not reject H 0 , with 95% confidence.

0.025 0.025

X
79.216 80 80.784

reject reject

0.025 0.025

Z
-1.96 0 1.96

25
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

reject reject

= 2( > 1.25) = 2 0.1056 = 0.2112

6. A real estate expert claims the current mean value of houses in a particular area is more than $250,000.
A random sample of 150 recent sales prices in the area yields a sample mean of $265,000. It is known
that house values in the area are approximately normally distributed with a standard deviation of
$50,000.
(a) Perform an upper tail test of the null hypothesis that the population mean house value in the
area is $250,000. Use a 5% level of significance and state the rejection (critical) region in
terms of both and z.
Let X value of a house in the area.
2
= $265,000, = $50,000, ~(, )

We wish to test
0 : = 250,000; 1 : > 250,000

Rejection region:
250,000
= > 0.05 = 1.645
50,000150

or
50,000
> = 0 + 0.05 = 250,000 + 1.645 ( ) 256,715.68
150

Since

265,000250,000
= = 3.67 > 0.05 = 1.645 ,
50,000150

we reject H 0 and conclude that with 95% confidence, the mean house value in the area is more
than $250,000.

(b) Why is an upper tail test most appropriate in this case?

The nature of the research problem dictates an upper tail test. In this case we will not believe the
experts claim unless there is significant sample evidence to do so. The claim itself implies the
possibility of an alternative above the conservative number one would otherwise guess, which in
turn implies an upper tail test.

(c) What is the p-value associated with the test statistic used in the part (a) test? Interpret this
value.

The p-value is the probability of obtaining a test statistic as or more extreme than the realized
value, assuming the null hypothesis is true, from a sample of the given size. The lower the p-value,
the greater is the evidence for rejection of the null hypothesis. In this case it is very unlikely to find
a sample mean as extreme as $265,000 in a sample of 150 observations if in reality the population
mean is $250,000.

(d) Define in words the type I and II errors that could afflict the part (a) test.
26
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Type I Error: Concluding that average housing price is more than $250,000, when in fact it is
really $250,000.
Type II Error: Not rejecting the claim that average housing price is $250,000, when in fact it is
really more.
Note that the exact probability of a Type II error cannot be determined without specifying an
exact alternative hypothesis.

7. What effect does increasing the sample size have on the outcome of a hypothesis test? Explain your
answer using the example of a one-tail test concerning the mean of a normally distributed population
with known variance.

Suppose an upper tail test:


0 : = 0 ; 1 : > 0

Under 0 :
2 0
~(, 2 ) ~ (, )= ~(0,1)


The point on N(0,1) corresponds to the point = 0 + on the distribution of 0 .

27
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

The distribution of 0 is:

X

0 0 z
n

But suppose the true is to the right of . Then the true distribution of is say:


0 0 z
n
The shaded area in the above diagram gives the probability of correctly rejecting H0 (i.e. the power,
1- , which is greater than the area under the tail and beyond the cutoff point on the first graph).

Now suppose the sample size is increased. As a result:

2
() = decreases and hence decreases.

28
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Suppose the new sample size is n1>n. The distribution of 0 will now look something like:


0 0 z X
n1

Note that with a fixed the rejection region cutoff is now smaller (we have to fit the same amount
of probability density in above the cutoff, but because of this distributions lower variance than in the
initial case, there is less density in the tails so we have to move our cutoff point closer to the centre
of the distribution in order to capture enough probability density). Again, if the true is actually to
the right of , the probability of rejecting the same incorrect null hypothesis is higher than before.
Diagrammatically the true distribution of will be, say,:

0 X

0 z
n1

Again the shaded area in the above diagram gives the probability of correctly rejecting H0.

Conclusion: The probability of correctly rejecting a false H0. (the power of the test) increases as n
increases, given we keep the Type I error () fixed.

8. Project Review: For the course project, you are only expected to use statistical methods covered in
lectures up to and including those in Week 9. Thus you should now have sufficient material to complete

29
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

the project in a timely fashion.

What might be useful at this stage is to think about presentation. See the statistical report section of
the Project folder on Moodle for some ideas in general. As a directed exercise for this tutorial , compare
and contrast the presentation of material in the NSW BOCSAR report on driving under the influence
of cannabis (driving-cannabis.pdf) and Queensland Office of Economic and Statistical Research
bulletin on computer and internet usage in Queensland (computer-internet-useage-qld-c01.pdf). You
should be able to read these reports comfortably, although there are a few methods that may be
unfamiliar in the cannabis report (although these methods will be covered later in the course).

(a) What are some key differences in the presentation of results in the 2 reports?
In the observations below, DC = driving under the influence of cannabis report and CIU =
computer and internet use report.

Heavy reliance on graphical presentation in CIU in conjunction with a couple of tables.


No graphs in DC and only one table with most of the statistics introduced in the text.
Part of this difference associated with slightly different objective: CIU is entirely descriptive while
DC has descriptive statistics and analysis including confidence intervals. This in part explains why
DC has many references and CIU has none.
Also DC is more complete as a report. For example, CIU has no conclusion or discussion of the
implications of the results.

(b) Which is better? What criteria do you use to determine which is better?

For example, do we really need Figure 11 in CIU where we have only 6 numbers to report? Also see
the text associated with Figure 11 where several other results are reported that cannot be deduced
from the figure (comparison of reported skills by age).

(c) What are some key similarities in the overall presentation in the 2 reports?
Both have a non-technical summary recall the need for an Executive Summary in the project.
Both should be relatively clear and easy to read.
Both have an absence of intermediate calculations as would be present in a tutorial problem. This
is entirely appropriate for reports of this type and hence the project.
The above is not meant to be comprehensive.

30
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Weeks 9 and 10
1. State whether the normal distribution, the t distribution, or neither would be the right type of sampling
distribution to assume for the sample mean in order to test hypotheses regarding the population mean
in the following situations:
(a) Population variable normally distributed, 2 unknown, sample size less than 30.
t-distribution, because of the unknown population variance and the low sample size.

(b) Population variable normally distributed, 2 unknown, sample size greater than 30.
t-distribution although as the sample size gets very large this effectively becomes the same as
using the normal.

(c) Population variable normally distributed, 2 known, sample size less than 30.
Normal distribution because we know the population variance. In this case, even in a small
sample, the sampling distribution of the mean follows a normal distribution.

(d) Population variable not normally distributed, 2 unknown, sample size greater than 30.
Even though the population variable is not normally distributed, because the sample size is large
you can invoke the CLT and use the fact that s2 is a consistent estimator of 2 to justify using the
normal distribution.

(e) Population variable not normally distributed, 2 unknown, sample size less than 30.
Here the sampling distribution of the mean is unknown, and hence we dont know how to test a
hypothesis about in this circumstance. In practice you could either assume the population variable
is approximately normally distributed and proceed as in (a); or alternatively invoke the CLT and
proceed as in (d). How well either of these solutions works ultimately depends on the extent of non-
normality of the distribution of the variable in the population, which is not specified in the question.

2. Reconsider the example used earlier in the course in which a real estate expert claimed the current
mean value of houses in a particular area was more than $250,000. A random sample of 150 recent
sales prices in the area yielded a sample mean of $265,000, and it is known that house values in the
area are approximately normally distributed with a standard deviation of $50,000.
(a) If in fact the population mean house value in the area is $260,000, what is the probability of
committing a type II error in performing an upper-tail test of the null hypothesis that the mean
house value price in the area is $250,000, as was done in Part (a) of the prior weeks exercise?
What is the power of the test in these circumstances? State in words what the power of the test
means.

Let X= value of a house in the area.


2
= $265,000, = $50,000, = 150, ~(, )
0 : = 250,000; 1 : > 250,000

Rejection region:
50,000
> = 0 + 0.05 = 250,000 + 1.645 ( ) 256,715.68
150

Thus the Type II error probability (probability of not rejecting H0 when it is false) is:

= ( < 256,715.68| = 260,000)


256,715.68 260,000
= ( < ) ( < 0.8) = 0.2119
50,000150

31
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

= 1 = 0.7881

The power of the test gives the probability of correctly rejecting the null hypothesis when it is false. Note
that power cannot be specified unless we specify a particular alternative around which we assume the
sampling distribution of our statistic to be centred. For this reason the dependence of power on the
exact alternative being considered power is sometimes defined as the ability of a test to detect a
particular alternative.

(b) Illustrate your answer to part (a) above by showing on a diagram the areas representing the
probability of a type II error and the power of the test.

Under 0 : = 250,000

1- power
under 260,000

250,000 260,000

= $256,715.68

3. A company running an urban rail service wishes to estimate its daily average number of late-running
trains on weekdays. For 10 randomly selected weekdays, it finds the following numbers of late running
trains:

32, 10, 9, 18, 25, 15, 14, 18, 22, 16

(a) Assuming the number of late running trains on a weekday is approximately normally
distributed, calculate a 90% confidence interval for the mean number of late-running trains on
a weekday.

Let X = number of late-running trains on a weekday. Then:

= 0.1, = 17.9, 2 = 48.32, 6.9514

Since 2 is unknown, n is small and the underlying distribution of the variable in the population is
(approximately) normal, we construct the confidence interval using the t distribution. The required
interval is:
6.9514
,1 = 17.9 0.05,9
2 10
6.9514
= 17.9 1.833
10
= 17.9 4.029
= (13.871,21.929)

(b) If we did not have the assumption of normality, could we still calculate a confidence interval
in this example? If not, suggest a way of overcoming this problem.

32
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Everything else the same, we could not construct a confidence interval in the same way as in (a) since the
t distribution is only valid if the underlying distribution of the variable in the population is normal. This
problem could be overcome by obtaining a larger sample size and then making use of the central limit
theorem (and still using s instead of ).

4. Reconsider the question from a previous week that used the Anzac Garage data, available from the
course website (in the Tutorial Questions and Information folder) in an Excel file called Anzacg.xls.
Would normality be a good approximation for the population distribution of distance travelled by used
passenger cars? (Hint: look at the summary statistics and a histogram.) Do you need to assume
normality? Redo the 95% confidence interval for the population mean distance travelled by used
passenger cars without assuming a known population standard deviation.

Excel summary statistics and histogram for distance traveled indicate non-normality. The distribution is
skewed to the right, the median is much less than the mean, and the sample mean is only 1.35 standard
deviations from zero:

Odometer (km)

Mean 78560.83
Standard Error 5384.86
Median 67980
Mode 147000
Standard Deviation 58246.19
Sample Variance 3392618896
Kurtosis 3.426
Skewness 1.528
Range 315597
Minimum 403
Maximum 316000
Sum 9191617
Count 117

33
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

Frequency histogram for odometer readings for cars in


Anzac Garage data

45

40

35

30
Frequency

25

20

15

10

0
20000 60000 100000 140000 180000 220000 260000 300000
Odometer (kms)

While the population distribution seems non-normal, the sample size is large enough to invoke the CLT
and hence to assume the sample mean is approximately normally distributed.

In a previous question using these data we assumed was known, but here we consider the more likely
situation where it is unknown and we replace by s as calculated by Excel. The 95% confidence interval
is given by:

58,246
/2 = 78,561 1.96
117
= 78561 10,554
= (68,007,89,115)

5. It is known that 80% of people suffering from a particular disease are cured by a certain standard
medication. Test the claim of the developers of a new medication that their product is more effective
than the standard medication in curing the disease, using a 5% significance level, given a random
sample of 400 people with the disease of whom 330 are cured by using the new medication. (Hint:
Use the normal approximation, and ignore the continuity correction.)
330
0 : = 0.8, 1 : > 0.8, = 400, = 0.05 & = = 0.825
400

Therefore we can use the normal approximation to the binomial. Under H0:
(1 ) 0.8 0.2)
~ (, ) ~ (0.8, )
400

So, ignoring the continuity correction, calculate the empirical significance level, or p-value:
0.825 0.8
( > 0.825) = ( > ) = ( > 1.25) = 0.1056
(0.8 0.2)/400

Because the p-value > (0.1056 > 0.05) we do not reject H0 and instead we conclude that there is not
enough evidence to support the developers claim of a more effective cure.

34
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

(Alternatively, the rejection region is given by z >1.645 or > 0.8329)..

6. Download the data Credit_Card_Bank from the MyStatLab website (available under the heading
of Chapter 1: Data and Decisions). Using the variables Offer Status and Spendlift Positive,
conduct the appropriate Chi-squared test to determine whether these there is a relationship between
the type of offer a customer was exposed to and whether a lift in spending was observed, assuming a
significance level of 0.05. Interpret your results.

The contingency table looks like this:


Offer/Lift No Yes TOTAL
Double Miles + Free 15 14 29
Flight Insurance
Free Flight Insurance 15 12 27
No Offer 16 5 21
Rtl w/o Enr 11 12 23
TOTAL 57 43 100

To determine the chi-squared value, we first need to construct a table of expected values, which is as
follows:
Offer/Lift No Yes TOTAL
Double Miles + Free (29/100)*(57/100)*100 12.47 29
Flight Insurance =16.53
Free Flight Insurance 15.39 11.61 27
No Offer 11.97 9.03 21
Rtl w/o Enr 13.11 9.89 23
TOTAL 57 43 100

The table of Chi-squared terms is then:


Offer/Lift No Yes TOTAL
Double Miles + Free (15-16.53)^2/16.53 .1877225 29
Flight Insurance =.1416152
Free Flight Insurance .009883 .0131007 27
No Offer 1.3568003 1.7985492 21
Rtl w/o Enr .3395957 .4501617 23
TOTAL 57 43 100

Summing the entries in all cells of the table above yields the Chi-sq statistic of 4.2974283. Using the
Chi-sq critical value for alpha = .05 and df = (r-1)*(c-1) = (4-1)*(2-1) = 3, which (consulting the table
at the back of the book) is equal to 7.815, we find that our statistic falls into the non-critical region.
The P-value associated with our statistic is 0.231. We fail to reject the null hypothesis that the two

35
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

variables distributions (that of the offer type, and whether or not there was a spend lift) are
statistically independent.

The evidence suggests that the amount of spend lift obtained does not depend on the type of offer
received.

7. Use a calculator to compute the sample least squares regression line for the model = 0 + 1 + ,
given the following six observations:
y 2 8 6 12 9 11
x 1 4 3 10 10 8

1 + 4 + 3 + 10 + 10 + 8 2 + 8 + 6 + 12 + 9 + 11
= = 6; = =8
6 6
( )( ) = (1 6) (2 8) + + (8 6) (11 8) = 62

( )2 = (1 6)2 + + (8 6)2 = 74

( )( ) 62
1 = 2
= 2
= 0.8378
( ) 74
0 = 1 = 8 0.8378 6 = 2.9732

Thus the sample regression line is = 2.9732 + 0.8378

8. Suppose the relationship between the dependent variable weekly household consumption
expenditure in dollars (y) and the independent variable weekly household income in dollars (x) is
represented by the simple regression model (i refers to the ith observation or household):

= 0 + 1 +

Suppose a sample of observations yields least squares estimates of b0 = -32 and b1 = 0.82 for this
model.

(a) What does represent in the model?


It is the random disturbance term. It includes any purely random factors or errors and factors that are
systematic but have been left out of the model.

(b) State the basic (classical) assumptions made about the s in this model. Explain in words what
the assumptions mean.

(i) These errors are random variables, for which one classical assumption is that ( | ) = 0 for all
observations. In words, the conditional mean of the disturbance does not depend on x and is
normalized to zero. Note this is not something directly addressed by Sharpe in Chapter 15. It is
nevertheless a crucial assumption. That the conditional mean of the disturbances does not depend
on x ensures the unbiasedness of the OLS estimator as an indicator of the direct effect of x on y, and
is hence as important as the other assumptions below in determining how the output of the model
can be interpreted. This assumption implies that omitted factors that might affect expenditure, but
in fact are only included in the disturbance term rather than as separate variables on the right-hand
side of the model, must be uncorrelated with x. In the present example, a violation of this requirement
would occur if, for example, people who are taller also earn more income (due to more confidence in
the labor market than shorter people, say), and at the same time consume more because they require
more calories to keep their larger bodies running. This would then mean that additional income does

36
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

not directly bring about all of the additional consumption implied by the slope coefficient estimate;
at least part of this effect is due to the third-party cause of each variable (income and consumption):
height.
(ii) ( , ) are drawn by simple random sampling and are hence independent and identically distributed.
(iii) The standard deviation of is constant for all observations (no changing spread as Sharpe says).
This spread is denoted by and we say the disturbances in such as case are homoskedastic. Here
that implies the variability in consumption expenditure does not depend on income, which is possibly
problematic in practice (poorer people are probably more similar to one another in how much of
their income they consume, since they operate closer to the line of subsistence; richer people have
many more options for how to use their money, and hence there may be more variability across rich
people in what share of their income is used for consumption expenditure).
(iv) The disturbances for any two observations are independent. This will imply, in particular, that there
is no correlation between the disturbances associated with different observations. In this example
this requires that the factors in the disturbance for household i are not correlated with those for
household j, which seems reasonable.
(v) is normally distributed for all observations.

(c) Does the estimate of b0 = -32 make sense? If not, does this necessarily invalidate the model?
Explain your answer.

This indicates that if a household had a zero weekly income then on average such a household would
have negative consumption, which does not make sense. However, this does not necessarily invalidate
the model. It may be that the linear model is only a reasonable approximation for some range of
household incomes, not including incomes near zero. In particular, the relationship between the two
variables may be non-linear for values of x near zero. The conclusion is that we should be careful in
interpreting the intercept term, as it may not be very meaningful in some cases.

(d) Interpret both 1 and b1. What does the model predict would be the change in y following a $10
increase in x from some initial level?

1 is the (unknown) population change in the value of y resulting from a one-unit increase in x, whereas
b1=0.82 is an estimate of 1. In this particular example the quantity being estimated is the marginal
propensity to consume that is discussed in economics courses. The predicted change in y following a $10
increase in x would be 10 1 = 10 0.82 = $8.20.

(e) Suppose we measured y and x in cents rather than dollars. What effect would this have on the
estimated coefficient of x? What effect would it have on the estimated intercept?
In this case: $x becomes 100x cents and $y becomes 100y cents. The estimated coefficient of xi when the
variables are measured in dollars is given by
( )( )
1 =
( )2

If we let 1 be the estimated slope coefficient when the variables are measured in cents, we have
(100 100 )(100 100) 1002 ( )( )
1 = = = 1
(100 100 )2 1002 ( )2

Also, denoting by 0 the estimated intercept in this case, we have

0 = 100 1 100 = 100( 1 ) = 1000

Thus estimation of this model (with the same, but re-scaled data) would lead to an unchanged b1, whilst
the estimated intercept term would become 1000 = 3200.
37
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)
lOMoARcPSD|1204875

(f) Suppose y were measured in dollars but x were measured in cents. What effects would this have
on the estimated coefficient of x?

Denote the estimated slope and intercept in this case by 1 and 0 , respectively . Then

(100 100 )( ) 100 ( )( ) 1


1 = = =
(100 100 )2 1002 ( )2 1001

0 = 1 100 = 1 = 0

The estimation of this model would lead to an estimated coefficient for the income variable of 0.0082,
and the estimated intercept would be unchanged. This makes sense since:
If income is measured in dollars, we predict expenditure (in dollars) will increase by$0.82 if household
income increases by one dollar.
If income is measured in cents, we predict expenditure (in dollars) will increase by $0.0082 if household
income increases by one cent.

(g) Distinguish between and (the residual associated with observation i). Illustrate your
answer with a diagram.

We can think of = as an estimate of the true random disturbance associated with observation
i, = 0 1 . (In the above diagram, we are just imagining what the true model might look like
we never could draw it since we never know it!)

9. Work through problem 16 on page 529-530 of Sharpe (Chapter 15).

38
Distributing prohibited | Downloaded by Samarth Tripathi (samtrip11@gmail.com)

S-ar putea să vă placă și