Sunteți pe pagina 1din 18

FIRST ARTS AND FIRST SCIENCE EXAMINATION MATHEMATICS AND STATISTICS MT1007: Statistics in Practice May 2008 Time

allowed : Two Hours Attempt All Questions

Relevant formulae can be found on pages 1618

For the multiple choice questions, it is sucient to write down the appropriate letter against the question number; e.g., if you think answer (a) is correct for question 1, simply write 1 (a). Also where calculations have been done, show workings to allow for possible partial marks.

MT1007

[See over

Use this information to answer questions 15 It is often suggested that household dust mites cause asthma. Impermeable mattresscovers are thought to reduce household dust and a double-blind, randomized, placebocontrolled study (involving 1122 adults with asthma) was carried out to investigate the eect of impermeable mattress covers on asthma1 . Patients in the control group used permeable mattress covers, and those in the treatment group used impermeable covers. Patients were assessed at 6 months into the study, in two ways (1) by monitoring the eciency of their breathing (peak expiratory ow rate measured in cubic metres per second) and (2) by noting if they used steroid medication to control their asthma. There was no signicant dierence between the groups in the peak expiratory ow rate at six months, however 17.4% of the treatment group and 17.1% of the control group were able to stop taking their asthma medication. 1. Explain the term placebo. What type of placebo was used in this study? [2] Placebo = a false treatment, such that those in the control group and the treatment group do not know which group they are in, e.g. a chalk pill instead of an active medicine. In this study, the placebo involved mattress covers that were permeable to dust. 2. What does double-blind mean, and why was it appropriate to use double-blinding in this case? [3] Double blind = doctors also do not know which treatment is real and which is fake i.e. do not know if a patient is in the treatment or control group. Appropriate to use this because doctors may give hidden signals to patients if they know which ones are recieving treatment. 3. Explain the term randomised, and suggest an appropriate randomisation scheme for this study. [2] Any patient who volunteers for the study has an equal chance of being allocated either a treatment or control mattress. Number the volunteers at study entry and pick numbers at random. 4. Which of the following statements is true? [2]

(a) peak expiratory ow rate is a qualitative variable (b) use of steroid medication was treated as an explanatory variable (c) Impermeable mattress covers were used to reduce the exposure of patients in the treatment group to household dust (d) We can be sure that those people who participated in the study were representative of the UK population in general (e) This was a prospective, observational study.
1

Woodcock et al 2003; NEJM Volume 349:225-236

MT1007

C 5. Which of the following statements is true? (a) 17.4% is a parameter (b) There were 1122 patients in the treatment group (c) 6 months is a parameter (d) We have evidence that the study was replicated (e) 17.1% is a statistic E [2]

6. State three assumptions involved when using the Binomial distribution.

[3]

(1) random trials are independent (2) probability of an event is constant (3) there are only 2 possible outcomes of a random trial, (4) Fixed number of trials 7. In the UK, approximately 1 in 8 children on average suer from asthma. Assuming p = 1 , use the Binomial distribution to calculate the probability that if 3 children are 8 picked at random from the UK population: (a) none of them have asthma (b) at least one of them has asthma (c) only one of them has asthma Use the following to help you: > pbinom(0,3,1/8) [1] 0.6699219 > pbinom(1,3,1/8) [1] 0.9570312 > pbinom(2,3,1/8) [1] 0.9980469 > pbinom(3,3,1/8) [1] 1 [1] [1] [2]

a: P r(X = 0) = 0.6699219; b: P r(X 1) = 1 P r(X = 0) = 1 0.6699219 = 0.3300781; c: P r(X = 1) = P r(X 1) P r(X = 0) = 0.9570312 0.6699219 = 0.2871093
MT1007

[See over

8. A family has three children. Can we estimate the probability that at least one child in this family has asthma using this method? Justify your answer. [2] It would not be legitimate to do this, the independence assumption is violated due to a genetic relationship between children; i.e. if 1 child has asthma then the others are more likely to have asthma compared with randomly chosen children 9. Choose the best answer from the following. A measurement is said to be precise if [2]

(a) the measurement is close to the underlying parameter (b) if the measurement is repeated, the new measurement is likely to be similar to the previous measurement (c) the measurement is unbiased (d) there is large sampling error (e) there is no systematic error B 10. Suppose that a randomly chosen patient in hospital has their temperature measured repeatedly over several days. Assume that the temperature of a patient is Normally distributed, with = 36 Celsius and = 3 Celsius. Which of the following statements is true? [2] (a) The variance of the data will be about 1.73 (b) If a patients temperature is measured with error, we anticipate the mean of the data will be somewhat greater than 36 because we have measurement error in addition to the underlying variation in temperature (c) If a patients temperature is measured with error, we anticipate the standard deviation of the data will be somewhat greater than 3 because we have measurement error in addition to the underlying variation in temperature (d) If a data point is picked at random, the probability that the temperature exceeds 39 Celsius is approximately 0.66 (e) If temperature measurements are biased, the standard deviation of the measurements will be greater than 3 C 11. Temperatures in Celsius (c) can be converted to temperatures in Farenheit (f ) using the following formula: f = 1.8c + 32 If temperature measurements are converted to the Farenheit scale, what is the new mean and standard deviation of the temperature distribution described in Question
MT1007

10? [2] E(aX +b) = aE(X)+b = 1.836+32 = 96.8, sd(aX +b) = |a|sd(X) = 1.83 = 5.4 Suppose that approximately 0.3% of new UK residents who appear healthy on entering the UK have been exposed to Tuberculosis (TB), which also guarantees a positive result in a test for the disease. Assume that only exposed people test positive. Of those apparently healthy new UK residents who test positive, it is expected that about 1/4 will develop TB. 12. What is the probability that a new resident who appears healthy (picked at random) would get a negative test result? [1] 1-0.003 = 0.997 13. What is the probability that from 20 apparently healthy new residents (picked at random), none would show a positive result? [1] 0.99720 = 0.942 14. What is the probability that an apparently healthy new resident (picked at random) was exposed to TB and developed the disease? Is this a joint, marginal or conditional probability? [2] 0.003 0.25 = 0.00075, a joint probability

MT1007

[See over

Use this information to answer questions 1522 Respondents to the Public Attitudes to the Environment in Scotland survey were asked about 23 environmental issues. 4018 people completed this survey in 1991 and 4119 people completed this survey in 2002. We can assume that the same people were not surveyed in both years. The survey results are shown in Table 1. Response Issue/Year Raw sewage put into the sea Nuclear waste Damage to the ozone layer Pollution of rivers, lochs and seas Protection of wildlife Very worried 1991 2330 2049 1929 2290 1005 Very worried 2002 2018 1934 1400 1236 1153 Quite worried 2002 1442 1359 1730 1854 1854

Table 1: Summary information from the Public Attitudes to the Environment in Scotland survey in 1991 and 2002.

15. What proportion of survey respondents in 1991 were very worried about nuclear waste? [1] p= 2049 = 0.5099552 4018

16. What is the standard error of the estimate for those very worried about nuclear waste in 1991? [2] se() = p p(1 p) = n 0.5099552(1 0.5099552) = 0.007886403 4018

17. Build and interpret a 95% condence interval for the proportion of people in the population who were very worried about nuclear waste in 1991. [4] p 1.96 se() = 0.5099552 1.96 0.007886388 = (0.4944979, 0.5254125) p We can be 95% condent that between 49.4% and 52.5% of the Scottish population were very worried about nuclear waste in 1991. 18. Approximately 47% of people surveyed were very worried about nuclear waste in 2002. Based on your answer to question 17, do the Scottish population appear to feel more concerned, less concerned or the same about nuclear waste in 2002 compared with 1991? Justify your answer. [2] The population appear less concerned in 2002 than they did in 1991. The sample proportion for those very worried about nuclear waste in 2002 falls outside (below) the 95% CI for those very worried about nuclear waste in 1991.
MT1007

19. Carry out a hypothesis test for no dierence between the proportion of those very worried about nuclear waste in 1991 and the corresponding proportion in 2002. (a) What is the null hypothesis? H0 : p1991 = p2002 (b) What is the alternative hypothesis? H1 : p1991 = p2002 (c) What is the data-estimate for this test? p1991 =
2049 4018

[1]

[1]

[1]

= 0.5099552, p2002 =

1934 4119

= 0.4695314, p1991 p2002 = 0.0404238 [4]

(d) How many standard errors is the data-estimate from zero?

se(1991 p2002 ) = p =

p1991 (1 p1991 ) p2002 (1 p2002 ) + n1991 n2002

0.5099552(1 0.5099552) 0.4695314(1 0.4695314) + 4018 4119 =0.01107539 The estimate is


0.04042380 0.01107539

= 3.649876 standard errors from zero. [1]

(e) The p-value for this test is 0.000262. Interpret this p-value.

We have very strong evidence for a dierence in the proportion of Scottish people who were very worried about nuclear waste in 1991 and 2002; the proportion appears to be smaller in 2002. 20. If we were to use the survey results in Table 1 to compare the proportion of the Scottish population who were very worried about nuclear waste in 2002 with the proportion who were quite worried about nuclear waste in 2002 which of the standard error formulae (Situations A-C) would be appropriate? Justify your answer. [2] Situation B: One sample, many response categories. We have one sample (n2002 = 4119) and they had to choose either very worried or quite worried. 21. If we were to use the survey results in Table 1 to compare the proportion of the Scottish population who were very worried about damage to the ozone layer in 2002 with those who were very worried about the protection of wildlife in 2002 which of the standard error formulae (Situations A-C) would be appropriate? Justify your answer. [2]
MT1007

[See over

Situation C: One sample, many yes/no items. We have one sample (n2002 = 4119) and they could choose both very worried about damage to the ozone layer and very worried about the protection of wildlife and so, could contribute to both sample proportions. 22. If we were to use the survey results in Table 1 to compare the proportion of the Scottish population who were very worried about pollution to rivers, lochs and seas in 2002 with those who were quite worried about pollution to rivers, lochs and seas in 1991 which of the standard error formulae (Situations A-C) would be appropriate? Justify your answer. [2] Situation A: Two independent samples; n2002 and n1991 and we are assuming independence between years. Greenhouse gases in the atmosphere help to retain infra-red radiation, resulting in warming of the lower atmosphere and earth surface. The emission of greenhouse gases is often claimed to be falling and Table 2, page 9 contains annual measurements of greenhouse gases. An analysis based on Carbon Dioxide (CO2 ) values from Table 2 are also shown on page 9. Questions 2327 refer to the output shown on page 9 23. What are the null and alternative hypotheses? H0 : 1990 = 1995 , ..., = 2004 H1 : at least two of the annual means dier. 24. Which of the following about this analysis is true? [2] [2]

(a) If the variability within groups (s2 ) is large, compared with the variability beW tween groups (s2 ), then the corresponding p-value will be small B (b) If the means for each group were larger but the standard deviations for each group were kept the same, the variability within groups (s2 ) would increase W (c) If f0 is large then the associated p-value is also large (d) If the variability within groups (s2 ) is small relative to the variability between W groups (s2 ) then f0 will be large B (e) If the total number of observations increased (ntot ) but the number of groups kept the same then df1 and df2 would increase D 25. Show how f0 = 25.265 is calculated. f0 =
MT1007
s2B s2W

[2]

545.7 21.6

= 25.26

Greenhouse Gas Carbon Dioxide Methane Nitrous Oxide n

1990 1995 2000 13.6 13.2 13.6 2.2 2.1 1.8 1.7 1.5 1.4 200 200 200

2002 12.4 1.5 1.4 200

2003 2004 12.4 11.7 1.4 1.4 1.4 1.4 200 200

Table 2: Summary information from the 2004 AEA Energy and Environment survey.

>

summary(aov(co2 ~ year)) Df Sum Sq Mean Sq F value Pr(>F) year 5 2728.5 545.7 25.265 5.756e-07 *** Residuals 1193 25875.7 21.6 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > TukeyHSD(aov(co2 ~ year)) Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = co2 ~ year) $year 1995-1990 2000-1990 2002-1990 2003-1990 2004-1990 2000-1995 2002-1995 2003-1995 2004-1995 2002-2000 2003-2000 2004-2000 2003-2002 2004-2002 2004-2003 diff -0.1019529 0.4117198 -0.8831083 -1.8335032 -2.1863168 0.5136727 -0.7811555 -1.7315504 -2.0843639 -1.2948282 -2.2452231 -2.5980366 -0.9503949 -1.3032084 -0.3528135 lwr -1.4159732 -0.9023005 -2.1971287 -3.1475236 -3.5003371 -0.8003477 -2.0951758 -3.0455707 -3.3983843 -2.6088485 -3.5592434 -3.9120570 -2.2644153 -2.6172288 -1.6668339 upr 1.21206750 1.72574020 0.43091204 -0.51948286 -0.87229641 1.82769306 0.53286490 -0.41753000 -0.77034355 0.01919221 -0.93120270 -1.28401624 0.36362546 0.01081191 0.96120682 p adj 0.9999272 0.9479077 0.3913132 0.0010176 0.0000337 0.8749459 0.5341083 0.0024393 0.0000957 0.0561389 0.0000180 0.0000003 0.3068013 0.0533847 0.9730562

MT1007

[See over

10

26. Based on the p-value, which of the following can we conclude?

[2]

(a) We have very strong evidence that at least two annual means dier from each other (b) We have very strong evidence that all annual means dier from each other (c) We have very strong evidence that the standard deviations of the groups dier from each other (d) We have very strong evidence that all annual means are the same (e) We have very strong evidence that the standard deviations of the groups are the same A 27. Interpret the Tukeys Honest Signicant dierences for CO2 levels in 2003 [5]

We have very strong evidence that 2003 levels are signicantly lower on average than levels in 1990, 1995 and 2000 but no evidence that levels in 2003 were dierent from those in either 2002 or 2004. The reduction of household waste is a key sustainable development objective and the Scottish Household Survey provides information on recycling behaviour for the Scottish population. The number of people who said they had recycled various items in 20032006 are shown in Table 3 and it is of interest to know if the pattern of recycling behaviour across years (20032006) diers by the type of item recycled (e.g. Glass bottles). Item/Year Glass bottles Newspaper/Magazines/Paper/Card Plastic Metal Cans Total Sample size (n) 2003 4888 6284 1815 1955 14,942 13,965 2004 5763 7832 2808 2955 19,358 14,777 2005 7035 9708 5065 5206 27,014 14,070 2006 8088 10784 6669 6811 32,352 14,190 Total 25,774 34,608 16,357 16,927 93,666

Table 3: Summary information from several years (2003-2006) of the Scottish Household Survey.

Use this information to answer questions 2834 28. Why is a Chi-square test for Homogeneity, rather than a Chi-square test for Independence, appropriate for these data? [2] We have several samples - one for each year classied by one factor (item recycled) and not one sample cross-classied by two factors (consistent with a test for independence).
MT1007

11

29. What are the null and alternative hypotheses for this test?

[2]

H0 : The pattern across years is the same for all recycled items. H1 : The pattern across years diers for at least two of the recycled items. 30. For a Chi-square test of Homogeneity, what are the expected counts for Plastic in 2003? [2]

E=

Row total x Column total 16357 14942 = = 2609.338 Grand total 93666

31. For a Chi-square test of Homogeneity, what is the Chi-square contribution for Plastic in 2003? [2]

(O E)2 (1815 2609.338)2 = = 241.8134 E 2609.338 32. The test statistic was 2 = 1589.124 and the chi-square cell contributions ranged from 7.47 to 241.81. Based on this information and your answer to question 31, which of the following is true? [2] (a) The chi-square contribution for Plastic in 2003 is relatively large because the expected counts under H0 are larger than those observed (b) The chi-square contribution for Plastic in 2003 is relatively small because the expected counts under H1 are larger than those observed (c) The chi-square contribution for Plastic in 2003 is relatively large because the expected counts under H1 are smaller than those observed (d) The chi-square contribution for Plastic in 2003 is relatively small because the expected counts under H0 are similar to those observed (e) The chi-square contribution for Plastic in 2003 is relatively large because the expected counts under H1 are larger than those observed A 33. What are the degrees of freedom for this test? (number of columns-1) (number of rows -1)=4-1 4-1=9 34. The p-value for this test is p < 0.00001. What is the best interpretation of this result? [2]
MT1007

[2]

[See over

12

(a) The annual sample sizes are too small for us to draw valid conclusions based on this analysis (b) The p-value is too small for us to draw any conclusions about yearly recycling patterns (c) We have very strong evidence that yearly recycling patterns dier for all recycled items (d) We have very strong evidence that yearly recycling patterns dier for some of the recycled items (e) We have very strong evidence that yearly recycling patterns are the same for some of the recycled items D Current climate trends are unlikely to be entirely natural in origin and there is widespread belief that human activities are having a discernible impact on the global climate.2 Annual mean UK temperatures from 1914 to 2006 are shown in Figure 1, page 12. These temperatures are represented as the dierence from a baseline temperature calcuated using average temperatures from 19611990. An analysis of this data is also shown on page 13.

Difference from 19611990 baseline temperature

0.2

0.0

0.2

0.4

0.6

0.8

1920

1940 Year

1960

1980

2000

Figure 1: Annual mean temperature for the UK across time with a tted linear model overlaid (dotted line).

MET oce; IPCC Third assessment: Climate Change 2001.

MT1007

13

>

summary(lm(temp ~ year))

Call: lm(formula = temp ~ year) Residuals: Min 1Q Median -0.27988 -0.19950 -0.01773 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -14.926121 5.461353 ***** 0.0257 * year 0.007715 0.002793 ***** 0.0246 * --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.2537 on 8 degrees of freedom Multiple R-Squared: 0.4881, Adjusted R-squared: 0.4242 F-statistic: 7.629 on 1 and 8 DF, p-value: 0.02460 3Q 0.15980 Max 0.40582

Use this information to answer questions 3544 35. What is the estimate for 1 ? 0.007715 36. The observed value for the year 2000 is 0.91. Based on this model, what is the residual for the year 2000? [3] Predicted value=-14.926121+0.007715*2000=0.503879; residual=observed-predicted=0.910.503879=0.406121. 37. Using Figure 1 (page 12), is the model under-estimating or over-estimating temperature in 1960? Justify your answer. [2] Overestimating; the tted line shows predictions are higher than the observed value for 1960. 38. Find the test statistic for the slope parameter. (estimate 0)/SE = 0.007715/0.002793 = 2.762 [2] [1]

39. Which of the following about the year coecient is true?


MT1007

[2]
[See over

14

(a) The null hypothesis is that there is a linear relationship between year and temp (b) When the null hypothesis is true, we expect to see slope coecients at least two standard errors from zero (c) The test statistic has a t-distribution with df = n 1 (d) A test statistic of 2.76 is typically obtained under the null hypothesis (e) If the standard error for this estimate was larger, then the associated p-value would also be larger E 40. Given this model and these data, is there a basis for predictions in 2008? Justify your answer. [2] No there isnt. This would involve predicting far into the future. 41. Does the average annual temperature appear to be increasing? Justify your answer. [2] Yes, the slope coecient is positive and we have some-to-strong evidence for a nonzero linear relationship between year and temp. 42. Which of the following about linear regression is false? [2]

(a) We can check the normal errors assumption using quantile-quantile plots of model residuals (b) We can check for linearity by plotting x versus y and/or using residual plots (c) We can check the non-constant error variance assumption using histograms of residuals (d) We can check the independence assumption by plotting the residuals in serial order (e) We can check the normal errors assumption using a Shapiro-Wilk test for Normality with H0 : errors are normal C 43. Interpret the slope coecient [2]

Under the model, we expect a 0.007715 increase in temperature (relative to the baseline) for an increase of one year. 44. Which of the following about Figure 1 (page 12) is true? [2]

(a) A linear model best describes the relationship between year and temp because the p-value is small
MT1007

15

(b) The tted model has an estimated intercept of approximately -0.2 (c) Removing the year 2000 data point would result in a model with a shallower slope coecient (d) Plotting the residuals in serial order would not be appropriate in this case (e) The interpretation of the intercept coecient is meaningful for these data C

MT1007

[See over

16

Formulae Sheet
Median: position = Distributions: In general: sd(X) = If X is a discrete random variable: x = E(x) = xi pr(X = xi ) sd(X) = (xi x )2pr(X = xi ) E(X x )2 n+1 2

X Binomial(n, p) X P oisson() X N ormal(, ) Combining random variables: For any constants a and b :

E(X) =np E(X) = E(X) =

sd(X) =

sd(X) = sd(X) = sd(aX + b) = |a|sd(X)

np(1 p)

E(aX + b) = aE(X) + b

If X1 and X2 are independent random variables: E(a1 X1 + a2 X2 ) =a1 E(X1 ) + a2 E(X2 ) sd(a1 X1 + a2 X2 ) = a21 sd(X1 )2 + a22 sd(X2 )2

If X1 , X2 , ..., Xn is a random sample from a distribution with mean and standard deviation : E(X1 + X2 + ... + Xn ) =n sd(X1 + X2 + ... + Xn ) = n Sampling distributions: E(X) =, E(P ) =p, sd(X) = n sd(P ) = p (1 p) n

Standard error of a dierence (independent estimates): se(1 + 2 ) = se(1 )2 + se(2 )2


MT1007

17

Condence interval and t-tests: Condence interval: estimate t standard error t-test statistic: estimate hypothesised value standard error 0 t0 = se() t0 = Applications: Mean x : = x Proportion p : = p sx se() = x n se() = p p (1 p) n df =n 1 df =

Dierence between two means:(two independent samples) 1 2 : = x 1 x 2 se(1 x2 ) = x se(1 x2 ) = x se(1 )2 + se(2 )2 x x s1 2 s22 + n1 n2 df = M in(n1 1, n2 1)

Dierence between two proportions: Situation A, Independent samples: p1 p2 : = p1 p2 se(1 p2 ) = p p1 (1 p1 ) p2 (1 p2 ) + n1 n2 df =

Situation B, One sample of size n, several response categories: p1 p2 : = p1 p2 se(1 p2 ) = p p1 + p2 (1 p2 )2 p n df =

Situation C, One sample of size n, many yes/no items: p1 p2 : = p1 p2 se(1 p2 ) = p M in(1 + p2 , q1 + q2 ) (1 p2 )2 p p n df =


[See over

MT1007

18

The F -test (ANOVA) f0 = The Chi-square test (observed expected)2 expected all cells in the table Ri Cj n s2B s2W df1 = k 1 df2 = ntot k

0 2 =

df = J 1(one way tables) df = (J 1)(I 1)(two way tables)

Expected cell count in cell(i, j) =

Regression: Fitted least squares regression line: y = 0 + 1 x Inference about slope, 1 , and intercept, 0 , df = n 2.

MT1007

S-ar putea să vă placă și