Sunteți pe pagina 1din 15

QUESTION - A

1. Study the Lungcap data set and answer the following questions.

i) Construct a two-way table for gender and smoking habit.


ii) Given that one randomly selected person is a smoker, what is the probability that the
person is a female?
iii) Are gender and smoking habit independent?

2. Suppose it is given that 20% of the male smokers and 15% of the female smokers were born
caesarean. With the help of the data, verify the above statements. Give enough reasons for your
answers.
3. Plot the histogram of the distribution of Lungcap amongst smokers.
4. Plot the histogram of the distribution of Height amongst smokers.
5. Are height and Lungcap independent?
6. Are the variation of Lungcap of male smokers and female smokers equal?
7. Are the average of Lungcap of smokers and non-smokers equal?
8. Plot the histogram of the age amongst smokers.
9. What percentage of people below 16 years smoke?
10. What percentage of people above 17 years smoke?
11. Test if smoking habit and age are dependent.
12. Test if smoking habit and Lungcap are dependent.
13. Fit a suitable distribution to height and also to Lungcap. Test the goodness of fit.

QUESTION – B

Study the car data set and answer the following questions.

1. Find the average and variance of price and mileage separately. Comment on the results. How will
you interpret the result statistically?
2. Test if the mean mileage of different car manufacturers within some price range are equal.
Clearly specify all the assumptions and the null and alternative hypotheses.
3. Find a 90% confidence price range for the Chevrolet cars.
4. Find a 90% confidence for variance of prices for Pontiac cars.
5. Calculate the correlation coefficient between mileage and Liter for each company.
6. Comment on the results.
7. Suppose a car has a Liter of 3.8. How sure will you be that its mileage is more than 20,000?
8. Is there any correlation between prices and mileage?

QUESTION – C

1. Let 𝑌̅ be the mean of a random sample of size 𝑛1 from 𝑁(𝜇, 𝜎 2 = 10). Find 𝑛1 such that the
probability of the random interval (𝑌̅ − 1/2 , 𝑌̅ + 1/2 ) includes 𝜇 is approximately 0.954.
2. Let 𝑍̅ be the mean of a random sample of size 𝑛2 from 𝑁(𝜇, 𝜎 2 = 9). Find 𝑛2 such that the
probability of the random interval (𝑍̅ − 1 , 𝑍̅ + 1) includes 𝜇 is approximately 0.90.

3. Draw 200 random samples each of size 𝑛1 (found above) from a normal distribution with mean 5
and variance 3.
4. Write down the distribution of the sample mean. Test using the data obtained in Q3 above, if the
sample means follow that distribution.
5. Draw 200 random samples each of size 25 from a normal distribution with mean 7 and variance
3.
6. Compute 95% confidence interval for the difference of means from each of the 200 samples.
Draw a graph to show all 200 confidence intervals and comment.

QUESTION – D

1. Collect stock prices for 5 companies from 1st Jan 2016 to 30th June 2016.
2. Plot the histogram of the returns for each company. Describe the histograms.
3. Test whether the average returns for 5 companies are equal. State clearly the assumptions
required, null and alternative hypotheses.
4. Test whether the average returns for each pair of companies are equal.
5. Comment on the results.

QUESTION – E

1. The income distribution of a very large population is exponential with average income ₹ 40, 000
per annum. Draw 500 samples (from the income distribution) of size 100 each. Sketch the
distribution of sample average income. Comment.
2. The age distribution of a very large population is given below:

Age Group 15-18 18-21 21-23 23-25 25-27 27-29 29-31 31-33 33-35
(years)
Proportion 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1

Draw 100 samples (from the age distribution) of size 50 each. Sketch the distribution of sample
average age. Comment.

1
Section-A
Q.1.i.

GENDER SMOKING
SMOKES NEVER SMOKED TOTAL
MALE 33 334 367
FEMALE 44 314 358
TOTAL 77 648 725

Q.1.ii Given that one randomly selected person is a smoker, probability that the person is female:
P (Female|Smoker) = No. of female smokers
Total No. of smokers
= 44/77

= 0.571

Q.1.iii. Let
H0: Null Hypothesis that Gender and Smoking Habit are independent.
HA: Alternate Hypothesis that Gender and Smoking Habit are dependent.
Reject Ho if C is less than 5% p-value.

Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq


F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 33 E11 38.98 -5.98 0.917
F12 334 E12 328.02 5.98 0.109
F21 44 E21 38.02 5.98 0.940
F22 314 E22 319.98 -5.98 0.112
Degrees of freedom= 1 C

2.077
1,0.05 3.841

Since c < 1, 0.05, the p-value for c (>10%) is more than 5%. Hence it is not sufficient to reject
HO and we can say that gender and smoking habits are independent.

Q.2 Given 20% (=m) of male smokers and 15 %( =f) of female smokers were born caesarean.
a) As per the sample,
No. of male smokers = 33 , No. of male smoker born caesarean =10
Proportion of male smoker born caesarean, Pm =10/33 =30.3%
Sample size, Nm=33
Since sample size > 30, as per CLT, Pm ~ N(Pm,Sm)
Standard deviation, Sm= (Pmx(1-Pm)/ Nm)^0.5= 0.08

2
Zc=Pmm)/S = (30.30%-20%)/0.08 = 1.29
Z+cr= Z0.975= 1.96; Z-cr=Z0.025= -1.96
Hypothesis Statement:
HO: Proportion of male smoker, m = 20%
HA: Proportion of male smoker, m ≠ 20%
Rejection Rule
Reject HO if Zc > Zcr or Zc < Zcr
Since Z-cr > Zc (=1.29) < Zcri , there is not enough evidence to reject HO.
Hence we accept the hypothesis that 10% of the male smokers were born caesarean.

b) As per the sample


No. of female smokers = 44 , No. of male smoker who are caesarean =11
Percentage of male smoker born caesarean, Pf =11/44=25%
Sample size, Nf=44
Since sample size > 30, as per Central Limit Theorem, Pf ~ N(Pf,SDf)
Standard deviation, SDf= sqrt(Pfx(1-Pf)/ Nf)= 0.065
Zcal=Pff)/SD = (25% - 15%)/0.065 = 1.53
Z+cri=Z0.975=1.96; Z-cri=Z0.025=-1.96
Hypothesis Statement:
HO: Percentage of female smoker, f = 15%
HA: Percentage of female smoker, f ≠ 20%
Rejection Rule
Reject HO if Zc > Z+cr or Zc < Zcr
Since Z-c > Zc (=1.53) < Zcr , there is not enough evidence to reject HO.
Hence we accept the hypothesis that 15% of the female smokers were born caesarean.

Q3/4.

Q5.
Hypothesis Statement: Lungcap Height Total
H0: Height and lungcap are independent <63 >63
<7 229 34 263
for the following ranges
>7 54 408 462
HA: Height and lungcap are dependent
Total 283 442 725
Rejection Rule:

3
Reject Ho if cal is less than 5% p-value.
Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq
F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 229 E11 102.66 126.34 155.479
F12 34 E12 160.34 -126.34 99.549
F21 54 E21 180.34 -126.34 88.509
F22 408 E22 281.66 126.34 56.670
Degrees of freedom= (2-1)(2-1)=1  cal

400.207
 1,0.05

3.841
Since cal > 1,0.05 , the p-value for cal (~0%) is less than 5%. Hence reject HO and state that
Height and lungcap are dependent.

Q6. Hypothesis Statement:


HO: Variance of male smokers and female smokers are equal or /22 =1 (Null Hypothesis)
HA: Variance of male smokers and female smokers are not equal or /22≠1 (Alternate
Hypothesis)
Rejection Rule:
Reject HO if Fcal(=s12/s22) >Fcrit,df for significance level 0.05.
We conducted F-value test for =0.05/2=0.025(for two tail test) got the following results:

Since Fcal (=1.19)< Fcrit(1.96), there is not enough reasons to reject HO. Hence we accept the
hypothesis and state that the variances of male smokers and female smokers are equal.

Q7. Let 1 and 2 be the average of lungcap of smokers and non-smokers. Whereas 12 and 22 are
the sample variance of the respective population.
x1= Random Variable of average of lungcap of sample smokers ~ N(1,12/n1)
x2= Random Variable of average of lungcap of sample non-smokers~ N(2,22/n2)
As per data,
No. of smokers, n1=77 No. of non-smokers, n2= 648
Average of lungcap of smokers x1=8.645 Average of lungcap of non-smokers x2= 7.77
Sample lungcap variance of smoker, s= 3.545

4
 Sample lungcap variance of non-smoker, s=
7.432

Since population lungcap variances of smokers (12)and non-smokers is unknown22) , we
assume 12=22=sp2.
Where sp2= [(n1 – 1)s12 + (n2 – 1)s22 ]/(n1+n2 – 2) = 7.023
Now, (x1-x2)/(sp x sqrt(1/n1+1/n2)) ~ tn1+n2-2
tcal = (x1-x2)/(sp x sqrt(1/n1+1/n2)) = 2.74
t+cri,0.025 =1.96
t-cri,0.975 = -1.96
Hypothesis Statement
HO: 1=2 or 1-2=0
HA:1≠2 or 1-2≠0
Rejection Rule
Reject HO if tcal > t+cri or tcal < t-cri
Since in our case tcal (=2.74) > t+cri (=1.96), we reject HO.
Hence the average lungcap of smokers and non-smokers are not equal(1≠2).
Q.8.

Q.9. # people below 16 years = 548


# people below 16 years who smoke= 42
Percentage of people below 16 who smoke = 42/548 = 7.66%
Q.10. # people above 17 years = 80
# people above 17 years who smoke= 15
Percentage of people below 16 who smoke = 15/80 = 18.75%

5
Q.11. H0: Age and smoking habit are independent for Age
Smoking
the above age ranges Habit <15 >15 Total
HA: Age and smoking habit are dependent for Yes 42 35 77
the above age ranges No 506 142 648
Reject Ho if cal is less than 5% p-value. Total 548 177 725

Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq


F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 42 E11 58.20 -16.20 4.510
F12 35 E12 18.80 16.20 13.963
F21 506 E21 489.80 16.20 0.536
F22 142 E22 158.20 -16.20 1.659
Degrees of freedom= (2-1)(2-1)=1  cal

20.668
 1,0.05

3.841
Since  cal >  1,0.05 , the p-value for  cal (~0%) is less than 5%. Hence reject HO and state that age
  

and smoking habit are dependent.


Q.12. Hypothesis Statement:
H0: Lungcap and smoking habit are Smoking Lungcap
independent for the above lungcap ranges Habit <9 >9 Total
HA: Lungcap and smoking habit are Yes 43 34 77
dependent for the above lungcap ranges No 432 216 648
Reject Ho if cal is less than 5% p-value. Total 475 250 725
Observed Frequencies Expected Frequencies Difference Sq. Diff./Exp. Freq
F Value Given E Value Expected (Fij - Eij) (Fij - Eij)^2/Eij
F11 43 E11 50.45 -7.45 1.100
F12 34 E12 26.55 7.45 2.089
F21 432 E21 424.55 7.45 0.131
F22 216 E22 223.45 -7.45 0.248
Degrees of freedom= (2-1)(2-1)=1 cal 3.568
 1,0.05

3.841
Since cal < 1,0.05 , the p-value for cal (>=5%) is more than 5%. Hence there is not enough
reasons to reject HO and state that lungcap and smoking habit are independent.

Q13. As per the data, we have the following descriptive statistics for lungcap and height:
LungCap Height
Mean 7.863148 Mean 64.83628
Standard Standard
Deviation 2.662008 Deviation 7.202144
Count 725 Count 725

6
Distribution for Lungcap
HO : We assume the lungcap distribution of the population to follow Normal Distribution
~ N(7.863,2.66)
HA: The lungcap distribution doesn’t follow ~ N(7.863,2.66)
We construct the following frequency distribution with taking bin size such that the frequency
percentage is 10%.
Percentage Z-value Bin Frequency (fi) Expected fi-ei (fi-ei)2 (fi-ei)2/ei
Frequency
(ei)
10% -1.28 4.456 83 72.5 10.5 110.250 1.521
20% -0.84 5.627 62 72.5 -10.5 110.250 1.521
30% -0.52 6.479 67 72.5 -5.5 30.250 0.417
40% -0.25 7.198 61 72.5 -11.5 132.250 1.824
50% 0 7.863 72 72.5 -0.5 0.250 0.003
60% 0.25 8.529 74 72.5 1.5 2.250 0.031
70% 0.52 9.247 79 72.5 6.5 42.250 0.583
80% 0.84 10.099 71 72.5 -1.5 2.250 0.031
90% 1.28 11.271 88 72.5 15.5 240.250 3.314
More 68 72.5 -4.5 20.250 0.279
 cal

9.524
We get cal = 9.52
For significance level 5% and degrees of freedom 7 (=10-2-1), we have 7,0.05=14.064.
Since cal < 7,0.05 , p-value will be more than 5% . Hence we accept HO and lungcap distribution
follow ~ N(7.863,2.66).

Distribution for Height


HO : We assume the height distribution of the population to follow Normal Distribution
~ N(64.836,7.202)
HA: The height distribution doesn’t follow ~ N(64.836,7.202)
We construct the following frequency distribution with taking bin size such that the frequency
percentage is 10%.
Percentage Z-value Bin Frequency (fi) Expected fi-ei (fi-ei)2 (fi-ei)2/ei
Frequency
(ei)
10% -1.28 55.618 86 72.5 13.5 182.250 2.514
20% -0.84 58.786 64 72.5 -8.5 72.250 0.997
30% -0.52 61.091 64 72.5 -8.5 72.250 0.997
40% -0.25 63.036 69 72.5 -3.5 12.250 0.169
50% 0 64.836 61 72.5 -11.5 132.250 1.824
60% 0.25 66.637 79 72.5 6.5 42.250 0.583
70% 0.52 68.581 63 72.5 -9.5 90.250 1.245

7
80% 0.84 70.886 70 72.5 -2.5 6.250 0.086
90% 1.28 74.055 98 72.5 25.5 650.250 8.969
More 71 72.5 -1.5 2.250 0.031
cal 17.414
We get cal = 17.414
For significance level 5% and degrees of freedom 7 (=10-2-1), we have 7,0.05=14.064.
Since cal > 7,0.05 , p-value will be less than 5% . Hence we reject HO and height distribution
doesn’t follow ~ (64.836,7.202).

Section-B

Q1. Price Mileage

Mean 21343.14 Mean 19831.93


Sample Variance 97710315 Sample Variance 67179657

We observe that the sample variance of price is more than mileage. That means the spread of
price around average is more than that of mileage. So we can say that wide range of priced cars
have mileage closer to 19831.93.

Q2. Average
Price Range mileage (xi) Variance (i2) Sample Size(ni)
<20000 20241.52 64394503 467
20k-40k 19759.26 65564947 297
>40K 15589.65 95651556 40

Let 1,2 and 3 be the average of mileage of cars in the price range as given in the table. Whereas
12 ,22 and 32 are the variance of the respective car price range.
Hypothesis Statement
HO : 1= 2=3
HA : 1≠ 2≠3

We conducted
Anova test. Since
the p-Value is less
than 0.05, we
reject HO and state
that the average
mileage of the cars
in above price
range are not
equal.

8
Q3.
Price-Chevrollet t-value Price
t+ 0.05,319 0.824822 16745.82
Mean 16427.6 t-0.95,319 -0.82482 16109.38
Standard Deviation 6901.439 CL= 636.4364
Sample Variance 47629867
Count 320
Confidence Level(90.0%) 636.4364 Or,else using t-distribution, s=6901.439/sqrt(320)
= 385.80

Q.4. 150
n

Sample Variance 16708238

171.507
149,0.1
CI Variance (90%) 14515607 (n-1)s2/149,0.1

Q.5.
Manufacturer Cov.(ML) SD(M) SD(L) Corr.(ML)
Buick 162.323 6932.136 0.230 0.102
Cadillac 594.100 8964.292 0.803 0.083
Chevrolet -285.829 8203.571 1.151 -0.030
Pontiac 959.280 8110.435 1.098 0.108
SAAB -9.525 8404.288 0.162 -0.007
Saturn -501.661 8479.994 0.301 -0.197

Q.6. Analysing the correlation coefficients of mileage and liter among different car manufacturer, it
can be stated that there is weak linear relation between mileage and liter as the correlation
coefficients are close to zero.

Q7. # total car = 804


# cars with liter equal to 3.8 = 160
# cars with liter equal to 3.8 and mileage more than 20,000 = 95
Given that a car has liter 3.8, probability of mileage greater than 20,000=
P(Mileage >20000|liter=3.8)
= P(cars with liter=3.8 & mileage >20000) = (#cars with liter=3.8 & mileage >20000)/(#total cars)
P(cars with liter =3.8) (#cars with liter =3.8)/(#total cars)
=95/804 = 0.5937
160/804
Thus, given that a car has liter 3.8, with 59.37% confidence we can say that mileage is greater than
20,000.

Q8. As correlation between price and mileage is


Cov.(PM) SD(P) SD(M) Corr.(PM)
close to zero, there is a weak linear relation -11589868.158 9884.853 8196.320 -0.143
among them.

9
Section-C:

Q1. Given 𝑌̅ ~ N(10), error = 0.5 for sample size n1.


Confidence Level =1- =0.954 or 
 Corresponding Z value for /2 =|Z0.023| =1.99
Now, from standard normal distribution |Z/2|=(𝑌̅- /(sqrt(n1)).
Or, error = |Z/2|x(sqrt(n1))
Or, 0.5=1.99xsqrt(10)/sqrt(n1)
Or, n1 = 1.992 x 10 / 0.52 = 158.4 ~ 159.
Q2. Given 𝑍̅~ N(9), error = 1
Confidence Level =1- =0.9 or 
 Corresponding Z value for /2 =|Z0.05| =1.65
Now, from standard normal distribution |Z/2|=( 𝑍̅- /(sqrt(n2)).
Or, error = |Z/2|x(sqrt(n2))
Or, 1=1.65 x sqrt(9)/sqrt(n2)
Or, n2 = 1.652 x 9 / 12 = 24.5 ~ 25.
Q.3. 200 Random no. with size 159 from N(5,3) generated.

Q.4. Since we have taken the sample from a normal distribution N(5,3) each of size 159 (>30), the
average of 200 sample will follow normal distribution with N(5,3/159) which can be verified in the
descriptive statistics of the sample and histogram below.
Sample Mean = 5
Sample variance = 0.020

Frequency
40
35
30
25
20
15
10
5
0

Bin Frequency Bin Frequency


4.61242 1 5.109468 30
4.674551 4 5.171599 24

10
4.736682 8 5.23373 9
4.798813 8 5.295861 3
4.860944 16 5.357992 1
4.923075 23 5.420123 2
4.985206 33 More 1
5.047337 37
Q.5. 200 Random no. with size 25 from N(5,3) generated.

Frequency
45
40
35
30
25
20
15
10
5
0

Bin Frequency Bin Frequency


3.355843 1 5.431835 40
3.615342 2 5.691334 17
3.874841 1 5.950834 14
4.13434 5 6.210333 11
4.393839 16 6.469832 2
4.653338 19 6.729331 1
4.912837 31 More 2
5.172336 38

Q.6. For confidence interval(CI) 95% corresponding Z value is 1.96. Hence the Confidence interval will
be 2xZ0.975 x /sqrt(n2). We have =3 , n2=25.

11
Hence CI =2x1.96 x sqrt(3/25) = 1.358. The corresponding CI and difference of mean is plotted
below.
1

0.5

141

197
106
113
120
127
134

148
155
162
169
176
183
190
1
8

29

85
15
22

36
43
50
57
64
71
78

92
99
-0.5

-1

Difference of Mean Upper Limit of CI (+0.679) Lower Limit of CI (-0.679)

Section D:
Q.1. I collected stock price of Monnet Ispat & Energy Ltd, GAIL (India) Ltd, Alstom India Ltd, ABB India
Ltd and Siemens Ltd from 01.01.2016 to 30.06.2016.

Q2. Histogram of stock returns

Q.3. HO: Average stock returns of the five


companies taken are equal (R1 = R2 =
R3 = R4 = R5)
HO: Atleast one of average stock
return is not equal to other stock
returns.
Reject HO if p value < 0.05

Since the p-Value is greater than 0.05, we


accept the HO and state that the average
return of the said companies are equal.

12
Q4.&5. Hypothesis Statement
HO: Average stock returns of each pair of companies are equal.
HA: Average stock returns of each pair of companies are not equal.
Rejection Rule:
tcal lies in rejection region or tcal>t+0.975,df or tcal<t- 0.025,df
Observation:
In all the cases t-0.025,df <tcal<t+0.975,df, i.e lies in acceptance region. Hence we don’t have sufficient
reasons to reject HO and state that for all the pair of companies averages of stock returns are
equal.

Section E:
Q.1

13
Since the sample size is more than 30,i.e 100, the average salary of each sample will follow normal
distribution, N(40000,SD=40000/sqrt(10))
Sample mean calculated = 40260.59 ~ 40000
Standard deviation = 3983.923 ~ 4000 (=40000/10)

Q.2 Since the sample size is more than 30, i.e 50, the average age of the samples should follow normal
distribution as per CLT.
Population average age =25.8, and standard deviation = 5.216 then sample average age must have mean
25.8 ~ 25.09 and standard deviation = 5.216/sqrt(50) = 0.737 ~ 0.64

14

S-ar putea să vă placă și