Documente Academic
Documente Profesional
Documente Cultură
MIM 14 F1
Exercise 1
www.exploreiceland.is has committed to display 650.000 ads. Traffic to the website
is estimated to be normally distributed with a mean of 850.000 viewers and a standard
deviation of 150.000.
a)
The problem consists of obtaining the probability of showing the 650.000 ads out of a
normal distributed population. Any normal distribution can be expressed as follows:
(
Where:
)
Figure 1 Gaussian distribution
The question can be rewritten in a more mathematical styile as we are looking for
probabilities:
(
This means that we are looking for the probability that Z is over 650.000, or what is the
samein other words, 1 minus the probability that Z is less than 650.000, as it is less than the
mean of the traffic.
Z is obtained using the next equation that normalizes any Gaussian distribution to N (0,1):
Where:
Graphically, this probability is represented in the Gauss curve as shown in the following
figure:
Formatted: Highlight
b)
In this case, we will obtain the probability of displaying less than the 5% of ads and then,
Once lookingLooking atin the normal table for P (Z) = 5%, Z can be obtained:
As 005 is located between 164 and
165, it is recommended to make a linear
interpolation to obtain z:
Figure 3 Normal distributed data
( )
Using this result in the Z equation we can calculate the number of impressions required:
Exercise 2
A sample of 36 weekly observations has a mean of 0005 and a standard deviation of
002.
a)
To build a 95% Confidence Interval, we must first assume that our sample is normally
distributed. In order to do so, we can reflect on the CENTRAL LIMIT THEOREM:
It is safe to use the normal distribution if the sample is reasonably large (30 or more)
Sample size is 36, so it is safe to adopt a normal distribution. Hence, a 95% C.I. can be
built using the next expressionequation below:
b)
We have to calculate a new 95% C.I. knowing what the new range is:
This means that the spread of the distribution must be less than 0004. Knowing this, we
can calculate the new value required for the size of the sample:
n = 9604 = 97
The size of the sample required is 97
c)
As it has been explained in a), thanks to the size of the sample being more than 30, we can
use the CENTRAL LIMIT THEOREM and assume that the sample is normally distributed.
Without using this theorem and assuming a normal distribution, we couldnt have built a 95%
C.I. to estimate the mean weekly return.
Again, as sample size is 256 random observations, we can use the CENTRAL LIMIT
THEOREM and assume that the sample mean is the same as the mean of the population.
Estimated mean income for the population is 35.420.
b)
95% C.I. (rounded to the nearest 10) for the estimate of the mean income.
Assuming, as said in a), a normal distribution, we can calculate a 95% C.I. using the next
expressionequation below:
c)
It means that we are 95 % confident that the estimate mean income is between 35.170
and 35.670. In other words, the probability that this confidence interval includes the true
estimate mean income is 95%. Hence, if we were to repeat the sampling 100 times and we would
build a 95% confidence interval each time, 95 out of the 100 C.I. would include the true mean.
Exercise 4
Two models, X and Y, are used to forecast the probability that a drug development
project is going to be successful. The probability of success is said to be related to the
spent in R&D. Model Y also takes the number of scientist in the project into account.
a)
In order to analyze any regression model we must follow the next steps below:
1)
As we dont have the original data from which the model has been created, we cannot draw
a scatterplot. However, we can see that the correlation value is 0.63 which0.63, which means
that both dependent and independentand independent variables have a positive correlation.
Models significance:
The objective is to check whether the model has been created randomly or not. For this
purpose we will do the following hypothesis testing:
Null hypothesis H0 : m = 0
Alternative hypothesis HA : m 0
T-Stat: It is a ratio that tells us how many standard errors the regression
coefficient is from zero. This can be calculated using the following formula:
|
Using = 0 (as the objective is to know how far is our slope from this value), we
obtain the t-stat for our model. We canwill be sure that this difference is
significant if t-stat is >2.
P-Value is the probability of seeing a sample with at least much evidence in favor of
the alternative hypothesis as the sample actually observed. The smaller the pvalue, the more evidence there is in favor of the alternative hypothesis. As we will
use a 95% C.I., we need at least a p-value 5%.
We also need to check that the confidence interval for the variable analyzed
does not contain zero.
To sum up, in order to check the models significance, we must check the following
values:
T-Stat > 2
P-Value < 5%
C.I. 95% does not include 0
Once knowing what do we have to analyze toWe can follow the above steps to analyze a
given variable. To check the models significance, we can apply it to Model X:
R&D
Models value
Pass/Fail
T-stat (>2)
27352
PASS
P-value (<5%)
00194
PASS
Lower 95%
00001
Upper 95%
00013
PASS
Models quality:
With this second step, we will check whether the model is good enough to predict the
response of the dependent variable and, therefore, to be able to predict the success of the
development project. The following values must be analyzed:
Adjusted R2: Is a measure that adjusts R2 (R2 is the percentage of variation of the
dependent variable explained by the model) for the number of explanatory
variables in the equation.
Adjusted R2 = 35%
We should compare this value with the range of values fromof our original data, but as a
first opinion, we think that this value is too high.
However, these two conclusions cannot be used as a final decision to qualify the model. In
order to do so, we should build a 95% C.I. for the model and see how accurate it is:
*This model is obtained in the part c) of the exercise:
95% C.I. = (1759% , 9207%)
As we can see, the confidence interval is too wide (7448%). Therefore, we conclude that:
b)
Evaluate model Y
In order to evaluate model Y we have to follow the same steps used for model X. The only
difference is that model Y has have included two explanatory variables to run the regression
analysis.
1)
Again, we cannot draw a scatterplot as we dont know the original values, but we have the
correlation analysis in the assignment:
Both, R&D spending and No. of scientists
show a strong positive correlation with the % of
success. Furthermore, the two independent
variables are correlated to one another. This is
called multicolinearity, and will be explained in
the next steps.
2)
Models significance:
R&D
Models value
Pass/Fail
T-stat (>2)
45454
PASS
P-value (<5%)
00003
PASS
Lower 95%
00005
Upper 95%
00014
Scientists
T-stat (>2)
Models value
P-value (<5%)
Lower 95%
09246
-00101
Upper 95%
0011
00961
PASS
Pass/Fail
FAIL
FAIL
FAIL
This model presents a peculiarity, one of the independent variables is not significant for the
regression analysis. This is due to the multicolinearity mentioned above, and which means
that the variable No. of scientists does not improve models quality.
3)
Models quality:
Adjusted R2 = 93%
We should compare this value with the range of values fromof our original data, but as a
first opinion, we think that this value is very good.
However, these two conclusions cannot be used as a final decision to qualify the model. In
order to do so, we should build a 95% C.I. for the model and see how accurate it is:
*This model is obtained in the part c) of the exercise:
95% C.I. = (336% , 614%)
As we can see, the C.I. is very narrow (278%). This means that the model is accurate.
Model Y is statistically significant, and its very accurate after the analysis.
It can be improved by removing the variable No. of scientists as it does not
improve the significance of the model.
c)
Write both model X and Y regression equations and forecast the % of success for
both companies when 540 have been invested in R&D and 21 scientists have
been assigned to the project. Also determine the 95% C.I.
Both model X and model Y uses a straight line to create the regression model. This straight
line can be expressed using the next equation:
Where:
Model X:
Success = 5483%
Model Y:
Success = 4768%
Model X:
Model Y:
Exercise 5
A scatterplot of spending on alcohol vs tobacco has been done in 11 regions.
a)
According to the graph in the left we can see that, in the first 11 regions, the more alcohol
is consumed, the more money is spent onin tobacco. Also, if we analyze the right graph, a
straight line can be drawn in the scatterplot implying a positive correlation:
TApparently, taking into account the data provided, we could say that alcohol spending
causes tobacco spending. However, Correlation does not imply Causation, so maybe, once the
regression analysis is done, we realize that we cannot find any relationship between both.
b)
Outliers are observations that have extreme values relative to other observations made
under the same conditions.
Normally, we should not discard any outlier unless we are completely sure that it is not
significant for our model.
In this case, we recommend to discarddiscarding the outlier as it is only one region out of
the 11 observed and, instead of being significant for our model, we think that it will be
10
11