Sunteți pe pagina 1din 10

GYAAN KOSH

TERM 1
Learning and
Development
Council, CAC

Statistical Methods for Managerial Decisions


This document covers the basic concepts of Statistical
Methods for Decision Making covered in Term 1. The
document only summarizes the main concepts and is not
intended to be an instructive material on the subject.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Random Variable: This is a variable which can take values from a given set of events. Each event will
have a probability associated with it. For example: when a dice is thrown, the number obtained is a
random variable. Each of the values 1, 2, 3, 4, 5 and 6 will have a probability of occurrence of 1/6. The
above example is a discrete random variable as it can have only a pre-determined set of values. Random
variable can be continuous as well (For example: the temperature during the day), where it can
potentially take infinite number of values. Corresponding to a random variable there would be a
probability distribution, which just maps the events to their probabilities. In case of a continuous
random variable, any particular event has zero probability (since there are infinite no of events), so
probability is usually expressed as the probability of the random variable taking on any value less than a
given value. This is called cumulative probability distribution.
Normal random variable: It is a specific family of random variable which has some special properties.
The normal distribution (i.e. the representation of the normal random variable) is the most used
distribution because of the following properties:

It can be easily represented in a standard form, which makes it easy to be written in a table
format.
Even though it is actually very hard to find a normally distributed population but if we pick a
sample of large enough size and create a distribution of these averages, the resulting
distribution tends to be normal. This is called central limit theorem. In most practical cases, a
sample size of 30 is enough. More precise condition is:

The Empirical Rule: The following is true for all normal distributions.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Standard Normal distribution: A normal distribution can be made standard by making the mean
equal to zero and standard deviation as one. To standardize a random variable X, we use the following
formula
Z= [X Mean(X)] / Std. Deviation(X)
Binomial Distribution: The binomial distribution is the discrete probability distribution of the
number of successes in a sequence of n independent yes/no experiments, each of which yields success
with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli
trial; when n = 1, the binomial distribution is a Bernoulli distribution. The Binomial distribution is a n
times repeated Bernoulli trial.
Hypothesis Testing: Used to make inferences about the data and ascertaining whether the inferences
are statistically significant. The process involves the following steps
1. State the hypothesis which has to be proved (null hypothesis) and its opposite (alternate hypothesis).
2. Define a suitable test statistic which can be measured to prove or disprove the hypothesis
3. Define a significance level to suit the needs of the project. Calculate whether the test statistic meets
the significance level and make a decision to accept or reject the null hypothesis.
Z-Statistic: Used when the sample standard deviation is known and the distribution is normal. If
sample has n observations, and standard deviation of the population is and the null hypothesis is
mean = 0 , the z-statistic is given by

The term /n is called Standard error of the sample.


T-Statistic: Used when population variance is not known and the sample size is small.

Z-Statistic for proportions: Calculate the z statistic based on the following formula. Where p0 is the
population mean proportion as claimed by null hypothesis. p hat(^) is the sample observed proportion
and n is the sample size. This formula is only valid for proportions.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Confidence Interval: Since the point estimates for the mean are not very accurate, it is better to
specify an interval and level of confidence for that interval. For example, we can specify that mean lies
between 10 and 20 with 90% probability, this is called 90% confidence interval for the mean. The
confidence interval for (1-alpha) % of confidence level is given by

The confidence interval using a t-distribution for (1-alpha) % level of confidence and sample size n is
given by:

Type 1 Error: Even after validating null hypothesis, it should be noted that we are not 100% sure about
the validity of the outcome though we have a high degree of confidence in the validity of analysis. The
event of falsely accepting a non-true null hypothesis is called Type I error and the corresponding
probability is the significance level ( =90% in the previous example).
Type 2 Error: The event of falsely rejecting a true null hypothesis is called Type 2 error. This is much
more serious than a type 1 error. For example: consider a test administered to determine whether a
patient has cancer. The event that the patient is determined not to have cancer when he actually does
have cancer is an example of a Type 2 Error.
P-Value: The maximum value of for which the observed sample will pass the null hypothesis is called
P-Value of the test. This is important because different researchers may want to have different
significant levels for the same test for different applications. By specifying the p-value, the researchers
would be able to determine for what significance levels the results hold as well as the sensitivity of the
results for a given significance level.
Covariance and Correlation: Used to determine if two variables are dependent on each other. If the
variables move in the same direction i.e. if one variable increases, the other also increases, the variables
are said to be positively correlated. Mathematically these two measures are given by the below
equations
Cov (X,Y) = E[ ( X- E(X) ) (Y- E(Y) ) ]
Corr (X,Y) = Cov (X,Y) /( x * y )
Note that Covariance between two variables can be any number, but correlation coefficient is a number
between -1 and +1. If the correlation coefficient is zero, then it means that the two variables are
independent.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Regression: Used to determine if one variable has a linear relationship with another. The independent
variable is called predictor variable and the dependant variable is called response variable. The
regression equation is given by

The term ei is the standard error. If the regression is modeled correctly ei should be random and should
be independent of y or x. The co-efficient of determination (R-square) of regression gives the fraction of
values in the sample which can be explained using the regression model. If R-Square is very small, it
means that the model is not very useful. Note however that R-Square always increases as we add more
terms to the model, even if the additional variables doesnt bring much value. To avoid this we use a
modified version of R-square which is called R-Square adjusted, in which improvement in the value
happens only if the added variables bring more value to the model.
Multiple Regression: If one response variable depends on more on than one predictor variable, then
we have to use multiple regression. For ex: Sales of Dominos pizza may depend not only on its own
price, but the price of its rival Pizza Hut. To model the sales in this scenario, we have to use multiple
regression. In this case we need to determine whether each of the variables is significant or not. This
can be done by using a t-test, where the null hypothesis states that the co-efficient for the variable is
zero. For example, we can test whether the sales of Dominos Pizza really depend on the price of Pizza
Hut by running a t-test on the co-efficient for Pizza Huts price on the regression equation. In general, a
t-test value of more than 2 means the variable is significant.
If none of the coefficients are significant in the model, then the model itself is not significant. This can
also be found out by running an F-Test on the model.
Statistical software usually provides the value of F-statistic once we run a regression. (See Appendix)
Assumptions of Linear Regression: Quantitative models always rest on assumptions about the way
the world works, and regression models are no exception. There are four principal assumptions which
justify the use of linear regression models for purposes of prediction:
(i) linearity of the relationship between dependent and independent variables
(ii) independence of the errors (no serial correlation)
(iii) homoscedasticity (constant variance) of the errors
(a) versus time
(b) versus the predictions (or versus any independent variable)
(iv) normality of the error distribution.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity,
and/or non-normality), then the forecasts, confidence intervals, and economic insights yielded by a
regression model may be (at best) inefficient or (at worst) seriously biased or misleading.
Multi-collinearity Problem: There can be situations when we run a regression, we get the result that
the model is good, but none of the coefficients are significant. This is called multi-collinearity problem
and occurs because one or more variables is linearly related to other variables. To eliminate this
problem, we will have to remove all the dependant variables from the regression equation. A systematic
approach to doing this is called step wise regression.
Indicator/Categorical Variables: In some instances, the response variable may depend on some
categorical variables. Using categorical variables we can compare two groups and find out whether
there is a significant difference between the two groups. Example: Salary of employees in a firm may
depend on whether they have an MBA degree or not. To fit such information into the regression model,
we use dummy variables. For example, in the above case, we may use a dummy variable MBA which will
take a value of 1, if the person has MBA degree or 0 otherwise. Such dummy variables are also called
Indicator Variables.
Time Series Regression: Stock price today is a function of stock price existing yesterday. In fact it may
depend on stock prices in the last fifteen days. To model such scenario, we create variables with lag and
use that as predictor variables in the regression equation.
Regression in use: Regression is widely used across various management disciplines including
financial analysis, manufacturing, managerial accounting or consulting. To measure any kind of
dependence if we have the data we can formulate a regression model to predict future. Some key uses
of regression are listed below:
Finance: Calculating beta (portfolio management)
Managerial accounting: Breaking up the costs between variable and fixed costs
Marketing: Predicting sales, BASS model, Conjoint analysis
Data analytics: Looking for correlation between categories of data.
Economics: Finding out the demand, supply curves equations.
Useful tips for data analysis: Always plot your data. If it is measured against time, plot it against time.
Always check the residuals for any patterns and violation of assumptions.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Appendix 1: How to read Regression Output?1


The REG Procedure
Model: MODEL1
Dependent Variable: science science score

Anova Table
a

Source
Model
Error
Corrected Total

Analysis of Variance
Sum of
Mean
b
c
d
DF
Squares
Square
4
9543.72074
2385.93019
195
9963.77926
51.09630
199
19507

F Value
46.69

Pr > F
<.0001

Source - Looking at the breakdown of variance in the outcome variable, these are the categories we will examine: Model, Error,
and Corrected Total. The Total variance is partitioned into the variance which can be explained by the independent variables
(Model) and the variance which is not explained by the independent variables (Error).
DF - These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of
freedom. The model degrees of freedom corresponds to the number of coefficients estimated minus 1. Including the
intercept, there are 5 coefficients, so the model has 5-1=4 degrees of freedom. The Error degrees of freedom is the DF total
minus the DF model, 199 - 4 =195.
Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total, Model and Error.
Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF.
F Value - This is the F-statistic is the Mean Square Model (2385.93019) divided by the Mean Square Error (51.09630), yielding
F=46.69.
Pr > F - This is the p-value associated with the above F-statistic. It is used in testing the null hypothesis that all of the model
coefficients are 0.

Overall Model Fit


g

Root MSE
7.14817
R-Square
0.4892
h
k
Dependent Mean
51.85000
Adj R-Sq
0.4788
i
Coeff Var
13.78624
Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Error.
Dependent Mean - This is the mean of the dependent variable.
Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by
the mean of the dependent variable, multiplied by 100: (100*(7.15/51.85) = 13.79).
R-Square - R-Squared is the proportion of variance in the dependent variable (science) which can be explained by the
independent variables (math, female, socst and read). This is an overall measure of the strength of association and does not
reflect the extent to which any particular independent variable is associated with the dependent variable.
Adj R-Sq - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted
R-squared is computed using the formula 1 - ((1 - Rsq)(N - 1) /( N - k - 1)) where k is the number of predictors.

http://www.ats.ucla.edu/stat/sas/output/reg.htm

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Parameter Estimates

Variable
Intercept
Math score
female
sst score
reading score

DF
1
1
1
1
1

Variable
Intercept
math score
female
sst scorere
reading score

DF
1
1
1
1
1

Parameter
Estimate
12.32529
0.38931
-2.00976
0.04984
0.33530

Parameter Estimates
Standard
Error
t Value
3.19356
3.86
0.07412
5.25
1.02272
-1.97
0.06223
0.80
0.07278
4.61

Pr > |t|
0.0002
<.0001
0.0508
0.4241
<.0001

Parameter Estimates
95% Confidence Limits
6.02694
18.62364
0.24312
0.53550
-4.02677
0.00724
-0.07289
0.17258
0.19177
0.47883

DF - This column give the degrees of freedom associated with each independent variable. All continuous variables have one
degree of freedom, as do binary variables (such as female).
Parameter Estimates - These are the values for the regression equation for predicting the dependent variable from the
independent variable. The regression equation is presented in many different ways, for example:

Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4


The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.
math - The coefficient is .3893102. So for every unit increase in math, a 0.38931 unit increase in science is predicted, holding all
other variables constant.
female - For every unit increase in female, we expect a -2.00976 unit decrease in the science score, holding all other variables
constant. Since female is coded 0/1 (0=male, 1=female) the interpretation is more simply: for females, the predicted science
score would be 2 points lower than for males.
Standard Error - These are the standard errors associated with the coefficients.
t Value - These are the t-statistics used in testing whether a given coefficient is significantly different from zero.
Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that the coefficient (parameter) is 0.
Using an alpha of 0.05.
The coefficient for math is significantly different from 0 because its p-value is 0.000, which is smaller than 0.05.
95% Confidence Limits - These are the 95% confidence intervals for the coefficients. The confidence intervals are related to
the p-values such that the coefficient will not be statistically significant if the confidence interval includes 0. These confidence
intervals can help you to put the estimate from the coefficient into perspective by seeing how much the value could vary.

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Appendix 2: Checking assumptions of linear regression2 (This should be particularly useful for
people applying to quantitative roles)
Violations of linearity are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions
are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data.
How to detect: nonlinearity is usually most evident in a plot of the observed versus predicted values or a plot of residuals
versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around
a diagonal line in the former plot or a horizontal line in the latter plot. Look carefully for evidence of a "bowed" pattern,
indicating that the model makes systematic errors whenever it is making unusually large or small predictions.
How to fix: consider applying a nonlinear transformation to the dependent and/or independent variables--if you can think of a
transformation that seems appropriate. For example, if the data are strictly positive, a log transformation may be feasible.
Another possibility to consider is adding another regressor which is a nonlinear function of one of the other variables. For
example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a parabolic curve, then it may
make sense to regress Y on both X and X^2 (i.e., X-squared). The latter transformation is possible even when X and/or Y have
negative values, whereas logging may not be.
Violations of independence are also very serious in time series regression models: serial correlation in the residuals means that
there is room for improvement in the model, and extreme serial correlation is often a symptom of a badly mis-specified model,
as we saw in the auto sales example. Serial correlation is also sometimes a byproduct of a violation of the linearity assumption-as in the case of a simple (i.e., straight) trend line fitted to data which are growing exponentially over time.
How to detect: The best test for residual autocorrelation is to look at an autocorrelation plot of the residuals. (If this is not part
of the standard output for your regression procedure, you can save the RESIDUALS and use another procedure to plot the
autocorrelations.) Ideally, most of the residual autocorrelations should fall within the 95% confidence bands around zero,
which are located at roughly plus-or-minus 2-over-the-square-root-of-n, where n is the sample size. Thus, if the sample size is
50, the autocorrelations should be between +/- 0.3. If the sample size is 100, they should be between +/- 0.2. Pay especially
close attention to significant correlations at the first couple of lags and in the vicinity of the seasonal period, because these are
probably not due to mere chance and are also fixable. The Durbin-Watson statistic provides a test for significant residual
autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual autocorrelation, so ideally it
should be close to 2.0--say, between 1.4 and 2.6 for a sample size of 50.
How to fix: Minor cases of positive serial correlation (say, lag-1 residual autocorrelation in the range 0.2 to 0.4, or a DurbinWatson statistic between 1.2 and 1.6) indicate that there is some room for fine-tuing in the model. Consider adding lags of the
dependent variable and/or lags of some of the independent variables. Or, if you have ARIMA options available, try adding an
AR=1 or MA=1 term. (An AR=1 term in Statgraphics adds a lag of the dependent variable to the forecasting equation, whereas
an MA=1 term adds a lag of the forecast error.) If there is significant correlation at lag 2, then a 2nd-order lag may be
appropriate.
If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0.3 or DW stat greater
than 2.6), watch out for the possibility that you may have overdifferenced some of your variables. Differencing tends to drive
autocorrelations in the negative direction, and too much differencing may lead to patterns of negative correlation that lagged
variables cannot correct for.
If there is significant correlation at the seasonal period (e.g. at lag 4 for quarterly data or lag 12 for monthly data), this indicates
that seasonality has not been properly accounted for in the model. Seasonality can be handled in a regression model in one of
the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted), or (ii) use seasonal lags
and/or seasonally differenced variables (caution: be careful not to overdifference!), or (iii) add seasonal dummy variables to
the model (i.e., indicator variables for different seasons of the year, such as MONTH=1 or QUARTER=2, etc.) The dummyvariable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive
constant can be estimated for each season of the year. If the dependent variable has been logged, the seasonal adjustment is
multiplicative. (Something else to watch out for: it is possible that although your dependent variable is already seasonally
adjusted, some of your independent variables may not be, causing their seasonal patterns to leak into the forecasts.)
2

http://www.duke.edu/~rnau/testing.htm

Gyaan Kosh Term 1

SMMD

Learning & Development Council, CAC

Major cases of serial correlation (a Durbin-Watson statistic well below 1.0, autocorrelations well above 0.5) usually indicate a
fundamental structural problem in the model. You may wish to reconsider the transformations (if any) that have been applied
to the dependent and independent variables. It may help to stationarize all variables through appropriate combinations of
differencing, logging, and/or deflating.
Violations of homoscedasticity make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in
confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time,
confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the
effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when
estimating coefficients.
How to detect: look at plots of residuals versus time and residuals versus predicted value, and be alert for evidence of residuals
that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be really
thorough, you might also want to plot residuals versus some of the independent variables.)
How to fix: In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth,
perhaps magnified by a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the
variance in this case. Stock market data may show periods of increased or decreased volatility over time--this is normal and is
often modeled with so-called ARCH (auto-regressive conditional heteroscedasticity) models in which the error variance is fitted
by an autoregressive model. Such models are beyond the scope of this course--however, a simple fix would be to work with
shorter intervals of data in which volatility is more nearly constant. Heteroscedasticity can also be a byproduct of a significant
violation of the linearity and/or independence assumptions, in which case it may also be fixed as a byproduct of fixing those
problems.
Violations of normality compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the
error distribution is "skewed" by the presence of a few large outliers. Since parameter estimation is based on the minimization
of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates. Calculation of
confidence intervals and various signficance tests for coefficients are all based on the assumptions of normally distributed
errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
How to detect: the best test for normally distributed errors is a normal probability plot of the residuals. This is a plot of the
fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution
is normal, the points on this plot should fall close to the diagonal line. A bow-shaped pattern of deviations from the diagonal
indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in
the same direction). An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either
two many or two few large errors in both directions.
How to fix: violations of normality often arise either because (a) the distributions of the dependent and/or independent
variables are themselves significantly non-normal, and/or (b) the linearity assumption is violated. In such cases, a nonlinear
transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due
to one or two very large errors. Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry
errors), are they explainable, are similar events likely to occur again in the future, and how influential are they in your modelfitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely
errors or if they can be explained as unique events not likely to be repeated, then you may have cause to remove them. In
some cases, however, it may be that the extreme values in the data provide the most useful information about values of some
of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors.

S-ar putea să vă placă și