Sunteți pe pagina 1din 37

Advanced Statistics

Group Assignment

PGPM-BABI March’19
GROUP 5
28/07/2019

Members:
Manidhar Pabbathi
Mirudhula Purushothaman
Balabaskar S
Harish Vardhana D
CONTENTS

Problem -1 (Cereal Factor Analysis)


Insights .................................................................................................................... 3
Conclusion .......................................................................................................... 10
Problem – 2(Leslie Salt Data Set)
Insights .................................................................................................................. 12
Linear Regression Model 1 ................................................................................ 15
Linear Regression Model 2 ................................................................................ 19
Linear Regression Model 3 ................................................................................ 21
Linear Regression Model 4 ................................................................................ 23
Linear Regression Model 5 ................................................................................ 24
Linear Regression Model 6 ................................................................................ 26
Conclusion ...................................................................................................... 29
Problem – 3(All Greens Franchise)
Insights .................................................................................................................. 30
Linear Regression Model 1 ................................................................................ 32
Linear Regression Model 2 ................................................................................ 35
Conclusion ............................................................................................... 36

1
Problem 1

Cereal Data Factor Analysis


As part of a study of consumer consideration of ready-to-eat cereals sponsored by Kellogg Australia, Roberts
and Lattin (1991) surveyed consumers regarding their perceptions of their favorite brands of cereals. Each
respondent was asked to evaluate three preferred brands on each of 25 different attributes. Respondents
used a five point likert scale to indicate the extent to which each brand possessed the given attribute.
For the purpose of this assignment, a subset of the data collected by Roberts and Lattin, reflecting the
evaluations of the 12 most frequently cirted cereal brands in the sample (in the original study, a total of 40
different brands were evaluated by 121 respondents, but the majority of brands were rated by only a small
number of consumers). The 25 attributes and 12 brands are listed below

Cereal Brand Attributes 1-12 Attributes 13-25


All Bran Filling Family
Cerola Muesli Natural Calories
Just Right Fibre Plain
Kellogg’s corn falkes Sweet Crisp
Komplete Easy Regular
Nutrigrain Salt Sugar
Purina Muesli Satisfying Fruit
Rice Bubbles Energy Process
Special K Fun Quality
Sustain Kids Treat
Vitabrit Soggy Boring
Weetbix Economical Nutritious
Health

In total 116 respondents provided 235 observations of the 12 selected brands. How do you characterize the
consideration behavior of the 12 selected brands? Analyze and interpret yoru results using factor analysis.

2
INSIGHTS

• To find the insights we have used the unsupervised learning technique “Factor Analysis”.
• In Total we have 235 observations with 25 variables out of which by using the dimension reduction
technique –Factor Analysis we have reduced it to 5 Dimensions.
• As we are using the Factor analysis, initially we need to know how many dimensions we are
going to reduce the variables to find out the correlation. So we need to convert the data
into a Matrix form like below.

3
• Use screeplot to find out the eigan values curve.

4
• Do Factor loadings without the Rotation to see the loadings.

5
Results of Factor loading without Rotations:
• Test of the hypothesis that 5 factors are sufficient.
• The chi square statistic is 319.45 on 185 degrees of freedom.
• The p-value is 3.09e-09

• Now do the Factor Loadings using the rotation “Promax” for all the Factors to find out the
better Grouping model.
• The R program for the different Factor loadings are given below and the best one is chosen.

6
• Once factor loadings using the Rotation are done we should filter to find out the best factor
which will form a perfect group.
• We now choose Factor 2 to be the best grouping model.

7
• The scatter plot grouping according to Factor 2:

8
• We formed the group with Factor 2 data as below:

• Now we can assign the Names to the Factors as below accordingly.

9
Conclusion: Problem 1

• We need to find out the average values for the newly formed 12 brands and 5
dimensions by assigning new factors as below.

• Followed by renaming the reduced column and formatting we get the below
results.
• Thus the 5 major categories reduced using factor analysis sums up the entire
behavior of the 235 observations of the 12 brands.

FINAL RESULT:

10
Problem 2

Leslie Salt Data Set

In 1968, the city of Mountain View, California, began the necessary legal proceedings to acquire a
parcel of land owned by the Leslie Sal Company. The Leslie property contained 246.8 acres and was
located right on the San Francisco Bay. The land had been used for salt evaporation and had an
elevation of exactly sea level. However, the property was diked so that the waters from the bay park
were kept out. The city of Mountain View intended to fill the property and use it for a city park.

Ultimately, it fell into the courts to determine a fair market value for the property. Appraisers were
hired, but what made the processes difficult was that there were few sales of byland property and
none of them corresponded exactly to the characteristics of the Leslie property. The experts involved
decided to build a regression model to better understand the factors that might influence market
valuation. They collected data on 31 byland properties that were sold during the previous 10 years. In
addition to the transaction price for each property, they collected data oina large number of other
factors, including size, time of sale, elevation, location, and access to sewers. A listing of these data,
including only those variables deemed relevant for this exercise. A description of the variables is
provided below.

11
INSIGHTS

Variable name Description

Price Sales price in $000 per acre

County San Mateo=0, Santa Clara =1

Size Size of the property in acres

Elevation Average Elevation in foot above sea level

Sewer Distance (in feet) to nearest sewer connection

Date Date of sale counting backward from current time


(in months)

Flood Subject to flooding by tidal action =1; otherwise =0

Distance Distance in miles from Leslie Property (in almost all


cases, this is toward San Francisco

Discuss and Answer the following questions:

1. What is the nature of each of the variables? Which variable is dependent variable and what are
the independent variables in the model?

2. Check whether the variables require any transformation individually

3. Set up a regression equation, run the model and discuss your results

12
Nature of the Variables:

Nature of
Variables Description Type of data
variable
Price $000 / acre numerical Dependent
Catergorical to
County San mateo = 0, Santa clara = 1 Independent
numerical
Size in acres numerical Independent
Elevation avg elevation in foot above sea level numerical Independent
Sewer Distance in feet numerical Independent
Date Total months numerical Independent
Catergoricalto
Flood tidal action = 1, otherwise =0 Independent
numerical
dist in milesfrom Leslie property toward
Distance numerical Independent
san franciso

• Since we are going to predict the fair market value of Leslie property, the response variable
or dependent variable for the dataset is ‘Price’, while other variables are considered as
predictor or independent variables.
• Importing the dataset into R and preparing the dataset for regression analysis. With the
summary details, we are able to conclude that there are no NULL values in the dataset.
Though, outliers are present in the dataset, we are going to analyze along with regression
equation to decide on removing or keeping the outlier for this model.

13
• Next step in creating the estimated regression equation is, by observing what type of
relationship prevails between dependent variables and independent variables. As per the
dataset, we have one dependent variable (‘Price’) and 7 independent variables – ‘ County’,
’Size’, ’Elevation’, ’Sewer’, ’Date’, ’Flood’, ’Distance’.
• Correlation matrix with coefficient values for each variable is shown below. The correlation
coefficient value between 2 variables greater than 0.7, finds to have strong positive
correlation, while less than -0.7 finds to have strong negative correlation. Based on the
chart, variable – Distance & County is showing strong negative correlation, as it may be
taken that County – Santa Clara (‘1’ in dataset) is far away from Leslie Salt Company.

14
Linear regression model – Model 1:

• Also, there is slightly positive correlation between date and price, also elevation and price,
which can form the key predictor of price value.
• With the fact that none of independent variables are not showing any no-linear indication
with dependent variable (‘Price’), it is decided to assume our first linear regression equation
contains all independent variables in first order linear form.

Linear regression model – Model 1:


• The output of linear regression model with all independent variables is as shown below.
The t-test for each independent variables are made to find that whether there is
significant contribution to the linear regression model and the result shows that
independent variables like Size, Date & Distance failed in t-test, indicating it is not
adding statistical contribution to the linear model in predicting the value of Price.
• With reference to F test, the P-value is less than 0.05, thus the linear regression model is
statistically valid for prediction. When we refer to adj. R-squared, the value is 0.67,
explains that this linear model is 67% good to fit the data given.

15
For this linear model, following conditions are to be satisfied, as follows:
• Dependent and independent variables follow normal distribution – should confirm
normality test
• Linear relationship is studied with respect to correlation matrix, as indicated above and
there exist linear relationship with all independent variables – Correlation coeff. Value
of any of the variable is not zero with respect to dependent variable
• Multi-collinearity between independent variables (VIF should be less than 10)
• No auto correlation should be found within the independent variables (should pass
‘dwtest’)
• There should exist homoscedasticity (should pass ‘gqtest’)

1. Shapiro test of all variables reject Null Hypothesis, thus none of the variables are normally
distributed

2. Computing VIF value for each independent variable and if Variance inflation factor (VIF) is
greater than 10, then there is strong collinearity between the 2 variables. With the output,
it is evident that no two variables are having collinearity.

3. Auto correlation test using Durbin – Watson test. The D value is 2.42, thus there exist auto
correlation in the model.

16
4. GQ test to detect the presence of heteroscedasticity in the linear model. Null hypothesis is
failed and thus proves that the variance of residual is not same for all dependent variables.

5. Residual plot against the fitted values, shows that there is no random variance present in
the linear regression model.

In order to overcome the issue with heteroscedasticity, auto correlation and residual plot,
we can identify outliers and high influential observation in the dataset and decide to remove it
to reduce such issues with the model or do transformation on variables in order to reduce the
magnitude of residual error.
To find the outlier and high influential observation, studentized deleted residual values are
used, as it most effective form of identifying the outliers, which even standardized residual
cannot detect it. If the studentized deleted residual value is greater than +2.447 or lesser than -
2.447, then that observation is said to be outlier with the model. And, it is reasonable, to
remove such high influential observation from linear regression model.

17
• From R output, we can able to see that observation 26 is an outlier in the dataset and highly
influences the regression model. Removing this from dataset, running the linear regression
model can help increase the fitness level.

18
Linear Regression - Model 2

Removing the outlier from the data, and running the model gives better results in terms of
fitness of model to the variability of the data. The adj. R squared value increased from 0.67 to
0.7454and the t-test for independent variables shows significant contribution to the model
except ‘ County’ , ‘Size’, ’Sewer’, ‘ Distance’. F-test shows that the model is statistically
significant with p-value < 0.05.

Still the normality test, other tests on auto correlation, homoscedasticity are not resolved with
model-2, as still this problem prevails. Thus, doing the transformation on the variables can help
in reducing the auto correlation and homoscedasticity issues. To identify the variables that
require transformation, the scatterplot between residuals of the model and variable is drawn
to identify, if there is any pattern exists.
The scatter plot in right shows, there is no randomness in the residual plot and the residuals
increases when there is increase in dependent variable. This factor can cause
heteroscedasticity. Thus, transformation on dependent variable can help solve this issue.
19
Tukey Transformation is done on dependent variable using transformTukey function in R as
shown, to transform dependent variable. Tukey transformation is one form of power
transformation, to overcome the non-normal variable to follow normal distribution and to
overcome auto correlation / heteroscedasticity. Below output shows the transformation of
dependent variable and Shapiro test to show that the transformed data it is normally
distributed.

20
Linear Regression Model - 3

• With the transformed dependent variable, the linear model is made re-run to compute the
parameters of the model and compare the results with the before model.
• As it is clearly evident, that the adj. R squared increased from 0.7454 to 0.7927, as we are
improving the overall fitness of model with the given dataset, thus the model will answer
79.27% of variability present in the dataset. And, variables that failed t-test reduced from 3
to 2 – ‘County’ & ‘Size’.

• With 2 independent variables, that are not passing individual t-test, shows that both the
variables are not contributing significance to the model. Thus, we need to decide on either
doing transformation on the variable or drop the variable.

21
• Since the ‘County’ variable consists of only 0 or 1 value, since it is categorical in nature, it is
reasonable to study the residual for ‘Size’ variable. Right side scatter plot of residual plot of
model_3 vs Size variable shows that there are outliers present in the size variable, though
the residuals are not considered to be outliers. Hence, we can do transformation on ‘Size’
variable and observe the results.

22
Linear Regression Model - 4

• It is clearly seen from the below R output that the adjusted. R square value increased from
0.7927 to 0.8143 (81.43% is very level of fitness) to the data. This is attained by
transforming the ‘Size’ variable.

23
Linear Regression Model - 5

• Now, ‘County’ is the only variable that failed in the t-test and still it signifies that it has very
less contribution in prediction of the y value by the model. We need to decide on retaining
the variable or dropping the variable. For this, we used best subset regression technique to
rank the variables in order to keep specific number of independent variables in linear
model.

• The above output on the best subset regression technique in R, shows that ‘County’ is first
variable to be dropped compared to other independent variables in order to build model
with 6 independent variables. Thus, we need to drop the ‘County’ variable and run the
results in linear regression model in model – 5 and the results are interpreted.

24
• The adjusted R. squared value increased from 0.8143 to 0.8215 and t-test are passed by all
the independent variables except ‘Size’ with p-value 0.06 and F-test is passed by the model,
which shows that the model is statistically significant in order to predict the dependent
variable or it gives better estimate of mean of the dependent variable.

25
Linear Regression Model - 6

• This model is find the outlier with new estimated regression equation and with transformed
variables and the steps are similar to what is mentioned in the model-3, using studentised
delete residuals.

• The final linear regression model gives adjusted. R squared value of 0.8833 from 0.8215,
which shows the final model can explain the 88.3% variability in the dataset and the model
has best fit condition to be used to give better confidence interval estimate and predictor
estimate for dependent variable. Also, all independent variables passed the t-test and multi-
linear regression model passes the F-test.
• The R output of VIF test for final model shows that none of the independent variable
selected has multi-collinearity.

26
• Goldfeld - Quandt test shows that there is no heteroscedasticity in the model that is
residuals are independent.
• Durbin-Watson test shows that there is no autocorrelation with p-value of 0.823.
Shapiro test shows the residuals follows normal distribution.

27
• Also, based on the reisdual plot of the final model, it is evident that the
residuals are randomly distributed against the fitted values.

• Thus, our final model of estimate linear regression equation is ready for
predicting the dependant variables.

28
Conclusion – Problem 2

Predicting the fair market value of Leslie Salt Company using linear regression model

Following are the details of independent given and assumed, in order to predict the market value
of Leslie Salt company.

1. County – Santa Clara “ 1” – (assumed)


2. Size – 248.6 acres (as given)
3. Elevation – 0 -At sea level (as given)
4. Sewer – 0 (assumed)
5. Date - +12 months from now (assumed)
6. Flood – No “0” (as given)
7. Distance – 0 (as given)

R output shows the fitted predicted value of price in the final linear regression model. Based on our
model, the fair market value for Leslie Salt Company is $18,318 /acre, if it is assumed to be bought
in 1 year (12 months) from now. Also, the output gives the prediction interval estimate with upper
and lower band (with 95% confidence interval) for given independent variables. The fair market
value for Leslie Salt Company with 95% confidence interval can be given between $10,395/acre to
$29,902/acre.

29
Problem 3

All Greens Franchise


Explain the importance of X2, X3, X4, X5, X6 on Annual Net Sales, X1. The data (X1, X2, X3, X4, X5, X6)
are for each franchise store.
X1 = annual net sales/$1000
X2 = number sq. ft./1000
X3 = inventory/$1000
X4 = amount spent on advertising/$1000
X5 = size of sales district/1000 families
X6 = number of competing stores in district

Explain the importance of X2, X3, X4, X5, X6 on Net sales, X1


• Importing the dataset into R and preparing the dataset for regression analysis. With the
summary details, we are able to conclude that there are no NULL values in the dataset.
Though, outliers are present in the dataset, we are going to analyze along with regression
equation to decide on removing or keeping the outlier for this model.

30
• Assigning the column names as shown, gives better understanding in building the model
and to find the significance of each variable on Net sales.

• Correlation matrix with coefficient values for each variable is shown below. The correlation
coefficient value between 2 variables greater than 0.7, finds to have strong positive
correlation, while less than -0.7 finds to have strong negative correlation.
• Based on the chart, all independent variables are showing strong correlation with the
dependent variable ‘Net Sales’
• With the fact that none of independent variables are showing any no-linear indication with
dependent variable (‘Net Sales’), it is decided to assume our first linear regression equation
contains all independent variables in first order linear form.

31
Linear Regression: Model 1

Linear regression model – Model 1:


• The output of linear regression model with all independent variables is as shown below. The
t-test for each independent variables are done to find whether there is significant
contribution to the linear regression model and the result shows that independent variables
are adding statistical contribution to the linear model in predicting the value of Net sales.

• With reference to F test, the P-value is less than 0.05, thus the linear regression model is
statistically valid for prediction. When we refer to adj. R-squared, the value is 0.99, explains
that this linear model is 99% good to fit the data given.

32
For this linear model, following conditions are to be satisfied, as follows:
• Dependent and independent variables follow normal distribution – should confirm
normality test
• Linear relationship is studied with respect to correlation matrix, as indicated above and
there exist linear relationship with all independent variables – Correlation coeff. Value of
any of the variable is not zero with respect to dependent variable
• Multi-collinearity between independent variables (VIF should be less than 10)
• No auto correlation should be found within the independent variables (should pass
‘dwtest’)
• There should exist homoscedasticity (should pass ‘gqtest’).

• VIF test indicates that ‘Inventory’ variable is above 10.0 and this leads to strong multi-
collinearity and thus the variable needs to be dropped out. Also, the residual plot against
the fitted values, shows that there is no random variance present in the linear regression
model.

33
• In order to overcome the issue with multicollinearity and residual plot, we can identify
outliers and high influential observation in the dataset and decide to remove it to reduce
such issues with the model or do drop the‘Inventory’ variable in order to reduce the
magnitude of residual error.

• To find the outlier and high influential observation, studentized deleted residual values are
used, as it most effective form of identifying the outliers, which even standardized residual
cannot detect it. If the studentized deleted residual value is greater than +2.447 or lesser
than -2.447, then that observation is said to be outlier with the model. And, it is reasonable,
to remove such high influential observation from linear regression model.

• From R output, we can able to see that observation 6, 27 are the outlier and highly
influential observation in the regression model. Removing this from dataset, running the
linear regression model can help increase the fitness level.

34
Linear Regression – Model 2

• Removing the outlier from the data, and running the model gives better results in terms of
fitness of model to the variability of the data. The adj. R squared value increased from 0.99
to 0.9913and the t-test for independent variables shows significant contribution to the
model. F-test shows that the model is statistically significant with p-value < 0.05.

• Thus, the above linear regression model can indicate the significance of each variable to the
dependent variable ‘Net Sales’. And the residual plot on the right shows, there is
randomness between residuals and fitted values.

35
Conclusion: Problem 3

Following is the linear regression defined after refining the model:

Net sales = -45.24 + 32.18(Area) +13.99(Advertising) + 14.23(sales district size) -


3.29(no. of competing stores)

While other variables are kept constant, the net sales increase by
1. 32.18 units when there is 1 unit increase in Area of store (sq.ft/1000)
2. 13.99 units when there is 1 unit increase in amount spent on advertising (/ $1000)
3. 14.23 units when there is 1 unit increase in sales district size (/1000 families)

And net sales decrease by,


1. 45.24 units when all independent variables are kept constant
2. 3.29 units when number of competing stores increase by 1 unit, when others are kept
constant.

Above points show the statistical significance of each variable on Net sales.

36

S-ar putea să vă placă și