Sunteți pe pagina 1din 16

WINE PRODUCTION REPORT

DATA ANALYSIS ON WINE PRODUCTION


DATUM Team
Nikhil Sebastian Sithara
Pradeep Daniel Neeladevi Natarajan


1


TABLE OF CONTENTS
ABSTRACT .................................................................................................................................................... 2
INTRODUCTION ........................................................................................................................................... 2
DATA PROCESSING ...................................................................................................................................... 3
DATA ANALYSIS ........................................................................................................................................... 4
Finding Correlations ........................................................................................................................... 4
Classification Tree .............................................................................................................................. 6
Linear Regression .............................................................................................................................. 7
SUMMARY OF ANALYSIS ............................................................................................................................. 8
Findings from Classification Tree ....................................................................................................... 8
Findings from Regression .................................................................................................................. 9
INSIGHTS ................................................................................................................................................... 12
APPENDIX .................................................................................................................................................. 13
Plotting Commands .......................................................................................................................... 13
Commands for Decision Tree ........................................................................................................... 15
White Wine Tree Creation: ........................................................................................................ 15
Red Wine Tree Creation: ........................................................................................................... 15
Commands for Linear Regression .................................................................................................... 15






2

WINE PRODUCTION
REPORT
DATA ANALYSIS ON WINE PRODUCTION
ABSTRACT

We, the Datum team have selected wine quality for the study. For a better business insight, we have
incorporated many compatible data mining techniques to analyze the data. We study two types of
wines (Red and White), which have different physiochemical characteristics. There are different
wineries across the globe and our data pertains to a particular winery in Portugal. Our dataset has only
the basic ingredients of wine preparation along with the quality as rated by wine experts. Our study is
to examine the combination of ingredients and their influence on the quality. We performed
classification of quality through decision tree; thereby, the classifiers that influence wine quality are
identified. Our next processing was linear regression, which not only helps in predicting quality but also
helps in identifying the significant ingredients. The models we created could be used independently for
quality predictions or as a support to wine tasting evaluations by experts and could help in improving
wine production.

INTRODUCTION

Wine is a beverage from fermented grape and other fruit juices with lower amount of alcohol content.
Quality of wine is graded based on the taste of wine and vintage. Tasting wine is an ancient process as
the wine itself is. When it comes to the quality of wine, many other factors or attributes comes in to
consideration other than the flavor. The dataset that we chose to analyze Wine Quality, represents
the quality of wines ( white & red ) based on different physiochemical attributes ( fixed acidity, volatile
acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH ,
sulphates and alcohol ). The quality score for each wine combination in our dataset varies from 0 to 10
(ranging from least to highest). This report will uncover some important relationships between wine
chemical contents like acidity and sugar levels versus its quality. The dataset exhibits a vast and distinct
chemical and acidic combination of two types of wine (white & red). By employing smart data analysis
techniques, we can unearth a hand full of important and interesting insights that would be helpful in
producing better quality wines and that would also be prolific for the economical/financial sector and
business sector of the production company.



3

All the attributes in the dataset are numerical except for the quality, which is ordinal and Wine type
which is nominal. Upon thorough study of the dataset its attributes and value ranges, we zeroed in on
three insights; the insights are supported by the analysis techniques such as Decision tree and linear
regression. The business strategies and value add are evolved based on the degree of perfection of the
developed models and proven findings. We can consider lauding a production if the final product is
produced with minimum cost and maximum quality.

An important insight that we are able to procure is how much each of the chemical components
contribute to the quality of the wine and how can we grade the quality level of a newly produced wine.
We have decided to create a classification decision tree to arrive at quality class label with attributes
with highest information gain as nodes. Through a decision tree, we will be able to identify the
distinguishing factors affecting quality level of a newly created wine and thereby fix a reasonable price
to the wine in the market.

Out of the thirteen attributes, the statistically significant attribute that influence quality of wine is an
essential finding. By employing linear regression analysis, we come up with a model that highlights the
significant attributes. The result of this regression analysis will be helpful in production and in quality
prediction by studying the impact of those significant attributes in determining the quality.

The main focus of this report is to illustrate how those attributes are analyzed using data mining
techniques in R tool. Also it explains in detail, our teams findings as a result of this analysis. All the
findings will be depicted with clear explanations and depictions in layman terms and also a full
technical explanation of the data mining approaches and algorithms correlated to the findings will be
given.

DATA PROCESSING

The data set we have considered for our analysis is pertaining to some of the physiochemical
characteristics of red and white wines from the Vinho Verde region of Portugal. Due to privacy and
logistic issues, only physiochemical (inputs) and the sensory (output) variables are available. There is
no data about grape types, wine brand, wine selling price etc. The data set is available through UCI
machine learning repository.

As initial step of preprocessing, we checked for missing values in the dataset and there were none. We
looked for the duplicated values and removed them the dataset. We studied the range and the units
used for each attribute and could see that they are the right measurements and are in the natural
range for ingredients of wines. The dataset contains a large number of outliers, some of them are
extreme. But when we observe an extreme outlier for a particular attribute, we could see that the
other attributes are still in range and we decided against discarding such observations. The extreme
outlier is helpful in finding interesting factual.


4

DATA ANALYSIS

The analysis of this data can be primarily viewed as a classification problem. The classes are ordered
and not balanced. The dataset contains data for red and white wines; they share a common data
structure. This data set contains 11 explanatory variables which are physiochemical parameters of a
wine, and 1 response variable, quality. The quality rating is from 1 to 10. Each data case is for one
wine and the quality rating for each is the median value of inputs from multiple judges.

Finding Correlations

The goal of our analysis is to predict quality based on the 11 predominant physiochemical
characteristics. Before selecting the appropriate modeling technique and the most significant
attributes to be used for modeling, we studied the relationship between each of these individual
attributes and quality.

The results discussed below are based on our observations on the entire data set of white and red
wines. [We did the same analysis individually on white and red wine data as well. For the significant
attributes discussed below, we observed similar relationships in both types. ]

Figure 1 Figure 2

Figure 3 Figure 4



5

Figure 5 Figure 6


Figure 1 is a plot of volatile acidity against quality. The plot shows a negative relationship between
quality and this explanatory variable. Volatile acidity is a measure of acetic acid in wine and an
indicator of the freshness of the wine. Its levels increase due to oxidation, when wine is exposed to the
air, resulting in vinegar smell and in extreme cases of oxidation, the wine more or less transforms to
vinegar. Hence the negative relation seems logical.

Figure 2 illustrates the relationship between quality and the residual sugar in the wine; again this is a
negative relationship, but progression is weak and is not applicable to lower quality wines. Residual
sugar typically refers to the sugar remaining in the wine after fermentation is stopped; in some cases
this could be additionally added to the wine as well. The negative relationship is aligned to the general
perception that drier wines are of higher quality.

Figure 3 illustrates the negative correlation relationship between quality and chlorides present in the
wine. The presence of chlorides leads to the saltiness in wines. In the plot the negative progression is
apparent in higher qualities. The saltiness has a regional preference/acceptance.

Figure 4 illustrates the relationship between quality and total sulfur dioxide. We observe a negative
relationship for this attribute to quality. Sulfur dioxide is a preservative added to wines to protect it
against oxidation. The high quality wines have low levels of total SO2 and the manufacturers would
have achieved the right balance between preservation and maintaining aroma in this case. Wines
graded average show a range of values of SO2, implying varying techniques used for balancing, by
manufacturers.

Figure 5 illustrates the relationship between quality and Sulphates. Sulphates and sulfur dioxide both
act as preservative for the wine. Like SO2 in this case also we observed a negative relationship to
quality.

Figure 6 illustrates the relationship between quality and Alcohol. This is the only prominent positive
relationship that we observed. The poorest quality wines have the lowest levels of alcohol. In the
higher quality wines the observed alcohol levels are also high. A higher level of alcohol is a desirable
characteristic for high quality wines. In our plots this tendency is clearly visible, especially in case of red
wines.


6

Classification Tree

Classifiers are the distinguishing factors that help in quality analysis. Decision tree is the statistical
model helpful in providing the information gain rich classifiers and the corresponding quality response
for the classifiers.

From the plots and observations on the data it is clear that each of the chemical component and acidic
compound has either a negative or positive impact on the quality of the wine. By studying the impacts
of each of these attributes on the quality grades, we decided to create a model which can be used to
grade the quality of new wine products by studying its chemical components and acidic compounds.

A model that focuses only on the attributes that have the highest information and impact on the
quality grade given by experts is created. We decided to build a classification decision tree that has
only the attribute nodes with the highest information gain and lower entropy. Classification can be
described as a form of data analysis through which we create data models that describes important
data classes. Classification has two steps; one is learning step wherein we create a classification model
and the second one is classification step where we use the model for predicting class labels. A decision
tree can be understood as a flowchart-like tree structure, where each non-leaf node denotes a test on
an attribute. Each branch represents an outcome of the test and each leaf node denotes a class label.

First we create a tree in R with the ordinal variable quality as the class label, for the Red wine data.
We provide all the attributes except Wine.Type as candidates for tree nodes.




7

By studying the tree and the summary of the tree, we have identified that only three attributes
alcohol, sulphates and volatile.acidity have higher information gain.

Secondly we created a separate decision tree with the response variable quality as the class label and
all attributes except Wine.Type as candidate tree nodes, for the White wine data. The resulting tree is
as shown below


From the summary of the tree, we have identified that only three attributes alcohol, sulphates and
volatile.acidity have higher information gain.

Linear Regression

Multiple linear regression models, a technique used to analyze the influence of a number of
independent variables on the dependent variable. The linear model explains the following features
helpful for the analysis:

Variation in the dependent variable corresponding to the variation in the independent variable
Prediction of the value of dependent variable based on the given conditions of independent
variable
Estimation of the independent variable influence on the dependent variable holding the other
independent variables as constant
The reliability of the model based on degrees of explanation of independent variable


8

Standard errors between the prediction and the observation
Significance of the model with the given confidence levels
Individual significance of the independent variables
These insights hold best of the basic parameters to explain whether the data provided is
sufficient to estimate the dependent variable.


In order to accommodate the variations in the range of particular attributes for each of these wines
(e.g. alcohol levels) we chose to perform regression separately on each type by splitting the dataset.
On top of all the statistics, the model is created and validated with the observation data. The data set is
split randomly into train and test dataset. The train dataset is used to predict the model while the test
dataset is used to validate the model. Thereby, the statistical errors are reduced though the dataset
can be erroneous in content.
SUMMARY OF ANALYSIS

Findings from Classification Tree


From the summary, we find that only 43 % of the red wine tuples were classified in the decision tree
created using our red wine dataset. Quality grades ranging from 4 to 7 are classified in our red wine
decision tree.

From the residual mean deviance, we find that only 58 % of the white wine tuples were classified in the
decision tree created using our white wine dataset. Quality grades ranging from 4 to 7 are classified in


9

our decision tree. The bad side is, the model which we have created using the dataset only allows us to
classify wines with a quality grade ranging from 4 to 7. Wines with extreme quality grades (0 to 3 and 8
to 10) cannot be classified with this model.

Findings from Regression

Results of executing Linear modeling for Red wine to predict quality of the wine based on all the
independent variable




































1
0



On observing the results of linear model the following findings are articulated.

Estimate feature provides the estimated influence of the independent variable on the quality.
The positive or negative influence is indicated by the sign of the feature. For instance, there is
one unit decrease in the quality for every 21.49% increase in density.
Standard error provides the degree of fit of the predicted regression equation for the sample
data.
T value of each coefficient is the test that the coefficient is different from zero. For example, T
value of Fixed Acidity is given by Estimate/Standard Error = 0.017914/0.032041 = 0.599
Pr(>|t|) is the significance level of the t value which should be greater than 5%. For instance,
the p value of Fixed acidity is 0.5762 so the component is not significant for analysis.
Residual Standard error is the sum of squares of all the residual errors.
Adjusted R square is the amount of variance of quality explained by the given set of
independent variables. This model predicts that 36.05% variation in quality is explained by the
independent variables.
F statistics and p value show the statistically significant relationship between the
independent and dependent variable provided the p value is less than 0













1
1

Prediction of the model:


The predicted model is evaluated with the test data. On evaluation, the quality of test data is in the
range of predicted quality range.























1
2

INSIGHTS

The final verdict on wine quality is the sensory perceptions. With a good prediction model, the
manufacturers can simulate the sensory testing based on the levels of the attributes and this
decision making system can be used to support wine tasting evaluations by experts and could
help in improving wine production.

The model that we have created could be used by manufacturers to create models for target
markets, modeling consumer tastes typical to niche markets. The freshness and low alcohol
content of Vinho Verde are two important aspects for a demanding market as Canada, whereas
it is the freshness and affordable prices are more desirable in the U.S.

The results from the classification give us the most relevant classifiers of the data. This
information can be used for a better control on the wine production phase. Given the harvest
conditions, when we know the levels of certain attributes like acidity or sugars in the produce,
we can control the manufacturing process to compensate for these and still achieve the desired
quality levels for the products.

BBC World News Article: The world is facing a wine shortage, with global consumer demand
already significantly outstripping supply.(link -http://www.bbc.com/news/world-24746539).

Due to the ongoing vine pull and poor weather the production in the Old World wine countries
i.e. Europe has been plummeting. Therefore this supply could be met by new wine producers
from the New World Wine countries by introducing different product lines based on the range
of average quality i.e 4 to 7 at reasonable cost which would attract the large market segment of
the wine consumers which are the price constrained and the commodity buyers. Therefore a
classification tree model derived from the data analysis helps identify the levels of alcohol,
volatile acidity and free sulfur dioxide that should be maintained to meet the quality of a
particular product line.









1
3

APPENDIX
Plotting Commands

> plot(winedata$quality, winedata$Wine.Type)
> plot(winedata$quality, winedata$fixed.acidity)
> plot(winedata$fixed.acidity,winedata$quality)
> plot(winedata$volatile.acidity,winedata$quality)
> plot(winedata$citric.acid,winedata$quality)
> plot(winedata$residual.sugar,winedata$quality)
> plot(winedata$chlorides,winedata$quality)
> plot(winedata$free.sulfur.dioxide,winedata$quality)
> plot(winedata$total.sulfur.dioxide,winedata$quality)
> plot(winedata$density,winedata$quality)
> plot(winedata$pH,winedata$quality)
> plot(winedata$sulphates,winedata$quality)
> plot(winedata$alcohol,winedata$quality)
> pairs(white)

1. > pairs(red)


1
4











> cor(winedata[,2:13])












1
5

Commands for Decision Tree

White Wine Tree Creation:

> whiteTree <- tree(quality ~.- Wine.Type, data = wineWhite)
> plot(whiteTree)
> text(whiteTree)
> summary(whiteTree)

Red Wine Tree Creation:

redTree <- tree(quality ~.- Wine.Type, data = wineRed)
> plot(redTree)
> text(redTree)
> summary(redTree)
Commands for Linear Regression

> #Split the dataset randomly into Train and Test Dataset
> red_ind <- sample(1:nrow(red_wine), size=1000)
> white_ind <- sample(1:nrow(white_wine),size=4000)

> #Split the dataset based on the random index
> red_train_ds <- red_wine[red_ind,]
> white_train_ds <- white_wine[white_ind,]

> #Collect the Test Dataset
> red_test_ds <- red_wine[-red_ind,]
> white_test_ds <- white_wine[-white_ind,]

> #Creation of Multiple Linear Regression Model

>red_wine_lm<-lm(quality~fixed.acidity+volatile.acidity+citric.acid
+residual.sugar+chlorides+free.sulfur.dioxide
+ +total.sulfur.dioxide+density+pH+sulphates+alcohol,data=red_train_ds)

>white_wine_lm<-lm(quality~fixed.acidity+volatile.acidity+citric.acid
+residual.sugar+chlorides+free.sulfur.dioxide
+ +total.sulfur.dioxide+density+pH+sulphates+alcohol,data=white_train_ds)

> summary(red_wine_lm)
> summary(white_wine_lm)

> #Predict the values for Test data
> red_test_pred <- predict(red_wine_lm,data=red_test_ds)
> white_test_pred <- predict(white_wine_lm,data=white_test_ds)

S-ar putea să vă placă și