Final DMT Report PDF

School of Information Technology & Engineering
M Tech Software Engineering
DATA MINING TECHNIQUES (SWE 2009)
Review-3 (Win 2018-19)
BIG MART SALES USING DECISION TREE REGRESSION
Submitted by
16MIS0166 P.Medhavini
16MIS0401 V.Krithika
Faculty : Prof Sudha. M

Slot : B1 +TB1
TABLE OF CONTENTS
ABSTRACT-------------------------------------------------------------------------------3
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION-----------------------------------------------------------------4
1.2 OBJECTIVE OF THE WORK---------------------------------------------------4
1.3 SCOPE OF THE WORK----------------------------------------------------------4
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
2.2 BACKGROUND
2.3 CHALLENGES
2.4 PROBLEM DEFINITION AND APPROACH
CHAPTER 3
EXPERIMENTAL DETAILS
3.1 MACHINE LEARNING METHODS

3.2 DESIGN FRAMEWORK
3.3 DATA SET,DATA SOURCE,CHARACTERIZATION,PREPROCESSING
3.4 PROCESSING TECHNIQUES
CHAPTER 4
RESULTS AND DISCUSSIONS: [ All the Metric, Plots, Visual Projection]
CHAPTER 5
SUMMARY AND CONCLUSIONS
REFERENCES [APA FORMAT]

ABSTRACT
Big mart sales prediction is about predicting future sales using the cumulative sales
reports. The datasets from kaggle repository where the datasets about 1559
products and 10 outlets/stores were taken. Along with that Item_Type that has 16
unique values are considered. We are using Pandas for handling data and numpy
for handling numerical operations in arrays. The algorithm which is going to be
used in this thesis is Decision Tree regression. Regression is used to predict a range
of numerical values, given a particular dataset. Decision tree is linearized into
decision rules where the outcome is the contents of the leaf node and the
conditions along the path form the conjunction in the if clause. The aim is to build
a predictive model and find the sales of each product at a particular store. Using
this model, big marts will try to understand the properties of the products and
stores which play a key role in increasing sales, where to improve the marketing or
to stop the selling of the product.
Keywords: Sales forecast, Decision tree Regression, Pandas, Numpy, datasets,

Cumulative sales
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
A Big mart is a shopping mall which sells variety of all household, eatables,
electronic devices,Garments,Groceries at a large scale. But the sales of a product
may vary season to season. For instance, Large scale of Air conditioners will be
bought by the customers during summer and less in winter. When the sales of
products vary, the employees of big mart may not know what the sales forecast is
and how much production is needed in the stock. In this case, sales forecasting
plays an important role to predict the sales of each and every product by the help of
cumulative sales report. To predict future sales, an algorithm is required to predict
the sales and in order to get accurate results. Decision trees are basically predictive
machine learning models. Decision trees models helps to predict a class for the
case after training pruning and testing is over. It is mainly of two types:
1) CLASSIFICATION TREE and
2) REGRESSION TREES.
In case data is continuous type with associated classes also numerical type. For
example if target is to predict sales forecast of big mart or price of a house or
setting of an apparatus mostly Regression type DECISION TREES are preferred.
1.2 OBJECTIVE
The aim is to build a predictive model and find the sales of each product at a
particular store. Using this model Bigmart will try to understand the properties and
stores which plays a key role in increasing sales, where to improve the marketing
or to stop the selling of the product.
1.3 SCOPE OF THE WORK

 To predict the future sales of product in Big Mart Sales.
 Predicting forecast sales using cumulative sales report.
 Training a classifier model and testing to get accurate forecast results on
future sales.
 To improve the profits of the mart indirectly by sales forecast.

CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
Big mart is a wholesale shopping mall where people can purchase all the needed
items. Predicting the sales is more important in increasing the profits of the mart
and controlling the production in stock. Many machine learning algorithms are
used . These algorithms are trained using the cumulative sales report and tested for
future sales. However all algorithms may not produce same accuracy over the
prediction. Neural networks was the most used for prediction when reviewing the
literature papers. The derivative analysis shows that the neural network model is
able to capture the dynamic nonlinear trend and seasonal patterns, as well as the
interactions between them. However, we use the decision tree regression model
which predicts the forecasts of products in sales with low error rate and higher
accuracy.
2.2 BACKGROUND
We are using pandas for handing data and numpy for handling numerical
operations in arrays.
Pandas
Python has long been great for data munging and preparation, but less so for data
analysis and modeling. pandas helps fill this gap, enabling you to carry out your
entire data analysis workflow in Python without having to switch to a more domain
specific language like R.
Combined with the excellent IPython toolkit and other libraries, the environment
for doing data analysis in Python excels in performance, productivity, and the
ability to collaborate. pandas does not implement significant modeling
functionality outside of linear and panel regression; for this, look to statsmodels
and scikitlearn. More work is still needed to make Python a first class statistical
modeling environment, but we are well on our way toward that goal.
NumPy
NumPy is the fundamental package for scientific computing with Python. It

contains among other things: • a powerful N-dimensional array object
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
• Useful linear algebra, Fourier transform, and random number capabilities .
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.
2.3 CHALLENGES
 Datasets acquired need to be divided, tuples were divided under train data
and test data.
 Training the model with the current dataset which is the cumulative sales
report was greater challenge among all.
 Applying pre-processing techniques to remove the missing values, outliers
and noisy data to train the model with a clean data.
 Accuracy measures to check whether the algorithm works best among all to
predicting the sales approximately.
 Determine the use of required packages which suits the model to work the
best.
2.4 PROBLEM DEFINITION AND APPROACH
To Predict Future Sales For Each Product Of Big Mart Using The Cumulative
Sales Reports. Also, Certain Attributes Of Each Product And Store Have Been
Defined.
The Aim Is To Build A Predictive Model And Find Out The Sales Of Each Product
At A Particular Store. Using This Model, Big Mart Will Try To Understand The
Properties Of Products And Stores Which Play A Key Role In Increasing Sales.
Approach
1. Hypothesis Generation – understanding the problem better by brainstorming

possible factors that can impact the outcome.
2. Data Exploration – looking at categorical and continuous feature summaries and

making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers.
4. Feature Engineering – modifying existing variables and creating new ones for
analysis.
5. Model Building – making predictive models on the data.

SURVEY REPORT:
Various Research journals and papers were studied which relates the content on
sales forecast prediction using machine learning algorithms. Below are the list of
few papers which were studied and a review of the paper was added along.
This article [1] proposes a new hybrid sales forecasting system based on genetic
fuzzy clustering and Back Propagation (BP) Neural Networks with adaptive
learning rate (GFCBPN).The proposed architecture consists of three stages: (1)
utilizing Winter’s Exponential Smoothing method and Fuzzy C-Means clustering,
all normalized data records will be categorized into k clusters; (2) using an adapted
Genetic Fuzzy System (MCGFS), the fuzzy rules of membership levels to each
cluster will be extracted; (3) each cluster will be fed into parallel BP networks with
a learning rate adapted as the level of cluster membership of training data records.
Compared to previous researches which use Hard clustering, this research uses the
fuzzy clustering which capable to increase the number of elements of each cluster
and consequently improve the accuracy of the proposed forecasting system.
Experimental results show that the proposed model outperforms the previous and
traditional approaches. Therefore, it is a very promising method for financial
forecasting.
Operations management [2] dysfunctions and lost production time are problems of
enormous magnitude that impact the performance and quality of industrial systems
as well as their cost of production. Association rule mining is a data mining
technique used to find out useful and invaluable information from huge databases.
This work develops a better conceptual base for improving the application of
association rule mining methods to extract knowledge on operations and
information management. The emphasis of the paper is on the improvement of the
operations processes. The application example details an industrial experiment in
which association rule mining is used to analyze the manufacturing process of a
fully integrated provider of drilling products. The study reports some new
interesting results with data mining and knowledge discovery techniques applied to
a drill production process. Experiment’s results on real-life data sets show that the
proposed approach is useful in finding effective knowledge associated to
dysfunctions causes.
To analyze this [3] spatial phenomenon, they proposed using a spatial divergence
approach based on the Ali-Silvey class of divergence measures to determine the
“distance” between two distribution functions. They apply this approach to both
simulated and real-life data. Using two divergence measures, we find that the
spatial divergence approach is capable of predicting success in the beginning of the
process, which makes it appealing for use in marketing activity in general, and
particularly for launches of new products. When applied to 17 actual product
introductions, the method succeeded in correctly predicting the success or failure
of the products in 15 cases.
Due to the strong competition that exists today, most manufacturing organizations
are in a continuous effort for increasing their profits and reducing their costs.
Accurate sales forecasting is certainly an inexpensive way to meet the
aforementioned goals, since this leads to improved customer service, reduced lost
sales and product returns and more efficient production planning. Especially for the
food industry, successful sales forecasting systems can be very beneficial, due to
the short shelf-life of many food products and the importance of the product
quality which is closely related to human health. In this paper [4] we present a
complete framework that can be used for developing nonlinear time series sales
forecasting models. The method is a combination of two artificial intelligence
technologies, namely the radial basis function (RBF) neural network architecture
and a specially designed genetic algorithm (GA).
Different prediction methods give different performance predictions when used for
daily fresh food sales forecasting. Logistic Regression (LR) is a good choice for
binary data, the Moving Average (MA) method is good for simple prediction,
while the Back-Propagation Neural Network (BPNN) method is good for long term
data. In this study [5] we develop and compare the performance of three sales
forecasting models, based on the above three prediction methods, for the
forecasting of fresh food sales in a point of sales (POS) database for convenience
stores. Fresh food is characterized by two factors: its short shelf-life and its
importance as a revenue producer for convenience stores. An efficient forecasting
model would be helpful to increase sales volume and reduce waste at such stores.
The correctness of the prediction rate is a good way to compare the efficacy of
different models which is the method used here.
Neural networks trained with the backpropagation algorithm are applied to predict
the future values of time series that consist of the weekly demand on items in a
supermarket. The influencing indicators of prices advertising campaigns and
holidays are taken into consideration .The performance of the networks [6] is
evaluated by comparing them to two prediction techniques used in the supermarket
now The comparison shows that neural nets outperform the conventional
techniques with regard to the prediction quality.
In this paper [7] is to forecast sales volumes as accurately as possible and as far
into the future as possible. The choice of network topology was Silva's adaptive
back-propagation algorithm and the network architectures were selected by
Genetic Algorithms (GAS). The networks were trained to forecast from 1 month to
6 months in advance and the performance of the network was tested after training.
The test results of artificial neural networks (ANNs) are compared with the time
series smoothing methods of forecasting using several measures of accuracy. The
outcome of the comparison proved that the ANNs generally perform better than the
time series smoothing methods of forecasting.
CRM assumes an essential job in the present promoting, utilizing client

information to continue business development. The proposed programming [8]
presents all the three capacities in single programming. The advances utilized for
the product are Java i.e. Netbeans 8.2 and MySQL. Information Mining and
Business Intelligence procedures are utilized for Data Analyzation which will give
the client data through pie diagrams and charts.
To create effective promotions and offers to meet its sales and marketing goals,
otherwise they will forgo the major opportunities that the current market offers.
Big Data application enables these retail organizations to use prior year’s data to
better forecast and predict the coming year’s sales. It also enables retailers with
valuable and analytical insights, especially determining customers with desired
products at desired time in a particular store at different geographical locations. In
this paper [9], we analysed the data sets of world’s largest retailers, Walmart Store
to determine the business drivers and predict which departments are affected by the
different scenarios (such as temperature, fuel price and holidays) and their impact
on sales at stores’ of different locations.
Association rules (frequent itemsets), classification and clustering are main

methods used in data mining research. One of the great challenges of data mining
is finding hidden patterns without violating data owners’ privacy. Privacy
preserving data mining came into prominence as a solution. In the aim of the paper
[10], Matrix Apriori algorithm is modified and a frequent itemset hiding
framework is developed. Four frequent itemset hiding algorithms are proposed
such that: first all versions work without pre-mining so privacy breech caused by
the knowledge obtained by finding frequent itemsets is prevented in advance,
secondly efficiency is increased since no pre-mining is required, thirdly supports
are found during hiding process and at the end sanitized dataset and frequent
itemsets of this dataset are given as outputs so no postmining is required, finally
the heuristics use pattern lengths rather than transaction lengths eliminating the
possibility of distorting more valuable data.
In this paper [11] it gain insights from the encoder-decoder recurrent neural
network (RNN) structure, and propose a novel framework named TADA to carry
out trend alignment with dualattention, multi-task RNNs for sales prediction. In
TADA, we innovatively divide the influential factors into internal feature and
external feature, which are jointly modelled by a multi-task RNN encoder. In the
decoding stage, TADA utilizes two attention mechanisms to compensate for the
unknown states of influential factors in the future and adaptively align the
upcoming trend with relevant historical trends to ensure precise sales prediction.
Experimental results on two real-world datasets comprehensively show the
superiority of TADA in sales prediction tasks against other state-of-the-art
competitors.
Convolutional Neural Networks (CNNs) to handle one dimensional data. The

proposed solution [12] overcomes the above mentioned challenges and proves that
two dimensional CNNs outperform a baseline LightGBM (gradient boosting
framework that uses tree based learning algorithms) model on two different
datasets - the dataset based on twenty one hot products and the dataset based on all
products by subcategory. The CNN model reached an F1 score of 0.69 on the test
set.
In this paper [13], there present confidence issues of rules, the association rules
mining. Accordingly, we present an approach for hiding a set of ARs, which is
detected as informative by database administrators. One rule has been called as
informative if its leakage risk is above a certain analyzer threshold. In some cases,
informative rules must not be disclosed to the unauthorized corporations, since
they are referring informative data which their disclosures may be utilized by
company competitor’s analyzers. We also evaluate the hiding process with a
similar one in order to analyze their performance.
In this paper[14], the database of the company is considered as a valuable asset to

compete with others. A methodology for such external data vendors based on
random forests predictive modeling techniques to create commercial variables that
solve the shortcomings of a classic transactional database. This study describes a
methodology for an external data vendor to create variables that solve all of these
limitations. The spending pleasure variables are composed of purchasing
behaviour and attitude dimension in specific product categories, predicted for a
large amount of respondents (customers and non-customers) without missing
values. Firstly this study demonstrates the usefulness of spending pleasure
variables in a new customer acquisition case. Secondly, in this study they predicted
only the respondents who are positioned in the spending pleasure segment because
these are valuable and interesting respondents for most companies.
This paper [15] develops an artificial neural network (ANN) model to forecast the
optimum demand as a function of time of the year, festival period, promotional
programmes, holidays, number of advertisements, cost of advertisements, number
of workers and availability. The model selects a feed-forward back-propagation
ANN with 13 hidden neurons in one hidden layer as the optimum network. The
model is validated with a furniture product data of a renowned furniture company.
The model has also been compared with a statistical linear model named Brown’s
double smoothing model which is normally used by furniture companies. It is
observed that ANN model performs much better than the linear model. Overall, the
proposed model can be applied for forecasting optimum demand level of furniture
products in any furniture company within a competitive business environment.
Direct marketing is a modern business activity with an aim to maximize the profit
generated from marketing to a selected group of customers. A key to direct
marketing is to select a subset of customers so as to maximize the profit return
while minimizing the cost. Achieving this goal is difficult due to the extremely
imbalanced data and the inverse correlation between the probability that a
customer responds and the dollar amount generated by a response. They [16]
presented a solution to this problem based on a creative use of association rules.
Association rule mining searches for all rules above an interestingness threshold,
as opposed to some rules in a heuristic-based search. Promising association rules
are then selected based on the observed value of the customers they summarize.
Selected association rules are used to build a model for predicting the value of a
future customer. On the challenging KDD-CUP-98 dataset, this approach generates
41% more profit than the KDD-CUP winner and 35% more profit than the best
result published thereafter, with 57.7% recall on responders and 78.0% recall on
non-responders. The average profit per mail is 3.3 times that of the KDD-CUP
winner.
In this paper [17] they have given a large database of customer transactions. Each
transaction consists of items purchased by a customer in a visit. We present ancient
algorithm that generates all significant association rules between items in the
database. The algorithm incorporates buyer management and novel estimation and
pruning techniques. We also present results of applying this algorithm to sales data
obtained from a large retailing company, which shows the effectiveness of the
algorithm.
The research [18] demonstrates that one-year forecasts based on consumer

purchase intentions can be made more accurate by first segmenting the data by
demographic pro files. They obtained the most accurate forecasts by separately
segmenting intenders and non-intenders, using segmentation methods that
incorporate a dependent variable and using a reduced degree of precision in
measuring intent. Segmentation methods incorporating purchase as the dependent
measure can no longer be used. However, this subjective estimation may be an
easier task than the subjective estimation of realization rates for the overall
heterogeneous population. Hence, it may still lead to more accurate forecasts.
CHAPTER 3
EXPERIMENTAL DETAILS
3.1 MACHINE LEARNING METHODS
Decision Tree Regression
Decision trees are basically predictive machine learning models. Decision trees
models helps to predict a class for the case after training pruning and testing is
over. It is mainly of two types:
1) CLASSIFICATION TREE and 2) REGRESSION TREES.
In case data is continuous type with associated classes also numerical type. For
example if target is to predict sales forecast of big mart or price of a house or
setting of an apparatus mostly Regression type DECISION TREES are preferred.
• The main difference between a regression tree and a classification tree is the
how you measure the "badness" of a node. There are various ways to do it
for both regression and classification trees. For regression trees, sum of
squared error or median absolute deviation or some other function is used.
• In a regression tree the idea is this: since the target variable does not have
classes, we fit a regression model to the target variable using each of the
independent variables. Then for each independent variable, the data is split
at several split points. At each split point, the "error" between the predicted
value and the actual values is squared to get a "Sum of Squared Errors
(SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root
node/split point. This process is recursively continued.
3.2 DESIGN FRAMEWORK
3.3 DATASET,DATASOURCE,CHARACTERIZATION,
PREPROCESSING
DATA SET:
 No of Columns (12)
 No of Rows (8524)
 Data Set Characteristics (Multi variate)
 Attribute Characteristics (Integer, Real)
 Associated tasks (classification, clustering)

 Number of attributes (8)
 Item_identifier
 Item_weight
 Item_fat_content
 Item_visibility
 Item_type
 Item_MRP
 Outlet_identifier
 Outlet_establishment_year
 Outlet_size
 Outlet_location_type
 Outlet_type
 Item_outlet_sales
Sample Datasets:
Item_ Item_ Item_ Item Item Outlet_ Outlet Outle Outl Outlet Item_Outlet_S
Weig Fat_ Visibi _ _ Identifi _Establi t_ et _Locati ales
ht Conte lity Type MRP er sh Size _Typ on
nt ment e _Type
_Year
9.3 1 0.016 1 249.8 49 1999 2 1 1 3735.138
05 1
5.92 2 0.019 2 48.26 18 2009 2 2 3 443.4228
28 9
17.5 1 0.016 3 141.6 49 1999 2 1 1 2097.27
76 2
19.2 2 0 4 182.1 10 1998 4 3 732.38
8.93 1 0 5 53.86 13 1987 1 1 3 994.7052
1
10.39 2 0 6 51.40 18 2009 2 2 3 556.6088
5 1
13.65 2 0.012 7 57.65 13 1987 1 1 3 343.5528
74 9
1 0.127 7 107.7 27 1985 2 3 3 4022.764
47 6
16.2 2 0.016 8 96.97 45 2002 1 2 1076.599
69 3
19.2 2 0.094 8 187.8 17 2007 1 2 4710.535
45 2
11.8 1 0 4 45.54 49 1999 2 1 1 1516.027
18.5 2 0.045 1 144.1 46 1997 3 1 1 2187.153
46 1
15.1 2 0.100 4 145.4 49 1999 2 1 1 1589.265
01 8
17.6 2 0.047 7 119.6 46 1997 3 1 1 2145.208
26 8
16.35 1 0.068 4 196.4 13 1987 1 1 3 1977.426
02 4
9 2 0.069 9 56.36 46 1997 3 1 1 1547.319
09 1
11.8 1 0.008 10 115.3 18 2009 2 2 3 1621.889
6 5
9 2 0.069 9 54.36 49 1999 2 1 1 718.3982
2 1
1 0.034 11 113.2 27 1985 2 3 3 2303.668
24 8
DATA SOURCE:
The datasets were downloaded from Kaggle repository.
Kaggle is an online community of data scientists and machine learners. It allows

users to find and publish data sets, explore and build models in a web-based data-
science environment, work with other data scientists and machine learning
engineers, and enter competitions to solve data science challenges.
Reference link:
https://www.kaggle.com/devashih0507/big-mart-sales-prediction
CHARACTERIZATION:
Variable Description
Item_identifier Unique produce id
Item_weight Weight of product
Item_fat_content Whether the product is low fat or not
The % of total display area of all
Item_visibility products in a
store allocated to the particular product
The category to which the product
Item_type belongs
Item_MRP Maximum retail price of the product
Outlet_identifier Unique store id
Outlet_establishment_year The year in which store was established
The size of the store in terms of ground
Outlet_size area
covered
The type of city in which the store is
Outlet_location_type located.
Whether the outlet is just a grocery store

Outlet_type or some sort of super market.
Sales of the product in the particular

Item_outlet_sales store. This is the outcome variable
to be predicted.
PREPROCESSING
Pre- processing is performed on the cumulative sales data sets in order to remove
missing values, outliers and noisy data’s.
In the current dataset, many missing values were found in many tuples under the
attributes.
We have applied mean formulae to calculate the missing values. The mean of the
whole column was calculated to provide data to the missing cell. This process was
applied iteratively until no missing cells were left under attributes.
Scaling is performed to the values for fitting and transforming the dataset. In this
case, the data’s under an attribute are taken. Mean and standard deviation is
performed. The data’s are scaled and transformed until we get mean value as 0 and
standard deviation value as 1.
3.4 PROCESSING TECHNIQUES
Techniques used for training the model and testing is decision tree regression. By
importing libraries such as scikit learn, pandas, numpy, we opt for regression on
the preprocessed train data.
1. The Required libraries such as pandas and numpy was imported.
2. The path to read the train dataset was set.
3. The train dataset has missing values, outliers and few noisy data’s.
4. Preprocessing needs to be performed to replace remove such data’s.
5. The main strategy used to fill the missing values was mean.
6. The mean of the whole column was taken to fill up the missing cell in the
dataset.
7. The clean data was scaled and transformed using scaler.tranform to fit.
8. The same strategy was performed with the test dataset.
9. Since the datasets were continuous values, decision tree regressor model was
the most suitable one.
10.Using the function tree.DecisionTreeRegressor(), the train and test were fit
into the model.
11.While test data was applied on the model, the proper forecasting was
measured using the performance metrics.
12.Regression algorithm has separate performance metrics unlike accuracy.
13.It has the mean absolute error where if this metric is 0, the accuracy of
prediction is cent percentage.
14.Few more metrics used to measure the progress was mean squared error, R2
score and the median absolute error.
CHAPTER 4
RESULTS AND DISCUSSION
PERFORMANCE METRICS:
Mean Absolute Error :
The mean absolute error (MAE) is the simplest regression error metric. Effectively,
MAE describes the typical magnitude of the residuals.
Because we use the absolute value of the residual, the MAE does not
indicate underperformance or overperformance of the model (whether or not the
model under or overshoots actual data). Each residual contributes proportionally to
the total amount of error, meaning that larger errors will contribute linearly to the
overall error.
A small MAE suggests the model is great at prediction, while a large MAE
suggests that your model may have trouble in certain areas. A MAE of 0 means
that your model is a perfect predictor of the outputs.
Mean Square Error :

The mean square error (MSE) is just like the MAE, but squares the difference
before summing them all instead of using the absolute value. The effect of the
square term in the MSE equation is most apparent with the presence of outliers in
our data. While each residual in MAE contributes proportionally to the total error,
the error grows quadratically in MSE.
This ultimately means that outliers in our data will contribute to much higher total
error in the MSE than they would the MAE. Similarly, our model will be penalized
more for making predictions that differ greatly from the corresponding actual
value.
R2 Score: In regression, the R2 coefficient of determination is a statistical measure
of how well the regression predictions approximate the real data points. An R2 of 1
indicates that the regression predictions perfectly fit the data.
Median Absolute Error: The median absolute error is particularly interesting

because it is robust to outliers. The loss is calculated by taking the median of all
absolute differences between the target and the prediction.
RESULTS
METRICS
VALUES
MEAN ABSOLUTE ERROR 0.1057
MEAN SQUARE ERROR 0.1125
MEDIAN ABSOLUTE ERROR 0
R2 SCORE 0.8205
PLOT
CHAPTER 6
SUMMARY AND CONCLUSIONS
SUMMARY
A Big mart is a shopping mall which sells variety of all household, eatables,
electronic devices, Garments, Groceries at a large scale. But the sales of a product
may vary season to season. In this case, sales forecasting plays an important role to
predict the sales of each and every product by the help of cumulative sales report.
The algorithm which was used in this thesis is Decision Tree regression.
Regression is used to predict a range of numerical values, given a particular
dataset. The aim was to build a predictive model and find the sales of each product
at a particular store. Using this model, big marts will try to understand the
properties of the products and stores which play a key role in increasing sales,
where to improve the marketing or to stop the selling of the product.
CONCLUSION
We have analyzed datasets of big mart sales prediction and performed literature
survey related to sales prediction using various techniques such as fuzzy logic,
deep learning, neural networks,etc,.We used Jupyter tool through Anaconda
Navigator for processing the techniques. Decision tree based Regression proved
the best model to predict the future sales with the accuracy rate of 90%. Training
the model was easier than any other models. It proved to be the best model in
forecasting sales of Big Mart. This indirectly helps to gain more profit and have a
scheduled products in stock.
OBTAINED ACCURACY RATE : 90%

REFERENCES
[1] Hichama, A., & Mohameda, B. (2013). A NOVEL APPROACH BASED ON GENETIC
FUZZY CLUSTERING AND ADAPTIVE NEURAL NETWORKS FOR SALES
FORECASTING.
[2] Kamsu-Foguem, B., Rigal, F., & Mauget, F. (2013). Mining association rules for the quality
improvement of the production process. Expert systems with applications, 40(4), 1034-1045.
[3] Garber, T., Goldenberg, J., Libai, B., & Muller, E. (2014). From density to destiny: Using
spatial dimension of sales data for early prediction of new product success. Marketing
Science, 23(3), 419-428
[4] Doganis, P., Alexandridis, A., Patrinos, P., &Sarimveis, H. (2016). Time series sales
forecasting for short shelf-life food products based on artificial neural networks and evolutionary
computing. Journal of Food Engineering, 75(2), 196-204.
[5] Lee, W. I., Chen, C. W., Chen, K. H., Chen, T. H., & Liu, C. C. (2013). Comparative study
on the forecast of fresh food sales using logistic regression, moving average and BPNN
methods. Journal of Marine Science and Technology, 20(2), 142-152.
[6] Thiesing, F. M., &Vornberger, O. (2013, June). Sales forecasting using neural networks.
In International conference on neural networks (Vol. 4, pp. 2125-2128).
[7] Yip, D. H., Hines, E. L., & Yu, W. W. (2013). Application of artificial neural networks in
sales forecasting.
[8] BhawanaDakhare, Hardik Jain, VedantSalunke, Rakesh Gandla.(2018). CRM Application

For Analizing the Sales Using Data Mining Techniques and Business Intelligence.
[9] Singh, M., Ghutla, B., Jnr, R. L., Mohammed, A. F., & Rashid, M. A. (2017, December).
Walmart's Sales Data Analysis-A Big Data Analytics Perspective. In 2017 4th Asia-Pacific
World Congress on Computer Science and Engineering (APWC on CSE) (pp. 114-119). IEEE.
[10] Arpit Agrawal,JitendraSoni.(2014,Febraury). Secure Frequent Itemset Hiding Techniques in

Data Mining.
[11] Chen, T., Yin, H., Chen, H., Wu, L., Wang, H., Zhou, X., & Li, X. (2018, November).
TADA: trend alignment with dual-attention multi-task recurrent neural networks for sales
prediction. In 2018 IEEE International Conference on Data Mining (ICDM) (pp. 49-58). IEEE.
[12] Bakshi, N. A., Kolan, P. R., Behera, B., Kaushik, N., & Ismail, A. M. (2018). Predicting
Pregnant Shoppers Based on Purchase History Using Deep Convolutional Neural
Networks. Journal of Advances in Information Technology Vol, 9(4).
[13] ZohrehRostamkhania, M. T. Taghavifard. A Decreasing Max-Min Approach for Hiding

Informative Association Rules.
[14] Baecke, P., & Van den Poel, D. (2013). Data augmentation by predicting spending pleasure
using commercially available external data. Journal of Intelligent Information Systems, 36(3),
367-383.
[15] Mahbub, N., Paul, S. K., &Azeem, A. (2013). A neural approach to product demand
forecasting. International Journal of Industrial and systems engineering, 15(1), 1-18.
[16] Wong, K. W., Zhou, S., Yang, Q., & Yeung, J. M. S. (2014). Mining customer value: From
association rules to direct marketing. Data Mining and Knowledge Discovery, 11(1), 57-79.
[17] Agrawal, R., Imieliński, T., &Swami, A. (2015, June). Mining association rules between
sets of items in large databases. In Acmsigmod record (Vol. 22, No. 2, pp. 207-216). ACM.

Final DMT Report PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Final DMT Report PDF

Încărcat de

Drepturi de autor:

Formate disponibile

School of Information Technology & Engineering

M Tech Software Engineering

DATA MINING TECHNIQUES (SWE 2009)

Review-3 (Win 2018-19)

BIG MART SALES USING DECISION TREE REGRESSION

Faculty : Prof Sudha. M

3.1 MACHINE LEARNING METHODS

RESULTS AND DISCUSSIONS: [ All the Metric, Plots, Visual Projection]

SUMMARY AND CONCLUSIONS

REFERENCES [APA FORMAT]

Keywords: Sales forecast, Decision tree Regression, Pandas, Numpy, datasets,

1) CLASSIFICATION TREE and

1.3 SCOPE OF THE WORK

 Predicting forecast sales using cumulative sales report.

 Training a classifier model and testing to get accurate forecast results on

 To improve the profits of the mart indirectly by sales forecast.

NumPy is the fundamental package for scientific computing with Python. It

• Sophisticated (broadcasting) functions

• Tools for integrating C/C++ and Fortran code

• Useful linear algebra, Fourier transform, and random number capabilities .

2.4 PROBLEM DEFINITION AND APPROACH

1. Hypothesis Generation – understanding the problem better by brainstorming

2. Data Exploration – looking at categorical and continuous feature summaries and

5. Model Building – making predictive models on the data.

CRM assumes an essential job in the present promoting, utilizing client

Association rules (frequent itemsets), classification and clustering are main

Convolutional Neural Networks (CNNs) to handle one dimensional data. The

In this paper[14], the database of the company is considered as a valuable asset to

The research [18] demonstrates that one-year forecasts based on consumer

3.1 MACHINE LEARNING METHODS

Decision Tree Regression

1) CLASSIFICATION TREE and 2) REGRESSION TREES.

 Data Set Characteristics (Multi variate)

 Attribute Characteristics (Integer, Real)

 Associated tasks (classification, clustering)

The datasets were downloaded from Kaggle repository.

Kaggle is an online community of data scientists and machine learners. It allows

Whether the outlet is just a grocery store

Sales of the product in the particular

3.4 PROCESSING TECHNIQUES

RESULTS AND DISCUSSION

Mean Square Error :

Median Absolute Error: The median absolute error is particularly interesting

OBTAINED ACCURACY RATE : 90%

[8] BhawanaDakhare, Hardik Jain, VedantSalunke, Rakesh Gandla.(2018). CRM Application

[10] Arpit Agrawal,JitendraSoni.(2014,Febraury). Secure Frequent Itemset Hiding Techniques in

[13] ZohrehRostamkhania, M. T. Taghavifard. A Decreasing Max-Min Approach for Hiding

S-ar putea să vă placă și