Revenue Predictor

Revenue Predictor
Udit Ennam
MSCS - Data Science (Class of 2019)
Rutgers University, New Brunswick
Outline
● Introduction
● The 6 Data Mining Steps
○ Problem Definition
○ Data Preparation
○ Data Exploration
○ Modeling
○ Evaluation
○ Deployment
● Challenges/Limitations
● Tools / Technologies
● References
Introduction
● Customers are the backbone of running
businesses.
● Companies are embracing customer

analytics to understand user behavior and
accordingly make modifications in their
promotional / marketing strategies.
Source: McKinsey Five Facts

6 Data Mining Steps
● Prediction of the revenue generated per
1. Problem customer for your business/company.
Definition
● What are the customer’s expectations?
○ Personalized feeling
○ Content Recommendations
○ Customer Loyalty points etc.
● Using the predictions, you can make more

targeted advertising to engage your
customers to drive conversion rates.
2. Data Preparation
Dataset Overview
Source of the Google Analytics dataset: Kaggle
Dataset Timeline: Aug 1, 2016 - Aug 1, 2017 (366 days)
Type of dataset Number of records Number of unique users
Training 0.9million 714717
Testing 0.7million 607891

Target Variable: totals_transactionRevenue
● Initial no. of columns: 12
● JSON columns: 4
○ Each of them in the following format:
'{"browser": "Chrome", "browserVersion": "not available in demo dataset", "browserSize":
"not available in demo dataset", "operatingSystem": "Windows", "operatingSystemVersion":
"not available in demo dataset", "isMobile": false, "mobileDeviceBranding": "not available
in demo dataset", "mobileDeviceModel": "not available in demo dataset",
"mobileInputSelector": "not available in demo dataset", "mobileDeviceInfo": "not available
in demo dataset", "mobileDeviceMarketingName": "not available in demo dataset",
"flashVersion": "not available in demo dataset", "language": "not available in demo
dataset", "screenColors": "not available in demo dataset", "screenResolution": "not
available in demo dataset", "deviceCategory": "desktop"}'
● After normalization of JSON columns,

Total no. of columns: 59
● Data shape after cleaning: (903653, 42)

Data Cleaning
● Removed columns having more than 50% null

entries.
● Replaced null entries elsewhere with medians,

modes, zeros depending on the type of variable
and its distribution.
for col in data.columns:
● Eliminated columns having a single constant if len(data[col].unique()) == 1:
values in all of its records. data.drop(col,inplace=True,axis=1)
● Converted Unix timestamp to month, year and data['visitStartTime']=data['visitStartTime'].apply(lambda

x:datetime.utcfromtimestamp(x).strftime('%Y-%m-%d
hour of the day for easier analysis. %H:%M:%S'))
3. Data Exploration
Univariate, Bivariate Analysis
● Plotted histograms for numeric columns and
categorized them via binning, to help visualize
and form insights better
Bivariate analysis using pairplot to understand the relationship between

multiple variables
Used combination charts [charts having 2 y-axes with one, being the count and second
being the mean of the transaction revenue]. Following are some interesting graphs
Univariate, Bivariate Analysis

● Used correlation heatmaps to understand
positive and negative correlations and remove
redundant columns
A few more interesting
combination charts
Possible explanation: We can see number of visits peaking for the month of Possible explanation: Number of visits during weekdays are more than that
November, but the mean revenue generated is the least of all the months. It in the weekends. This mostly could be because customers tend to make
could be because of ‘Black Friday’ and ‘Thanksgiving’ sales[generally the purchases at their workplace as Google Analytics plays a critical role in how
discounts range from 50-80%]. most companies market nowadays.
Outlier Detection
● Outlier detection using Boxplots, scatterplots,

PyOD library
OUTLIERS : 0 INLIERS : 20000 OUTLIERS : 915 INLIERS : 19085

Angle-based Outlier Detector (ABOD) Cluster-based Local Outlier Factor
Sample of dataset: 20000 rows Sample of dataset: 20000 rows
● It looks like there are not many outliers. The learned decision function function also doesn’t look
good because of a lot of zero transactions. So, we do not remove the outliers from this dataset.
Columns of importance after data exploration
Type of column Column Names Data Type of Column
Target totalstransactionRevenue Continuous Numeric
Intermediate Target Revenue_class Binary
Feature device_deviceCategory, Nominal Categorical

channelGrouping,
device_operatingSystem,
geoNetwork_subContinent, visitMonth,
visistHour, visitWeekday
Feature totalsnewVisits Binary
Feature totalspageviews_cat, Ordinal Categorical

visitNumber_category
Reference fullVisitorId Object

Further steps to building a good model
● One-Hot Encoding
Nominal Categorical variables are encoded, the total number of columns were 84
● Scaling
Features are standardized using StandardScaler utility class from the preprocessing module
● Dimensionality Reduction [Principal Component Analysis with 0.95 variance]
pca.n_components_ - The total number of columns now is 69

4. Modeling
Step - by - step process of model creation
Step 3: Apply regression model on

Zero Revenue 0 non-zero classified observations
Continuous Non-zero revenues
Non-zero Revenue 1 Regression applied on non-zero
predicted values
A classifier initially applied on data Step 4: Groupby transaction revenues

based on Visitor Ids
Step 2: Apply a classifier on your data
Final Model = Classification Model + Regression Model

Data
Step 1: Find data to train on
Types of models to be trained on our data
Model Category Model Name
Classification Naive-Bayes, k-nearest neighbors, Kernel SVM,

Random forests, Logistic Regression, XGBoost
Regression Multiple Linear Regression, Polynomial Regression,

SVR, Random forests, SGD, XGBoost
● Cross validation was then applied with models to avoid overfitting. [10-fold]
● Another way of avoiding overfitting is regularization. I used Elastic-Net regression with cross-validation
to decide on Lambda values because it is good at dealing with situations when there are correlations
between parameters.
● Hyper-parameter tuning was done using GridSearchCV from the scikit-learn library.
5. Evaluation
Evaluation Metrics for our Model
Classification Model:
● Confusion Matrix: Accuracy, Sensitivity, Specificity

● Gain Chart: The farther away the model curve is from the baseline model, the better
Regression Model:
● Mean absolute error: RMSE wasn’t used as it wouldn’t help us easily understand the monetary
profit or deficit.
● Adjusted R-squared value: R2 value is the ratio of expected variation to total variation. So the
closer, R2 value is to 1, the better the regressor model we have at our disposition. Adjusted R2
evaluator is more useful when you have multiple predictors
Evaluation of Classification Model
Model Name Accuracy Sensitivity Specificity
Logistic 98.72% 98.82% 62.17%

Regression
XGBoost 98.74% 98.93% 49.71%

Classifier
Random Forest 98.75% 98.83% 51.3%

Classifier
Kernel SVM 98.71% 98.94% 45.99%

classifier
KNN 98.23% 98.01% 39.21%
Gaussian NB 98.53% 98.22% 43.41%

Evaluation of Classification Model
Based on the confusion matrix scores and

CAP curve scores, Logistic Regression was
deemed as the best classification model for
this dataset.
Sample Gain Chart for Logistic Regression

Evaluation of Regression Models
● Adjusted R2 = 1 - ((1 - R2)*(n - 1)/(n - p - 1))

n -> number of samples, p -> number of independent variables.
After Logistic Regression Adjusted RSquared MAE

classification, Regression Model value
Multiple Linear Regression 0.016 712497.9878
SGD Regression 0.25 31568.9071
Random Forest Regression 0.0012 781712.9222
XGBoostRegressor 0.041 90911.9109
Polynomial Regression 0.19 100789.8121
Support Vector Regression 0.0005 912040.0098
Final Model = Logistic Regression (Classifier) + SGD Regressor (Regressor)

6. Deployment
Input: CSV File (can be comprised of JSON objects)
Output: Top x% users with VisitorIds and Prediction Revenue, where x can be defined by a client.
Sample output:
VisitorId PredictedRevenue
Challenges/Limitations
● Working with Target Variable having 98.75% null entries, which led to huge errors in MAE and
evaluation of classification models has been tricky as most of the values were zeros
● JSON columns processing took a lot of time, works fine with upto 2Gb of data. The model works
only with the Google Analytics dataset type currently
Tools / Technologies
● Web Framework: Flask API
● Front-end: HTML, CSS, Javascript, Bootstrap
● Python IDE: Jupyter
● Important libraries used: Scikit-learn, Scikit-plot [helped easily visualize Gain chart], Plotly
[for making combination charts], PyOD [has multiple outlier detection algorithms],
XGBoost, joblib [to serialize and deserialize model and columns]
References
[1] Zhao Y., Li B., Li X., Liu W., Ren S. (2005) Customer Churn Prediction Using Improved One-Class
Support Vector Machine. In: Li X., Wang S., Dong Z.Y. (eds) Advanced Data Mining and Applications.
ADMA 2005. Lecture Notes in Computer Science, vol 3584. Springer, Berlin, Heidelberg
[2] Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.: Defection Detection: Improving Predictive
Accuracy of Customer Churn Models
[3] Nath, S.V., Behara, R.S.: Customer churn analysis in the wireless industry A data mining approach.
In: Proceedings - Annual Meeting of the Decision Sciences Institute, pp. 505–510 (2003)
Thank You
MSCS, Rutgers University (Class of 2019) Udit Ennam

Revenue Predictor - Udit Ennam PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Revenue Predictor - Udit Ennam PDF

Încărcat de

Drepturi de autor:

Formate disponibile

● Companies are embracing customer

Source: McKinsey Five Facts

○ Customer Loyalty points etc.

● Using the predictions, you can make more

Dataset Timeline: Aug 1, 2016 - Aug 1, 2017 (366 days)

Type of dataset Number of records Number of unique users

Training 0.9million 714717

Testing 0.7million 607891

● After normalization of JSON columns,

● Data shape after cleaning: (903653, 42)

● Removed columns having more than 50% null

● Replaced null entries elsewhere with medians,

● Converted Unix timestamp to month, year and data['visitStartTime']=data['visitStartTime'].apply(lambda

Bivariate analysis using pairplot to understand the relationship between

Univariate, Bivariate Analysis

● Outlier detection using Boxplots, scatterplots,

OUTLIERS : 0 INLIERS : 20000 OUTLIERS : 915 INLIERS : 19085

Target totalstransactionRevenue Continuous Numeric

Intermediate Target Revenue_class Binary

Feature device_deviceCategory, Nominal Categorical

Feature totalsnewVisits Binary

Feature totalspageviews_cat, Ordinal Categorical

Reference fullVisitorId Object

● Dimensionality Reduction [Principal Component Analysis with 0.95 variance]

pca.n_components_ - The total number of columns now is 69

Step 3: Apply regression model on

A classiﬁer initially applied on data Step 4: Groupby transaction revenues

Final Model = Classification Model + Regression Model

Model Category Model Name

Classification Naive-Bayes, k-nearest neighbors, Kernel SVM,

Regression Multiple Linear Regression, Polynomial Regression,

● Confusion Matrix: Accuracy, Sensitivity, Speciﬁcity

Model Name Accuracy Sensitivity Specificity

Logistic 98.72% 98.82% 62.17%

XGBoost 98.74% 98.93% 49.71%

Random Forest 98.75% 98.83% 51.3%

Kernel SVM 98.71% 98.94% 45.99%

KNN 98.23% 98.01% 39.21%

Gaussian NB 98.53% 98.22% 43.41%

Based on the confusion matrix scores and

Sample Gain Chart for Logistic Regression

● Adjusted R2 = 1 - ((1 - R2)*(n - 1)/(n - p - 1))

After Logistic Regression Adjusted RSquared MAE

Multiple Linear Regression 0.016 712497.9878

SGD Regression 0.25 31568.9071

Random Forest Regression 0.0012 781712.9222

XGBoostRegressor 0.041 90911.9109

Polynomial Regression 0.19 100789.8121

Support Vector Regression 0.0005 912040.0098

Final Model = Logistic Regression (Classifier) + SGD Regressor (Regressor)

MSCS, Rutgers University (Class of 2019) Udit Ennam

S-ar putea să vă placă și