Sunteți pe pagina 1din 30

Revenue Predictor

Udit Ennam
MSCS - Data Science (Class of 2019)
Rutgers University, New Brunswick
Outline
● Introduction
● The 6 Data Mining Steps
○ Problem Definition
○ Data Preparation
○ Data Exploration
○ Modeling
○ Evaluation
○ Deployment
● Challenges/Limitations
● Tools / Technologies
● References
Introduction
● Customers are the backbone of running
businesses.

● Companies are embracing customer


analytics to understand user behavior and
accordingly make modifications in their
promotional / marketing strategies.

Source: McKinsey Five Facts


6 Data Mining Steps
● Prediction of the revenue generated per
1. Problem customer for your business/company.

Definition
● What are the customer’s expectations?

○ Personalized feeling

○ Content Recommendations

○ Customer Loyalty points etc.

● Using the predictions, you can make more


targeted advertising to engage your
customers to drive conversion rates.
2. Data Preparation
Dataset Overview
Source of the Google Analytics dataset: Kaggle

Dataset Timeline: Aug 1, 2016 - Aug 1, 2017 (366 days)

Type of dataset Number of records Number of unique users

Training 0.9million 714717

Testing 0.7million 607891


Target Variable: totals_transactionRevenue
● Initial no. of columns: 12

● JSON columns: 4
○ Each of them in the following format:
'{"browser": "Chrome", "browserVersion": "not available in demo dataset", "browserSize":
"not available in demo dataset", "operatingSystem": "Windows", "operatingSystemVersion":
"not available in demo dataset", "isMobile": false, "mobileDeviceBranding": "not available
in demo dataset", "mobileDeviceModel": "not available in demo dataset",
"mobileInputSelector": "not available in demo dataset", "mobileDeviceInfo": "not available
in demo dataset", "mobileDeviceMarketingName": "not available in demo dataset",
"flashVersion": "not available in demo dataset", "language": "not available in demo
dataset", "screenColors": "not available in demo dataset", "screenResolution": "not
available in demo dataset", "deviceCategory": "desktop"}'

● After normalization of JSON columns,


Total no. of columns: 59

● Data shape after cleaning: (903653, 42)


Data Cleaning

● Removed columns having more than 50% null


entries.

● Replaced null entries elsewhere with medians,


modes, zeros depending on the type of variable
and its distribution.
for col in data.columns:
● Eliminated columns having a single constant if len(data[col].unique()) == 1:
values in all of its records. data.drop(col,inplace=True,axis=1)

● Converted Unix timestamp to month, year and data['visitStartTime']=data['visitStartTime'].apply(lambda


x:datetime.utcfromtimestamp(x).strftime('%Y-%m-%d
hour of the day for easier analysis. %H:%M:%S'))
3. Data Exploration
Univariate, Bivariate Analysis
● Plotted histograms for numeric columns and
categorized them via binning, to help visualize
and form insights better

Bivariate analysis using pairplot to understand the relationship between


multiple variables
Used combination charts [charts having 2 y-axes with one, being the count and second
being the mean of the transaction revenue]. Following are some interesting graphs

Univariate, Bivariate Analysis


● Used correlation heatmaps to understand
positive and negative correlations and remove
redundant columns
A few more interesting
combination charts

Possible explanation: We can see number of visits peaking for the month of Possible explanation: Number of visits during weekdays are more than that
November, but the mean revenue generated is the least of all the months. It in the weekends. This mostly could be because customers tend to make
could be because of ‘Black Friday’ and ‘Thanksgiving’ sales[generally the purchases at their workplace as Google Analytics plays a critical role in how
discounts range from 50-80%]. most companies market nowadays.
Outlier Detection

● Outlier detection using Boxplots, scatterplots,


PyOD library

OUTLIERS : 0 INLIERS : 20000 OUTLIERS : 915 INLIERS : 19085


Angle-based Outlier Detector (ABOD) Cluster-based Local Outlier Factor
Sample of dataset: 20000 rows Sample of dataset: 20000 rows

● It looks like there are not many outliers. The learned decision function function also doesn’t look
good because of a lot of zero transactions. So, we do not remove the outliers from this dataset.
Columns of importance after data exploration
Type of column Column Names Data Type of Column

Target totalstransactionRevenue Continuous Numeric

Intermediate Target Revenue_class Binary

Feature device_deviceCategory, Nominal Categorical


channelGrouping,
device_operatingSystem,
geoNetwork_subContinent, visitMonth,
visistHour, visitWeekday

Feature totalsnewVisits Binary

Feature totalspageviews_cat, Ordinal Categorical


visitNumber_category

Reference fullVisitorId Object


Further steps to building a good model
● One-Hot Encoding

Nominal Categorical variables are encoded, the total number of columns were 84

● Scaling

Features are standardized using StandardScaler utility class from the preprocessing module

● Dimensionality Reduction [Principal Component Analysis with 0.95 variance]

pca.n_components_ - The total number of columns now is 69


4. Modeling
Step - by - step process of model creation

Step 3: Apply regression model on


Zero Revenue 0 non-zero classified observations
Continuous Non-zero revenues
Non-zero Revenue 1 Regression applied on non-zero
predicted values

A classifier initially applied on data Step 4: Groupby transaction revenues


based on Visitor Ids
Step 2: Apply a classifier on your data

Final Model = Classification Model + Regression Model


Data
Step 1: Find data to train on
Types of models to be trained on our data

Model Category Model Name

Classification Naive-Bayes, k-nearest neighbors, Kernel SVM,


Random forests, Logistic Regression, XGBoost

Regression Multiple Linear Regression, Polynomial Regression,


SVR, Random forests, SGD, XGBoost

● Cross validation was then applied with models to avoid overfitting. [10-fold]

● Another way of avoiding overfitting is regularization. I used Elastic-Net regression with cross-validation
to decide on Lambda values because it is good at dealing with situations when there are correlations
between parameters.

● Hyper-parameter tuning was done using GridSearchCV from the scikit-learn library.
5. Evaluation
Evaluation Metrics for our Model
Classification Model:

● Confusion Matrix: Accuracy, Sensitivity, Specificity


● Gain Chart: The farther away the model curve is from the baseline model, the better

Regression Model:

● Mean absolute error: RMSE wasn’t used as it wouldn’t help us easily understand the monetary
profit or deficit.
● Adjusted R-squared value: R2 value is the ratio of expected variation to total variation. So the
closer, R2 value is to 1, the better the regressor model we have at our disposition. Adjusted R2
evaluator is more useful when you have multiple predictors
Evaluation of Classification Model

Model Name Accuracy Sensitivity Specificity

Logistic 98.72% 98.82% 62.17%


Regression

XGBoost 98.74% 98.93% 49.71%


Classifier

Random Forest 98.75% 98.83% 51.3%


Classifier

Kernel SVM 98.71% 98.94% 45.99%


classifier

KNN 98.23% 98.01% 39.21%

Gaussian NB 98.53% 98.22% 43.41%


Evaluation of Classification Model

Based on the confusion matrix scores and


CAP curve scores, Logistic Regression was
deemed as the best classification model for
this dataset.

Sample Gain Chart for Logistic Regression


Evaluation of Regression Models

● Adjusted R2 = 1 - ((1 - R2)*(n - 1)/(n - p - 1))


n -> number of samples, p -> number of independent variables.

After Logistic Regression Adjusted RSquared MAE


classification, Regression Model value

Multiple Linear Regression 0.016 712497.9878

SGD Regression 0.25 31568.9071

Random Forest Regression 0.0012 781712.9222

XGBoostRegressor 0.041 90911.9109

Polynomial Regression 0.19 100789.8121

Support Vector Regression 0.0005 912040.0098

Final Model = Logistic Regression (Classifier) + SGD Regressor (Regressor)


6. Deployment
Input: CSV File (can be comprised of JSON objects)

Output: Top x% users with VisitorIds and Prediction Revenue, where x can be defined by a client.

Sample output:

VisitorId PredictedRevenue
Challenges/Limitations
● Working with Target Variable having 98.75% null entries, which led to huge errors in MAE and
evaluation of classification models has been tricky as most of the values were zeros

● JSON columns processing took a lot of time, works fine with upto 2Gb of data. The model works
only with the Google Analytics dataset type currently
Tools / Technologies
● Web Framework: Flask API
● Front-end: HTML, CSS, Javascript, Bootstrap
● Python IDE: Jupyter
● Important libraries used: Scikit-learn, Scikit-plot [helped easily visualize Gain chart], Plotly
[for making combination charts], PyOD [has multiple outlier detection algorithms],
XGBoost, joblib [to serialize and deserialize model and columns]
References
[1] Zhao Y., Li B., Li X., Liu W., Ren S. (2005) Customer Churn Prediction Using Improved One-Class
Support Vector Machine. In: Li X., Wang S., Dong Z.Y. (eds) Advanced Data Mining and Applications.
ADMA 2005. Lecture Notes in Computer Science, vol 3584. Springer, Berlin, Heidelberg

[2] Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.: Defection Detection: Improving Predictive
Accuracy of Customer Churn Models

[3] Nath, S.V., Behara, R.S.: Customer churn analysis in the wireless industry A data mining approach.
In: Proceedings - Annual Meeting of the Decision Sciences Institute, pp. 505–510 (2003)
Thank You

MSCS, Rutgers University (Class of 2019) Udit Ennam

S-ar putea să vă placă și