Sunteți pe pagina 1din 7

Comparison of learning techniques for prediction of

customer churn in telecommunication


Manpreet Singh, Sarbjeet Singh, Nadesh Seen, Sakshi Kaushal and Harish Kumar
CSE, UIET, Panjab University
Chandigarh, India
manpreetkkh@gmail.com, sarbjeet@pu.ac.in, nadeshseen@gmail.com, sakshi@pu.ac.in, harishk@pu.ac.in

Abstract—Customer Churn is a challenging issue that can telecommunication, banking, online shopping, and social
affect many businesses and is one of the most demanding issues network services.
in the telecom sector. The primary motivation of businesses at
present is just not only to acquire new customers, but to retain Most of the churn prediction techniques employ machine
existing customers as well. In fact, customer retention is more learning techniques. The learning happens in two ways:
important because of the associated high costs. This study is supervised and unsupervised.
conducted in a churn prediction modeling context and
Supervised Learning trains a model on known input and
benchmarks four machine learning techniques against a
publicly available telecommunication dataset. The results lead
output. The data is labeled and these labels set the ground to
to two important conclusions: i) Random Forest technique exploit the data to predict future outputs on new data.
outperforms other basic classification models and ii) Feature Unsupervised Learning is employed if the data is
Engineering plays critical role in the performance of the unlabeled. It finds hidden patterns or structures in the input
model. data using statistical means to predict churns.
Keywords—Machine learning, Customer Churn Prediction This present study applies four classical machine learning
algorithms (SVM, Logistic Regression, k-NN and Random
I. INTRODUCTION Forests) for prediction of churns on a publically available
dataset [3] in telecommunication domain.
Customer churning is the shifting of a customer from one
service provider to another. Customer churns are those Support Vector Machine is a supervised machine
targeted customers who have decided to leave the company learning algorithm for classification and regression problems
or the service provider and planned to shift to the wherein the training dataset is utilized to learn different
competitor’s company in the market. Customer churn is one classes using geometrical concepts. It works by classifying
of the most rapidly growing issue in the telecom sector. The the data into different classes by finding a hyper-plane which
associated cost for acquiring a new customers has led to shift separates the training dataset into classes based on whether
the focus of telecom sector to retain old customers then to the data point is on either side of the plane. The best
acquire new customers. Sales can be improved and separating hyper-plane is the one which maximizes the
marketing costs can be reduced by retaining old customers margin (Perpendicular Distance between the support
instead of focusing on new customers. Literature reveals the vectors).
following types of customers [1]: Logistic Regression Algorithm is also a supervised
Active Churner (Volunteer): These customers are those classification algorithm. It is named after the function used at
who want to leave their service provider. the core of the method, the logistic function. Here, the log
ratio of the odds for label to be equal to 1 or 0 is estimated by
Passive Churner (Non-Volunteer): Passive Churner are a linear combination of the data features. Later, this is
those to whom the companies have discontinued the services. transformed to get the prediction probabilities.
Silent Churner: These are those customers who may The K-Nearest Neighbours has also been made use of as
suddenly discontinue the service without any prior notice. it is one of the basic and prominent machine learning
Customer churning is the key concern in algorithms for classification. Here, the algorithm assigns
telecommunication [2]. Needs and behavior of the customers labels to a data-point based on polling from the top k shortest
are needed to be understood clearly in order to develop distances from the test data-point to all the data-points of the
stronger relationship with them and all such issues are training set.
addressed under Customer Relationship Management Random Forest is supervised learning algorithm. It is
(CRM). Customer retention is one of the main objectives of called so because it creates a ‘forest’ which means that it
CRM and its importance has led to the development of builds an ensemble of Decision Trees to obtain better results.
various tools that support many important tasks in predictive Mostly, it uses Bagging method for training. Although it can
modelling and classification of churns. The companies are be used for both classification and regression problems, but
focusing more on the long-term relationships with the in the current work, it has been used for classification of
customers and are observing their customers behaviour from churns.
time to time. Companies use various feature engineering
techniques to get the hidden relationships between different
entities of the database and to predict churn in an efficient II. LITERATURE REVIEW
way. The present scenario has led many companies to invest A number of initiatives have been taken by different
liberally in the Customer Relationship Management for researchers for prediction churn in different domains. Table I
customer churn prediction. Customer Churn Prediction summarizes the notable work carried out in this area
problem is being widely studied in various domains such as highlighting the major goals achieved along with the dataset
and the techniques used in the work.
TABLE I. SUMMARY OF CHURN PREDICTION INITIATIVES, DATASETS AND TECHNIUES USED AND THEIR OUTCOMES

Authors Title Techniques Year Dataset Outcomes

W. Au, K.C.C. A novel Evolutionary data mining algorithm Data Mining by Evolutionary 2003 Malaysian subscriber They were able to discover rules very
Chan, X. Yao [5] With applications to churn prediction Learning, Decision tree database of wireless telecom effectively and predicted churn in the telecom
(C4.5), Neural industry data accurately.
Network.
S.-Y. Hung, D.C. Applying data mining to telecom churn Decision Tree and Neural 2006 Taiwan telecom company’s Both DT and NN techniques can deliver
Yen, H.-Y. Wang management Network Dataset accurately while BPN performance is better
[5] than DT without segmentation.
R.J. Jadhav, U.T. Churn prediction in Telecommunication Back Propagation Neural 2011 It contains data from in-house Customers who are at risk of churning are
Pawar [7] using data mining technology Network algorithm customer database, predicted.
proprietary call record from
company & research survey
A. Sharma, P. A neural network based approach for Artificial Neural Network 2011 Telecom Dataset, UCI Accuracy obtained by Artificial Neural
Prabin Kumar [8] Predicting Customer churn in cellular Repository, University of Network based model is 92%.
network services California, Irvine

H. Abbasimehr A neuro-fuzzy classifier for Customer churn Adaptive Neuro-fuzzy 2011 Telecom Dataset, UCI Neuro-Fuzzy performed better than C4.5,
[9] prediction Inference system (ANFIS) Repository, University of RIPPER in case of accuracy, sensitivity and
California, Irvine specificity.
E. Shaaban, Y. A proposed churn prediction model Decision tree, neural network, 2012 The Dataset is obtained from Accuracy of Neural Network is 83.7%, SVM
Helmy, A. Khedr, and SVM. an anonymous mobile service is 83.7% and Decision Tree is 77.9%
M. Nasr [10] provider
I. Brandusoiu, G. Churn prediction in the telecommunications Support Vector Machine 2013 Telecom Dataset, UCI Accuracy of SVM based model is 88.56%.
Toderean [11] sector using support vector machines (SVM) Algorithm Repository, University Of
California, Irvine
K. Kim, C.-H. Improved churn prediction in Logistic regression and 2014 Customer’s personal An efficient approach is developed using SPA
Jun, J. Lee [12] telecommunication industry by analysing a Multilayer perceptron neural information and CDR data is (as propagation process).
large network networks present in the dataset

G. Olle [13] A hybrid churn prediction model in mobile Logistic Regression, Voted 2014 Asian Mobile telecom A Hybrid learning model is developed to
Telecommunication industry perceptron operator dataset predict churn.
T. Vafeiadis, K.I. A comparison of machine learning SVM, Decision tree, Artificial 2015 Telecom Dataset, UCI SVM-POLY classifier is the best, using
Diamantaras, G. techniques for customer churn prediction Neural Network, Naive Repository, University Of AdaBoost.
Sarigiannidis, Bayes, Regression Analysis, California, Irvine
K.C. boosting
Chatzisavvas [14]
b) Data Transformation
III. CHURN PREDICTION AND EVALUATION
Feature scaling may or may not have a significant effect
on the results, which depends on the the algorithm that is
This section describes the steps followed for the used. The feature which have larger magnitude weighs more
prediction of churn in telecommunication. in calculations with respect to features having smaller
magnitude. To suppress this effect, we brought all the
A. Churn Prediction Steps features to reasonably at the same level of magnitudes.
1) Data Preparation Feature Scaling is performed on features which are
The data used in this work comes from the Earino company continuous in nature. Voice Mail Plan, International Plan
dataset [3]. The dataset consists of 21 columns including the and Customer Service Calls features are exempted from
label. The dataset available at [3] is not normalized. The standard scaling procedure because Voice Mail Plan and
features available in the dataset are: International Plan are categorical with only 0 and 1 as
categories and Customer Service Calls feature has values
State: Categorical variable, representing one of the 50 states spread over a range of 0 – 7 with frequency of 6 and 7 being
Account length: Integer-valued variable representing very less.
Account active time.
Area code: Categorical variable representing area code 2) Feature Engineering
International Plan: Categorical variable having yes or no a) Feature Selection
value representing the presence of international plan In this phase, selection of a subset of features is
Phone number: A key for customer identification performed, which are logically more relevant and have more
Voice Mail Plan: Categorical variable with ‘yes’ or ‘no’ predictive power. As there is no general method to gauge
Number of voice mail messages: Integer-valued Variable the predictive power and logical relevance of a feature, it is
Total day minutes: Continuous variable representing the purely subjective and requires domain knowledge. For
minutes customer has spent on the service during the day example, Phone number is not considered for predictive
Total day calls: Integer-valued variable representing total analysis because churning of the customer intuitively and
number of calls made in a day logically does not depend on this feature.
Total day charge: Continuous variable representing total day Features which have the strongest relationship with the
charges output variable are selected using statistical tests. The scikit-
Total evening minutes: Continuous variable representing the learn library provides the SelectKBest class which is used to
minutes customer has spent on the service during evening select ‘k’ number of features according to the ‘k’ highest
Total evening calls: Integer-valued variable representing the ANOVA F-value scores.
total number of calls made in evening b) Dimensionality Reduction
Total evening charge: Continuous variable representing total
PCA has been used for noise filtering and feature
day charges
extraction. The scikit learn’s inbuilt implementation of PCA
Total night minutes: Continuous variable representing the has been used to reduce the dimensionality of the data.
minutes customer has spent on the service during night
Total night calls: Integer-valued variable representing the c) Oversampling(SMOTE)
total number of calls made at night SMOTE is used for the oversampling of the churners
Total night charge: Continuous variable representing total class (Here, positive class). SMOTE is a better way to
night charges increase the number of elements of weak class than simply
duplicating existing cases. SMOTE module from ‘imblearn’
Total international minutes: continuous variable
library is used because here the positive class is under-
representing the minutes the customer has spent on making
represented.
international calls
Total international calls: integer-valued variable 3) Train Test Split
representing total number of international calls The dataset is partitioned into training and testing dataset
Customer Service Calls: Continuous Numeric Variable with train to test ratio of 7:3.
representing service calls made by customer
International Charge: Continuous Numeric Variable 4) Hyper-parameter Optimization
representing total international charges Hyper-parameter optimization or tuning is the problem of
Churn: Categorical Variable (1- Churn, 0-Non Churn) choosing a set of optimal hyper-parameters for a learning
algorithm. We need to search for the optimal values of the
The Dataset is imbalanced, with churn class being only following parameters for each model:
14.5 % of the total 3,333 samples. The continuous variables
mostly follow Gaussian distribution. For continuous • Logistic Regression: None
variables, binning and feature scaling is performed. The • Support Vector Machine:
dataset has no missing values. o feature_selection__k
a) Data Pre-Processing o pca__n_components
Categorical Variables are encoded using One Hot o svm__C (Regularization Parameter)
Encoding. One Hot Encoding is an alternative approach to o svm__class_weight {None, Balanced( Assigns
Label Encoding Scheme. This method has the benefit that a weights to data points differently, generally
value is not weighted improperly. used for imbalanced data )}
• K-Nearest Neighbours:
o feature_selection__k precision. It considers both the precision and the
o knn__n_neighbors recall to compute the score.
o pca__n_components b) AUC-ROC Curve
• Random Forest Classifier: An ROC curve (receiver operating characteristic curve) is
o feature_selection__k a graph showing the performance of a classification model at
o rf__n_estimators(Number of weak learners in all classification thresholds. This curve plots two parameters:
the ensemble )
• True Positive Rate(Recall)
The current work makes use of hyper-parameter TPR = TP/(TP+FN)
optimization using grid search. A pipeline with the following • False Positive Rate
sequence is created: FPR = FP/(FP+TN)
An ROC curve plots TPR vs. FPR at different
Feature Selection  PCA  Model Training classification thresholds. Lowering the classification
The parameters to be searched are specified for each step threshold classifies more items as positive, thus increases
of the sequence and also the range in which searching is to both False Positives and True Positives.
be performed is specified. (E.g. k: range (1, 30)) The
searching is done on training dataset and results are sorted IV. RESULTS
based on mean test score. The set of hyper-parameters with
The Dataset used has 3,333 customers and 2,850 (85.5
the highest mean test score is considered as the best model
%) are churners and 453 (14.5 %) are non-churners. The
searched.
objective under the problem is to select the model that yields
A grid search algorithm must be guided by some good classification accuracy with maximum possible recall
performance metric, typically measured by cross-validation (sensitivity towards correctly predicting churner class).
on the training set or evaluation on a held-out validation set.
Here, the performance metric is accuracy score because first We trained models with following optimal values of
an accurate classifier needs to be selected. Then it can be parameters:
optimized for achieving maximum recall score (sensitivity).
• Logistic Regression: None
5) Model Training • Support Vector Machine: feature_selection__k:17,
Four machine learning models which are used to train the pca__n_components:12, svm__C:10,
dataset are: Logistic Regression, Support Vector Machine, K- svm__class_weight: balanced
Nearest Neighbours and Random Forest. The default • K-Nearest Neighbours: feature_selection__k:15,
classification threshold is 0.5 for all the models. Results are knn__n_neighbors:26, pca__n_components:11
obtained first without adjusting the threshold and then after • Random Forest Classifier: feature_selection__k:16,
adjusting the threshold.
rf__n_estimators:33
6) Evaluation Metrics
Evaluation metrics explain the performance of a model. Sensitivity of the model can be increased by adjusting the
An important aspects of evaluation metrics is their capability classification threshold. The thresholds vs major error
to discriminate among model results. Two evaluation metrics metrics plot facilitates this decision making. The confusion
are used to explain the performance of the model: matrices for the two cases viz. before threshold adjustment
and after threshold adjustment are also presented for each of
a) Confusion Matrix the four machine learning models. The threshold is chosen
A confusion matrix is a table that is often used to intuitively, guided by the objective to maximize recall
describe the performance of a classification model on a set of without significantly affecting the model classification
test data for which the true values are known. Confusion accuracy.
Matrix is an NxN matrix, where N is the number of classes. 1) Logistic Regression
• Accuracy: The proportion of the total number of
predictions that were correct.
• Positive Predictive Value (Precision): The
proportion of positive cases that were correctly
identified.
• Negative Predictive Value: The proportion of
negative cases that were correctly identified.
• Sensitivity (Recall): The proportion of actual positive
cases which are correctly identified.
• Specificity: The proportion of actual negative cases
which are correctly identified.
• Null Accuracy: This is how often machine learning
algorithm will be correct if it always predicted the
majority class. Fig. 1. ROC curve for Logistic Regression
• F1-Score (F-Score/F-measure): F1-Score is a ROC-AUC Score: 0.84
weighted average of true positive rate (recall) and The ROC curve of Fig. 1 shows that logistic regression
model is not showing satisfactory results because to achieve
100% sensitivity, we need to face false positive rate of 80%, KNN has better classifying power than Logistic
which is not desired as it will misclassify a large chunk of Regression as witnessed by their ROC Scores (0.92 vs
non-churners as churners. 0.84). ROC curve for KNN is shown in Fig. 3.

Fig. 2. Thresholds vs. Metrics Scores (Logistic Regression)

It can be clearly seen from Fig. 2 above that by changing


the threshold, the metrics scores varies non-linearly. By Fig. 3. ROC curve for K Nearest Neighbors
adjusting threshold to 0.2, the sensitivity rises from 79% to ROC-AUC Score: 0.92
96%. As a result, the classification accuracy further decrease
to 66% making it unsuitable for classification.

TABLE II. CONFUSION MATRIX, THRESHOLD=0.5


Predicted Predicted
Negative Positive
Actual
657 198
Negative
Actual
177 678
Positive

TABLE III. CONFUSION MATRIX, THRESHOLD=0.2


Fig. 4. Thresholds vs. Metrics Scores (KNN)
Predicted Predicted
Negative Positive
Actual The threshold of 0.25 increases the sensitivity from 88%
310 545 to 99% but accompanying it, is a dip in overall accuracy by
Negative
Actual 82% to 65%.
29 826
Positive
TABLE V. CONFUSION MATRIX, THRESHOLD=0.5
Predicted Predicted
Table II and Table III shows the effect of threshold on Negative Positive
Actual
confusion matrix entries. The False Negatives significantly Negative
659 196
reduce from 177 to 29 and number of True Positives increase Actual
from 678 to 826. Moreover, the False Positives get 97 758
Positive
augmented by 347 which implies that we are predicting large
chunk of non-churners as churners. This may incur TABLE VI. CONFUSION MATRIX, THRESHOLD=0.2
significant expenses to the company.
Predicted Predicted
Negative Positive
TABLE IV. PERFORMANCE INDICATORS Actual
264 591
Negative
Model Threshold =0.5 Threshold =0.2 Actual
6 849
Accuracy 0.78 0.66 Positive
Recall 0.79 0.96
Precision 0.77 0.60 Although, KNN has predicted True Positives better than
F1-Score 0.78 0.74 Logistic Regression but again the increase in False Positives
Specificity 0.76 0.38
ROC-AUC 0.84 -
by 395 is a stronger vote against it. This simplifies to a
tradeoff between classifying 91 churners correctly and
Table IV shows that by adjusting threshold to 0.2, F1- misclassifying 395 incorrectly as churners. This is surely a
Score has not changed appreciably. Although the sensitivity poor model considering the company’s objective of cost-
is significantly improved (the main objective) but specificity cutting.
has worsened. Thus, the model is not practically useful.

2) K Nearest Neighbors TABLE VII. PERFORMANCE INDICATORS


Model Threshold =0.5 Threshold =0.2
Accuracy 0.82 0.65 Predicted Predicted
Recall 0.88 0.99 Negative Positive
Precision 0.79 0.59 Actual
769 86
F1-Score 0.83 0.74 Negative
Specificity 0.77 0.30 Actual
17 838
ROC-AUC 0.92 - Positive

Similar justification holds (as for Logistic Regression) The True Positives predicted are greater and the
for this model too. The specificity is even below 0.5 (very difference between total churners and their predicted
poor). numbers is very less (just 17). The False Positives generated
are very less and the False Negatives are less as compared to
3) Support Vector Machine previous models (both in favor of our objective). Threshold
adjustment has not shown any significant improvement in
The ROC Score claims SVM to be a classifier with predicting more True Positives.
excellent predictive capability. The Area under ROC Curve
TABLE X. PERFORMANCE INDICATORS
is close to unity. To achieve almost 100% True Positive
Rate (our objective) we need to compromise to False Model Threshold =0.5 Threshold =0.25
Positive Rate of just 20%. Accuracy 0.94 0.94
Recall 0.97 0.98
Precision 0.92 0.90
F1-Score 0.94 0.94
Specificity 0.94 0.90
ROC-AUC 0.98 -

Scores of the evaluation metrics are better than the


previous models and are not changed much by threshold
adjustment.

4) Random Forest

The ROC Score claims Random Forest to be a classifier


with outstanding predictive capability. The Area under ROC
Curve is close to unity. To achieve almost 100% True
Positive Rate (our objective) we need to compromise to
False Positive Rate of just 20 %.( similar to SVM model)

Fig. 5. ROC curve for SVM


ROC-AUC Score: 0.98

Fig. 6. Thresholds vs. Metrics Scores

At 0.25 threshold, sensitivity increases from 97% to 98%


by not changing other evaluation metrics appreciably.
Moreover, the overall classification accuracy is almost the Fig. 7. ROC curve for Random Forest
same i.e. 94%. ROC-AUC Score: 0.99
At 0.3 threshold, the sensitivity (recall) is optimized
TABLE VIII. CONFUSION MATRIX, THRESHOLD=0.5 from 94% to 98% without much affecting the accuracy and
Predicted Predicted specificity (capability of not predicting non-churners as
Negative Positive churners).
Actual
783 72
Negative
Actual
24 831
Positive

TABLE IX. CONFUSION MATRIX, THRESHOLD=0.25


have been examined. The dataset taken is imbalanced and
not normalized. The categorical features are encoded and
continuous valued features are normalized. A subset of
irrelevant features (with low qualitative predictive power) is
removed. The features having strongest relationship with the
output variable are selected. The dimensionality of dataset is
reduced without losing much variance of the data. Data
imbalance is treated using an oversampling technique. We
tuned the models for maximum predictive performance
using Grid Search. Models are trained on the best set of
parameters obtained from the Grid Search procedure. The
classification efficiency is gauged using standard evaluation
metrics (confusion metrics and ROC Curve). Classification
threshold is adjusted so as to optimize sensitivity of the
Fig. 8. Thresholds vs. Metrics Scores model.
It is concluded from the results presented in Section IV
TABLE XI. CONFUSION MATRIX, THRESHOLD=0.5 that Random Forest and SVM are comparably the best
Predicted Predicted models for the given dataset. The False Positives predicted
Negative Positive by Random Forest and SVM are much less than the other
Actual
826 29
two models. Also, the True Positives are predicted with
Negative accuracy of 94% and sensitivity of 98%.
Actual
47 808
Positive
REFERENCES
TABLE XII. CONFUSION MATRIX, THRESHOLD=0.3 [1] V. Lazarov, M. Capota, churn prediction, Bus. Anal. Course TUM
Comput. Sci. (2007).
Predicted Predicted
[2] R.H. Wolniewicz, R. Dodier, Predicting customer behavior in
Negative Positive
telecommunications, IEEE Intell. Syst. 19 (2) (2004) 50–58.
Actual
768 87 [3] https://www.kaggle.com/becksddf/churn-in-telecoms-dataset
Negative
Actual [4] Leif E. Peterson (2009) K-nearest neighbor. Scholarpedia, 4(2):1883.
17 838
Positive [5] W. Au, K.C.C. Chan, X. Yao, A novel Evolutionary data mining
algorithm with applications to churn prediction, IEEE Trans. Evol.
TABLE XIII. PERFORMANCE INDICATORS Comput. 7 (6) (2003) 532–545
[6] S.-Y. Hung, D.C. Yen, H.-Y. Wang, Applying data mining to telecom
Model Threshold =0.5 Threshold =0.3 churn management, Expert Syst. Appl. 31 (3) (2006) 515–524
Accuracy 0.95 0.94 [7] R.J. Jadhav, U.T. Pawar, Churn prediction in Telecommunication
Recall 0.94 0.98 using data mining technology, Int. J. Adv. Comput. Sci. Appl. 2 (2)
Precision 0.96 0.90 (2011) 17–19.
F1-Score 0.95 0.94
[8] A. Sharma, P. Prabin Kumar, A neural network based approach for
Specificity 0.96 0.90 Predicting Customer churn in cellular network services, Int. J.
ROC-AUC 0.99 - Comput. Appl. 27 (11) (2011) 26–31.
[9] H. Abbasimehr, A neuro-fuzzy classifier for Customer churn
Although before adjustment of threshold, Random prediction, Int. J. Comput. Appl 19 (8) (2011) 35–41.
Forest Accuracy is greater than that of SVM but the True [10] E. Shaaban, Y. Helmy, A. Khedr, M. Nasr, A proposed churn
Positives predicted are less in case of Random Forest. But prediction model, Int. J. Eng. Res. Appl 2 (4) (2012) 693–697.
False Positives being 29 (rather than 72 in SVM) is a good [11] I. Brandusoiu, G. Toderean, Churn prediction in the
telecommunications sector using support vector machines, Ann.
argument in favor of Random Forest. After threshold ORADEA Univ. Fascicle Manag. Technol. Eng. (1) (2013).
adjustment, both SVM and Random Forest Models perform [12] K. Kim, C.-H. Jun, J. Lee, Improved churn prediction in
exactly similar. telecommunication industry by analyzing a large network, Expert
Syst. Appl. 41 (15) (2014) 6575–6584
[13] G. Olle, A hybrid churn prediction model in mobile
V. SUMMARY AND CONCLUSION Telecommunication industry, Int. J. e-Educ. e-Bus. e-Manag. e-Learn.
4 (1) (2014) 55–62.
Churn Prediction can be modelled as a binary
[14] T. Vafeiadis, K.I. Diamantaras, G. Sarigiannidis, K.C. Chatzisavvas,
classification problem. This work aims to solve this problem A comparison of machine learning techniques for customer churn
using four classical methods of machine learning. The prediction, Simul. Model. Pract. Theory 55 (2015) 1–9.
prediction capabilities of different classification models

S-ar putea să vă placă și