Customer Segmentation with R K-Means Clustering

Customer Segmentation: R
So as a personal project to learn more about data analysis and its applications, I will be reproducing exercises
in the book both in Excel and R. I will be structured in the blog posts:
0) I will always repeat this paragraph in every post
1) Introducing the Concept and Data Set,
2) Reproducing the results with R.
R
Step 1: Pivot & Copy
Step 2: Distances and Clusters
Step 3: Solving for optimal cluster centers
Step 4: Top deals by clusters
Entire Code
Customer Segmentation
Customer segmentation is as simple as it sounds: grouping customers by their characteristics and why would
you want to do that? To better serve their needs!
So how does one go about segmenting customers? One method we will look at is an unsupervised method of
machine learning called k-Means clustering. Unsupervised learning means finding out stuff without knowing
anything about the data to start so you want to discover.
What we have are offers sent via email, and transactions based on those offers. What we want to do with K-
Means clustering is classify customers based on offers they consume. Simplystatistics.org have a nice
animation of what this might look like:
If we plot a given set of data by their dimensions, we can identify groups of similar points (customers) within the
dataset by looking at how those points center around two or more points. The k in k-means is just the number
of clusters you choose to identify; naturally this would be greater than one cluster.
Great, were ready to start.
K-Means Clustering R
We will go through the same steps we did. We start with our vanilla files provided in CSV format:
Offers
Transactions
Download these files into a folder where you will work from in R. Fire up RStudio and lets get started.
Step 1: Pivot & Copy
First we want to read our data. This is simple enough:
# Read offers and transaction data
offers<-read.csv(file="OfferInformation.csv")
transactions<-read.csv(file="Transactions.csv")
We now need to combine these 2 files to get a frequency matrix equivalent to pivoting in Excel. This
can be done using the reshape library in R. Specifically we will use the melt and cast functions.
We first melt the 2 columns of the transaction data. This will create data that we can pivot: customer,
variable, and value. We only have 1 variable here Offer.
We then want to cast this data by putting value first (the offer number) in rows, customer names in the
columns. This is done by using Rs style of formula input: Value ~ Customers.
We then want to count each occurrence of customers in the row. This can be done by using a function
that takes customer names as input, counts how many there are, and returns the result. Simply:
function(x) length(x)
Lastly, we want to combine the data from offers with our new transaction matrix. This is done using
cbind (column bind) which glues stuff together automagically.
Lots of explanations for 3 lines of code!
#Load Library
library(reshape)
# Melt transactions, cast offer by customers
pivot<-melt(transactions[1:2])
pivot<-(cast(pivot,value~Customer.Last.Name,fill=0,fun.aggregate=function(x) length(x)))
# Bind to offers, we remove the first column of our new pivot because it's redundant.
pivot<-cbind(offers,pivot[-1])
We can output the pivot table into a new CSV file called pivot.
write.csv(file="pivot.csv",pivot)
Step 2: Clustering
To cluster the data we will use only the columns starting from Adams until Young.
We will use the fpc library to run the KMeans algorithm with 4 clusters.
To use the algorithm we will need to rotate the transaction matrix with t().
Thats all you need: 4 lines of code!
# Load library
library(fpc)
# Only use customer transaction data and we will rotate the matrix
cluster.data<-pivot[,8:length(pivot)]
cluster.data<-t(cluster.data)
# We will run KMeans using pamk (more robust) with 4 clusters.
cluster.kmeans<-pamk(cluster.data,k=4)
# Use this to view the clusters
View(cluster.kmeans$pamobject$clustering)
Step 3: Solving for Cluster Centers
This is not a necessary step in R! Pat yourself on the back, get another cup of tea or coffee and move
onto to step 4.
Step 4: Top deals by clusters
Top get the top deals we will have to do a little bit of data manipulation. First we need to combine our
clusters and transactions. Noteably the lengths of the tables holding transactions and clusters are
different. So we need a way to merge the data so we use the merge() function and give our columns
sensible names:
#Merge Data
cluster.deals<-merge(transactions[1:2],cluster.kmeans$pamobject$clustering,by.x =
"Customer.Last.Name", by.y = "row.names")
colnames(cluster.deals)<-c("Name","Offer","Cluster")
We then want to repeat the pivoting process to get Offers in rows and clusters in columns counting the
total number of transactions for each cluster. Once we have our pivot table we will merge it with the
offers data table like we did before:
# Melt, cast, and bind
cluster.pivot<-melt(cluster.deals,id=c("Offer","Cluster"))
cluster.pivot<-cast(cluster.pivot,Offer~Cluster,fun.aggregate=length)
cluster.topDeals<-cbind(offers,cluster.pivot[-1])
We can then reproduce the excel version by writing to a csv file:
write.csv(file="topdeals.csv",cluster.topDeals,row.names=F)
Note
Its important to note that cluster 1 in excel does not correspond to cluster 1 in R. Its just the way the
algorithms run. Moreover, the allocation of clusters might differ slightly because of the nature of
kmeans algorithm. However, your insights will be the same; in R we also see that cluster 3 prefers Pinot
Noir and cluster 4 has a strong preference for Offer 22.
Entire code
# Read data
offers<-read.csv(file="OfferInformation.csv")
transactions<-read.csv(file="Transactions.csv")
# Create transaction matrix

library(reshape)
pivot<-melt(transactions[1:2])
pivot<-(cast(pivot,value~Customer.Last.Name,fill=0,fun.aggregate=function(x)
length(x)))
pivot<-cbind(offers,pivot[-1])
write.csv(file="pivot.csv",pivot)
# Cluster
library(fpc)
cluster.data<-pivot[,8:length(pivot)]
cluster.data<-t(cluster.data)
cluster.kmeans<-pamk(cluster.data,k=4)
# Merge Data
cluster.deals<-merge(transactions[1:2],cluster.kmeans$pamobject$clustering,by.x =
"Customer.Last.Name", by.y = "row.names")
colnames(cluster.deals)<-c("Name","Offer","Cluster")
# Get top deals by cluster

cluster.pivot<-melt(cluster.deals,id=c("Offer","Cluster"))
cluster.pivot<-cast(cluster.pivot,Offer~Cluster,fun.aggregate=length)
cluster.topDeals<-cbind(offers,cluster.pivot[-1])
write.csv(file="topdeals.csv",cluster.topDeals,row.names=F)

Customer Segmentation with R K-Means Clustering

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Customer Segmentation with R K-Means Clustering

Încărcat de

Drepturi de autor:

Formate disponibile

Customer Segmentation: R

Step 1: Pivot & Copy

First we want to read our data. This is simple enough:

# Read offers and transaction data

# Melt transactions, cast offer by customers

Thats all you need: 4 lines of code!

# Use this to view the clusters

Step 3: Solving for Cluster Centers

Step 4: Top deals by clusters

# Melt, cast, and bind

We can then reproduce the excel version by writing to a csv file:

# Create transaction matrix

# Get top deals by cluster

S-ar putea să vă placă și