3.unsupervised Learning

2016. 11. 27.
Unsupervised learning
Unsupervised learning
About this module

The goal of unsupervised learning is to model patterns that are hidden in the data. For
example, in our retail dataset there may be groups of customers with particular behaviours,
e.g. customers that use the shop for expensive items, customers that use the shop only with a
small budget, customers that use the website only in some periods of the year, and so on. With
unsupervised learning we can discover these kinds of pattern and summarise them.
The analysis that allows us to discover and consolidate patterns is called unsupervised because
we do not know what groups there are in the data or the group membership of any individual
observation. In this case, we say that the data is unlabelled. The most common unsupervised
learning method is clustering, where patterns are discovered by grouping samples.
Clustering with K-Means

K-means clustering is a method for finding clusters and cluster centres in a set of unlabelled
data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-
point distances are small compared with the distances to points outside of the cluster. Given
an initial set of K centres, the K-means algorithm alternates the two steps:
1. for each centre we identify the subset of training points (its cluster) that is closer to it than
any other centre;
2. the mean of each feature for the data points in each cluster are computed, and the
corresponding vector of means becomes the new centre for that cluster.
These two steps are iterated until the centres no longer move or the assignments no longer
change. Then, a new point x can be assigned to the cluster of the closest prototype.
Run K-Means with two features

Isolate the features mean_spent and max_spent , then run the K-Means algorithm on the
resulting dataset using K=2 (in sklearn, it is n_clusters = 2 ) and visualise the result.
http://beta.cambridgespark.com/courses/jpm/03module.html 1/9
2016. 11. 27. Unsupervised learning
PYTHON
# Apply k-means with 2 cluster using a subset of the features
# (mean_spent and max_spent)
Xsub = X[:,1:3]
n_clusters = 2
kmeans = KMeans(n_clusters = n_clusters)

kmeans.fit(Xsub) (1)
# use the fitted model to predict what the cluster of each customer should be
cluster_assignment = kmeans.predict(Xsub) (2)
cluster_assignment
1. The method fit runs the K-Means algorithm on the data that we pass to it.
2. The method predict returns a cluster label for each sample in the data.
PYTHON
# Visualise the clusters using a scatter plot or scatterplot matrix if you wish
data = [
Scatter(
x = Xsub[cluster_assignment == i, 0],
y = Xsub[cluster_assignment == i, 1],
mode = 'markers',
name = 'cluster '+ str(i)
) for i in range(n_clusters)
]
layout = Layout(
xaxis = dict(title = 'max_spent'),
yaxis = dict(title = 'mean_spent'),
height= 600,
)
fig = dict(data = data, layout = layout)
iplot(fig)
Figure 1. K-Means clustering results with two features.
The separation between the two clusters is neat (the two clusters can be separated with a line).
One cluster contains customers with low spendings and the second with high spendings.
Run K-Means with all the features

Run K-Means using all the available features and visualise the result in the subspace created
by mean_spent and max_spent .
PYTHON
# Apply k-means with 2 clusters using all the features
PYTHON
# Adapt the visualisation code accordingly
This is what you should observe:
Figure 2. K-Means clustering results with all features.
The result is now different. The first cluster contains customers with a maximum spending
close to the minimum mean spending and the second contains customers with a maximum
spending far from the minimum mean spending. This way can tell apart customers that could
be willing to buy objects that cost more than their average spending.
Question: Why can’t the clusters be separated with a line as before?
Compare expenditure between clusters

Select the feature 'mean_spent' (or any feature of your choice) and compare the two clusters
obtained. Can you interpret the output of these commands?
PYTHON
# Compare expenditure between clusters
feat = 1
cluster0_desc = pd.DataFrame(X[cluster_assignment == 0, feat],

columns=['cluster0']).describe()
cluster1_desc = pd.DataFrame(X[cluster_assignment == 1, feat],

columns=['cluster1']).describe()
compare_df = pd.concat((cluster0_desc, cluster1_desc), axis=1)

compare_df
Figure 3. Descriptive statistics of the clusters.
Compare expenditure with box plots

Compare the distribution of the feature mean_spent in the two clusters using a box plot.
PYTHON
# Create a boxplot of the two clusters for 'mean_spent'
data = [
Box(
y = X[cluster_assignment == i, feat],
name = 'cluster'+ str(i),
) for i in range(n_clusters)
]
layout = Layout(
xaxis = dict(title = "Clusters"),
yaxis = dict(title = "Value"),
showlegend=False
)
fig = dict(data = data, layout = layout)
iplot(fig)
Figure 4. Boxplot of mean expenditure for each cluster.
Compare the mean expenditure distributions

Use the function create_distplot from FigureFactory to show the distribution of the
mean expenditure in both clusters.
PYTHON
# Compare mean expediture with a histogram
# Add histogram data

x1 = X[cluster_assignment == 0, feat]
x2 = X[cluster_assignment == 1, feat]
# Group data together

hist_data = [x1, x2]
group_labels = ['Cluster 1', 'Cluster 2']
fig = FF.create_distplot(hist_data, group_labels, bin_size=.2)
iplot(fig)
Figure 5. Mean expenditure distribution per cluster.
Here we note:
Cluster 0 contains more customers.
Customers in cluster 1 spend more in average
There is more variability in the behaviour of the Customers in cluster 1
Looking at the centroids

Look at the centroids of the clusters kmeans.cluster_centers_ and check the values of the
centres in for the features 'mean_spent' and 'max_spent'.
PYTHON
# Compare the centroids
We can see that the centres coincide with the means of each cluster in the table above.
Compute the silhouette score

Compute the silhouette score of the clusters resulting from the application of K-Means. The
Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean
nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a)
/ max(a, b) . It represents how similar a sample is to the samples in its own cluster compared
to samples in other clusters. The best value is 1 and the worst value is -1. Values near 0
indicate overlapping clusters. Negative values generally indicate that a sample has been
assigned to the wrong cluster, as a different cluster is more similar.
PYTHON
# Compute the silhouette score
print('silhouette_score', silhouette_score(X, cluster_assignment))

> ('silhouette_score', 0.451526633737)
K-Means, pro and cons

Pro:
fast, if your dataset is big K-Means might be the only option
easy to understand
any unseen point can be assigned to the cluster with the closest mean to the point
many implementations available
Cons:
you need to guess the number of clusters
clusters can be only globular
the results depends on the initial choice of the means
all the points are assigned to a cluster, clusters are affected by noise
Comparison of algorithms
The chart below shows the characteristics of different clustering algorithms implemented in
sklearn on simple 2D datasets.
Figure 6. Comparison of different clustering algorithms.
Here we note that K-Means works pretty well in case of globular clusters but it doesn’t
produce good results on the clusters that have circular and half moon shapes. Instead, Linkage
and DBSCAN are able to deal with these kind of cluster shapes.
The snippet to generate the chart can be found at http://scikit-

learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html.
Wrap up of Module 3
Clustering is an unsupervised way to generate groups out of your data
Each clustering algorithm has its benefits and pitfalls
Some clustering algorithms, like DBSCAN, have an embedded outlier detection mechanism
Silhouette score can be used to measure how compact the clusters are
Last updated 2016-11-25 07:29:37 GMT

3.unsupervised Learning

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

3.unsupervised Learning

Încărcat de

Drepturi de autor:

Formate disponibile

2016. 11. 27.

About this module

Clustering with K-Means

Run K-Means with two features

kmeans = KMeans(n_clusters = n_clusters)

fig = dict(data = data, layout = layout)

Figure 1. K-Means clustering results with two features.

Run K-Means with all the features

This is what you should observe:

Figure 2. K-Means clustering results with all features.

Question: Why can’t the clusters be separated with a line as before?

Compare expenditure between clusters

cluster0_desc = pd.DataFrame(X[cluster_assignment == 0, feat],

cluster1_desc = pd.DataFrame(X[cluster_assignment == 1, feat],

compare_df = pd.concat((cluster0_desc, cluster1_desc), axis=1)

Figure 3. Descriptive statistics of the clusters.

Compare expenditure with box plots

fig = dict(data = data, layout = layout)

Figure 4. Boxplot of mean expenditure for each cluster.

Compare the mean expenditure distributions

# Add histogram data

# Group data together

fig = FF.create_distplot(hist_data, group_labels, bin_size=.2)

Figure 5. Mean expenditure distribution per cluster.

Cluster 0 contains more customers.

Customers in cluster 1 spend more in average

There is more variability in the behaviour of the Customers in cluster 1

Looking at the centroids

Compute the silhouette score

print('silhouette_score', silhouette_score(X, cluster_assignment))

K-Means, pro and cons

fast, if your dataset is big K-Means might be the only option

many implementations available

you need to guess the number of clusters

clusters can be only globular

the results depends on the initial choice of the means

Figure 6. Comparison of different clustering algorithms.

The snippet to generate the chart can be found at http://scikit-

Each clustering algorithm has its benefits and pitfalls

Last updated 2016-11-25 07:29:37 GMT

S-ar putea să vă placă și