Documente Academic
Documente Profesional
Documente Cultură
Introduction
An important step in data analysis is data exploration and representation. We have
already seen some concepts in Exploratory Data Analysis and how to use them with
both, Python and R. In this tutorial we will see how by combining a technique called
Principal Component Analysis (PCA) together with Cluster Analysis we can represent
in a two-dimensional space data defined in a higher dimensional one while, at the
same time, be able to group this data in similar groups or clusters and find hidden
relationships in our data.
But enough theory for today. Let's put these ideas in practice by using Python and R
so we better
Enjoy understand them in order to apply them in the future.
this post? 15WRITE A POST
Leave a like and comment for Jose
All the source code for the different parts of this series of tutorials and applications
can be checked at GitHub. Feel free to get involved and share your progress with us!
We invite the reader to repeat the process with the new cases and deaths datasets
and share the results.
For the sake of making the tutorial self-contained, we will repeat here the code that
gets and prepared the datasets bot, in Python and R. This tutorial is about exploring
countries. Therefore, we will work with datasets where each sample is a country and
each variable is a year.
In R, you use read.csv to read CSV files into data.frame variables. Although the R
function read.csv can work with URLs, https is a problem for R in many cases, so
you need to use a package like RCurl to get around it.
library(RCurl)
Python
So first,
Enjoywe
thisneed
post? to download Google Spreadsheet data as CSV. 15WRITE A POST
Leave a like and comment for Jose
import urllib
tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVA
local_tb_existing_file = 'tb_existing_100.csv'
existing_f = urllib.urlretrieve(tb_existing_url_csv, local_tb_existing_file)
Now that we have it locally, we need to read the CSV file as a data frame.
import pandas as pd
existing_df = pd.read_csv(
local_tb_existing_file,
index_col = 0,
thousands = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']
existing_df.head()
y 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
e 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
a 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
r 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
c
o
u
n
t
r
y
A
f
g
h
a 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 2 2 2
n 3 2 2 1 0 9 9 8 7 7 4 2 0 0 8 6 5 3
i 6 9 2 5 7 7 7 7 4 3 6 6 4 8 3 7 1 8
s
t
a
n
A
l
b
4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 2 2 2
a
2 0 1 2 2 3 2 4 3 2 0 4 2 2 9 9 6 2
n
i
a
A
l
g
4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5
e
5 4 4 3 3 2 3 4 5 6 8 9 0 1 2 3 5 6
r
i
a
A
m
e
r
i
c
4 1 1 1 2 2 1 1
a 4 0 8 8 6 5 6 9 9 5
2 4 8 7 2 5 2 1
n
S
a
m
o
a
A
n
d
3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1
o
9 7 5 3 2 0 8 3 4 2 0 0 1 8 9 8 7 9
r
r
a
In order to do so, we will first how to perform PCA and plot the first two PCs in both,
Python and R. We will close the section by analysing the resulting plot and each of
the two PCs.
R
Enjoy this post? 15WRITE A POST
Leave a like and comment for Jose
The default R package stats comes with function prcomp() to perform principal
component analysis. This means that we don’t need to install anything (although
there are other options using external packages). This is perhaps the quickest way to
do a PCA, and I recommend you to call ?prcomp in your R console if you're interested
in the details of how to fine tune the PCA process with this function.
The resulting object contains several pieces of information related with principal
component analysis. We are interested in the scores, that we have in pca_existing$x .
We got 18 different principal components. Remember that the total number of PCs
corresponds to the total number of variables in the dataset, although we normally
don't want to use all of them but the subset that corresponds to our purposes.
In our case we will use the first two. How much variation is explained by each one? In
R we can use the plot function that comes with the PCA result for that.
plot(pca_existing)
Most variation is explained by the first PC. So let's use the first two PCs to represent
all of our countries in a scatterplot.
## PC1 PC2
## Afghanistan -3.490274 0.973495650
## Albania 2.929002 0.012141345
## Algeria 2.719073 -0.184591877
## American Samoa 3.437263 0.005609367
## Andorra 3.173621 0.033839606
## Angola -4.695625 1.398306461
Now that we have them in a data frame, we can use them with plot .
Let's set the color associated with the mean value for all the years. We will use
functions rgb , ramp , and rescale to create a color palette from yellow (lower
values) to blue (higher values).
library(scales)
ramp <- colorRamp(c("yellow", "blue"))
colours_by_mean <- rgb(
ramp( as.vector(rescale(rowMeans(existing_df),c(0,1)))),
max = 255 )
plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid", col=colours_by_mean)
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8, col=colours_by_mean)
And finally let's associate it with the difference between first and last year, as a
simple way to measure the change in time.
We have some interesting conclusions already about what PC1 and PC2 code as a
representation of the years from 1990 to 2008. We will explain that right after
showing how to perform dimensionality reduction using Python.
Python
When using this implementation of PCA we need to specify in advance the number
of principal components we want to use. Then we can just call the fit() method
with our data frame and check the results.
pca = PCA(n_components=2)
pca.fit(existing_df)
This gives us an object we can use to transform our data by calling transform .
existing_2d = pca.transform(existing_df)
In both cases we will end up with a lower dimension representation of our data
frame, as a numPy array. Let's put it in a new data frame.
existing_df_2d = pd.DataFrame(existing_2d)
existing_df_2d.index = existing_df.index
existing_df_2d.columns = ['PC1','PC2']
existing_df_2d.head()
PC1 PC2
country
print(pca.explained_variance_ratio_)
[ 0.91808789 0.060556 ]
We see that the first PC already explains almost 92% of the variance, while the
second one accounts for another 6% for a total of almost 98% between the two of
them.
Now we are ready to plot the lower dimensionality version of our dataset. We just
need to call plot on the data frame, by passing the kind of plot we want (see here for
more on plotting data frames) and what columns correspond to each axis. We also
add an annotation loop that tags every point with its country name.
%matplotlib inline
country_me country_mean_sca
PC1 PC2
an led
country
-732.2158 203.3814
Afghanistan 353.333333 0.329731
64 94
613.2965
Albania 4.715978 36.944444 0.032420
10
569.3037 -36.8370
Algeria 47.388889 0.042234
13 51
American 717.0827
5.464696 12.277778 0.009240
Samoa 66
661.8022 11.03773
Andorra 25.277778 0.021457
41 6
Now we are ready to plot using this variable size (we will omit the country names this
time since we are not so interested in them).
existing_df_2d.plot(
kind='scatter',
x='PC2',
y='PC1',
s=existing_df_2d['country_mean_scaled']*100,
figsize=(16,8))
And finally let's associate the size with the change between 1990 and 2007. Note that
in the scaled version, those values close to zero will make reference to those with
negative values in the original non-scaled version, since we are scaling to a [0,1]
range.
existing_df_2d['country_change'] = pd.Series(
existing_df['2007']-existing_df['1990'],
index=existing_df_2d.index)
country_change_max = existing_df_2d['country_change'].max()
country_change_min = existing_df_2d['country_change'].min()
country_change_scaled =
(existing_df_2d.country_change - country_change_min) / country_change_max
existing_df_2d['country_change_scaled'] = pd.Series(
country_change_scaled,
index=existing_df_2d.index)
existing_df_2d[['country_change','country_change_scaled']].head()
country_change country_change_scaled
country
Algeria 11 1.289916
existing_df_2d.plot(
kind='scatter',
x='PC2', y='PC1',
s=existing_df_2d['country_change_scaled']*100,
figsize=(16,8))
From the plots we have done in Python and R, we can confirm that the most
variation happens along the y axis, which we have assigned to PC1. We saw that the
first PC already explains almost 92% of the variance, while the second one accounts
for another 6% for a total of almost 98% between the two of them. At the very top of
our charts we saw an important concentration of countries, most of them
developed. While we descend that axis, the number of countries is more sparse, and
they belong to less developed regions of the world.
When colouring/sizing points using two absolute magnitudes such as average and
total number of cases, we can see that the directions also correspond to a variation
in these magnitudes.
Moreover, when using color/size to code the difference in the number of cases over
time (2007 minus 1990), the color gradient mostly changed along the direction of the
second principal component, with more positive values (i.e. increase in the number
of cases) coloured in blue or with bigger size . That is, while the first PC captures
most of the variation within our dataset and this variation is based on the total cases
in the 1990-2007 range, the second PC is largely affected by the change over time.
In the next section we will try to discover other relationships between countries.
When using k-means, we need to determine the right number of groups for our case.
This can be done more or less accurately by iterating through different values for the
number of groups and compare an amount called the within-cluster sum of square
distances for each iteration. This is the squared sum of distances to the cluster
center for each cluster member. Of course this distance is minimal when the number
of clusters gets equal to the number of samples, but we don't want to get there. We
normally stop when the improvement in this value starts decreasing at a lower rate.
However, we will use a more intuitive approach based on our understanding of the
world situation and the nature of the results that we want to achieve. Sometimes
this is the way to go in data analysis, specially when doing exploration tasks. To use
the knowledge that we have about the nature of our data is always a good thing to
do.
Let's start with k=3 asuming that at least, the are countries in a really bad situation,
countries in a good situation, and some of them in between.
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 3)
Let's colour our previous scatter plot based on what cluster each country belongs to.
Most clusters are based on the first PC. That means that clusters are just defined in
terms of the total number of cases per 100K and not how the data evolved on time
(PC2). So let's try with k=4 and see if some of these cluster are refined in the
direction of the second PC.
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 4)
existing_cluster_groups <- existing_clustering$cluster
plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid", col=existing_cluster_groups)
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8, col=existing_cluster_groups)
There is more refinement, but again in the direction of the first PC. Let's try then with
k=5 .
Enjoy this post?
15WRITE A POST
Leave a like and comment for Jose
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 5)
existing_cluster_groups <- existing_clustering$cluster
plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid", col=existing_cluster_groups)
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8, col=existing_cluster_groups)
There we have it. Right in the middle we have a cluster that has been split in two
different ones in the direction of the second PC. What if we try with k=6 ?
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 6)
existing_cluster_groups <- existing_clustering$cluster
plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid", col=existing_cluster_groups)
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8, col=existing_cluster_groups)
We get some diagonal split in the second top cluster. That surely contains some
interesting information, but let's revert to our k=5 case and later on we will see how
to use a different refinement process with clusters are too tight like we have at the
top of the plot.
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 5)
existing_cluster_groups <- existing_clustering$cluster
plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid", col=existing_cluster_groups)
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8, col=existing_cluster_groups)
Python
Again we will use sklearn , in this case its k-means clustering implementation, in
order to perform our clustering on the TB data. Since we already decided on a
number of clusters of 5, we will use it here straightaway.
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit(existing_df)
Now we need to store the cluster assignments together with each country in our
data frame. The cluster labels are returned in clusters.labels_ .
And now we are ready to plot, using the cluster column as color.
import numpy as np
existing_df_2d.plot(
kind='scatter',
x='PC2',y='PC1',
c=existing_df_2d.cluster.astype(np.float),
figsize=(16,8))
The result is pretty much as the one obtained with R, with the color differences
and without the country names that we decided not to include here so we can better
see the colours. In the next section we will analyse each cluster in detail.
Cluster Interpretation
Most of the work in this section is about data frame indexing and plotting. There isn't
anything sophisticated about the code itself, so we will pick up one of our languages
and perform the whole thing (we will use R this time).
In order to analyse each cluster, let's add a column in our data frame containing the
cluster ID. We will use that for subsetting.
##
## 1 2 3 4 5
## 16 30 20 51 90
The last line shows how many countries do we have in each cluster.
Cluster 1
rownames(subset(existing_df, cluster==1))
existing_clustering$centers[1,]
These are by all means the countries with the most tuberculosis cases every year.
We can see in the chart that this is the top line, although the number of cases
Enjoy this
descends post?
progressively. 15WRITE A POST
Leave a like and comment for Jose
Cluster 2
rownames(subset(existing_df, cluster==2))
existing_clustering$centers[2,]
It is a relatively large cluster. Still countries with lots of cases, but definitively less
than the first cluster. We see countries such as India or China here, the larger
countries on earth (from a previous tutorial we know that China itself has reduced its
cases by 85%) and american countries such as Peru or Bolivia. In fact, this is the
cluster with the fastest decrease in the number of existing cases as we see in the line
chart.
Cluster 3
existing_clustering$centers[3,]
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999
## 259.85 278.90 287.30 298.05 309.00 322.95 335.00 357.65 369.65 410.85
## X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 422.25 463.75 492.45 525.25 523.60 519.90 509.80 513.50
This is the only cluster where the number of cases has increased over the years, and
is about to overtake the first position by 2007. Each of these countries are probably
in the middle of an humanitarian crisis and probably beeing affected by other
infectious diseases such as HIV. We can confirm here that PC2 is coding mostly that,
the percentage of variation over time of the number of exiting cases.
Cluster 4
existing_clustering$centers[4,]
This cluster is pretty close to the last and larger one. It contains many american
countries, some european countries, etc. Some of them are large and rich, such as
Russia or Brazil. Structurally the differece with the countries in Cluster 5 may reside
in a larger number of cases per 100K. They also seem to be decreasing the number
of cases slightly faster than Cluster 5. These two reasons made k-means cluster them
in a different group.
Cluster 5
This cluster is too large and heterogeneous and probably needs further refinement.
However, it is a good grouping when compared to other distant clusters. In any case
it contains those countries with less number of existing cases in our set.
So let's do just that quickly. Let's re-cluster the 90 countries in our Cluster 5 in order
to firther refine them. As the number of clusters let's use 2. We are just interested in
seeing if there are actually two different clusters withing Cluster 5. The reader can of
course try to go further and use more than 2 centers.
Now we can plot them in order to see if there are actual differences.
There are actually different tendencies in our data. We can see that there is a group
Enjoy this post?
of countries
Leave a likein our
and original
comment Cluster 5 that is decreasing the number cases at a faster
for Jose
15WRITE A POST
rate, trying to catch up with those countries with a lower number of existing TB cases
per 100K.
rownames(subset(cluster5_df, cluster5_df$cluster==2))
While the countries with less number of cases and also slower decreasing rate is.
However, we won't likely obtain this clusters by just increasing in 1 the number of
centers in our first clustering process with the original dataset. As we said, Cluster 5
seemed like a very cohesive group when compared with more distant countries. This
two step clustering process is a useful technique that we can use with any dataset
we want to explore.
Conclusions
I hope you enjoyed this data exploration as much as I did. We have seen how to use
Python and R to perform Principal Componen Analysis and Clustering on our
Tuberculosis data. Although we don't advocate the use of one platform over the
other, sometimes it's easier to perform some tasks in one of them. Again we see that
the programming interface in Python is more consistent (we will have this more clear
after more tutorials comparing Python and R), clear, and modular. However R has
been created around performing statistics and analytics tasks, and more often than
not we will have a way of doing something in one or two function calls.
Regarding PCA and k-means clustering, the first technique allowed us to plot the
distribution of all the countries in a two dimensional space based on their evolution
of number of cases in a range of 18 years. by doing so we saw how the total number
of cases mostly defines the principal component (i.e. the direction of largest
variation) while the percentage of change in time affects the second component.
Enjoy this post?
15WRITE A POST
Leave a like and comment for Jose
Then we used k-means clustering to group countries by how similar their number of
cases in each year are. By doing so we uncovered some interesting relationships
and, most importantly, better understood the world situation regarding this
important disease. We saw how most countries improved in the time lapse we
considered, but we were also able to discover a group of countries with a high
prevalence of the disease that, far from improving their situation, are increasing the
number of cases.
Remember that all the source code for the different parts of this series of tutorials
and applications can be checked at GitHub. Feel free to get involved and share your
progress with us!
15 3 SHARE
Jose A Dianes
Machine Learning & Data Analytics - Computer Science PhD -
data.jadianes.com FOLLOW
With more than a decade of experience, I have been involved in different aspects
of Computer Science, Machine Learning, and Data Analytics applied to domains
such as Life Sciences, Ambient Sensing, and Real-time Simulators. I a...
3 Replies
Leave a reply
rownames is R
rownames(subset(existing_df, cluster==1))
Reply
Reply
Reply
Python 3.8 is in the development phase (currently in alpha phase) and it’s expected
to be released in September 2019. You can read the full documentation here. As
expected, it has included a lot of features — **assignment expressions, position only
parameters, a lot of additio ... READ MORE