Data Science With Python & R - Dimensionality Reduction and Clustering - Codementor

Jose A Dianes FOLLOW
Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com
Data Science with Python & R:

Dimensionality Reduction and Clustering
Published Aug 03, 2015 Last updated Apr 12, 2017
Introduction
An important step in data analysis is data exploration and representation. We have
already seen some concepts in Exploratory Data Analysis and how to use them with
both, Python and R. In this tutorial we will see how by combining a technique called
Principal Component Analysis (PCA) together with Cluster Analysis we can represent
in a two-dimensional space data defined in a higher dimensional one while, at the
same time, be able to group this data in similar groups or clusters and find hidden
relationships in our data.
More concretely, PCA reduces data dimensionality by finding principal components.

These are the directions of maximum variation in a dataset. By reducing a dataset
original features or variables to a reduced set of new ones based on the principal
components, we end up with the minimum number of variables that keep the
maximum amount of variation or information about how the data is distributed.
If we end up with just two of these new variables, we will be able to represent each
sample in our data in a two-dimensional chart (e.g. a scatterplot).
As an unsupervised data analysis technique, clustering organises data samples by

proximity based on its variables. By doing so we will be able to understand how each
data point relates to each other and discover groups of similar ones. Once we have
each of this groups or clusters, we will be able to define a centroid for them, an ideal
data sample that minimises the sum of the distances to each of the data points in a
cluster. By analysing these centroids' variables we will be able to define each cluster
in terms of its characteristics.
But enough theory for today. Let's put these ideas in practice by using Python and R
so we better
Enjoy understand them in order to apply them in the future.
this post? 15WRITE A POST

Leave a like and comment for Jose
All the source code for the different parts of this series of tutorials and applications
can be checked at GitHub. Feel free to get involved and share your progress with us!
Preparing Our TB Data

We will continue using the same datasets we already loaded in the part introducing
data frames. The Gapminder website presents itself as a fact-based worldview. It is a
comprehensive resource for data regarding different countries and territories
indicators. Its Data section contains a list of datasets that can be accessed as Google
Spreadsheet pages (add &output=csv to download as CSV). Each indicator dataset is
tagged with a Data provider, a Category, and a Subcategory.
For this tutorial, we will use a dataset related to prevalence of Infectious

Tuberculosis:
TB estimated prevalence (existing cases) per 100K
We invite the reader to repeat the process with the new cases and deaths datasets
and share the results.
For the sake of making the tutorial self-contained, we will repeat here the code that
gets and prepared the datasets bot, in Python and R. This tutorial is about exploring
countries. Therefore, we will work with datasets where each sample is a country and
each variable is a year.
In R, you use read.csv to read CSV files into data.frame variables. Although the R
function read.csv can work with URLs, https is a problem for R in many cases, so
you need to use a package like RCurl to get around it.
library(RCurl)
## Loading required package: bitops
# Get and process existing cases file

existing_cases_file <- getURL("https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5
existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)
existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-
lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],
function(x) { as.integer(gsub(',', '', x) )})
Python
So first,
Enjoywe
thisneed
post? to download Google Spreadsheet data as CSV. 15WRITE A POST

import urllib
tb_existing_url_csv = 'https://docs.google.com/spreadsheets/d/1X5Jp7Q8pTs3KLJ5JBWKhncVA
local_tb_existing_file = 'tb_existing_100.csv'
existing_f = urllib.urlretrieve(tb_existing_url_csv, local_tb_existing_file)
Now that we have it locally, we need to read the CSV file as a data frame.
import pandas as pd
existing_df = pd.read_csv(
local_tb_existing_file,
index_col = 0,
thousands = ',')
existing_df.index.names = ['country']
existing_df.columns.names = ['year']
We have specified index_col to be 0 since we want the country names to be the

row labels. We also specified the thousands separator to be ',' so Pandas
automatically parses cells as numbers. We can use head() to check the first
few lines.
existing_df.head()
y 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
e 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
a 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
r 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
c
o
u
n
t
r
y
A
f
g
h
a 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 2 2 2
n 3 2 2 1 0 9 9 8 7 7 4 2 0 0 8 6 5 3
i 6 9 2 5 7 7 7 7 4 3 6 6 4 8 3 7 1 8
s
t
a
n
Enjoy this post? 15WRITE A POST


y 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
e 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
a 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
r 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
A
l
b
4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 2 2 2
a
2 0 1 2 2 3 2 4 3 2 0 4 2 2 9 9 6 2
n
i
a
A
l
g
4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5
e
5 4 4 3 3 2 3 4 5 6 8 9 0 1 2 3 5 6
r
i
a
A
m
e
r
i
c
4 1 1 1 2 2 1 1
a 4 0 8 8 6 5 6 9 9 5
2 4 8 7 2 5 2 1
n
S
a
m
o
a
A
n
d
3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1
o
9 7 5 3 2 0 8 3 4 2 0 0 1 8 9 8 7 9
r
r
a
Dimensionality Reduction with PCA

In this section, we want to be able to represent each country in a two dimensional
space. In our dataset, each sample is a country defined by 18 different variables,
each one corresponding to TB cases counts per 100K (existing, new, deaths) for a
given year from 1990 to 2007. These variables represent not just the total counts or
average in the 1990-2007 range but also all the variation in the time series and
relationships within countries in a given year. By using PCA we will be able to reduce
these 18 variables to just the two of them that best captures that information.
In order to do so, we will first how to perform PCA and plot the first two PCs in both,
Python and R. We will close the section by analysing the resulting plot and each of
the two PCs.
R

The default R package stats comes with function prcomp() to perform principal
component analysis. This means that we don’t need to install anything (although
there are other options using external packages). This is perhaps the quickest way to
do a PCA, and I recommend you to call ?prcomp in your R console if you're interested
in the details of how to fine tune the PCA process with this function.
pca_existing <- prcomp(existing_df, scale. = TRUE)
The resulting object contains several pieces of information related with principal
component analysis. We are interested in the scores, that we have in pca_existing$x .
We got 18 different principal components. Remember that the total number of PCs
corresponds to the total number of variables in the dataset, although we normally
don't want to use all of them but the subset that corresponds to our purposes.
In our case we will use the first two. How much variation is explained by each one? In
R we can use the plot function that comes with the PCA result for that.
plot(pca_existing)
Most variation is explained by the first PC. So let's use the first two PCs to represent
all of our countries in a scatterplot.
scores_existing_df <- as.data.frame(pca_existing$x)

# Show first two PCs for head countries
head(scores_existing_df[1:2])
## PC1 PC2
## Afghanistan -3.490274 0.973495650
## Albania 2.929002 0.012141345
## Algeria 2.719073 -0.184591877
## American Samoa 3.437263 0.005609367
## Andorra 3.173621 0.033839606
## Angola -4.695625 1.398306461
Now that we have them in a data frame, we can use them with plot .


plot(PC1~PC2, data=scores_existing_df,
main= "Existing TB cases per 100K distribution",
cex = .1, lty = "solid")
text(PC1~PC2, data=scores_existing_df,
labels=rownames(existing_df),
cex=.8)
Let's set the color associated with the mean value for all the years. We will use
functions rgb , ramp , and rescale to create a color palette from yellow (lower
values) to blue (higher values).
library(scales)
ramp <- colorRamp(c("yellow", "blue"))
colours_by_mean <- rgb(
ramp( as.vector(rescale(rowMeans(existing_df),c(0,1)))),
max = 255 )
cex = .1, lty = "solid", col=colours_by_mean)
cex=.8, col=colours_by_mean)
Now let's associate colour with total sum.

colours_by_sum <- rgb(
ramp( as.vector(rescale(rowSums(existing_df),c(0,1)))),
max = 255 )
cex = .1, lty = "solid", col=colours_by_sum)
cex=.8, col=colours_by_sum)
And finally let's associate it with the difference between first and last year, as a
simple way to measure the change in time.
Enjoy this post?

15WRITE A POST

existing_df_change <- existing_df$X2007 - existing_df$X1990
colours_by_change <- rgb(
ramp( as.vector(rescale(existing_df_change,c(0,1)))),
max = 255 )
cex = .1, lty = "solid", col=colours_by_change)
cex=.8, col=colours_by_change)
We have some interesting conclusions already about what PC1 and PC2 code as a
representation of the years from 1990 to 2008. We will explain that right after
showing how to perform dimensionality reduction using Python.
Python
Python's sklearn machine learning library comes with a PCA implementation.

This implementation uses the scipy.linalg implementation of the singular value
decomposition. It
only works for dense arrays (see numPy dense arrays or
sparse array PCA if you are using sparse arrays) and is not scalable to large
dimensional data. For large dimensional data we should consider something such as
Spark's dimensionality reduction features. In our case we just have 18 variables, and
that is far from being a large number of features for today's machine learning
libraries and computer capabilities.
When using this implementation of PCA we need to specify in advance the number
of principal components we want to use. Then we can just call the fit() method
with our data frame and check the results.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(existing_df)
PCA(copy=True, n_components=2, whiten=False)
This gives us an object we can use to transform our data by calling transform .
existing_2d = pca.transform(existing_df)
Enjoy this post?

15WRITE A POST

Or we could have just called fit_transform to perform both steps in one single
call.
In both cases we will end up with a lower dimension representation of our data
frame, as a numPy array. Let's put it in a new data frame.
existing_df_2d = pd.DataFrame(existing_2d)
existing_df_2d.index = existing_df.index
existing_df_2d.columns = ['PC1','PC2']
existing_df_2d.head()
PC1 PC2
country
Afghanistan -732.215864 203.381494
Albania 613.296510 4.715978
Algeria 569.303713 -36.837051
American Samoa 717.082766 5.464696
Andorra 661.802241 11.037736
We can also print the explained variance ratio as follows.
print(pca.explained_variance_ratio_)
[ 0.91808789 0.060556 ]
We see that the first PC already explains almost 92% of the variance, while the
second one accounts for another 6% for a total of almost 98% between the two of
them.
Now we are ready to plot the lower dimensionality version of our dataset. We just
need to call plot on the data frame, by passing the kind of plot we want (see here for
more on plotting data frames) and what columns correspond to each axis. We also
add an annotation loop that tags every point with its country name.
%matplotlib inline
ax = existing_df_2d.plot(kind='scatter', x='PC2', y='PC1', figsize=(16,8))
for i, country in enumerate(existing_df.index):

ax.annotate(
country,
(existing_df_2d.iloc[i].PC2, existing_df_2d.iloc[i].PC1)
)
Enjoy this post?

15WRITE A POST

Let's now create a bubble chart, by setting the point size to a value proportional to
the mean value for all the years in that particular country. First we need to add a
new column containing the re-scaled mean per country across all the years.
from sklearn.preprocessing import normalize
existing_df_2d['country_mean'] = pd.Series(existing_df.mean(axis=1), index=existing_df_

country_mean_max = existing_df_2d['country_mean'].max()
country_mean_min = existing_df_2d['country_mean'].min()
country_mean_scaled =
(existing_df_2d.country_mean-country_mean_min) / country_mean_max
existing_df_2d['country_mean_scaled'] = pd.Series(
country_mean_scaled,
index=existing_df_2d.index)
existing_df_2d.head()
country_me country_mean_sca
PC1 PC2
an led
country
-732.2158 203.3814
Afghanistan 353.333333 0.329731
64 94
613.2965
Albania 4.715978 36.944444 0.032420
10
569.3037 -36.8370
Algeria 47.388889 0.042234
13 51
American 717.0827
5.464696 12.277778 0.009240
Samoa 66
661.8022 11.03773
Andorra 25.277778 0.021457
41 6
Now we are ready to plot using this variable size (we will omit the country names this
time since we are not so interested in them).
existing_df_2d.plot(
kind='scatter',
x='PC2',
y='PC1',
s=existing_df_2d['country_mean_scaled']*100,
figsize=(16,8))
Let's do the same with the sum instead of the mean.
Enjoy this post?

15WRITE A POST

existing_df_2d['country_sum'] = pd.Series(
existing_df.sum(axis=1),
country_sum_max = existing_df_2d['country_sum'].max()
country_sum_min = existing_df_2d['country_sum'].min()
country_sum_scaled =
(existing_df_2d.country_sum-country_sum_min) / country_sum_max
existing_df_2d['country_sum_scaled'] = pd.Series(
country_sum_scaled,
kind='scatter',
x='PC2', y='PC1',
s=existing_df_2d['country_sum_scaled']*100,
figsize=(16,8))
And finally let's associate the size with the change between 1990 and 2007. Note that
in the scaled version, those values close to zero will make reference to those with
negative values in the original non-scaled version, since we are scaling to a [0,1]
range.
existing_df_2d['country_change'] = pd.Series(
existing_df['2007']-existing_df['1990'],
country_change_max = existing_df_2d['country_change'].max()
country_change_min = existing_df_2d['country_change'].min()
country_change_scaled =
(existing_df_2d.country_change - country_change_min) / country_change_max
existing_df_2d['country_change_scaled'] = pd.Series(
country_change_scaled,
existing_df_2d[['country_change','country_change_scaled']].head()
country_change country_change_scaled
country
Afghanistan -198 0.850840
Albania -20 1.224790
Algeria 11 1.289916
American Samoa -37 1.189076
Andorra -20 1.224790
kind='scatter',
x='PC2', y='PC1',
s=existing_df_2d['country_change_scaled']*100,
figsize=(16,8))
Enjoy this post?

15WRITE A POST

PCA Results
From the plots we have done in Python and R, we can confirm that the most
variation happens along the y axis, which we have assigned to PC1. We saw that the
first PC already explains almost 92% of the variance, while the second one accounts
for another 6% for a total of almost 98% between the two of them. At the very top of
our charts we saw an important concentration of countries, most of them
developed. While we descend that axis, the number of countries is more sparse, and
they belong to less developed regions of the world.
When colouring/sizing points using two absolute magnitudes such as average and
total number of cases, we can see that the directions also correspond to a variation
in these magnitudes.
Moreover, when using color/size to code the difference in the number of cases over
time (2007 minus 1990), the color gradient mostly changed along the direction of the
second principal component, with more positive values (i.e. increase in the number
of cases) coloured in blue or with bigger size . That is, while the first PC captures
most of the variation within our dataset and this variation is based on the total cases
in the 1990-2007 range, the second PC is largely affected by the change over time.
In the next section we will try to discover other relationships between countries.
Exploring Data Structure with k-means Clustering

In this section we will use k-means clustering to group countries based on how
similar their situation has been year-by-year. That is, we will cluster the data based in
the 18 variables that we have. Then we will use the cluster assignment to colour the
previous 2D chart, in order to discover hidden relationship within our data and
better understand the world situation regarding the tuberculosis disease.
When using k-means, we need to determine the right number of groups for our case.
This can be done more or less accurately by iterating through different values for the
number of groups and compare an amount called the within-cluster sum of square
distances for each iteration. This is the squared sum of distances to the cluster
center for each cluster member. Of course this distance is minimal when the number
of clusters gets equal to the number of samples, but we don't want to get there. We
normally stop when the improvement in this value starts decreasing at a lower rate.
However, we will use a more intuitive approach based on our understanding of the
world situation and the nature of the results that we want to achieve. Sometimes
this is the way to go in data analysis, specially when doing exploration tasks. To use
the knowledge that we have about the nature of our data is always a good thing to
do.
Obtaining clusters in R is as simple as calling to kmeans . The function has several

parameters, but we will just use all the defaults and start trying with different values
Enjoy this post?
15WRITE A POST

of k.
Let's start with k=3 asuming that at least, the are countries in a really bad situation,
countries in a good situation, and some of them in between.
set.seed(1234)
existing_clustering <- kmeans(existing_df, centers = 3)
The result contains a list with components:
cluster : A vector of integers indicating the cluster to which each point is

allocated.
centers : A matrix of cluster centres.
withinss : The within-cluster sum of square distances for each cluster.
size : The number of points in each cluster.
Let's colour our previous scatter plot based on what cluster each country belongs to.
existing_cluster_groups <- existing_clustering$cluster

cex = .1, lty = "solid", col=existing_cluster_groups)
cex=.8, col=existing_cluster_groups)
Most clusters are based on the first PC. That means that clusters are just defined in
terms of the total number of cases per 100K and not how the data evolved on time
(PC2). So let's try with k=4 and see if some of these cluster are refined in the
direction of the second PC.
set.seed(1234)
There is more refinement, but again in the direction of the first PC. Let's try then with
k=5 .
Enjoy this post?
15WRITE A POST

set.seed(1234)
There we have it. Right in the middle we have a cluster that has been split in two
different ones in the direction of the second PC. What if we try with k=6 ?
set.seed(1234)
We get some diagonal split in the second top cluster. That surely contains some
interesting information, but let's revert to our k=5 case and later on we will see how
to use a different refinement process with clusters are too tight like we have at the
top of the plot.
set.seed(1234)
Python
Again we will use sklearn , in this case its k-means clustering implementation, in
order to perform our clustering on the TB data. Since we already decided on a
number of clusters of 5, we will use it here straightaway.
Enjoy this post?

15WRITE A POST

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit(existing_df)
Now we need to store the cluster assignments together with each country in our
data frame. The cluster labels are returned in clusters.labels_ .
existing_df_2d['cluster'] = pd.Series(clusters.labels_, index=existing_df_2d.index)
And now we are ready to plot, using the cluster column as color.
import numpy as np
kind='scatter',
x='PC2',y='PC1',
c=existing_df_2d.cluster.astype(np.float),
figsize=(16,8))
The result is pretty much as the one obtained with R, with the color differences
and without the country names that we decided not to include here so we can better
see the colours. In the next section we will analyse each cluster in detail.
Cluster Interpretation
Most of the work in this section is about data frame indexing and plotting. There isn't
anything sophisticated about the code itself, so we will pick up one of our languages
and perform the whole thing (we will use R this time).
In order to analyse each cluster, let's add a column in our data frame containing the
cluster ID. We will use that for subsetting.
existing_df$cluster <- existing_clustering$cluster

table(existing_df$cluster)
##
## 1 2 3 4 5
## 16 30 20 51 90
The last line shows how many countries do we have in each cluster.
Enjoy this post?

15WRITE A POST

Centroids
Leave a like Comparison Chart
and comment for Jose
Let's start by creating a line chart that compares the time series for each cluster
centroid. This chart will helps us better understand our cluster results.
xrange <- 1990:2007

plot(xrange, existing_clustering$centers[1,],
type='l', xlab="Year",
ylab="New cases per 100K",
col = 1,
ylim=c(0,1000))
for (i in 2:nrow(existing_clustering$centers)) {
lines(xrange, existing_clustering$centers[i,],
col = i)
}
legend(x=1990, y=1000,
lty=1, cex = 0.5,
ncol = 5,
col=1:(nrow(existing_clustering$centers)+1),
legend=paste("Cluster",1:nrow(existing_clustering$centers)))
Cluster 1
Cluster 1 contains just 16 countries. These are:
rownames(subset(existing_df, cluster==1))
## [1] "Bangladesh" "Bhutan" "Cambodia"

## [4] "Korea, Dem. Rep." "Djibouti" "Kiribati"
## [7] "Mali" "Mauritania" "Namibia"
## [10] "Philippines" "Sierra Leone" "South Africa"
## [13] "Swaziland" "Timor-Leste" "Togo"
## [16] "Zambia"
The centroid that represents them is:
existing_clustering$centers[1,]
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997

## 764.0000 751.1875 734.9375 718.0625 701.6875 687.3125 624.7500 621.6250
## X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005
## 605.1875 609.4375 622.0000 635.5000 604.2500 601.1250 597.3750 601.1250
## X2006 X2007
## 600.2500 595.7500
These are by all means the countries with the most tuberculosis cases every year.
We can see in the chart that this is the top line, although the number of cases
Enjoy this
descends post?
progressively. 15WRITE A POST

Cluster 2
Cluster 2 contains 30 countries. These are:
## [1] "Afghanistan" "Angola"

## [3] "Bolivia" "Cape Verde"
## [5] "China" "Gabon"
## [7] "Gambia" "Ghana"
## [9] "Guinea-Bissau" "Haiti"
## [11] "India" "Indonesia"
## [13] "Laos" "Liberia"
## [15] "Madagascar" "Malawi"
## [17] "Mongolia" "Myanmar"
## [19] "Nepal" "Niger"
## [21] "Pakistan" "Papua New Guinea"
## [23] "Peru" "Sao Tome and Principe"
## [25] "Solomon Islands" "Somalia"
## [27] "Sudan" "Thailand"
## [29] "Tuvalu" "Viet Nam"
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997

## 444.5000 435.2000 426.1667 417.4000 409.2333 400.5667 378.6000 365.3667
## X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005
## 358.0333 354.4333 350.6000 326.7333 316.1667 308.5000 297.8667 288.8000
## X2006 X2007
## 284.9667 280.8000
It is a relatively large cluster. Still countries with lots of cases, but definitively less
than the first cluster. We see countries such as India or China here, the larger
countries on earth (from a previous tutorial we know that China itself has reduced its
cases by 85%) and american countries such as Peru or Bolivia. In fact, this is the
cluster with the fastest decrease in the number of existing cases as we see in the line
chart.
Cluster 3
This is an important one. Cluster 3 contains just 20 countries. These are:
Enjoy this post?

15WRITE A POST

## [1] "Botswana" "Burkina Faso"

## [3] "Burundi" "Central African Republic"
## [5] "Chad" "Congo, Rep."
## [7] "Cote d'Ivoire" "Congo, Dem. Rep."
## [9] "Equatorial Guinea" "Ethiopia"
## [11] "Guinea" "Kenya"
## [13] "Lesotho" "Mozambique"
## [15] "Nigeria" "Rwanda"
## [17] "Senegal" "Uganda"
## [19] "Tanzania" "Zimbabwe"
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999
## 259.85 278.90 287.30 298.05 309.00 322.95 335.00 357.65 369.65 410.85
## X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 422.25 463.75 492.45 525.25 523.60 519.90 509.80 513.50
This is the only cluster where the number of cases has increased over the years, and
is about to overtake the first position by 2007. Each of these countries are probably
in the middle of an humanitarian crisis and probably beeing affected by other
infectious diseases such as HIV. We can confirm here that PC2 is coding mostly that,
the percentage of variation over time of the number of exiting cases.
Cluster 4
The fourth cluster contains 51 countries.
Enjoy this post?

15WRITE A POST

## [1] "Armenia" "Azerbaijan"

## [3] "Bahrain" "Belarus"
## [5] "Benin" "Bosnia and Herzegovina"
## [7] "Brazil" "Brunei Darussalam"
## [9] "Cameroon" "Comoros"
## [11] "Croatia" "Dominican Republic"
## [13] "Ecuador" "El Salvador"
## [15] "Eritrea" "Georgia"
## [17] "Guam" "Guatemala"
## [19] "Guyana" "Honduras"
## [21] "Iraq" "Kazakhstan"
## [23] "Kyrgyzstan" "Latvia"
## [25] "Lithuania" "Malaysia"
## [27] "Maldives" "Micronesia, Fed. Sts."
## [29] "Morocco" "Nauru"
## [31] "Nicaragua" "Niue"
## [33] "Northern Mariana Islands" "Palau"
## [35] "Paraguay" "Qatar"
## [37] "Korea, Rep." "Moldova"
## [39] "Romania" "Russian Federation"
## [41] "Seychelles" "Sri Lanka"
## [43] "Suriname" "Tajikistan"
## [45] "Tokelau" "Turkmenistan"
## [47] "Ukraine" "Uzbekistan"
## [49] "Vanuatu" "Wallis et Futuna"
## [51] "Yemen"
Represented by its centroid.
## X1990 X1991 X1992 X1993 X1994 X1995 X1996

## 130.60784 133.41176 125.60784 127.54902 124.82353 127.70588 121.68627
## X1997 X1998 X1999 X2000 X2001 X2002 X2003
## 130.50980 125.82353 124.45098 110.58824 106.60784 121.09804 103.01961
## X2004 X2005 X2006 X2007
## 101.80392 97.29412 96.17647 91.68627
This cluster is pretty close to the last and larger one. It contains many american
countries, some european countries, etc. Some of them are large and rich, such as
Russia or Brazil. Structurally the differece with the countries in Cluster 5 may reside
in a larger number of cases per 100K. They also seem to be decreasing the number
of cases slightly faster than Cluster 5. These two reasons made k-means cluster them
in a different group.
Cluster 5
The last and bigger cluster contains 90 countries.

Enjoy this post?
15WRITE A POST

## [1] "Albania" "Algeria"

## [3] "American Samoa" "Andorra"
## [5] "Anguilla" "Antigua and Barbuda"
## [7] "Argentina" "Australia"
## [9] "Austria" "Bahamas"
## [11] "Barbados" "Belgium"
## [13] "Belize" "Bermuda"
## [15] "British Virgin Islands" "Bulgaria"
## [17] "Canada" "Cayman Islands"
## [19] "Chile" "Colombia"
## [21] "Cook Islands" "Costa Rica"
## [23] "Cuba" "Cyprus"
## [25] "Czech Republic" "Denmark"
## [27] "Dominica" "Egypt"
## [29] "Estonia" "Fiji"
## [31] "Finland" "France"
## [33] "French Polynesia" "Germany"
## [35] "Greece" "Grenada"
## [37] "Hungary" "Iceland"
## [39] "Iran" "Ireland"
## [41] "Israel" "Italy"
## [43] "Jamaica" "Japan"
## [45] "Jordan" "Kuwait"
## [47] "Lebanon" "Libyan Arab Jamahiriya"
## [49] "Luxembourg" "Malta"
## [51] "Mauritius" "Mexico"
## [53] "Monaco" "Montserrat"
## [55] "Netherlands" "Netherlands Antilles"
## [57] "New Caledonia" "New Zealand"
## [59] "Norway" "Oman"
## [61] "Panama" "Poland"
## [63] "Portugal" "Puerto Rico"
## [65] "Saint Kitts and Nevis" "Saint Lucia"
## [67] "Saint Vincent and the Grenadines" "Samoa"
## [69] "San Marino" "Saudi Arabia"
## [71] "Singapore" "Slovakia"
## [73] "Slovenia" "Spain"
## [75] "Sweden" "Switzerland"
## [77] "Syrian Arab Republic" "Macedonia, FYR"
## [79] "Tonga" "Trinidad and Tobago"
## [81] "Tunisia" "Turkey"
## [83] "Turks and Caicos Islands" "United Arab Emirates"
## [85] "United Kingdom" "Virgin Islands (U.S.)"
## [87] "United States of America" "Uruguay"
## [89] "Venezuela" "West Bank and Gaza"
Represented by its centroid.
Enjoy this post?

15WRITE A POST

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997

## 37.27778 35.68889 35.73333 34.40000 33.51111 32.42222 30.80000 30.51111
## X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005
## 29.30000 26.77778 24.35556 23.57778 22.02222 20.93333 20.48889 19.92222
## X2006 X2007
## 19.25556 19.11111
This cluster is too large and heterogeneous and probably needs further refinement.
However, it is a good grouping when compared to other distant clusters. In any case
it contains those countries with less number of existing cases in our set.
A Second Level of Clustering
So let's do just that quickly. Let's re-cluster the 90 countries in our Cluster 5 in order
to firther refine them. As the number of clusters let's use 2. We are just interested in
seeing if there are actually two different clusters withing Cluster 5. The reader can of
course try to go further and use more than 2 centers.
# subset the original dataset

cluster5_df <- subset(existing_df, cluster==5)
# do the clustering
set.seed(1234)
cluster5_clustering <- kmeans(cluster5_df[,-19], centers = 2)
# assign sub-cluster number to the data set for Cluster 5
cluster5_df$cluster <- cluster5_clustering$cluster
Now we can plot them in order to see if there are actual differences.
xrange <- 1990:2007

plot(xrange, cluster5_clustering$centers[1,],
type='l', xlab="Year",
ylab="Existing cases per 100K",
col = 1,
ylim=c(0,200))
for (i in 2:nrow(cluster5_clustering$centers)) {
lines(xrange, cluster5_clustering$centers[i,],
col = i)
}
legend(x=1990, y=200,
lty=1, cex = 0.5,
ncol = 5,
col=1:(nrow(cluster5_clustering$centers)+1),
legend=paste0("Cluster 5.",1:nrow(cluster5_clustering$centers)))
There are actually different tendencies in our data. We can see that there is a group
Enjoy this post?
of countries
Leave a likein our
and original
comment Cluster 5 that is decreasing the number cases at a faster
for Jose
15WRITE A POST

rate, trying to catch up with those countries with a lower number of existing TB cases
per 100K.
rownames(subset(cluster5_df, cluster5_df$cluster==2))
## [1] "Albania" "Algeria"

## [3] "Anguilla" "Argentina"
## [5] "Bahamas" "Belize"
## [7] "Bulgaria" "Colombia"
## [9] "Egypt" "Estonia"
## [11] "Fiji" "French Polynesia"
## [13] "Hungary" "Iran"
## [15] "Japan" "Kuwait"
## [17] "Lebanon" "Libyan Arab Jamahiriya"
## [19] "Mauritius" "Mexico"
## [21] "New Caledonia" "Panama"
## [23] "Poland" "Portugal"
## [25] "Saint Vincent and the Grenadines" "Samoa"
## [27] "Saudi Arabia" "Singapore"
## [29] "Slovakia" "Slovenia"
## [31] "Spain" "Syrian Arab Republic"
## [33] "Macedonia, FYR" "Tonga"
## [35] "Tunisia" "Turkey"
## [37] "United Arab Emirates" "Venezuela"
## [39] "West Bank and Gaza"
While the countries with less number of cases and also slower decreasing rate is.
Enjoy this post?

15WRITE A POST

rownames(subset(cluster5_df, cluster5_df$cluster==1))
## [1] "American Samoa" "Andorra"

## [3] "Antigua and Barbuda" "Australia"
## [5] "Austria" "Barbados"
## [7] "Belgium" "Bermuda"
## [9] "British Virgin Islands" "Canada"
## [11] "Cayman Islands" "Chile"
## [13] "Cook Islands" "Costa Rica"
## [15] "Cuba" "Cyprus"
## [17] "Czech Republic" "Denmark"
## [19] "Dominica" "Finland"
## [21] "France" "Germany"
## [23] "Greece" "Grenada"
## [25] "Iceland" "Ireland"
## [27] "Israel" "Italy"
## [29] "Jamaica" "Jordan"
## [31] "Luxembourg" "Malta"
## [33] "Monaco" "Montserrat"
## [35] "Netherlands" "Netherlands Antilles"
## [37] "New Zealand" "Norway"
## [39] "Oman" "Puerto Rico"
## [41] "Saint Kitts and Nevis" "Saint Lucia"
## [43] "San Marino" "Sweden"
## [45] "Switzerland" "Trinidad and Tobago"
## [47] "Turks and Caicos Islands" "United Kingdom"
## [49] "Virgin Islands (U.S.)" "United States of America"
## [51] "Uruguay"
However, we won't likely obtain this clusters by just increasing in 1 the number of
centers in our first clustering process with the original dataset. As we said, Cluster 5
seemed like a very cohesive group when compared with more distant countries. This
two step clustering process is a useful technique that we can use with any dataset
we want to explore.
Conclusions
I hope you enjoyed this data exploration as much as I did. We have seen how to use
Python and R to perform Principal Componen Analysis and Clustering on our
Tuberculosis data. Although we don't advocate the use of one platform over the
other, sometimes it's easier to perform some tasks in one of them. Again we see that
the programming interface in Python is more consistent (we will have this more clear
after more tutorials comparing Python and R), clear, and modular. However R has
been created around performing statistics and analytics tasks, and more often than
not we will have a way of doing something in one or two function calls.
Regarding PCA and k-means clustering, the first technique allowed us to plot the
distribution of all the countries in a two dimensional space based on their evolution
of number of cases in a range of 18 years. by doing so we saw how the total number
of cases mostly defines the principal component (i.e. the direction of largest
variation) while the percentage of change in time affects the second component.
Enjoy this post?
15WRITE A POST

Then we used k-means clustering to group countries by how similar their number of
cases in each year are. By doing so we uncovered some interesting relationships
and, most importantly, better understood the world situation regarding this
important disease. We saw how most countries improved in the time lapse we
considered, but we were also able to discover a group of countries with a high
prevalence of the disease that, far from improving their situation, are increasing the
number of cases.
Remember that all the source code for the different parts of this series of tutorials
and applications can be checked at GitHub. Feel free to get involved and share your
progress with us!
Enjoy this post?

15WRITE A POST

Python R Data Science  Report
Enjoy this post? Give Jose A Dianes a like if it's helpful.
 15 3  SHARE
Jose A Dianes
Machine Learning & Data Analytics - Computer Science PhD -
data.jadianes.com FOLLOW
With more than a decade of experience, I have been involved in different aspects
of Computer Science, Machine Learning, and Data Analytics applied to domains
such as Life Sciences, Ambient Sensing, and Real-time Simulators. I a...
3 Replies
Leave a reply
max kleiner 6 months ago 
rownames is R
 Reply
Roland 2 years ago 
Very good explanation. Thank you for sharing.
 Reply
Saheli Ghosh 2 years ago 
which version of r had been used here???
 Reply
Enjoy this post?

15WRITE A POST

Ruturaj Kiran Vaidya
Introduction to Python 3.8 new feature

— “The Walrus Operator”
Image credits: geekboots.com
You can see my post here on medium.
Python 3.8 is in the development phase (currently in alpha phase) and it’s expected
to be released in September 2019. You can read the full documentation here. As
expected, it has included a lot of features — **assignment expressions, position only
parameters, a lot of additio ... READ MORE
Enjoy this post?

 15

Data Science With Python & R - Dimensionality Reduction and Clustering - Codementor

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Science With Python & R - Dimensionality Reduction and Clustering - Codementor

Încărcat de

Drepturi de autor:

Formate disponibile

Jose A Dianes FOLLOW

Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com

Data Science with Python & R:

More concretely, PCA reduces data dimensionality by finding principal components.

As an unsupervised data analysis technique, clustering organises data samples by

Preparing Our TB Data

For this tutorial, we will use a dataset related to prevalence of Infectious

TB estimated prevalence (existing cases) per 100K

## Loading required package: bitops

# Get and process existing cases file

We have specified index_col to be 0 since we want the country names to be the

Enjoy this post? 15WRITE A POST

Dimensionality Reduction with PCA

pca_existing <- prcomp(existing_df, scale. = TRUE)

scores_existing_df <- as.data.frame(pca_existing$x)

Enjoy this post? 15WRITE A POST

Now let's associate colour with total sum.

ramp <- colorRamp(c("yellow", "blue"))

Enjoy this post?

Python's sklearn machine learning library comes with a PCA implementation.

from sklearn.decomposition import PCA

PCA(copy=True, n_components=2, whiten=False)

Enjoy this post?

Afghanistan -732.215864 203.381494

Albania 613.296510 4.715978

Algeria 569.303713 -36.837051

American Samoa 717.082766 5.464696

Andorra 661.802241 11.037736

We can also print the explained variance ratio as follows.

ax = existing_df_2d.plot(kind='scatter', x='PC2', y='PC1', figsize=(16,8))

for i, country in enumerate(existing_df.index):

Enjoy this post?

from sklearn.preprocessing import normalize

existing_df_2d['country_mean'] = pd.Series(existing_df.mean(axis=1), index=existing_df_

Let's do the same with the sum instead of the mean.

Enjoy this post?

Afghanistan -198 0.850840

Albania -20 1.224790

American Samoa -37 1.189076

Andorra -20 1.224790

Enjoy this post?

Exploring Data Structure with k-means Clustering

Obtaining clusters in R is as simple as calling to kmeans . The function has several

The result contains a list with components:

cluster : A vector of integers indicating the cluster to which each point is

centers : A matrix of cluster centres.

withinss : The within-cluster sum of square distances for each cluster.

size : The number of points in each cluster.

existing_cluster_groups <- existing_clustering$cluster

Enjoy this post?

existing_df_2d['cluster'] = pd.Series(clusters.labels_, index=existing_df_2d.index)

existing_df$cluster <- existing_clustering$cluster

Enjoy this post?

xrange <- 1990:2007

Cluster 1 contains just 16 countries. These are:

## [1] "Bangladesh" "Bhutan" "Cambodia"

The centroid that represents them is:

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997

Cluster 2 contains 30 countries. These are:

## [1] "Afghanistan" "Angola"

The centroid that represents them is:

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997

This is an important one. Cluster 3 contains just 20 countries. These are: