Sunteți pe pagina 1din 25

Homework 4

36-350: Data Mining


SOLUTIONS
I loaded the data and transposed it thus:
library(ElemStatLearn)
data(nci)
nci.t = t(nci)
This means that rownames(nci.t) contains the cell classes, or equivalently
colnames(nci). Ill use that a lot, so Ill abbreviate it: nci.classes=rownames(nci.t).
1. Similarity searching for cells
(a) How many features are there in the raw data? Are they discrete or
continuous? (If there are some of each, which is which?) If some are
continuous, are they necessarily positive?
Answer: This depends on whether the cell classes are counted as
features or not; you could defend either answer. If you dont count
the cell class, there are 6830 continuous features. Since they are
logarithms, they do not have to be positive, and indeed they are not.
If you do count the class, there are 6831 features, one of which is
discrete (the class).
I am inclined not to count the clas as really being a feature, since
when we get new data it may not have the label.
(b) Describe a method for finding the k cells whose expression profiles
are most similar to that of a given cell. Carefully explain what you
mean by similar. How could you evaluate the success of the search
algorithm?
Answer: The simplest thing is to call two cells similar if their vectors of expression levels have a small Euclidean distance. The most
similar cells are the ones whose expression-profile vectors are closest,
geometrically, to the target expression profile vector.
There are several ways the search could be evaluated. One would be
to use the given cell labels as an indication of whether the search results are relevant. That is, one would look at the k closest matches
and see (a) what fraction of them are of the same type as the target, which is the precision, and (b) what fraction of cells of that type
1

are in the search results (the recall). Plotting the precision and recall
against k gives an indication of how efficient the search is. This could
then be averaged over all the labeled cells.
Normalizing the vectors by the Euclidean length is possible here,
though a little tricky because the features are the logarithms of the
concentration levels, so some of them can be negative numbers, meaning that very little of that gene is being expressed. Also, the reason
for normalizing bag-of-words vectors by length was the idea that two
documents might have similar meanings or subjects even if they had
different sizes. A cell which is expressing all the genes at a low level,
however, might well be very different from one which is expressing
the same genes at a high level.
(c) Would it make sense to weight genes by inverse cell frequency?
Explain.
Answer: No; or at least, I dont see how. Every gene is expressed to
some degree by every cell (otherwise the logarithm of its concentration
would be ), so every gene would have an inverse cell frequency of
zero. You could fix a threshold and just count whether the expression
level is over the threshold, but this would be pretty arbitrary.
However, a related idea would be to scale genes expression levels by
something which indicates how much they vary across cells, such as
the range or the variance. Since we have labeled cell types here, we
could even compute the average expression level for each type and
then scale genes by the variance of these type-averages. That is, genes
with a small variance (or range, etc.) should get small weights, since
they are presumably uninformative, and genes with a large variance
in epxression level should get larger weights.

2. k-means clustering Use the kmeans function in R to cluster the cells, with
k = 14 (to match the number of true classes). Repeat this three times to
get three different clusterings.
Answer:
clusters.1 = kmeans(nci.t,centers=14)$cluster
clusters.2 = kmeans(nci.t,centers=14)$cluster
clusters.3 = kmeans(nci.t,centers=14)$cluster
Im only keping the cluster assignments, because thats all this problem
calls for.
(a) Say that k-means makes a lumping error whenever it assigns two
cells of different classes to the same cluster, and a splitting error
when it puts two cells of the same class in different clusters. Each
pair of cells can give an error. (With n objects, there are n(n 1)/2
distinct pairs.)
i. Write a function which takes as inputs a vector of classes and
a vector of clusters, and gives as output the number of lumping
errors. Test your function by verifying that when the classes
are (1, 2, 2), the three clusterings (1, 2, 2), (2, 1, 1) and (4, 5, 6)
all give zero lumping errors, but the clustering (1, 1, 1) gives two
lumping errors.
Answer: The most brute-force way (Code 1) uses nested for
loops. This works, but its not the nicest approach. (1) for
loops, let alone nested loops, are extra slow and inefficient in R.
(2) We only care about pairs which are in the same cluster, waste
time checking all pairs. (3) Its a bit obscure what were really
doing.
Code 2 is faster, cleaner, and more R-ish, using the utility function outer, which (rapidly!) applies a function to pairs of objects constructed from its first two arguments, returning a matrix. Here were seeing which pairs of objects belong to different
class, and which to the same clusters. (The solutions to the first
homework include a version of this trick.) We restrict ourselves
to pairs in the same cluster and count how many are of different
classes. We divided by two because outer produces rectangular
matrices and works with ordered pairs, so (i, j) 6= (j, i), but the
problem asks for unordered pairs.
> count.lumping.errors(c(1,2,2),c(1,2,2))
[1] 0
> count.lumping.errors(c(1,2,2),c(2,1,1))
[1] 0
> count.lumping.errors(c(1,2,2),c(4,5,6))
[1] 0
3

# Count number of errors made by lumping different classes into


# one cluster
# Uses explicit, nested iteration
# Input: vector of true classes, vector of guessed clusters
# Output: number of distinct pairs belonging to different classes
# lumped into one cluster
count.lumping.errors.1 <- function(true.class,guessed.cluster) {
n = length(true.class)
# Check that the two vectors have the same length!
stopifnot(n == length(guessed.cluster))
# Note: dont require them to have same data type
# Initialize counter
num.lumping.errors = 0
# check all distinct pairs
for (i in 1:(n-1)) {
# think of above-diagonal entries in a matrix
for (j in (i+1):n) {
if ((true.class[i] != true.class[j])
& (guessed.cluster[i] == guessed.cluster[j])) {
# lumping error: really different but put in same cluster
num.lumping.errors = num.lumping.errors +1
}
}
}
return(num.lumping.errors)
}
Code Example 1: The first way to count lumping errors.

# Count number of errors made by lumping different classes into


# one cluster
# Vectorized
# Input: vector of true classes, vector of guessed clusters
# Output: number of distinct pairs belonging to different classes
# lumped into one cluster
count.lumping.errors <- function(true.class,guessed.cluster) {
n = length(true.class)
stopifnot(n == length(guessed.cluster))
different.classes = outer(true.class, true.class, "!=")
same.clusters = outer(guessed.cluster, guessed.cluster, "==")
num.lumping.errors = sum(different.classes[same.clusters])/2
return(num.lumping.errors)
}
Code Example 2: Second approach to counting lumping errors.

# Count number of errors made by splitting a single classes into


# clusters
# Input: vector of true classes, vector of guessed clusters
# Output: number of distinct pairs belonging to the same class
# split into different clusters
count.splitting.errors <- function(true.class,guessed.cluster) {
n = length(true.class)
stopifnot(n == length(guessed.cluster))
same.classes = outer(true.class, true.class, "==")
different.clusters = outer(guessed.cluster, guessed.cluster, "!=")
num.splitting.errors = sum(different.clusters[same.classes])/2
return(num.splitting.errors)
}
Code Example 3: Counting splitting errors.
> count.lumping.errors(c(1,2,2),c(1,1,1))
[1] 2
ii. Write a function, with the same inputs as the previous one, that
calculates the number of splitting errors. Test it on the same
inputs. (What should the outputs be?)
Answer: A simple modification of the code from the last part.
Before we can test this, we need to know what the test results
are. The classes (1, 2, 2) and clustering (1, 2, 2) are identical, so
obviously there will be no splitting errors. Likewise for comparing (1, 2, 2) and (2, 1, 1). Comparing (1, 2, 2) and (4, 5, 6), the
pair of the second and third items are in the same class but split
into two clusters one splitting error. On the other hand, comparing (1, 2, 2) and (1, 1, 1), there are no splitting errors, because
everything is put in the same cluster.
> count.splitting.errors(c(1,2,2),c(1,2,2))
[1] 0
> count.splitting.errors(c(1,2,2),c(2,1,1))
[1] 0
> count.splitting.errors(c(1,2,2),c(4,5,6))
[1] 1
> count.splitting.errors(c(1,2,2),c(1,1,1))
[1] 0
iii. How many lumping errors does k-means make on each of your
three runs? How many splitting errors?
> count.lumping.errors(nci.classes,clusters.1)
[1] 79
> count.lumping.errors(nci.classes,clusters.2)
[1] 96
> count.lumping.errors(nci.classes,clusters.3)
5

[1] 110
> count.splitting.errors(nci.classes,clusters.1)
[1] 97
> count.splitting.errors(nci.classes,clusters.2)
[1] 97
> count.splitting.errors(nci.classes,clusters.3)
[1] 106
For comparison, there are 64 63/2 = 2016 pairs, so the error
rate here is pretty good.
(b) Are there any classes which seem particularly hard for k-means to
pick up?
Answer: Well-argued qualitative answers are fine, but there are
ways of being quantitative. One is the entropy of the cluster assignments for the classes. Start with a confusion matrix:
> confusion = table(nci.classes,clusters.1)
> confusion
clusters.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
BREAST
2 0 0 0 2 0 0 0 0 0 0 2 1 0
CNS
0 0 0 0 3 0 0 0 2 0 0 0 0 0
COLON
0 0 1 0 0 0 0 6 0 0 0 0 0 0
K562A-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0
K562B-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0
LEUKEMIA
0 1 0 0 0 0 0 0 0 0 0 0 0 5
MCF7A-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0
MCF7D-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0
MELANOMA
7 0 0 1 0 0 0 0 0 0 0 0 0 0
NSCLC
0 0 4 1 0 0 2 0 0 0 1 0 1 0
OVARIAN
0 0 3 1 0 0 0 0 0 2 0 0 0 0
PROSTATE
0 0 2 0 0 0 0 0 0 0 0 0 0 0
RENAL
0 0 0 1 1 7 0 0 0 0 0 0 0 0
UNKNOWN
0 0 0 1 0 0 0 0 0 0 0 0 0 0
> signif(apply(confusion,1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
0.97
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
0.00
0.00
0.54
2.10
1.50
RENAL
UNKNOWN
0.99
0.00

LEUKEMIA
0.65
PROSTATE
0.00

Of course thats just one k-means run. Here are a few more.
> signif(apply(table(nci.classes,clusters.2),1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
0.00
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
6

LEUKEMIA
1.50
PROSTATE

0.00
0.00
0.54
2.10
1.50
RENAL
UNKNOWN
0.99
0.00
> signif(apply(table(nci.classes,clusters.3),1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
1.50
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
0.00
0.00
1.10
1.80
0.65
RENAL
UNKNOWN
0.99
0.00
BREAST and NSCLC have consistently high cluster entropy, indicating that cells of these types tend not to be clustered together.
(c) Are there any pairs of cells which are always clustered together, and
if so, are they of the same class?
Answer: Cells number 1 and 2 are always clustered together in my
three runs, and are CNS tumors. However, cells 3 and 4 are always
clustered together, too, and one is CNS and one is RENAL.
(d) Variation-of-information metric
i. Calculate, by hand, the variation-of-information distance between
the partition (1, 2, 2) and (2, 1, 1); between (1, 2, 2) and (2, 2, 1);
and between (1, 2, 2) and (5, 8, 11).
Answer: Write X for the cell of the first partition in each pair
and Y for the second.
In the first case, when X = 1, Y = 2 and vice versa. So
H[X|Y ] = 0, H[Y |X] = 0 and the distance between the partitions is also zero. (The two partitions are the same, theyve
just swapped the labels on the cells.)
For the second pair, we have Y = 2 when X = 1, so H[Y |X =
1] = 0, but Y = 1 or 2 equally often when X = 2, so H[Y |X =
2] = 1. Thus H[Y |X] = (0)(1/3) + (1)(2/3) = 2/3. Symmetrically, X = 2 when Y = 1, H[X|Y = 1] = 0, but X = 1 or
2 equiprobably when Y = 2, so again H[X|Y ] = 2/3, and the
distance is 4/3.
In the final case, H[X|Y ] = 0, because for each Y there is a
single value of X, but H[Y |X] = 2/3, as in the previous case. So
the distance is 2/3.
ii. Write a function which takes as inputs two vectors of class or
cluster assignments, and returns as output the variation-of-information
distance for the two partitions. Test the function by checking that
it matches your answers in the previous part.
Answer: The most straight-forward way is to make the contingency table or confusion matrix for the two assignment vectors,
calculate the two conditional entropies from that, and add them.
But I already have a function to calculate mutual information,
7

0.00

LEUKEMIA
1.90
PROSTATE
0.00

# Variation-of-information distance between two partitions


# Input: two vectors of partition-cell assignments of equal length
# Calls: entropy, word.mutual.info from lecture-05.R
# Output: variation of information distance
variation.of.info <- function(clustering1, clustering2) {
# Check that were comparing equal numbers of objects
stopifnot(length(clustering1) == length(clustering2))
# How much information do the partitions give about each other?
mi = word.mutual.info(table(clustering1, clustering2))
# Marginal entropies
# entropy function takes a vector of counts, not assignments,
# so call table first
h1 = entropy(table(clustering1))
h2 = entropy(table(clustering2))
v.o.i. = h1 + h2 - 2*mi
return(v.o.i.)
}
Code Example 4: Calculating variation-of-information distance between partitions, based on vectors of partition-cell assignments.
so Ill re-use that via a little algebra:
I[X; Y ]

H[X] H[X|Y ]

H[X|Y ]

H[X] I[X; Y ]

H[X|Y ] + H[Y |X]

= H[X] + H[Y ] 2I[X; Y ]

implemented in Code 4.
Lets check this out on the test cases.
> variation.of.info(c(1,2,2),c(2,1,1))
[1] 0
> variation.of.info(c(1,2,2),c(2,2,1))
[1] 1.333333
> variation.of.info(c(1,2,2),c(5,8,11))
[1] 0.6666667
> variation.of.info(c(5,8,11),c(1,2,2))
[1] 0.6666667
I threw in the last one to make sure the function is symmetric
in its two arguments, as it should be.
iii. Calculate the distances between the k-means clusterings and the
true classes, and between the k-means clusterings and each other.
Answer:
> variation.of.info(nci.classes,clusters.1)
[1] 1.961255
> variation.of.info(nci.classes,clusters.2)
8

[1] 2.026013
> variation.of.info(nci.classes,clusters.3)
[1] 2.288418
> variation.of.info(clusters.1,clusters.2)
[1] 0.5167703
> variation.of.info(clusters.1,clusters.3)
[1] 0.8577165
> variation.of.info(clusters.2,clusters.3)
[1] 0.9224744
The k-means clusterings are all roughly the same distance to the
true classes (two bits or so), and noticeably closer to each other
than to the true classes (around three-quarters of a bit).

3. Hierarchical clustering
(a) Produce dendrograms using Wards method, single-link clustering and
complete-link clustering. Include both the commands you used and the
resulting figures in your write-up. Make sure the figures are legible.
(Try the cex=0.5 option to plot.)
Answer: hclust needs a matrix of distances, handily produced by
the dist function.
nci.dist = dist(nci.t)
plot(hclust(nci.dist,method="ward"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by Wards method")
plot(hclust(nci.dist,method="single"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by single-link method")
plot(hclust(nci.dist,method="complete"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by complete-link method")
Producing Figures 1, 2 and 3
(b) Which cell classes seem are best captured by each clustering method?
Explain.
Answer: Wards method does a good job of grouping together
COLON, MELANOMA and RENAL; LEUKEMIA is pretty good
too but mixed up with some of the lab cell lines. Single-link is good
with RENAL and decent with MELANOMA (though confused with
BREAST). There are little sub-trees of cells of the same time, like
COLON or CNS, but mixed together with others. The complete-link
method has sub-trees for COLON, LEUKEMIA, MELANOMA, and
RENAL, which look pretty much as good as Wards method.
(c) Which method best recovers the cell classes? Explain.
Answer: Wards method looks better scanning down the figure,
a lot more of the adjacent cells are of the same type. Since the
dendrogram puts items in the same sub-cluster together, this suggests
that the clustering more nearly corresponds to the known cell types.
(d) The hclust command returns an object whose height attribute is the
sum of the within-cluster sums of squares. How many clusters does
this suggest we should use, according to Wards method? Explain.
(You may find diff helpful.)
Answer:
> nci.wards = hclust(nci.dist,method="ward")
> length(nci.wards$height)
[1] 63
> nci.wards$height[1]
[1] 38.23033
There are 63 heights, corresponding to the 63 joinings, i.e., not including the height (sum-of-squares) of zero when we have 64 clusters,
10

COLON
COLON
COLON
COLON
COLON
COLON
COLON
NSCLC
NSCLC
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC
OVARIAN
OVARIAN
NSCLC
PROSTATE
OVARIAN
PROSTATE
BREAST
BREAST
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
CNS
CNS
NSCLC
RENAL
BREAST
MELANOMA
CNS
NSCLC
CNS
CNS
BREAST
UNKNOWN
OVARIAN
RENAL
BREAST
NSCLC

MCF7A-repro
BREAST
MCF7D-repro

50

LEUKEMIA
LEUKEMIA
LEUKEMIA
BREAST

LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA

100

150

Height
200

250

300

350

Hierarchical clustering by Ward's method

Cells
hclust (*, "ward")

Figure 1: Hierarchical clustering by Wards method.

11

MCF7A-repro
BREAST
MCF7D-repro

40

LEUKEMIA
K562B-repro
K562A-repro

UNKNOWN
OVARIAN

BREAST
BREAST

30

MELANOMA

60

12
CNS
CNS

COLON
NSCLC
COLON
OVARIAN
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
PROSTATE
COLON
COLON
COLON
COLON
NSCLC
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
CNS

BREAST

70

NSCLC

NSCLC
BREAST
BREAST
RENAL
NSCLC

LEUKEMIA
LEUKEMIA

BREAST
CNS
CNS
OVARIAN
NSCLC
OVARIAN
RENAL
MELANOMA
OVARIAN
COLON

MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA

LEUKEMIA
LEUKEMIA

50

Height

LEUKEMIA

80

90

Hierarchical clustering by single-link method

Cells
hclust (*, "single")

Figure 2: Hierarchical clustering by the single-link method.

13
UNKNOWN
OVARIAN

BREAST
BREAST

LEUKEMIA
LEUKEMIA

COLON
COLON
COLON
COLON

40

NSCLC
MELANOMA
CNS
NSCLC
PROSTATE
OVARIAN
PROSTATE

RENAL
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC

CNS
CNS
BREAST
OVARIAN
OVARIAN
RENAL

BREAST
NSCLC

RENAL

RENAL
BREAST
NSCLC
NSCLC
MELANOMA

LEUKEMIA
LEUKEMIA

COLON
COLON

COLON

MELANOMA
MELANOMA

RENAL
RENAL

CNS
CNS

60
BREAST

MELANOMA
MELANOMA
MELANOMA
MELANOMA

LEUKEMIA
K562B-repro
K562A-repro

MCF7A-repro
BREAST
MCF7D-repro

20

NSCLC

LEUKEMIA

80

Height
100

120

140

Hierarchical clustering by complete-link method

Cells
hclust (*, "complete")

Figure 3: Hierarchical clustering by the complete-link method.

60
50
40

Merging cost

30
20
10
0
0

10

20

30

40

50

60

Clusters remaining

Figure 4: Merging costs in Wards method.


one for each data point. Since the merging costs are differences, we
want to add an initial 0 to the sequence:
merging.costs = diff(c(0,nci.wards$height))
plot(64:2,merging.costs,xlab="Clusters remaining",ylab="Merging cost")
The function diff takes the difference between successive components in a vector, which is what we want here (Figure 4).
The heuristic is to stop joining clusters when the cost of doing so
goes up a lot; this suggests using only 10 clusters, since going from
10 clusters to 9 is very expensive. (Alternately, use 63 clusters; but
that would be silly.)
(e) Suppose you did not know the cell classes. Can you think of any
reason to prefer one clustering method over another here, based on
14

their outputs and the rest of what you know about the problem?
Answer: I cant think of anything very compelling. Wards method
has a better sum-of-squares, but of course we dont know in advance
that sum-of-squares picks out differences between cancer types with
any biological or medical importance.

15

4. (a) Use prcomp to find the variances (not the standard deviations) associated with the principal components. Include a print-out of these
variances, to exactly two significant digits, in your write-up.
Answer:
> nci.pca = prcomp(nci.t)
> nci.variances = (nci.pca$sdev)^2
> signif(nci.variances,2)
[1] 6.3e+02 3.5e+02 2.8e+02 1.8e+02
[9] 1.1e+02 9.2e+01 8.9e+01 8.5e+01
[17] 6.7e+01 6.2e+01 6.2e+01 6.0e+01
[25] 5.0e+01 4.7e+01 4.5e+01 4.4e+01
[33] 3.7e+01 3.7e+01 3.6e+01 3.5e+01
[41] 3.1e+01 3.0e+01 3.0e+01 2.9e+01
[49] 2.3e+01 2.3e+01 2.2e+01 2.1e+01
[57] 1.7e+01 1.5e+01 1.4e+01 1.2e+01

1.6e+02
7.8e+01
5.8e+01
4.3e+01
3.4e+01
2.8e+01
2.0e+01
1.0e+01

1.5e+02
7.5e+01
5.5e+01
4.2e+01
3.4e+01
2.7e+01
1.9e+01
9.9e+00

1.2e+02
7.1e+01
5.4e+01
4.1e+01
3.2e+01
2.6e+01
1.8e+01
8.9e+00

1.2e+02
6.8e+01
5.0e+01
4.0e+01
3.2e+01
2.5e+01
1.8e+01
1.7e-28

(b) Why are there only 64 principal components, rather than 6830? (Hint:
Read the lecture notes.)
Answer: Because, when n < p, it is necessarily true that the data
lie on an n-dimensional subspace of the p-dimensional feature space,
so there are only n orthogonal directions along which the data can
vary at all. (Actually, one of the variances is very, very small because
any 64 points lie on a 63-dimensional surface, so we really only need
63 directions, and the last variance is within numerical error of zero.)
Alternately, we can imagine that there are 6830 64 = 6766 other
principal components, each with zero variance.
(c) Plot the fraction of the total variance retained by the first q components against q. Include both a print-out of the plot and the commands you used.
Answer:
plot(cumsum(nci.variances)/sum(nci.variances),xlab="q",
ylab="Fraction of variance",
main = "Fraction of variance retained by first q components")
producing Figure 5. The handy cumsum function takes a vector and
returns the cumulative sums of its components (i.e., another vector
of the sum length). (The same effect can be achieved through lapply
or sapply, or through an explicit loop, etc.)
(d) Roughly how much variance is retained by the first two principal components?
Answer: From the figure, about 25%. More exactly:
> sum(nci.variances[1:2])
[1] 986.1434
> sum(nci.variances[1:2])/sum(nci.variances)
[1] 0.2319364
16

0.6
0.4
0.2

Fraction of variance

0.8

1.0

Fraction of variance retained by first q components

10

20

30

40
q

Figure 5: R2 vs. q.

17

50

60

so 23% of the variance.


(e) Roughly how many components must be used to keep half of the variance? To keep nine-tenths?
Answer: From the figure, about 10 components keep half the variance, and about 40 components keep nine tenths. More exactly:
> min(which(cumsum(nci.variances) > 0.5*sum(nci.variances)))
[1] 10
> min(which(cumsum(nci.variances) > 0.9*sum(nci.variances)))
[1] 42
so we need 10 and 42 components, respectively.
(f) Is there any point to using more than 50 components? (Explain.)
Answer: Theres very little point from the point of view of capturing variance. The error in reconstructing the expression levels from
using 50 components is already extremely small. However, theres no
guarantee that, in some other application, the information carried by
the 55th component (say) wouldnt be crucial.
(g) How much confidence should you have in results from visualizing the
first two principal components? Why?
Answer: Very little; the first two components represent only a small
part of the total variance.

18

20

NSCLC

RENAL

BREAST

RENAL

BREAST

COLON
COLON

NSCLC
OVARIANOVARIAN
RENAL
OVARIAN
RENAL
OVARIAN
PROSTATE
OVARIAN
RENAL
NSCLC
UNKNOWN
NSCLC
RENAL
NSCLC
OVARIAN
RENAL
PROSTATE

BREAST
RENAL
NSCLC
CNS
NSCLC
CNS
CNS
CNS
CNS
MELANOMA

COLON
MCF7A-repro
COLON MCF7D-repro
COLON
BREAST
BREAST
COLON

NSCLC

COLON

NSCLC

LEUKEMIA
LEUKEMIA

K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA

-20

PC2

RENAL

MELANOMA
MELANOMA
MELANOMA

-40

MELANOMA

MELANOMA
MELANOMA

BREAST

MELANOMA

BREAST

-40

-20

20

40

60

PC1

Figure 6: PC projections of the cells, labeled by cell type.


5. (a) Plot the projection of each cell on to the first two principal components. Label each cell by its type. Hint: See the solutions to the first
homework.
Answer: Take the hint!
plot(nci.pca$x[,1:2],type="n")
text(nci.pca$x[,1:2],nci.classes, cex=0.5)
Figure 6 is the result.
(b) One tumor class (at least) forms a cluster in the projection. Say
which, and explain your answer.
Answer: MELANOMA; most of the cells cluster in the bottomcenter of the figure (PC1 between about 0 and 20, PC2 around

19

40), with only a few non-MELANOMA cells nearby, and a big gap
to the rest of the data. One could also argue for LEUKEMIA in the
center left, CNS in the upper left, RENAL above it, and COLON in
the top right.
(c) Identify a tumor class which does not form a compact cluster.
Answer: Most of them dont. What I had in mind though was
BREAST, which is very widely spread.
(d) Of the two classes of tumors you have just named, which will be more
easily classified with the prototype method? With the nearest neighbor
method?
Answer: BREAST will be badly classified using the prototype method.
The prototype will be around the middle of the plot, where there are
no breast-cancer cells. Nearest-neighbors can hardly do worse. On
the other hand, MELANOMA should work better with the prototype
method, because it forms a compact blob.

20

6. Combining dimensionality reduction and clustering


(a) Run k-means with k = 14 on the projections of the cells on to the
first two principal components. Give the commands you used to do
this, and get three clusterings from three different runs of k-means.
pc.clusters.1 = kmeans(nci.pca$x[,1:2],centers=14)$cluster
pc.clusters.2 = kmeans(nci.pca$x[,1:2],centers=14)$cluster
pc.clusters.3 = kmeans(nci.pca$x[,1:2],centers=14)$cluster
(b) Calculate the number of errors, as with k-means clustering based on
all genes.
Answer:
> count.lumping.errors(nci.classes,pc.clusters.1)
[1] 96
> count.lumping.errors(nci.classes,pc.clusters.2)
[1] 92
> count.lumping.errors(nci.classes,pc.clusters.3)
[1] 111
> count.splitting.errors(nci.classes,pc.clusters.1)
[1] 126
> count.splitting.errors(nci.classes,pc.clusters.2)
[1] 128
> count.splitting.errors(nci.classes,pc.clusters.3)
[1] 134
The number of lumping errors has increased a bit, though not very
much. The number of splitting errors has gone up by about 10%.
(c) Are there any pairs of cells which are always clustered together? If
so, do they have the same cell type?
Answer: For me, the first three cells are all clustered together, and
are all CNS.
(d) Does k-means find a cluster corresponding to the cell type you thought
would be especially easy to identify in the previous problem? (Explain
your answer.)
Answer: Make confusion matrices and look at them.
> table(nci.classes,pc.clusters.1)
pc.clusters.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
BREAST
2 1 0 0 0 0 1 0 0 0 1 1 0 1
CNS
0 0 0 0 0 0 0 0 0 0 0 0 0 5
COLON
0 0 0 2 0 0 2 1 0 0 2 0 0 0
K562A-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0
K562B-repro 0 0 0 0 1 0 0 0 0 0 0 0 0 0
LEUKEMIA
0 0 0 0 4 2 0 0 0 0 0 0 0 0
21

MCF7A-repro
MCF7D-repro
MELANOMA
NSCLC
OVARIAN
PROSTATE
RENAL
UNKNOWN

0
0
1
0
0
0
0
0

0
0
0
0
0
0
1
0

0
0
0
0
0
0
2
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0

0
0
0
2
0
0
0
0

0
0
0
2
1
0
5
0

0
0
0
1
3
2
0
0

1
1
0
0
0
0
0
0

0
0
1
3
2
0
1
1

0
0
6
0
0
0
0
0

0
0
0
1
0
0
0
0

Six of the eight MELANOMA cells are in one cluster here, which has
no other members this matches what I guessed from the projection
plot. This is also true in one of the other two runs; in the third that
group of six is split into two clusters of 4 cells and 2.
(e) Does k-means find a cluster corresponding to the cell type you thought
be be especially hard to identify? (Again, explain.)
Answer: I thought BREAST would be hard to classify, and it is,
being spread across six clusters, all containing cells from other classes.
This remains true in the other k-means runs. On the other hand, it
wasnt well-clustered with the full data, either.

22

7. Combining dimensionality reduction and classification


(a) Calculate the error rate of the prototype method using all the gene
features, using leave-one-out cross-validation. You may use the solution code from previous assignments, or your own code. Remember
that in this data set, the class labels are the row names, not a separate column, so you will need to modify either your code or the data
frame. (Note: this may take several minutes to run. [Why so slow?])
Answer: I find it simpler to add another column for the class labels
than to modify the code.
> nci.frame = data.frame(labels = nci.classes, nci.t)
> looCV.prototype(nci.frame)
[1] 0.578125
(The frame-making command complains about duplicated row-names,
but I suppressed that for simplicity.) 58% accuracy may not sound
that great, but remember there are 14 classes its a lot better than
chance.
Many people tried to use matrix here. The difficulty, as mentioned
in the lecture notes, is that a matrix or array, in R, has to have
all its components be of the same data type it cant mix, say,
numbers and characters. When you try, it coerces the numbers into
characters (because it can do that but cant go the other way). With
data.frame, each column can be of a different data type, so we can
mix character labels with numeric features.
This takes a long time to run because it has to calculate 14 class
centers, with 6830 dimensions, 64 times. Of course most of the class
prototypes are not changing with each hold-out (only the one center
of the class to which the hold-out point belongs), so theres a fair
amount of wasted effort computationally. A slicker implementation
of leave-one-out for the prototype method would take advantage of
this, at the cost of specializing the code to that particular classifier
method.
(b) Calculate the error rate of the prototype method using the first two,
ten and twenty principal components. Include the R commands you
used to do this.
Answer: Remember that the x attribute of the object returned by
prcomp gives the projections on to the components. We need to
turn these into data frames along with the class labels, but since we
dont really care about keeping them around, we can do that in the
argument to the CV function.
> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:2]))
[1] 0.421875
> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:10]))
[1] 0.515625
23

proto.with.q.components = function(q, pcs, classes) {


pcs = pcs[,1:q]
frame = data.frame(class.labels=classes, pcs)
accuracy = looCV.prototype(frame)
return(accuracy)
}
Code Example 5: Function for calculating the accuracy of a variable number
of components.
> looCV.prototype(data.frame(labels=nci.classes,nci.pca$x[,1:20]))
[1] 0.59375
So the accuracy is only slightly reduced, or even slightly increased,
by using the principal components rather than the full data.
(c) (Extra credit.) Plot the error rate, as in the previous part, against the
number q of principal components used, for q from 2 to 64. Include
your code and comment on the graph. (Hint: Write a function to
calculate the error rate for arbitrary q, and use sapply.)
Answer: Follow the hint (Code Example 5).
This creates the frame internally strictly the extra arguments to
the function arent needed, but assuming that global variables will
exist and have the right properties is always bad programming practice. I could add default values to them (pcs=nci.pca$x, etc.), but
thats not necessary because sapply can pass extra named arguments
on to the function.
> pca.accuracies = sapply(2:64, proto.with.q.components,
+
pcs=nci.pca$x, classes=nci.classes)
> pca.accuracies[c(1,9,19)]
[1] 0.421875 0.515625 0.593750
> plot(2:64, pca.accuracies, xlab="q", ylab="Prototype accuracy")
The middle command double checks that the proto.with.q.components
function works properly, by giving the same results as we got by
hand earlier. (Remember that the first component of the vector
pca.accuracies corresponds to q = 2, etc.) The plot is Figure
7. Theres a peak in the accuracy for q between about 15 and 20.

24

0.60
0.55
0.50
0.45

Prototype accuracy

10

20

30

40

50

60

Figure 7: Accuracy of the prototype method as a function of the number of


components q.

25

S-ar putea să vă placă și