Documente Academic
Documente Profesional
Documente Cultură
are in the search results (the recall). Plotting the precision and recall
against k gives an indication of how efficient the search is. This could
then be averaged over all the labeled cells.
Normalizing the vectors by the Euclidean length is possible here,
though a little tricky because the features are the logarithms of the
concentration levels, so some of them can be negative numbers, meaning that very little of that gene is being expressed. Also, the reason
for normalizing bag-of-words vectors by length was the idea that two
documents might have similar meanings or subjects even if they had
different sizes. A cell which is expressing all the genes at a low level,
however, might well be very different from one which is expressing
the same genes at a high level.
(c) Would it make sense to weight genes by inverse cell frequency?
Explain.
Answer: No; or at least, I dont see how. Every gene is expressed to
some degree by every cell (otherwise the logarithm of its concentration
would be ), so every gene would have an inverse cell frequency of
zero. You could fix a threshold and just count whether the expression
level is over the threshold, but this would be pretty arbitrary.
However, a related idea would be to scale genes expression levels by
something which indicates how much they vary across cells, such as
the range or the variance. Since we have labeled cell types here, we
could even compute the average expression level for each type and
then scale genes by the variance of these type-averages. That is, genes
with a small variance (or range, etc.) should get small weights, since
they are presumably uninformative, and genes with a large variance
in epxression level should get larger weights.
2. k-means clustering Use the kmeans function in R to cluster the cells, with
k = 14 (to match the number of true classes). Repeat this three times to
get three different clusterings.
Answer:
clusters.1 = kmeans(nci.t,centers=14)$cluster
clusters.2 = kmeans(nci.t,centers=14)$cluster
clusters.3 = kmeans(nci.t,centers=14)$cluster
Im only keping the cluster assignments, because thats all this problem
calls for.
(a) Say that k-means makes a lumping error whenever it assigns two
cells of different classes to the same cluster, and a splitting error
when it puts two cells of the same class in different clusters. Each
pair of cells can give an error. (With n objects, there are n(n 1)/2
distinct pairs.)
i. Write a function which takes as inputs a vector of classes and
a vector of clusters, and gives as output the number of lumping
errors. Test your function by verifying that when the classes
are (1, 2, 2), the three clusterings (1, 2, 2), (2, 1, 1) and (4, 5, 6)
all give zero lumping errors, but the clustering (1, 1, 1) gives two
lumping errors.
Answer: The most brute-force way (Code 1) uses nested for
loops. This works, but its not the nicest approach. (1) for
loops, let alone nested loops, are extra slow and inefficient in R.
(2) We only care about pairs which are in the same cluster, waste
time checking all pairs. (3) Its a bit obscure what were really
doing.
Code 2 is faster, cleaner, and more R-ish, using the utility function outer, which (rapidly!) applies a function to pairs of objects constructed from its first two arguments, returning a matrix. Here were seeing which pairs of objects belong to different
class, and which to the same clusters. (The solutions to the first
homework include a version of this trick.) We restrict ourselves
to pairs in the same cluster and count how many are of different
classes. We divided by two because outer produces rectangular
matrices and works with ordered pairs, so (i, j) 6= (j, i), but the
problem asks for unordered pairs.
> count.lumping.errors(c(1,2,2),c(1,2,2))
[1] 0
> count.lumping.errors(c(1,2,2),c(2,1,1))
[1] 0
> count.lumping.errors(c(1,2,2),c(4,5,6))
[1] 0
3
[1] 110
> count.splitting.errors(nci.classes,clusters.1)
[1] 97
> count.splitting.errors(nci.classes,clusters.2)
[1] 97
> count.splitting.errors(nci.classes,clusters.3)
[1] 106
For comparison, there are 64 63/2 = 2016 pairs, so the error
rate here is pretty good.
(b) Are there any classes which seem particularly hard for k-means to
pick up?
Answer: Well-argued qualitative answers are fine, but there are
ways of being quantitative. One is the entropy of the cluster assignments for the classes. Start with a confusion matrix:
> confusion = table(nci.classes,clusters.1)
> confusion
clusters.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
BREAST
2 0 0 0 2 0 0 0 0 0 0 2 1 0
CNS
0 0 0 0 3 0 0 0 2 0 0 0 0 0
COLON
0 0 1 0 0 0 0 6 0 0 0 0 0 0
K562A-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0
K562B-repro 0 1 0 0 0 0 0 0 0 0 0 0 0 0
LEUKEMIA
0 1 0 0 0 0 0 0 0 0 0 0 0 5
MCF7A-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0
MCF7D-repro 0 0 0 0 0 0 0 0 0 0 0 1 0 0
MELANOMA
7 0 0 1 0 0 0 0 0 0 0 0 0 0
NSCLC
0 0 4 1 0 0 2 0 0 0 1 0 1 0
OVARIAN
0 0 3 1 0 0 0 0 0 2 0 0 0 0
PROSTATE
0 0 2 0 0 0 0 0 0 0 0 0 0 0
RENAL
0 0 0 1 1 7 0 0 0 0 0 0 0 0
UNKNOWN
0 0 0 1 0 0 0 0 0 0 0 0 0 0
> signif(apply(confusion,1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
0.97
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
0.00
0.00
0.54
2.10
1.50
RENAL
UNKNOWN
0.99
0.00
LEUKEMIA
0.65
PROSTATE
0.00
Of course thats just one k-means run. Here are a few more.
> signif(apply(table(nci.classes,clusters.2),1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
0.00
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
6
LEUKEMIA
1.50
PROSTATE
0.00
0.00
0.54
2.10
1.50
RENAL
UNKNOWN
0.99
0.00
> signif(apply(table(nci.classes,clusters.3),1,entropy),2)
BREAST
CNS
COLON K562A-repro K562B-repro
2.00
1.50
0.59
0.00
0.00
MCF7A-repro MCF7D-repro
MELANOMA
NSCLC
OVARIAN
0.00
0.00
1.10
1.80
0.65
RENAL
UNKNOWN
0.99
0.00
BREAST and NSCLC have consistently high cluster entropy, indicating that cells of these types tend not to be clustered together.
(c) Are there any pairs of cells which are always clustered together, and
if so, are they of the same class?
Answer: Cells number 1 and 2 are always clustered together in my
three runs, and are CNS tumors. However, cells 3 and 4 are always
clustered together, too, and one is CNS and one is RENAL.
(d) Variation-of-information metric
i. Calculate, by hand, the variation-of-information distance between
the partition (1, 2, 2) and (2, 1, 1); between (1, 2, 2) and (2, 2, 1);
and between (1, 2, 2) and (5, 8, 11).
Answer: Write X for the cell of the first partition in each pair
and Y for the second.
In the first case, when X = 1, Y = 2 and vice versa. So
H[X|Y ] = 0, H[Y |X] = 0 and the distance between the partitions is also zero. (The two partitions are the same, theyve
just swapped the labels on the cells.)
For the second pair, we have Y = 2 when X = 1, so H[Y |X =
1] = 0, but Y = 1 or 2 equally often when X = 2, so H[Y |X =
2] = 1. Thus H[Y |X] = (0)(1/3) + (1)(2/3) = 2/3. Symmetrically, X = 2 when Y = 1, H[X|Y = 1] = 0, but X = 1 or
2 equiprobably when Y = 2, so again H[X|Y ] = 2/3, and the
distance is 4/3.
In the final case, H[X|Y ] = 0, because for each Y there is a
single value of X, but H[Y |X] = 2/3, as in the previous case. So
the distance is 2/3.
ii. Write a function which takes as inputs two vectors of class or
cluster assignments, and returns as output the variation-of-information
distance for the two partitions. Test the function by checking that
it matches your answers in the previous part.
Answer: The most straight-forward way is to make the contingency table or confusion matrix for the two assignment vectors,
calculate the two conditional entropies from that, and add them.
But I already have a function to calculate mutual information,
7
0.00
LEUKEMIA
1.90
PROSTATE
0.00
H[X] H[X|Y ]
H[X|Y ]
H[X] I[X; Y ]
implemented in Code 4.
Lets check this out on the test cases.
> variation.of.info(c(1,2,2),c(2,1,1))
[1] 0
> variation.of.info(c(1,2,2),c(2,2,1))
[1] 1.333333
> variation.of.info(c(1,2,2),c(5,8,11))
[1] 0.6666667
> variation.of.info(c(5,8,11),c(1,2,2))
[1] 0.6666667
I threw in the last one to make sure the function is symmetric
in its two arguments, as it should be.
iii. Calculate the distances between the k-means clusterings and the
true classes, and between the k-means clusterings and each other.
Answer:
> variation.of.info(nci.classes,clusters.1)
[1] 1.961255
> variation.of.info(nci.classes,clusters.2)
8
[1] 2.026013
> variation.of.info(nci.classes,clusters.3)
[1] 2.288418
> variation.of.info(clusters.1,clusters.2)
[1] 0.5167703
> variation.of.info(clusters.1,clusters.3)
[1] 0.8577165
> variation.of.info(clusters.2,clusters.3)
[1] 0.9224744
The k-means clusterings are all roughly the same distance to the
true classes (two bits or so), and noticeably closer to each other
than to the true classes (around three-quarters of a bit).
3. Hierarchical clustering
(a) Produce dendrograms using Wards method, single-link clustering and
complete-link clustering. Include both the commands you used and the
resulting figures in your write-up. Make sure the figures are legible.
(Try the cex=0.5 option to plot.)
Answer: hclust needs a matrix of distances, handily produced by
the dist function.
nci.dist = dist(nci.t)
plot(hclust(nci.dist,method="ward"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by Wards method")
plot(hclust(nci.dist,method="single"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by single-link method")
plot(hclust(nci.dist,method="complete"),cex=0.5,xlab="Cells",
main="Hierarchical clustering by complete-link method")
Producing Figures 1, 2 and 3
(b) Which cell classes seem are best captured by each clustering method?
Explain.
Answer: Wards method does a good job of grouping together
COLON, MELANOMA and RENAL; LEUKEMIA is pretty good
too but mixed up with some of the lab cell lines. Single-link is good
with RENAL and decent with MELANOMA (though confused with
BREAST). There are little sub-trees of cells of the same time, like
COLON or CNS, but mixed together with others. The complete-link
method has sub-trees for COLON, LEUKEMIA, MELANOMA, and
RENAL, which look pretty much as good as Wards method.
(c) Which method best recovers the cell classes? Explain.
Answer: Wards method looks better scanning down the figure,
a lot more of the adjacent cells are of the same type. Since the
dendrogram puts items in the same sub-cluster together, this suggests
that the clustering more nearly corresponds to the known cell types.
(d) The hclust command returns an object whose height attribute is the
sum of the within-cluster sums of squares. How many clusters does
this suggest we should use, according to Wards method? Explain.
(You may find diff helpful.)
Answer:
> nci.wards = hclust(nci.dist,method="ward")
> length(nci.wards$height)
[1] 63
> nci.wards$height[1]
[1] 38.23033
There are 63 heights, corresponding to the 63 joinings, i.e., not including the height (sum-of-squares) of zero when we have 64 clusters,
10
COLON
COLON
COLON
COLON
COLON
COLON
COLON
NSCLC
NSCLC
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC
OVARIAN
OVARIAN
NSCLC
PROSTATE
OVARIAN
PROSTATE
BREAST
BREAST
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
CNS
CNS
NSCLC
RENAL
BREAST
MELANOMA
CNS
NSCLC
CNS
CNS
BREAST
UNKNOWN
OVARIAN
RENAL
BREAST
NSCLC
MCF7A-repro
BREAST
MCF7D-repro
50
LEUKEMIA
LEUKEMIA
LEUKEMIA
BREAST
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
100
150
Height
200
250
300
350
Cells
hclust (*, "ward")
11
MCF7A-repro
BREAST
MCF7D-repro
40
LEUKEMIA
K562B-repro
K562A-repro
UNKNOWN
OVARIAN
BREAST
BREAST
30
MELANOMA
60
12
CNS
CNS
COLON
NSCLC
COLON
OVARIAN
OVARIAN
PROSTATE
NSCLC
NSCLC
NSCLC
PROSTATE
COLON
COLON
COLON
COLON
NSCLC
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
RENAL
CNS
BREAST
70
NSCLC
NSCLC
BREAST
BREAST
RENAL
NSCLC
LEUKEMIA
LEUKEMIA
BREAST
CNS
CNS
OVARIAN
NSCLC
OVARIAN
RENAL
MELANOMA
OVARIAN
COLON
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
MELANOMA
LEUKEMIA
LEUKEMIA
50
Height
LEUKEMIA
80
90
Cells
hclust (*, "single")
13
UNKNOWN
OVARIAN
BREAST
BREAST
LEUKEMIA
LEUKEMIA
COLON
COLON
COLON
COLON
40
NSCLC
MELANOMA
CNS
NSCLC
PROSTATE
OVARIAN
PROSTATE
RENAL
RENAL
RENAL
RENAL
OVARIAN
OVARIAN
NSCLC
NSCLC
NSCLC
CNS
CNS
BREAST
OVARIAN
OVARIAN
RENAL
BREAST
NSCLC
RENAL
RENAL
BREAST
NSCLC
NSCLC
MELANOMA
LEUKEMIA
LEUKEMIA
COLON
COLON
COLON
MELANOMA
MELANOMA
RENAL
RENAL
CNS
CNS
60
BREAST
MELANOMA
MELANOMA
MELANOMA
MELANOMA
LEUKEMIA
K562B-repro
K562A-repro
MCF7A-repro
BREAST
MCF7D-repro
20
NSCLC
LEUKEMIA
80
Height
100
120
140
Cells
hclust (*, "complete")
60
50
40
Merging cost
30
20
10
0
0
10
20
30
40
50
60
Clusters remaining
their outputs and the rest of what you know about the problem?
Answer: I cant think of anything very compelling. Wards method
has a better sum-of-squares, but of course we dont know in advance
that sum-of-squares picks out differences between cancer types with
any biological or medical importance.
15
4. (a) Use prcomp to find the variances (not the standard deviations) associated with the principal components. Include a print-out of these
variances, to exactly two significant digits, in your write-up.
Answer:
> nci.pca = prcomp(nci.t)
> nci.variances = (nci.pca$sdev)^2
> signif(nci.variances,2)
[1] 6.3e+02 3.5e+02 2.8e+02 1.8e+02
[9] 1.1e+02 9.2e+01 8.9e+01 8.5e+01
[17] 6.7e+01 6.2e+01 6.2e+01 6.0e+01
[25] 5.0e+01 4.7e+01 4.5e+01 4.4e+01
[33] 3.7e+01 3.7e+01 3.6e+01 3.5e+01
[41] 3.1e+01 3.0e+01 3.0e+01 2.9e+01
[49] 2.3e+01 2.3e+01 2.2e+01 2.1e+01
[57] 1.7e+01 1.5e+01 1.4e+01 1.2e+01
1.6e+02
7.8e+01
5.8e+01
4.3e+01
3.4e+01
2.8e+01
2.0e+01
1.0e+01
1.5e+02
7.5e+01
5.5e+01
4.2e+01
3.4e+01
2.7e+01
1.9e+01
9.9e+00
1.2e+02
7.1e+01
5.4e+01
4.1e+01
3.2e+01
2.6e+01
1.8e+01
8.9e+00
1.2e+02
6.8e+01
5.0e+01
4.0e+01
3.2e+01
2.5e+01
1.8e+01
1.7e-28
(b) Why are there only 64 principal components, rather than 6830? (Hint:
Read the lecture notes.)
Answer: Because, when n < p, it is necessarily true that the data
lie on an n-dimensional subspace of the p-dimensional feature space,
so there are only n orthogonal directions along which the data can
vary at all. (Actually, one of the variances is very, very small because
any 64 points lie on a 63-dimensional surface, so we really only need
63 directions, and the last variance is within numerical error of zero.)
Alternately, we can imagine that there are 6830 64 = 6766 other
principal components, each with zero variance.
(c) Plot the fraction of the total variance retained by the first q components against q. Include both a print-out of the plot and the commands you used.
Answer:
plot(cumsum(nci.variances)/sum(nci.variances),xlab="q",
ylab="Fraction of variance",
main = "Fraction of variance retained by first q components")
producing Figure 5. The handy cumsum function takes a vector and
returns the cumulative sums of its components (i.e., another vector
of the sum length). (The same effect can be achieved through lapply
or sapply, or through an explicit loop, etc.)
(d) Roughly how much variance is retained by the first two principal components?
Answer: From the figure, about 25%. More exactly:
> sum(nci.variances[1:2])
[1] 986.1434
> sum(nci.variances[1:2])/sum(nci.variances)
[1] 0.2319364
16
0.6
0.4
0.2
Fraction of variance
0.8
1.0
10
20
30
40
q
Figure 5: R2 vs. q.
17
50
60
18
20
NSCLC
RENAL
BREAST
RENAL
BREAST
COLON
COLON
NSCLC
OVARIANOVARIAN
RENAL
OVARIAN
RENAL
OVARIAN
PROSTATE
OVARIAN
RENAL
NSCLC
UNKNOWN
NSCLC
RENAL
NSCLC
OVARIAN
RENAL
PROSTATE
BREAST
RENAL
NSCLC
CNS
NSCLC
CNS
CNS
CNS
CNS
MELANOMA
COLON
MCF7A-repro
COLON MCF7D-repro
COLON
BREAST
BREAST
COLON
NSCLC
COLON
NSCLC
LEUKEMIA
LEUKEMIA
K562B-repro
K562A-repro
LEUKEMIA
LEUKEMIA
LEUKEMIA
LEUKEMIA
-20
PC2
RENAL
MELANOMA
MELANOMA
MELANOMA
-40
MELANOMA
MELANOMA
MELANOMA
BREAST
MELANOMA
BREAST
-40
-20
20
40
60
PC1
19
40), with only a few non-MELANOMA cells nearby, and a big gap
to the rest of the data. One could also argue for LEUKEMIA in the
center left, CNS in the upper left, RENAL above it, and COLON in
the top right.
(c) Identify a tumor class which does not form a compact cluster.
Answer: Most of them dont. What I had in mind though was
BREAST, which is very widely spread.
(d) Of the two classes of tumors you have just named, which will be more
easily classified with the prototype method? With the nearest neighbor
method?
Answer: BREAST will be badly classified using the prototype method.
The prototype will be around the middle of the plot, where there are
no breast-cancer cells. Nearest-neighbors can hardly do worse. On
the other hand, MELANOMA should work better with the prototype
method, because it forms a compact blob.
20
MCF7A-repro
MCF7D-repro
MELANOMA
NSCLC
OVARIAN
PROSTATE
RENAL
UNKNOWN
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
2
1
0
5
0
0
0
0
1
3
2
0
0
1
1
0
0
0
0
0
0
0
0
1
3
2
0
1
1
0
0
6
0
0
0
0
0
0
0
0
1
0
0
0
0
Six of the eight MELANOMA cells are in one cluster here, which has
no other members this matches what I guessed from the projection
plot. This is also true in one of the other two runs; in the third that
group of six is split into two clusters of 4 cells and 2.
(e) Does k-means find a cluster corresponding to the cell type you thought
be be especially hard to identify? (Again, explain.)
Answer: I thought BREAST would be hard to classify, and it is,
being spread across six clusters, all containing cells from other classes.
This remains true in the other k-means runs. On the other hand, it
wasnt well-clustered with the full data, either.
22
24
0.60
0.55
0.50
0.45
Prototype accuracy
10
20
30
40
50
60
25