Documente Academic
Documente Profesional
Documente Cultură
Abstract In this short exercise, we explore the most basic approach to the selection of dierentially expressed genes between two classes: rst, a nonspecic ltering step to remove probes for genes that appear to be not. Second, a probe-by-probe statistical test, and third, multiple testing correction. There are many variations and improvements to the procedure shown here, and you can learn more about these in Chapter 7.
The data and the following steps with which we construct the subset of interest, ALL_bcrneg, are described in more detail in Chapter 1. Briey, we select samples from B-cell lymphomas harboring the BCR/ABL translocation and from lymphomas with no observed cytogenetic abnormalities (NEG). > bcell = grep("^B", as.character(ALL$BT)) > moltyp = which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL"))
F. Hahne et al., Bioconductor Case Studies, DOI: 10.1007/978-0-387-77240-0 6, Springer Science+Business Media, LLC 2008
84
F. Hahne, W. Huber
> ALL_bcrneg = ALL[, intersect(bcell, moltyp)] > ALL_bcrneg$mol.biol = factor(ALL_bcrneg$mol.biol) The last line in the code above is used to drop unused levels of the factor variable mol.biol.
85
A related approach would be to discard all probe sets with consistently low expression values. The idea is similar: those probe sets most likely match transcripts whose expression we cannot detect anyway, and hence we need not test them for dierential expression. A more comprehensive approach to nonspecic ltering of probe sets according to various criteria is provided by the function nsFilter from the Category package, and that functions documentation as well as an application of it in Chapter 1 are further references on this topic. To summarize, nonspecic ltering uses the biological knowledge that there exists a substantial fraction of probe sets in a microarray experiment that is not informative, either because the target gene is not expressed, or because the probe set lacks sensitivity. Using this knowledge in the analysis will, in general, improve the quality of the gene selection.
86
F. Hahne, W. Huber
> table(ALLsfilt$mol.biol) BCR/ABL NEG 37 42 > tt = rowttests(ALLsfilt, "mol.biol") > names(tt) [1] "statistic" "dm" "p.value" Take a look at the histogram of the resulting p-values in the left panel of Figure 6.2. > hist(tt$p.value, breaks=50, col="mistyrose", xlab="p-value", main="Retained") We see a number of probe sets with very low p-values (which correspond to dierentially expressed genes) and a whole range of insignicant p-values. This is more or less what we would expect. The expression of the majority of genes is not signicantly shifted by the BCR/ABL mutation. To make sure that the nonspecic ltering did not throw away an undue amount of promising candidates, let us take a look at the p-values for those probe sets that we ltered out before. We can compute t-statistics for them as well and plot the histogram of p-values (right panel of Figure 6.2): > ALLsrest = ALL_bcrneg[sds<sh, ] > ttrest = rowttests(ALLsrest, "mol.biol") > hist(ttrest$p.value, breaks=50, col="lightblue", xlab="p-value", main="Removed")
Retained
200 400 600 10 0
Removed
Frequency
0 20 0.0
60
0.2
0.4
0.6
0.8
1.0
p -value
p -value
Figure 6.2. Histograms of p-values. The left panel shows those p-values retained after nonspecic lering; the right panel shows those that were removed.
87
Exercise 6.1 Comment on the plot; do you think that the nonspecic ltering was appropriate?
88
F. Hahne, W. Huber
5 37015_at ALDH1A1 6 37027_at AHNAK 7 39730_at ABL1 8 39837_s_at ZNF467 9 40202_at KLF9 10 40504_at PON2 and plot the data of the rst one together with symbols indicating the value of the mol.biol variable: > > > > mb = ALLsfilt$mol.biol y = exprs(ALLsfilt)[g[1],] ord = order(mb) plot(y[ord], pch=c(1,16)[mb[ord]], col=c("black", "red")[mb[ord]], main=g[1], ylab=expression(log[2]~intensity), xlab="samples") The result is shown in Figure 6.3.
1636_g_at
10.0
10.5
log2 intensity
9.5
9.0
8. 5
7.5
8.0
20
40 samples
60
80
Figure 6.3. The ALLslt data for the top dierentially expressed probe set across the 79 samples. The value of the mol.biol variable is indicated by the plot
symbols.