Sunteți pe pagina 1din 6

6

Easy Dierential Expression


F. Hahne and W. Huber

Abstract In this short exercise, we explore the most basic approach to the selection of dierentially expressed genes between two classes: rst, a nonspecic ltering step to remove probes for genes that appear to be not. Second, a probe-by-probe statistical test, and third, multiple testing correction. There are many variations and improvements to the procedure shown here, and you can learn more about these in Chapter 7.

6.1 Example data


For this chapter, we use the ALL data, which have been obtained in a microarray study of B- and T-cell leukemia. We want to nd genes that are dierentially expressed between two distinct types of B-cell leukemia. > > > > library("Biobase") library("genefilter") library("ALL") data("ALL")

The data and the following steps with which we construct the subset of interest, ALL_bcrneg, are described in more detail in Chapter 1. Briey, we select samples from B-cell lymphomas harboring the BCR/ABL translocation and from lymphomas with no observed cytogenetic abnormalities (NEG). > bcell = grep("^B", as.character(ALL$BT)) > moltyp = which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL"))

F. Hahne et al., Bioconductor Case Studies, DOI: 10.1007/978-0-387-77240-0 6, Springer Science+Business Media, LLC 2008

84

F. Hahne, W. Huber

> ALL_bcrneg = ALL[, intersect(bcell, moltyp)] > ALL_bcrneg$mol.biol = factor(ALL_bcrneg$mol.biol) The last line in the code above is used to drop unused levels of the factor variable mol.biol.

6.2 Nonspecic ltering


Between these two groups we should be able to detect substantial dierences in gene expression. But rst let us explore how nonspecic ltering can improve our analysis. To this end, we calculate the overall variability across arrays of each probe set, regardless of the sample labels. For this, we use the function rowSds, which calculates the standard deviation for each row. A reasonable alternative would be to calculate the interquartile range (IQR), for which we could employ the rowQ function from the genelter package. > library("genefilter") > sds = rowSds(exprs(ALL_bcrneg)) > sh = shorth(sds) > sh [1] 0.242 We can plot the histogram of the distribution of sds; see Figure 6.1. The function shorth calculates the midpoint of the shorth (the shortest interval containing half of the data), and is in many cases a reasonable estimator of the peak of a distribution. Its value 0.242 is drawn as a dashed vertical line in Figure 6.1. > hist(sds, breaks=50, col="mistyrose", xlab="standard deviation") > abline(v=sh, col="blue", lwd=3, lty=2) There are a large number of probe sets with very low variability. We can safely assume that we will not be able to infer dierential expression for their target genes. The target genes of these probe sets may not be expressed in the samples, or the probe sets may lack the sensitivity to detect expression. Hence, let us discard those probe sets whose standard deviation is below the value of sh. > ALLsfilt = ALL_bcrneg[sds>=sh, ] > dim(exprs(ALLsfilt)) [1] 8812 79

6. Easy Dierential Expression

85

Figure 6.1. Histogram of sds.

A related approach would be to discard all probe sets with consistently low expression values. The idea is similar: those probe sets most likely match transcripts whose expression we cannot detect anyway, and hence we need not test them for dierential expression. A more comprehensive approach to nonspecic ltering of probe sets according to various criteria is provided by the function nsFilter from the Category package, and that functions documentation as well as an application of it in Chapter 1 are further references on this topic. To summarize, nonspecic ltering uses the biological knowledge that there exists a substantial fraction of probe sets in a microarray experiment that is not informative, either because the target gene is not expressed, or because the probe set lacks sensitivity. Using this knowledge in the analysis will, in general, improve the quality of the gene selection.

6.3 Dierential expression


We can now perform probe-by-probe tests for dierential expression (Dudoit et al., 2002). The function rowttests can deal with ExpressionSet s. It uses the t-test, row by row, to detect signicant dierences in the location of the distribution of expression data of two groups of samples dened by a factor variable. In this case, we use the information about BCR/ABL mutation status in the column mol.biol of ALLsfilts sample annotation as a grouping factor.

86

F. Hahne, W. Huber

> table(ALLsfilt$mol.biol) BCR/ABL NEG 37 42 > tt = rowttests(ALLsfilt, "mol.biol") > names(tt) [1] "statistic" "dm" "p.value" Take a look at the histogram of the resulting p-values in the left panel of Figure 6.2. > hist(tt$p.value, breaks=50, col="mistyrose", xlab="p-value", main="Retained") We see a number of probe sets with very low p-values (which correspond to dierentially expressed genes) and a whole range of insignicant p-values. This is more or less what we would expect. The expression of the majority of genes is not signicantly shifted by the BCR/ABL mutation. To make sure that the nonspecic ltering did not throw away an undue amount of promising candidates, let us take a look at the p-values for those probe sets that we ltered out before. We can compute t-statistics for them as well and plot the histogram of p-values (right panel of Figure 6.2): > ALLsrest = ALL_bcrneg[sds<sh, ] > ttrest = rowttests(ALLsrest, "mol.biol") > hist(ttrest$p.value, breaks=50, col="lightblue", xlab="p-value", main="Removed")

Retained
200 400 600 10 0

Removed

Frequency

Frequency 0.0 0.2 0.4 0.6 0.8 1.0

0 20 0.0

60

0.2

0.4

0.6

0.8

1.0

p -value

p -value

Figure 6.2. Histograms of p-values. The left panel shows those p-values retained after nonspecic lering; the right panel shows those that were removed.

6. Easy Dierential Expression

87

Exercise 6.1 Comment on the plot; do you think that the nonspecic ltering was appropriate?

6.4 Multiple testing correction


We use the p-values for ranking genes, and do not advocate interpreting them as true probabilities. Nevertheless, the results of a multiple testing adjustment can be informative for choosing selection cut-os. Typically, in the setting of a single statistical test we consider the data as providing evidence against a given null hypothesis when it is suciently improbable that these data arise by chance if the null hypothesis is true. When repeatedly doing tests, we need to raise the bar for what we consider suciently improbable. For example, if we do 8812 tests of a null hypothesis that is actually true, using a signicance level of 5%, then in 5% 441 cases we can expect to reject the null hypothesis just by chance. Many approaches have been proposed to address this problem (Pollard et al., 2005); here we just discuss one that appears to be appropriate in many micrarray-related contexts: the false discovery rate (FDR), that is, the expected proportion of false positives among the genes that are called dierentially expressed. The procedure of Benjamini and Hochberg is implemented in the multtest package and we use the function mt.raw2adjp for this purpose. (Note that a more formal treatment would need to take into account the multiple t-tests as well as the implicit testing of the nonspecic ltering.) > library("multtest") > mt = mt.rawp2adjp(tt$p.value, proc="BH") Finally, we can use the results of the t-tests to create a gene list containing the ten highest-ranking genes with respect to the adjusted p-value, > g = featureNames(ALLsfilt)[mt$index[1:10]] print their gene symbols, > library("hgu95av2.db") > links(hgu95av2SYMBOL[g]) probe_id symbol 1 1635_at ABL1 2 1636_g_at ABL1 3 1674_at YES1 4 32434_at MARCKS

88

F. Hahne, W. Huber

5 37015_at ALDH1A1 6 37027_at AHNAK 7 39730_at ABL1 8 39837_s_at ZNF467 9 40202_at KLF9 10 40504_at PON2 and plot the data of the rst one together with symbols indicating the value of the mol.biol variable: > > > > mb = ALLsfilt$mol.biol y = exprs(ALLsfilt)[g[1],] ord = order(mb) plot(y[ord], pch=c(1,16)[mb[ord]], col=c("black", "red")[mb[ord]], main=g[1], ylab=expression(log[2]~intensity), xlab="samples") The result is shown in Figure 6.3.
1636_g_at

10.0

10.5

log2 intensity

9.5

9.0

8. 5

7.5

8.0

20

40 samples

60

80

Figure 6.3. The ALLslt data for the top dierentially expressed probe set across the 79 samples. The value of the mol.biol variable is indicated by the plot

symbols.

S-ar putea să vă placă și