Sunteți pe pagina 1din 13

Inferring trees of clonal evolution from single-cell SNVs using oncoNEM

Edith M Ross and Florian Markowetz

University of Cambridge,
Cancer Research UK Cambridge Institute,
Cambridge, UK
September 9, 2015

Abstract
OncoNEM is an automated method for reconstructing clonal lineage trees from SNVs of multiple single
tumour cells that exploits the nested structure of mutation patterns of related cells. While accounting for
the high error rates of single-cell sequencing, oncoNEM identifies subpopulations and unobserved ancestral
subpopulations including their genotypes and evolutionary relationships. This vignette explains the use of the
oncoNEM package. For a detailed description of the underlying methods please see our paper.

Contents
1 Installing the oncoNEM package 1

2 Input data 2

3 Tree inference 3
3.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Initial search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.4 Inferring unobserved nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Clustering cells into subpopulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Estimating branch lengths through the occurrence parameter Θ . . . . . . . . . . . . . . . . . . . . . 9
3.7 Plotting trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Session Info 12

1 Installing the oncoNEM package


The oncoNEM package uses functions from the following R packages: Rcpp, igraph, ggm. To install those packages
run the following commands in R:

1
install.packages(c('Rcpp','igraph','ggm'))

The oncoNEM package can be installed from bitbucket by running the following commands in R

install.packages('devtools')
library(devtools)
install_bitbucket("edith_ross/oncoNEM")

2 Input data
As input, oncoNEM expects a binary genotype matrix where 0 denotes a unmutated and 1 denotes a mutated site.
The rows of the input matrix should correspond to mutations and columns to cells. If the data contains missing
values, those should be coded as 2.
Note that the data should not include a column for the normal, as this implementation automatically assumes
that the genotype of the normal is 0 at every mutation site.
For the purpose of this vignette we simulate a small data set, containing 20 cells from 5 clones, one of which
is unobserved. We simulate 300 mutations and the data set we generate has a false positive rate of 20%, a false
negative rate of 10% and 20% of the data entries are missing.

library(oncoNEM)
set.seed(1)
simData <- simulateData(N.cells = 20,
N.clones = 5,
N.unobs = 1,
N.sites = 300,
FPR = 0.2,
FNR = 0.1,
p.missing = 0.2)

This produces a list containing different objects. The simulated ”observed” genotypes can be accessed with $D.

head(simData$D)

## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## [1,] 0 0 1 1 0 1 0 0 2 0 0 2 0 2 0 0 0 0 0 0
## [2,] 1 0 1 0 1 2 1 2 1 1 1 1 1 1 1 1 1 0 1 1
## [3,] 1 0 1 0 0 1 1 1 0 1 1 1 2 1 0 0 0 0 1 1
## [4,] 1 1 1 1 1 1 1 1 1 1 1 1 2 0 1 1 1 1 1 1
## [5,] 1 0 1 0 1 0 1 2 1 1 2 1 1 1 1 1 1 0 1 0
## [6,] 1 0 0 1 1 0 1 1 2 1 2 1 1 1 1 2 1 1 1 1

For benchmarking purposes simData also contains


• theta: the ocurrence parameter, i.e. an index that describes in which clone each of the mutations occurred;

2
• gtyp: the true genotypes;
• g: the true tree structure from which the genotype data was simulated;
• clones: a list in which the i-th element contains the indices of all cells belonging to node i in g;
• FPR, FNR: the error rates used in simulation.

3 Tree inference
3.1 Parameter estimation
The scoring method of oncoNEM depends on two input parameters, the false positive rate (FPR) and the false
negative rate (FNR). If those parameters are known, you can directly proceed to the next section. Usually, however,
they are unknown. If this is the case, we suggest to estimate them using a maximum-likelihood approach based on
a search over a grid of parameter combinations:
First we select the parameter combinations over which we want to maximze. Here, we only choose a small
parameter range for a faster runtime.

test.fpr <- seq(from = 0.15, to = 0.3, length.out = 6)


test.fnr <- seq(from = 0.05, to = 0.2, length.out = 6)

Next we run the intitial search (for details on how to do this see next section), which estimates cell lineage trees,
and save the log-likelihood of the highest scoring tree for each of those parameter combinations:

llh <- matrix(0,nrow = length(test.fpr),ncol = length(test.fnr))

for (i.fpr in 1:length(test.fpr)) {


for (i.fnr in 1:length(test.fnr)) {
## initialize oncoNEM
oNEM <- oncoNEM$new(Data = simData$D,
FPR = test.fpr[i.fpr],
FNR = test.fnr[i.fnr])

## run initial search


oNEM$search(delta = 200)

## save log-likelihood of highest-scoring tree


llh[i.fpr,i.fnr] <- oNEM$best$llh
}
}

Then we choose the parameter combination that yields the highest scoring tree overall:

indx <- which(llh==max(llh), arr.ind=TRUE)


fpr.est <- test.fpr[indx[1]]
fnr.est <- test.fnr[indx[2]]

3
To get an idea of how the likelihood changes with the input parameters and to compare our estimate with the
true parameters we plot a heatmap with ggplot2.

library(ggplot2)

## format data for ggplot


df <- cbind(expand.grid(test.fpr,test.fnr),
llh[as.matrix(expand.grid(1:length(test.fpr),
1:length(test.fnr)))])
colnames(df) <- c('fpr','fnr','llh')
errorRates <- data.frame(x = c(simData$FPR,fpr.est),
y = c(simData$FNR,fnr.est),
Parameter = c('Ground truth','Estimate'))

## plot heatmap
ggplot(df, aes(fpr,fnr)) +
geom_raster(aes(fill = llh),hjust = 0.5, vjust = 0.5)+
geom_point(data = errorRates,aes(x = x,y = y,shape = Parameter),
colour = "black",bg = "white",size = 6) +
scale_shape_manual(values = c('Ground truth' = 24,
'Estimate' = 25)) +
labs(x = 'False positive rate',
y = 'False negative rate') +
scale_fill_gradientn(colours = rainbow(8)[1:7],
name = 'Log-likelihood',
limits = c(max(llh)-200+5,max(llh)+5)) +
theme(aspect.ratio = 1)

4
3.2 Initialization
Before we can begin with the actual inference of a tree we need to initialize a new object of the oncoNEM Reference
Class using the observed genotypes and the estimated error rates.

## initialize oncoNEM
oNEM <- oncoNEM$new(Data = simData$D,
FPR = fpr.est,
FNR = fnr.est)

3.3 Initial search


During the initialization a star tree is automatically added as start tree for the initial search. (Different trees can
be added using the Reference Method addTree, but this is usually not neccessary.)
To avoid confusion, note that Reference Methods are called using the $ operator and doing so directly changes
the object from which they are called.
The following command runs the initial search until the highest scoring tree has not changed for delta = 100
steps in order to find a initial cell lineage tree that will be refined in the subsequent steps:

5
oNEM$search(delta = 200)

If we want to have a look at the best tree found so far we can do that using

oNEM$best

## $tree
## [1] 16 18 10 2 9 4 10 20 15 11 8 11 17 7 1 13 0 17 15 5
##
## $llh
## [1] -2582.982

The cell lineage tree can be plotted by calling

plotTree(tree = oNEM$best$tree,clones = NULL,vertex.size = 25)

6
N

17

13 18

16 2

1 4

15 6

9 19

20

11

10 12

3 7

14

3.4 Inferring unobserved nodes


Next we test if we can identify any unobserved subpopulations.

oNEM.expanded <- expandOncoNEM(oNEM,epsilon = 10,delta = 200,


checkMax = 10000,app = TRUE)

In this case the algorithm added one unobserved node, with node ID 21. Note that if no unobserved subpopu-
lation is identified, oNEM.expanded will contain the same solution as oNEM.

7
plotTree(tree = oNEM.expanded$best$tree,
clones = NULL,
vertex.size = 25)

21

13 18

16 2

1 4

5 6

9 19

15

17

20

11

10 12

3 7

14

8
3.5 Clustering cells into subpopulations
In the final refinement step, we cluster cells along branches to identify subpopulations.

oncoTree <- clusterOncoNEM(oNEM = oNEM.expanded,


epsilon = 10)

3.6 Estimating branch lengths through the occurrence parameter Θ


Before we plot the final result, we estimate branch lengths for the tree. The first step here is to calculate the
posterior probabilities of the occurrence parameter Θ.

post <- oncoNEMposteriors(tree = oncoTree$g,


clones = oncoTree$clones,
Data = oNEM$Data,
FPR = oNEM$FPR,
FNR = oNEM$FNR)

Then, for every subpopulation we sum up the posterior probability of a mutation to have occured in that
subpopulation over all mutations. The first column of post$p theta corresponds to the normal and needs to be
removed as the normal has no incoming edge.

edgeLengths = colSums(post$p_theta)[-1]

3.7 Plotting trees


The resulting tree with variable edge lengths can be plotted as follows

plotTree(tree = oncoTree$g,
clones = oncoTree$clones,
e.length = edgeLengths,
label.length = 4,
axis = TRUE)

9
N

0
−50
−100

2, 18
1, 5, 9, 13,
15, 16, 17, 19
−150

4, 6
3, 7, 8, 10,
11, 12, 14, 20
−200

As a comparison we also plot the ground truth tree.

plotTree(tree = simData$g,
clones = simData$clones,
e.length = table(simData$theta),
label.length = 4,
axis = TRUE)

10
N

0
−50
−100

1, 5, 9, 13, 2, 18
15, 16, 17, 19
−150

4, 6
3, 7, 8, 10,
11, 12, 14, 20
−200

Instead of labelling all clones with the IDs of the cells it contains, we can also remove node labels and plot a
tree where the size of the node corresponds to the size of the subpopulation.

vSize <- sapply(oncoTree$clones,length)


## adjust size of Normal
vSize[1] <- 10
plotTree(tree = oncoTree$g,
clones = NULL,
e.length = edgeLengths,
axis = TRUE,
vertex.size = vSize,
vertex.label = c('N',rep(NA,vcount(oncoTree$g)-1)),
edge.arrow.mode = '-',
vertex.color = c('paleturquoise3',
rep('thistle',vcount(oncoTree$g)-1)))

11
N

0
−50

−100


−150


−200

4 Session Info

sessionInfo()

## R version 3.1.3 (2015-03-09)


## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## Running under: OS X 10.8.5 (Mountain Lion)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods
## [7] base
##
## other attached packages:
## [1] ggplot2_1.0.1 RBGL_1.42.0 graph_1.44.1 igraph_0.7.1
## [5] oncoNEM_1.0 Rcpp_0.11.6 knitr_1.10.5
##
## loaded via a namespace (and not attached):

12
## [1] BiocGenerics_0.12.1 codetools_0.2-11 colorspace_1.2-6
## [4] digest_0.6.8 evaluate_0.7 formatR_1.2
## [7] ggm_2.3 grid_3.1.3 gtable_0.1.2
## [10] highr_0.5 labeling_0.3 magrittr_1.5
## [13] MASS_7.3-40 munsell_0.4.2 parallel_3.1.3
## [16] plyr_1.8.2 proto_0.3-10 reshape2_1.4.1
## [19] scales_0.2.4 stats4_3.1.3 stringi_0.4-1
## [22] stringr_1.0.0 tools_3.1.3

13

S-ar putea să vă placă și