Sunteți pe pagina 1din 40

RNA-Seq Analysis

Simon Andrews
simon.andrews@babraham.ac.uk
@simon_andrews

v2.0
Licence
This presentation is 2013-14, Simon Andrews.

This presentation is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that
you are free:

to copy, distribute, display, and perform the work

to make derivative works

Under the following conditions:

Attribution. You must give the original author credit.

Non-Commercial. You may not use this work for commercial purposes.

Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical
to this one.

Please note that:

For any reuse or distribution, you must make clear to others the licence terms of this work.
Any of these conditions can be waived if you get permission from the copyright holder.
Nothing in this license impairs or restricts the author's moral rights.

Full details of this licence can be found at
http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
RNA-Seq Libraries
rRNA depleted mRNA

Fragment

Random prime + RT

2
nd
strand synthesis (+ U)

A-tailing

Adapter Ligation

(U strand degradation)

Sequencing
NNNN
u u u u
u u u u
A
A
u u u u
A
A
T
T
A T
RNA-Seq Analysis
QC
(Trimming) Mapping
Mapped QC
Quantitation
Statistical
Analysis
QC: Raw Data
Sequence call quality
QC: Raw Data
Sequence bias

QC: Raw Data
Duplication level

Mapping
Exon 1 Exon 2 Exon 3 Genome
Simple mapping within exons
Mapping between exons
Spliced mapping
Can simplify by aligning first to a transcriptome
and then translate back to genomic coordinates.
Can map unmatched reads to the whole genome.
Spliced Mapping Software
Tophat (http://tophat.cbcb.umd.edu/)

Star (http://code.google.com/p/rna-star/)

TopHat
Automated splice junction detection
Can provide GFF/GTF file
Maps against transcriptome first
Requires bowtie (linear) or bowtie2 (gapped)
Outputs BAM files
TopHat pipeline
Reference FastQ files Indexed Genome
Reference GTF Models Indexed Transcriptome
Reads Maps to transcriptome? Translate coords and report
Maps to genome?
Split map to genome Build consensus and report
Report
Yes
Yes
Yes
Discard
No
No
No
Tophat Summary
Left reads:
Input: 24128554
Mapped: 23232131 (96.3% of input)
of these: 1584577 ( 6.8%) have multiple alignments (1584577
have >1)
Right reads:
Input: 24128554
Mapped: 22488314 (93.2% of input)
of these: 1409835 ( 6.3%) have multiple alignments (1409835
have >1)
94.7% overall read alignment rate.

Aligned pairs: 22137913
of these: 1013864 ( 4.6%) have multiple alignments
and: 944383 ( 4.3%) are discordant alignments
87.8% concordant pair alignment rate.
Post Mapping QC
Mapping statistics
Proportion of reads which are in transcripts
Proportion of reads in transcripts in exons
Strand specificity
Consistency of coverage

RNASeqQC (easiest through GenePattern)
Post Mapping QC
Quantitation
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
Splice form 1
Splice form 2
Definitely splice form 1
Definitely splice form 2
Ambiguous
Options for handling splice variants
Ignore them analyse at gene level
Simple, powerful, inaccurate in some cases
DE-Seq, EdgeR, BaySeq
Ignore them analyse at exon level
Simple, some splicing detection, mixed signals
DEXSeq
Assign ambiguous reads based on unique ones
Potentially cleaner more powerful signal
High degree of uncertainty false confidence
Cufflinks etc.
Read counting
Simple (exon or transcript)
HTSeq (htseq-count)
BEDTools (multicov)
featureCounts

Complex (re-assignment)
Cufflinks
Counting Options
Raw counts
Simple counts of numbers of read per gene
Unambiguous
Will require normalisation
Input for tools which internally normalise (DESeq, Edge R)

RPM (Reads per million reads of library)
Corrects for total library coverage
Comparable between different datasets

RPKM (Reads per kilobase of transcript per million reads of library)
Corrects for total library coverage
Corrects for gene length
Comparable between different genes within the same dataset
RPKM
Most widely used measure
Simple, easy to understand
Has problems!

Changes in highly expressed genes
Small changes in highly expressed genes (especially differences in rRNA
contamination) cause a global shift in all other values

Changes in lowly expressed genes
Small changes across lowly expressed genes (especially differences in DNA
contamination) cause differences across a wide number of genes.

Mixing of noise levels
Noise is generally linked to the number of observations
The same RPKM value could come from
A small lowly observed gene with high noise
A large well observed gene with low noise
Normalisation
Filtering Genes
Remove things which are uninteresting or
shouldnt be measured
Reduces noise easier to achieve significance
Non-coding (miRNA, snoRNA etc) in RNA-Seq
Known mis-spliced forms (exon skipping etc)
Mitochonidrial genes
X/Y chr genes in mixed sex populations
Unknown genes
Filtering Mouse mRNAs
Non coding
ESTs
Predicted genes
Good transcripts
Visualising Expression
Comparing the same gene in different samples
Normalised log2 RPM values

Comparing different genes in the same sample
Normalised log2 RPKM values
Linear Log2
Eef1a1
Actb
Lars2
Eef2
CD74
Differential Expression
Microarrays traditionally used continuous
statistical tests (t-test ANOVA etc)

RNA-Seq differs in that it is count based data,
so continuous tests fail at low counts

Most differential tests use count based
distribution tests, usually based on a negative
binomial distribution
Negative binomial tests
Are the counts we see for gene X in condition
1 consistent with those for gene X in condition
2?
Initially modelled using simple Poisson
distribution using mean expression as the only
parameter
Doesnt model real data very well
Poisson vs Negative binomial
Poisson
Binomial
Parameters
Size factors
Estimator of library sampling depth
More stable measure than total coverage
Based on median ratio between conditions

Dispersion (Variance)
Can be measured per gene with large sample sizes
Not enough information in small sample sizes
Information sharing between genes with similar
average observation to improve estimation
Dispersion shrinkage
Plot observed per gene dispersion

Calculate average dispersion for
genes with similar observation

Individual dispersions regressed
towards the mean. Weighted by
Distance from mean
Number of observations

Points more than 2SD above the
mean are not regressed
Outlier removal
Some genes have very large variance
E.g. 3 replicates with counts 0,1 and 5000

Measured by Cooks distance
The effect on the mean from the removal of any individual observation

Calculated for n>=3
Flagged genes removed for n = 3-6
Outlier measures replaced by trimmed mean for n>6

Can be turned off as an extra option
Replicates
Compared to arrays, RNA-Seq is a very clean
technical measure of expression
Generally dont run technical replicates

Some statistics can be run on single replicates,
but they can only tell you about technical noise
(how likely is it that this change is due to a
technical issue)

Assessing biological variation requires biological
replicates
Replicates
Traditional statistics require min 3x3

DESeq can operate at 2x2, but this is a minimum, not recommended

True number of replicates required will depend on your biology and
requirements

4x4 design is fairly common

Always expect at least one sample to fail

Randomise samples during sample prep
The problem of power
In a library Gene B is much better observed for
the same copy number
Power to detect DE is proportional to length
Gene A (1kb)
Gene B (8kb)
5x5 Replicates

5,000 out of 22,000 genes
(23%) identified as DE using
DESeq (p<0.05)
Intensity difference test
Different approach to differential expression
Doesnt aim to find every differentially
expressed gene
Conservative test
Guaranteed to never return large numbers of
hits
Assumptions
Noise is related to observation level
Similar to DESeq

Differences between conditions are either
A direct response to stimulus
Noise, either technical or biological

Find points whose differences arent explained
by general disruption
Method
Results
Exercises
Look at raw QC
Mapping with tophat
Small test data
Quantitation and visualisation with SeqMonk
Larger replicated data
Differential expression with DESeq
Review in SeqMonk
Useful links
FastQC www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tophat tophat.cbcb.umd.edu/
SeqMonk www.bioinformatics.babraham.ac.uk/projects/seqmonk/
Cufflinks cufflinks.cbcb.umd.edu/
DESeq www-huber.embl.de/users/anders/DESeq/
Bioconductor www.bioconductor.org/

For more training courses and tutorials please see
www.bioinformatics.babraham.ac.uk

S-ar putea să vă placă și