Sunteți pe pagina 1din 139

Talleres Internacionales de Bioinformtica

Cuernavaca, Jan 2017

RNA-seq analysis
Denis Puthier, Claire Rioualen and Jacques van Helden

1
Transcriptome analysis

Tentative definition
Transcriptome: the set of all RNA produced by a cell

or population of cells at a given moment.

2
Some players of the RNA world

Messenger RNA (mRNA)


Protein coding
Most are polyadenylated (but not histones)
1-5% of total RNA
Ribosomal RNA (rRNA)
4 types in eukaryotes (18s, 28s, 5.8s, 5s)
80-90% of total RNA
Transfer RNA
15% of total RNA
3
Some players of the RNA world

miRNA
Regulatory RNA (mostly through binding of 3UTR
target genes )
SnRNA
Uridine-rich
Several are related to splicing mechanism
Some are found in the nucleolus (snoRNA)
Related to rRNA biogenesis

4
Some players of the RNA world

eRNA
Enhancer RNA
And many others(e.g LncRNA)

5
Microarrays drawbacks
Cross-hybridization
Probe design issues
Content limited
Can only search expression not discover
Indirect record of expression level
Complementary probes
Relative abundance

6
Even more powerful technology: RNA-seq

7
RNA-seq simplified overview

Objectives: sequencing of DNA/RNA fragments derived from transcripts

RNA-Seq with reference genome (I)


We will talk
about (I)
Alignment
Reference
gnome

Library Quantification Differential


Sequencing expression
construction (counting)
analysis
No Assembly
reference
genome 8
De novo RNA-Seq (II)
RNA-seq library construction: protocol variations
Fragmentation methods
RNA: magnesium-catalyzed hydrolysis, enzymatic cleavage (RNAse III)
cDNA: sonication, Dnase I treatment
Targeted RNA populations
Poly(A) RNA-Seq : Positive selection of mRNA . Poly(A) selection.
Total RNA-Seq : Negative selection of non-Poly(A) (Ribo depletion)
Small RNA-seq : Size selection (e.g between 17nt and 35nt). E.g for miRNA
profiling

Ribo-depletion
vs Poly-A
selection

9
Protocol variations in library construction

Stranded vs unstranded RNA-seq


Unstranded
No information regarding the strand of the gene producing the
fragment. Ambiguous reads should be discarded
Stranded (should be preferred)
The strand of the gene producing the fragment can be inferred from
alignment
No ambiguity. Better estimation of gene expression level.
Better reconstruction of transcript model.

10
Unstranded library construction

(1) - RNA fragments (2) - Reverse transcription (3) - Second strand (4) - ligation of adapters
and RNA degradation synthesis with
dUTP

(5) - Sequencing : fragment


(6) - Results
with ends are complementary.
Green and red fragments are (5) - amplification
sequenced

Each colony may produce two types of


sequences corresponding to the RNA
sequence and its reverse complement (as
two different colonies)

11
Stranded library construction

(1) - RNA fragments (2) - Reverse transcription (3) - Second strand (4) - ligation of adapters
and RNA degradation synthesis with
dUTP

(5) - Sequencing : fragment


(6) - Results
with ends are complementary.
Only Green fragments are (5) - amplification.
sequenced. dUTP fragments are
not amplified
Each colony may produce only one type of
sequences corresponding to the 5 or 3
end depending on the kit.

X 12
Example of stranded single-end RNA-seq alignment

Forward (Red)
Reverse (Blue)
Cd3g (strand -) Cd3d (strand +)

13
Example of unstranded single-end RNA-seq alignment

Forward (Red)
Reverse (Blue)

14
Stranded RNA-Seq result

+ (Watson)

# reads
- (Crick)

Transcript
models

Stranded RNA-seq
makes it possible to
extract signal produced
from both strands

15
Sequencing variation: single-end vs paired-end

Paired-end sequencing: sequence both ends of a fragment


Facilitate alignment
Facilitate gene fusion detection
Better to reconstruct transcript model from RNA-seq

16
RNA-seq library preparation: PE vs SE
Paired-end vs Single-end
Better reconstruction of transcripts with Paired-end
Paired-end: more expensive

PE should be
preferred

E1 E2 E3

SE E1 and E2 may be connected in the encountered


transcripts

PE E1 and E2 are connected in 3 encountered fragments

17
Bioinformatic workflow: overview

Raw Data (fastq) FastQC (html)

Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

...
18
Raw Data

Raw Data (fastq) FastQC (html)

Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 19
Our dataset

Immature mouse T-cells


Resting
Control (DMSO)
DM1, DM2, DM3
Activated (PMA-Ionomycin)
PI1, PI2, PI3
Each sample has been sequenced using a NextSeq 500
~ 50.106 paired reads / sample
We will focus our analysis on chr18

20
Loading the dataset 21

Protocol
1. In the upper left corner, click on Unnamed history and rename this workspace to DM1.
2. Select Shared Data > Data Libraries > P5424 > DM1 > DM1_chr18_R1.fq > Import this
dataset into selected history.
3. In the new window select DM1 as Destination history. Click on Import library dataset.
4. Select Analyze Data in the upper menu. Using the pencil, to rename the dataset into DM1_R1.
5. Select Shared Data > Data Libraries > P5424 > DM1 > DM1_chr18_R2.fq > Import this
dataset into selected history. In the new window select DM1 as Destination history. Click on
Import library dataset.
6. Select Analyze Data in the upper menu. Using the pencil, rename the dataset to DM1_R2.
7. Click the eye icon to display the content of DM1_R1 file.
Q1: How is the quality encoded?
Q2: What can you say about the quality of the first encountered reads?
Name your history (for example: DM1)

In the upper left corner, click on


Unnamed history and rename this
workspace to DM1.

22
Get a dataset from a Shared Data Library

Select Shared Data > Data


Libraries > P5424 > DM1 >
DM1_chr18_R1.fq > Import this
dataset into selected history.

23
Import a file from the shared library to your history

Do this for the two ends of the


paired-end reads:

- DM1_chr18-20Mto50M_
R1.fq
- DM1_chr18-20Mto50M_
R2.fq

24
Imported files should now appear in DM1 history

25
Change (simplify) file names in your history

DM1_chr18-20Mto50M_R1.fq -> DM1_R1

DM1_chr18-20Mto50M_R2.fq -> DM1_R2

26
The raw data are provided in fastq format

@QSEQ32.249996 HWUSI-EAS1691:3:1:17036:13000#0/1 PF=0 length=36


GGGGGTCATCATCATTTGATCTGGGAAAGGCTACTG
+
=.+5:<<<<>AA?0A>;A*A################

27
Performing FastQC analysis (raw data) 28

Protocol
1. Use NGS: QC and manipulation > FastQC:Read QC.
2. Select the first fastq file (DM1_R1) and press Execute.
3. Display the data for the corresponding fastqc result (use the view (eyes) icon above the dataset
name in the right panel).
4. Carefully inspect all the statistics.
5. Perform the same operation for DM1_R2 file.

Q1: What do you think of the overall quality of the sequencing?


Q2: Carefully inspect all diagrams. The FastQC documentation contains a section that explains the
meaning of each diagram.
Q3: What is the format of quality encoding? You need to know it to perform next step (read trimming).
Find the FastQC tool and run it on your read files

29
FastQC result page

30
Raw Data (fastq) FastQC (html)

Trimming Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 31
Trimming 32

Protocol
1. Search for the Sickle tool using the galaxy search engine (upper left corner).
2. Set Single-End or Paired-End reads to Paired-end (two separate files).
3. Set forward reads to DM1_R1
4. Set reverse reads to DM1_R2
5. Set Quality Threshold to 20, Length Threshold to 25.
6. Execute.
7. Rename Paired-End forward strand output of Sickle to DM1_R1_trim
8. Rename Paired-End reverse strand output of Sickle to DM1_R2_trim
9. Rename Paired-End singleton output of Sickle to DM1_singleton_trim
10. Perform a new FastQC analysis using the trimmed read as input
Q: How many reads to you retrieve after trimming? How does it compare with the input fastq
files?
Trimming: sickle input form

33
Raw Data (fastq) FastQC (html)

Mapping Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 34
Alignment: splice-aware aligners
Reads that overlaps several exons may not be mapped properly by
splice-unaware aligners (e.g bowtie)

Genome
E1 E2 E3

Final transcript E1 E1 E1 AAAAAAA

Fragments

35
Splice-aware aligners ?
Reads that overlap several exons may not be mapped properly by
splice-unaware aligners (e.g bowtie)

Fragments

Genome
E1 E2 E3

We will obtain
spliced reads
(gapped
alignments)

36
Example of splice aware aligners

Tophat
Part of a complete pipeline (the tuxedo pipeline)
Make call to bowtie to perform initial, unspliced-alignments
STAR
Developed in the context of ENCODE project
Very fast (>> compared to tophat)
Need ~30Go of memory for human/mouse genome
Based on an associative table (hash).
Usage is painful
Compatible with the tuxedo pipeline

37
Aligned reads
stranded paired-end sequencing on Total RNA (contains immature
RNA)
Alignment performed with tophat

Gene: Il2RA

38
Getting the sequence of mouse chromosome 18 at UCSC 39

Most of the time the galaxy server will provide you with an already indexed genome that can be used by
tophat to perform read alignment. In this practical, we would like to restrict the alignment to mouse
chromosome 18 (this will be faster). We thus need to download the sequence of mouse chromosome 18.
This sequence will be provided to tophat in the subsequent steps (tophat will perform sequence indexing
internally by calling bowtie-build).
NB: the chromosome sequence can also be obtained from ensembl ftp web site.

Protocol
1. In your browser, open a connection to the UCSC ftp site for the mouse genome (assembly mm9):
http://hgdownload.soe.ucsc.edu/goldenPath/mm9/chromosomes/
2. Copy the link address of chr18.fa.gz.
3. Select Tools > Get Data > Upload File.
4. Click Paste / Fetch data.
5. In the text area (URL/Text) paste the link to the chr18 sequence.
6. Select fasta as File Format and mm9 as a reference genome.
7. Press Start. After a few seconds the query is submitted to your Galaxy server and the Status appears in green.
Click Close to quit the Download window.
8. Once the sequence has been fetched from UCSC, rename the record in the history to chr18_mm9.fa.
Gene Transfer Format (GTF)

GTF file format


Generally used to store connected genomic features
e.g exons from transcripts
Rich format
contains basic attributes (chrom, start, end, source, ...)
contains additional (extended) attributes
Can be viewed as a flat file database about
exons/transcripts/genes

40
My GTF file format is rich
chr source type start end score strand frame attributes (keys/values) .
chr6 refGene exon 80837264 80837341 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80838878 80838946 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80877395 80877528 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80878592 80878747 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80880999 80881107 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80910651 80910748 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80912819 80912929 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80982852 80982938 . + . transcript_id "NM_183050"; gene_id "BCKDHB";

Types can be diverse:


exon, start_codon, stop_codon,
Ensembl database add gene and transcript features to GTF files
May be problematic with some software
Attributes can be of any type
Transcript/gene/exon IDs, gene name, transcript version... 41
Getting transcript annotation in gtf format 42

In order to provide Tophat with the location of known exons in the human genome, we will
download a file in GTF format (Gene transfer format). You can get more information about this
format on UCSC web site or GENCODE web site. GTF file can be obtained both from UCSC table
browser or ensembl ftp web site.
NB: it is very important at this step to ensure that the fasta file and the GTF file are
obtained from the genome release . The chromosome sequences and gene positions vary
between releases.

Protocol
1. Select Shared Data > Data Libraries > P5424_chr18-20Mto50M > GTF >
chr18_20M-50M_gencode_vM1.gtf > Import this dataset into selected history. In the
new window select DM1 as Destination history. Click on Import library dataset.
2. Select Analyze Data in the upper menu..
3. Check the first lines of the GTF file. What kind of information is enclosed in this file?
GTF file content

https://genome.ucsc.edu/FAQ/FAQformat.html#format4 43
Read mapping with TopHat 44

Protocol
1. Select NGS: Mapping > Tophat from the toolbox.
2. Set data type to Paired-end (individual datasets).
3. Set RNA-Seq FASTQ file, forward reads to DM1_R1_trim. Set reverse reads to
DM1_R2_trim.
4. Set Use a built in reference genome or one from your history to Use a genome from
history.
5. Set Select the reference genome to chr18_mm9.fa.
6. Set TopHat settings to use to Full parameter list.
a. Set Maximum number of alignments to be allowed to 1.
b. Set Library Type to FR First strand (this is determined by library construction kit).
c. Set Use Own Junctions to yes. Set Use Gene Annotation Model from History.
Set Gene Model Annotations to chr18_20M-50M_gencode_vM1.gtf
7. Press Execute. Rename the accepted_hits dataset to DM1_alignments. Rename the
splice-junction bed file to DM1_splice_junctions.bed.
Mapping with tophat

Beware: for our study case we need


to specify some custom parameters
(see next slide) 45
Setting custom parameters on tophat

Option for strand-oriented


RNA-seq (our study case)

46
Checking the number of aligned reads 47

Exercise
We will use samtools flagstat to assess the number of aligned reads available in the bam file.
1. Select Statistics > flagstat.
2. Select the BAM file and press Execute.
Q1: look carefully at the statistics. What is the meaning of each record ?
NB: "Properly paired" means both mates of a read pair map to the same chromosome, oriented in
Forward-Reverse orientation and having an insert size compatible with fragment size.
Raw Data (fastq) FastQC (html)

Visualization Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 48
Loading Tophat results with the Integrative Genomics Viewer (IGV) 49

Protocol
1. In Galaxy, select tophat accepted hits and download dataset (bam) and bam index (bai).
2. Select control_splice_junctions.bed and download the bed file.
If this file does not contain a bed extension, rename it to add .bed.
3. Open a connection to the Integrative Genomics Viewer (IGV) download page:
https://software.broadinstitute.org/software/igv/download
4. Download and open IGV or launch it with 750 MB or 1.2 Gb depending of your machine.
5. In IGV, load the mm9 genome.
6. Select the menu function File -> Load from file , locate the download folder on your
computer, select the bam file., and click Open.
7. Note: you do not need to load the bam index (bai) file because it will be loaded automatically
with the bam file of the same name.
8. In the same way, load the splice junction file (bed format).
Download the aligned reads (bam, bai) and splice junctions
Splice junctions (bed)
Aligned reads (bam) Bam index (bai)

50
Load the Mouse genome without its sequence in IGV

Do not activate
this option !
51
Load your tophat results in IGV

Load
the bam file but not the bai
(it will be loaded
automatically with the bam)
the splice junction bed file.

52
Loaded tophat results

53
Viewing the results with the Integrative Genomics Viewer (IGV) 54

Protocol
1. Select mm9 as a genome and browse to chromosome 18.
Note: the download sequence option should be inactive.
2. Go to the Egr1 gene (by typing Egr1 in the GO text area). Zoom to view alignments.
3. In the left panel right click on the bam track and select View as pairs.
4. In the left panel right click on the bam track and set Color Alignments by > Read
strand.
5. Load the control_splice_junctions.bed into IGV (File > Load from file).
6. Unzoom to view the number of alignments supporting exon junctions.
7. Mouse over a junction on of control_splice_junctions.bed track. What is the Depth about
?
Select a chromosome (here. chr18)

55
Select a gene (e.g. Egr1)

56
Select viewing options

1. Right-click on the bam track, select View as pairs.


2. Select Group alignments by -> read strand

57
Viewing results with IGV 58

Exercise
1. What is the strand of Egr1 gene?
2. Regarding reads, what does the blue and pink color indicate?
3. Mouse over a paired read. What are the meanings of the following tags/keys:
a. CIGAR? Mapped? Mapping quality? Secondary? Duplicate? Mate-is mapped?
Insert-size? Pair-orientation? First in pair? Second in pair?
4. What are the meaning of :
a. NH? NM?
5. Mouse over several paired alignments on Egr1. What are the values of the
pair-orientation keys?
Viewing results with IGV 59

Exercise
1. Go to internal exons of Etf1 (this gene is located just 40kb away on the 3 side of Egr1).
a. What is the strand of Etf1? What are the values of pair-orientation key on paired
alignments?
b. Look at additional gene examples.
2. What can you conclude regarding paired alignments values?
a. How would you isolate the signal emitted from the plus and minus strands?
3. Looking at Nr3c1 you will find some signal extending from the 5 region?
a. Is it produced by the plus or minus strand?
Bam files are fat

BAM files are fat as they do contain exhaustive information about read
alignments.
Memory issues (can only visualize fraction of the BAM).
Need a more lightweight file format containing only genomic coverage
information:
Wig (not compressed, not indexed)

TDF (compressed, indexed). IGV specific.


BigWig (compressed, indexed).

60
Window wi wi+1 wi+2 wi+3

Coverage files
(wig, bigwig, tdf)
Coverage 4 7 3 1

A lightweight format
FixedStep chrom=chr1 start=100 step =50
Especially when span=50
compressed 4
7
Fast access 3
1
When indexed
61
Creating a bigwig file - (1) getting chromosome sizes 62

Several programs need to know about chromosome length to perform


dedicated tasks.
Chromosome information can be obtained using UCSC whose
table-browser is interfaced in Galaxy.

Protocol
1. in Galaxy, use Get Data > UCSC Main table browser.
2. Set : Clade to Mammal, Genome to Mouse, assembly to July 2007 (NCBI37/mm9), group to All tables,
database to mm9 and table to chromInfo.
3. Set output format to all fields from selected table and Send output to Galaxy.
4. Click get output. In the result web page press Send query to galaxy.
5. Rename the dataset to mm9_chrom_info_txt.
Q1: What does this file contain? Check the information in each column of the result.
Q2: Suppress from the text file the column header line(starting with a #) and the random chromosomes.
Q3: Use the Galaxy tools Text Manipulation > Cut and Statistics > Summary Statistics > Column or
expression > select C2 to compute the median size of a mouse chromosome.
Getting gene coordinates from UCSC table browser

63
Remove the header line

64
Suppress random chromosomes

65
Creating a bigwig track (2) - create a wiggle 66

Protocol
1. Use the BAM to Wiggle tool to convert the BAM file to a wiggle format (the uncompressed and
unindexed version of the BigWig format).
2. Select the tophat accepted hits as input file.
3. Set Chromosome size file to mm9_chrom_info_txt.
4. Set Strand-specific to Paired-end RNA-Seq.
5. Set Pair-End Read Type to read1 (positive > negative; negative > positive), read2 (positive
> positive; negative > negative).
6. Press Execute.
7. Rename the output tracks to DM1_Fwd_wig and DM1_Rev_wig
Convert bam to wiggle

67
Creating a bigwig track (3) 68

Protocol
For each one of the two wiggle files:

Select Convert Formats > Wig/BedGraph-to-bigWig.


Click on Execute.
Rename the output obtained from Wiggle on Forward Reads to DM1_Fwd_bigwig.
Rename the output obtained from Wiggle on Reverse Reads to DM1_Rev_bigwig.
Download the resulting bigwig files (click on the eye icon) and load them into IGV.
In IGV, on the left panel, right click on the bigwig track name. Use Set data range and set the value
min, mid and max value to -200, 0, 200 respectively.
Unzoom.
Transcript Raw Data (fastq) FastQC (html)

discovery Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 69
Searching for novel transcript models

RNA-Seq may be used to discover novel transcripts


Several software:
Cufflinks, MATS, MISO
Cufflinks is the most popular
Performs much better with stranded RNA-Seq
Analyse read overlap to infer transcript structure

Fragments

Genome
E1 E2 E3

70
Searching for novel transcript models: cufflinks

Read pair

Gapped alignment
71
Searching for novel transcript models: cufflinks

72
Searching for novel transcripts with cufflinks 73

Protocol
1. In the toolbox, select NGS: RNA Analysis > Cufflinks.
2. Select the Tophat accepted hits file as SAM or BAM file of aligned RNA-Seq reads.
3. Set Use Reference Annotation to Use reference annotation as guide.
4. Set Reference Annotation to chr18_20M-50M_gencode_vM1.gtf.
5. Set Set advanced Cufflinks options to Yes.
6. Set Library prep used for input read to fr-firststrand.
7. Press Execute.
8. Rename assembled transcript dataset to DM1_cufflinks_transcripts.
Q1: Have a look at the assembled transcripts file produced by cufflinks. What are the attributes
provided?
Q2: Move downwards in the assembly table and check the gene_id and transcript_id attributes.
Why do some of them start with ENS and others with CUFF?
Q3: Download the assembled transcript file produced by cufflinks and load it into IGV. What
can we say about the transcripts produced by the Pura gene?
Cufflinks parameters

74
Extracting a Raw Data (fastq) FastQC (html)

workflow Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 75
Extracting a workflow 76

Protocol

1. In the history menu, select history


options.
2. Click on Extract workflow.
3. Set the name of the new workflow to
RNA-Seq mapping and transcript
discovery. Leave all parameters
unchanged and click Create workflow.
4. Using the menu go to workflow >
RNA-Seq mapping and transcript
discovery > edit.
5. Move the boxes in order to optimize the
readability of the workflow.
6. Rename the input elements to Read_1,
Read_2, CHROM_SIZE and GTF
according to their connections.
Importing data folders to submit to the workflow 77

Protocol
1. Create a new history: History > Create new and rename it PI1.
2. Select Shared Data > Data Libraries > P5424_chr18-20Mto50M
3. Check the box besides the genome_data folder.
This folder contains 3 files required as input for the workflow:
a. the sequence of mouse chromosome 18: mm9_chr_size.txt, chr18_mm9_fa
b. annotations for the selected region of chromosome 18:
chr18_20M-50M_gencode_vM1.gtf
c. mouse chromosome sizes: mm9_chr_size.txt.
4. Check the
5. Check the PI1 folder, which contains the two fastq files with the reads R1 and
R2 of the PI1 sample.
6. Click To History and import it in the PI1 history. This will import all the files
contained in the two selected folders
7. Click on Analyze Data to go back to your history (PI1). You should see the five
datasets.
Running the workflow on another sample 78

Protocol
1. In the top menu, select workflow > RNA-Seq mapping and transcript discovery > edit.
2. Have a look at your new workflow. Check the input files and figure out which of the files
imported in the previous slide should be used where.
3. Select workflow > RNA-Seq mapping and transcript discovery > run. Set the proper
input files.
4. Click Run workflow at the bottom of the page.
5. Renamed the bigwig files PI1_Fwd_cov.bigwig and PI1_Rev_cov.bigiwg, resp.
6. Rename assembled transcript dataset to PI1_cufflinks_transcripts.
7. Rename the accepted_hits dataset to PI1_tophat_alignments.
8. Rename the splice-junction bed file to PI1_tophat_splice_junctions.bed.
9. Save these 5 results on your computer, and load them in IGV.
Q1: Go to the Egr1 gene. What can you see?
Creating a workspace to compare two samples 79

Protocol
1. Create a new history entitled DM versus PI.
2. Import the following files from
Shared Libraries > P5424_chr18-20Mto50M > genome_data
a. The gtf file (chr18_20M-50M_gencode_vM1.gtf).
b. The chromosome sizes (mm9_chr_size.txt).
3. Click Analyse data and use History > Dataset actions > Copy datasets to copy
the following datasets in this history.
a. The assembled transcripts from cufflinks, which should be renamed
PI1_cufflinks_transcripts and DM1_cufflinks_transcripts, resp.
b. The tophat accepted hits (that should have been renamed PI1_alignments and
DM1_alignments) for all samples.
Transcript Raw Data (fastq) FastQC (html)

annotation Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 80
Merging the reference and inferred genomic annotations
We now have at least three different GTF files (depending on whether you have
processed DM2,DM3,PI2,PI3):
The reference annotation
The discovered transcripts in the control sample(s).
The discovered transcripts in the activated sample(s).
We will ask cuffmerge to merge the novel annotations (obtained through cufflinks)
with the reference (known annotation) and to classify the transcripts. It will
annotate transcripts by producing a GTF file containing flags. Some of this flags
may indicate that:
The transcript is unknown (class code u).
The transcript is a novel isoform of a known transcript (class code j).
The transcript is the same as the original/known transcript ((class code =).

For a full description of all possible flags (class code), please refer to the
cuffmerge web site (section Transfrag class codes).
Here we will concentrate on retrieving the position of novel transcripts. 81
Combining and comparing transcripts with cuffmerge 82

Protocol
1. Come back to Analyse Data, and select cuffmerge in the tool search box.
2. In the option GTF file(s) produced by Cufflinks, select the assembled transcript files
(DM1_transcripts).
3. Set Use Reference Annotation to Yes.
4. Set Reference Annotation to chr18_20M-50M_gencode_vM1.gtf.
5. Press Execute.
6. Rename the result cuffmerge_transcripts_PI_and_DM.
Combining and comparing transcripts with cuffmerge

83
Selecting unknown transcripts discovered by cuffmerge 84

Protocol
1. Use Filter and sort > Select lines that match an expression.
2. In cuffmerge_transcripts_PI_and_DM select lines matching the pattern class_code "u".
Note: the letter "u" stands for unknown genes (not present in the reference annotations).
3. Rename the result unknown_genes_PI_and_DM
4. Merge the novel gene annotations with the reference annotations
(chr18_20M-50M_gencode_vM1.gtf) using Text Manipulation > Concatenate datasets
(you will need to click Insert datasets to specify the second dataset).
5. Rename the file to enhanced_annotations.gtf.
Q1: How many transcripts were classified as unknown?
Merging reference and novel annotations

85
Raw Data (fastq) FastQC (html)

Quantification Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 86
Quantification

Estimate the expression level of each gene


Counting the number of reads overlapping each gene model.
Several programs have been developed for this task
Cuffdiff, featureCount, HTSeq-count,
The FeatureCounts software is a lightweight read counting
program written entirely in the C programming language.
It has a variety of advanced parameters
Outstanding performance (10GB SE BAM file takes about 7
minutes on a single average CPU).

87
Counting reads per genes 88

Protocol
We assume that you have aligned the reads for samples DM1 (control) and PI1 (activated cells).

1. Copy the tophat bam files to the history PI versus DM.


2. Select NGS: RNA Analysis > featureCounts.
3. Select the bam files in Alignment file.
4. Set GFF/GTF Source to Use reference from history.
5. Select enhanced_annotations.gtf as Gene annotation file.
6. Set featureCounts parameters to extended settings.
7. Set GFF feature type filter to exon (we want to count inside exonic regions).
8. Set GFF gene identifier to gene_id (all exons of a given gene will be summed up to get the expression value).
9. Set Strand specific protocol to Stranded (reverse)
10. Set Minimum read quality to 12.
11. Select PE Count fragments instead of reads to Yes (we are using paired-ends data).
12. Click Execute.
13. Rename the output files to PI_vs_DM_count_table and PI_vs_DM_count_summary, respectively.
14. Check the Summary file.

Q1: What are Unassigned_MultiMapping, Unassigned_NoFeatures, Unassigned_MappingQuality,


Unassigned_Chimera ?
Differential Raw Data (fastq) FastQC (html)

analysis Trimming (fastq) FastQC (html)

Transcript Discovery Mapping (bam) Visualization (bigwig)

Transcript annotation Quantification

Clustering Differential analysis Func. enrichment

... 89
Protocol
NB: Differential analysis will be done on the full expression matrix provided as shared library.
1. Create a new history named Differential_expression.
2. Go to Shared Data > Data Libraries > P5424_chr18-20Mto50M > COUNTS.
3. Import DM_vs_PI_gene_counts.txt into the Differential_expression history.
4. Click Analyse data and select Differential_Count from the toolbox.
5. Set Title for job outputs to P5424_PI_vs_DM.
6. Set Treatment Name to PMA_Ionomycine.
7. Select columns containing treatment to PI1, PI2, PI3.
8. Set control name to DMSO.
9. Select columns containing control to DM1, DM2, DM3.
10. Set Run this model using edgeR to Do not run edgeR.
11. Set Do not run DESeq2 to Run DESeq2.
12. Set Run Voom to Do not run Voom.
13. Set FDR (Type II error) control method to fdr.
14. Click Execute.
Q1: What is the first plot?
Q2: What can you say from the second plot?
90 Q3: Look at the produced html file. What can you guess from the heatmap?
Extracting significant genes 91

Protocol
1. Select the tool Filter data on any column using simple expressions.
2. Use DifferentialCounts_topTable_DESeq2.xls as a dataset.
3. Set With following condition to abs(c3) > 1 and c7 < 0.01 and Execute.
4. Rename the output DESeq2_DEG_logFC1_padj0.01
5. With the tool Cut columns from a table, select the first column of the
DESeq2_DEG_logFC1_padj0.01 table, in order to dispose of the list of gene names.
6. Rename the output DESeq2_DEG_logFC1_padj0.01_genes

Q1: How many genes were analysed in total?


Q2: How many genes are declared significant?
Q2: Among the genes declared significant, how many false positives should we expect?
Biological interpretation

92
An example list.
Pubmed query for all of them ?

93
What is the biological meaning of a gene lists?

Example: the list of genes upregulated in tumors compared to


normal counterpart.
Is there any hidden biological meaning?
Solution: compare this list to known lists. Eg:
Gene involved in cell cycle, apoptosis, T-cell activation
Gene involved in chemotactism
Gene whose products are located in mitochondria
Gene involved in a given pathway
Predicted targets of miRNA, transcription factors.
Gene located in a given chromosome
Genes known to be associated with mutations in a given tumor type.
Genes known or predicted as being regulated by a given transcription factor

94
Is my list enriched in gene whose function is known ?
N genes
m genes known to be associated to a
Term !Term
term/function T.
List x k-x k
n genes not associated to the term/function
T. !List m-x n-(k-x) N-k
k selected genes (e.g. upregulated in the
tumor compared to normal counterpart) m (white) n N
(black)
x genes associated to term/function T in k.
What is probability to observe x genes
associated with term/function T in k ?
X follows a hypergeometric
distribution N
Hypergeometric test / Fisher exact k
X
test m
95
Where are these lists coming from ?

Pathways: KEGG pathways, Reactome, Biocarta, GenMapp...


Gene Ontology
Ontology: definition of types, properties and relationships between
entities using a control vocabulary
The GO (http://geneontology.org/) defines concepts/classes used to describe gene/product
function, and relationships between these concepts. It classifies functions along three
aspects:
molecular function
molecular activities of gene products
cellular component
where gene products are active
biological process
pathways and larger processes made up of the activities of multiple gene products.
96
Example GO term: T cell activation
(GO:0042098)
225 genes in human are
annotated with GO term
GO/0042098:

E.g: IL27, IRF1, CD28,


CD1D, CD5, CD6, CD4,
CD8, LCK, ZAP70...

97
gProfiler - A web server for functional interpretation

98
http://biit.cs.ut.ee/gprofiler/
Getting the list of up/down-regulated genes 99

Exercise
1. Using Galaxy extract the list of up and down-regulated genes into two separate datasets.
2. Use these lists as two separated input for gProfiler (http://biit.cs.ut.ee/gprofiler/).
3. What are the functional terms that appear significant with the hypergeometric test ?
a. For the up-regulated genes ?
b. For the down-regulated genes ?
Merci

100
Unstranded RNA-seq library limitations

+ (Watson)
>>>>>> Ea1 >>>>> >>>>>> Ea2 >>>>> >>>>>> Ea3 >>>>>

- (Crick)
<<<<<<<<<<<< Eb1 <<<<<<<<<<<<<<

UNSTRANDED Ambiguous
reads should
Ambiguous reads Non ambiguous reads be discarded
From
counting

STRANDED

Non ambiguous reads


Non Ambiguousreads 101
Quantification
Objective
Count the number of reads or fragments (PE) that fall in each gene
featureCounts, HTSeq-count,...
The output is a count matrix (or expression matrix)

102
Quantification
Quantification is most generally performed
at the gene level.
Some specialized software may provide
you with transcript abundance
estimations.
Cufflinks (tuxedo pipeline)
Kallisto
Known issues
Positive association between gene
counts and length.
May be problematic for
gene-wise comparisons.
Suggests higher expression of
longer genes.
Unstranded data may lead to
ambiguous reads that should be
103
discarded.
Intersample normalization: library size
Inter-sample normalization is a prerequisite for differential expression analysis.
This normalization is mostly applied because of some imbalance in read counts
between samples.
Example
Sample 1 has 2 times more reads than sample 2 (24 vs 12)
Gene expression will be overestimated in sample 1 although its expression is unchanged.
A basic normalisation factor could be the library size (total number of reads),
However this might lead to biases (see next slides).

Reads from gene g


Sample 1 Sample 2
Library size
normalization
Scaling factor = 24/12

#readsg,1 = 4 ; #readsg,2 = 2 104


Inter-sample normalization: limits of library size
If a large number of genes are highly expressed in, one experimental condition, the
expression of the remaining genes will artefactually appear as decreased.
Can force the differential expression analysis to be skewed towards one
experimental condition.

Ratio (sample2/sample1)

50.5
G5

0.5
0.5
0.5
0.5
0.5
0.5
105
0.5
TMM Normalization (Robinson and Oshlack, 2010)
Trimmed Mean of M values
Outline
Compute the M values (log ratio). G5
Take the trimmed mean of the M
value as scaling factor.
Multiply read counts by scaling
factor (they multiply to one)
If more than two columns
The library whose 3rd quartile
is closest to the mean of 3rd
quartile is used.
Very similar to RLE

106
Intra-sample normalization

Here the objective is to compare the expression level of genes in the same
sample
Counts ?
Problem with long transcripts
Produce lots of fragments
Will appear artifactually highly expressed compared to others
Proposed method
RPKM
Read per kilobase per million mapped reads (SE)
FPKM
Fragment per kilobase per million mapped reads (PE)

107
RPKM/FPKM normalization
2kb transcript with 3000 alignments in a sample of 10 millions of mappable
reads
RPKM = 3000/(2 * 10) = 150

108
Differential expression analysis

Use statistical tests (e.g based on negative binomial model) to


find differentially expressed genes
Biological replicates prefered/needed (not technical
replicates)
Tools:
EdgeR, DESeq2
The list of differentially expressed genes may be used for
subsequent analysis.

109
Ontologies for almost everything !

https://www.bioontology.org/
Bioportal at http://bioportal.bioontology.org/

110
Network analysis through data mining

Mine various databases in search for meaningful connections between


gene/products
Interactome analysis
Known or predicted Protein-Protein Interactions
Several databases : IntAct, BioGrid, mint
Yeast-two-hybrid
Literature
Co-expression analysis
E.g microarray data or RNA-Seq data
http://coxpresdb.jp/
Text-mining
??
Combined analysis
String
Reactome
GeneMania

111
GeneMania
(http://genemania.org/)

112
Yet other applications of RNA-seq

Fusion transcript analysis


Are there any fusion transcript specific of my tumors ?
Isoforms or exons-level differential analysis
Allele-specific expression
Preferential expression of one of the two alleles in a diploid
genome
The allele-specific expression of a gene is attributed to a distinct
epigenetic status of its two parental alleles
Short RNA-seq (miRNA)
Single cell analysis
C1 (Fluidigm)
10X Genomics
113
Sequence read Archive (SRA)

The SRA archives high-throughput sequencing data that are


associated with:
RNA-seq, ChIP-Seq, and epigenomic data that are submitted
to GEO
114
SRA growth

115
Merci

116
TopHat pipeline
RNA-seq reads are mapped against the whole reference genome (bowtie).
TopHat allows Bowtie to report more than one alignment for a read (default=10),
and suppresses all alignments for reads that have more than this number
Reads that do not map are set aside (initially unmapped reads, or IUM reads)
TopHat then assembles the mapped reads using the assembly module in Maq.
An initial consensus of mapped regions is computed.
The ends of exons in the pseudo-consensus will initially be covered by few reads
(most reads covering the ends of exons will also span splice junctions)
Tophat adds a small amount of flanking sequence of each island
(default=45 bp).

117
TopHat pipeline
Weakly expressed genes should be poorly covered
Exons may have gaps
To map reads to splice junctions, TopHat first enumerates all canonical
donor and acceptor sites within the island sequences (as well as their
reverse complements)
Next, Tophat considers all pairings of these sites that could form
canonical (GTAG) introns between neighboring (but not necessarily
adjacent) islands.
By default, TopHat examines potential introns longer than 70 bp and shorter than 20 000 bp
(more than 93% of mouse introns in the UCSC known gene set fall within this range)
Sequences flanking potential donor/acceptor splice sites within neighboring
regions are joined to form potential splice junctions.
Read are mapped onto these junction library
118
Mapping read spanning exons

119
Bowtie, a very popular aligner (for unspliced alignments)
Burrows-Wheeler Transform-based algorithm
Two phases: seed and extend.
The Burrows-Wheeler Transform of a text T, BWT(T), can be constructed as follows:
The character $ is appended to T, where $ is a character not in T that is
lexicographically less than all characters in T.
The Burrows-Wheeler Matrix of T, BWM(T), is obtained by computing the matrix whose
rows comprise all cyclic rotations of T sorted lexicographically.

acaacg$ 1 $acaacg 7
T caacg$a aacg$ac BWT (T)
2 3
acaacg$ aacg$ac 3 acaacg$ 1 gc$aaac
acg$aca 4 acg$aca 4
cg$acaa 5 caacg$a 2
g$acaac 6 cg$acaa 5
$acaacg 7 g$acaac 6 120
Bowtie principle
Burrows-Wheeler Matrices have a property called the Last First (LF)
Mapping.
The ith occurrence of character c in the last column corresponds to
the same text character as the ith occurrence of c in the first
column
Example: searching AAC in ACAACG

7
3
1
4
2
5
6

Second phase is extension


121
Transcript discovery in the context of the ENCODE project
E.g ENCODE (Encyclopedia Of DNA Elements)
A catalog of expressed transcripts

122
Some key results of ENCODE analysis
15 cell lines studied
RNA-seq, CAGE-seq, RNA-PET
Long RNA-seq (76) vs short (36)
Subnuclear compartments
chromatin, nucleoplasm and nucleoli

Human genome coverage by transcripts


62.1% covered by processed transcripts
74.7 % covered by primary transcripts
Significant reduction of intergenic regions
1012 expressed isoforms per gene per cell line

123
The world of long non-coding RNA (LncRNA)
Long: i.e cDNA of at least 200 bp
A considerable fraction (29%) of lncRNAs are detected in only one of the cell
lines tested (vs 7% of protein coding)
10% expressed in all cell lines (vs 53% of protein-coding genes)
More weakly expressed than coding genes
The nucleus is the center of accumulation of ncRNAs

124
Some LncRNA are functional
Some results regarding their implication in cancer
May help recruitment of chromatin modifiers
May also reveal the underlying activity of enhancers
A large fraction are divergent transcripts

125
The Gencode database (hs/mm)

126
Aligner output: SAM/BAM files
SAM = Sequence Alignment/MAP
BAM: binary/compressed version of SAM
Store information related to alignments
Read alignment coordinates
Mapping quality
CIGAR String
Bitwise FLAG
read paired, read mapped in proper pair, read unmapped, ...
...

127
Bitwise flag
Numerous informations are enclosed in the 3rd column of the
bam file:
read pairs
reads mapped in proper pairs
reads unmapped
mates unmapped
reads reverse strand These binary information
are enclosed in a single column
mates reverse strand
first in pair
second in pair
not primary alignment
...

128
Bitwise flag
00000000001 2^0 = 1 (read paired)
00000000010 2^1 = 2 (read mapped in proper pair)
00000000100 2^2 = 4 (read unmapped)
00000001000 2^3 = 8 (mate unmapped)
00000010000 2^4 = 16 (read reverse strand)
00000001001 2^0+ 2^3 = 9 (read paired, mate unmapped)
00000001101 2^0+2^2+2^3 =13 ...
...

http://picard.sourceforge.net/explain-flags.html 129
The extended CIGAR string
Exemple flags:
M alignment match (can be a sequence match or mismatch !)
I insertion to the reference
D deletion from the reference
http://samtools.sourceforge.net/SAM1.pdf

ATTCAGATGCAGTA
ATTCA--TGCAGTA 5M2D7M
130
RNA-seq: library construction (simplified)

131
Illumina sequencing general principle

132
http://www.illumina.com/company/video-hub/HMyCqWhwB8E.html
The Sanger quality score
Sanger quality score (Phred quality score): Measure the quality of each base call
Based on p, the probability of error (the probability that the corresponding base
call is incorrect).
Qsanger = -10 log10(p)
p = 10-Q/10
Example: p = 0.01 <=> Qsanger = 20
Quality scores are in ASCII 33.
Note that SRA has adopted Sanger quality score although original fastq files may use
different quality score (see: http://en.wikipedia.org/wiki/FASTQ_format)

133
ASCII 33
Storing PHRED scores as single characters gave a simple and space
efficient encoding:
Character ! means a quality of 0
Character means a quality of 1
Character # means a quality of 3
...
Range 0-40

134
Quality control for high throughput sequence data
Quality control
First step of analysis
Ensure proper quality of sequencing experiment.

135
Quality control with FastQC program

Quality

Position in read Position in read

Nb Reads

136
Mean Phred Score
Tools to create reproducible workflows
https://github.com/common-workflow-language/common-workflow-language/
wiki/Existing-Workflow-systems
E.g make, snakemake, galaxy, taverna...

137
http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf
Galaxy server (https://usegalaxy.org/)
Interface to a computing cluster
Highly flexible
Large palette of bioinformatic
programs
Easy to add your own
Fully reproducible workflows

138
Snakemake
A make-like solution

139

S-ar putea să vă placă și