Sunteți pe pagina 1din 20

Data Preprocessing

Preprocessing and SNP calling


Natasja S. Ehlers, PhD student
Center for Biological Sequence Analysis
Functional Human Variation Group
Next Generation Sequencing analysis
DTU Bioinformatics

36626 - Next Generation Sequencing Analysis


Generalized NGS analysis
Data size

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Data size

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
Assembly: Two basic approaches

• Alignment: Use a reference genome and align your


reads to the genome

• de novo assembly: Try to assemble the reads into a


genome without any prior knowledge

36626 - Next Generation Sequencing Analysis


Assembly: Two basic approaches
Monday

• Alignment: Use a reference genome and align your


reads to the genome
Wednesday

• de novo assembly: Try to assemble the reads into a


genome without any prior knowledge

36626 - Next Generation Sequencing Analysis


Assembly: Two basic approaches
Monday

• Alignment: Use a reference genome and align your


reads to the genome
Wednesday

• de novo assembly: Try to assemble the reads into a


genome without any prior knowledge

But first a look at data preprocessing


36626 - Next Generation Sequencing Analysis
Preprocessing
• Reads have qualities - bases are not always correct!
• Different error profiles pr. technology
• What can we do?
• Quality trimming
• Adaptor clipping
• 5’ clipping
• k-mer correction
• ...
36626 - Next Generation Sequencing Analysis
Analyze data using FastQC
• Report basic statistics on your
data
• Identify issues with your data

36626 - Next Generation Sequencing Analysis


Per base sequence quality
Illumina

Trim from 3’ to qual 20

36626 - Next Generation Sequencing Analysis


Average quality
Illumina

Remove reads with


avg. qual < 20

Remove reads with


“N” basecalls

36626 - Next Generation Sequencing Analysis


Trim from 5’
• Sometimes something is fishy in the beginning of the
read

Clip a certain number of bases from 5’


36626 - Next Generation Sequencing Analysis
Adapters
• Sometimes adapters/primers are also part of the read
• Adapter/primers are non-biological sequences

• Short read alignment is global - adapters are no-go


• de novo assembly will be confused ~ artificial repeats

• If you dont know which were used: FastQC will (may) find
them for you!

36626 - Next Generation Sequencing Analysis


Adapters - example

We will use “Cutadapt” and “AdapterRemoval” to cut adapters,


many other options exist

Very important if your DNA fragment is shorter than read length


36626 - Next Generation Sequencing Analysis
454 / ion torrent data
Prinseq output
• Main problem is indels at
homopolymer runs
• (Trim homopolymers), trim trailing
poor quality bases
• Remove very short reads

• For de novo adapters should be


removed (prinseq)
• For alignment we use Smith-
Waterman (local) so less important

36626 - Next Generation Sequencing Analysis


k-mer correction
• What is a k-mer?
• Create a sliding window of size k, move it over all
your reads and count occurrence of k-mers
• We can use this to correct sequencing errors!
DNA: ACGTGTAACGTGACGTTGGA
ACGTG
Eg. k=5 CGTGT
GTGTA

36626 - Next Generation Sequencing Analysis


k-mer correction
Page 9 of 13

mer
Concept: Rare k-mers are seq. errors
0.015

rse
Need >15X coverage
na Error k-mers
the
0.010

uch
True k-mers
ACGTGGTTGCCCTTAAA
ACGTGGTTACCCTTAAA
Density

ACGTGGTTACCCTTAAA
(2)
ACGTGGTTACCCTTAAA
0.005

ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
(3) ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
0.000

et k
me, 0 20 40 60 80 100

ea- Coverage
s in Figure 3 k-mer coverage. 15-mer coverage model fit to 76×
of coverage of 36 bp reads from E. coli. Note that the expected
36626 - coverage
ich of a k-mer
Next Generation
L −k +1
in the genome
Sequencing Analysis using reads of length L will be Kelley et al., 2010
times the expected coverage of a single nucleotide
Merge paired ends

Insert size: 500nt Insert size: 180nt


Reads: 100nt Reads: 100nt
Middle: 300nt Middle: -20nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011


Merge paired ends
Overlap

Insert size: 500nt Insert size: 180nt


Reads: 100nt Reads: 100nt
Middle: 300nt Middle: -20nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011


overage Coverage
• Coverage/depth is how many times that your data covers the genome
(on average)

• Example: L
• N: Number of reads: 5 mill C = N ⇥
G
• L: Read length: 100
• G: Genome size: 5 Mbases

G OnC = 5*100/5 = 100X
: genome size
• average there are 100 reads covering each position in the genome
N : number of reads
36626 - Next Generation Sequencing Analysis
Last, but important!
• Lots of data - storage is expensive!

• Keep data compressed whenever


possible (gzip, bzip, bam)

• Remove intermediate files and files


that can easily be re-created

36626 - Next Generation Sequencing Analysis

S-ar putea să vă placă și