Data Preprocessing: Preprocessing and SNP Calling

Data Preprocessing
Preprocessing and SNP calling

Natasja S. Ehlers, PhD student
Center for Biological Sequence Analysis
Functional Human Variation Group
Next Generation Sequencing analysis
DTU Bioinformatics
36626 - Next Generation Sequencing Analysis

Generalized NGS analysis
Data size
Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
Generalized NGS analysis
Data size
Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
Assembly: Two basic approaches
• Alignment: Use a reference genome and align your

reads to the genome
• de novo assembly: Try to assemble the reads into a

genome without any prior knowledge

Monday

reads to the genome
Wednesday


Monday

reads to the genome
Wednesday

But first a look at data preprocessing

Preprocessing
• Reads have qualities - bases are not always correct!
• Different error profiles pr. technology
• What can we do?
• Quality trimming
• Adaptor clipping
• 5’ clipping
• k-mer correction
• ...
Analyze data using FastQC
• Report basic statistics on your
data
• Identify issues with your data

Per base sequence quality
Illumina
Trim from 3’ to qual 20

Average quality
Illumina
Remove reads with

avg. qual < 20
Remove reads with

“N” basecalls

Trim from 5’
• Sometimes something is fishy in the beginning of the
read
Clip a certain number of bases from 5’

Adapters
• Sometimes adapters/primers are also part of the read
• Adapter/primers are non-biological sequences
• Short read alignment is global - adapters are no-go

• de novo assembly will be confused ~ artificial repeats
• If you dont know which were used: FastQC will (may) find
them for you!

Adapters - example
We will use “Cutadapt” and “AdapterRemoval” to cut adapters,

many other options exist
Very important if your DNA fragment is shorter than read length

454 / ion torrent data
Prinseq output
• Main problem is indels at
homopolymer runs
• (Trim homopolymers), trim trailing
poor quality bases
• Remove very short reads
• For de novo adapters should be

removed (prinseq)
• For alignment we use Smith-
Waterman (local) so less important

k-mer correction
• What is a k-mer?
• Create a sliding window of size k, move it over all
your reads and count occurrence of k-mers
• We can use this to correct sequencing errors!
DNA: ACGTGTAACGTGACGTTGGA
ACGTG
Eg. k=5 CGTGT
GTGTA

k-mer correction
Page 9 of 13
mer
Concept: Rare k-mers are seq. errors
0.015
rse
Need >15X coverage
na Error k-mers
the
0.010
uch
True k-mers
ACGTGGTTGCCCTTAAA
ACGTGGTTACCCTTAAA
Density
ACGTGGTTACCCTTAAA
(2)
ACGTGGTTACCCTTAAA
0.005
ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
(3) ACGTGGTTACCCTTAAA
ACGTGGTTACCCTTAAA
0.000
et k
me, 0 20 40 60 80 100
ea- Coverage
s in Figure 3 k-mer coverage. 15-mer coverage model fit to 76×
of coverage of 36 bp reads from E. coli. Note that the expected
36626 - coverage
ich of a k-mer
Next Generation
L −k +1
in the genome
Sequencing Analysis using reads of length L will be Kelley et al., 2010
times the expected coverage of a single nucleotide
Merge paired ends
Insert size: 500nt Insert size: 180nt

Reads: 100nt Reads: 100nt
Middle: 300nt Middle: -20nt
• Merge overlapping pairs: single longer read
• Smart because Illumina reads have bad 3’ quals
• Very useful for de novo assembly
36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011

Merge paired ends
Overlap
Insert size: 500nt Insert size: 180nt

Reads: 100nt Reads: 100nt
Middle: 300nt Middle: -20nt
• Merge overlapping pairs: single longer read
• Smart because Illumina reads have bad 3’ quals
• Very useful for de novo assembly
36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011

overage Coverage
• Coverage/depth is how many times that your data covers the genome
(on average)
• Example: L
• N: Number of reads: 5 mill C = N ⇥
G
• L: Read length: 100
• G: Genome size: 5 Mbases
•
G OnC = 5*100/5 = 100X
: genome size
• average there are 100 reads covering each position in the genome
N : number of reads
Last, but important!
• Lots of data - storage is expensive!
• Keep data compressed whenever

possible (gzip, bzip, bam)
• Remove intermediate files and files

that can easily be re-created

Data Preprocessing: Preprocessing and SNP Calling

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Preprocessing: Preprocessing and SNP Calling

Încărcat de

Drepturi de autor:

Formate disponibile

Data Preprocessing

Preprocessing and SNP calling

36626 - Next Generation Sequencing Analysis

• Alignment: Use a reference genome and align your

• de novo assembly: Try to assemble the reads into a

36626 - Next Generation Sequencing Analysis

• Alignment: Use a reference genome and align your

• de novo assembly: Try to assemble the reads into a

36626 - Next Generation Sequencing Analysis

• Alignment: Use a reference genome and align your

• de novo assembly: Try to assemble the reads into a

But first a look at data preprocessing

36626 - Next Generation Sequencing Analysis

Trim from 3’ to qual 20

36626 - Next Generation Sequencing Analysis

Remove reads with

Remove reads with

36626 - Next Generation Sequencing Analysis

Clip a certain number of bases from 5’

• Short read alignment is global - adapters are no-go

36626 - Next Generation Sequencing Analysis

We will use “Cutadapt” and “AdapterRemoval” to cut adapters,

Very important if your DNA fragment is shorter than read length

• For de novo adapters should be

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

Insert size: 500nt Insert size: 180nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011

Insert size: 500nt Insert size: 180nt

• Merge overlapping pairs: single longer read

• Smart because Illumina reads have bad 3’ quals

• Very useful for de novo assembly

36626 - Next Generation Sequencing Analysis Magocˇ and Salzberg, 2011

• Keep data compressed whenever

• Remove intermediate files and files

36626 - Next Generation Sequencing Analysis

S-ar putea să vă placă și