Module 1

1 Introduction to Genomics
OMICS
OMICS refers to a field of study in biology ending in -omics, such as genomics, proteomics or metabolomics.
The related suffix -ome is used to address the objects of study of such fields, such as the genome, proteome or
metabolome respectively. Omics aims at the collective characterization and quantification of pools of biological
molecules that translate into the structure, function, and dynamics of an organism or organisms.
Simply stated, whenever the suffix omics is attached to any biological term, e.g. Gene + omics ; it refers to the
overall study of the whole genetic entity of an organism from the basic characteristics to the structural, dynamical,
expressional and production of the genetic entity.
GENOMICS
Genomics is the study of the complete DNA sequence of an organism. It provides the overall data of a genome,
its difference and similarities w.r.t other organisms’ genome and helps in finding solutions of many real life issues.
Or
Genomics is an interdisciplinary field of science focusing on the structure, function, evolution, mapping, and
editing of genomes.
A genome is an organism's complete set of DNA, including all of its genes. In contrast to genetics, which refers
to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization
and quantification of genes, which direct the production of proteins with the assistance of enzymes and messenger
molecules.
Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA
sequencing and bioinformatics to assemble and analyse the function and structure of entire genomes.
Structural Genomics
 Structural genomics seeks to describe the 3D structure of every protein encoded by a given genome.
 This genome-based approach allows for a high-throughput method of structure determination by a
combination of experimental and modelling approaches.
 The principal difference between structural genomics and traditional structural prediction is that structural
genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing
on one particular protein.
 With full-genome sequences available, structure prediction can be done more quickly through a
combination of experimental and modelling approaches, especially because the availability of large
number of sequenced genomes and previously solved protein structures allows scientists to model protein
structure on the structures of previously solved homologs.
 Structural genomics takes advantage of completed genome sequences in several ways in order to determine
protein structures.
 The gene sequence of the target protein can also be compared to a known sequence and structural
information can then be inferred from the known protein’s structure.
 Structural genomics can be used to predict novel protein folds based on other structural data.
 Structural genomics can also take modelling-based approach that relies on homology between the unknown
protein and a solved protein structure.
Functional Genomics
 Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of
gene expression and protein–protein interactions, as opposed to the static aspects of the genomic
information such as DNA sequence or structures.
 Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA
transcripts, and protein products.
 A key characteristic of functional genomics studies is their genome-wide approach to these questions,
generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.
 Functional genomics may be applied to the complete collection of DNA (the genome), RNA (the
transcriptome), or protein (the proteome) of an organism.
 Functional genomics implies the use of high‐throughput screens, in contrast to traditional methods of
biology in which one gene or protein has been characterized experimentally in depth. Such traditional
methods commonly complement high-throughput approaches.
 Functional genomics often involves the perturbation of gene function to investigate the consequence on
the function of other genes in a genome.
 One of the most challenging and fundamental problems in modern biology is to understand the relationship
between genotype and phenotype. Connecting the two is a fundamental part of functional genomics.
Metagenomics
 Metagenomics is the study of genetic material recovered directly from environmental samples.
 The broad field may also be referred to as environmental genomics, ecogenomics or community
genomics.
 While traditional microbiology and microbial genome sequencing and genomics rely upon
cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S
rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority
of microbial biodiversity had been missed by cultivation-based methods.
 Recent studies use either "shotgun" or PCR directed sequencing to get largely unbiased samples of all
genes from all the members of the sampled communities.
 Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a
powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the
entire living world.
Epigenomics
 Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell,
known as the epigenome.
 Genomic modifications that alter gene expression that cannot be attributed to modification of the primary
DNA sequence and that are heritable mitotically and meiotically are classified as epigenetic
modifications. DNA methylation and histone modification are among the best characterized epigenetic
processes.
 The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a
cell.
 Epigenetic modifications are reversible modifications on a cell’s DNA or histones that affect gene
expression without altering the DNA sequence.
 Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic
genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be
inhibiting epigenomic marks that cause cancers.
 Two of the most characterized epigenetic modifications are DNA methylation and histone modification.
 Epigenetic modifications play an important role in gene expression and regulation, and are involved in
numerous cellular processes such as in differentiation/development and tumorigenesis. The study of
epigenetics on a global level has been made possible only recently through the adaptation of genomic
high-throughput assays.
SEQUENCING
1. Sanger’s Method – Dideoxy Nucleotide Chain Termination Method
The most commonly used method of DNA sequencing, called dideoxy sequencing is based on DNA
replication. Using a sequence of interest already cloned into a vector as a template, DNA polymerase adds
nucleotides to a short primer, until extension of the new DNA strand is stopped by inclusion of a modified
nucleotide. This generates an array of short fragments, which can be interpreted by gel electrophoresis either
in an automated DNA sequencer or in a standard gel apparatus. Both linear DNA and circular DNA can be
sequenced using the dideoxy DNA sequencing method.
Principle
i. Di-deoxy Nucleotides
If the normal dNTP precursor is used, the extended DNA chain has
a –OH at its end and, therefore, another nucleotide can be added by
DNA polymerase. However, if the dideoxy ddTTP precursor is
used, the extended DNA chain has a -H at its end and, therefore,
another nucleotide cannot be added by DNA polymerase. In other
words, the addition of a dideoxy nucleotide to a DNA chain being
synthesized terminates the DNA synthesis reaction.
This happens because the H atom on the 3’ end can only bind with
one more molecule, which is the carbon atom of the sugar, making
it unable to bind to any other molecule. This results in termination
of the chain.
ii. DNA Replication
The reaction protocol of the sequencing technique is derived from the existing natural phenomena of DNA
replication which takes place in the cells. Here, the DNA polymerase is performing its function as it
performs in the replication reaction, but the final product is different, because of the inclusion of a modified
nucleotide. Also, the reaction protocol is much similar to the PCR technique which is used to produce
multiple copies of a DNA sequence. The chain-terminator or dideoxy procedure for DNA sequencing
capitalizes on two properties of DNA polymerases:
 Their ability to synthesize faithfully a complimentary copy of a single-stranded DNA template
 Their ability to use 2′, 3′- dideoxynucleotides as substrates
Reaction Mixture
The reaction mixture consists of the following components –
 The DNA sample to be sequenced is the template. The template has to be converted into a single strand
by denaturing with NaOH or by heating. But if you are carrying out the sequencing reaction using a PCR
machine, denaturation of the template occurs as a part of the reaction cycle.
 DNA primers. 5′ end radio-labeled DNA primers, which are short fragments of DNA complementary to
the template DNA. Primers are labeled with radioactive phosphate at the 5′ end.
 A mixture of all dNTPs—dATP, dGTP, dCTP, and dTTP. All this as a mixture is distributed among four
different reaction tubes in appropriate quantities and labeled ‘A’, ‘T’, ‘G’, and ‘C’.
 ddNTPs - In each tube a small quantity of the corresponding ddNTP is added, which are either
radiolabeled or tagged with a fluorescent dye which absorb certain wavelengths of light, causing them to
emit very specific wavelengths of light. This makes the technique more visible. In a tube labeled ‘A’ a
small amount of ddATP is added. In tube ‘T’ ddTTP is added, in tube ‘G’ ddGTP is added, and in tube
‘C’ a small amount of ddCTP is added. The concentration of ddNTP is approximately 1% of the
concentration of the dNTPs.
 Taq DNA polymerase-When all the components are ready, Taq DNA polymerase is added to all the four
tubes and the reaction, the synthesis of DNA or the elongation of the primer, starts.
Reaction Steps of Sequencing Methods
A. Denaturation of the Template Strand
The template strand is heated to a temperature (92 – 980C) where it denatures or the strands separate in
order to form single stranded DNA or a chemical agent (NaOH).
B. Polymerase Annealing
The template is subjected to a lower temperature (50-650C) for the primer to bind to the template and the
polymerase to start the elongation of the chain.
C. Elongation and Chain Termination
Since the reaction is using Taq Polymerase, the reaction is brought up to 720C and is allowed to continue
for a certain period of time. In this time period, the dNTPs and ddNTPs get incorporated by the
polymerase in the chain. Whenever the chain incorporates a ddNTP (ddATP against T in the template),
the elongation of the chain stops.
D. Electrophoresis and Autoradiogram
The sample is subjected to electrophoresis (gel or capillary) and autoradiogram is used to find the
sequence of the template.
Limitations of Sanger Sequencing
Sanger sequencing has a number of limitations that can lead to problems with results and difficulty using the
method in general:
 Sanger methods can only sequence short pieces of DNA--about 300 to 1000 base pairs.
 The quality of a Sanger sequence is often not very good in the first 15 to 40 bases because that is where
the primer binds.
 Sequence quality degrades after 700 to 900 bases.
 If the DNA fragment being sequenced has been cloned, some of the cloning vector sequence may find its
way into the final sequence.
 Nonspecific primer binding
 Formation of DNA secondary structures which alter sequencing fidelity
2. Maxam and Gilbert’s Sequencing Method – Chemical Degradation Method

In the Maxam and Gilbert method for DNA sequencing, the four sets of oligonucleotides are obtained by
treating a 32P-end-labeled DNA fragment under four different conditions with a reagent that modifies a
particular nucleotide, followed by cleavage of the DNA molecule next to the modified nucleotide. It is for this
reason that the method is also known as the partial chemical degradation method for DNA sequencing.
In the Maxam and Gilbert method, the reaction conditions are chosen in such a way that only a limited number
of cleavages occurs in each DNA molecule. Ideally, they should be such that only one nucleotide m each
labeled molecule inside the region to be sequenced is modified. Since any nucleotide is as likely to react as
any other with the reagent that is specific for it, the one-hit reactions evenly distribute the radioactivity among
the cleavage products.
Principle
A. Guanine/Adenine Cleavage
Dimethyl sulfate methylates the purines (guanines at N7 position and adenines at the N3). The glycosidic
bond of a methylated purine is unstable and breaks easily on heating at neutral pH, leaving the sugar free.
Treatment with 0.1 M alkali at 90°C then will cleave the sugar from the neighboring phosphate groups.
When the resulting end-labeled fragments are resolved on a polyacrylamide gel, the autoradiograph
contains a pattern of dark and light bands. The dark bands arise from breakage at guanines, which
methylate 5-fold faster than adenines.
B. An Adenine-Enhanced Cleavage
The glycosidic bond of methylated adenosine is less stable than that of methylated guanosine. Thus, gentle
treatment with dilute acid releases adenines preferentially. Subsequent cleavage with alkali then produces
a pattern of dark bands corresponding to adenines with light bands at guanines.
C. Cleavage at Cytosines and Thymines
Hydrazine reacts with thymine and cytosine, cleaving the base and leaving ribosylurea. Hydrazine then
may react further to produce a hydrazone. After a partial hydrazinolysis in 15-18 M aqueous hydrazine at
200, the DNA is cleaved with 0.5 M piperidine. This cyclic secondary amine, as the free base, displaces
all the products of the hydrazine reaction from the sugars and catalyzes the β-elimination of the
phosphates. The final pattern contains bands of similar intensity from the cleavages at cytosines and
thymines.
D. Cleavage at Cytosine
The presence of 2 M NaCl preferentially suppresses the reaction of thymines with hydrazine. Then, the
piperidine breakage produces bands only from cytosine
Reaction Mixture Components/ Chemicals required
 Cacodylate buffer - 50 mM sodium cacodylate; 10 mM magnesium chloride; 0 1 mM EDTA, pH 7.4.
 Dimethyl sulfate (DMS) (99%)
 DMS stop mixture - l.0 M Tris-acetate; 1.5M sodium acetate, l.0 M β-mercaptoethanol; 50 mM
magnesium acetate; 1 mM EDTA; 0.4 mg/mL tRNA, pH 7 5.
 Formic acid, 100%.
 Acetate RNA mix - 3M sodium acetate, 0 1 mM EDTA; 0 4 mg/mL t-RNA.
 Hydrazine, 96%.
 Sodium chloride, 5M
 Ethanol, 96% and 80%.
 Sodium acetate, 0.3M.
 Piperidine, 99%.
 Formamide (deionized) dye solution: 1 mL formamide; 20 μL 1% Orange G; 10 μL 0 1%
 Bromophenol Blue; 20 μL 1% p-Xylenol Blue
Reaction Steps
i. DNA Template preparation
Dissolve the end-labeled DNA fragment in H20. Divide this solution between four Eppendorf tubes for the
four different reactions. Take a small volume of DNA solution for the guanine- and cytosine-specific
reactions, and double the volume taken for G and C reactions of DNA for the adenine + guanine and thymine
+ cytosine reactions, respectively. Keep the tubes at 0°C on ice.
ii. The Guanine Reaction

a. In the first tube, Add cacodylate buffer (at 00C), and 100 times less volume of DMS, mix and incubate for
4.5 min at 20°C. Add DMS stop mixture (at 00C), mix, and chill in ice water.
b. Precipitate the DNA using 96% ethanol (at -200C) at -700C, then centrifuge and obtain the pellet.
c. Dissolve the pellet in 0.3M sodium acetate, add 96% ethanol, mix and precipitate the DNA at -
70°C.Obtain the pellet using centrifugation and wash it using 80% ethanol.
d. Again, Centrifuge and obtain the pellet and wash it in 96% ethanol.
e. Centrifuge and obtain the pellet and dry it in vacuum.
f. Dissolve the dry DNA pellet in 10% piperidine solution (freshly made) and incubate at 95°C for 30 min.
g. Lyophilize the DNA solution to obtain a dry pellet and dissolve the pellet in H2O. Repeat the step again.
h. Dissolve the pellet in formamide dye solution. Usually l-2 µL are loaded, which should contain 10,000-
20,000 cpm measured by Cerenkov counting, to ensure an exposure time of l-2 d.
iii. Adenine plus Guanine Reaction
a. Add 100% formic acid, mix, and incubate for 10 min at 20°C.
b. Add 200 µL of acetate-RNA mix (at 00C), chill in ice water, then continue with steps b-h of the guanine
reaction.
iv. Cytosine plus Thymine Reaction
a. Add H2O (at 00C) and then 96% hydrazine. Mix and incubate for 7 min at 20°C.
b. After this time, add acetate-RNA mix (at 00C), chill in ice water, then continue with steps b-h of the guanine
reaction.
v. Cytosine Reaction
a. Add 5M NaCl and then 96% hydrazine. Mix and incubate for 10 min at 20°C
b. After this time, add 200 FL of acetate-RNA mix (at 00C), chill in ice water and continue with steps b-h of
the guanine reaction.
Limitations
 It requires extensive use of hazardous chemicals.
 It has a relatively complex set up / technical complexity.
 It is difficult to “scale up” and cannot be used to analyze more than 500 base pairs.
 The read length decreases from incomplete cleavage reactions.
 It is difficult to make Maxam-Gilbert sequencing based DNA kits.
3. Next Generation Sequencing

Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to
describe a number of different modern sequencing technologies. Next generation sequencing technologies,
have made sequencing hundreds of times faster and less expensive than the traditional Sanger sequencing
method. Most next generation sequencing technologies do sequencing in parallel, which means that hundreds
of thousands or even millions of DNA fragments are simultaneously sequenced.
Some of the technologies listed under this are –
A. Pyrosequencing
Pyrosequencing is a DNA sequencing method that involves determining which of the four bases is incorporated
at each step in the copying of a DNA template. Pyrosequencing is named for the pyrophosphate molecule (two
phosphate groups connected by a covalent bond) that is released when a dNTP is used by DNA polymerase to
extend a new DNA strand.
Principle and Technology
The DNA to be sequenced is denatured to form single-stranded DNA. The single-stranded DNA is attached to
a solid, microscopic bead that is placed in a microscopic well in the pyrosequencer. The sequencing reaction
mixture, consisting of a primer, DNA polymerase, and three other enzymes, is added. The four dNTPs are not
present in the initial mix, but are added sequentially to and removed from the pyrosequencing reaction, such
that only one dNTP is present in the reaction at any one time. This cycle of addition and removal of each dNTP
in turn repeats over and over. As DNA polymerase moves along a single stranded template, each of the four
nucleoside triphosphates is fed sequentially and then removed. If one of the four bases is incorporated then
pyrophosphate is released and this is detected in an enzyme cascade that emits light.
Variants of Pyrosequencing
There are two variants of the pyrosequencing technique.
 In solid-phase pyrosequencing, the DNA to be
sequenced is immobilized and a washing step is used
to remove the excess substrate after each nucleotide
addition. The four different nucleotides are added
stepwise to the immobilized primed DNA template
and the incorporation event is followed using the
enzyme ATP sulfurylase and luciferase. After each
nucleotide addition, a washing step is performed to
allow iterative addition.
 In liquid-phase sequencing a nucleotide degrading
enzyme (apyrase) is introduced to make a four enzyme
system. Addition of this enzyme has eliminated the
need for a solid support and intermediate washing
thereby enabling the pyrosequencing reaction to be
performed in a single tube. However, without the
washing step, inhibitory substances can accumulate.
Primed DNA template and four enzymes involved in
liquid-phase pyrosequencing are placed in a well of a
microtitre plate.
The four different nucleotides are added stepwise and incorporation is followed using the enzyme ATP
sulfurylase and luciferase. The nucleotides are continuously degraded by nucleotide-degrading enzyme
allowing addition of subsequent nucleotide.
B. Illumina Dye Sequencing
Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA
sequencing. This sequencing method is based on reversible dye-terminators that enable the identification of
single bases as they are introduced into DNA strands. It can also be used for whole-genome and region
sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-
wide protein-nucleic acid interaction analysis.
Overview
Illumina sequencing technology works in three basic steps: amplify, sequence, and analyze. The process begins
with purified DNA. The DNA gets chopped up into smaller pieces and given adapters, indices, and other kinds
of molecular modifications that act as reference points during amplification, sequencing, and analysis. The
modified DNA is loaded onto a specialized chip where amplification and sequencing will take place. Along
the bottom of the chip are hundreds of thousands of oligonucleotides (short, synthetic pieces of DNA). They
are anchored to the chip and able to grab DNA fragments that have complementary sequences. Once the
fragments have attached, a phase called cluster generation begins. This step makes about a thousand copies of
each fragment of DNA. Next, primers and modified nucleotides enter the chip. These nucleotides have
reversible 3' blockers that force the polymerase to add on only one nucleotide at a time as well as fluorescent
tags. After each round of synthesis, a camera takes a picture of the chip. A computer determines what base was
added by the wavelength of the fluorescent tag and records it for every spot on the chip. After each round, non-
incorporated molecules are washed away. A chemical deblocking step is then used in the removal of the 3’
terminal blocking group and the dye in a single step. The process continues until the full DNA molecule is
sequenced
Steps for Sequencing Technology

I. DNA Purification
II. Tagmentation: Enzymes called transposases randomly cut the DNA into short segments ("tags"). Adapters
are added on either side of the cut points (ligation). Strands that fail to have adapters ligated are washed
away.
III. Reduced cycle amplification: During this step, sequences for primer binding, indices, and terminal
sequences are added. Indices are usually six base pairs long and are used during DNA sequence analysis
to identify samples. Indices allow for up to 96 different samples to be run together. During analysis, the
computer will group all reads with the same index together. The terminal sequences are used for attaching
the DNA strand to the flow cell. Illumina uses a "sequence by synthesis" approach. This process takes place
inside of an acrylamide-coated glass flow cell. The flow cell has oligonucleotides (short nucleotide
sequences) coating the bottom of the cell, and they serve to hold the DNA strands in place during
sequencing. The oligos match the two kinds of terminal sequences added to the DNA during reduced cycle
amplification. As the DNA enters the flow cell, one of the adapters attaches to a complementary oligo.
IV. Bridge amplification: Once attached, cluster
generation can begin. The goal is to create hundreds
of identical strands of DNA. Some will be the forward
strand; the rest, the reverse. Clusters are generated
through bridge amplification. Polymerases move
along a strand of DNA, creating its complementary
strand. The original strand is washed away, leaving
only the reverse strand. At the top of the reverse strand
there is an adapter sequence. The DNA strand bends
and attaches to the oligo that is complementary to the
top adapter sequence. Polymerases attach to the reverse strand, and its complementary strand (which is
identical to the original) is made. The now double stranded DNA is denatured so that each strand can
separately attach to an oligonucleotide sequence anchored to the flow cell. One will be the reverse strand;
the other, the forward. This process is called bridge amplification, and it happens for thousands of clusters
all over the flow cell at once.
V. Clonal amplification
Over and over again, DNA strands will bend and attach to
oligos. Polymerases will synthesize a new strand to create
a double stranded segment, and that will be denatured so
that all of the DNA strands in one area are from a single
source (clonal amplification).
VI. Sequence by synthesis:
 At the end of clonal amplification, all of the reverse strands are washed off
the flow cell, leaving only forward strands.
 Primers attach to the forward strands and a polymerase adds fluorescently
tagged nucleotides to the DNA strand. Only one base is added per round.
 A reversible terminator is on every nucleotide to prevent multiple additions
in one round. Tagged nucleotides are added in order to the DNA strand. Each
of the four nucleotides have an identifying label that can be excited to emit
a characteristic wavelength. A computer records all of the emissions, and
from this data, base calls are made.
 Once the DNA strand has been read, the strand that was just added is washed
away. Then, the index 1 primer attaches, polymerizes the index 1 sequence, and is washed away. The
strand forms a bridge again, and the 3' end of the DNA strand attaches to an oligo on the flow cell. The
index 2 primer attaches, polymerizes the sequence, and is washed away.
 A polymerase sequences the complementary strand on top of the arched strand. They separate, and the 3'
end of each strand is blocked. The forward strand is washed away, and the process of sequence by
synthesis repeats for the reverse strand.
VII. Data analysis
The sequencing occurs for millions of clusters at once, and each cluster has ~1,000 identical copies of a
DNA insert. The sequence data is analyzed by finding fragments with overlapping areas, called contigs,
and lining them up. If a reference sequence is known, the contigs are then compared to it for variant
identification.
C. Nanopore Sequencing
Nanopore sequencing is a third generation approach used in
the sequencing of biopolymers- specifically, polynucleotides in
the form of DNA or RNA.
Using Nanopore sequencing, a single molecule of DNA or
RNA can be sequenced without the need for PCR amplification
or chemical labeling of the sample. At least one of these
aforementioned steps is necessary in the procedure of any
previously developed sequencing approach. Nanopore
sequencing has the potential to offer relatively low-cost
genotyping, high mobility for testing, and rapid processing of
samples with the ability to display results in real-time.
Principles for detection and base identification

Nanopore sequencing uses electrophoresis to transport an
unknown sample through an orifice of 10−9 meters in diameter. A nanopore system always contains an
electrolytic solution- when a constant electric field is applied, an electric current can be observed in the system.
The magnitude of the electric current density across a nanopore surface depends on the nanopore's dimensions
and the composition of DNA or RNA that is occupying the nanopore. Sequencing is made possible because,
when close enough to nanopores, samples cause characteristic changes in electric current density across
nanopore surfaces. The total charge flowing through a nanopore channel is equal to the surface integral of
electric current density flux across the nanopore unit normal surfaces between times t1 and t2.
Types
 Biological
Biological nanopore sequencing relies on the use of transmembrane proteins, called porins, embedded in
lipid membranes so as to create size dependent porous surfaces- with nanometer scale "holes" distributed
across the membranes. Sufficiently low translocation velocity can be attained through the incorporation
of various proteins that facilitate the movement of DNA or RNA through the pores of the lipid membranes.
 Solid state
Solid state nanopore sequencing approaches, unlike biological nanopore sequencing, do not incorporate
proteins into their systems. Instead, solid state nanopore technology uses various metal or metal alloy
substrates with nanometer sized pores that allow DNA or RNA to pass through. These substrates most
often serve integral roles in the sequence recognition of nucleic acids as they translocate through the
channels along the substrates
D. Nanoball Sequencing
DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire
genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of
genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then
polymerized to anchor sequences bound to known sequences on the DNA template. The base order is
determined via the fluorescence of the bound nucleotides. This DNA sequencing method allows large numbers
of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing
platforms. However, a limitation of this method is that it generates only short sequences of DNA, which
presents challenges to mapping its reads to a reference genome.
Sequencing technology Procedure

DNA Nanoball Sequencing involves isolating DNA that is to be sequenced, shearing it into small 100 – 350
base pair (bp) fragments, ligating adapter sequences to the fragments, and circularizing the fragments. The
circular fragments are copied by rolling circle replication resulting in many single-stranded copies of each
fragment. The DNA copies concatenate head to tail in a long strand, and are compacted into a DNA nanoball.
The nanoballs are then adsorbed onto a sequencing flow cell. The color of the fluorescence at each interrogated
position is recorded through a high-resolution camera. Bioinformatics are used to analyze the fluorescence data
and make a base call, and for mapping or quantifying the 50bp, 100bp, or 150bp single- or paired-end reads.
 DNA Isolation, fragmentation, and size capture
Cells are lysed and DNA is extracted from the cell lysate. The high-molecular-weight DNA, often several
mega base pairs long, is fragment by physical or enzymatic methods to break the DNA double-strands at
random intervals. For small RNA sequencing, selection of the ideal fragment lengths for sequencing is
performed by gel electrophoresis; for sequencing of larger fragments, DNA fragments are separated by
bead-based size selection.
 Attaching adapter sequences
Adapter DNA sequences must be attached to the unknown DNA fragment so that DNA segments with
known sequences flank the unknown DNA. In the first round of adapter ligation, right (Ad153_right) and
left (Ad153_left) adapters are attached to the right and left flanks of the fragmented DNA, and the DNA is
amplified by PCR. A split oligo then hybridizes to the ends of the fragments which are ligated to form a
circle. An exonuclease is added to remove all remaining single-
stranded and double-stranded DNA products. The result is a
completed circular DNA template.
 Rolling circle replication
Once a single-stranded circular DNA template is created, containing
sample DNA that is ligated to two unique adapter sequences has been
generated, the full sequence is amplified into a long string of DNA.
This is accomplished by rolling circle replication with the Phi 29
DNA polymerase which binds and replicates the DNA template. The
newly synthesized strand is released from the circular template,
resulting in a long single-stranded DNA comprising several head-to-
tail copies of the circular template. The resulting nanoparticle self-
assembles into a tight ball of DNA approximately 300 nanometers
(nm) across. Nanoballs remain separated from each other because
they are negatively charged naturally repel each other, reducing any
tangling between different single stranded DNA lengths.
 DNA nanoball patterned array
To obtain DNA sequence, the DNA nanoballs are attached to a
patterned array flow cell. The flow cell is a silicon wafer coated with
silicon dioxide, titanium, hexamethyldisilazane (HMDS), and a
photoresist material. The DNA nanoballs are added to the flow cell
and selectively bind to the positively-charged aminosilane in a
highly ordered pattern, allowing a very high density of DNA
nanoballs to be sequenced.
 Imaging
After each DNA nucleotide incorporation step, the flow cell is
imaged to determine which nucleotide base bound to the DNA
nanoball. The fluorophore is excited with a laser that excites
specific wavelengths of light. The emission of fluorescence from
each DNA nanoball is captured on a high resolution CCD
camera. The image is then processed to remove background
noise and assess the intensity of each point. The color of each
DNA nanoball corresponds to a base at the interrogative position
and a computer records the base position information.
 Sequencing data format
The data generated from the DNA nanoballs is formatted as
standard FASTQ formatted files with contiguous bases (no
gaps). These files can be used in any data analysis pipeline that
is configured to read single-end or paired-end FASTQ files.
WHOLE GENOME SEQUENCING
Whole genome sequencing (also known as WGS, full genome sequencing, complete genome sequencing, or
entire genome sequencing) is the process of determining the complete DNA sequence of an organism's genome
at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the
mitochondria and, for plants, in the chloroplast. In practice, genome sequences that are nearly complete are also
called whole genome sequences.
Whole genome sequencing has largely been used as a research tool, but is currently being introduced to clinics.
In the future of personalized medicine, whole genome sequence data will be an important tool to guide therapeutic
intervention. The tool of gene sequencing at SNP level is also used to pinpoint functional variants from association
studies and improve the knowledge available to researchers interested in evolutionary biology, and hence may
lay the foundation for predicting disease susceptibility and drug response.
Early techniques
Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of shotgun
sequencing technology. While full genome shotgun sequencing for small (4000–7000 base pair) genomes was
already in use in 1979, broader application benefited from pairwise end sequencing, known colloquially as
double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated
genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of
a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data
was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two
sequences were oriented in opposite directions and were about the length of a fragment apart from each other was
valuable in reconstructing the sequence of the original target fragment.
The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT
locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun
sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming
fragments of constant length, was in 1991. In 1995 the innovation of using fragments of varying sizes was
introduced, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets.
The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire
genome of the bacterium Haemophilus influenzae in 1995, and then by Celera Genomics to sequence the entire
fruit fly genome in 2000, and subsequently the entire human genome. Applied Biosystems, now called Life
Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The
Human Genome Project.
Current techniques
While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still
too expensive and takes too long for commercial purposes. Since 2005 capillary sequencing has been
progressively displaced by high-throughput (formerly "next-generation") sequencing technologies such as
Illumina dye sequencing, pyrosequencing, and SMRT sequencing. All of these technologies continue to employ
the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation.
Other technologies are emerging, including nanopore technology. Though nanopore sequencing technology is
still being refined, its portability and potential capability of generating long reads are of relevance to whole-
genome sequencing applications.
GENOME ASSEMBLY
The raw sequences obtained from genome sequencing projects must be assembled into larger sequences; that is,
the bases must be pieced together in their correct order as they are found in the genome. Once assembly is
complete, that is often the point when “working drafts” of genome sequences are announced. The work is not
completed at that point, because there are still many gaps in the sequences to fill in as well as errors from the
sequencing. Finishing the genome sequence is the next step, producing a highly accurate sequence with less than
one error per 10,000 bases, and as many gaps as possible filled in.
The sequenced fragment assembly method requires that possible overlaps be identified first, so as to detect
clones with common DNA sequences. If two clones are found to overlap, they are merged to form what is called
a contig, a term designating a set of fragments connected to each other by overlapping sequences that are either
identical or very similar (within the limits of sequencing error). The next step consists in step-by-step comparison
of each new fragment with contigs that have already been identified. This comparison must take into account the
two possible relative orientations of the two sequences: If the fragment overlaps one contig, that contig is
extended; otherwise a new contig consisting of this fragment alone is created. When a fragment simultaneously
overlaps two contigs, both are fused with the fragment. At any time during a major sequencing project, the data
correspond to a set of several contigs, whereas ideally, only one contig covering the whole sequence, remains at
the end of the project.
While a contig is being assembled, a consensus sequence associated with it is defined according to the alignment
of its constituent fragments. The consensus sequence compares the positions being read and checks their
agreement, revealing any differences or ambiguity due to data errors (unread or incorrect nucleotide
interpretation). Differences and interpretation ambiguities may be resolved by further data analysis and if
necessary, by additional sequencing. This verification step is indispensable but time-consuming, since it is at least
partly manual.
Overlap identification
Using the dynamic programming algorithms or alignment algorithms overlaps may be located by 1 × 1 alignment
of each new sequence added to already assembled contigs. If the alignment score is above a given threshold, the
two sequences are considered to be overlaps. However, this ‘brute force’ method is costly in terms of calculation
time, since the alignment algorithms are O (nm), where n and m are the lengths of the two compared sequences.
If k fragments are to be assembled, the algorithm is O (k2). This is prohibitive for very large genomes, where k >
106 base pairs. In order to simplify this problem, note that overlapping sequences are usually identical (or nearly
so; except for a few rare errors) throughout the entire common region.
In 1982, this observation led Roger Staden of Cambridge University to propose a more efficient strategy, which
has since been improved. It consists in creating a table of 4n n-uplets of possible nucleotides (n being of the order
of 6 to 12). A list of fragments containing common n-uplets is compiled for each entry in the table, which is
prepared in linear time O (k). Two overlapping fragments will have a great number of common n-uplets, i.e., all
those that correspond to the common region. Applying this criterion, it is possible to identify candidate overlap
fragments simply by looking up fragments that have several common n-uplets in the table.
Overlapping may then be verified by applying a classical alignment method. This approach differs from the ‘brute
force’ method in that the alignment algorithm is used only in cases in which overlapping is highly probable. The
cost of this method is thus approximately a linear function of the number of gels to be analyzed.
The heuristic strategy developed by Staden can nevertheless be ineffective in certain cases, either owing to
insufficient sequence data quality, resulting in failure to identify overlaps, or because the sequence analyzed
contains several repetitions of a given motif, which can introduce contig fragment connection errors at the
repetition sites. Several other methods avoid this obstacle by analyzing all possible overlaps according to criteria
that permit evaluation of overlap quality (alignment scores). A graph of all possible connections among fragments
is then drawn, in which the best pathway (‘minimal cost pathway’) is determined. However, although these
methods (based on the Dijkstra algorithm) guarantee that the alignment obtained is optimum overall, they are
considerably more costly in terms of calculation time.
Assembly Paradigms
1. Greedy Approaches
One of the simplest strategies for assembling a genome sequence involves iteratively joining together the reads
in decreasing order of the quality of their overlaps. The process starts by joining the two reads that overlap the
best (in terms of either the length of the overlap or a more complex quality measure that accounts for base quality
estimates), then repeats this process until a predefined minimum quality threshold is reached. As a result, nascent
assembled sequences (i.e., contigs) grow through either the addition of new reads or a joining with previously
constructed contigs. Read overlaps that conflict with already constructed contigs are ignored. This strategy is
called greedy because it makes the greediest (locally optimal) choice at each step. Despite its simplicity, this
approach and its variants provide a good approximation for the optimal assembly.
Despite its tremendous early successes, the greedy strategy has a severe limitation: Because of its local nature, it
cannot effectively handle repeated genomic regions. As the size and complexity of the genomes being sequenced
increased, greedy approaches were replaced by more complex graph-based algorithms that are better able to model
and resolve highly repetitive genomic sequences.
2. Graph-Based Approaches
Theoretical studies on combinatorial solutions to the sequence assembly problem led to the development of
practical software using graph-based representations of the sequence data.
Graph-based sequence assembly models represent sequence reads and their inferred relationships to one another
as vertices and edges in the graph. Walks through the graph describe an ordering of the reads that can be assembled
together. The sequence assembler tries to find a walk that best reconstructs the underlying genome while avoiding
generating misassemblies by taking erroneous paths caused by repeats.
A. OLC graphs-
In the simplest graph-based model, each sequence read is a vertex in the graph, and a pair of vertices are
linked with an edge if they overlap. In this formulation, overlaps are represented by edges from a terminal
vertex of one read to a terminal vertex of another read. Regardless of the representation of the graph, the
assembly process typically follows three main stages.
 First, overlapping pairs of reads are detected.
 Second, the graph is constructed, and an appropriate ordering and orientation (layout) of the reads are
found.
 Finally, a consensus sequence is computed from the ordered and oriented reads.
The set of consensus sequences is then output by the assembler as the sequence contigs. Genome assemblers
following this paradigm are called OLC assemblers for the three main stages of the assembly: overlap, layout,
and consensus.
Overlap-
Requires significant compute time. Dynamic programming is used to check whether each pair has a significant
overlap (typically determined by the length of the overlap and the similarity within the overlapping region).
Such an algorithm requires O (N2) time, where N is the total number of sequenced bases. This brute-force
approach can be used to assemble the sequences of only very small genomes.
To accelerate overlap detection, an index can be constructed that maps k-mers to the list of reads containing
the k-mer, used to quickly screen for reads that may overlap, and dynamic programming is then performed to
verify the overlap. This technique drastically reduces the search space and has been widely used.
Layout and Consensus-
Because a globally optimal solution to the layout stage of the assembly is computationally infeasible, the
layout step typically tries to generate unitigs, which are collections of reads that can be unambiguously
assembled without significant chance of misassembly. The assembler typically removes low-quality sequence
reads and overlaps that are likely to be sequencing artifacts, and then removes redundant edges in a process
called transitive reduction. Finally, the layout algorithm finds unambiguous regions of the graph. After the
reads are ordered and oriented, a multiple alignment is constructed from the chain of overlaps, and a consensus
sequence is inferred.
B. De Bruijn graphs
The de Bruijn graph method of sequence assembly has its roots in theoretical work from the late 1980s.
Pevzner studied the problem of reconstructing a genome sequence when only its set of constituent k-mers is
known. This theoretical work and subsequent development laid the foundation for de Bruijn graph-based
assembly of whole-genome sequencing data.
In this type of assembly, each read is broken into a sequence of overlapping k-mers. The distinct k-mers are
added as vertices to the graph, and k-mers that originate from adjacent positions in a read are linked by an
edge. The assembly problem can then be formulated as finding a walk through the graph that visits each edge
in the graph once—an Eulerian path problem.
In practice, sequencing errors and sampling biases obscure the graph, so a complete Eulerian tour through the
entire graph is typically not sought. Even when an Eulerian path through the entire graph can be found, it is
unlikely to reflect an accurate sequence of the genome because of the presence of repeats, as there are a
potentially exponential number of Eulerian traversals of the graph, only one of which is correct. In most
instances, the assembler attempts to construct contigs consisting of the unambiguous, unbranching regions of
the graph.
Lander Waterman Statistics

The Lander-Waterman statistics estimate the number of gaps in coverage (conversely the number of contiguous
DNA segments) that can be expected given the following set of parameters:
• G - Genome length
• n - Number of sequences
• L - Length of sequences
• c = nL/G - Depth of coverage (number of times genome over-sampled by the set of sequences)
• t - Amount by which two sequences need to overlap in order to computationally detect this overlap
• σ = (L-t)/L
Briefly, if some events occur uniformly at random (e.g. the start of a sequencing fragment along a genome can be
assumed to be chosen uniformly at random), the number of events occurring within a given time interval is
represented by a Poisson distribution.
Given an average "arrival rate" λ (# of events occurring within a given interval of time), the probability that
exactly n events occur within the same interval is expressed by the formula:
F (n, lambda) = (lambdan X e-lambda) / n!
In the context of sequencing we are interested in finding intervals that contain no events (n=0) - these would
represent gaps in the coverage of the genome by sequences.
Among other numbers, the L-W statistics provide estimates for the expected number of contigs: ne-cσ

Module 1

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Module 1

Încărcat de

Drepturi de autor:

Formate disponibile

1 Introduction to Genomics

2. Maxam and Gilbert’s Sequencing Method – Chemical Degradation Method

ii. The Guanine Reaction

3. Next Generation Sequencing

Steps for Sequencing Technology

Principles for detection and base identification

Sequencing technology Procedure

Lander Waterman Statistics

S-ar putea să vă placă și