Documente Academic
Documente Profesional
Documente Cultură
OMICS
OMICS refers to a field of study in biology ending in -omics, such as genomics, proteomics or metabolomics.
The related suffix -ome is used to address the objects of study of such fields, such as the genome, proteome or
metabolome respectively. Omics aims at the collective characterization and quantification of pools of biological
molecules that translate into the structure, function, and dynamics of an organism or organisms.
Simply stated, whenever the suffix omics is attached to any biological term, e.g. Gene + omics ; it refers to the
overall study of the whole genetic entity of an organism from the basic characteristics to the structural, dynamical,
expressional and production of the genetic entity.
GENOMICS
Genomics is the study of the complete DNA sequence of an organism. It provides the overall data of a genome,
its difference and similarities w.r.t other organisms’ genome and helps in finding solutions of many real life issues.
Or
Genomics is an interdisciplinary field of science focusing on the structure, function, evolution, mapping, and
editing of genomes.
A genome is an organism's complete set of DNA, including all of its genes. In contrast to genetics, which refers
to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization
and quantification of genes, which direct the production of proteins with the assistance of enzymes and messenger
molecules.
Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA
sequencing and bioinformatics to assemble and analyse the function and structure of entire genomes.
Structural Genomics
Structural genomics seeks to describe the 3D structure of every protein encoded by a given genome.
This genome-based approach allows for a high-throughput method of structure determination by a
combination of experimental and modelling approaches.
The principal difference between structural genomics and traditional structural prediction is that structural
genomics attempts to determine the structure of every protein encoded by the genome, rather than focusing
on one particular protein.
With full-genome sequences available, structure prediction can be done more quickly through a
combination of experimental and modelling approaches, especially because the availability of large
number of sequenced genomes and previously solved protein structures allows scientists to model protein
structure on the structures of previously solved homologs.
Structural genomics takes advantage of completed genome sequences in several ways in order to determine
protein structures.
The gene sequence of the target protein can also be compared to a known sequence and structural
information can then be inferred from the known protein’s structure.
Structural genomics can be used to predict novel protein folds based on other structural data.
Structural genomics can also take modelling-based approach that relies on homology between the unknown
protein and a solved protein structure.
Functional Genomics
Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of
gene expression and protein–protein interactions, as opposed to the static aspects of the genomic
information such as DNA sequence or structures.
Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA
transcripts, and protein products.
A key characteristic of functional genomics studies is their genome-wide approach to these questions,
generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.
Functional genomics may be applied to the complete collection of DNA (the genome), RNA (the
transcriptome), or protein (the proteome) of an organism.
Functional genomics implies the use of high‐throughput screens, in contrast to traditional methods of
biology in which one gene or protein has been characterized experimentally in depth. Such traditional
methods commonly complement high-throughput approaches.
Functional genomics often involves the perturbation of gene function to investigate the consequence on
the function of other genes in a genome.
One of the most challenging and fundamental problems in modern biology is to understand the relationship
between genotype and phenotype. Connecting the two is a fundamental part of functional genomics.
Metagenomics
Metagenomics is the study of genetic material recovered directly from environmental samples.
The broad field may also be referred to as environmental genomics, ecogenomics or community
genomics.
While traditional microbiology and microbial genome sequencing and genomics rely upon
cultivated clonal cultures, early environmental gene sequencing cloned specific genes (often the 16S
rRNA gene) to produce a profile of diversity in a natural sample. Such work revealed that the vast majority
of microbial biodiversity had been missed by cultivation-based methods.
Recent studies use either "shotgun" or PCR directed sequencing to get largely unbiased samples of all
genes from all the members of the sampled communities.
Because of its ability to reveal the previously hidden diversity of microscopic life, metagenomics offers a
powerful lens for viewing the microbial world that has the potential to revolutionize understanding of the
entire living world.
Epigenomics
Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell,
known as the epigenome.
Genomic modifications that alter gene expression that cannot be attributed to modification of the primary
DNA sequence and that are heritable mitotically and meiotically are classified as epigenetic
modifications. DNA methylation and histone modification are among the best characterized epigenetic
processes.
The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a
cell.
Epigenetic modifications are reversible modifications on a cell’s DNA or histones that affect gene
expression without altering the DNA sequence.
Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic
genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be
inhibiting epigenomic marks that cause cancers.
Two of the most characterized epigenetic modifications are DNA methylation and histone modification.
Epigenetic modifications play an important role in gene expression and regulation, and are involved in
numerous cellular processes such as in differentiation/development and tumorigenesis. The study of
epigenetics on a global level has been made possible only recently through the adaptation of genomic
high-throughput assays.
SEQUENCING
1. Sanger’s Method – Dideoxy Nucleotide Chain Termination Method
The most commonly used method of DNA sequencing, called dideoxy sequencing is based on DNA
replication. Using a sequence of interest already cloned into a vector as a template, DNA polymerase adds
nucleotides to a short primer, until extension of the new DNA strand is stopped by inclusion of a modified
nucleotide. This generates an array of short fragments, which can be interpreted by gel electrophoresis either
in an automated DNA sequencer or in a standard gel apparatus. Both linear DNA and circular DNA can be
sequenced using the dideoxy DNA sequencing method.
Principle
i. Di-deoxy Nucleotides
If the normal dNTP precursor is used, the extended DNA chain has
a –OH at its end and, therefore, another nucleotide can be added by
DNA polymerase. However, if the dideoxy ddTTP precursor is
used, the extended DNA chain has a -H at its end and, therefore,
another nucleotide cannot be added by DNA polymerase. In other
words, the addition of a dideoxy nucleotide to a DNA chain being
synthesized terminates the DNA synthesis reaction.
This happens because the H atom on the 3’ end can only bind with
one more molecule, which is the carbon atom of the sugar, making
it unable to bind to any other molecule. This results in termination
of the chain.
ii. DNA Replication
The reaction protocol of the sequencing technique is derived from the existing natural phenomena of DNA
replication which takes place in the cells. Here, the DNA polymerase is performing its function as it
performs in the replication reaction, but the final product is different, because of the inclusion of a modified
nucleotide. Also, the reaction protocol is much similar to the PCR technique which is used to produce
multiple copies of a DNA sequence. The chain-terminator or dideoxy procedure for DNA sequencing
capitalizes on two properties of DNA polymerases:
Their ability to synthesize faithfully a complimentary copy of a single-stranded DNA template
Their ability to use 2′, 3′- dideoxynucleotides as substrates
Reaction Mixture
The reaction mixture consists of the following components –
The DNA sample to be sequenced is the template. The template has to be converted into a single strand
by denaturing with NaOH or by heating. But if you are carrying out the sequencing reaction using a PCR
machine, denaturation of the template occurs as a part of the reaction cycle.
DNA primers. 5′ end radio-labeled DNA primers, which are short fragments of DNA complementary to
the template DNA. Primers are labeled with radioactive phosphate at the 5′ end.
A mixture of all dNTPs—dATP, dGTP, dCTP, and dTTP. All this as a mixture is distributed among four
different reaction tubes in appropriate quantities and labeled ‘A’, ‘T’, ‘G’, and ‘C’.
ddNTPs - In each tube a small quantity of the corresponding ddNTP is added, which are either
radiolabeled or tagged with a fluorescent dye which absorb certain wavelengths of light, causing them to
emit very specific wavelengths of light. This makes the technique more visible. In a tube labeled ‘A’ a
small amount of ddATP is added. In tube ‘T’ ddTTP is added, in tube ‘G’ ddGTP is added, and in tube
‘C’ a small amount of ddCTP is added. The concentration of ddNTP is approximately 1% of the
concentration of the dNTPs.
Taq DNA polymerase-When all the components are ready, Taq DNA polymerase is added to all the four
tubes and the reaction, the synthesis of DNA or the elongation of the primer, starts.
Reaction Steps of Sequencing Methods
A. Denaturation of the Template Strand
The template strand is heated to a temperature (92 – 980C) where it denatures or the strands separate in
order to form single stranded DNA or a chemical agent (NaOH).
B. Polymerase Annealing
The template is subjected to a lower temperature (50-650C) for the primer to bind to the template and the
polymerase to start the elongation of the chain.
C. Elongation and Chain Termination
Since the reaction is using Taq Polymerase, the reaction is brought up to 720C and is allowed to continue
for a certain period of time. In this time period, the dNTPs and ddNTPs get incorporated by the
polymerase in the chain. Whenever the chain incorporates a ddNTP (ddATP against T in the template),
the elongation of the chain stops.
D. Electrophoresis and Autoradiogram
The sample is subjected to electrophoresis (gel or capillary) and autoradiogram is used to find the
sequence of the template.
Limitations of Sanger Sequencing
Sanger sequencing has a number of limitations that can lead to problems with results and difficulty using the
method in general:
Sanger methods can only sequence short pieces of DNA--about 300 to 1000 base pairs.
The quality of a Sanger sequence is often not very good in the first 15 to 40 bases because that is where
the primer binds.
Sequence quality degrades after 700 to 900 bases.
If the DNA fragment being sequenced has been cloned, some of the cloning vector sequence may find its
way into the final sequence.
Nonspecific primer binding
Formation of DNA secondary structures which alter sequencing fidelity
A. Pyrosequencing
Pyrosequencing is a DNA sequencing method that involves determining which of the four bases is incorporated
at each step in the copying of a DNA template. Pyrosequencing is named for the pyrophosphate molecule (two
phosphate groups connected by a covalent bond) that is released when a dNTP is used by DNA polymerase to
extend a new DNA strand.
Principle and Technology
The DNA to be sequenced is denatured to form single-stranded DNA. The single-stranded DNA is attached to
a solid, microscopic bead that is placed in a microscopic well in the pyrosequencer. The sequencing reaction
mixture, consisting of a primer, DNA polymerase, and three other enzymes, is added. The four dNTPs are not
present in the initial mix, but are added sequentially to and removed from the pyrosequencing reaction, such
that only one dNTP is present in the reaction at any one time. This cycle of addition and removal of each dNTP
in turn repeats over and over. As DNA polymerase moves along a single stranded template, each of the four
nucleoside triphosphates is fed sequentially and then removed. If one of the four bases is incorporated then
pyrophosphate is released and this is detected in an enzyme cascade that emits light.
Variants of Pyrosequencing
There are two variants of the pyrosequencing technique.
In solid-phase pyrosequencing, the DNA to be
sequenced is immobilized and a washing step is used
to remove the excess substrate after each nucleotide
addition. The four different nucleotides are added
stepwise to the immobilized primed DNA template
and the incorporation event is followed using the
enzyme ATP sulfurylase and luciferase. After each
nucleotide addition, a washing step is performed to
allow iterative addition.
In liquid-phase sequencing a nucleotide degrading
enzyme (apyrase) is introduced to make a four enzyme
system. Addition of this enzyme has eliminated the
need for a solid support and intermediate washing
thereby enabling the pyrosequencing reaction to be
performed in a single tube. However, without the
washing step, inhibitory substances can accumulate.
Primed DNA template and four enzymes involved in
liquid-phase pyrosequencing are placed in a well of a
microtitre plate.
The four different nucleotides are added stepwise and incorporation is followed using the enzyme ATP
sulfurylase and luciferase. The nucleotides are continuously degraded by nucleotide-degrading enzyme
allowing addition of subsequent nucleotide.
B. Illumina Dye Sequencing
Illumina dye sequencing is a technique used to determine the series of base pairs in DNA, also known as DNA
sequencing. This sequencing method is based on reversible dye-terminators that enable the identification of
single bases as they are introduced into DNA strands. It can also be used for whole-genome and region
sequencing, transcriptome analysis, metagenomics, small RNA discovery, methylation profiling, and genome-
wide protein-nucleic acid interaction analysis.
Overview
Illumina sequencing technology works in three basic steps: amplify, sequence, and analyze. The process begins
with purified DNA. The DNA gets chopped up into smaller pieces and given adapters, indices, and other kinds
of molecular modifications that act as reference points during amplification, sequencing, and analysis. The
modified DNA is loaded onto a specialized chip where amplification and sequencing will take place. Along
the bottom of the chip are hundreds of thousands of oligonucleotides (short, synthetic pieces of DNA). They
are anchored to the chip and able to grab DNA fragments that have complementary sequences. Once the
fragments have attached, a phase called cluster generation begins. This step makes about a thousand copies of
each fragment of DNA. Next, primers and modified nucleotides enter the chip. These nucleotides have
reversible 3' blockers that force the polymerase to add on only one nucleotide at a time as well as fluorescent
tags. After each round of synthesis, a camera takes a picture of the chip. A computer determines what base was
added by the wavelength of the fluorescent tag and records it for every spot on the chip. After each round, non-
incorporated molecules are washed away. A chemical deblocking step is then used in the removal of the 3’
terminal blocking group and the dye in a single step. The process continues until the full DNA molecule is
sequenced
C. Nanopore Sequencing
Nanopore sequencing is a third generation approach used in
the sequencing of biopolymers- specifically, polynucleotides in
the form of DNA or RNA.
Using Nanopore sequencing, a single molecule of DNA or
RNA can be sequenced without the need for PCR amplification
or chemical labeling of the sample. At least one of these
aforementioned steps is necessary in the procedure of any
previously developed sequencing approach. Nanopore
sequencing has the potential to offer relatively low-cost
genotyping, high mobility for testing, and rapid processing of
samples with the ability to display results in real-time.
Types
Biological
Biological nanopore sequencing relies on the use of transmembrane proteins, called porins, embedded in
lipid membranes so as to create size dependent porous surfaces- with nanometer scale "holes" distributed
across the membranes. Sufficiently low translocation velocity can be attained through the incorporation
of various proteins that facilitate the movement of DNA or RNA through the pores of the lipid membranes.
Solid state
Solid state nanopore sequencing approaches, unlike biological nanopore sequencing, do not incorporate
proteins into their systems. Instead, solid state nanopore technology uses various metal or metal alloy
substrates with nanometer sized pores that allow DNA or RNA to pass through. These substrates most
often serve integral roles in the sequence recognition of nucleic acids as they translocate through the
channels along the substrates
D. Nanoball Sequencing
DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire
genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of
genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then
polymerized to anchor sequences bound to known sequences on the DNA template. The base order is
determined via the fluorescence of the bound nucleotides. This DNA sequencing method allows large numbers
of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing
platforms. However, a limitation of this method is that it generates only short sequences of DNA, which
presents challenges to mapping its reads to a reference genome.
Early techniques
Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of shotgun
sequencing technology. While full genome shotgun sequencing for small (4000–7000 base pair) genomes was
already in use in 1979, broader application benefited from pairwise end sequencing, known colloquially as
double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated
genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of
a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data
was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two
sequences were oriented in opposite directions and were about the length of a fragment apart from each other was
valuable in reconstructing the sequence of the original target fragment.
The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT
locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun
sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming
fragments of constant length, was in 1991. In 1995 the innovation of using fragments of varying sizes was
introduced, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets.
The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire
genome of the bacterium Haemophilus influenzae in 1995, and then by Celera Genomics to sequence the entire
fruit fly genome in 2000, and subsequently the entire human genome. Applied Biosystems, now called Life
Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The
Human Genome Project.
Current techniques
While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still
too expensive and takes too long for commercial purposes. Since 2005 capillary sequencing has been
progressively displaced by high-throughput (formerly "next-generation") sequencing technologies such as
Illumina dye sequencing, pyrosequencing, and SMRT sequencing. All of these technologies continue to employ
the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation.
Other technologies are emerging, including nanopore technology. Though nanopore sequencing technology is
still being refined, its portability and potential capability of generating long reads are of relevance to whole-
genome sequencing applications.
GENOME ASSEMBLY
The raw sequences obtained from genome sequencing projects must be assembled into larger sequences; that is,
the bases must be pieced together in their correct order as they are found in the genome. Once assembly is
complete, that is often the point when “working drafts” of genome sequences are announced. The work is not
completed at that point, because there are still many gaps in the sequences to fill in as well as errors from the
sequencing. Finishing the genome sequence is the next step, producing a highly accurate sequence with less than
one error per 10,000 bases, and as many gaps as possible filled in.
The sequenced fragment assembly method requires that possible overlaps be identified first, so as to detect
clones with common DNA sequences. If two clones are found to overlap, they are merged to form what is called
a contig, a term designating a set of fragments connected to each other by overlapping sequences that are either
identical or very similar (within the limits of sequencing error). The next step consists in step-by-step comparison
of each new fragment with contigs that have already been identified. This comparison must take into account the
two possible relative orientations of the two sequences: If the fragment overlaps one contig, that contig is
extended; otherwise a new contig consisting of this fragment alone is created. When a fragment simultaneously
overlaps two contigs, both are fused with the fragment. At any time during a major sequencing project, the data
correspond to a set of several contigs, whereas ideally, only one contig covering the whole sequence, remains at
the end of the project.
While a contig is being assembled, a consensus sequence associated with it is defined according to the alignment
of its constituent fragments. The consensus sequence compares the positions being read and checks their
agreement, revealing any differences or ambiguity due to data errors (unread or incorrect nucleotide
interpretation). Differences and interpretation ambiguities may be resolved by further data analysis and if
necessary, by additional sequencing. This verification step is indispensable but time-consuming, since it is at least
partly manual.
Overlap identification
Using the dynamic programming algorithms or alignment algorithms overlaps may be located by 1 × 1 alignment
of each new sequence added to already assembled contigs. If the alignment score is above a given threshold, the
two sequences are considered to be overlaps. However, this ‘brute force’ method is costly in terms of calculation
time, since the alignment algorithms are O (nm), where n and m are the lengths of the two compared sequences.
If k fragments are to be assembled, the algorithm is O (k2). This is prohibitive for very large genomes, where k >
106 base pairs. In order to simplify this problem, note that overlapping sequences are usually identical (or nearly
so; except for a few rare errors) throughout the entire common region.
In 1982, this observation led Roger Staden of Cambridge University to propose a more efficient strategy, which
has since been improved. It consists in creating a table of 4n n-uplets of possible nucleotides (n being of the order
of 6 to 12). A list of fragments containing common n-uplets is compiled for each entry in the table, which is
prepared in linear time O (k). Two overlapping fragments will have a great number of common n-uplets, i.e., all
those that correspond to the common region. Applying this criterion, it is possible to identify candidate overlap
fragments simply by looking up fragments that have several common n-uplets in the table.
Overlapping may then be verified by applying a classical alignment method. This approach differs from the ‘brute
force’ method in that the alignment algorithm is used only in cases in which overlapping is highly probable. The
cost of this method is thus approximately a linear function of the number of gels to be analyzed.
The heuristic strategy developed by Staden can nevertheless be ineffective in certain cases, either owing to
insufficient sequence data quality, resulting in failure to identify overlaps, or because the sequence analyzed
contains several repetitions of a given motif, which can introduce contig fragment connection errors at the
repetition sites. Several other methods avoid this obstacle by analyzing all possible overlaps according to criteria
that permit evaluation of overlap quality (alignment scores). A graph of all possible connections among fragments
is then drawn, in which the best pathway (‘minimal cost pathway’) is determined. However, although these
methods (based on the Dijkstra algorithm) guarantee that the alignment obtained is optimum overall, they are
considerably more costly in terms of calculation time.
Assembly Paradigms
1. Greedy Approaches
One of the simplest strategies for assembling a genome sequence involves iteratively joining together the reads
in decreasing order of the quality of their overlaps. The process starts by joining the two reads that overlap the
best (in terms of either the length of the overlap or a more complex quality measure that accounts for base quality
estimates), then repeats this process until a predefined minimum quality threshold is reached. As a result, nascent
assembled sequences (i.e., contigs) grow through either the addition of new reads or a joining with previously
constructed contigs. Read overlaps that conflict with already constructed contigs are ignored. This strategy is
called greedy because it makes the greediest (locally optimal) choice at each step. Despite its simplicity, this
approach and its variants provide a good approximation for the optimal assembly.
Despite its tremendous early successes, the greedy strategy has a severe limitation: Because of its local nature, it
cannot effectively handle repeated genomic regions. As the size and complexity of the genomes being sequenced
increased, greedy approaches were replaced by more complex graph-based algorithms that are better able to model
and resolve highly repetitive genomic sequences.
2. Graph-Based Approaches
Theoretical studies on combinatorial solutions to the sequence assembly problem led to the development of
practical software using graph-based representations of the sequence data.
Graph-based sequence assembly models represent sequence reads and their inferred relationships to one another
as vertices and edges in the graph. Walks through the graph describe an ordering of the reads that can be assembled
together. The sequence assembler tries to find a walk that best reconstructs the underlying genome while avoiding
generating misassemblies by taking erroneous paths caused by repeats.
A. OLC graphs-
In the simplest graph-based model, each sequence read is a vertex in the graph, and a pair of vertices are
linked with an edge if they overlap. In this formulation, overlaps are represented by edges from a terminal
vertex of one read to a terminal vertex of another read. Regardless of the representation of the graph, the
assembly process typically follows three main stages.
First, overlapping pairs of reads are detected.
Second, the graph is constructed, and an appropriate ordering and orientation (layout) of the reads are
found.
Finally, a consensus sequence is computed from the ordered and oriented reads.
The set of consensus sequences is then output by the assembler as the sequence contigs. Genome assemblers
following this paradigm are called OLC assemblers for the three main stages of the assembly: overlap, layout,
and consensus.
Overlap-
Requires significant compute time. Dynamic programming is used to check whether each pair has a significant
overlap (typically determined by the length of the overlap and the similarity within the overlapping region).
Such an algorithm requires O (N2) time, where N is the total number of sequenced bases. This brute-force
approach can be used to assemble the sequences of only very small genomes.
To accelerate overlap detection, an index can be constructed that maps k-mers to the list of reads containing
the k-mer, used to quickly screen for reads that may overlap, and dynamic programming is then performed to
verify the overlap. This technique drastically reduces the search space and has been widely used.
Layout and Consensus-
Because a globally optimal solution to the layout stage of the assembly is computationally infeasible, the
layout step typically tries to generate unitigs, which are collections of reads that can be unambiguously
assembled without significant chance of misassembly. The assembler typically removes low-quality sequence
reads and overlaps that are likely to be sequencing artifacts, and then removes redundant edges in a process
called transitive reduction. Finally, the layout algorithm finds unambiguous regions of the graph. After the
reads are ordered and oriented, a multiple alignment is constructed from the chain of overlaps, and a consensus
sequence is inferred.
B. De Bruijn graphs
The de Bruijn graph method of sequence assembly has its roots in theoretical work from the late 1980s.
Pevzner studied the problem of reconstructing a genome sequence when only its set of constituent k-mers is
known. This theoretical work and subsequent development laid the foundation for de Bruijn graph-based
assembly of whole-genome sequencing data.
In this type of assembly, each read is broken into a sequence of overlapping k-mers. The distinct k-mers are
added as vertices to the graph, and k-mers that originate from adjacent positions in a read are linked by an
edge. The assembly problem can then be formulated as finding a walk through the graph that visits each edge
in the graph once—an Eulerian path problem.
In practice, sequencing errors and sampling biases obscure the graph, so a complete Eulerian tour through the
entire graph is typically not sought. Even when an Eulerian path through the entire graph can be found, it is
unlikely to reflect an accurate sequence of the genome because of the presence of repeats, as there are a
potentially exponential number of Eulerian traversals of the graph, only one of which is correct. In most
instances, the assembler attempts to construct contigs consisting of the unambiguous, unbranching regions of
the graph.