Documente Academic
Documente Profesional
Documente Cultură
Received September 15, 2010; Revised and Accepted September 17, 2010
First- and second-generation sequencing technologies have led the way in revolutionizing the field of geno-
mics and beyond, motivating an astonishing number of scientific advances, including enabling a more com-
plete understanding of whole genome sequences and the information encoded therein, a more complete
characterization of the methylome and transcriptome and a better understanding of interactions between pro-
re-assembly. The first step of sample preparation is to break due to the large number of scanning and washing cycles
the target genome into multiple small fragments. Depending required. Furthermore, because step yields for the addition
on the amount of sample DNA, these fragments may be ampli- of each base are ,100%, a population of molecules
fied into multiple copies using a variety of molecular methods. becomes more asynchronous as each base is added (16).
In the physical sequencing phase, the individual bases in each This loss of synchronicity (called dephasing) causes an
fragment are identified in order, creating individual reads. The increase in noise and sequencing errors as the read extends,
number of bases identified contiguously is defined as read effectively limiting the read length produced by the most
length. In the re-assembly phase, bioinformatics software is widely used SGS systems to significantly less than the
used to align overlapping reads, which allows the original average read lengths achieved by Sanger sequencing (11,17).
genome to be assembled into contiguous sequences. The Further, in order to generate this large number of DNA mol-
longer the read length, the easier it is to reassemble the ecules, PCR amplification is required. The amplification
genome (17). process can introduce errors in the template sequence as
well as amplification bias. The effects of these pathologies
First-generation sequencing are that neither the sequences nor the frequencies with
First-generation sequencing was originally developed by which they appear are always faithfully preserved. In addition,
Figure 1. How previous generation DNA-sequencing systems work. (A) A modern implementation of Sanger sequencing is shown to illustrate differential label-
ing and use of terminator chemistry followed by size separation to resolve the sequence. (B) The Illumina sequencing process is shown to illustrate the
wash-and-scan paradigm common to second-generation DNA-sequencing technologies.
the time to result, reducing the overall footprint of the instru- well, as well as termination events typically halting sequen-
ment, and lowering cost to make DNA sequencing more gen- cing after each nucleotide incorporation, in order to monitor
erally accessible to all. However, this technology is still a in succession the incorporation of each of the four bases
‘wash-and-scan’ system like all current SGS technologies, across all DNA templates. As a result of this process, the
requiring PCR amplification of the DNA template in each overall read length is limited to that of current SGS systems,
R230 Human Molecular Genetics, 2010, Vol. 19, Review Issue 2
Fundamental technology Size-separation of specifically end- Wash-and-scan SBS SBS, by degradation, or direct physical
labeled DNA fragments, produced by inspection of the DNA molecule
SBS or degradation
Resolution Averaged across many copies of the Averaged across many copies of the Single-molecule resolution
DNA molecule being sequenced DNA molecule being sequenced
Current raw read accuracy High High Moderate
Current read length Moderate (800– 1000 bp) Short, generally much shorter than Long, 1000 bp and longer in
Sanger sequencing commercial systems
Current throughput Low High Moderate
Current cost High cost per base Low cost per base Low-to-moderate cost per base
Low cost per run High cost per run Low cost per run
RNA-sequencing method cDNA sequencing cDNA sequencing Direct RNA sequencing and cDNA
a
There are many TGS technologies in development but few have been reduced to practice. While there is significant potential of TGS to radically improve current
throughput and read-length characteristics (among others), the ultimate practical limits of these technologies remain to be explored. Furthermore, there is active
development of SGS technologies that will also improve read-length and throughput characteristics.
and ultimately, throughput is limited as well, compared with driven to completion, which serves to reduce the overall
what SMS platforms will be capable of achieving. mis-incorporation error rate. As with the other TGS technol-
Sitting even closer to the TGS boundary is the Helicos ogies discussed below, deletions and insertions are a significant
Genetic Analysis Platform, the first commercially available issue.
sequencing instrument to carry out SMS (24–27). The The sample preparation part of this technology involves
Helicos sequencing instrument works by imaging individual fragmenting genomic DNA into smaller pieces, adding a 3’
DNA molecules affixed to a planar surface as they are extended poly(A) tail to the fragments, labeling and blocking by term-
using a defined primer and a modified polymerase as well as inal transferase. These templates are then captured onto a
proprietary fluorescently labeled nucleotide analogues, referred surface with covalently bound 5’ dT(50) oligonucleotides via
to as Virtual Terminator nucleotides, in which the dye is hybridization (25). The surface is then imaged using charge-
attached to the nucleotide via a chemically cleavable group coupled device (CCD) sensors, where those templates that
that allows for step-wise sequencing to be carried out (25). have been appropriately captured are identified and then
Because halting is still required in this process (similar to tracked for SBS. The process then resembles the
SGS technologies), the time to sequence a single nucleotide is ‘wash-and-scan’ steps of SGS in which a labeled nucleotide
high, and the read lengths realized are 32 nucleotides long. and polymerase mixture are flooded onto the system and incu-
However, given the SMS nature of this technology, no PCR bated for a period of time, the surface is then washed to
is required for sequencing, a significant advantage over SGS remove the synthesis mixture and scanned to detect the fluor-
technologies. However, also due to the single-molecule nature escent label. The dye – nucleotide linker is then cleaved to
of this technology (and all of the SMS technologies), the raw release the dye, and this process is repeated.
read error rates are generally at or .5%, although the highly Not only can this technology be used to sequence DNA,
parallel nature of this technology can deliver high fold coverage but the DNA polymerase can be replaced with a reverse
and a consensus or finished read accuracy of .99%. This tech- transcriptase enzyme to sequence RNA directly (29),
nology is capable of sequencing an entire human genome, albeit without requiring the conversion of RNA to cDNA or
at significant cost by today’s standards (roughly $50 000 in without the need for ligation/amplification steps, something
reagents) (28). It can follow roughly one billion individual all existing SGS technologies require for RNA sequencing
DNA molecules as they are sequenced over the course of (5). Instead, each RNA molecule is polyadenylated and
many days. Unlike SGS, these many hundreds of millions of 3’-blocked and captured on a surface coated with dT(50)
sequencing reactions can be carried out asynchronously, a hall- oligonucleotides, similar to the DNA sequencing process.
mark of TGS. Further, given individual monitoring of tem- Sequencing is then carried out as described for DNA, but
plates, the enzymatic incorporation step does not need to be using reverse transcriptase instead of DNA polymerase. In
Human Molecular Genetics, 2010, Vol. 19, Review Issue 2 R231
addition to direct RNA sequencing, the Helicos platform can so that multiple dyes are not held in the confinement volume
carry out other sequencing-based assays such as chromatin at a time (something that would destroy the signal-to-noise
profiling (30). ratio).
While the Helicos SMS technology has been successfully The problem of observing a DNA polymerase working in
deployed, representing the first example of true SMS, with real time, detecting the incorporation of a single nucleotide
many significant advantages over SGS technologies, it has taken from a large pool of potential nucleotides during DNA
many of the characteristics of SGS technologies and so has synthesis, was solved using zero-mode waveguide (ZMW)
had a more difficult time clearly differentiating itself from technology (Fig. 2A) (31). The principle employed is similar
SGS with respect to read lengths, throughput and run times, to that employed in the protective screen in a microwave
all of which are similar to leading SGS technologies. When oven door. The screen is perforated with holes that are much
combined with a higher raw read error rate (requiring repeti- smaller than the wavelength of the microwaves. Because of
tive sequencing to overcome), the end result is a higher their relative size, the holes prevent the much longer micro-
sequencing cost compared with leading SGS technologies. waves from passing through and penetrating the glass.
While the Helicos technology may struggle to clearly differen- However, the much smaller wavelengths of visible light are
tiate itself from SGS in some respects, the direct RNA- able to pass through the holes in the screen, allowing food
Figure 2. How third-generation DNA-sequencing technologies work. Third-generation DNA-sequencing technologies are distinguished by direct inspection of
single molecules with methods that do not require wash steps during DNA synthesis. (A) Pacific Biosciences technology for direct observation of DNA synthesis
on single DNA molecules in real time. A DNA polymerase is confined in a zero-mode waveguide and base additions measured with florescence detection of
gamma-labeled phosphonucleotides. (B) Several companies seek to sequence DNA by direct inspection using electron microscopy similar to the Reveo tech-
nology pictured here, in which an ssDNA molecule is first stretched and then examined by STM. (C) Oxford Nanopore technology for measuring translocation
of nucleotides cleaved from a DNA molecule across a pore, driven by the force of differential ion concentrations across the membrane. (D) IBM’s DNA transistor
technology reads individual bases of ssDNA molecules as they pass through a narrow aperture based on the unique electronic signature of each individual nucleo-
tide. Gold bands represent metal and gray bands dielectric layers of the transistor.
ZMWs overcome the first obstacle, but not the second. All system attempting to observe DNA synthesis in real time
SGS technologies directly attach the dye to the base, which is because the dye’s large size relative to the DNA can interfere
incorporated into the DNA strand. This is problematic for any with the activity of the DNA polymerase. Typically, a DNA
Human Molecular Genetics, 2010, Vol. 19, Review Issue 2 R233
polymerase can incorporate only a few base-labeled nucleo- released will be capable of only up to 75 000 ZMWs. Finally,
tides before it halts. The SMRT sequencing approach instead as expected to be the case for most TGS technologies, SMRT
attaches the fluorescent dye to the phosphate chain of the sequencing data are different in form from SGS data, hence
nucleotide rather than to the base. As a natural step in the syn- they are amenable to more advanced probabilistic modeling
thesis process, the phosphate chain is cleaved when the that has the potential to exploit more information about the
nucleotide is incorporated into the DNA strand. Thus, upon chemical and structural nature of nucleotide sequences than
incorporation of a phospholinked nucleotide, the DNA poly- previous sequencing technologies (as discussed below for
merase naturally frees the dye molecule from the nucleotide most TGS technologies).
when it cleaves the phosphate chain. Upon cleaving, the
label quickly diffuses away, leaving a completely natural
piece of DNA with no evidence of labeling remaining. Real-time DNA sequencing using fluorescence resonance
The SMRT sequencing platform requires minimal amounts of energy transfer. Other SMS SBS technologies are in develop-
reagent and sample preparation to carry out a run, and there are ment, but little data are available to assess where they are at in
no time-consuming scanning and washing steps, enabling time development and when they are likely to be released. VisiGen
to result in a matter of minutes as opposed to days (14). In Biotechnologies had one of the more promising approaches to
SMS whereby the DNA polymerase is tagged with a fluoro-
claims that the technology is capable of producing 10 000 – Nanopore DNA sequencing with MspA. Another approach
20 000 base reads at a rate of 1.7 billion bases per day. Like aims to use a biological nanopore directly on intact DNA.
most of the other TGS technologies, read length and reduced Unlike Oxford Nanopore, which addresses axial resolution
costs are expected to be the chief advantages. limitations in the alpha-haemolysin pore by disassembling
the DNA molecule, in this case, the Mycobacterium smegmatis
Porin A (MspA) protein, which has a shorter blockade region
Direct imaging of DNA sequences using scanning tunneling and thus a better resolution, is used as the pore and the effect
microscope tips. Reveo is developing a technology related to of a linear molecule of single-stranded DNA (ssDNA) on the
IBM’s DNA transistor approach (see subsequently) in which current transiting the pore is measured (12). To slow the transit
DNA is placed on a conductive surface to detect bases electro- of the ssDNA through the pore to a level allowing detection of
nically using scanning tunneling microscope (STM) tips and individual bases as they interrupt current transiting the pore, a
tunneling current measurements (34). The STM tips are region of double-stranded DNA (dsDNA) is introduced. The
knife-edge shaped and have nanoscale dimensions (Fig. 2B). ability of this method to directly measure ssDNA in a proces-
The aim in applying this technology to SMS is to stretch sive fashion is attractive, but the complexity of introducing the
and confine a molecule of DNA such that tunneling current needed dsDNA break on the pore transit velocity appears to be
assay would be label free and require no optics, again greatly depicted in Figure 3A are seven contigs assembled using
diminishing cost. Abyss (54) applied to short read data generated from the
While the original DNA transistor idea proposed was based genome of Rhodopseudomonas palustris using the Illumina
on theoretical calculations and molecular dynamic simulations GA platform. Because the six blue contigs are overlapping,
(41), IBM recently published a solution to one of the two tech- the red contig representing a 1.5 kb repeat region, and
nical challenges facing this approach: modulation of the speed because none of the contigs spans the repeat region, the
with which the DNA molecule is passed through the nanopore contigs cannot be ordered with respect to each other.
to enable optimal base orientation as well as sufficient However, in Figure 3B, we depict just three molecules of long-
sampling of a base as it passes through the nanopore (15). range sequencing data (Figure 3, legend) from the TGS SMRT
The other challenge remaining for this approach is demonstrat- sequencing platform that span the repeat and unambiguously
ing that the signal for a single base can be distinguished from resolve how the contigs should be ordered with respect to
the signals of nearby bases. A recent publication related to this one another.
challenge indicates via simulation that factors such as ionic Its strengths notwithstanding, TGS will come with its own
motion in the nanopore may not necessarily affect the set of challenges. Because a TGS system by definition
desired signal of an individual base (42). assays a single molecule, there is no longer any safety in
uncertainty and incorporating it into the analysis, the ability to Companies such as Microsoft, Amazon, Google and Facebook
appropriately map a given sequence to the correct region of a have also become masters of petabyte-scale datasets—linking
genome, provide for alternative base calls at any given pos- pieces of data distributed over a massively parallel architec-
ition or identify more general structural variants is improved. ture in response to user requests and presenting to the user
The second major issue relates to the amount of data that will in a matter of seconds. Users of TGS technologies will need
ultimately be achievable with third-generation technologies. to follow in the footsteps of these others and carve out new
With the potential to generate hundreds of billions of reads paths where needed. For today, the data storage and compu-
per day per sequencing instrument, it is not only impractical tational solutions capable of meeting the demands of comput-
to store the raw data files, but impractical for most users to ing on TGS-scale datasets include cloud-based computing
store the trace and pulse-level data as well. Therefore, the services now available from a number of vendors including
raw data must be appropriately collapsed to reduce data Amazon and Microsoft, as well as access to custom high-
storage requirements while simultaneously capturing all of performance compute clusters (57). In addition, a number of
the salient features of the pulses from the trace data to companies like Geospiza are offering services that leverage
enable reliable interpretation of the pulse (observation) events. cloud-based compute resources to enable SGS and TGS
Addressing these major issues will be essential to get the users a path to manage and process their raw sequence data.
into manageable bites of information that can be consumed by niques, the sequence reads gradually diverge in length if an
basic science researchers, clinical researchers, physicians, extra base is added or a base fails to incorporate.
patients and consumers, efforts to realize the impact TGS Direct RNA sequencing: RNA can be sequenced directly
can have in areas like medicine, crop and livestock science, by replacing the DNA polymerase in a sequencing reaction
and alternative energy will not realize its full potential. Ulti- with a reverse transcriptase or other RNA-dependent polymer-
mately, through the use of advanced life sciences and infor- ase. Third-generation technologies that dispense with poly-
matics technologies, it should be possible for these different merases altogether also have the potential to sequence RNA
communities to become masters of information. Only by mar- directly, and in all cases, direct RNA sequencing offers poten-
rying information technology to the life sciences and biotech- tial decreases in time to result and much more accurate RNA
nology will we realize the astonishing potential of the vast sequencing as cDNA conversion steps are not required prior to
amounts of biological data we will be capable of generating sequencing.
with TGS coming on line. Large-scale DNA sequencing, Read: A read is the number of bases determined from a
RNA sequencing, translation and related molecular phenotype single segment of sample nucleic acid by a sequencing instru-
data, if properly integrated and analyzed, will enable strategies ment.
in areas like personalized medicine that would lead to our Read length: The number of individual bases identified
microbial gene catalogue established by metagenomic sequencing. 28. Pushkarev, D., Neff, N.F. and Quake, S.R. (2009) Single-molecule
Nature, 464, 59– 65. sequencing of an individual human genome. Nat. Biotechnol., 27,
5. Alexander, R.P., Fang, G., Rozowsky, J., Snyder, M. and Gerstein, M.B. 847– 852.
(2010) Annotating non-coding regions of the genome. Nat. Rev. Genet., 29. Ozsolak, F., Platt, A.R., Jones, D.R., Reifenberger, J.G., Sass, L.E.,
11, 559–571. McInerney, P., Thompson, J.F., Bowers, J., Jarosz, M. and Milos, P.M.
6. van Bakel, H., Nislow, C., Blencowe, B.J. and Hughes, T.R. (2010) Most (2009) Direct RNA sequencing. Nature, 461, 814–818.
‘dark matter’ transcripts are associated with known genes. PLoS Biol., 8, 30. Goren, A., Ozsolak, F., Shoresh, N., Ku, M., Adli, M., Hart, C., Gymrek,
e1000371. M., Zuk, O., Regev, A., Milos, P.M. et al. (2010) Chromatin profiling by
7. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W.J., Wang, directly sequencing small quantities of immunoprecipitated DNA. Nat.
X., Xie, B., Ni, P. et al. (2009) The genome of the cucumber, Cucumis Methods, 7, 47–49.
sativus L. Nat. Genet., 41, 1275– 1281. 31. Levene, M.J., Korlach, J., Turner, S.W., Foquet, M., Craighead, H.G. and
8. Wall, P.K., Leebens-Mack, J., Chanderbali, A.S., Barakat, A., Wolcott, E., Webb, W.W. (2003) Zero-mode waveguides for single-molecule analysis
Liang, H., Landherr, L., Tomsho, L.P., Hu, Y., Carlson, J.E. et al. (2009) at high concentrations. Science, 299, 682–686.
Comparison of next generation sequencing technologies for transcriptome 32. Travers, K.J., Chin, C.S., Rank, D.R., Eid, J.S. and Turner, S.W. (2010) A
characterization. BMC Genomics, 10, 347. flexible and efficient template format for circular consensus sequencing
9. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, and SNP detection. Nucleic Acids Res., 38, e159.
A., He, W., Chen, Y.J., Makhijani, V., Roth, G.T. et al. (2008) The 33. Uemura, S., Aitken, C.E., Korlach, J., Flusberg, B.A., Turner, S.W. and
complete genome of an individual by massively parallel DNA sequencing. Puglisi, J.D. (2010) Real-time tRNA transit on single translating
52. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., 58. Jones, S., Zhang, X., Parsons, D.W., Lin, J.C., Leary, R.J., Angenendt, P.,
Lander, E.S., Nusbaum, C. and Jaffe, D.B. (2008) ALLPATHS: de novo Mankoo, P., Carter, H., Kamiyama, H., Jimeno, A. et al. (2008) Core
assembly of whole-genome shotgun microreads. Genome Res., 18, signaling pathways in human pancreatic cancers revealed by global
810– 820. genomic analyses. Science, 321, 1801– 1806.
53. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., 59. Parsons, D.W., Jones, S., Zhang, X., Lin, J.C., Leary, R.J., Angenendt, P.,
Kristiansen, K. et al. (2010) De novo assembly of human genomes with Mankoo, P., Carter, H., Siu, I.M., Gallia, G.L. et al. (2008) An integrated
massively parallel short read sequencing. Genome Res., 20, 265– 272. genomic analysis of human glioblastoma multiforme. Science, 321,
54. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J. and 1807– 1812.
Birol, I. (2009) ABySS: a parallel assembler for short read sequence data. 60. Cokus, S.J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild,
Genome Res., 19, 1117– 1123. C.D., Pradhan, S., Nelson, S.F., Pellegrini, M. and Jacobsen, S.E. (2008)
55. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA
traces using phred. II. Error probabilities. Genome Res., 8, 186– 194. methylation patterning. Nature, 452, 215–219.
56. Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1998) Base-calling of 61. Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary
automated sequencer traces using phred. I. Accuracy assessment. Genome tool for transcriptomics. Nat. Rev. Genet, 10, 57–63.
Res., 8, 175–185.
62. Ingolia, N.T., Ghaemmaghami, S., Newman, J.R. and Weissman, J.S.
57. Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L. and Nolan, G.P.
(2009) Genome-wide analysis in vivo of translation with nucleotide
(2009) Computational solutions to large-scale data management and
resolution using ribosome profiling. Science, 324, 218– 223.
analysis. Nat. Rev. Genet., 11, 647– 657.