Sunteți pe pagina 1din 82

© 2002 Nature Publishing Group http://www.nature.

com/naturegenetics contents

supplement september 2002

editorial
Cover art by Darryl Leja
Spreading the word 1
Alan Packer

foreword
Power to the people 2
Andreas D Baxevanis & Francis S Collins

perspective
Genomic empowerment: the importance of public databases 3
Harold Varmus

user’s guide
A user’s guide to the human genome 4
Tyra G Wolfsberg, Kris A Wetterstrand, Mark S Guyer, Francis S Collins
& Andreas D Baxevanis

Introduction: putting it together 5

Question 1 9
How does one find a gene of interest and determine that gene’s structure? Once the
gene has been located on the map, how does one easily examine other genes in that
same region?

Question 2 18
How can sequence-tagged sites within a DNA sequence be identified?

Question 3 21
During a positional cloning project aimed at finding a human disease gene, linkage
data have been obtained suggesting that the gene of interest lies between two
sequence-tagged site markers. How can all the known and predicted candidate genes
in this interval be identified? What BAC clones cover that particular region?

Question 4 29
A user wishes to find all the single nucleotide polymorphisms that lie between two
sequence-tagged sites. Do any of these single nucleotide polymorphisms fall within
the coding region of a gene? Where can any additional information about the
function of these genes be found?

Question 5 33
Given a fragment of mRNA sequence, how would one find where that piece of DNA
mapped in the human genome? Once its position has been determined, how would
one find alternatively spliced transcripts?

40

supplement to nature genetics • september 2002


contents
Question 6
How would one retrieve the sequence of a gene, along with all annotated exons and
introns, as well as a certain number of flanking bases for use in primer design?

Question 7 44
How would an investigator easily find compiled information describing the structure
of a gene of interest? Is it possible to obtain the sequence of any putative promoter
regions?

Question 8 49
How can one find all the members of a human gene family?

Question 9 53
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Are there ways to customize displays and designate preferences? Can tracks or
features be added to displays by users on the basis of their own research?

Question 10 57
For a given protein, how can one determine whether it contains any functional
domains of interest? What other proteins contain the same functional domains as
this protein? How can one determine whether there is a similarity to other proteins,
not only at the sequence level, but also at the structural level?

Question 11 63
An investigator has identified and cloned a human gene, but no corresponding
mouse ortholog has yet been identified. How can a mouse genomic sequence with
similarity to the human gene sequence be retrieved?

Question 12 66
How does a user find characterized mouse mutants corresponding to human genes?

Question 13 70
A user has identified an interesting phenotype in a mouse model and has been able
to narrow down the critical region for the responsible gene to approximately 0.5 cM.
How does one find the mouse genes in this region?

Commentary: keeping biology in mind 74

Acknowledgments 75

References 76

Web resources: Internet resources featured in this guide 77

supplement to nature genetics • september 2002


© 2002 Nature Publishing Group http://www.nature.com/naturegenetics editorial

supplement september 2002

Spreading the word


doi:10.1038/ng961

There was a time, not too long ago, when the wisdom of swimming in a rapidly rising sea of data…how do we
genome-sequencing projects was up for discussion. keep from drowning?” And if geneticists and bioinfor-
Would they be too expensive, draining funds from other maticians are struggling to stay afloat, what of the non-
areas of the life sciences? Would they be worth the trou- geneticists who are eager to exploit the sequences but
ble? Not much more than 15 years have passed since are relative newcomers to the tools needed to navigate
those early debates, and the importance of sequenced all of this information?
genomes to biology and medicine has now gained wide It is with these questions in mind that we present A
acceptance. This is in part owing to the relatively rapid User’s Guide to the Human Genome. Written by Tyra
fall in the cost of sequencing, followed by the undeniably Wolfsberg, Kris Wetterstrand, Mark Guyer, Francis
important insights gained from the annotation of sev- Collins and Andreas Baxevanis of the National Human
eral bacterial genomes, and those of a few of our favorite Genome Research Institute (NHGRI), this peer-
eukaryotes. The news has been so relentlessly upbeat reviewed how-to manual guides the reader through
that one might even have expected some ‘genome some of the basic tasks facing anyone whose work might
fatigue’ to set in, especially given the saturation coverage be facilitated by an improved understanding of the
of the publication of the drafts of the human genome online resources that make sense of annotated genomes.
sequence 18 months ago. Not so, however; witness the The directors of these online resources—Ewan Birney of
recent jockeying by different groups for inclusion of Ensembl, David Haussler of the University of California,
‘their’ model organism in the next round of sequencing Santa Cruz and David Lipman of the National Center for
projects. The honeymoon goes on. Biotechnology Information—have served as advisors
And yet there are important issues to be addressed. during the development of this guide, ensuring a bal-
One is the concern surrounding any bestseller—that it anced and accurate treatment of their respective web
will have far fewer actual readers than one might expect. portals. The online version of the guide will also evolve,
At first glance, this would seem not to apply to the with an initial update scheduled for April, 2003.
human genome. After all, one is hard pressed these days As noted by Harold Varmus in his eloquent perspec-
to pick up a copy of Nature Genetics, or any genetics tive on A User’s Guide and the public databases it exam-
journal, and not find evidence that sequenced genomes ines, one of the important legacies of the Human
inform many of the most important advances. A survey Genome Project is its ethos of open access to the data. In
published last year by the Wellcome Trust, however, this spirit, and with the generous sponsorship of the
found that only half of the researchers who were using NHGRI and the Wellcome Trust, the online version of
sequence data were fully conversant with the services this supplement will be freely available on the
provided by the freely accessible databases. Nature Genetics website.
There is also the concern that genome sequencers
might be victims of their own success. As computa- Alan Packer
tional biologist David Roos recently put it, “We are Nature Genetics

supplement to nature genetics • september 2002 1


foreword

Power to the people


doi:10.1038/ng962

The National Human Genome Research Institute of the the Wellcome Trust indicated that only half of biomed-
National Institutes of Health is delighted to sponsor this ical researchers using genome databases are familiar
special supplement of Nature Genetics. The primary aim with the tools that can be used to actually access the data.
of this supplement is to provide the reader with an ele- The inherent potential underlying all of this sequence-
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

mentary, hands-on guide for browsing and analyzing based data is tremendous, so the importance of all biolo-
data produced by the International Human Genome gists having the ability to navigate through and cull
Sequencing Consortium, as well as data found in other important information from these databases cannot be
publicly available genome databases. The majority of this understated.
supplement is devoted to a series of worked examples, The study of biology and medicine has truly undergone
providing an overview of the types of data available and a major transition over the last year, with the public avail-
highlighting the most common types of questions that ability of advanced draft sequences of the genomes of
can be asked by searching and analyzing genomic data- Homo sapiens and Mus musculus, rapidly growing
bases. These examples, which have been set in a variety of sequence data on other organisms, and ready access to a
biological contexts, provide step-by-step instructions host of other databases on nucleic acids, proteins and
and strategies for using many of the most commonly- their properties. Yet for the full benefits of this dramatic
used tools for sequence-based discovery. It is hoped that revolution to be felt, all scientists on the planet must be
readers will grow in confidence and capability by work- empowered to use these powerful databases to unravel
ing through the examples, understanding the underlying longstanding scientific mysteries. As pointed out by
concepts, and applying the strategies used in the exam- Harold Varmus in the Perspective, free accessibility of all
ples to advance their own research interests. of this basic information, without restrictions, subscrip-
One of the motivating factors behind the development tion fees or other obstacles, is the most critical component
of this User’s Guide comes from the general sense that the of realizing this potential. It is our modest hope that this
most commonly-used tools for genomic analysis still are User’s Guide will provide another useful contribution.
terra incognita for the majority of biologists. Despite the
large amount of publicity surrounding the Human Andreas D. Baxevanis and Francis S. Collins
Genome Project, a recent survey conducted on behalf of National Human Genome Research Institute

2 supplement to nature genetics • september 2002


perspective

Genomic empowerment: the importance of


public databases
doi:10.1038/ng963

Over the past twenty five years, a mere sliver of recorded time, the teaching many of the principles of biological design, including
world of biology — and indeed the world in general — has been evolution, gene organization and expression, organismal devel-
transformed by the technical tools of a field now known as opment, and disease; and in part because those who work on
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

genomics. These new methods have had at least two kinds of genomes have been tireless in attempts to explain the meaning of
effects. First, they have allowed scientists to generate extraordi- genes to an eager public. Endless metaphors, artistic creations,
narily useful information, including the nucleotide-by- lively journalism, monographs about social and ethical implica-
nucleotide description of the genetic blueprint of many of the tions, televised lectures from the White House, and many other
organisms we care about most—many infectious pathogens; use- cultural happenings have been among the manifestations of this
ful experimental organisms such as mice, the round worm, the fascination. In this way, the HGP has had a strong hand in raising
fruitfly, and two kinds of yeast; and human beings. Second, they the public’s awareness of new ideas in biology and of the power-
have changed the way science is done: the amount of factual ful implications of genomics in medicine, law and other societal
knowledge has expanded so precipitously that all modern biolo- institutions.
gists using genomic methods have become dependent on com- Some of these cultural effects come as much from the behav-
puter science to store, organize, search, manipulate and retrieve ioral aspects of the HGP as from the genomic sequences them-
the new information. selves. The sharing of new information, even before its assembly
Thus biology has been revolutionized by genomic information into publishable form, has spurred efforts to share other kinds of
and by the methods that permit useful access to it. Equally research tools and has encouraged the notion of making the sci-
importantly, these revolutionary changes have been dissemi- entific literature freely accessible through the Internet. The con-
nated throughout the scientific community, and spread to other tribution of scientists in many countries to the sequencing of
interested parties, because many of those who practice genomics many genomes, including the human genome, has inspired
have made a concerted effort to ensure that access is simplified efforts to develop gene-based sciences—from basic genomics to
for all, including those who have not been deeply schooled in the biotechnology—throughout the world, including the poorest
information sciences. The goal of providing genomic informa- developing nations. Indeed, the World Health Organization, the
tion widely has also inevitably attracted the interests of those in United Nations, and the World Bank have all contributed
the commercial sector, and privately developed versions of vari- recently to the growth of the ideas that science is both possible
ous genomes are also now available, albeit for a licensing fee. and valuable in all economies and that science can be a means to
The operative principle most prominently involved in trans- help unify the world’s population under a banner of enlighten-
mitting the fruits of genomics—the one that has captured the ment, demonstrating a virtue of globalization.
imagination of the public and served as a standard for the shar- From this perspective, the availability of the sequences of many
ing of results and methods more generally in modern biology— genomes through the Internet is a liberating notion, making
has been open access. Funding by public and philanthropic extraordinary amounts of essential information freely accessible
organizations, such as the U.S. National Institutes of Health, the to anyone with a desktop computer and a link to the World Wide
U.S. Department of Energy, the Wellcome Trust in Britain, and Web. But the information itself is not enough to allow efficient
many other organizations, has made this altruistic behavior pos- use. Interested people who reside outside the centers for studying
sible and has fostered the idea that genomic information about genomes need to be told where best to view the information in a
biological species should be available to all. (Such information form suitable for their purposes and how to take advantage of the
about individual human beings is, of course, an entirely different software that has been provided for retrieval and analysis.
matter and should be protected by privacy rules.) The attitude of The manual before us now offers such help to those who might
open access to new biological knowledge has also been embodied otherwise have had trouble in attempting to use the products of
in the databases of the International Nucleotide Sequence Data- genomics. Furthermore, the advice is offered in that spirit of
base Collaboration, comprising the DNA DataBank of Japan, the altruism that has come to characterize the public world of
European Molecular Biology Laboratory, and GenBank at the US genomics. The information is provided in a highly inviting and
National Library of Medicine. The same focus on open access is understandable format by casting it in the form of answers to the
exemplified by PubMed (operated by the NLM), other gateways questions most commonly posed when approaching big
to the scientific literature, and the assemblies of genomic genomes. The information, made freely available on the World
sequence now found at the several Web portals described in this Wide Web, has been assembled by some of the best minds in the
guide. HGP, who have generously given their time and intellect to
The Human Genome Project (HGP), which has supported the encourage widespread use of the great bounty that has been cre-
public genome sequencing effort, has been the mainstay of the ated over the past two decades.
effort to make genomes accessible to the entire community of In other words, the guide to use of genomes provided here is
scientists and all citizens. This effort has, in fact, been quite natu- simply another indication that the HGP should take great pride
rally extended to instruct the public about many themes in mod- in much more than the sequencing of genomes.
ern biological science. This has occurred in part because the
human genome itself has been such an exciting concept for the Harold Varmus
public; in part because genomes are natural entry points for Memorial Sloan-Kettering Cancer Center

supplement to nature genetics • september 2002 3


user’s guide

A user’s guide to the human genome


doi:10.1038/ng964

The primary aim of A User’s Guide to the Human Genome is to provide the reader with an elementary hands-on
guide for browsing and analyzing data produced by the International Human Genome Sequencing Consortium
and other systematic sequencing efforts. The majority of this supplement is devoted to a series of worked exam-
ples, providing an overview of the types of data available, details on how these data can be browsed, and step-
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

by-step instructions for using many of the most commonly-used tools for sequence-based discovery. The major
web portals featured throughout include the National Center for Biotechnology Information Map Viewer, the
University of California, Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system,
along with many others that are discussed in the individual examples. It is hoped that readers will become more
familiar with these resources, allowing them to apply the strategies used in the examples to advance their own
research programs.

Authors
Tyra G. Wolfsberg
Kris A. Wetterstrand
Mark S. Guyer
Francis S. Collins
Andreas D. Baxevanis

National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
e-mail: andy@nhgri.nih.gov

4 supplement to nature genetics • september 2002


user’s guide

Introduction: putting it together


doi:10.1038/ng965

In its short history, the Human Genome Project (HGP) has pro- finished when it has been determined at an accuracy of at least
vided significant advances in the understanding of gene structure 99.99% and has no gaps. Sequence data that fall short of that
and organization, genetic variation, comparative genomics and benchmark but can be positioned along the physical map of the
appreciation of the ethical, legal and social issues surrounding chromosomes are termed ‘draft’. Currently, 87% of the euchro-
the availability of human sequence data. One of the most signifi- matic fraction of the genome is finished and less than 13% is at
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

cant milestones in the history of this project was met in February the draft stage.
2001 with the announcement and publication of the draft ver- Even in this incomplete state, the available data are extremely
sion of the human genome sequence1. The significance of this useful. This usefulness was apparent early on, leading the Inter-
milestone cannot be understated, as it firmly marks the entrance national Human Genome Sequencing Consortium (IHGSC) to
of modern biology into the genome era (and not the post- pursue a staged approach in sequencing the human genome. The
genome era, as many have stated). The potential usefulness of first stage generated draft sequence across the entire genome1.
this rich databank of information should not be lost on any biol- The project is now well advanced into its second stage, with draft
ogist: it provides the basis for ‘sequence-based biology’, whereby sequence being improved to ‘finished quality’ across the entire
sequence data can be used more effectively to design and inter- genome, a necessarily localized process. As a result, and as it has
pret experiments at the bench. The intelligent use of sequence been presented to date, the human genome sequence is an evolv-
data from humans and model organisms, along with recent tech- ing mix of both finished and unfinished regions, with the unfin-
nological innovation fostered by the HGP, will lead to important ished regions varying in data quality. As the data are initially
advances in the understanding of diseases and disorders having a made available in raw form, with subsequent refinement and
genetic basis and, more importantly, in how health care is deliv- improvement, and because data of different quality are found in
ered from this point forward2. different places in the genome, users must understand the kinds
Although this flood of data has enormous potential, many of data presented by the various tools available.
investigators whose research programs stand to benefit in a tan-
gible way from the availability of this information have not Determining the human sequence: a brief overview
been able to capitalize on its potential. Some have found the As with all systematic sequencing projects, the basic experimen-
data difficult to use, particularly with respect to incomplete tal problem in sequencing lies in the fact that the output of a sin-
human genome draft sequence information. Others are simply gle reaction (a ‘read’) yields about 500–800 bp1,4. To determine
not sufficiently conversant with the seeming myriad of data- the sequence of a DNA molecule that is millions of bases long, it
bases and analytical tools that have arisen over the last several must first be fragmented into pieces that are within an order of
years. To assist investigators and students in navigating this magnitude of the read size. The sequence at one or both ends of
rapidly expanding information space, numerous World Wide many such fragments is determined, and the pieces are then
Web sites, courses and textbooks have become available; many ‘assembled’ back into the long linear string from which they were
individuals, of course, also turn to their friends and colleagues originally derived. A number of approaches for doing this have
for guidance. We have prepared this Guide in that same spirit, been suggested and tested; the most commonly used is shotgun
as an additional resource for our fellow scientists who wish to sequencing4. The application of shotgun sequencing to the mul-
make use (or better use) of both sequence data and the major timegabase- or gigabase-sized genomes of metazoans is still
tools that can be used to view these data. The Guide has been evolving. A small number of strategies are currently being evalu-
written in a practical, question-and-answer format, with step- ated, for example, hierarchical or map-based shotgun sequenc-
by-step instructions on how to approach a representative set of ing, whole-genome shotgun sequencing and hybrid approaches.
problems using publicly available resources. The reader is These approaches are described in detail elsewhere4.
encouraged to work through the examples, as this is the best The IHGSC’s human sequencing effort began as a purely map-
way to truly learn how to navigate the resources covered and based strategy and evolved into a hybrid strategy1. The ‘pipeline’
become comfortable using them on a regular basis. We suggest that the IHGSC used to generate the human sequence data
that readers keep copies of the Guide next to their computers as involved the following steps.
an easy-to-use reference. 1. Bacterial artificial chromosome (BAC) clones were selected,
Before embarking on this new adventure, it is important to and a random subclone library was constructed for each one in
review a number of basic concepts regarding the generation of either an M13- or a plasmid-based vector.
human genome sequence data. This review does not discuss the 2. A small number of members of the subclone library (usually
chronological development of the HGP or provide an in-depth 96 or 192) were sequenced to produce very-low-coverage, single-
treatment of its implications; the reader is referred to Nature’s pass or ‘phase 0’ data. These data were used for quality control
Genome Gateway (http://www.nature.com/genomics/human/) and can be found in the Genome Survey Sequence division of
for more information on these topics. The DNA Database of Japan (DDBJ), the European Molecular
Biology Laboratory (EMBL) and GenBank (of the National Cen-
Current status of human genome sequencing ter for Biotechnology and Information; NCBI).
Sequencing of the human genome is nearing completion. The 3. If a BAC clone met the requisite standard, subclones were
target date for making the complete, high-accuracy sequence derived and sufficient sequence data generated from these to pro-
available is April 2003, the 50th anniversary of the discovery vide four- to fivefold coverage (that is, enough data to represent
of the double helix3. As we go to press, however, the work is still an average base in the BAC clone between four and five times).
a mosaic of finished and draft sequence. A sequence becomes This is known as ‘draft-level’ coverage, and permits the assembly

supplement to nature genetics • september 2002 5


user’s guide
5. Subsequent to the genera-
NCBI reference sequences tion and publication of the
The data release and distribution practices adopted by the HGP participants have led not draft human genome sequence,
only to very early, pre-publication access to this treasure trove of information, but also to a work has continued towards
potentially confusing variety of formats and sources for the sequence data. To address this and finishing the sequencing. The
other issues, the NCBI initiated the RefSeq project (http://www.ncbi.nlm.nih.gov/ final stage initially targeted
locuslink/refseq.html). draft-quality BAC clones. For
The goal of the RefSeq effort is to provide a single reference sequence for each molecule of the each of these clones, enough
central dogma: DNA, the mRNA transcript, and the protein. The RefSeq project helps to sim- additional shotgun sequence
plify the redundant information in GenBank by providing, for example, a single reference for data are obtained to bring the
human glyceraldehyde-3-phosphate dehydrogenase mRNA and protein, out of the 14 or so full- coverage to eight- to tenfold, a
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

length sequences in GenBank. Each alternatively spliced transcript is represented by its own ref- stage referred to as ‘fully
erence mRNA and protein. The RefSeq project also includes sequences of complete genomes topped-up’. The data from each
and whole chromosomes, and genomic sequence contigs. The human genomic contigs that fully topped-up BAC are
NCBI assembles, which form the basis of the presentations in the different genome browsers, reassembled, typically resulting
are part of the RefSeq project. Most RefSeq entries are considered provisional and are derived by in a smaller number of contigs
an automated process from existing GenBank records. Reviewed RefSeq entries are manually (often in just a single contig)
curated and list additional publications, gene function summaries and sometimes sequence than at the draft level. The new
corrections or extensions. assembly is again submitted to
Reference sequences are available through NCBI resources, including Entrez, BLAST and the HTGS division as an
LocusLink. They can be easily recognized by the distinctive style of their accession numbers. update of the existing BAC
NM_###### is used to designate mRNAs, NP_###### to designate proteins and NT_###### to clone, now identified with the
designate genomic contigs. The NCBI and UCSC use alignments of the mRNA RefSeqs with the keyword ‘htgs_fulltop’. The
genome to annotate the positions of known genes. Ensembl aligns mRNA RefSeqs to the accession number of the clone
genome. The NCBI also provides model mRNA RefSeqs produced from genome annotation. stays the same, and the version
These are derived by aligning the NM_ mRNAs and other GenBank mRNAs to the assembled number increases by one
genome and then extracting the genomic sequence corresponding to the transcripts. The result- (AC108475.2, for example,
ing model mRNA and model protein sequences have accession numbers of the form becoming AC108475.3).
XM_###### and XP_######. As the XM_ and XP_ records are derived from genomic sequence, 6. At this stage, there are,
they may differ from the original NM_ or GenBank mRNAs because of real-sequence polymor- even for clones comprising a
phisms, errors in the genomic or mRNA sequences or problems in the mRNA/genomic single contig, typically some
sequence alignment. A complete list of types of RefSeqs, along with details on how they are pro- regions that are of insufficient
duced, is available from http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html. quality for the clone to be con-
sidered finished. If this is the
case, the fully topped-up
of sequence using computer programs that can detect overlaps sequence is analyzed by a sequence finisher (an actual person)
between the random reads from the subclones, yielding longer who collects, in a directed manner, the additional data that are
‘sequence contigs’. At this stage, the sequence of a BAC clone needed to close the few remaining gaps and to bring any regions
could typically exist on between four and ten different contigs, of low quality up to the finished sequence standard. While the
only some of which were ordered and oriented with respect to clone is worked on by the finisher, the HTGS entry in GenBank is
one another. The BAC ‘projects’ were submitted, within 24 hours identified by the keyword ‘htgs_activefin’. Once work on the
of having been assembled, to the High-Throughput Genomic clone has been completed, the keyword of the HTG record is
Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where changed to ‘htgs_phase3’, the version number is once again
each was given a unique accession number and identified with increased, and the record is moved from the HTGS division to
the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are the primate division of DDBJ/EMBL/GenBank. In the context of
members of the International Nucleotide Sequence Database a BLAST search at NCBI, these finished BAC sequences would
Collaboration, whose members exchange data nightly and assure now be available in the nr (“non-redundant”) database.
that the sequence data generated by all public sequencing efforts 7. The finished clone sequences are then put together into a
are made available to all interested parties freely and in a timely finished chromosome sequence. As with the initial draft assem-
fashion.) Less-complete high-throughput genomic (HTG) blies, there are a number of steps involved in this process that use
records are also known as ‘phase 1’ records. As the sequence is map-based and sequence-based information in calculating the
refined, it is designated ‘phase 2’. In the context of a BLAST maps. The final assembly process involves identifying overlaps
search at the NCBI, these sequences would be available in the between the clones and then anchoring the finished sequence
HTGS database. contigs to the map of the genome; details of the process can be
4. In late 2000, the draft sequence of the entire human genome found on the NCBI web site (http://www.ncbi.nlm.nih.gov/
was assembled from the sequence of 30,445 clones (BAC clones genome/guide/build.html).
and a relatively small number of other large-insert clones). This Initially, both the UCSC and NCBI groups generated complete
assembled draft human genome sequence was published in Feb- assemblies of the human genome, albeit using different
ruary 2001 and made publicly available through three primary approaches. As noted on the UCSC web site, the NCBI assembly
portals: the University of California, Santa Cruz (UCSC), tended to have slightly better local order and orientation, whereas
Ensembl (of the European Bioinformatics Institute; EBI) and the the UCSC assembly tended to track the chromosome-level maps
NCBI. The use of all three of these sites to obtain annotated somewhat better. Rather than having different assemblies based
information on the human genome sequence is the primary sub- on the same data, IHGSC, UCSC, Ensembl and NCBI decided
ject of this guide. that it would be more productive (and obviously less confusing)

6 supplement to nature genetics • september 2002


user’s guide
to focus their efforts on a single, definitive assembly. To this end, Over the next year, sequence producers will continue to add
and by agreement, the NCBI assembly will be taken as the refer- finished sequence to the nucleotide sequence databases, and the
ence human genome sequence. It is this NCBI assembly that is NCBI will continue to update the human sequence assembly
displayed at the three major portals covered in this guide. until its ultimate completion. The human genome sequence will,
however, continue to improve even after April 2003, as new
Annotating the assemblies cloning, mapping and sequencing technologies lead to the clo-
Once the assemblies have been constructed, the DNA sequence sure of the few gaps that will remain in the euchromatic regions.
undergoes a process known as annotation, in which useful It is hoped that such technological advances will also allow for
sequence features and other relevant experimental data are cou- the sequencing of heterochromatic regions, regions that cannot
pled to the assembly. The most obvious annotation is that of be cloned or sequenced using currently available methods.
known genes. In the case of NCBI, known genes are identified by The sequence-based and functional annotations presented at
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

simply aligning Reference Sequence (RefSeq) mRNAs (see box), the three major genome portals will certainly continue to evolve
GenBank mRNAs, or both to the assembly. If the RefSeq or Gen- long after April 2003. Computational annotation is a highly
Bank mRNA aligns to more than one location, the best align- active area of research, yielding better methods for identifying
ment is selected. If, however, the alignments are of the same coding regions, noncoding transcribed regions and noncoding,
quality, both are marked on to the contig, subject to certain rules non-transcribed functional elements contained within the
(specifically, the transcript alignment must be at least 95% iden- human sequence.
tical, with the aligned region covering 50% or more of the length,
or at least 1,000 bases). Transcript models are used to refine the Accessing human genome sequence data
alignments. Ensembl identifies ‘best in genome’ positions for Although each of the three portals through which users access
known genes by performing alignments between all known genome data has its own distinctive features, coordination
human proteins in the SPTREMBL database6 and the assembly among the three ensures that the most recent version and anno-
using a fast protein-to-DNA sequence matcher7. UCSC predicts tations of the human genome sequence are available.
the location of known genes and human mRNAs by aligning Ref- Ensembl (http://www.ensembl.org) is the product of a collab-
Seq and other GenBank mRNAs to the genome using the BLAST- orative effort between the Wellcome Trust Sanger Institute and
like alignment tool (BLAT) program8. In addition to identifying EMBL’s European Bioinformatics Institute and provides a bioin-
and placing known genes onto the assemblies, all of the major formatics framework to organize biology around the sequences
genome browser sites provide ab initio gene predictions, using a of large genomes7. It contains comprehensive human genome
variety of prediction programs and approaches. annotation through ab initio gene prediction, as well as infor-
Genome annotation goes well beyond noting where known mation on putative gene function and expression. The web site
and predicted genes are. Features found in the Ensembl, NCBI provides numerous different views of the data, which can be
and UCSC assemblies include, for example, the location and either map-, gene- or protein-centric. Ensembl is actively build-
placement of single-nucleotide polymorphisms, sequence- ing comparative genome sequence views, and presents data
tagged sites, expressed sequence tags, repetitive elements and from human, mouse, mosquito and zebrafish. In addition,
clones. Full details on the types of annotation available and the numerous sequence-based search tools are available, and the
methods underlying sequence annotation for each of these dif- Ensembl system itself can be downloaded for use with individ-
ferent types of sequence feature can be found by accessing the ual sequencing projects.
URLs listed under Genome Annotation in the Web Resources The UCSC Genome Browser (http://genome.ucsc.edu) was
section of this guide. At UCSC, many of the annotations are pro- originally developed by a relatively small academic research
vided by outside groups, and there may be a significant delay group that was responsible for the first human genome assem-
between the release of the genome assembly and the annotation blies. The genome can be viewed at any scale and is based on
of certain features. Furthermore, some tracks are generated for the intuitive idea of overlaying ‘tracks’ onto the human
only a limited number of assemblies. For an in-depth discussion genome sequence; these annotation tracks include, for exam-
of genome annotation, the reader is referred to an excellent ple, known genes, predicted genes and possible patterns of
review by Stein9 and the references cited therein. This review, alternative splicing. There is also an emphasis on comparative
along with the Commentary in this guide, also provides cautions genomics, with mouse genomic alignments being available.
on the possible overinterpretation of genome annotation data. The browser also provides access to an interactive version of
the BLAT algorithm8, which UCSC uses for RNA and compar-
The data—and sometimes the tools—change every day ative genomic alignments.
The steps outlined in the previous section should emphasize Given its Congressional mandate to store and analyze biologi-
that the state of the human genome sequence will continue to be cal data and to facilitate the use of databases by the research com-
in flux, as it will be updated daily until it has actually been munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as a
declared ‘finished’. (Finished sequence is properly defined as the central hub for genome-related resources. NCBI maintains Gen-
“complete sequence of a clone or genome, with an accuracy of at Bank, which stores sequence data, including that generated by
least 99.99% and no gaps”2. A more practical definition is that of the HGP and other systematic sequencing projects. NCBI’s Map
“essentially finished sequence,” meaning the complete sequence Viewer provides a tool through which information such as exper-
of a clone or genome, with an accuracy of at least 99.99% and no imentally verified genes, predicted genes, genomic markers,
gaps, except those that cannot be closed by any current physical maps, genetic maps and sequence variation data can be
method.) The reader should be mindful of this, not just when visualized. The Map Viewer is linked to other NCBI tools—for
reading this guide, but also, when referring back to it over time. example, Entrez, the integrated information retrieval system that
Similarly, the tools used to search, visualize and analyze these provides access to numerous component databases.
sequence data also undergo constant evolution, capitalizing on Although we have chosen to illustrate each example using
new knowledge and new technology in increasing the usefulness resources available at a single site, almost all the questions in this
of these data to the user. guide can be answered using any of the three browsers. The

supplement to nature genetics • september 2002 7


user’s guide
informational sidebars that follow some of the questions provide
pointers on how to format the search at other sites. Furthermore, Browser problems?
the three sites link to each other wherever possible. Examples In following the question-and-answer portion of this guide,
presented in this Guide rely on the data and genome browser some readers may find that their web browsers are not be able
interfaces that were available in June 2002. As new versions of the to render the web pages properly. If this occurs, do one or
genome assembly and viewing tools will come online every few more of the following:
months, the specifics of some of the examples may change over 1. Install the most recent version of either Netscape Navi-
time. Regardless, the basic strategies behind answering the ques- gator or Internet Explorer.
tions in the examples will remain the same. This underscores the 2. Increase the amount of memory available to the web
importance of readers working through the examples at their browser.
own computers so that they may understand and be able to navi- 3. Try a different web browser. In general, Macintosh users
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

gate these public databases. The readers are encouraged to who seek to gain access to these three genome portals will see
explore the alternative methods for answering the questions. better performance with Internet Explorer.

8 supplement to nature genetics • september 2002


user’s guide

Question 1
How does one find a gene of interest and determine that gene’s struc-
ture? Once the gene has been located on the map, how does one easily
examine other genes in that same region?
doi:10.1038/ng966
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

This question serves as a basic introduction to the three major • ev brings the user to the evidence viewer, a view that displays
genome viewers. One gene, ADAM2, will be examined using the biological evidence supporting a particular gene model.
all three sites so that the reader can gain an appreciation of This view shows all RefSeq models, GenBank mRNAs, tran-
the subtle differences in information presented at each of scripts (whether annotated, known or potential) and
these sites. expressed sequence tags (ESTs) aligning to this genomic con-
tig. More information on the evidence viewer can be found
National Center for Biotechnology Information Map on the NCBI web site by clicking Evidence Viewer Help on any
Viewer ev report page.
The NCBI Human Map Viewer can be accessed from the NCBI’s • hm is a link to the NCBI’s Human–Mouse Homology Map,
home page, at http://www.ncbi.nlm.nih.gov. Follow the hyper- showing genome sequences with predicted orthology
link in the right-hand column labeled Human map viewer to go between mouse and human (Fig. 12.2).
to the Map Viewer home page. The notation at the top of the • seq allows the user to retrieve the genomic sequence of the
page indicates that this is Build 29, or the NCBI’s 29th assembly region in text format. The region of sequence displayed can
of the human genome. Build 29 is based on sequence data from 5 easily be changed.
April 2002. The previous genome assembly, Build 28, was based • mm is a link to the Model Maker, which shows the exons that
on sequence data from 24 December 2001. To search for any result when GenBank mRNAs, ESTs and gene predictions are
mapped element, such as a gene symbol, GenBank accession aligned to the genomic sequence. The user can then select
number, marker name or disease name, enter that term in the individual exons to create a customized model of the gene.
Search for box and then press Find. For this example, enter More information on the Model Maker can be found on the
‘ADAM2’ and then press Find. The on chromosome(s) box may be NCBI web site by clicking help on any mm report page.
left blank for text-based searches such as this one. The UniG_Hs map shows human UniGene clusters that have
The resulting overview page shows a schematic of all of the been aligned to the genome. The gray histogram depicts the
human chromosomes, pinpointing the position of ADAM2 to number of aligning ESTs and the blue lines show the mapping of
the p arm of chromosome 8 (Fig. 1.1). The search results section UniGene clusters to the genome. The thick blue bars are regions
shows that the gene exists on two NCBI maps, Genes_cyto and of alignment (that is, exons) and the thin blue lines indicate
Genes_seq. Genes_cyto refers to the cytogenetic map, whereas potential introns. In this example, the mapping of UniGene clus-
Genes_seq refers to the sequence map. Clicking on either of those ter Hs.177959 to the genome follows that of ADAM2, and all the
two links opens a view of just that map. exons align.
Detailed descriptions of these and other NCBI maps are The Genes_cyto map shows genes that have been mapped
available at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/ cytogenetically; the orange bar shows the position of the gene.
humansearch.html. To get the most general overview of the Although ADAM2 has been finely mapped and is represented by
genomic context of ADAM2, including all available maps, click a short line, other genes, such as the group below it on a longer
on the item in the Map element column (in this case, ADAM2). line, have been cytogenetically mapped to broader regions of
This view shows ADAM2 and a bit of flanking sequence on chro- chromosome 8.
mosome 8p11.2 (Fig. 1.2). Three maps are displayed in this view, Clicking on the zoom control in the blue sidebar allows the
each of which will be discussed below. Additional maps, dis- user to zoom out to view a larger region of chromosome 8.
cussed in other examples in this guide, can be added to this view Zooming out one level shows 1/100th of the chromosome. There
using the Maps & Options link. are 20 genes in the region, and all 20 are labeled (displayed) in
The rightmost map is the master map, the map providing the this view (Fig. 1.3). The region of ADAM2 is highlighted in red
most detail. The master map in this case is the Genes_seq map, on all maps. On the basis of the Genes_seq map, ADAM2 is
which depicts the intron/exon organization of ADAM2 and is located between ADAM18 and LOC206849.
created by aligning the ADAM2 mRNA to the genome. The gene
appears to have 14 exons. The vertical arrow next to the ADAM2 University of California, Santa Cruz Genome Browser
gene symbol (within the pink box) shows the direction in which The home page for the UCSC Genome Browser is http://genome.
the gene is transcribed. The gene symbol itself is linked to ucsc.edu/. At present, UCSC provides browsers not only for the
LocusLink, an NCBI resource that provides comprehensive most recent version of the mouse and human genome data, but
information about the gene, including aliases, nucleotide and also for several earlier assemblies. To use the Genome Browser,
protein sequences, and links to other resources10 (see Question select the appropriate organism from the pull-down menu at the
10). The links to the right of the gene symbol point to additional top of the blue sidebar (Human, in this case) and then click the
information about the gene. link labeled Browser. On the resulting page, select the version of
• sv, or sequence view, shows the position of the gene in the the human assembly to view. The genome browser from August
context of the genomic contig, including the nucleotide and 2001 is based on an assembly of the human genome done by
encoded protein sequences. UCSC using sequence data available on that date. The Dec. 2001

supplement to nature genetics • september 2002 9


user’s guide
browser displays annotations based on NCBI’s build 28 of the Exon sequences are shown in the third section of the GeneView
human genome, and the Apr. 2002 browser displays annotations (Fig. 1.11) and splice sites in the fourth (Fig. 1.12). If more than
on NCBI’s build 29. As the annotations presented in this most one transcript is predicted for the gene, each is allocated its own
recent human assembly are not yet as comprehensive as those transcript, exon and splice-site sections.
from the December 2001 assembly, the examples in this text are The complete genomic context of ADAM2 is viewed by return-
based on the earlier assembly. Select Dec. 2001 from the pull- ing to the first section of the GeneView (Fig. 1.9) and clicking on
down menu to access the assembly from that date (Fig. 1.4). one of the two links within the Genomic Location box. The top
Supported types of queries are listed below the text input portion of the resulting ContigView (Fig. 1.13) depicts the chro-
boxes. Enter ‘ADAM2’ in the box labeled position and then mosome, with the region of interest outlined in red. The
click Submit. The results of this search are presented in two Overview shows the genomic context of the gene, including the
categories, Known Genes and mRNA Associated Search Results chromosome bands, contigs, markers and genes that map to near
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

(Fig. 1.5). The section marked Known Genes shows the map- 8p12. Clicking on any of these items recenters the display around
ping of the NCBI Reference mRNA sequences to the genome. that item. The section of interest is boxed in red on the
The mRNA Associated Search Results represent the mapping of DNA(contigs) map. The genes annotated by Ensembl as being
other GenBank mRNA sequences to the genome. Click on the around ADAM2 are Q96KB2 and ADAM18.
Known Genes link for ADAM2 (arrow, Fig. 1.5) to see the The bottom panel of the ContigView, the Detailed View
genomic context of the ADAM2 mRNA Reference Sequence (Fig. 1.14), shows a zoomed-in view of the boxed region, high-
(NM_001464). lighting all features that have been mapped to this region of the
The resulting zoomed-in view shows a region of chromosome human genome. The navigator buttons between the Overview
8 from base pair 36234934 to 36280132, located within 8p12 and the Detailed View move the display to the left and right and
(Fig. 1.6). The blue track entitled Known Genes (from RefSeq) zoom in and out. The features to be displayed can be changed
shows the intron–exon structure of known genes. The vertical by selecting the Features pull-down menu and then checking
boxes indicate exons and the horizontal lines introns. The which features to view.
ADAM2 gene seems to have 14 exons. The direction of transcrip- The Features shown in Fig. 1.14 are the defaults. The DNA
tion is indicated by the arrowheads on the introns. The tracks (contigs) map separates items on the forward strand (above)
labeled Acembly Gene Predictions, Ensembl Gene Predictions from those on the reverse (below). The only feature on the
and Fgenesh++ Gene Predictions are the results of gene predic- reverse strand in this view is a single Genscan transcript, pre-
tions (see Question 7). Alignments of other database nucleotide dicted by the GENSCAN gene prediction program11 (see Ques-
sequences are shown in the Human mRNAs from GenBank, tion 7). The forward strand shows five types of features. Starting
spliced EST, UniGene and Nonhuman mRNAs from GenBank at the bottom, the ADAM2 transcript is shown in red, indicating
tracks. Translated alignments of mouse and Tetraodon genomic that it is a known transcript corresponding to a near-full-length
sequence are in the mouse and fish BLAT tracks. Tracks display- cDNA sequence, protein sequence or both already available in
ing single-nucleotide polymorphisms (SNPs), repetitive ele- the public sequence database. Black transcripts are predicted
ments and microarray data are shown at the bottom. Additional based on EST or protein sequence similarity. EST Transcr. links to
details about each track are available by selecting the track name individual aligning ESTs, whereas the UniGene track near the top
in the Track Controls at the bottom. displays UniGene clusters. The Genscan model on the forward
To view the genomic context of ADAM2, zoom out 10× by strand contains many exons found in the known transcript. The
clicking on the zoom out 10× box in the upper right corner. Proteins and Human proteins boxes indicate protein sequences
ADAM2 is located between TEM5 and ADAM18 (Fig. 1.7). that align to this version of the genome, whereas NCBI Transcr.
links to the NCBI Map Viewer. Positioning the computer mouse
Ensembl over any feature brings up the feature’s name and links to more
The Ensembl7 project, http://www.ensembl.org/, provides detailed information.
genome browsers for four species: human, mouse, zebrafish and The NCBI, UCSC and Ensembl sometimes use different sym-
mosquito. Click on Human to view the main entry point for the bols for the same genes, so it can be difficult to compare the
human genome. The current version of human Ensembl is ver- views obtained by the different browsers. Furthermore, the
sion 6.28.1, based on the NCBI’s 28th build of the genome. To three sites maintain independent annotation pipelines and do
perform a text search, enter ‘ADAM2’ in the text box, and limit not all attempt to align the same mRNA sequences to the
the search by selecting Gene from the pull-down search. Click on genome. The NCBI is currently displaying build 29, Ensembl
the upper button labeled Lookup. A single result is returned with shows build 28, and UCSC offers both builds 28 (December
a link to the ADAM2 gene (Fig. 1.8). 2001) and 29 (April 2002), although all examples from UCSC in
Click on either of the ADAM2 links to retrieve the GeneView this guide will be illustrated using the better-annotated build
window. The returned page contains four sections of data. The 28. Because of the differences between the two assemblies, there
first section (Fig. 1.9) is an overview of ADAM2, including links are subtle discrepancies between what is shown at the NCBI and
to accession numbers and protein domains and families. Links to what is available at UCSC and Ensembl. However, it is fairly
the Ensembl view of highly similar mouse sequences are pre- easy to navigate among the three sites. The NCBI, for example,
sented in the Homology Matches section. Some of these fields will links to Ensembl and UCSC through the black boxes at the top
be described in more detail in later examples. The second section of LocusLink entries for human genes, and Ensembl directs
of the GeneView window provides information on the gene tran- users to NCBI and UCSC through the “Jump to” link in its Con-
script (Fig. 1.10). The sequence of the cDNA is shown, as is a tigView. Some versions of UCSC’s Genome Browser have links
graphic of its intron–exon structure. A limited amount of the to Ensembl and NCBI’s Map Viewer in the blue bar at the top of
genomic context around the gene is shown schematically as well. each browser page.

10 supplement to nature genetics • september 2002


user’s guide
Figure 1.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.2

supplement to nature genetics • september 2002 11


user’s guide
Figure 1.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.4

12 supplement to nature genetics • september 2002


user’s guide
Figure 1.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.6

supplement to nature genetics • september 2002 13


user’s guide
Figure 1.7
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.8

14 supplement to nature genetics • september 2002


user’s guide
Figure 1.9
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.10

supplement to nature genetics • september 2002 15


user’s guide
Figure 1.11
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.12

16 supplement to nature genetics • september 2002


user’s guide
Figure 1.13
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 1.14

supplement to nature genetics • september 2002 17


user’s guide

Question 2
How can sequence-tagged sites within a DNA sequence be identified?
doi:10.1038/ng967

The NCBI’s electronic PCR (e-PCR) tool12, which is part of the ent maps. Cross-references to LocusLink, UniGene and the
UniSTS resource, can be used to find STS markers within a DNA Genebridge 4 map to which this STS was mapped are shown
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

fragment of interest. UniSTS (http://www.ncbi.nih.gov/ next. The mapping information section contains links to the
genome/sts/) contains all the available data on STS markers, NCBI’s MapViewer. At the bottom of the page, the Electronic
including primer sequences, product size, mapping information PCR results show other sequences, including contigs, mRNAs
and alternative names. Links to other NCBI resources such as and ESTs that may contain this STS marker.
Entrez, LocusLink and the MapViewer are also provided. e-PCR To see the genomic context of the STS marker in all maps to
looks for potential STSs in a DNA sequence by searching for sub- which it has been mapped, click on the link labeled MapViewer
sequences with the correct orientation and distance that could at the top of the Mapping Information section. This map view
represent the PCR primers used to generate known STSs. (Fig. 2.3) shows two maps. Note that, in this view, the STS
The e-PCR home page can be found by going to the NCBI stSG47693 is called RH92759 (highlighted in pink). Gene
home page, at http://www.ncbi.nlm.nih.gov, and then following Map ’99–Genebridge 4 (GM99_GB4, left) has 46,000 STS mark-
the Electronic PCR link in the right-hand column. On the e-PCR ers mapped onto the GB4 RH panel by the International
home page, paste the sequence of interest or enter an accession Radiation Hybrid Consortium. The STS map (right) shows the
number into the large text box at the top of the page. The acces- NCBI’s placement of STSs onto the genome sequence assembly
sion number of the sequence for this example is AF288398. This using e-PCR. Gray lines connect markers that appear in both
sequence contains only one STS, stSG47693, which is located maps, whereas the red line denotes where the STS RH92759
between nucleotides (nt) 2102 and 2232 of the sequence under appears on both maps. In the region shown, there are a total of
study (Fig. 2.1). 211 STSs on the STS map, but only 20 are labeled in this view. To
Click on the marker name to bring up details of the STS from the right of the STS map, the green and yellow circles show the
UniSTS (Fig. 2.2). The primer information and PCR product size maps on which the STS markers have been placed. One can
are listed at the top of the page, along with alternative names for zoom in or out of this view by clicking on the lines of the zoom
the marker. Often STSs are known by different names on differ- tool in the left sidebar.

18 supplement to nature genetics • september 2002


user’s guide
Figure 2.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 2.2

supplement to nature genetics • september 2002 19


user’s guide
Figure 2.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

20 supplement to nature genetics • september 2002


user’s guide

Question 3
During a positional cloning project aimed at finding a human disease
gene, linkage data have been obtained suggesting that the gene of
interest lies between two sequence-tagged site markers. How can all
the known and predicted candidate genes in this interval be identified?
What BAC clones cover that particular region?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

doi:10.1038/ng968

UCSC According to NCBI naming conventions, this clone is from the


One possible starting point for this search is the UCSC Genome RP11 library and has been named 85C15. RP11 is the NCBI desig-
Browser home page, at http://genome.ucsc.edu. From this page, nation for RPCI-11, a commonly used human BAC library pro-
select Human from the Organism pull-down menu in the blue duced at the Roswell Park Cancer Institute. More information
bar at the side of the page, and then click Browser. On the Human on the naming conventions of genomic sequencing libraries
Genome Browser Gateway page, change the assembly pull-down can be found at the NCBI’s Clone Registry (Fig. 3.6;
to Dec. 2001. To view a region of the genome between two query http://www.ncbi.nlm.nih.gov/genome/clone/nomenclature.shtml).
terms, enter the terms in the search box, separated by a semi- Clone ordering information is also available, at http://www.
colon. For example, to view the region between STS markers ncbi.nlm.nih.gov/genome/clone/ordering.html.
D10S1676 and D10S1675, enter ‘D10S1676;D10S1675’ in the box
marked position and press Submit. Because both of these markers NCBI
map to a single position in the genome, the genome browser for The NCBI MapViewer allows for direct viewing of the region
the region between those markers is returned (Fig. 3.1). between two markers, as long as both markers are on the master
The STS Markers track displays genetically mapped markers in map. If, for example, the master map is a cytogenetic one, one
blue and radiation hybrid–mapped markers in black. Click on can search chromosome 22 for the region between band num-
the STS Markers label to expand that track and see each marker bers 22q12.1 and 22q13.2. If the master map is Gene_Seq, one
listed individually (Fig. 3.2). The markers of interest are called by can view the region between two mapped genes.
their alternate names (AFMA232YH9 and AFMA230VA9 in this Access the Map Viewer home page by starting at the NCBI
view) and are at the top and bottom of the interval, respectively home page (http://www.ncbi.nlm.nih.gov) and clicking
(Fig. 3.2, arrows). Human map viewer in the list on the right-hand side of the
The full list of known genes in this display is shown in the page. To view multiple hits on the same chromosome, type in
Known Genes track (Fig. 3.1). These protein-coding genes are the search terms separated by the word ‘OR’. To see the same
taken from the RefSeq mRNA sequences compiled at the NCBI10 region between the STS markers D10S1676 and D10S1675, for
and aligned to the genome assembly using the BLAT program8. To example, type ‘D10S1676 OR D10S1675’ in the search box, and
export a list of the genes, or other features, in this region, click the hit Find. At the top of the resulting page (Fig. 3.7), two red tick
Tables link in the top blue bar. For more information about a par- marks on the chromosome cartoon indicate that the markers
ticular gene (such as MGMT), click on the gene symbol to get a list map close to each other on chromosome 10. The search results
of additional links to resources such as Online Mendelian Inheri- at the bottom of the page show the alternative names for the
tance in Man (OMIM), PubMed, GeneCards and Mouse Genome two markers (AFMA232YH9 and AFMA230VA9) as well as the
Informatics (MGI; Fig. 3.3). Many tracks, including Acembly maps on which they have been placed. To view both markers at
Genes, Ensembl Genes and Fgenesh++ Genes, indicate predicted the same time, click on the link for chromosome 10 in the
genes (see Question 7).To view the full set of features in any of chromosome diagram. Fig. 3.8 shows the region around
these categories, click on the title of that track on the left side of the D10S1676 and D10S1675, with the original queries high-
screen in Fig. 3.1. To view brief descriptions of these tracks, as well lighted in pink. Red lines connect the positions of the marker
as others not mentioned, click on the gray box to the left of the on the different maps.
track or scroll down to Track Controls and click on the title of a fea- The Maps & Options link, in the horizontal blue bar near the
ture of interest. Explanations of the gene-prediction programs can top of the page, allows the user to customize the maps and region
be found in Question 7. Reset the browser to its default settings by displayed. To view, for example, the known and predicted genes
clicking on the reset all button below the tracks.
To see the BAC clones used for sequencing, return to the page
illustrated in Fig. 3.1 and click on Coverage at the left side of the One can also search for a region between two STS markers
screen to expand that track. Here BAC clones are listed individu- using the MapView at Ensembl. Start at the Ensembl Human
ally, with finished regions shown in black and draft regions Genome Browser at http://www.ensembl.org/Homo_sapi-
shown in various shades of gray (Fig. 3.4). For details such as size ens/, click on the idiogram of any chromosome to access the
and sequence coverage of a specific clone, click on the clone MapView, and enter the marker names in the Jump to Con-
accession number (such as AL355529.21, arrow). From this tigview section. To use Ensembl to obtain a list of genes (or
screen, click on the accession number (as shown in Fig. 3.5) to other annotations) in a defined chromosomal region, click on
link to the NCBI Entrez document summary for the clone. The Export→Gene List from any ContigView window (Fig. 1.14,
full GenBank entry can be viewed by clicking on AL355529 on center yellow bar).
the Entrez document summary page.

supplement to nature genetics • september 2002 21


user’s guide
in this region, as well as the BAC clones from which the sequence shows the NCBI’s gene predictions. Any of these genes, known or
was derived, click on the link to open the Maps & Options win- predicted, are candidates for the disease gene.
dow (Fig. 3.9). First remove all the maps except Gene and STS The NCBI’s assembled contigs, also known as the NT contigs,
from the Maps Displayed box by highlighting them, and selecting are found in the Contig map. Blue segments come from finished
<<REMOVE. Next, add the Transcript (RNA), GenomeScan, sequence, orange from draft. These contigs are constructed from
Component and Contig maps by selecting them from the Avail- the individual GenBank sequence entries shown in the Comp
able Maps box and selecting ADD>>. Make the STS map the (Component) map. Draft HTG records (phase 1 and 2; see
master by highlighting it, then selecting Make Master/Move to http://www.ncbi.nlm.nih.gov/HTGS/) are displayed in orange
Bottom. To limit the view such that only the STSs between and finished HTGs in blue. Most of these GenBank entries are
D10S1676 and D10S1675 are shown, type the marker names in derived from BAC clones. The tiling paths of the BAC clones that
the Region Shown boxes. Hit Apply to see the aligned maps. In were assembled into contigs are clearly visible. One can obtain
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

some cases, it may be useful to select a page size larger than the more details about an entry, including the clone name, by click-
default of 20 to view more data in the browser window. ing on the accession number to link to Entrez. The clone name is
Fig. 3.10 shows the maps, as specified in the Maps & Options visible directly in the MapViewer if the Comp map is the master.
window. The green dots to the right of the STS map show all the A map can be quickly made the master map by clicking on the
maps on which the markers appear. This is a fairly long region of blue arrow next to its name.
chromosome 10, and not every STS marker is shown. In particu- Because this is a zoomed-out view of the chromosome, indi-
lar, although there are 611 STSs in this region, only 20 are shown vidual genes and GenBank entries are difficult to visualize.
by name in this view. For each known gene, the Genes_Seq map Zooming in, using the controls in the blue sidebar, will provide
shows all the exons that have been mapped to the genome. Exons a region in more detail. Alternatively, click on the Data As
for individual known mRNAs are shown on the RNA (Tran- Table View in the left sidebar to retrieve all data, including
script) map. Unless a gene is alternatively spliced, the Genes_Seq those hidden in this view, as a text-based table (partially shown
and RNA maps will be the same. The GScan (GenomeScan) map in Fig. 3.11).

22 supplement to nature genetics • september 2002


user’s guide
Figure 3.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 3.2

supplement to nature genetics • september 2002 23


user’s guide
Figure 3.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 3.4

24 supplement to nature genetics • september 2002


user’s guide
Figure 3.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 3.6

supplement to nature genetics • september 2002 25


user’s guide
Figure 3.7
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 3.8

26 supplement to nature genetics • september 2002


user’s guide
Figure 3.9
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 3.10

supplement to nature genetics • september 2002 27


user’s guide
Figure 3.11
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

28 supplement to nature genetics • september 2002


user’s guide

Question 4
A user wishes to find all the single nucleotide polymorphisms that lie
between two sequence-tagged sites. Do any of these single nucleotide
polymorphisms fall within the coding region of a gene? Where can any
additional information about the function of these genes be found?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

doi:10.1038/ng969

The starting point for this search would be the web site for the linked to other databases; a P in this column indicates that the
Database of Single Nucleotide Polymorphisms (dbSNP) at the variation has been mapped to a known protein structure. For a
NCBI13, which is located at http://www.ncbi.nlm.nih.gov/SNP. complete description of all the features within this display, click
There is a series of links on the page that allow the user to search on any part of the header above the columns.
using either information about the database submission itself or Returning to the original question, one of the SNPs displayed
information regarding genes and gene loci. on this page does indeed fall within a coding region, as indi-
For this particular search, assume that the region of interest is cated by an orange C. To obtain more information on any par-
known and defined by two STS markers, RH70674 and G32133. ticular SNP, simply click on the hyperlinked SNP Cluster ID.
Begin by scrolling to the section labeled Between Markers at the Clicking on rs1059133, for example, produces a new page, with
bottom of the page. Enter the STS marker names ‘RH70674’ and all available information on that SNP (Fig. 4.2). Under the
‘G32133’ into the two text boxes, and click on Submit STS Mark- header marked Submitter records for this RefSNP Cluster is a list
ers. This will produce a display showing SNPs 1–25 out of the of the individual SNPs (in this case, only one SNP) that have
total of 81 within the region of interest. Go to page 3 of the dis- been clustered together to form this single reference SNP. The
play by entering ‘3’ in the Page box and clicking Display. sequence of the SNP is shown in the next header. Under the
The resulting page (Fig. 4.1) illustrates most of the possible header marked NCBI Resource Links are GenBank and NCBI
types of result one would find on a typical dbSNP results page. In RefSeq entries that are associated with this SNP. Scrolling fur-
the table, starting from the left, the first column gives the individ- ther down on the SNP page (Fig. 4.3), the gene whose coding
ual dbSNP cluster IDs (all starting with ‘rs’). The second column, region this SNP falls within is indicated on the LocusLink Analy-
labeled Map, shows whether a particular SNP has been mapped sis section (ADAM2, a disintegrin and metalloproteinase
to a unique position in the genome (illustrated by a single green domain 2). The SNP allele is G/C, a non-synonymous change
arrow, as in the first row of the example) or to multiple positions leading to replacement of the Asp residue in the reference
(not shown here). sequence by a His residue. Links are also provided to the NCBI
The next set of columns, labeled Gene, indicates whether these Map Viewer, Ensembl map and UCSC genome assembly in the
SNPs are associated with particular features, such as genes, section labeled Integrated Maps. The sections labeled Variation
mRNAs or coding regions. The three columns (L, T and C) are Summary and Validation Summary (not shown) give the raw
either lit up or appear gray in every row. Taking each in order: data on this particular SNP.
If the L (for locus) appears in blue, part or all of the marker To answer the final part of this question requires jumping from
position lies either within 2 kilobases (kb) of the 5′ end of a gene dbSNP to LocusLink10. To do so, click on the ADAM2 link in the
feature or within 500 bases of the 3′ end of a gene feature. line marked LocusLink at the top of the page (Fig. 4.3). This
If the T (for transcript) appears in green, part or all of the brings the user to the LocusLink page for ADAM2 and provides
marker position overlaps with a known mRNA. This does not numerous jumping-off points to the NCBI and affiliated
mean, however, that the SNP marker necessarily falls within a resources through the boxed links at the top of the page. More
coding region. information on these resources can be found by following the
If the C (for coding) appears in orange, part or all of the LocusLink FAQ link in the left-hand column of the page. By sim-
marker position overlaps with a coding region. ply examining the LocusLink page itself, one sees that the
The next column, labeled Het, indicates the average heterozy- ADAM2 protein belongs to a family of membrane-anchored pro-
gosity observed for this marker, on a scale of 0–100%. A reading teins that have been implicated in processes as diverse as fertiliza-
of zero means that no information is available for that particular tion, muscle development and neurogenesis.
marker, whereas the pink bars show a 95% confidence interval One often-overlooked source of information on genes and
for the marker. The Validation column indicates whether the gene products is OMIM14. This is an electronic version of the
marker has been validated (shown by a star) or is unvalidated
(shown by light blue boxes). Validated markers have been veri-
fied by independent re-analysis of the sequence. All of the unval- Using the UCSC browser, users can retrieve the positions of
idated markers shown in Fig. 4.1 are denoted by three blue boxes, genome annotations such as SNPs as a text file suitable for
which, according to the scale at the top of the column, means that loading into a spreadsheet program. While looking at the
there is a >95% success rate in validation. This figure indicates browser for a defined chromosomal region, click on the
the probability that this marker is real. (The success rate is Tables link (Fig. 1.6, upper blue bar). Similarly, to export a
defined as 1 – false-positive rate.) list of genome annotations in a defined chromosomal region
In the penultimate column, the symbol TT (not shown here) at Ensembl, click on Export from any ContigView window
indicates that individual genotypes are available for this marker. (Fig. 1.14, center yellow bar).
Finally, the Linkout Avail column indicates which markers are

supplement to nature genetics • september 2002 29


user’s guide
catalog of human genes and genetic disorders developed by Vic- mode of inheritance (including mapping information) and a
tor McKusick at The Johns Hopkins University. OMIM provides clinical synopsis. These entries are manually curated, ensuring
the user with concise textual information from the published that the ‘executive summary’ is up to date and accurate.
literature on most human disorders with a genetic basis, and Although OMIM can be searched directly, many LocusLink
links back to the primary literature as appropriate. Information entries also link to the OMIM record for the gene. The OMIM
comprising an OMIM entry includes the gene symbol, alternate entry page for the ADAM2 protein is shown in Fig. 4.4. The
names for the disease, a description of the disease (including page is fully hyperlinked to PubMed, GenBank and other
clinical, biochemical and cytogenetic features), details of the related databases.
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

30 supplement to nature genetics • september 2002


user’s guide
Figure 4.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 4.2

supplement to nature genetics • september 2002 31


user’s guide
Figure 4.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 4.4

32 supplement to nature genetics • september 2002


user’s guide

Question 5
Given a fragment of mRNA sequence, how would one find where that
piece of DNA mapped in the human genome? Once its position has been
determined, how would one find alternatively spliced transcripts?
doi:10.1038/ng970
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

For the purpose of this example, the fragment of mRNA of inter- potentially representing differentially spliced transcripts, click
est is contained within GenBank accession number BG334944. on the track’s label. This will expand this area of the figure so
First, retrieve the nucleotide sequence of this EST using the that each EST occupies a single line (Fig. 5.7). The ESTs are of
NCBI’s Entrez interface, at http://www.ncbi.nlm.nih.gov/ varying length, but most contain the same exons as the known
Entrez/. Type ‘BG334944’ into the text box at the top of the page, gene and are (presumably) spliced in the same way. Close
change the pull-down menu to Nucleotide and press Go. The inspection indicates that some of the ESTs are missing one or
resulting page shows one entry, corresponding to accession num- more exons compared with the known gene. Consider the lines
ber BG334944. To retrieve this sequence in FASTA format (a marked BE798864 and W52533: the former appears to be miss-
common format for bioinformatics programs), change the pull- ing the fifth exon, whereas the latter is missing the fourth, fifth
down menu on this page to FASTA and then press Text (Fig. 5.1). and sixth exons.
A new web page containing only the sequence, in FASTA format, Any of the ESTs can be examined in more detail by clicking on
is produced (Fig. 5.2); copy the resulting sequence. that particular line. Here, click on the line for BE798864 (arrow,
To determine where this sequence maps within the genome, Fig. 5.7) to reach the information page for this EST (Fig. 5.8).
use UCSC’s BLAT tool8. Begin this search by pointing your web The EST is 99.8% identical to the genomic sequence; clicking
browser to the UCSC Genome Browser home page, at anywhere on the hyperlinked line in the section marked
http://genome.ucsc.edu. From this page, select Human from the EST/Genomic Alignments returns the actual side-by-side align-
Organism pull-down menu in the blue bar on the side of the ment (Fig. 5.9). Differences exist at the ends of the EST, but the
page, and then click Blat. Paste the FASTA-formatted sequence sequences are identical in the region surrounding the putative
obtained from Entrez (above) into the large text box on the BLAT missing exon.
search page (Fig. 5.3), change the Freeze pull-down menu to Dec. An alternatively spliced mRNA is more likely to be of biologi-
2001, change the Query pull-down menu to DNA and then press cal significance when it changes the sequence of the encoded,
Submit. The server will (very quickly) return the search results; in wildtype protein. To determine whether EST BE798864 could
this case, a single match of length 636 is found on the forward encode a protein different from that of the known gene
strand of chromosome 9 (Fig. 5.4). (RAB9P40), one can simply compare the two sequences directly
To obtain more details on this hit, click the details link, to the against each other using the NCBI’s BLAST 2 Sequences tool.
left of the entry. A long web page is returned, with three major First, open a new web browser window, because information
sections: the mRNA sequence (Fig. 5.5, top), the genomic from the above search will be needed here; this will prevent hav-
sequence (Fig. 5.5, middle) and an alignment of the mRNA ing to use the browser’s Back and Forward keys excessively and is
sequence against the genomic sequence (see Fig. 5.9 for an exam- a good general rule when using multiple web tools. Then access
ple). In the alignment in Fig. 5.5, matching bases in the cDNA the BLAST home page, at http://www.ncbi.nlm.nih.gov/BLAST.
and genomic sequences are colored in darker blue and capital- Select BLAST 2 Sequences, under the header labeled Pairwise
ized. Gaps are indicated in lower-case black type. Light blue BLAST. On this page, the user can simply enter accession num-
upper-case bases mark the boundaries of aligned regions on bers rather than cutting and pasting sequences into the text
either side of a gap and are often splice sites. boxes. For the EST, simply enter its accession number
Returning to the BLAT summary page for this search (Fig. 5.4),
click on browser. This will produce a graphic representation of
where this particular mRNA sequence aligns to the genome Ensembl also displays database hits that overlap with each
(Fig. 5.6). The track labeled Chromosome Band indicates that the exon in a transcript. These hits may include proteins as well as
mRNA maps to 9q34.11. The query sequence itself is represented ESTs and mRNAs, and may illustrate alternatively spliced
on the line labeled Your Sequence from BLAT Search (arrow, products. The hits are shown as green boxes in the TransView
Fig. 5.6). The sequence is shown as being discontinuous: regions (Fig.13.5), which can be accessed in a number of ways; for
of similarity are shown as vertical lines, gaps are shown as thin example, by clicking on the View Evidence box for a transcript
horizontal lines, and the direction of the alignment is indicated on the GeneView (Fig. 1.10). Another good starting point for
by the arrowheads. The aligned regions of the EST query corre- visualizing alternatively spliced transcripts is the NCBI’s
spond to the exons of a known gene, shown on the line immedi- Model Maker (follow the mm link in Fig. 1.2). The Model
ately below (Known Genes, here RAB9P40). Typing the EST Maker displays putative exons from mRNAs, ESTs and gene
name, BG334944, directly into a UCSC search box would have predictions that align with the genome. Users can select indi-
generated a similar result to that shown in Fig. 5.6, but part of the vidual exons from these alignments and build a customized
purpose of this example is to illustrate the use of BLAT. gene model. As the Model Maker displays the nucleotide
Approximately halfway down the graphic is a track labeled sequence of the model along with its three-frame translation,
Human ESTs That Have Been Spliced. This track is at first shown the effects of adding, modifying or deleting exons can be
in dense mode, with all the ESTs condensed onto a single line. To quickly evaluated.
see all of the ESTs that align with the genome in this region,

supplement to nature genetics • september 2002 33


user’s guide
(BE798864) into the box marked Enter accession or GI for mRNA), which corresponds to the fifth exon that is missing in
Sequence 1. Obtaining the accession number of RAB9P40 BE798864. This gap is in frame, so the EST could encode a
requires going back to the graphic shown in Fig. 5.6 and clicking homologous yet shorter protein.
on the gene’s track. Once this has been done, input the gene’s Because of the nature of EST sequencing, ESTs often contain
accession number (NM_005833) into the box marked Enter sequencing errors at a rate much higher than those of the fin-
accession or GI for Sequence 2. Make sure that the Program pull- ished or even draft genomic sequence. It is certainly encouraging
down is set to blastn (to compare a nucleotide sequence against that EST BE798864 aligns well with the genomic sequence and
another nucleotide sequence, hence the n in blastn) and click the that its encoded protein could be in the same frame as that pro-
Align button at the bottom of the page to generate the alignment duced from the known gene. In addition, it appears from the
(Fig. 5.10). The sequence corresponding to sequence 1 (the EST) UCSC graphic (Fig. 5.7) that other ESTs in this region, such as
is denoted as the query, whereas the sequence corresponding to BE779110, are also missing the fifth exon of RAB9P40. All these
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

sequence 2 (the known gene) is denoted as the subject. The predictions must, however, be tested computationally by looking
known gene’s protein translation is also shown, starting at the at the quality of the EST–genomic alignment as shown above.
end of the third row of the alignment. Examination of the align- Final proof of alternative splicing can, of course, only be gener-
ment shows that the EST is missing 153 nt (nt 360–512 of the ated at the laboratory bench.

34 supplement to nature genetics • september 2002


user’s guide
Figure 5.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 5.2

supplement to nature genetics • september 2002 35


user’s guide
Figure 5.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 5.4

36 supplement to nature genetics • september 2002


user’s guide
Figure 5.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 5.6

supplement to nature genetics • september 2002 37


user’s guide
Figure 5.7
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 5.8

38 supplement to nature genetics • september 2002


user’s guide
Figure 5.9
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 5.10

supplement to nature genetics • september 2002 39


user’s guide

Question 6
How would one retrieve the sequence of a gene, along with all anno-
tated exons and introns, as well as a certain number of flanking bases for
use in primer design?
doi:10.1038/ng971
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

This type of search can be initiated at the UCSC Genome ton next to extended case/color options and then click Submit. By
Browser home page, located at http://genome.ucsc.edu. Select selecting this option, the user can highlight features in the
Human from the pull-down menu labeled Organism, and then sequence by changing the format (case, underline, bold, italic)
click on Browser. This brings the user to the Human Genome and/or color (red, green, blue) of the text. Colors can be varied in
Browser Gateway, from which a number of text- or position- darkness and mixed together by changing the values in the boxes
based searches can be performed on current or older versions of under Red, Green and Blue to any number between 0 and 255;
the genome assembly. In this case, select the Dec. 2001 assem- examples of how to specify in RGB (red-green-blue) format color
bly, type the name of the gene of interest (PTPN1) into the posi- are given below the table. At this point, check the Toggle Case box
tion box, and then click Submit. The Browser returns all genes in the Known Genes row, change the red saturation to 255 and
starting with the characters ‘PTPN1’ (Fig. 6.1). The gene of leave the other color values set at zero (Fig. 6.4). Once the user
interest here is the one called PTPN1; click on the hyperlinked clicks Submit, a new page is presented with the entire length of
PTPN1 (arrow, Fig. 6.1) to view the genomic context of this the sequence specified above (chr20:48928540-49003836) and
gene (Fig. 6.2). the exons within this range are shown in red in capital letters
The text box at the top of Fig. 6.2 gives the absolute base pair (Fig. 6.5). This genomic sequence can now be saved and
position of this gene (chromosome 20, positions imported into a primer design or sequence assembly package for
48929540–49003636) and indicates that the gene spans 74 kb. further analysis.
The track labeled Chromosome Bands shows that PTPN1 is The Extended DNA Case/Color Options page can be used to
located at 20q13.13. Finally, the track marked Known Genes combine and differentiate between genomic tracks. For exam-
shows that the gene is on the forward strand, as the arrows on ple, return to the Options page, leave the Known Genes row as
that track are pointing to the right. The exons within this gene before but now also check the Underline square in the Mouse
are indicated by the vertical lines in the Known Genes track. Blat row of the table. Clicking Submit produces a page on which
One way to obtain sequence upstream of a gene is described in the human exons still appear in red capital letters, but hits from
Question 7. Here we explain how to retrieve flanking sequence the mouse sequence are now shown as underlined text (Fig. 6.6).
on both sides of a gene. To retrieve an adequate amount of In this section of the gene, the conserved mouse sequence over-
sequence with which to design primers, one can increase the size laps with the exons.
of the region displayed by changing the position numbers within
the position box at the top of the figure. To add an additional
1,000 nt at the 5′ end and an additional 200 nt at the 3′ end, for One way to retrieve sequence for a defined chromosomal
example, change the text in the position box to ‘chr20:4892854- region at the NCBI is with the seq link on the MapViewer, vis-
49003836’ and click Jump. This now redraws the graphic with the ible when the Gene_Seq map is the master (Fig. 1.2). At
new boundaries. Ensembl, export genomic nucleotide sequence with the
To obtain the actual sequence within the region, click on the Export→FASTA link in any ContigView window (Fig. 1.14,
DNA link in the blue bar at the top of the page. This produces a center yellow bar).
new page, entitled Get DNA in Window (Fig. 6.3). Click the but-

40 supplement to nature genetics • september 2002


user’s guide
Figure 6.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 6.2

supplement to nature genetics • september 2002 41


user’s guide
Figure 6.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 6.4

42 supplement to nature genetics • september 2002


user’s guide
Figure 6.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 6.6

supplement to nature genetics • september 2002 43


user’s guide

Question 7
How would an investigator easily find compiled information describing
the structure of a gene of interest? Is it possible to obtain the sequence
of any putative promoter regions?
doi:10.1038/ng972
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

One place to initiate this search is at UCSC’s Genome Browser, at intronic regions both 5′ and 3′ to a putative exon using a dynamic
http://genome.ucsc.edu. For purposes of this example, consider programming algorithm; the method also takes into account
the gene encoding pendrin (PDS), a protein associated with protein similarity data16.
developmental abnormalities of the cochlea, sensorineural hear- The Genscan Gene Predictions derive from a method called
ing loss and diffuse thyroid enlargement (goiter). GENSCAN, through which introns, exons, promoter sites and
From the UCSC home page, choose Human from the pull- poly(A) signals can be identified. Here, the method does not
down Organism list, and click on Browser. The user is now at the expect the query sequence to represent one and only one gene, so
Human Genome Browser Gateway. The search in this case is sim- it can make accurate predictions for either partial genes or multi-
ple: select Dec. 2001 from the assembly pull-down menu, type ple genes separated by intergenic DNA11.
pendrin into the position box, and then click Submit. The The Human mRNAs from Genbank track shows alignments
returned results indicate one known gene and two mRNA between human mRNAs in GenBank and the genome sequence.
sequences; click on the accession number of the mRNA sequence The Spliced ESTs and Human EST tracks show the alignment of
AF030880 to continue. The user will now be presented with a ESTs from GenBank against the genome. Because ESTs usually
graphic overview of the region containing this mRNA. To gain a represent fragments of transcribed genes, there is high likelihood
better perspective of the region, click on the 1.5× button next to that an EST corresponds to an exonic region.
zoom out. Finally, click the reset all button on the middle of the Finally, the Repeating Elements by RepeatMasker track shows,
page to reset the tracks to their default settings. as its name would suggest, repetitive elements such as short and
Carrying out these steps will produce an output similar to that long interspersed nuclear elements (SINEs and LINEs), long ter-
shown in Fig. 7.1. For the purpose of this question, however, the minal repeats (LTRs) and low-complexity regions (http://repeat-
default settings are not ideal. Using the Track Controls at the bot- masker.genome.washington.edu/cgi-bin/RepeatMasker). It is
tom of the figure, and following the example in Fig. 7.2, set some customary to remove or ‘mask’ these elements before applying a
tracks to hide mode (not shown), others to dense (all data con- gene prediction method to a nucleotide sequence.
densed onto one line) and some to full (a separate line for each Returning to the example shown in Fig. 7.2, notice that most of
feature, up to 300). Before considering the actual data within the tracks return a nearly identical gene prediction; as a rule,
these tracks, a brief discussion of the content and representation exons predicted by multiple methods increase the likelihood that
of these tracks is warranted. Many were provided to UCSC by the prediction is actually correct and does not represent a ‘false
outside individuals. Further information on the gene prediction positive’. Most of the methods show a 3′ untranslated region, indi-
methods briefly discussed below can be found elsewhere15. cated by the heavy, shorter block at the left of the predictions. The
The general convention for the Known Genes and predicted Acembly track shows three possible alternative splices in addition
gene tracks (Fig. 7.1) is that each coding exon is shown as a tall, to the full-length product shown in the third line of that section, a
vertical bar or block. 5′ and 3′ untranslated regions are shown as prediction that agrees with those shown in most of the other
shorter vertical bars or blocks. tracks. The Genscan track extends off to both the right and the
Connecting introns are shown as very thin lines. The direction left: GENSCAN can be used to predict multiple genes, and this
of transcription is indicated by the arrows along that thin line. display implies that the method has been applied in this fashion.
Known Genes are taken from mRNA reference sequences Although these graphical overviews are useful, the investigator
within LocusLink10. These reference sequences have been aligned will more often than not want the actual sequence corresponding
against the genome using BLAT. to these blocks. For this example, the Fgenesh++ prediction will
The Acembly Gene Predictions With Alt-splicing track is derived be used as the basis for obtaining raw sequence data, but the steps
from the alignment of human mRNA and EST sequence data will be identical regardless of which track is chosen. Click on the
against the genome, using the program Acembly. This program track labeled Fgenesh++ Gene Predictions to go to a summary
attempts to find the best alignment of each mRNA against the page describing the prediction (Fig. 7.3). The region has sequence
genome and considers alternative splice models. If more than similarity to the pendrin gene (which was already known at the
one gene model with statistical significance can be produced, beginning of the example). The size and the beginning- and end-
each of these is shown in the display. Additional information on points of the prediction are given, and it is indicated that the pre-
Acembly can be found on the NCBI web site at diction lies on the minus strand; this was also indicated in Fig. 7.2
http://www.ncbi.nih.gov/IEB/Research/Acembly/. by the left-pointing arrows in the intronic regions. To obtain the
The Ensembl Gene Predictions track7 is provided by Ensembl. sequence, click on Genomic Sequence. The user will be taken to a
The Ensembl genes are predicted by a range of methods, includ-
ing homology to known mRNAs and proteins, ab initio gene pre-
diction using GENSCAN and gene prediction HMMs. The NCBI also provides gene predictions, computed using
The Fgenesh++ Gene Predictions come from a method that pre- the program GenomeScan17. These models are shown on the
dicts internal exons by looking for structural features such as GenomeScan and Gene_Seq maps.
donor and acceptor splice sites, putative coding regions and

44 supplement to nature genetics • september 2002


user’s guide
query page entitled Get Genomic Sequence Near Gene, from which Coding Region Only returns just the coding region, with exons
the transcript, coding region, promoter, or both the transcript shown in upper-case letters.
and promoter can be obtained (Fig. 7.4). For each of the options, Transcript + Promoter appends the promoter sequence to the 5′
the sequence is returned in FASTA format, with the nucleotide end of the sequence that the user would have obtained by using
coordinates being given in the definition line. the Transcript option, with exons shown in upper-case letters.
Transcript returns the sequence of the entire transcript, with The length of the promoter can be indicated in the text box.
exons shown in upper-case letters. Promoter returns just the promoter region, as shown in Fig. 7.5.
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

supplement to nature genetics • september 2002 45


user’s guide
Figure 7.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 7.2

46 supplement to nature genetics • september 2002


user’s guide
Figure 7.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 7.4

supplement to nature genetics • september 2002 47


user’s guide
Figure 7.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

48 supplement to nature genetics • september 2002


user’s guide

Question 8
How can one find all the members of a human gene family?
doi:10.1038/ng973

The HUGO Gene Nomenclature Committee (http://www.gene. BLAST. Paste the sequence of the ADAM2 protein (GenBank
ucl.ac.uk/nomenclature/) has been working to develop a unique accession NP_001455.2) into the query box (having obtained the
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

symbol, as well as a longer and more descriptive name, for each protein sequence from the NCBI’s Entrez database by following
human gene. Thus, members of many gene families, previously the steps in Question 5). Set the database to Homo sapiens,
cloned in different laboratories and known by a variety of terms, genomic sequence to search the Ensembl genome assembly, and
now share a common gene symbol. A text search in any of the choose TBLASTN as the executable (Fig. 8.2). Use the default
genome browsers will often return links to all named members of parameters for the remaining settings. When done, click Search.
a gene family that have been mapped to the genome. Whereas The returned page will contain a retrieval ID (Fig. 8.3), which,
Ensembl and UCSC currently return lists of the genes, the NCBI when the search is finished, will link to the search results page
presents both a list and a graphical overview. (Fig. 8.4).
Go to the NCBI home page at http://www.ncbi.nlm.nih.gov/ The top of the results page shows a graphical overview of the
and click on the Human map viewer link on the right side to locations of hits. These hits may be to the entire protein or just to
access the Map Viewer search page. Enter the term a single domain. The hits are colored by BLAST score, red being
‘ADAM*[sym]’ in the text query box. The asterisk, or wild card, most similar, blue least similar and green intermediate. Some of
will match any character, whereas the term [sym] limits the the hits, like the pairs on the q arms of chromosomes 10 and 14,
search to items with ADAM as their gene symbol. Other lie in positions similar to those of ADAMs mapped by the NCBI
advanced search options are available by clicking the Advanced (Fig. 8.1), but others, such as those on chromosomes 12 and Y,
Search box or by reading the online documentation. The search are unique to the BLAST search. These unique hits may represent
returns 41 hits, which include members of the ADAM family as real members of the ADAM family that have not yet been named
well as other related families whose names start with the term and would therefore not show up in a text-based search. Alterna-
‘ADAM’, such as ADAMTS and ADAMDEC. To limit the search tively, they may be unnamed pseudogenes or nonsignificant
to ADAM genes only, eliminate the undesired gene symbols with BLAST hits. One gene on chromosome 1 is found in the text-
the Boolean NOT term, using the query ADAM*[sym] NOT based search at the NCBI but not in the BLAST search at
ADAMTS*[sym] NOT ADAMDEC1*[sym]. The graphic at the Ensembl. The similarity between this gene and ADAM2 is not
top of the returned page shows the location of each gene with a high enough for it to appear in the BLAST search using the
red tick mark (Fig. 8.1). It is immediately clear that the 19 default Ensembl parameters.
mapped ADAM genes are distributed among 11 chromosomes, Clicking on an arrow next to one of the hits shown in Figure 8.4
and that some, such as those at the tips of the q arms of chromo- activates a pop-up menu that gives the details of the BLAST
somes 10 and 14, are close together. The list at the bottom of the report and provides links to the BLAST alignment and the
page presents links to the 19 genes. ContigView (Figs 8.5 and 8.6, respectively, for the hit on chromo-
Another way to search for homologous genes in the genome is some 12). The hit on chromosome 12 contains a stop codon and
through a basic local alignment search tool (BLAST) search at is probably an intronless pseudogene. The bottom of the results
the NCBI or Ensembl. BLAT searches at UCSC are not as sensi- page (Fig. 8.4) shows a summary of the BLAST hits. Clicking on a
tive as BLAST searches and may not find as many homologous hit links to the BLAST alignment (Fig. 8.5). A link in the middle
genes. In this example, all genomic sequences homologous to the of the results page (Fig. 8.4) provides the entire BLAST report in
ADAM2 protein will be found using the Ensembl BLAST inter- standard format. Clicking on a hit in the BLAST report retrieves
face. From the Ensembl Human home page at the ContigView for the region around the hit (similar to what is
http://www.ensembl.org/Homo_sapiens/, click on the link to shown in Fig. 8.6).

supplement to nature genetics • september 2002 49


user’s guide
Figure 8.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 8.2

50 supplement to nature genetics • september 2002


user’s guide
Figure 8.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 8.4

supplement to nature genetics • september 2002 51


user’s guide
Figure 8.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 8.6

52 supplement to nature genetics • september 2002


user’s guide

Question 9
Are there ways to customize displays and designate preferences? Can
tracks or features be added to displays by users on the basis of their
own research?
doi:10.1038/ng974
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

In this example, the UCSC browser will be used to view particu- for example, customize the EST track controls to color red all
lar tracks. Start at the UCSC home page (http://genome.ucsc. ESTs from a certain library that contain a particular keyword in
edu), click on Browser in the blue sidebar on the left-hand side of their GenBank entry or to eliminate all such ESTs from the dis-
the page, and set the Genome Browser Gateway to a region of play. The browser retains these selections for all subsequent ses-
interest. For example, one could set the genome to Human and sions; the default settings can be restored by clicking on the reset
the assembly to Dec. 2001, type chr22:38496887-39496866 into all button.
the position box, and click Submit to display a representative One of the attractive features of the UCSC system is that users
region of the December 2001 assembly of human chromosome can add their own annotations, features or tracks to their local
22. A number of tracks are already displayed in dense format displays. These changes are not written or saved in any way to the
(Fig. 9.1). Below the graphic showing the specified region are original data held at UCSC. To customize the display, the user
pull-down menus that allow the user to change the appearance of returns to the Human Genome Browser Gateway page and scrolls
the graphic, under the heading Track Controls (Fig. 9.2). There down to the Add Your Own Tracks section. Here, the user is pre-
are three options in each of these pull-down menus: sented with a large text box into which properly formatted text
• Hide, which allows the user to eliminate that particular track can be typed or pasted. Alternatively, the specifications can be in
from the display. a text file, which the user can select by using the Browse button
• Dense, which displays all annotations or features for that above the large text box. As another option, if the text file is
track on a single line. posted on the user’s local web page, the user can share the custom
• Full, which displays each annotation or feature for that track track of annotations with other colleagues simply by telling them
on a separate line; this is the ‘exploded view’ that is illustrated the URL of the file. Colleagues can then view the custom annota-
in a number of the questions in this guide. tion by starting the UCSC browser and entering this URL into
Once the desired selections have been made, the user clicks on the large text box.
the refresh button to redraw the graphic. Further customization For the purposes of this example, enter the following text
of individual tracks can be achieved by clicking on the track file into the entry field (Fig. 9.3) and click Submit at the top of
name in the Track Controls section of the browser. The user can, the page:

browser position chr22:38496887-39496866


browser hide cytoBand
browser hide stsMap
browser hide gap
browser hide clonePos
browser full refGene
browser dense mrna
track name="scale" description="our peak"
chr22 38996887 38996888 peak
track name="Microsatellites" description="Microsatellites" color=0,128,0
chr22 38627059 38627060 D22S276
chr22 39005417 39005418 D22S307
track name="Genotyped SNPs" description="Genotyped SNPs" color=0,0,255
chr22 38518342 38518343 ss146131
chr22 38705963 38705964 ss2941443
chr22 38884157 38884158 ss141110
chr22 39171390 39171391 ss22916
chr22 39438769 39438770 ss1479794
track name="Upcoming SNPs" description="Upcoming SNPs" color=0,128,192
chr22 38615712 38615713 ss86855
chr22 38804838 38804839 ss85533
chr22 39077895 39077896 ss141190
chr22 39305065 39305066 ss137027

supplement to nature genetics • september 2002 53


user’s guide
This is but one example using only some of many options to
At Ensembl, the user can customize the Detailed View section the Add Your Own Tracks feature. A full description, information
of the ContigView, adding annotations and changing colors, on input format and additional examples are available at
by selecting options under the Features link (Fig. 1.14, center http://genome.ucsc.edu/goldenPath/help/customTrack.html.
yellow bar). Users can visualize their own custom data on
Ensembl’s ContigView displays and even share that data with Table 9.1 Symbolic Names for UCSC Browser Tracks
other users by following the instructions under the DAS Track Symbolic name
Sources link (Fig. 1.14, center yellow bar). At the NCBI, Map
Acembly Genes acembly
Viewer displays are changed in the Maps & Options window Assembly gold
(Fig. 3.9). BAC End Pairs bacEndPairs
Base position ruler
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Chromosome Band cytoBank


Coverage clonePos
The browser will ignore any entries in the position box and CpG Islands cpgIsland
look only to the file pasted in the Add Your Own Tracks field. The Duplications genomicDups
results of this customized display are shown in Fig. 9.4. Ensembl Genes ensGene
Lines that begin with the word ‘browser’ control the overall Exofish Ecores exoFish
Fgenesh++ Genes softberryGene
browser display. Lines beginning with ‘track’ create new tracks. Fish Blat blatFish
Lines following track lines provide positional information for FISH Clones fishClones
each item to be displayed on that track. Therefore: Gap gap
• the first line of the above format sets the browser to position GC Percent gcPercent
Geneid Genes geneid
38496887–39496866 on chromosome 22. Genscan Genes genscan
• the next six ‘browser’ lines change the overall browser display GNF Ratio affyRatio
for the Chromosome Band, STS Markers, Gap, Coverage, Human Blat blatHuman
Known Genes and Human mRNAs tracks. The formatted text Human ESTs est
must contain the symbolic name for each track rather than Human mRNAs mrna
Known Genes refGene
the name listed on the web page display. Symbolic names Map Contigs ctgPos
used by the UCSC browser are listed in Table 9.1. Compared Mouse Blat blatMouse
with the default settings (Fig. 9.1), the Chromosome Band, Mouse ESTs est
STS Markers, Gap and Coverage tracks have all been hidden, Mouse mRNA mrna
Mouse Synteny mouseSyn
and Human mRNAs is dense rather than full (Fig. 9.4). NCI60 nci60
• the remaining lines instruct the browser to create four new Nonhuman EST xenoEst
tracks named scale, Microsatellites, Genotyped SNPs and Nonhuman mRNA xenoMrna
Upcoming SNPs, respectively. Names are listed on the left side Nonmouse mRNA xenoMrna
Overlap SNPs snpNih
of the browser display. The lines beginning with the word Random SNPs snpTsc
‘track’ name the tracks, as listed above, and also set the RepeatMasker rmsk
descriptions (our peak, Microsatellites, Genotyped SNPs and Rosetta rosetta
Upcoming SNPs) and colors [default (black), green, blue, and Sanger 22 sanger22
Simple Repeats simpleRepeat
light blue] to be used to display those tracks (Fig. 9.4). The Spliced ESTs intronEst
descriptions appear as a center label in the browser and colors STS Markers stsMap
are determined by the three RGB values provided. All lines Tigr Gene Index tigrGeneIndex
following a ‘track’ line provide position information for the UniGene uniGene
tick marks corresponding to the individual items. For exam-
ple, ‘peak’ is displayed at position 38996887–38996888 on
chromosome 22.

54 supplement to nature genetics • september 2002


user’s guide
Figure 9.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 9.2

supplement to nature genetics • september 2002 55


user’s guide
Figure 9.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 9.4

56 supplement to nature genetics • september 2002


user’s guide

Question 10
For a given protein, how can one determine whether it contains any
functional domains of interest? What other proteins contain the same
functional domains as this protein? How can one determine whether
there is a similarity to other proteins, not only at the sequence level, but
also at the structural level?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

doi:10.1038/ng975

To demonstrate how to find functional domains within a protein, CDD (Pfam and SMART), as can be seen by looking at the acces-
the human testis-determining factor TDF, also known as the sex- sion numbers in the hit list.
determining protein SRY, will be used as an example. To determine which other proteins contain this same HMG-
Although the search could be commenced from the Entrez box domain, click on the box labeled Show, right under the
search box on the NCBI home page, a better way to perform the graphical view near the top of the page. This will invoke the
initial search is from LocusLink10. One of the advantages of domain architecture retrieval tool (DART). DART shows func-
using LocusLink lies in its standardization of gene and protein tional domains within a protein and, more importantly, other
names with appropriate cross-referencing, making it more proteins with a similar domain architecture (Fig. 10.5). The
likely that the correct protein will be found on the first attempt. query (the HMG-box) is shown at the top of the page in red.
From the NCBI home page at http://www.ncbi.nlm.nih.gov/, Every other protein in the NCBI’s non-redundant sequence data-
choose LocusLink from the pull-down menu in the upper left base having that same domain is then shown below the query,
corner, type the gene name, ‘TDF’, into the query box, and click with the HMG box again colored red. Other domains within the
Go. Four loci are returned (Fig. 10.1). The first column gives found proteins are also shown, in various colors and shapes, with
the Locus ID, which is a stable identifier associated with that a key appearing at the bottom of the web page. Clicking on any of
gene locus. Clicking on the LocusID produces a LocusLink the links to the left would provide additional information about
report view; more detailed information on the report view can these new proteins.
be found in the LocusLink Help feature and in the literature15. Although a protein domain has now been identified within the
The second column, marked Org, gives a shorthand version of query protein, no in-depth information has yet been provided
the organism name. Here, there is one entry from Drosophila about the function of that domain. Whereas a circuitous path
(Dm), one from mouse (Mm), one from human (Hs) and one could be followed from the DART page to find this information,
from rat (Rn). A series of alphabet blocks shown to the right of an easier method is to use another web-based resource, called
each entry provide jumping-off points to other database InterPro. InterPro is an integrated resource for information
resources. The locus of interest here is the third entry in the list, about protein families, domains and functional sites, bringing
because that is the one for the human form of TDF/SRY. together information from a number of protein domain-based
To find additional information on the protein, click on the sec- resources, such as PROSITE, PRINTS, Pfam and ProDom19. The
ond P (in green) on that line. This takes the user to the protein InterPro Simple Search engine can be accessed from the InterPro
entries corresponding to that particular LocusLink entry home page, at http://www.ebi.ac.uk/interpro. Clicking on Text
(Fig. 10.2). At this point, the user can click on any of the hyper- Search, on the left, brings the user to the search page; for this
links to look at the raw database information available on any search, type ‘HMG Box’ into the text box and hit Search. Three
of the proteins listed. hits are returned (Fig. 10.6). For purposes of this example, follow
Consider the first entry in the list, an NCBI Reference Protein the link from the first hit, for high mobility group proteins HMG1
sequence with accession number NP_003131. To the right of the and HMG2 (IPR000135). The resulting InterPro summary page
accession number is a series of hyperlinks. Clicking on the link (Fig. 10.7) provides information on the function, intracellular
labeled BLink will take the user to the BLink page for the protein location and, most importantly, metabolic role of this particular
of interest (Fig. 10.3). BLink stands for ‘BLAST Link’ and pro- protein within the cell, in an executive summary format. Refer-
vides the graphical results of pre-computed BLAST searches that ences are provided at the bottom of the web page for users who
have been performed not just for this protein sequence, but for wish for more in-depth information about the domain. Users
every protein sequence within the Entrez Proteins data domain. can also retrieve all of the full-length sequences containing the
The pre-computed BLAST results for TDF/SRY are shown in the domain; the reader is referred to the InterPro documentation for
section beginning with the label ‘204 aa’. Across the top are a more details.
number of buttons that allow the user to ask a series of questions The final part of this question asks whether similarity to the
regarding their protein of interest. As the object of this question query protein can be found at the structural as well as the
is to find the protein domains present within the TDF/SRY pro- sequence level. Answering this question requires a new search
tein, the user can click on CDD-Search (Conserved Domain against NCBI Structures. From the NCBI home page, change the
Database Search18). Doing this will produce a graphical overview pull-down menu in the query box at the top of the page to Struc-
of any domains present within the protein, as well as a sequence
alignment of those domains with the query sequence (Fig. 10.4).
In this case, one functional domain is found: an HMG box, At Ensembl, the GeneView links directly to the InterPro
which is a DNA-binding domain found in many nuclear pro- domain(s) found in the protein (Fig. 1.9).
teins. The domain was found in both of the databases comprising

supplement to nature genetics • september 2002 57


user’s guide
ture, type ‘SRY’ in the box and hit Go. Four three-dimensional lar to that of the original SRY protein; more information on the
structures are returned, one of which is 1HRY, the structure of method and on interpreting the data within the tables can be
the human SRY–DNA complex solved by NMR. Clicking on the found elsewhere15. Here, the SRY protein is shown to have some
1HRY hyperlink takes the user to the Structure Summary page structural similarity to a fasciculin 2–mouse acetylcholinesterase
for 1HRY. The summary links to more detailed information complex, a protein named V-1 Nef, a heat-shock protein of 70 kD
about chain A, the protein component of the structure, chain B, and a myosin motor-domain complex (Fig. 10.8). The VAST pro-
the nucleotide component of the structure, and the conserved gram quite often reveals similarities between proteins that are
domain (CD) in the protein, obtained through a CDD search. not evident from simple BLAST or FASTA searches, so readers
Click on the chain A graphic to get a list of proteins whose known are encouraged to employ this and similar tools when trying to
structures have, using a method called VAST, been deemed simi- answer questions related to protein families.
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

58 supplement to nature genetics • september 2002


user’s guide
Figure 10.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 10.2

supplement to nature genetics • september 2002 59


user’s guide
Figure 10.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 10.4

60 supplement to nature genetics • september 2002


user’s guide
Figure 10.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 10.6

supplement to nature genetics • september 2002 61


user’s guide
Figure 10.7
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 10.8

62 supplement to nature genetics • september 2002


user’s guide

Question 11
An investigator has identified and cloned a human gene, but no corre-
sponding mouse ortholog has yet been identified. How can a mouse
genomic sequence with similarity to the human gene sequence be
retrieved?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

doi:10.1038/ng976

For purposes of this example, assume that the user does not est (Fig. 11.2). Especially in a translated mode, mouse and
already have the human sequence of interest to hand. The first human gene sequences are usually more similar in exons than in
step will be to locate the human gene of interest using the UCSC introns. Look carefully at the two alignments that derive from a
Genome Browser. Begin by pointing to the UCSC Genome mouse sequence called chr3 81178k (Fig. 11.2, arrow). On the
Browser home page, at http://genome.ucsc.edu. Select Human Mouse Blat track, the brown vertical lines represent alignments
from the Organism pull-down menu and then click on Browser; and the horizontal lines are gaps. These alignments correspond
both are located on the blue navigation bar at the left side of the to the blue vertical lines indicating the exons of AGPS on the
page. This will take the user to the Human Genome Browser Known Genes track.
Gateway. Select the Dec. 2001 version of the UCSC genome To see the kind of information available for a translated BLAT
assembly, type the gene symbol ‘AGPS’ into the position box, and alignment, click on the mouse genomic sequence labeled chr3
then click Submit. On the resulting page, follow the link for 81178k. The resulting page (Fig. 11.3) provides the details of the
AGPS in the Known Genes section. alignment of the trace with the human genome assembly. This
The result of the search on AGPS is shown in Fig. 11.1. In the mouse genomic sequence is 607 nt in length and aligns with the
main figure are a series of ‘tracks’, which are labeled along the human sequence in eight blocks. Within the blocks, the mouse
left-hand side. The Known Gene track is for AGPS, correspond- and human sequences are 78% identical. To view the alignment
ing to the query. Clicking on AGPS returns a summary of infor- itself, click on the View details of parts of alignment. . . link. On
mation on that gene, including the full name of the protein the resulting page (Fig. 11.4), the mouse sequence is shown on
product (alkylglycerone phosphate synthase precursor), a link to top, with the region of alignment in blue. The human genomic
the GeneCards database at the Weizmann Institute20 and links to sequence is shown next, and a side-by-side alignment of the
the translated protein, mRNA and genomic sequences. Focus human and mouse sequences is at the bottom of the web page
now on the track labeled Mouse Translated Blat Alignments. (not shown).
What is shown in this track are the results of aligning the
November 2001 version of the mouse genome assembly with the
human genome using the program BLAT8 in its translated pro- The NCBI’s UniGene_Mouse map shows alignments of
tein mode. More details about the BLAT algorithm and about mouse mRNA and EST sequences with the human genome.
how the mouse BLAT track is automatically generated can be Add this map using Maps & Options (Fig. 3.9). The easiest
found by clicking on the Mouse Blat hyperlink found below the way to find the mouse ortholog of a human gene is probably
main graphical display. to use Ensembl’s precomputed Homology Matches. These
Click anywhere within the Mouse Blat track to expand the sin- matches, where available, link directly from a human gene to
gle BLAT track so that it now shows each individual mouse a putative mouse homolog (Fig. 1.9).
sequence that aligns with human sequence in the region of inter-

supplement to nature genetics • september 2002 63


user’s guide
Figure 11.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 11.2

64 supplement to nature genetics • september 2002


user’s guide
Figure 11.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 11.4

supplement to nature genetics • september 2002 65


user’s guide

Question 12
How does a user find characterized mouse mutants corresponding to
human genes?
doi:10.1038/ng977
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

The NCBI provides a set of maps that show chromosomal to select Linkage Maps. On the resulting page, customize the
regions homologous between mouse and human. This resource search to find the region around the mouse gene Tyr. Under
can be accessed directly at http://www.ncbi.nlm.nih.gov/Homol- Chromosome, set the number to 7; then set the chromosomal
ogy/. For this example using a known, mapped human gene, region to between 40 and 48 cM.
however, it is easier to start the search from the LocusLink entry Many of the uncloned mouse mutants are not mapped in
for tyrosinase. The LocusLink10 query page can be found at high-resolution crosses, and many are carried out with a small
http://www.ncbi.nlm.nih.gov/LocusLink/. Select Human from number of mice relative to another easy-to-score phenotype for
the Organism pull-down menu, enter ‘tyrosinase’ into the Query another mouse mutant that maps to the same chromosome. It is
box and click Go. To view the entry for tyrosinase (TYR), click on thus necessary to be lenient in looking for potential uncloned
its LocusLink ID, 7299. mouse mutants (±4 cM relative to the location). In this case, as
On the resulting page (Fig. 12.1), links to the mouse homology the NCBI data tells us that the gene is at 44 cM, the region from
maps are in the section of the LocusLink summary page marked 40 to 48 cM should be searched.
Relationships. In this case, there are four maps available for TYR Further down the page (Fig. 12.3), under Markers, set Include
showing mouse alignments: NCBI vs MGD aligns the current DNA segments to No to reduce the number of markers shown. Do
NCBI assembly of the human genome with the MGD (Mouse include syntenic markers, which are DNA markers and mutant
Genome Database21, at The Jackson Laboratory) genetic map, alleles linked to chromosome 7 that have not been finely mapped
UCSC vs. MGD aligns the current UCSC genome assembly with but that may be associated with a phenotype of interest relative to
the MGD genetic map, NCBI vs. EST-based RH Map aligns the TYR. Under Comparative Maps, Show homologs from species,
NCBI assembly with the Whitehead–MRC RH map, and UCSC choose human (Homo sapiens). Select Show all markers. Use the
vs. Hudson et al. aligns the 7 October 2001 UCSC assembly with default setting for all other options, and hit Retrieve.
the Whitehead–MRC RH Map22. The Hs and Mm links adjacent The gene Tyr is found on page 2 of the output, at 44 cm
to each map name show the mouse–human homology map with (Fig. 12.4). The mouse chromosome is shown schematically on
the master chromosome as human or mouse, respectively. Click the left and expands as one moves to the right. In the rightmost
on the Hs link next to the NCBI vs. MGD map. columns are the names of the mouse markers in a particular
The resulting mouse–human map shows the mouse genes that region in blue and, if there is a corresponding human ortholog,
are the likely orthologs of human genes on human chromosome the name of that ortholog in black. Some of the displayed
11 (Fig. 12.2). Depending on the browser being used, one may mouse markers are genes, some are STSs, some are recessive
have to click the View as text box to obtain the output; the result- mutants (all small letters) and some are dominant alleles (ini-
ing output will appear in text format, slightly different from what tial capital letter). At the bottom of the page are syntenic mark-
is shown in Fig. 12.2. Chromosomal locations of the mouse genes ers, those which have been mapped to chromosome 7 but not to
are shown, where known. The green circles link to the UniSTS an exact position.
entry for each locus; those on the left link to the human UniSTS Clicking the blue Tyr link at 44 cm opens up a summary of the
entry, whereas those on the right link to the mouse UniSTS entry. Genes, Markers, and Phenotypes for that gene (Fig. 12.5). Of par-
The cytogenetic positions are hyperlinked to either the human or ticular interest in this case are the phenotypic alleles. There are 99
mouse Map Viewer, as appropriate. Gene symbols are linked to mouse strains with mutations in the Tyr gene.
LocusLink10. The tyrosinase gene, highlighted in pink, maps to
mouse chromosome 7 at 44 cM, a piece of information that will
be needed in the next step. Users can also view chromosomal regions that are homolo-
The mouse models themselves are described at the Mouse gous between mouse and human by using Ensembl’s Syn-
Genome Informatics site at the Jackson Laboratory. Go to tenyView, available from the ContigView by clicking the Jump
the Mouse Genome Informatics home page, at http://www. to→syntenyview link (Fig. 1.14, center yellow bar).
informatics.jax.org/, and use the Query Forms pull-down menu

66 supplement to nature genetics • september 2002


user’s guide
Figure 12.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 12.2

supplement to nature genetics • september 2002 67


user’s guide
Figure 12.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 12.4

68 supplement to nature genetics • september 2002


user’s guide
Figure 12.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

supplement to nature genetics • september 2002 69


erratum

A User’s Guide to the Human Genome


T G Wolfsberg, K A Wetterstrand, M S Guyer, F S Collins & A D Baxevanis
Nature Genet. 32, 1–79 (2002).
doi:10.1038/ng977
Owing to a production error, Figure 11.2 was inadvertently inserted in place of Figure 12.2 in both the print and online PDF versions of
the User’s Guide. The screen pictured in the full text version online is the correct Figure 12.2.

nature genetics
user’s guide

Question 13
A user has identified an interesting phenotype in a mouse model and
has been able to narrow down the critical region for the responsible
gene to approximately 0.5 cM. How does one find the mouse genes in
this region?
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

doi:10.1038/ng978

Ensembl provides a mouse genome browser, similar to the one Consider the new gene indicated by the red arrow in
available for humans. It is being updated with the latest mouse Fig. 13.3. To view general information about this gene, hold the
genome sequence assemblies and, at the time of writing, dis- computer mouse over the gene graphic and select Transcript
plays the MGSC version 3 assembly of the mouse genome, with Information from the pop-up menu. The GeneView window
sequence data from February 2002. The sequence is estimated (Fig. 13.4) provides a description of this gene, as well as a link
to cover 96% of mouse euchromatic DNA, and Ensembl has to the GeneView window for the putative human ortholog
predicted that it contains over 22,000 genes. Start at the (Fig. 13.4, Homology Matches section). To view the database
Ensembl mouse home page, at http://www.ensembl.org/ sequences that align with the predicted exons of the new mouse
Mus_musculus/. Choose Marker from the pull-down menu, gene, place the computer mouse pointer over the gene in the
type the marker name ‘RH114718’ in the adjacent box, and Detailed View (Fig. 13.3, arrow) and select Supporting evidence
press Lookup. Click either of the resulting links to view more from the pop-up menu. Fig. 13.5 depicts the mRNA and pro-
details about this radiation hybrid marker. RH114718 has been tein sequences that align with exons in the new gene. Click on
mapped to a single position on chromosome 19 and is also any of the green boxes to see the alignment of the database
known as MGI:102447, MTH1904 and D19MIT109 (Fig. 13.1). sequence with the new transcript.
Click on the chromosomal position to view the genomic con- The zoomed-out Detailed View also provides links to com-
text of the marker (Fig. 13.2). puted regions of orthology between the mouse and human
The Overview section of Fig. 13.2 shows a region of 1 Mb of genomes (Fig. 13.3, pink bars). As the mouse genome assembly
chromosome 19 centered around the marker, labeled and annotation lag behind those of the human, it may also
D19MIT109 in this view. More than 30 mouse genes are pre- be useful to view the human genes in an orthologous region of
dicted in this region, some already known and some new. The the genome.
Detailed View at the bottom of the page is a zoomed-in display of
the region around the marker. To get a better view of the genes
and transcripts in this region, zoom out on the bottom view by
clicking on the longest bar in the zoom control (closest to the UCSC also provides a mouse genome browser and the BLAT
minus sign). The Detailed View will now show the same region of search tool for use with the latest mouse genome sequence
chromosome 19 as the overview, but with many additional fea- assemblies. The links are available from the UCSC genome
tures (Fig. 13.3). The splice patterns of the genes and gene pre- browser home page, at http://genome.ucsc.edu/. Mouse
dictions are shown, as are regions of homology between the genome analysis tools developed at the NCBI, including a
genome and other proteins and mRNAs. Pointing the computer mouse Map Viewer and mouse BLAST pages, are available
mouse at any feature allows the user to open a small menu that from http://www.ncbi.nlm.nih.gov/genome/guide/mouse/.
links to additional descriptions.

70 supplement to nature genetics • september 2002


user’s guide
Figure 13.1
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 13.2

supplement to nature genetics • september 2002 71


user’s guide
Figure 13.3
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

Figure 13.4

72 supplement to nature genetics • september 2002


user’s guide
Figure 13.5
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

supplement to nature genetics • september 2002 73


user’s guide

Commentary: keeping biology in mind


doi:10.1038/ng979

In working through the examples in the User’s Guide, the reader SMART web sites and issue the query there, the searches would
is exposed to a number of databases, web sites and other be performed using a very different algorithm, a hidden Markov
resources of enormous value for performing in silico analysis of model26. Although a description of the two different methods is
biological data. Familiarity with and use of this vast arsenal can beyond the scope of this discussion, it is important to understand
help the researcher to plan and execute experiments more intelli- that they are fundamentally different and will therefore produce
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

gently. In using these resources and, more importantly, in draw- different results. An extended discussion on this point, using spe-
ing biological conclusions based on the results gleaned from cific examples, is available27. The CDD front end will miss those
these sites, however, there are a number of caveats and potential SMART and Pfam entries that represent short domains, repeats
pitfalls of which the user should be aware. Although some of the and motifs28. To understand what the methods do does not mean
specific points we now discuss go beyond the sample questions having to comprehend advanced mathematical equations: basic
included in this guide, the basic lessons to be learned apply to the explanations in layman’s terms can be found in any one of a
full range of bioinformatic analyses. number of reviews or textbooks7,8.
The user must understand the capabilities—and limitations— One can often carry out a search and become excited on the
of the programs being used. In the same way that molecular biol- identification of a motif; frequently such a motif is rather small.
ogists need to understand the chemistry underlying a routine The Lys-Asp-Glu-Leu motif is an example; it targets proteins to
assay or the physics behind separation techniques, they must the endoplasmic reticulum. But one should beware the ‘short-
have a basic understanding of what search or analysis methods motif ’ pitfall. The level of sequence identity required for signifi-
actually do once the ‘Submit’ button has been pressed. Under- cant homology is much higher for smaller regions—they either
standing what the chemistry, physics or search methods can and match or they don’t. For very short motifs, homology cannot be
cannot reveal is critical if the user is to extract the full meaning of inferred by sequence identity, meaning that short motifs may not
the results but not overinterpret them. By understanding the be at all helpful in describing what a protein does.
methods, users can also optimize them and end up with a better Longer motifs have greater power in identifying true positives
set of results than if these sequence-based search methods were and eliminating false positives. More importantly, the support-
treated simply as a ‘black box’. ing information is made available by simply clicking past the first
A specific case in which the reader could have encountered dif- page of summary results provided by the search engine. Even, or
ficulty deals with the detection of domains within a protein, as especially, the newest of users is encouraged to click away and
described in Question 10. Consider the part of the question that discover the information and assumptions underlying the results
discussed the Conserved Domain Database (CDD) at the NCBI. that the searches have produced. These are self-explanatory in
The CDD is a ‘secondary database’, one in which the entries have many cases.
been derived from other databases, in this case Pfam23 and the With respect to complete sequences, the reader is advised to
Simple Modular Architecture Research Tool (SMART)24. Pfam recall that the preliminary analyses of the human genome
provides collections of multiple sequence alignments that repre- sequence led to a large reduction in the estimated number of
sent known, common protein domains. Pfam is subdivided into genes contained in the human genome. Earlier, numbers of the
two parts: Pfam A, which is manually curated, and Pfam B, which order of 80,000 to as high as 140,000 had been suggested29. With
is automatically generated. By virtue of being ‘hand-crafted’, the the draft sequence of the genome in hand, new estimates lie
entries in Pfam A are of higher quality and are therefore more closer to 30,000–35,000 genes11. If this is correct, the human
reliable than those in Pfam B. Nevertheless, both Pfam A and would have only twice as many genes as are observed in either the
Pfam B provide broad coverage across the spectrum of known roundworm or the fruit fly11. At the same time, human genes
protein domains. appear (in general) to have a more complex structure.
The second source database, SMART, provides information This pronounced ‘reduction’ in the number of genes in the
on 500 domain families, but with a specific emphasis on those human genome obviously challenges the one-gene, one-protein
domains that have been implicated in signaling or have been hypothesis (or, more properly, the one-gene, one-enzyme hypoth-
found in extracellular or chromatin-associated proteins. This esis30), as the number of proteins in the human proteome is
was a deliberate choice by the developers, who wished to tackle thought to be well in excess of 35,000 (ref. 11). One explanation of
what might be called ‘tougher-to-detect’ or ‘tougher-to-define’ the large number of individual proteins that can be generated
domains. At the outset, simply knowing the scope of the target from this relatively small number of genes is alternative splicing, a
database tells the user whether or not it is an appropriate process by which the transcripts from a single gene can be
choice for a sequence of interest, especially when some bio- processed differently and thus give rise to several distinct proteins.
chemical data may already be available. If users were to search Particularly germane to this discussion is that many proteins have
solely against SMART and find nothing, without understand- more than one function, depending on where they are found in
ing the limited scope of the data underlying the resource, they the cell or within the body as a whole.
might erroneously conclude that the protein of interest had no An interesting example of this phenomenon is the multifunc-
known domains. tional protein phosphoglucose isomerase31. This protein cat-
Continuing with this example, and assuming that the user alyzes the interconversion of D-glucose-6-phosphate and
now understands the scope of the underlying source databases, D-fructose-6-phosphate. It is identical to neuroleukin, a protein
a second problem quickly surfaces. When searching Pfam and secreted by T cells that promotes the survival of some embryonic
SMART through the CDD interface at the NCBI, the search spinal neurons and sensory nerves. It is also identical to an
is performed using a variation of the BLAST algorithm called autocrine motility factor that might be involved in metastasis,
RPS-BLAST25. If one were, however, to go directly to the Pfam or and to a differentiation and maturation mediator implicated in

74 supplement to nature genetics • september 2002


user’s guide
the in vitro differentiation of human myeloid leukemia HL-60 made in an automated fashion, without the benefit of human
cells to terminal monocytes. This therefore appears to be a single curation. This is a matter of practicality, as it would be difficult to
soluble protein that can take on four distinct cellular roles. verify every annotation in the human genome, let alone those of
A more extreme example of one protein being used in alterna- every sequenced organism. Although some sequence-based
tive contexts involves an outright phase shift: the proteins known annotations, such as the positions of genome, are determined
as α-enolase and τ-crystallin are encoded by a single gene and experimentally and are therefore quite reliable, others are no
have the same amino-acid sequence. In the liver, the protein more than predictions. The most notable of these are the predic-
functions as α-enolase, a soluble glycolytic enzyme, whereas tions of gene structure that can be found at the NCBI, Ensembl
within the lens of the eye, it functions as τ-crystallin, a structural and UCSC. Question 7 in this guide provides an excellent exam-
protein32. Proteins for which alternative functions have been ple of inconsistencies in gene predictions obtained using meth-
identified have been given the playful name ‘moonlighting pro- ods; the user should use such information carefully, particularly
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

teins’ (see ref. 33 for a review). when designing experiments.


Why is this biological finding important to anyone who uses The second type of annotation—functional annotation—can
comparative sequence information? In the early days of sequence be even more problematic. Even when similarity can be reliably
comparison, it was assumed that if a sequence of unknown func- detected, the functional annotations currently found in the pub-
tion matched a sequence of known function, one knew, by exten- lic databases are often incorrect. For example37, the functional
sion, the function of the unknown; the conclusions of many annotations of 340 Mycoplasma genes were assessed: 8% were
published papers were based on this assumption. In light of these found to be incorrect, and, in many cases, did not logically con-
and similar, more recent findings, does sequence similarity still nect to the known biology and metabolism of Mycoplasma. So
imply common function? The answer is: maybe yes and maybe never use database annotation as evidence of function when
no. In any case, more evidence than just sequence similarity is there are few homologs or when the annotations are inconsistent
needed to draw any conclusion about sequence function. between homologs. And remember that annotations are intran-
Moving up in conceptual complexity to the level of structure, sitive38: if protein A and protein B share a common functional
an entire class of molecular modeling techniques is available to annotation, and so do proteins B and C, proteins A and C do not
consider similarities between proteins whose relationship might necessarily have the same function. Use functional annotations
not be obvious from looking strictly at the nucleotide or amino- as a first step, and confirm the annotations by going back into the
acid sequence. The reason one would want to perform such primary literature.
analyses was stated early in a relatively short history of bioinfor- Biology is complex, and we still do not understand it very well.
matics34: structure is conserved to a greater extent than sequence. Although performing searches and finding data are not difficult,
This stands to reason, as there is evolutionary pressure to main- the intelligent use of all of the accumulated facts from databases
tain the three-dimensional shape of proteins, particularly those is. It is always necessary to take a step backwards and ask a very
critical to the basic functions of a cell. simple question: do the search results actually make biological
Inferring common function from structural similarity, how- sense? Even when one is able to make biological sense of a predic-
ever, is more problematic. Consider the TIM barrel. It defines a tion of function, it may turn out to be incorrect. As science is
structural superfamily whose members show a high degree of increasingly undertaken in a ‘sequence-based’ fashion, using
structural similarity over a substantial number of residues. The sequence data to underpin the experimental design and interpre-
TIM-barrel fold is a good example of possible divergent evolu- tation of experiments, it becomes increasingly important that
tion, because this same basic structure mediates a wide variety of computational results are cross-checked in the laboratory, against
chemical reactions critical to biological survival. The TIM barrel the literature and with more robust computational analysis, so
is associated with one non-enzymatic and fifteen enzymatic that the conclusions not only make sense, but are also correct.
functions35, and transcripts encoding TIM-barrel proteins
account for over 8% of the yeast transcriptome36. The roles of
TIM-barrel proteins are diverse, ranging from isomerases to Acknowledgments
David Haussler (University of California, Santa Cruz), Ewan Birney
oxidoreductases and hydrolases. This generic versatility is eco-
(The Wellcome Trust Sanger Institute) and David J. Lipman (National
nomical for the cell but can make the job of assigning function to Center for Biotechnology Information) served as advisors during the
structures or substructures difficult. In deciding whether struc- development of this guide.
tural similarity implies common function, one needs to consider
The authors would also like to thank the following people for their
the subcellular localization of the proteins, when they are contributions: K.N. Lazarides, S.K. Loftus, E.H. Margulies, K.L. Mohlke,
expressed, and the presence or absence of cofactors that might P.M. Pollock, R.B. Sood and J.W. Touchman (National Human Genome
significantly alter their structure. Research Institute); D. Karolchik and J. Kent (University of California,
A final point to be considered relates to annotations in the Santa Cruz); D. Church and K. Pruitt (National Center for Biotechnology
public databases. Although these are of great value, most are Information); and M. Hammond and E. Schmidt (Ensembl).

supplement to nature genetics • september 2002 75


user’s guide
the in vitro differentiation of human myeloid leukemia HL-60 made in an automated fashion, without the benefit of human
cells to terminal monocytes. This therefore appears to be a single curation. This is a matter of practicality, as it would be difficult to
soluble protein that can take on four distinct cellular roles. verify every annotation in the human genome, let alone those of
A more extreme example of one protein being used in alterna- every sequenced organism. Although some sequence-based
tive contexts involves an outright phase shift: the proteins known annotations, such as the positions of genome, are determined
as α-enolase and τ-crystallin are encoded by a single gene and experimentally and are therefore quite reliable, others are no
have the same amino-acid sequence. In the liver, the protein more than predictions. The most notable of these are the predic-
functions as α-enolase, a soluble glycolytic enzyme, whereas tions of gene structure that can be found at the NCBI, Ensembl
within the lens of the eye, it functions as τ-crystallin, a structural and UCSC. Question 7 in this guide provides an excellent exam-
protein32. Proteins for which alternative functions have been ple of inconsistencies in gene predictions obtained using meth-
identified have been given the playful name ‘moonlighting pro- ods; the user should use such information carefully, particularly
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

teins’ (see ref. 33 for a review). when designing experiments.


Why is this biological finding important to anyone who uses The second type of annotation—functional annotation—can
comparative sequence information? In the early days of sequence be even more problematic. Even when similarity can be reliably
comparison, it was assumed that if a sequence of unknown func- detected, the functional annotations currently found in the pub-
tion matched a sequence of known function, one knew, by exten- lic databases are often incorrect. For example37, the functional
sion, the function of the unknown; the conclusions of many annotations of 340 Mycoplasma genes were assessed: 8% were
published papers were based on this assumption. In light of these found to be incorrect, and, in many cases, did not logically con-
and similar, more recent findings, does sequence similarity still nect to the known biology and metabolism of Mycoplasma. So
imply common function? The answer is: maybe yes and maybe never use database annotation as evidence of function when
no. In any case, more evidence than just sequence similarity is there are few homologs or when the annotations are inconsistent
needed to draw any conclusion about sequence function. between homologs. And remember that annotations are intran-
Moving up in conceptual complexity to the level of structure, sitive38: if protein A and protein B share a common functional
an entire class of molecular modeling techniques is available to annotation, and so do proteins B and C, proteins A and C do not
consider similarities between proteins whose relationship might necessarily have the same function. Use functional annotations
not be obvious from looking strictly at the nucleotide or amino- as a first step, and confirm the annotations by going back into the
acid sequence. The reason one would want to perform such primary literature.
analyses was stated early in a relatively short history of bioinfor- Biology is complex, and we still do not understand it very well.
matics34: structure is conserved to a greater extent than sequence. Although performing searches and finding data are not difficult,
This stands to reason, as there is evolutionary pressure to main- the intelligent use of all of the accumulated facts from databases
tain the three-dimensional shape of proteins, particularly those is. It is always necessary to take a step backwards and ask a very
critical to the basic functions of a cell. simple question: do the search results actually make biological
Inferring common function from structural similarity, how- sense? Even when one is able to make biological sense of a predic-
ever, is more problematic. Consider the TIM barrel. It defines a tion of function, it may turn out to be incorrect. As science is
structural superfamily whose members show a high degree of increasingly undertaken in a ‘sequence-based’ fashion, using
structural similarity over a substantial number of residues. The sequence data to underpin the experimental design and interpre-
TIM-barrel fold is a good example of possible divergent evolu- tation of experiments, it becomes increasingly important that
tion, because this same basic structure mediates a wide variety of computational results are cross-checked in the laboratory, against
chemical reactions critical to biological survival. The TIM barrel the literature and with more robust computational analysis, so
is associated with one non-enzymatic and fifteen enzymatic that the conclusions not only make sense, but are also correct.
functions35, and transcripts encoding TIM-barrel proteins
account for over 8% of the yeast transcriptome36. The roles of
TIM-barrel proteins are diverse, ranging from isomerases to Acknowledgments
David Haussler (University of California, Santa Cruz), Ewan Birney
oxidoreductases and hydrolases. This generic versatility is eco-
(The Wellcome Trust Sanger Institute) and David J. Lipman (National
nomical for the cell but can make the job of assigning function to Center for Biotechnology Information) served as advisors during the
structures or substructures difficult. In deciding whether struc- development of this guide.
tural similarity implies common function, one needs to consider
The authors would also like to thank the following people for their
the subcellular localization of the proteins, when they are contributions: K.N. Lazarides, S.K. Loftus, E.H. Margulies, K.L. Mohlke,
expressed, and the presence or absence of cofactors that might P.M. Pollock, R.B. Sood and J.W. Touchman (National Human Genome
significantly alter their structure. Research Institute); D. Karolchik and J. Kent (University of California,
A final point to be considered relates to annotations in the Santa Cruz); D. Church and K. Pruitt (National Center for Biotechnology
public databases. Although these are of great value, most are Information); and M. Hammond and E. Schmidt (Ensembl).

supplement to nature genetics • september 2002 75


user’s guide

Web resources: Internet resources featured in this guide


doi:10.1038/ng981

Major Genome Browsers NHGRI Genome Hub


Ensembl http://www.nhgri.nih.gov/genome_hub.html
http://www.ensembl.org
UK HGMP GenomeWeb
NCBI Map Viewer http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/map_search db.html

UCSC Genome Browser Major public sequence databases


http://genome.ucsc.edu Each of these databases belongs to the International Nucleotide
Sequence Database Collaboration. Although all three centers
Additional Genome Browsers provide separate mechanisms for sequence submission by indi-
In addition to the genome browsers discussed in this Guide, the vidual investigators, they exchange data daily. As each member
reader may find these additional views of the human genome database stores and presents the underlying data using a slightly
sequence helpful. Each of these sites provides documentation on different format, this data exchange makes all known nucleotide
their scope of coverage and how to examine the data housed at and protein sequence data available to all users, regardless of
that site. which of the three databases are queried.

Celera DNA Data Bank of Japan


http://www.celera.com/genomics/academic/home.cfm http://www.ddbj.nig.ac.jp

ORNL Genome Channel EMBL Nucleotide Sequence Database


http://compbio.ornl.gov/channel/ http://www.ebi.ac.uk/embl/index.html

RIKEN Genomic Sciences Center GenBank


http://hgrep.ims.u-tokyo.ac.jp/ http://www.ncbi.nlm.nih.gov

Genome annotation Expressed sequence tag clustering databases


The following sites provide detailed information on annotations The ability to bring together expressed sequence tag, mRNA and
at each of the three major genome portals. other related sequences into gene-oriented clusters often facili-
tates genomic analysis, since the method groups individual
Distributed Annotation System sequences that most likely arise from the same gene or tran-
http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/ script. These three databases provide gene-oriented views of the
EnsemblDAS.html data, using different algorithms in calculating the individual
gene clusters.
Ensembl Science Documentation
http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/ STACK
ScienceDocumentation.html http://www.sanbi.ac.za/Dbases.html

NCBI Contig Assembly and Annotation Process TIGR Gene Indices


http://www.ncbi.nlm.nih.gov/genome/guide/build.html http://www.tigr.org/tdb/tgi.shtml

UCSC Annotation Database UniGene


http://genome.ucsc.edu/goldenPath/help/hg http://www.ncbi.nlm.nih.gov/UniGene
TracksHelp.html
Human genetic and physical maps
Human Genome Hub and Genome Central The databases listed below represent a significant portion of the
These sites provide jumping-off points to major genome-based data underlying current human genome assemblies. Many of
web sites. Resources available include trace data archives, access these data are available through DDBJ/EMBL/GenBank, but each
to cDNA and expressed sequence tag data and mapping informa- database contains additional information regarding clones, con-
tion used to produce genome assemblies. The web sites of the structs and similar that is not available through the major
individual members of the International Human Genome sequence repositories. A more extensive list of human genetic
Sequencing Consortium may be accessed through these sites. and physical maps can also be found through the online Nucleic
Acids Research Database Collection, at http://nar.oupjournals.
Ensembl Human Genome Central org/cgi/content/ full/30/1/1/DC1.
http://www.ensembl.org/genome/central/
Bacterial artificial chromosome and accession maps
NCBI Human Genome Central http://genome.wustl.edu/projects/human/index.php?fpc=1
http://www.ncbi.nlm.nih.gov/genome/guide/central.html

supplement to nature genetics • september 2002 77


user’s guide
GenAtlas Ensembl BLAST
http://www.citi2.fr/GENATLAS/ http://www.ensembl.org/Homo_sapiens/blastview

Genebridge4 radiation hybrid maps SSAHA


http://www.sanger.ac.uk/Software/RHserver/ http://www.ensembl.org/Homo_sapiens/ssahaview
RHserver.shtml
Model organism databases
GeneMap ’99 This list represents a small subset of the sequencing initiatives on
http://www.ncbi.nlm.nih.gov/genemap99 model organisms. Additional information on the progress of
numerous model organism sequencing initiatives can be found
GenMapDB on the Model Organisms for Biomedical Research web page, at
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

http://genomics.med.upenn.edu/genmapdb http://www.nih.gov/science/models/. A more extensive list of


organismal databases can also be found through the online
Généthon linkage map Nucleic Acids Research Database Collection, at http://nar.
http://www.genethon.fr/index_en.html oupjournals.org/cgi/content/full/30/1/1/DC1.

HuGeMap Arabidopsis thaliana


http://www.infobiogen.fr/services/Hugemap The Arabidopsis Information Resource
http://www.arabidopsis.org
Marshfield genetic maps
http://research.marshfieldclinic.org/genetics/ Arabidopsis Genome Initiative
Map_Markers/maps/IndexMapFrames.html http://mips.gsf.de/proj/thal/db/

RHdb Caenorhabditis elegans


http://corba.ebi.ac.uk/RHdb AceDB
http://www.acedb.org
Stanford G3 and TNG radiation hybrid maps
http://www-shgc.stanford.edu/RH/ WormBase
http://www.wormbase.org/
Genomic Databases and Resources
In addition to the databases listed in the section above, there are Drosophila melanogaster
numerous useful databases containing human mutation, varia- Berkeley Drosophila Genome Project
tion, medical or expression data. This short list is offered as a rep- http://www.fruitfly.org/
resentative cross-section of the types of database freely available
to genome researchers. The reader is referred to the ‘lists of lists’ FlyBase
found at the Human GenomeHub and Genome Central cites for http://flybase.bio.indiana.edu/
a more extensive catalog of available resources.
Escherichia coli
Cancer Genome Anatomy Project (CGAP) EcoGene
http://www.ncbi.nlm.nih.gov/CGAP/ http://bmb.med.miami.edu/EcoGene/EcoWeb/

Genome DataBase (GDB) Microbial Genomes


http://www.gdb.org Comprehensive Microbial Resource
http://www.tigr.org/tigr-scripts/CMR2/
HUGO Gene Nomenclature CMRHomePage.spl
http://www.gene.ucl.ac.uk/nomenclature
TIGR Microbial Database
Online Mendelian Inheritance in Man (OMIM) http://www.tigr.org/tdb/mdb/
http://www.ncbi.nlm.nih.gov/Omim
Mouse
SNP Consortium Mouse Genome Database/Informatics
http://snp.cshl.org http://www.informatics.jax.org/

Sequence-based searching Rat


The following links provide access to the most frequently used Rat Genome Database
tools for performing sequence-based comparisons to human http://rgd.mcw.edu
genome data. An extensive list of sequence similarity search
tools can be found on the ExPASy web site, at http://us.expasy. Yeast
org/tools/. Comprehensive Yeast Genome Database
http://mips.gsf.de/proj/yeast/CYGD/db/
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/ Saccharomyces Genome Database
http://genome-www.stanford.edu/Saccharomyces/
BLAT
http://genome.ucsc.edu/cgi-bin/hgBlat?command=start

78 supplement to nature genetics • september 2002


user’s guide
S. pombe Genome Sequencing Project Genetic education
http://www.sanger.ac.uk/Projects/S_pombe/ The following sites present basic information on genetics and
genomics, much of which is appropriate for elementary and sec-
Zebrafish ondary school education, as well as for the college level. Many of
Zebrafish Information Network these sites offer teaching plans, graphics and other teaching
http://zfin.org resources that can be freely used in the classroom or lecture hall.

Ethical, legal and social Issues Access Excellence


Although this guide has focused on the mechanics of accessing http://www.accessexcellence.org/
and using human genome data, it is important to remember that
ethical, legal and social issues (ELSI) are becoming increasingly Department of Energy education resources
© 2002 Nature Publishing Group http://www.nature.com/naturegenetics

important in this age of genetic and genomic research. The fol- http://www.ornl.gov/hgmis/education/education.html
lowing web sites provide an introduction to important issues
related to genome biology as applied to human health and pro- Genetics Education Center
vide a jumping-off point for further information. http://www.kumc.edu/gec/

DOE ELSI Program NHGRI Exploring our Molecular Selves Multimedia Kit
http://www.ornl.gov/hgmis/elsi/elsi.html http://www.genome.gov/Pages/EducationKit/

Lawrence Berkeley National Laboratory NHGRI Glossary of Genetic Terms


http://www.lbl.gov/Education/ELSI/ http://www.genome.gov/glossary.cfm

NHGRI ELSI Program


http://www.nhgri.nih.gov/ELSI/

supplement to nature genetics • september 2002 79

S-ar putea să vă placă și