2006 09 01 - Lect01 - ch1 2 PDF

Introduction to Bioinformatics
260.602.01 September 1, 2006
Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org
Teaching assistants
Hugh Cahill (hugh@jhu.edu) Jennifer Turney (jturney@jhsph.edu) Meg Zupancic (mzupanc1@jhmi.edu)
Who is taking this course?
People with very diverse backgrounds in biology People with diverse backgrounds in computer science and biostatistics Most people have a favorite gene, protein, or disease
What are the goals of the course?

To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI To focus on the analysis of DNA, RNA and proteins To introduce you to the analysis of genomes To combine theory and practice to help you solve research problems
Themes throughout the course

Textbooks Web sites Literature references Gene/protein families Computer labs
Textbook
The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2003). The chapters contain content, lab exercises, and quizzes that were developed in this course over the past six years. A few copies will be available on reserve at Welch Library for those of you who do not want to buy a copy (go up to the 2nd floor), and the library has six more copies. Several other bioinformatics texts are available: Baxevanis and Ouellette David Mount Durbin et al.
Web sites
The course website is reached via: http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm (or Google pevsnerlab courses) This site contains the powerpoints for each lecture. The textbook website is: http://www.bioinfbook.org This has 1000 URLs, organized by chapter This site also contains the same powerpoints. The weekly quizzes are on my website: http://pevsnerlab.kennedykrieger.org/moodle Once you log in and take a quiz, you will get instant feedback. You can use moodle to ask questions as well.
Literature references
You are encouraged to read original source articles. They will enhance your understanding of the material. Reading will be assigned.
Themes throughout the course: gene/protein families

We will use retinol-binding protein 4 (RBP4) as a model gene/protein throughout the course. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study it in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species We will also use other examples, such as the globins and the pol protein of HIV-1
The HIV-1 pol gene encodes three proteins
Aspartyl protease PR
Reverse transcriptase RT
Integrase
IN
Themes throughout the course: computer labs
There is a computer lab each Friday. This is a chance to gain practical experience using a variety of web resources. You can do the lab on your own, ahead of time. However, during the Friday lab you can get help on problems, and in some cases the computers will have specialized software.
Grading
40% ten moodle quizzes (corresponding to chapters 2-11) 30% final exam October 25 (in class) 30% discovery of a novel gene: --Find the novel gene by the end of September, and turn in the final report, with phylogenetic tree, by October 25 --Instructions are posted on the course website --We will discuss this project in detail in the next two weeks.
Grading
Quizzes are taken at the moodle website, and are due one week after the relevant lecture 4% Chapter 2 quiz (sequences) 4% Chapter 3 quiz (alignment) 4% Chapter 4 quiz (BLAST) 4% Chapter 5 quiz (advanced BLAST) 4% Chapter 6 quiz (RNA) 4% Chapter 7 quiz (microarrays) 4% Chapter 8 quiz (proteomics) 4% Chapter 9 quiz (protein structure) 4% Chapter 10 quiz (multiple alignment) 4% Chapter 11 quiz (phylogeny) 30% find-a-gene project (due October 25) 30% final exam October 25 (in class)
ten quizzes
Outline for today (chapters 1 and 2)

Definition of bioinformatics Overview of the NCBI website Accessing information about DNA and proteins --Definition of an accession number --Four ways to find information on proteins and DNA Access to biomedical literature
What is bioinformatics?
Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
Top ten challenges for bioinformatics

[1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes [5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics

[6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula
Source: Ewan Birney, Chris Burge, Jim Fickett
bioinformatics
medical informatics
Tool-users
public health informatics
Tool-makers
databases infrastructure algorithms
Three perspectives on bioinformatics
The cell The organism The tree of life

Page 4
DNA
RNA
protein
phenotype
Page 5
Time of development
Body region, physiology, pharmacology, pathology
Page 5
After Pace NR (1997) Science 276:734
Page 6
DNA
RNA
protein
phenotype
Growth of GenBank
Base pairs of DNA (billions)
1982 1986 1990 1994 1998 2002
Sequences (millions)
Updated 8-12-04: >40b base pairs
Year
Fig. 2.1 Page 17
Growth of GenBank
Sequences (millions)
70 60 50 40 30 20 10 0
1985
1990
1995
2000
December 1982
June 2006
Growth of the International Nucleotide Sequence Database Collaboration

Base pairs contributed by GenBank EMBL DDBJ
http://www.ncbi.nlm.nih.gov/Genbank/
Central dogma of molecular biology
DNA
RNA
protein
genome
transcriptome
proteome
Central dogma of bioinformatics and genomics
DNA
RNA
protein
phenotype
genomic DNA databases
cDNA ESTs UniGene
protein sequence databases Fig. 2.2 Page 20
There are three major public DNA databases
EMBL
GenBank
DDBJ
The underlying raw DNA sequences are identical

Page 16
There are three major public DNA databases
EMBL
Housed at EBI European Bioinformatics Institute
GenBank
Housed at NCBI National Center for Biotechnology Information
DDBJ
Housed in Japan
Page 16
>100,000 species are represented in GenBank
all species viruses bacteria archaea eukaryota
128,941 6,137 31,262 2,100 87,147
Table 2-1 Page 17
Taxonomy nodes at NCBI
8/06
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank

Homo sapiens Mus musculus Rattus norvegicus Danio rerio Zea mays Oryza sativa Drosophila melanogaster Gallus gallus Arabidopsis thaliana 10.7 billion bases 6.5b 5.6b 1.7b 1.4b 0.8b 0.7b 0.5b 0.5b
Updated 8-12-04 GenBank release 142.0
Table 2-2 Page 18

Homo sapiens Mus musculus Rattus norvegicus Danio rerio Bos taurus Zea mays Oryza sativa (japonica) Xenopus tropicalis Canis familiaris Drosophila melanogaster 11.2 billion bases 7.5b 5.7b 2.1b 1.9b 1.4b 1.2b 0.9b 0.8b 0.7b Table 2-2 Page 18

Homo sapiens Mus musculus Rattus norvegicus Bos taurus Danio rerio Zea mays Oryza sativa (japonica) Strongylocentrotus purpurata Sus scrofa Xenopus tropicalis 12.3 billion bases 8.0b 5.7b 3.5b 2.5b 1.8b 1.5b 1.2b 1.0b 1.0b Table 2-2 Page 18
National Center for Biotechnology Information (NCBI)
www.ncbi.nlm.nih.gov
Page 24
www.ncbi.nlm.nih.gov
Fig. 2.5 Page 25
Fig. 2.5 Page 25
PubMed is
National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via Education on side bar)
Page 24
Entrez integrates
the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes
Page 24
Entrez is a search and retrieval system that integrates NCBI databases
Page 24
BLAST is
Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 80,000 searches per day
Page 25
OMIM is
Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU
Page 25
Books is
searchable resource of on-line books
Page 26
TaxBrowser is
browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms
Page 26
Structure site includes

Molecular Modelling Database (MMDB)
biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26
Accessing information on molecular sequences
Page 26
Accession numbers are labels for sequences

NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data.
Page 26
What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4):
X02775 NT_030059 Rs7079946 N91759.1 NM_006744 NP_007635 AAC02945 Q28369 1KT7 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record
DNA RNA protein Page 27
Four ways to access DNA and protein sequences

[1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)
Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31).
Page 27
4 ways to access protein and DNA sequences

[1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Page 27
From the NCBI home page, type rbp4 and hit Go
revised Fig. 2.7 Page 29
revised Fig. 2.7 Page 29
By applying limits, there are now just two entries
Fig. 2.9 Page 32
Fig. 2.9 Page 32
Fig. 2.9 Page 32
FASTA format
Fig. 2.10 Page 32
Entrez Gene (top of page)
Note that links to many other RBP4 database entries are available revised Fig. 2.8 Page 30
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
What is an accession number?

An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4):
X02775 NT_030059 Rs7079946 N91759.1 NM_006744 NP_007635 AAC02945 Q28369 1KT7 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record
DNA RNA protein Page 27
NCBIs important RefSeq project: best representative sequences

RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon reference version of a sequence. RefSeq identifiers include the following formats: Complete genome Complete chromosome Genomic contig mRNA (DNA format) Protein NC_###### NC_###### NT_###### NM_###### e.g. NM_006744 NP_###### e.g. NP_006735 Page 29-30
NCBIs RefSeq project: accession for genomic, mRNA, protein sequences

Accession AC_123456 AP_123456 NC_123456 NG_123456 NM_123456 NM_123456789 NP_123456 NP_123456789 NR_123456 NT_123456 NW_123456 NZ_ABCD12345678 XM_123456 XP_123456 XR_123456 YP_123456 ZP_12345678 Molecule Genomic Protein Genomic Genomic mRNA mRNA Protein Protein RNA Genomic Genomic Genomic mRNA Protein RNA Protein Protein Method Mixed Mixed Mixed Mixed Mixed Mixed Mixed Curation Mixed Automated Automated Automated Automated Automated Automated Auto. & Curated Automated Note Alternate complete genomic Protein products; alternate Complete genomic molecules Incomplete genomic regions Transcript products; mRNA Transcript products; 9-digit Protein products; Protein products; 9-digit Non-coding transcripts Genomic assemblies Genomic assemblies Whole genome shotgun data Transcript products Protein products Transcript products Protein products Protein products
Four ways to access DNA and protein sequences

Page 31
DNA
RNA
protein
complementary DNA (cDNA)
UniGene Fig. 2.3 Page 23
UniGene: unique genes via ESTs

Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21
Cluster sizes in UniGene
This is a gene with 1 EST associated; the cluster size is 1
Fig. 2.3 Page 23
Cluster sizes in UniGene
This is a gene with 10 ESTs associated; the cluster size is 10
Cluster sizes in UniGene (human)
Cluster size (ESTs) 1 2 3-4 5-8 9-16 17-32 500-1000 2000-4000 8000-16,000 16,000-30,000
UniGene build 194, 8/06
Number of clusters 42,800 6,500 6,500 5,400 4,100 3,300 2,128 233 21 8
UniGene: unique genes via ESTs

Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further on September 20 (gene expression).
Page 31
Five ways to access DNA and protein sequences

Page 31
Ensembl to access protein and DNA sequences

Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.
click human
enter RBP4
Five ways to access DNA and protein sequences

Page 33
ExPASy to access protein and DNA sequences

ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/
Page 33
Fig. 2.11 Page 33
Example of how to access sequence data: HIV-1 pol
There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol
Page 34
Searching for HIV-1 pol: Following the genome link yields a manageable three results
Page 34
Example of how to access sequence data: HIV-1 pol
For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for hiv-1), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq!
Page 34
over 100,000 nucleotide entries for HIV-1
only 1 RefSeq
Examples of how to access sequence data: histone

query for histone protein records RefSeq entries RefSeq (limit to human) NOT deacetylase # results 21847 7544 1108 697
At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein.
8-12-06
Access to Biomedical Literature
Page 35
PubMed at NCBI to find literature information
PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966.
Page 35
MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.
Page 35
PubMed search strategies

Try the tutorial (education on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using limits Try Links to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35
1 AND 2
lipocalin AND disease (60 results)
1 OR 2
lipocalin OR disease (1,650,000 results)
1 NOT 2 8/04
lipocalin NOT disease (530 results) Fig. 2.12 Page 34
Article contents: globin is globin is absent present Search result: globin is found true positive false positive (article does not discuss globins)
globin is not found
false negative (article discusses globins)
true negative
8/06
WelchWeb is available at http://www.welch.jhu.edu
http://www.welch.jhu.edu
Brian Brown (bbrown20@jhmi.edu) and Carrie Iwema (iwema@jhmi.edu) are the Welch Medical Library liasons to the basic sciences
Course sponsors
Dept. of Molecular Microbiology & Immunology, and Dept. of Biostatistics, School of Public Health

2006 09 01 - Lect01 - ch1 2 PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2006 09 01 - Lect01 - ch1 2 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction to Bioinformatics

260.602.01 September 1, 2006

Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org

Hugh Cahill (hugh@jhu.edu) Jennifer Turney (jturney@jhsph.edu) Meg Zupancic (mzupanc1@jhmi.edu)

Who is taking this course?

What are the goals of the course?

Themes throughout the course

Themes throughout the course: gene/protein families

The HIV-1 pol gene encodes three proteins

Themes throughout the course: computer labs

Outline for today (chapters 1 and 2)

Top ten challenges for bioinformatics

Top ten challenges for bioinformatics

public health informatics

Three perspectives on bioinformatics

The cell The organism The tree of life

Body region, physiology, pharmacology, pathology

After Pace NR (1997) Science 276:734

Updated 8-12-04: >40b base pairs

Fig. 2.1 Page 17

Base pairs of DNA (billions)

Growth of the International Nucleotide Sequence Database Collaboration

Central dogma of molecular biology

Central dogma of bioinformatics and genomics

genomic DNA databases

cDNA ESTs UniGene

protein sequence databases Fig. 2.2 Page 20

There are three major public DNA databases

The underlying raw DNA sequences are identical

There are three major public DNA databases

>100,000 species are represented in GenBank

all species viruses bacteria archaea eukaryota

128,941 6,137 31,262 2,100 87,147

Table 2-1 Page 17

Taxonomy nodes at NCBI

The most sequenced organisms in GenBank

Updated 8-12-04 GenBank release 142.0

Table 2-2 Page 18

The most sequenced organisms in GenBank

Updated 8-29-05 GenBank release 149.0

The most sequenced organisms in GenBank

Updated 7-19-06 GenBank release 154.0

National Center for Biotechnology Information (NCBI)

Fig. 2.5 Page 25

Fig. 2.5 Page 25

Entrez is a search and retrieval system that integrates NCBI databases

Structure site includes

Accessing information on molecular sequences

Accession numbers are labels for sequences

What is an accession number?

DNA RNA protein Page 27

Four ways to access DNA and protein sequences

4 ways to access protein and DNA sequences

From the NCBI home page, type rbp4 and hit Go

revised Fig. 2.7 Page 29

revised Fig. 2.7 Page 29

By applying limits, there are now just two entries

Fig. 2.9 Page 32

Fig. 2.9 Page 32

Fig. 2.9 Page 32

Fig. 2.10 Page 32

Entrez Gene (top of page)