CMSC 838T - Lecture 9: Bioinformatics Databases

CMSC 838T – Lecture 9
Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data
Entrez search & retrieval Mapviewer search & retrieval

of linked databases by chromosome position
What Is a Database?
Computerized storehouse of data (records)
Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
Uses standardized formats
1
Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data
Entrez search & retrieval Mapviewer search & retrieval

of linked databases by chromosome position
What Is a Database?
Computerized storehouse of data (records)
Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
Uses standardized formats
1
Database Models
Defines data organization (schema)
Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
2
Database Models
Defines data organization (schema)
Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases
Outline
0 Issues
0 Databases
2
Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)
Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented
Resulting in many bioinformatic databases…
3
Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)
Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented
Resulting in many bioinformatic databases…
3
Bioinformatic Database Issues
Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example

Redundancy
0 Multiple entries for same DNA / protein sequence
0 Arising from multiple experiments & biological disciplines
0 Example – redundant GenBank entries for E. coli dUTPase
z 4 separate publications (X01714, V01578, L10328, AE000441)
Data annotation & formats

0 Multiple data for single gene
z Sequence, location, expression, structure, function…
0 Resulting in multiple data annotations & formats

Data integration
0 Combining data from multiple bioinformatic databases
4
Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example

Redundancy
0 Multiple entries for same DNA / protein sequence
0 Arising from multiple experiments & biological disciplines
0 Example – redundant GenBank entries for E. coli dUTPase
z 4 separate publications (X01714, V01578, L10328, AE000441)
Data annotation & formats

0 Multiple data for single gene
z Sequence, location, expression, structure, function…
0 Resulting in multiple data annotations & formats

Data integration
0 Combining data from multiple bioinformatic databases
4
Outline
0 Issues
0 Databases
Major Bioinformatic Databases

DNA sequences
0 GenBank, RefSeq, UniGene
Protein sequences
0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq
Protein structure
0 Protein Data Bank (PDB)
Gene expression
0 Gene Expression Omnibus (GEO)
Biomedical publications
0 PubMed / MedLine
5
Outline
0 Issues
0 Databases
Major Bioinformatic Databases

DNA sequences
0 GenBank, RefSeq, UniGene
Protein sequences
0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq
Protein structure
0 Protein Data Bank (PDB)
Gene expression
0 Gene Expression Omnibus (GEO)
Biomedical publications
0 PubMed / MedLine
5
Bioinformatic Data Sources
Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB

Derived databases
0 Compiled from data in primary databases
0 Manually curated (human selection & correction)
z Advantages – high quality
z Disadvantages – high expense, low volume
z Examples
Swiss-Prot, PIR-PSD, RefSeq
0 Computational derivation (automatically generated)

z Advantages – inexpensive, up-to-date
z Disadvantages – lower quality
z Examples
GenPept, TrEMBL, UniGene, COGs
6
Bioinformatics Databases and Tools
CS426: Introduction to Why would we use such databases

When obtaining a new DNA sequence, one needs to know
Computational Biology whether it has already been deposited in the databanks fully or
partially, or whether they contain any homologous
sequences(sequences which are descended from a common
ancestor).
Some of the databases contain annotation which has already been
Section Week 3
added to a specific sequence. Finding annotation for the searched
sequence or its homologous sequences can facilitate its research.
Find similar non-coding DNA stretches in the database:for
instance repeat elementsor regulatory sequences.
Other uses for specific purpose, like locating false priming sites
for a set of PCR oligonucleotides.
Search for homologous proteins - proteins similar in their
sequence and therefore also in their presumed folding or structure
or function.
Primary sequence databases Primary sequence databases

There are several problems with databases today:
Databases are regulated by users rather than by a
central body (except for Swiss-Prot).
Only the owner of the data can change it.
Sequences are not up to date.
Large degree of redundancy in databases and between
databases.
Lack of standard for .elds or annotation.
List of primary sequence databases and their locations.
1
Bioinformatics Databases and Tools
CS426: Introduction to Why would we use such databases

When obtaining a new DNA sequence, one needs to know
Computational Biology whether it has already been deposited in the databanks fully or
partially, or whether they contain any homologous
sequences(sequences which are descended from a common
ancestor).
Some of the databases contain annotation which has already been
Section Week 3
added to a specific sequence. Finding annotation for the searched
sequence or its homologous sequences can facilitate its research.
Find similar non-coding DNA stretches in the database:for
instance repeat elementsor regulatory sequences.
Other uses for specific purpose, like locating false priming sites
for a set of PCR oligonucleotides.
Search for homologous proteins - proteins similar in their
sequence and therefore also in their presumed folding or structure
or function.
Primary sequence databases Primary sequence databases

There are several problems with databases today:
Databases are regulated by users rather than by a
central body (except for Swiss-Prot).
Only the owner of the data can change it.
Sequences are not up to date.
Large degree of redundancy in databases and between
databases.
Lack of standard for .elds or annotation.
List of primary sequence databases and their locations.
1
Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB

Derived databases
0 Compiled from data in primary databases
0 Manually curated (human selection & correction)
z Advantages – high quality
z Disadvantages – high expense, low volume
z Examples
Swiss-Prot, PIR-PSD, RefSeq
0 Computational derivation (automatically generated)

z Advantages – inexpensive, up-to-date
z Disadvantages – lower quality
z Examples
GenPept, TrEMBL, UniGene, COGs
6
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
Its growth is exponential,
supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.
DNA Databases Protein Databases

GeneBank: http://www.ncbi.nlm.nih.gov/ PIR – Protein Sequence Database
sequence database from National Center http://pir.georgetown.edu/
Biotechnology Information (NCBI). was developed in the early 1960’s.
four sections:
It incorporates sequences from publicly
PIR1 - fully classified and annotated entries.
available sources
PIR2 - preliminary entries, not thoroughly reviewed.
PIR3 - unverified entries, not reviewed.
PIR4 - conceptual translations.
2
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
Its growth is exponential,
supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.
DNA Databases Protein Databases

GeneBank: http://www.ncbi.nlm.nih.gov/ PIR – Protein Sequence Database
sequence database from National Center http://pir.georgetown.edu/
Biotechnology Information (NCBI). was developed in the early 1960’s.
four sections:
It incorporates sequences from publicly
PIR1 - fully classified and annotated entries.
available sources
PIR2 - preliminary entries, not thoroughly reviewed.
PIR3 - unverified entries, not reviewed.
PIR4 - conceptual translations.
2
Bioinformatic Databases – GenBank
Database type
0 Nucleotide sequences
0 Primary database
Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides

Types of submissions to database
0 Genomic DNA
z High quality complete DNA sequence
0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)
0 Expressed sequence tag (EST)

z Single-pass partial cDNA sequences from mRNA
0 Sequence tagged sites (STS)

z Short DNA sequences unique in genome
0 Genomic survey sequence (GSS)

z Single-pass genomic DNA
0 Third-party annotations of GenBank sequences
7
Database type
0 Primary database
Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides

Types of submissions to database
0 Genomic DNA
z High quality complete DNA sequence
0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)
0 Expressed sequence tag (EST)

z Single-pass partial cDNA sequences from mRNA
0 Sequence tagged sites (STS)

z Short DNA sequences unique in genome
0 Genomic survey sequence (GSS)

z Single-pass genomic DNA
0 Third-party annotations of GenBank sequences
7
Bioinformatic Databases – Proteins
Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation
z Computational derivation
Classification
0 Predicted protein
z No similarity match to protein of known function
z Match to EST
0 Hypothetical protein
z No match to EST
Bioinformatic Databases – Swiss-Prot, PIR-PSD

Database type
0 Protein sequences
0 Derived database
z Manually curated (non-redundant, annotated)
Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…
0 Domains and sites

z Calcium binding regions, ATP-binding sites, zinc fingers…
0 Secondary & quaternary structure

0 Similarities to other proteins
0 Variants
8
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
Established in 1986
GenPept is a supplement to the GenBank nucleotide
Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
was created in 1996
NRL_3D
It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. produced and maintained by PIR.
SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot contains sequences extracted from the Protein
REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,
Protein Databases Database searching

Summary of protein sequence databases Text based search - Searching the
PIR(1-4) - comprehensive, poor quality of annotations. Examples:SRS, GCG’s
annotation (even in PIR1). Lookup, Entrez.
Swiss-Prot - poor sequence coverage, highly
structured, excellent annotation. Sequence based search - Searching the
GenPept most comprehensive, poor quality of sequence itself. Examples:Blast, FastA,
annotation. SW.
NRL 3D - least comprehensive but is directly
relating to structural information.
3
Bioinformatic Databases – Proteins
Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation
z Computational derivation
Classification
0 Predicted protein
z Match to EST
0 Hypothetical protein
z No match to EST

Database type
0 Protein sequences
0 Derived database
z Manually curated (non-redundant, annotated)
Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…
0 Domains and sites

z Calcium binding regions, ATP-binding sites, zinc fingers…
0 Secondary & quaternary structure

0 Similarities to other proteins
0 Variants
8
Swiss-Prot statistics
0 Release 41.2, March 2003
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
PIR-PSD statistics
0 283,289 entries
Bioinformatic Databases – GenPept, TrEMBL

Database type
0 Protein sequences
0 Computationally derived database
z Predicted coding sequences (CDS) from GenBank, EMBL
z Candidate sequences for Swiss-Prot, not yet processed
GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues
9
Swiss-Prot statistics
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
PIR-PSD statistics
0 283,289 entries
Bioinformatic Databases – GenPept, TrEMBL

Database type
0 Protein sequences
z Predicted coding sequences (CDS) from GenBank, EMBL
z Candidate sequences for Swiss-Prot, not yet processed
GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues
9
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
Established in 1986
GenPept is a supplement to the GenBank nucleotide
Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
was created in 1996
NRL_3D
It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. produced and maintained by PIR.
SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot contains sequences extracted from the Protein
REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,
Protein Databases Database searching

Summary of protein sequence databases Text based search - Searching the
PIR(1-4) - comprehensive, poor quality of annotations. Examples:SRS, GCG’s
annotation (even in PIR1). Lookup, Entrez.
Swiss-Prot - poor sequence coverage, highly
structured, excellent annotation. Sequence based search - Searching the
GenPept most comprehensive, poor quality of sequence itself. Examples:Blast, FastA,
annotation. SW.
NRL 3D - least comprehensive but is directly
relating to structural information.
3
Bioinformatic Databases – Connections
DNA sequences
Sequin & BankIt
Genome projects
GenBank EMBL/EBI
Automatically translated
Protein
GenPept TrEMBL
sequences
from labs
Manual curation
PIR-PSD SwissProt
& annotation
Bioinformatic Databases – Protein Data Bank

Database type
0 Protein 3D structures
0 Primary database
Statistics Folds & New Folds / Year
0 March 2003
0 20,473 proteins
10
Bioinformatic Databases – Connections
DNA sequences
Sequin & BankIt
Genome projects
GenBank EMBL/EBI
Automatically translated
Protein
GenPept TrEMBL
sequences
from labs
Manual curation
PIR-PSD SwissProt
& annotation
Bioinformatic Databases – Protein Data Bank

Database type
0 Protein 3D structures
0 Primary database
Statistics Folds & New Folds / Year
0 March 2003
0 20,473 proteins
10
Bioinformatic Databases – Pfam
Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions
0 Derived database (from Swiss-Prot & TrEMBL)

z Pfam-A – manually curated (hand-edited MSA)
z Pfam-B – computationally derived
Non-overlapping families from PRODOM database
Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A
z 20% at least one match in Pfam-B
Bioinformatic Databases – RefSeq

Database type
0 Nucleotide & protein sequences
0 Derived database
z Human curated (non-redundant, cross-linked)
Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
0 Multiple organisms
Statistics
0 March 2003
0 17,268 human loci, ~52,000 for all species
11
Bioinformatic Databases – Pfam
Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions
0 Derived database (from Swiss-Prot & TrEMBL)

z Pfam-A – manually curated (hand-edited MSA)
z Pfam-B – computationally derived
Non-overlapping families from PRODOM database
Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A
z 20% at least one match in Pfam-B
Bioinformatic Databases – RefSeq

Database type
0 Nucleotide & protein sequences
0 Derived database
z Human curated (non-redundant, cross-linked)
Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
Statistics
0 March 2003
11
Bioinformatic Databases – UniGene
Database type
z Partitioned into non-redundant gene-oriented clusters
0 Gene-oriented view
Data in UniGene
0 Clusters of genomic DNA & ESTs
Statistics
0 March 2003
Bioinformatic Databases – Relative Sizes

100000000
DB size (# sequences)
10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD
RefSeq
PDB
TrEMBL
GenPept
GenBank
Swiss-
UniGene
Prot
Computationally Manually
Derived Curated
12
Bioinformatic Databases – UniGene
Database type
z Partitioned into non-redundant gene-oriented clusters
0 Gene-oriented view
Data in UniGene
0 Clusters of genomic DNA & ESTs
Statistics
0 March 2003
Bioinformatic Databases – Relative Sizes

100000000
DB size (# sequences)
10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD
RefSeq
PDB
TrEMBL
GenPept
GenBank
Swiss-
UniGene
Prot
Computationally Manually
Derived Curated
12
Bioinformatic Databases – PubMed
Database type
0 Biomedical papers
0 Manually curated database
Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s
Bioinformatic Databases – Others

Gene expression
0 ArrayExpress, Gene Expression Omnibus (GEO)
Multi-organism genomes
0 Entrez Genome, HomoloGene, COGs, TIGR
Genetic variation & genetic diseases
0 dbSNP, OMIM, CGAP
Metabolic pathways
0 WIT, KEGG
Many more…
0 Listed in journal “Nucleic Acids Research” each January
13
Bioinformatic Databases – PubMed
Database type
0 Biomedical papers
0 Manually curated database
Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s
Bioinformatic Databases – Others

Gene expression
0 ArrayExpress, Gene Expression Omnibus (GEO)
Multi-organism genomes
0 Entrez Genome, HomoloGene, COGs, TIGR
Genetic variation & genetic diseases
0 dbSNP, OMIM, CGAP
Metabolic pathways
0 WIT, KEGG
Many more…
0 Listed in journal “Nucleic Acids Research” each January
13
Text based retrieval tools Entrez example
Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Developed at NCBI
entry point for exploring the NCBI’s integrated
databases.
easy to use, but unlike SRS, the search is
limited.
Sequence Based Searching

DNA search versus Protein search
A DNA sequence is a string of length n over an
alphabet of size 4. Its protein translation is a string of
length n/3 over an alphabet of size 20. Statistically, the
expected number of random matches in some arbitrary
database is larger for a DNA sequence.
DNA databases are much larger than protein databases,
and they grow faster. This also means more random
hits.
Translation of a DNA sequence to a protein sequence
causes loss of information.
Protein sequences are more biologically preserved than
DNA sequences.
5
Text based retrieval tools Entrez example
Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Developed at NCBI
entry point for exploring the NCBI’s integrated
databases.
easy to use, but unlike SRS, the search is
limited.
Sequence Based Searching

DNA search versus Protein search
A DNA sequence is a string of length n over an
alphabet of size 4. Its protein translation is a string of
length n/3 over an alphabet of size 20. Statistically, the
expected number of random matches in some arbitrary
database is larger for a DNA sequence.
DNA databases are much larger than protein databases,
and they grow faster. This also means more random
hits.
Translation of a DNA sequence to a protein sequence
causes loss of information.
Protein sequences are more biologically preserved than
DNA sequences.
5
Bioinformatic Search Tools – Entrez
Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets

Mapviewer (component of Entrez Genome)
0 View & search complete genome by chromosome position
0 Display & zoom into chromosome maps
14
Text based retrieval tools
SRS (Sequence Retrieval System)
http://srs.ebi.ac.uk/
Developed by EBI
provides a homogeneous interface to over 80 biological
databases
includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
Before entering a query, one selects one or more of the
databases to search.
It is possible to send the query results as a batch query
to a sequence search tool.
4
Text based retrieval tools
SRS (Sequence Retrieval System)
http://srs.ebi.ac.uk/
Developed by EBI
provides a homogeneous interface to over 80 biological
databases
includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
Before entering a query, one selects one or more of the
databases to search.
It is possible to send the query results as a batch query
to a sequence search tool.
4
Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets

Mapviewer (component of Entrez Genome)
0 View & search complete genome by chromosome position
0 Display & zoom into chromosome maps
14
Outline
0 Issues
0 Databases
Bioinformatic Database Identifiers

Common identifiers for bioinformatic data
0 Locus name
0 Accession numbers
0 GenInfo ID
0 PubMed ID
15
Outline
0 Issues
0 Databases
Bioinformatic Database Identifiers

Common identifiers for bioinformatic data
0 Locus name
0 Accession numbers
0 GenInfo ID
0 PubMed ID
15
Database Identifiers – Locus Names
Original identifiers of GenBank records
0 LOCUS line in GenBank entries
Originally
0 First 3 letters of organism followed by code for gene
Example
0 HUMBB for human ß-globin region
Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same
Database Identifiers – Accession Numbers

No biological meaning
Originally
0 Uppercase letter followed by 5 digits: U00002
Currently
0 Two uppercase letters followed by six digits: BC037153
0 May include version number for entry: BC037153.1
Stable way of identifying GenBank entries
Now being used for both DNA and proteins
16
Database Identifiers – Locus Names
Original identifiers of GenBank records
0 LOCUS line in GenBank entries
Originally
0 First 3 letters of organism followed by code for gene
Example
0 HUMBB for human ß-globin region
Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same
Database Identifiers – Accession Numbers

No biological meaning
Originally
0 Uppercase letter followed by 5 digits: U00002
Currently
0 Two uppercase letters followed by six digits: BC037153
0 May include version number for entry: BC037153.1
Stable way of identifying GenBank entries
Now being used for both DNA and proteins
16
Database Identifiers – GenInfo (gi) IDs
Identifier for a particular sequence only
0 Each entry gets a unique gi number
Example
0 GI:22477487
Not subject to versioning
0 Entry always remains the same
Different / new versions of the same sequence
0 Manage using accession numbers
Database Identifiers – PubMed IDs (PMID)

Identifies articles managed by NCBI
Reliable, stable link to citation
Example
0 PMID: 12205585
17
Database Identifiers – GenInfo (gi) IDs
Identifier for a particular sequence only
0 Each entry gets a unique gi number
Example
0 GI:22477487
Not subject to versioning
0 Entry always remains the same
Different / new versions of the same sequence
0 Manage using accession numbers
Database Identifiers – PubMed IDs (PMID)

Identifies articles managed by NCBI
Reliable, stable link to citation
Example
0 PMID: 12205585
17
Bioinformatic Database Formats
Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML
Database Format – FASTA

Used by FASTA tools
Comment line followed by sequence data
0 No annotation, just sequence
Example
>gi|1040960|gb|U35641.1|MMU35641 Mus musculus Brca1 mRNA…
GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC
GGATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC
GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT
CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT
TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT
TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG
18
Bioinformatic Database Formats
Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML
Database Format – FASTA

Used by FASTA tools
Comment line followed by sequence data
0 No annotation, just sequence
Example
>gi|1040960|gb|U35641.1|MMU35641 Mus musculus Brca1 mRNA…
GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC
GGATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC
GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT
CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT
TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT
TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG
18
Database Format – GenBank
Flat file format used by GenBank
0 Annotation, author, version, etc…
Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698
Database Format – SWISS-PROT

Defined by SWISS-PROT database
0 Includes annotation, other info
Example
ID BRC1_MOUSE STANDARD; PRT; 1812 AA.
AC P48754; Q60957; Q60983;
DT 01-FEB-1996 (Rel. 33, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
DE Breast cancer type 1 susceptibility protein homolog.
GN BRCA1.
OS Mus musculus (Mouse).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
OX NCBI_TaxID=10090;
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=C57BL/6; TISSUE=Embryo;
RX MEDLINE=96177659; PubMed=8634697;
RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;
RT "Mouse Brca1: localization sequence analysis and identification of
RT evolutionarily conserved domains.";
RL Hum. Mol. Genet. 4:2265-2273(1995)…
19
Database Format – GenBank
Flat file format used by GenBank
0 Annotation, author, version, etc…
Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698
Database Format – SWISS-PROT

Defined by SWISS-PROT database
0 Includes annotation, other info
Example
ID BRC1_MOUSE STANDARD; PRT; 1812 AA.
AC P48754; Q60957; Q60983;
DT 01-FEB-1996 (Rel. 33, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
DE Breast cancer type 1 susceptibility protein homolog.
GN BRCA1.
OS Mus musculus (Mouse).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
OX NCBI_TaxID=10090;
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=C57BL/6; TISSUE=Embryo;
RX MEDLINE=96177659; PubMed=8634697;
RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;
RT "Mouse Brca1: localization sequence analysis and identification of
RT evolutionarily conserved domains.";
RL Hum. Mol. Genet. 4:2265-2273(1995)…
19
Database Format – ASN.1
International standard
0 Semi-structured format
0 Base format for NCBI data
Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …
Database Format – XML

eXtensible Markup Language
0 Open standard for semi-structured data, uses tags like HTML
0 Document split into content (XML), style (XSL), linking (XLL)
Example
<?xml version="1.0"?>
<!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN"
“http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">
<GBSet>
<GBSeq>
<GBSeq_locus>MMU35641</GBSeq_locus>
<GBSeq_length>5538</GBSeq_length>
<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>
<GBSeq_moltype value="mrna">5</GBSeq_moltype>
<GBSeq_topology value="linear">1</GBSeq_topology>
<GBSeq_division>ROD</GBSeq_division>
<GBSeq_update-date>18-OCT-1996</GBSeq_update-date>
<GBSeq_create-date>25-OCT-1995</GBSeq_create-date>
<GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition>
<GBSeq_primary-accession>U35641</GBSeq_primary-accession>
<GBSeq_accession-version>U35641.1</GBSeq_accession-version>
20
Database Format – ASN.1
International standard
0 Semi-structured format
0 Base format for NCBI data
Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …
Database Format – XML

eXtensible Markup Language
0 Open standard for semi-structured data, uses tags like HTML
0 Document split into content (XML), style (XSL), linking (XLL)
Example
<?xml version="1.0"?>
<!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN"
“http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">
<GBSet>
<GBSeq>
<GBSeq_locus>MMU35641</GBSeq_locus>
<GBSeq_length>5538</GBSeq_length>
<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>
<GBSeq_moltype value="mrna">5</GBSeq_moltype>
<GBSeq_topology value="linear">1</GBSeq_topology>
<GBSeq_division>ROD</GBSeq_division>
<GBSeq_update-date>18-OCT-1996</GBSeq_update-date>
<GBSeq_create-date>25-OCT-1995</GBSeq_create-date>
<GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition>
<GBSeq_primary-accession>U35641</GBSeq_primary-accession>
<GBSeq_accession-version>U35641.1</GBSeq_accession-version>
20
Processing Data in Bioinformatic Databases
Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO
Perl (Practical Extraction and Report Language)

0 Portable C-like interpreted scripting language
0 Powerful pattern matching, string processing operations
0 Frequently used to extract / process bioinformatic data
BioPerl
0 Collection of Perl classes designed for bioinformatic tools
0 Sequence analysis, alignment, format conversion, I/O,
automate bioinformatic analyses, parse results, create GUIs,
manage persistent storage in RDMBS…
Outline
0 Issues
0 Databases
21
Processing Data in Bioinformatic Databases
Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO
Perl (Practical Extraction and Report Language)

0 Portable C-like interpreted scripting language
0 Powerful pattern matching, string processing operations
0 Frequently used to extract / process bioinformatic data
BioPerl
0 Collection of Perl classes designed for bioinformatic tools
0 Sequence analysis, alignment, format conversion, I/O,
automate bioinformatic analyses, parse results, create GUIs,
manage persistent storage in RDMBS…
Outline
0 Issues
0 Databases
21
Bioinformatic Databases – Usage
NCBI Protein information usage survey
Using Bioinformatic Databases

Primary use of bioinformatics
0 Finding similar sequences
0 BLAST!
1) insert sequence
2) click button!
22
FastA FastA
Output
The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use Sequences with E-score less than 0.01 are
different programs:
To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
Sequences with E-score between 1 and 10
To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
Use ktup = 1 for oligonucleotides (length < 20).
FastA Example
6
FastA FastA
Output
The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use Sequences with E-score less than 0.01 are
different programs:
To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
Sequences with E-score between 1 and 10
To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
Use ktup = 1 for oligonucleotides (length < 20).
FastA Example
6
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/
Blast Example
7
Bioinformatic Databases – Usage
NCBI Protein information usage survey

Primary use of bioinformatics
0 Finding similar sequences
0 BLAST!
1) insert sequence
2) click button!
22
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/
Blast Example
7
Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids
0 BLASTP
z Protein query against protein database
0 BLASTX
z Translated nucleic acids against protein database
0 TBLAST
z Protein query against translated nucleic acid database
0 TBLASTX
z Translated nucleic acids against translated nucleic acids
Databases – Searching w/ BLAST
23
Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids
0 BLASTP
z Protein query against protein database
0 BLASTX
z Translated nucleic acids against protein database
0 TBLAST
z Protein query against translated nucleic acid database
0 TBLASTX
z Translated nucleic acids against translated nucleic acids
23
BLAST result
0 Graphic display

BLAST result
0 Matching sequences w/ bit-score & E-value
0 Hyperlinks to database entry for sequence
Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21
Hyperlinks to sequences Bit Score E-value
24
BLAST result
0 Graphic display

BLAST result
0 Matching sequences w/ bit-score & E-value
0 Hyperlinks to database entry for sequence
Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21
Hyperlinks to sequences Bit Score E-value
24
Searching w/ BLAST – Interpreting Results
High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format
z Change search parameters (scoring matrix, gap penalties)
z Check (low complexity) sequence filtering
z Try PSI-BLAST for distant homologs
Bioinformatic Databases – Summary

Observations
0 Lots of useful information
0 Complex relationships, naming schemes
0 High volume of nucleotide sequence data (GenBank)
0 Protein sequence data mostly derived from sequence data
z Computational derivation (GenPept, TrEMBL)
z Manual curation (PIR-PSD, Swiss-Prot)
0 Attempts to organize sequence data

z Eliminate redundancy, add annotation (RefSeq, UniGene)
0 Many tools attempt to link useful bioinformatic data
25
Comparison of the Programs Comparison of the Programs
Concept: Sensitivity:
FastA > BLAST (old version!)
BLAST produce local alignments, while FastA
FastA is more sensitive, missing less homologous sequences on the
is a global alignment tool. BLAST can report average (but the opposite can also happen - if there are no identical
more than one HSP per database entry, while residues conserved, but this is infrequent). It also gives better
FastA reports only one segment(match). separation between true hits and random hits.
Speed: Statistics:
BLAST calculates probabilities, and it sometimes fails entirely if
BLAST > FastA some of the assumptions used are invalid.
BLAST (package) is a highly e.cient search FastA calculates significance ’on the fly’ from the given dataset
tool. which is more relevant but can be problematic if the dataset is
small.
9
Searching w/ BLAST – Interpreting Results
High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format
z Change search parameters (scoring matrix, gap penalties)
z Check (low complexity) sequence filtering
z Try PSI-BLAST for distant homologs
Bioinformatic Databases – Summary

Observations
0 Lots of useful information
0 Complex relationships, naming schemes
0 High volume of nucleotide sequence data (GenBank)
0 Protein sequence data mostly derived from sequence data
z Computational derivation (GenPept, TrEMBL)
z Manual curation (PIR-PSD, Swiss-Prot)
0 Attempts to organize sequence data

z Eliminate redundancy, add annotation (RefSeq, UniGene)
0 Many tools attempt to link useful bioinformatic data
25

CMSC 838T - Lecture 9: Bioinformatics Databases

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CMSC 838T - Lecture 9: Bioinformatics Databases

Încărcat de

Drepturi de autor:

Formate disponibile

CMSC 838T – Lecture 9

Entrez search & retrieval Mapviewer search & retrieval