Sunteți pe pagina 1din 65

CMSC 838T – Lecture 9

‹ Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data

Entrez search & retrieval Mapviewer search & retrieval


of linked databases by chromosome position

CMSC 838T – Lecture 9

What Is a Database?
‹ Computerized storehouse of data (records)
‹ Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
‹ Uses standardized formats

CMSC 838T – Lecture 9

1
CMSC 838T – Lecture 9
‹ Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data

Entrez search & retrieval Mapviewer search & retrieval


of linked databases by chromosome position

CMSC 838T – Lecture 9

What Is a Database?
‹ Computerized storehouse of data (records)
‹ Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
‹ Uses standardized formats

CMSC 838T – Lecture 9

1
Database Models
‹ Defines data organization (schema)
‹ Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
‹ Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
‹ Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

2
Database Models
‹ Defines data organization (schema)
‹ Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
‹ Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
‹ Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

2
Bioinformatic Databases
‹ Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented

‹ Resulting in many bioinformatic databases…

CMSC 838T – Lecture 9

3
Bioinformatic Databases
‹ Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented

‹ Resulting in many bioinformatic databases…

CMSC 838T – Lecture 9

3
Bioinformatic Database Issues
‹ Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example

CMSC 838T – Lecture 9

Bioinformatic Database Issues


‹ Redundancy
0 Multiple entries for same DNA / protein sequence
0 Arising from multiple experiments & biological disciplines
0 Example – redundant GenBank entries for E. coli dUTPase
z 4 separate publications (X01714, V01578, L10328, AE000441)

‹ Data annotation & formats


0 Multiple data for single gene
z Sequence, location, expression, structure, function…

0 Resulting in multiple data annotations & formats


‹ Data integration
0 Combining data from multiple bioinformatic databases

CMSC 838T – Lecture 9

4
Bioinformatic Database Issues
‹ Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example

CMSC 838T – Lecture 9

Bioinformatic Database Issues


‹ Redundancy
0 Multiple entries for same DNA / protein sequence
0 Arising from multiple experiments & biological disciplines
0 Example – redundant GenBank entries for E. coli dUTPase
z 4 separate publications (X01714, V01578, L10328, AE000441)

‹ Data annotation & formats


0 Multiple data for single gene
z Sequence, location, expression, structure, function…

0 Resulting in multiple data annotations & formats


‹ Data integration
0 Combining data from multiple bioinformatic databases

CMSC 838T – Lecture 9

4
Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

Major Bioinformatic Databases


‹ DNA sequences
0 GenBank, RefSeq, UniGene
‹ Protein sequences
0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq
‹ Protein structure
0 Protein Data Bank (PDB)
‹ Gene expression
0 Gene Expression Omnibus (GEO)
‹ Biomedical publications
0 PubMed / MedLine

CMSC 838T – Lecture 9

5
Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

Major Bioinformatic Databases


‹ DNA sequences
0 GenBank, RefSeq, UniGene
‹ Protein sequences
0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq
‹ Protein structure
0 Protein Data Bank (PDB)
‹ Gene expression
0 Gene Expression Omnibus (GEO)
‹ Biomedical publications
0 PubMed / MedLine

CMSC 838T – Lecture 9

5
Bioinformatic Data Sources
‹ Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB

CMSC 838T – Lecture 9

Bioinformatic Data Sources


‹ Derived databases
0 Compiled from data in primary databases
0 Manually curated (human selection & correction)
z Advantages – high quality

z Disadvantages – high expense, low volume

z Examples

‹ Swiss-Prot, PIR-PSD, RefSeq

0 Computational derivation (automatically generated)


z Advantages – inexpensive, up-to-date

z Disadvantages – lower quality

z Examples

‹ GenPept, TrEMBL, UniGene, COGs

CMSC 838T – Lecture 9

6
Bioinformatics Databases and Tools

CS426: Introduction to Why would we use such databases


„ When obtaining a new DNA sequence, one needs to know
Computational Biology whether it has already been deposited in the databanks fully or
partially, or whether they contain any homologous
sequences(sequences which are descended from a common
ancestor).
Some of the databases contain annotation which has already been
Section Week 3 „
added to a specific sequence. Finding annotation for the searched
sequence or its homologous sequences can facilitate its research.
„ Find similar non-coding DNA stretches in the database:for
instance repeat elementsor regulatory sequences.
„ Other uses for specific purpose, like locating false priming sites
for a set of PCR oligonucleotides.
„ Search for homologous proteins - proteins similar in their
sequence and therefore also in their presumed folding or structure
or function.

Primary sequence databases Primary sequence databases


There are several problems with databases today:
„ Databases are regulated by users rather than by a
central body (except for Swiss-Prot).
„ Only the owner of the data can change it.
„ Sequences are not up to date.
„ Large degree of redundancy in databases and between
databases.
„ Lack of standard for .elds or annotation.
List of primary sequence databases and their locations.

1
Bioinformatics Databases and Tools

CS426: Introduction to Why would we use such databases


„ When obtaining a new DNA sequence, one needs to know
Computational Biology whether it has already been deposited in the databanks fully or
partially, or whether they contain any homologous
sequences(sequences which are descended from a common
ancestor).
Some of the databases contain annotation which has already been
Section Week 3 „
added to a specific sequence. Finding annotation for the searched
sequence or its homologous sequences can facilitate its research.
„ Find similar non-coding DNA stretches in the database:for
instance repeat elementsor regulatory sequences.
„ Other uses for specific purpose, like locating false priming sites
for a set of PCR oligonucleotides.
„ Search for homologous proteins - proteins similar in their
sequence and therefore also in their presumed folding or structure
or function.

Primary sequence databases Primary sequence databases


There are several problems with databases today:
„ Databases are regulated by users rather than by a
central body (except for Swiss-Prot).
„ Only the owner of the data can change it.
„ Sequences are not up to date.
„ Large degree of redundancy in databases and between
databases.
„ Lack of standard for .elds or annotation.
List of primary sequence databases and their locations.

1
Bioinformatic Data Sources
‹ Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB

CMSC 838T – Lecture 9

Bioinformatic Data Sources


‹ Derived databases
0 Compiled from data in primary databases
0 Manually curated (human selection & correction)
z Advantages – high quality

z Disadvantages – high expense, low volume

z Examples

‹ Swiss-Prot, PIR-PSD, RefSeq

0 Computational derivation (automatically generated)


z Advantages – inexpensive, up-to-date

z Disadvantages – lower quality

z Examples

‹ GenPept, TrEMBL, UniGene, COGs

CMSC 838T – Lecture 9

6
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
„ EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
„ EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
„ Its growth is exponential,
„ supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
„ SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.

DNA Databases Protein Databases


GeneBank: http://www.ncbi.nlm.nih.gov/ PIR – Protein Sequence Database
„ sequence database from National Center „ http://pir.georgetown.edu/
Biotechnology Information (NCBI). „ was developed in the early 1960’s.
„ four sections:
„ It incorporates sequences from publicly
„ PIR1 - fully classified and annotated entries.
available sources
„ PIR2 - preliminary entries, not thoroughly reviewed.
„ PIR3 - unverified entries, not reviewed.
„ PIR4 - conceptual translations.

2
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
„ EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
„ EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
„ Its growth is exponential,
„ supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
„ SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.

DNA Databases Protein Databases


GeneBank: http://www.ncbi.nlm.nih.gov/ PIR – Protein Sequence Database
„ sequence database from National Center „ http://pir.georgetown.edu/
Biotechnology Information (NCBI). „ was developed in the early 1960’s.
„ four sections:
„ It incorporates sequences from publicly
„ PIR1 - fully classified and annotated entries.
available sources
„ PIR2 - preliminary entries, not thoroughly reviewed.
„ PIR3 - unverified entries, not reviewed.
„ PIR4 - conceptual translations.

2
Bioinformatic Databases – GenBank
‹ Database type
0 Nucleotide sequences
0 Primary database
‹ Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
‹ Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides

CMSC 838T – Lecture 9

Bioinformatic Databases – GenBank


‹ Types of submissions to database
0 Genomic DNA
z High quality complete DNA sequence

0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)

0 Expressed sequence tag (EST)


z Single-pass partial cDNA sequences from mRNA

0 Sequence tagged sites (STS)


z Short DNA sequences unique in genome

0 Genomic survey sequence (GSS)


z Single-pass genomic DNA

0 Third-party annotations of GenBank sequences

CMSC 838T – Lecture 9

7
Bioinformatic Databases – GenBank
‹ Database type
0 Nucleotide sequences
0 Primary database
‹ Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
‹ Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides

CMSC 838T – Lecture 9

Bioinformatic Databases – GenBank


‹ Types of submissions to database
0 Genomic DNA
z High quality complete DNA sequence

0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)

0 Expressed sequence tag (EST)


z Single-pass partial cDNA sequences from mRNA

0 Sequence tagged sites (STS)


z Short DNA sequences unique in genome

0 Genomic survey sequence (GSS)


z Single-pass genomic DNA

0 Third-party annotations of GenBank sequences

CMSC 838T – Lecture 9

7
Bioinformatic Databases – Proteins
‹ Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation

z Computational derivation

‹ Classification
0 Predicted protein
z No similarity match to protein of known function

z Match to EST

0 Hypothetical protein
z No similarity match to protein of known function

z No match to EST

CMSC 838T – Lecture 9

Bioinformatic Databases – Swiss-Prot, PIR-PSD


‹ Database type
0 Protein sequences
0 Derived database
z Manually curated (non-redundant, annotated)

‹ Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…

0 Domains and sites


z Calcium binding regions, ATP-binding sites, zinc fingers…

0 Secondary & quaternary structure


0 Similarities to other proteins
0 Variants

CMSC 838T – Lecture 9

8
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
„ Established in 1986
„ GenPept is a supplement to the GenBank nucleotide
„ Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. „ translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
„ was created in 1996
NRL_3D
„ It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. „ produced and maintained by PIR.
„ SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot „ contains sequences extracted from the Protein

„ REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,

Protein Databases Database searching


Summary of protein sequence databases Text based search - Searching the
„ PIR(1-4) - comprehensive, poor quality of annotations. Examples:SRS, GCG’s
annotation (even in PIR1). Lookup, Entrez.
„ Swiss-Prot - poor sequence coverage, highly
structured, excellent annotation. Sequence based search - Searching the
„ GenPept most comprehensive, poor quality of sequence itself. Examples:Blast, FastA,
annotation. SW.
„ NRL 3D - least comprehensive but is directly
relating to structural information.

3
Bioinformatic Databases – Proteins
‹ Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation

z Computational derivation

‹ Classification
0 Predicted protein
z No similarity match to protein of known function

z Match to EST

0 Hypothetical protein
z No similarity match to protein of known function

z No match to EST

CMSC 838T – Lecture 9

Bioinformatic Databases – Swiss-Prot, PIR-PSD


‹ Database type
0 Protein sequences
0 Derived database
z Manually curated (non-redundant, annotated)

‹ Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…

0 Domains and sites


z Calcium binding regions, ATP-binding sites, zinc fingers…

0 Secondary & quaternary structure


0 Similarities to other proteins
0 Variants

CMSC 838T – Lecture 9

8
Bioinformatic Databases – Swiss-Prot, PIR-PSD
‹ Swiss-Prot statistics
0 Release 41.2, March 2003
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
‹ PIR-PSD statistics
0 Release 75.05, March 2003
0 283,289 entries

CMSC 838T – Lecture 9

Bioinformatic Databases – GenPept, TrEMBL


‹ Database type
0 Protein sequences
0 Computationally derived database
z Predicted coding sequences (CDS) from GenBank, EMBL

z Candidate sequences for Swiss-Prot, not yet processed

‹ GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
‹ TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues

CMSC 838T – Lecture 9

9
Bioinformatic Databases – Swiss-Prot, PIR-PSD
‹ Swiss-Prot statistics
0 Release 41.2, March 2003
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
‹ PIR-PSD statistics
0 Release 75.05, March 2003
0 283,289 entries

CMSC 838T – Lecture 9

Bioinformatic Databases – GenPept, TrEMBL


‹ Database type
0 Protein sequences
0 Computationally derived database
z Predicted coding sequences (CDS) from GenBank, EMBL

z Candidate sequences for Swiss-Prot, not yet processed

‹ GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
‹ TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues

CMSC 838T – Lecture 9

9
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
„ Established in 1986
„ GenPept is a supplement to the GenBank nucleotide
„ Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. „ translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
„ was created in 1996
NRL_3D
„ It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. „ produced and maintained by PIR.
„ SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot „ contains sequences extracted from the Protein

„ REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,

Protein Databases Database searching


Summary of protein sequence databases Text based search - Searching the
„ PIR(1-4) - comprehensive, poor quality of annotations. Examples:SRS, GCG’s
annotation (even in PIR1). Lookup, Entrez.
„ Swiss-Prot - poor sequence coverage, highly
structured, excellent annotation. Sequence based search - Searching the
„ GenPept most comprehensive, poor quality of sequence itself. Examples:Blast, FastA,
annotation. SW.
„ NRL 3D - least comprehensive but is directly
relating to structural information.

3
Bioinformatic Databases – Connections

DNA sequences
Sequin & BankIt

Genome projects
GenBank EMBL/EBI

Automatically translated
Protein
GenPept TrEMBL
sequences
from labs

Manual curation
PIR-PSD SwissProt
& annotation

CMSC 838T – Lecture 9

Bioinformatic Databases – Protein Data Bank


‹ Database type
0 Protein 3D structures
0 Primary database
‹ Statistics Folds & New Folds / Year
0 March 2003
0 20,473 proteins

CMSC 838T – Lecture 9

10
Bioinformatic Databases – Connections

DNA sequences
Sequin & BankIt

Genome projects
GenBank EMBL/EBI

Automatically translated
Protein
GenPept TrEMBL
sequences
from labs

Manual curation
PIR-PSD SwissProt
& annotation

CMSC 838T – Lecture 9

Bioinformatic Databases – Protein Data Bank


‹ Database type
0 Protein 3D structures
0 Primary database
‹ Statistics Folds & New Folds / Year
0 March 2003
0 20,473 proteins

CMSC 838T – Lecture 9

10
Bioinformatic Databases – Pfam
‹ Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions

0 Derived database (from Swiss-Prot & TrEMBL)


z Pfam-A – manually curated (hand-edited MSA)

z Pfam-B – computationally derived

‹ Non-overlapping families from PRODOM database

‹ Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A

z 20% at least one match in Pfam-B

CMSC 838T – Lecture 9

Bioinformatic Databases – RefSeq


‹ Database type
0 Nucleotide & protein sequences
0 Derived database
z Human curated (non-redundant, cross-linked)

‹ Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
0 Multiple organisms
‹ Statistics
0 March 2003
0 17,268 human loci, ~52,000 for all species

CMSC 838T – Lecture 9

11
Bioinformatic Databases – Pfam
‹ Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions

0 Derived database (from Swiss-Prot & TrEMBL)


z Pfam-A – manually curated (hand-edited MSA)

z Pfam-B – computationally derived

‹ Non-overlapping families from PRODOM database

‹ Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A

z 20% at least one match in Pfam-B

CMSC 838T – Lecture 9

Bioinformatic Databases – RefSeq


‹ Database type
0 Nucleotide & protein sequences
0 Derived database
z Human curated (non-redundant, cross-linked)

‹ Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
0 Multiple organisms
‹ Statistics
0 March 2003
0 17,268 human loci, ~52,000 for all species

CMSC 838T – Lecture 9

11
Bioinformatic Databases – UniGene
‹ Database type
0 Nucleotide sequences
0 Computationally derived database
z Partitioned into non-redundant gene-oriented clusters

0 Gene-oriented view
‹ Data in UniGene
0 Clusters of genomic DNA & ESTs
0 Multiple organisms
‹ Statistics
0 March 2003
0 111,064 human loci, ~500,000 for all species

CMSC 838T – Lecture 9

Bioinformatic Databases – Relative Sizes


100000000
DB size (# sequences)

10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD

RefSeq

PDB
TrEMBL
GenPept
GenBank

Swiss-
UniGene

Prot

Computationally Manually
Derived Curated
CMSC 838T – Lecture 9

12
Bioinformatic Databases – UniGene
‹ Database type
0 Nucleotide sequences
0 Computationally derived database
z Partitioned into non-redundant gene-oriented clusters

0 Gene-oriented view
‹ Data in UniGene
0 Clusters of genomic DNA & ESTs
0 Multiple organisms
‹ Statistics
0 March 2003
0 111,064 human loci, ~500,000 for all species

CMSC 838T – Lecture 9

Bioinformatic Databases – Relative Sizes


100000000
DB size (# sequences)

10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD

RefSeq

PDB
TrEMBL
GenPept
GenBank

Swiss-
UniGene

Prot

Computationally Manually
Derived Curated
CMSC 838T – Lecture 9

12
Bioinformatic Databases – PubMed
‹ Database type
0 Biomedical papers
0 Manually curated database
‹ Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s

CMSC 838T – Lecture 9

Bioinformatic Databases – Others


‹ Gene expression
0 ArrayExpress, Gene Expression Omnibus (GEO)
‹ Multi-organism genomes
0 Entrez Genome, HomoloGene, COGs, TIGR
‹ Genetic variation & genetic diseases
0 dbSNP, OMIM, CGAP
‹ Metabolic pathways
0 WIT, KEGG
‹ Many more…
0 Listed in journal “Nucleic Acids Research” each January

CMSC 838T – Lecture 9

13
Bioinformatic Databases – PubMed
‹ Database type
0 Biomedical papers
0 Manually curated database
‹ Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s

CMSC 838T – Lecture 9

Bioinformatic Databases – Others


‹ Gene expression
0 ArrayExpress, Gene Expression Omnibus (GEO)
‹ Multi-organism genomes
0 Entrez Genome, HomoloGene, COGs, TIGR
‹ Genetic variation & genetic diseases
0 dbSNP, OMIM, CGAP
‹ Metabolic pathways
0 WIT, KEGG
‹ Many more…
0 Listed in journal “Nucleic Acids Research” each January

CMSC 838T – Lecture 9

13
Text based retrieval tools Entrez example
Entrez
„ http://www.ncbi.nlm.nih.gov/Entrez/
„ Developed at NCBI
„ entry point for exploring the NCBI’s integrated
databases.
„ easy to use, but unlike SRS, the search is
limited.

Sequence Based Searching


DNA search versus Protein search
„ A DNA sequence is a string of length n over an
alphabet of size 4. Its protein translation is a string of
length n/3 over an alphabet of size 20. Statistically, the
expected number of random matches in some arbitrary
database is larger for a DNA sequence.
„ DNA databases are much larger than protein databases,
and they grow faster. This also means more random
hits.
„ Translation of a DNA sequence to a protein sequence
causes loss of information.
„ Protein sequences are more biologically preserved than
DNA sequences.

5
Text based retrieval tools Entrez example
Entrez
„ http://www.ncbi.nlm.nih.gov/Entrez/
„ Developed at NCBI
„ entry point for exploring the NCBI’s integrated
databases.
„ easy to use, but unlike SRS, the search is
limited.

Sequence Based Searching


DNA search versus Protein search
„ A DNA sequence is a string of length n over an
alphabet of size 4. Its protein translation is a string of
length n/3 over an alphabet of size 20. Statistically, the
expected number of random matches in some arbitrary
database is larger for a DNA sequence.
„ DNA databases are much larger than protein databases,
and they grow faster. This also means more random
hits.
„ Translation of a DNA sequence to a protein sequence
causes loss of information.
„ Protein sequences are more biologically preserved than
DNA sequences.

5
Bioinformatic Search Tools – Entrez
‹ Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets

CMSC 838T – Lecture 9

Bioinformatic Search Tools – Entrez


‹ Mapviewer (component of Entrez Genome)
0 View & search complete genome by chromosome position
0 Display & zoom into chromosome maps

CMSC 838T – Lecture 9

14
Text based retrieval tools
SRS (Sequence Retrieval System)
„ http://srs.ebi.ac.uk/
„ Developed by EBI
„ provides a homogeneous interface to over 80 biological
databases
„ includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
„ Before entering a query, one selects one or more of the
databases to search.
„ It is possible to send the query results as a batch query
to a sequence search tool.

4
Text based retrieval tools
SRS (Sequence Retrieval System)
„ http://srs.ebi.ac.uk/
„ Developed by EBI
„ provides a homogeneous interface to over 80 biological
databases
„ includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
„ Before entering a query, one selects one or more of the
databases to search.
„ It is possible to send the query results as a batch query
to a sequence search tool.

4
Bioinformatic Search Tools – Entrez
‹ Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets

CMSC 838T – Lecture 9

Bioinformatic Search Tools – Entrez


‹ Mapviewer (component of Entrez Genome)
0 View & search complete genome by chromosome position
0 Display & zoom into chromosome maps

CMSC 838T – Lecture 9

14
Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

Bioinformatic Database Identifiers


‹ Common identifiers for bioinformatic data
0 Locus name
0 Accession numbers
0 GenInfo ID
0 PubMed ID

CMSC 838T – Lecture 9

15
Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

Bioinformatic Database Identifiers


‹ Common identifiers for bioinformatic data
0 Locus name
0 Accession numbers
0 GenInfo ID
0 PubMed ID

CMSC 838T – Lecture 9

15
Database Identifiers – Locus Names
‹ Original identifiers of GenBank records
0 LOCUS line in GenBank entries
‹ Originally
0 First 3 letters of organism followed by code for gene
‹ Example
0 HUMBB for human ß-globin region
‹ Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same

CMSC 838T – Lecture 9

Database Identifiers – Accession Numbers


‹ No biological meaning
‹ Originally
0 Uppercase letter followed by 5 digits: U00002
‹ Currently
0 Two uppercase letters followed by six digits: BC037153
0 May include version number for entry: BC037153.1
‹ Stable way of identifying GenBank entries
‹ Now being used for both DNA and proteins

CMSC 838T – Lecture 9

16
Database Identifiers – Locus Names
‹ Original identifiers of GenBank records
0 LOCUS line in GenBank entries
‹ Originally
0 First 3 letters of organism followed by code for gene
‹ Example
0 HUMBB for human ß-globin region
‹ Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same

CMSC 838T – Lecture 9

Database Identifiers – Accession Numbers


‹ No biological meaning
‹ Originally
0 Uppercase letter followed by 5 digits: U00002
‹ Currently
0 Two uppercase letters followed by six digits: BC037153
0 May include version number for entry: BC037153.1
‹ Stable way of identifying GenBank entries
‹ Now being used for both DNA and proteins

CMSC 838T – Lecture 9

16
Database Identifiers – GenInfo (gi) IDs
‹ Identifier for a particular sequence only
0 Each entry gets a unique gi number
‹ Example
0 GI:22477487
‹ Not subject to versioning
0 Entry always remains the same
‹ Different / new versions of the same sequence
0 Manage using accession numbers

CMSC 838T – Lecture 9

Database Identifiers – PubMed IDs (PMID)


‹ Identifies articles managed by NCBI
‹ Reliable, stable link to citation
‹ Example
0 PMID: 12205585

CMSC 838T – Lecture 9

17
Database Identifiers – GenInfo (gi) IDs
‹ Identifier for a particular sequence only
0 Each entry gets a unique gi number
‹ Example
0 GI:22477487
‹ Not subject to versioning
0 Entry always remains the same
‹ Different / new versions of the same sequence
0 Manage using accession numbers

CMSC 838T – Lecture 9

Database Identifiers – PubMed IDs (PMID)


‹ Identifies articles managed by NCBI
‹ Reliable, stable link to citation
‹ Example
0 PMID: 12205585

CMSC 838T – Lecture 9

17
Bioinformatic Database Formats
‹ Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML

CMSC 838T – Lecture 9

Database Format – FASTA


‹ Used by FASTA tools
‹ Comment line followed by sequence data
0 No annotation, just sequence
‹ Example
>gi|1040960|gb|U35641.1|MMU35641 Mus musculus Brca1 mRNA…
GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC
GGATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC
GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT
CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT
TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT
TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG

CMSC 838T – Lecture 9

18
Bioinformatic Database Formats
‹ Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML

CMSC 838T – Lecture 9

Database Format – FASTA


‹ Used by FASTA tools
‹ Comment line followed by sequence data
0 No annotation, just sequence
‹ Example
>gi|1040960|gb|U35641.1|MMU35641 Mus musculus Brca1 mRNA…
GGCACGAGGATCCAGCACCTCTCTTGGGGCTTCTCCGTCCTCGGCGCTTGGAAGTAC
GGATCTTTTTTCTCGGAGAAAAGTTCACTGGAACTGGAAGAAATGGATTTATCTGCC
GTCCAAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATCTTAGAGTGT
CCGATCTGTTTGGAACTGATCAAAGAACCTGTTTCCACAAAGTGTGACCACATATTT
TGCAAATTTTGTATGCTGAAACTTCTTAACCAGAAGAAAGGGCCTTCACAATGTCCT
TTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAGGGAAGCACAAGGTTTAGTCAG

CMSC 838T – Lecture 9

18
Database Format – GenBank
‹ Flat file format used by GenBank
0 Annotation, author, version, etc…
‹ Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698

CMSC 838T – Lecture 9

Database Format – SWISS-PROT


‹ Defined by SWISS-PROT database
0 Includes annotation, other info
‹ Example
ID BRC1_MOUSE STANDARD; PRT; 1812 AA.
AC P48754; Q60957; Q60983;
DT 01-FEB-1996 (Rel. 33, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
DE Breast cancer type 1 susceptibility protein homolog.
GN BRCA1.
OS Mus musculus (Mouse).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
OX NCBI_TaxID=10090;
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=C57BL/6; TISSUE=Embryo;
RX MEDLINE=96177659; PubMed=8634697;
RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;
RT "Mouse Brca1: localization sequence analysis and identification of
RT evolutionarily conserved domains.";
RL Hum. Mol. Genet. 4:2265-2273(1995)…
CMSC 838T – Lecture 9

19
Database Format – GenBank
‹ Flat file format used by GenBank
0 Annotation, author, version, etc…
‹ Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698

CMSC 838T – Lecture 9

Database Format – SWISS-PROT


‹ Defined by SWISS-PROT database
0 Includes annotation, other info
‹ Example
ID BRC1_MOUSE STANDARD; PRT; 1812 AA.
AC P48754; Q60957; Q60983;
DT 01-FEB-1996 (Rel. 33, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
DE Breast cancer type 1 susceptibility protein homolog.
GN BRCA1.
OS Mus musculus (Mouse).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
OX NCBI_TaxID=10090;
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=C57BL/6; TISSUE=Embryo;
RX MEDLINE=96177659; PubMed=8634697;
RA Abel K.J., Xy J., Yin G.Y., Lyons R.H., Meisler M.H., Weber B.L.;
RT "Mouse Brca1: localization sequence analysis and identification of
RT evolutionarily conserved domains.";
RL Hum. Mol. Genet. 4:2265-2273(1995)…
CMSC 838T – Lecture 9

19
Database Format – ASN.1
‹ International standard
0 Semi-structured format
0 Base format for NCBI data
‹ Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …

CMSC 838T – Lecture 9

Database Format – XML


‹ eXtensible Markup Language
0 Open standard for semi-structured data, uses tags like HTML
0 Document split into content (XML), style (XSL), linking (XLL)
‹ Example
<?xml version="1.0"?>
<!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN"
“http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">
<GBSet>
<GBSeq>
<GBSeq_locus>MMU35641</GBSeq_locus>
<GBSeq_length>5538</GBSeq_length>
<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>
<GBSeq_moltype value="mrna">5</GBSeq_moltype>
<GBSeq_topology value="linear">1</GBSeq_topology>
<GBSeq_division>ROD</GBSeq_division>
<GBSeq_update-date>18-OCT-1996</GBSeq_update-date>
<GBSeq_create-date>25-OCT-1995</GBSeq_create-date>
<GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition>
<GBSeq_primary-accession>U35641</GBSeq_primary-accession>
<GBSeq_accession-version>U35641.1</GBSeq_accession-version>

CMSC 838T – Lecture 9

20
Database Format – ASN.1
‹ International standard
0 Semi-structured format
0 Base format for NCBI data
‹ Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …

CMSC 838T – Lecture 9

Database Format – XML


‹ eXtensible Markup Language
0 Open standard for semi-structured data, uses tags like HTML
0 Document split into content (XML), style (XSL), linking (XLL)
‹ Example
<?xml version="1.0"?>
<!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN"
“http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">
<GBSet>
<GBSeq>
<GBSeq_locus>MMU35641</GBSeq_locus>
<GBSeq_length>5538</GBSeq_length>
<GBSeq_strandedness value="not-set">0</GBSeq_strandedness>
<GBSeq_moltype value="mrna">5</GBSeq_moltype>
<GBSeq_topology value="linear">1</GBSeq_topology>
<GBSeq_division>ROD</GBSeq_division>
<GBSeq_update-date>18-OCT-1996</GBSeq_update-date>
<GBSeq_create-date>25-OCT-1995</GBSeq_create-date>
<GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition>
<GBSeq_primary-accession>U35641</GBSeq_primary-accession>
<GBSeq_accession-version>U35641.1</GBSeq_accession-version>

CMSC 838T – Lecture 9

20
Processing Data in Bioinformatic Databases
‹ Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO

‹ Perl (Practical Extraction and Report Language)


0 Portable C-like interpreted scripting language
0 Powerful pattern matching, string processing operations
0 Frequently used to extract / process bioinformatic data
‹ BioPerl
0 Collection of Perl classes designed for bioinformatic tools
0 Sequence analysis, alignment, format conversion, I/O,
automate bioinformatic analyses, parse results, create GUIs,
manage persistent storage in RDMBS…

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

21
Processing Data in Bioinformatic Databases
‹ Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO

‹ Perl (Practical Extraction and Report Language)


0 Portable C-like interpreted scripting language
0 Powerful pattern matching, string processing operations
0 Frequently used to extract / process bioinformatic data
‹ BioPerl
0 Collection of Perl classes designed for bioinformatic tools
0 Sequence analysis, alignment, format conversion, I/O,
automate bioinformatic analyses, parse results, create GUIs,
manage persistent storage in RDMBS…

CMSC 838T – Lecture 9

Bioinformatic Databases
‹ Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases

CMSC 838T – Lecture 9

21
Bioinformatic Databases – Usage
‹ NCBI Protein information usage survey

CMSC 838T – Lecture 9

Using Bioinformatic Databases


‹ Primary use of bioinformatics
0 Finding similar sequences
0 BLAST!

1) insert sequence

2) click button!

CMSC 838T – Lecture 9

22
FastA FastA
Output
„ The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use „ Sequences with E-score less than 0.01 are
different programs:
„ To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
„ Sequences with E-score between 1 and 10
„ To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
„ To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
„ Use ktup = 1 for oligonucleotides (length < 20).

FastA Example

6
FastA FastA
Output
„ The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use „ Sequences with E-score less than 0.01 are
different programs:
„ To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
„ Sequences with E-score between 1 and 10
„ To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
„ To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
„ Use ktup = 1 for oligonucleotides (length < 20).

FastA Example

6
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/

Blast Example

7
Bioinformatic Databases – Usage
‹ NCBI Protein information usage survey

CMSC 838T – Lecture 9

Using Bioinformatic Databases


‹ Primary use of bioinformatics
0 Finding similar sequences
0 BLAST!

1) insert sequence

2) click button!

CMSC 838T – Lecture 9

22
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/

Blast Example

7
Using Bioinformatic Databases
‹ Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids

0 BLASTP
z Protein query against protein database

0 BLASTX
z Translated nucleic acids against protein database

0 TBLAST
z Protein query against translated nucleic acid database

0 TBLASTX
z Translated nucleic acids against translated nucleic acids

CMSC 838T – Lecture 9

Databases – Searching w/ BLAST

CMSC 838T – Lecture 9

23
Using Bioinformatic Databases
‹ Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids

0 BLASTP
z Protein query against protein database

0 BLASTX
z Translated nucleic acids against protein database

0 TBLAST
z Protein query against translated nucleic acid database

0 TBLASTX
z Translated nucleic acids against translated nucleic acids

CMSC 838T – Lecture 9

Databases – Searching w/ BLAST

CMSC 838T – Lecture 9

23
Databases – Searching w/ BLAST
‹ BLAST result
0 Graphic display

CMSC 838T – Lecture 9

Databases – Searching w/ BLAST


‹ BLAST result
0 Matching sequences w/ bit-score & E-value
0 Hyperlinks to database entry for sequence

‹ Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21

Hyperlinks to sequences Bit Score E-value

CMSC 838T – Lecture 9

24
Databases – Searching w/ BLAST
‹ BLAST result
0 Graphic display

CMSC 838T – Lecture 9

Databases – Searching w/ BLAST


‹ BLAST result
0 Matching sequences w/ bit-score & E-value
0 Hyperlinks to database entry for sequence

‹ Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21

Hyperlinks to sequences Bit Score E-value

CMSC 838T – Lecture 9

24
Searching w/ BLAST – Interpreting Results
‹ High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
‹ Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format

z Change search parameters (scoring matrix, gap penalties)

z Check (low complexity) sequence filtering

z Try PSI-BLAST for distant homologs

CMSC 838T – Lecture 9

Bioinformatic Databases – Summary


‹ Observations
0 Lots of useful information
0 Complex relationships, naming schemes
0 High volume of nucleotide sequence data (GenBank)
0 Protein sequence data mostly derived from sequence data
z Computational derivation (GenPept, TrEMBL)

z Manual curation (PIR-PSD, Swiss-Prot)

0 Attempts to organize sequence data


z Eliminate redundancy, add annotation (RefSeq, UniGene)

0 Many tools attempt to link useful bioinformatic data

CMSC 838T – Lecture 9

25
Comparison of the Programs Comparison of the Programs
Concept: Sensitivity:
„ FastA > BLAST (old version!)
„ BLAST produce local alignments, while FastA
„ FastA is more sensitive, missing less homologous sequences on the
is a global alignment tool. BLAST can report average (but the opposite can also happen - if there are no identical
more than one HSP per database entry, while residues conserved, but this is infrequent). It also gives better
FastA reports only one segment(match). separation between true hits and random hits.

Speed: Statistics:
„ BLAST calculates probabilities, and it sometimes fails entirely if
„ BLAST > FastA some of the assumptions used are invalid.
„ BLAST (package) is a highly e.cient search „ FastA calculates significance ’on the fly’ from the given dataset
tool. which is more relevant but can be problematic if the dataset is
small.

9
Searching w/ BLAST – Interpreting Results
‹ High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
‹ Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format

z Change search parameters (scoring matrix, gap penalties)

z Check (low complexity) sequence filtering

z Try PSI-BLAST for distant homologs

CMSC 838T – Lecture 9

Bioinformatic Databases – Summary


‹ Observations
0 Lots of useful information
0 Complex relationships, naming schemes
0 High volume of nucleotide sequence data (GenBank)
0 Protein sequence data mostly derived from sequence data
z Computational derivation (GenPept, TrEMBL)

z Manual curation (PIR-PSD, Swiss-Prot)

0 Attempts to organize sequence data


z Eliminate redundancy, add annotation (RefSeq, UniGene)

0 Many tools attempt to link useful bioinformatic data

CMSC 838T – Lecture 9

25

S-ar putea să vă placă și