Documente Academic
Documente Profesional
Documente Cultură
Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data
What Is a Database?
Computerized storehouse of data (records)
Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
Uses standardized formats
1
CMSC 838T – Lecture 9
Bioinformatics databases
0 Organization & classification of bioinformatic data
0 Identify, format, & retrieval of bioinformatic data
What Is a Database?
Computerized storehouse of data (records)
Allows
0 User-defined queries
0 Extraction of specified records
0 Adding, changing, removing, & merging records
Uses standardized formats
1
Database Models
Defines data organization (schema)
Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
2
Database Models
Defines data organization (schema)
Relational
0 Entities and relationships stored in tables
0 Predefined schema
0 Examples: Oracle, DB2, MySQL, PostgreSQL
Object-oriented
0 Stores data as objects (i.e., structures with predefined type)
0 Examples: Versant, Jasmine, Objectivity
Semi-structured
0 Schema dynamically defined within data (self-describing)
0 Flexible description of data with complex relationships
0 Example: XML databases
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
2
Bioinformatic Databases
Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)
Bioinformatic Databases
Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented
3
Bioinformatic Databases
Useful information
0 DNA sequences
0 Conserved DNA domains
0 Genomes
0 Gene expression (ESTs, microarrays)
0 Protein sequences
0 Protein 3D structure
0 Protein families
0 Mutations / polymorphisms / SNPs
0 Metabolic pathways
0 Chemical compounds (ligands)
0 Biomedical literature (journal papers, online books…)
Bioinformatic Databases
Classification schemes
0 Database design – relational, object-oriented…
0 Data type – DNA, RNA, EST, protein…
0 Organism – bacteria, virus, human…
0 Accessibility – public, academic, commercial
0 Data source – primary, derived
0 Data entry – manually curated, computational derived
0 Focus – sequence-oriented, gene-oriented
3
Bioinformatic Database Issues
Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example
4
Bioinformatic Database Issues
Naming
0 Multiple names for same chemical
0 Arising from multiple biological disciplines, conventions
0 Example
4
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
5
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
5
Bioinformatic Data Sources
Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB
z Examples
z Examples
6
Bioinformatics Databases and Tools
1
Bioinformatics Databases and Tools
1
Bioinformatic Data Sources
Primary databases
0 Original submissions by researchers
0 Staff organizes information only
0 Generally sequence oriented
0 Examples
z GenBank, PDB
z Examples
z Examples
6
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
Its growth is exponential,
supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.
2
DNA Databases (Nucleotide
Sequences) DNA Databases
Growing faster then the protein databases. EMBL : URL http://www.ebi.ac.uk/embl/
EMBL is a DNA sequence database from European
Bioinformatics Institute (EBI).
EMBL includes sequences from direct submissions,
from genome sequencing projects, scientific literature
and patent applications.
Its growth is exponential,
supports several retrieval tools:
Largest databases: Genbank (US), EMBL (Europe - UK),
SRS for text based retrieval and Blast and FastA for
DDBJ (Japan).
sequence based retrieval.
2
Bioinformatic Databases – GenBank
Database type
0 Nucleotide sequences
0 Primary database
Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides
0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)
7
Bioinformatic Databases – GenBank
Database type
0 Nucleotide sequences
0 Primary database
Data combined from additional sources
0 European Molecular Biology Laboratory (EMBL)
0 DNA DataBank of Japan (DDBJ)
Current size
0 Release 134, Feb 2003
0 23,035,823 sequences
0 29,358,082,791 nucleotides
0 mRNA / cDNA
z Partial or complete mRNA (or retranscribed cDNA)
7
Bioinformatic Databases – Proteins
Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation
z Computational derivation
Classification
0 Predicted protein
z No similarity match to protein of known function
z Match to EST
0 Hypothetical protein
z No similarity match to protein of known function
z No match to EST
Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…
8
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
Established in 1986
GenPept is a supplement to the GenBank nucleotide
Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
was created in 1996
NRL_3D
It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. produced and maintained by PIR.
SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot contains sequences extracted from the Protein
REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,
3
Bioinformatic Databases – Proteins
Protein sequence databases
0 Once derived from laboratory experiments
0 Now mostly based on predicted ORFs from DNA
z Manual curation
z Computational derivation
Classification
0 Predicted protein
z No similarity match to protein of known function
z Match to EST
0 Hypothetical protein
z No similarity match to protein of known function
z No match to EST
Many annotations
0 Functions of the protein
0 Post-translational modifications
z Phosphorylation, acetylation, GPI-anchor, etc…
8
Bioinformatic Databases – Swiss-Prot, PIR-PSD
Swiss-Prot statistics
0 Release 41.2, March 2003
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
PIR-PSD statistics
0 Release 75.05, March 2003
0 283,289 entries
GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues
9
Bioinformatic Databases – Swiss-Prot, PIR-PSD
Swiss-Prot statistics
0 Release 41.2, March 2003
0 123,721 entries totaling 45,421,741 amino acids
0 Abstracted from 104,046 references
0 Average length 367 amino acids
PIR-PSD statistics
0 Release 75.05, March 2003
0 283,289 entries
GenPept
0 Release 134, February 2003
0 1,314,007 loci containing 407,394,800 residues
TrEMBL
0 Release 23, March 2003
0 921,952 sequences, 40,914,860 residues
9
Protein Databases Protein Databases
Swiss-Prot: http://us.expasy.org/sprot/ GenPept http://www.ncbi.nlm.nih.gov/
Established in 1986
GenPept is a supplement to the GenBank nucleotide
Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational sequence database.
modi.cations, variants, etc. It aims to be minimally redundant. translations of coding regions in GenBank entries.
TrEMBL - Translated EMBL
was created in 1996
NRL_3D
It contains translations of all coding sequences in the EMBL http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
nucleotide sequence database. produced and maintained by PIR.
SP-TrEMBL contains entries that will be incorporated into Swiss-
Prot contains sequences extracted from the Protein
REM-TrEMBL contains entries that are not destined to be included DataBank (PDB)
in Swiss-Prot,
3
Bioinformatic Databases – Connections
DNA sequences
Sequin & BankIt
Genome projects
GenBank EMBL/EBI
Automatically translated
Protein
GenPept TrEMBL
sequences
from labs
Manual curation
PIR-PSD SwissProt
& annotation
10
Bioinformatic Databases – Connections
DNA sequences
Sequin & BankIt
Genome projects
GenBank EMBL/EBI
Automatically translated
Protein
GenPept TrEMBL
sequences
from labs
Manual curation
PIR-PSD SwissProt
& annotation
10
Bioinformatic Databases – Pfam
Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions
Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A
Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
0 Multiple organisms
Statistics
0 March 2003
0 17,268 human loci, ~52,000 for all species
11
Bioinformatic Databases – Pfam
Database type
0 Protein families
z Multiple alignments of protein domains, conserved regions
Statistics
0 Release 8.0, February 2003
0 5193 families in Pfam-A
0 Protein sequence coverage
z 73% at least one match in Pfam-A
Data in RefSeq
0 Genomic DNA contigs
0 mRNAs & proteins for known genes, gene models
0 Entire chromosomes
0 Multiple organisms
Statistics
0 March 2003
0 17,268 human loci, ~52,000 for all species
11
Bioinformatic Databases – UniGene
Database type
0 Nucleotide sequences
0 Computationally derived database
z Partitioned into non-redundant gene-oriented clusters
0 Gene-oriented view
Data in UniGene
0 Clusters of genomic DNA & ESTs
0 Multiple organisms
Statistics
0 March 2003
0 111,064 human loci, ~500,000 for all species
10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD
RefSeq
PDB
TrEMBL
GenPept
GenBank
Swiss-
UniGene
Prot
Computationally Manually
Derived Curated
CMSC 838T – Lecture 9
12
Bioinformatic Databases – UniGene
Database type
0 Nucleotide sequences
0 Computationally derived database
z Partitioned into non-redundant gene-oriented clusters
0 Gene-oriented view
Data in UniGene
0 Clusters of genomic DNA & ESTs
0 Multiple organisms
Statistics
0 March 2003
0 111,064 human loci, ~500,000 for all species
10000000
1000000
100000
10000
1000
100
10
1
Pfam
PIR-PSD
RefSeq
PDB
TrEMBL
GenPept
GenBank
Swiss-
UniGene
Prot
Computationally Manually
Derived Curated
CMSC 838T – Lecture 9
12
Bioinformatic Databases – PubMed
Database type
0 Biomedical papers
0 Manually curated database
Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s
13
Bioinformatic Databases – PubMed
Database type
0 Biomedical papers
0 Manually curated database
Data searched by PubMed
0 MedLine papers
0 Additional physics, chemistry, life science journals
0 Abstracts, citations, some full articles
0 Over 11 million journal articles dating back to 1960’s
13
Text based retrieval tools Entrez example
Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Developed at NCBI
entry point for exploring the NCBI’s integrated
databases.
easy to use, but unlike SRS, the search is
limited.
5
Text based retrieval tools Entrez example
Entrez
http://www.ncbi.nlm.nih.gov/Entrez/
Developed at NCBI
entry point for exploring the NCBI’s integrated
databases.
easy to use, but unlike SRS, the search is
limited.
5
Bioinformatic Search Tools – Entrez
Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets
14
Text based retrieval tools
SRS (Sequence Retrieval System)
http://srs.ebi.ac.uk/
Developed by EBI
provides a homogeneous interface to over 80 biological
databases
includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
Before entering a query, one selects one or more of the
databases to search.
It is possible to send the query results as a batch query
to a sequence search tool.
4
Text based retrieval tools
SRS (Sequence Retrieval System)
http://srs.ebi.ac.uk/
Developed by EBI
provides a homogeneous interface to over 80 biological
databases
includes databases of sequences, metabolic pathways,
transcription factors, application results (like BLAST,
SSEARCH, FASTA), protein 3-D structures, genomes,
mappings, mutations, and locus specific mutations.
Before entering a query, one selects one or more of the
databases to search.
It is possible to send the query results as a batch query
to a sequence search tool.
4
Bioinformatic Search Tools – Entrez
Search / retrieval tool for multiple linked databases
0 Papers biomedical literature (PubMed)
0 Nucleotide sequence database (GenBank)
0 Protein sequence database
0 Structure 3D macromolecular structures
0 Genome complete genome assemblies
0 OMIM Online Mendelian Inheritance in Man
0 Taxonomy organisms in GenBank
0 ProbeSet gene expression and microarray datasets
14
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
15
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
15
Database Identifiers – Locus Names
Original identifiers of GenBank records
0 LOCUS line in GenBank entries
Originally
0 First 3 letters of organism followed by code for gene
Example
0 HUMBB for human ß-globin region
Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same
16
Database Identifiers – Locus Names
Original identifiers of GenBank records
0 LOCUS line in GenBank entries
Originally
0 First 3 letters of organism followed by code for gene
Example
0 HUMBB for human ß-globin region
Problems
0 Unmaintainable due to growth of data
0 Homologous genes not named the same
16
Database Identifiers – GenInfo (gi) IDs
Identifier for a particular sequence only
0 Each entry gets a unique gi number
Example
0 GI:22477487
Not subject to versioning
0 Entry always remains the same
Different / new versions of the same sequence
0 Manage using accession numbers
17
Database Identifiers – GenInfo (gi) IDs
Identifier for a particular sequence only
0 Each entry gets a unique gi number
Example
0 GI:22477487
Not subject to versioning
0 Entry always remains the same
Different / new versions of the same sequence
0 Manage using accession numbers
17
Bioinformatic Database Formats
Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML
18
Bioinformatic Database Formats
Data is stored / presented in a variety of formats
0 FASTA
0 GenBank
0 SwissProt
0 ASN.1
0 XML
18
Database Format – GenBank
Flat file format used by GenBank
0 Annotation, author, version, etc…
Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698
19
Database Format – GenBank
Flat file format used by GenBank
0 Annotation, author, version, etc…
Example (just the top)
LOCUS MMU35641 5538 bp mRNA linear ROD 18-OCT-1996
DEFINITION Mus musculus Brca1 mRNA, complete cds.
ACCESSION U35641
VERSION U35641.1 GI:1040960
KEYWORDS .
SOURCE house mouse strain=C57Bl/6.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 5538)
AUTHORS Sharan,S.K., Wims,M. and Bradley,A.
TITLE Murine Brca1: sequence and significance for human missense
mutations
JOURNAL Hum. Mol. Genet. 4 (12), 2275-2278 (1995)
MEDLINE 96177660
PUBMED 8634698
19
Database Format – ASN.1
International standard
0 Semi-structured format
0 Base format for NCBI data
Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …
20
Database Format – ASN.1
International standard
0 Semi-structured format
0 Base format for NCBI data
Example
Seq-entry ::= set {
level 1 ,
class nuc-prot ,
descr {
title "Mus musculus Brca1 mRNA, and translated products" ,
source {
org {
taxname "Mus musculus" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } , …
20
Processing Data in Bioinformatic Databases
Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
21
Processing Data in Bioinformatic Databases
Format conversion
0 Frequently tools handle only one of the data formats
0 Use software to transform between formats
z ReadSeq, SeqIO
Bioinformatic Databases
Outline
0 Issues
0 Databases
0 Identifiers & formats
0 Searching databases
21
Bioinformatic Databases – Usage
NCBI Protein information usage survey
1) insert sequence
2) click button!
22
FastA FastA
Output
The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use Sequences with E-score less than 0.01 are
different programs:
To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
Sequences with E-score between 1 and 10
To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
Use ktup = 1 for oligonucleotides (length < 20).
FastA Example
6
FastA FastA
Output
The standard FastA output contains a list of the
best alignment scores and a visual
http://www.ebi.ac.uk/fasta33/ representation of the alignments.
Under different circumstances it is favorable to use Sequences with E-score less than 0.01 are
different programs:
To identify an unknown protein sequence use either FastA3 or almost always found to be homologous.
tFastX3.
Sequences with E-score between 1 and 10
To identify structural DNA sequence:(rep eated DNA, structural
RNA) use FastA3, first with ktup = 6 and then with ktup = 3. frequently turnout to be related as well.
To identify an EST use FastX3 (check whether the EST codes for a
protein homologous to a known protein).
Use ktup = 1 for oligonucleotides (length < 20).
FastA Example
6
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/
Blast Example
7
Bioinformatic Databases – Usage
NCBI Protein information usage survey
1) insert sequence
2) click button!
22
BLAST - Basic Local
Alignment Search Tool
BLAST programs were designed for fast database
searching,with minimal sacrifice of sensitivity for
distantly related sequences.
http://www.ncbi.nlm.nih.gov/BLAST/
Blast Example
7
Using Bioinformatic Databases
Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids
0 BLASTP
z Protein query against protein database
0 BLASTX
z Translated nucleic acids against protein database
0 TBLAST
z Protein query against translated nucleic acid database
0 TBLASTX
z Translated nucleic acids against translated nucleic acids
23
Using Bioinformatic Databases
Versions of BLAST
0 BLASTN
z Nucleic acids against nucleic acids
0 BLASTP
z Protein query against protein database
0 BLASTX
z Translated nucleic acids against protein database
0 TBLAST
z Protein query against translated nucleic acid database
0 TBLASTX
z Translated nucleic acids against translated nucleic acids
23
Databases – Searching w/ BLAST
BLAST result
0 Graphic display
Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21
24
Databases – Searching w/ BLAST
BLAST result
0 Graphic display
Example
gi|17330420|gb|BH384278.1|BH384278 ... 153 3e-36
gi|17320126|gb|BH373984.1|BH373984 ... 140 9e-34
gi|17338337|gb|BH392196.1|BH392196 ... 112 8e-25
gi|20373967|gb|BH771010.1|BH771010 ... 105 1e-21
gi|17314411|gb|BH368367.1|BH368367 ... 104 2e-21
gi|17332712|gb|BH386570.1|BH386570 ... 64 3e-21
24
Searching w/ BLAST – Interpreting Results
High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format
25
Comparison of the Programs Comparison of the Programs
Concept: Sensitivity:
FastA > BLAST (old version!)
BLAST produce local alignments, while FastA
FastA is more sensitive, missing less homologous sequences on the
is a global alignment tool. BLAST can report average (but the opposite can also happen - if there are no identical
more than one HSP per database entry, while residues conserved, but this is infrequent). It also gives better
FastA reports only one segment(match). separation between true hits and random hits.
Speed: Statistics:
BLAST calculates probabilities, and it sometimes fails entirely if
BLAST > FastA some of the assumptions used are invalid.
BLAST (package) is a highly e.cient search FastA calculates significance ’on the fly’ from the given dataset
tool. which is more relevant but can be problematic if the dataset is
small.
9
Searching w/ BLAST – Interpreting Results
High quality hits
0 Matching sequences with high E-values, % identity
0 >25% identity may imply similar function, 3D structure
0 Caveat – similarity does not guarantee homology
Low quality hits
0 No matching sequences
0 Few matching sequences, with low E-values, % identity
0 Absence of match does not always mean no homology
z Check sequence format
25