Sunteți pe pagina 1din 16

Biological databases

• Bio-informatics is all that concerns with Biological


databases and software tools (Computer programs)
written to manipulate these biological databases
• Biological databases are the collection of scientific data
information generated by individuals regarding a
particular biological aspect,grouped and well
documented.
Information in these databases can be
searched,compared, retrieved and analyzed
Biological databases
D a ta b a s e s
( B io lo g ic a l)

G e n e r a liz e d S p e c ia liz e d
D a ta b a s e s D a ta b a s e s

DNA E x p re s s e d S e q u e n c e T a g s (E S T )
P r o t e in G e n o m e S u rv e y S e q u e n c e (G S S )
C a rb o h y d ra te s in g le N u c le o t id e P o ly m o r p h is m ( S N P )
e tc S e q u e n c e d T a g g e d S it e s ( S T S )

R N A D a ta b a s e s
Generalized Databases

G e n e ra liz e d
D a ta b a s e s

S e q u e n c e S tru c tu re
D a ta b a s e s D a ta b a s e s
Sequence Database
Are those that have the Individual Records As
sequences of either nucleotides or amino acids that
is they may be either nucleic acid databases or
protein sequence databases

Structure databases
Are the ones that contain the individual
records as bio-chemically solved structures of
macro- molecules
T h e n u c le ic a c id D a ta b a s e s
N u c le ic A c id
d a ta b a s e s

P r im a r y S e c o n d a ry
d a ta b a s e s D a ta b a s e s
The Primary databases
• Primary databases contain the data in their original
form,taken as such from the source

Example: Genebank (NCBI/USA) DNA


EMBL (EMBO/Europe)DNA
GSDB(NCCR,USA) DNA
PIR/NBRF(USA) Protein
SWISS-PROT(Switzerland),Protein
PDB(BNL/USA) 3D structure
Secondary databases
• Otherwise called as value added databases contain
annotated data and information
OMIM-Online Mendelian Inheritance in Man –
Gene and clinical data

GDB-Genome Data Base-human

PROSITE,BLOCKS- protein motifs,Metabolism

KEGG,EcoCyc,
Other Specialized databases

• Kabat- Immunology Proteins,


• Ligand-Enzyme reaction ligands
• Klotho-Biochemical Compounds
• PKR (Protein Kinase resource)-SDSC-
protein kinase
Nucleic acid sequence databases
• There are 3 premier institutes in this world that
are considered as the authority in the nucleotide
sequence databases.
• EMBL (European Molecular Biology Laboratory)
• NCBI(National center for Biotechnology
Information)
• DDBJ (DNA databank of Japan)

These institutes have an International Nucleotide


Sequence database Collaboration, under which
they share the nucleotides on daily basis
FASTA Format
• The most commonly and internationally accepted
sequence format in Bio-informatics is the FASTA
Format.
• A sequence in FASTA format begins with a single
line description,followed by lines of sequence
data.The description line is distinguished from the
sequence data by a greater than(‘>”) Symbol in
the first column.It is recommended that all lines of
text be shorter than 80 characters in length.
• Let us see an example
>gi 532 319 pir TVFV2E TVFV2E envelope
protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVHCLMNTTVTTGLLI
NGSYSENRTQIWQKHRTSNDSALLILLNKHYNLT
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVHCLMNTTVTTGLLI
NGSYSENRTQIWQKHXXXXXXXXXXXXXXXXXXXXXRTSNDSALLILL
NKHYNLT
Sequences are expected to be represented in the
standard IUB/IUPAC amino acid and nucleic acid
codes with exceptions : lower case letters are
accepted and mapped into upper case.
A single Hyphen or dash can be used to represent
the gap of intermediate length
In amino acid U and * are acceptable letters
• Before submitting a request, any numerical
digits in the query sequence should either be
removed or replaced by appropriate letter
codes(e.g for unknown nucleic acid residue
or X for unknown amino acid residue
The nucleic acid codes supported are

• A---> adenosine B--- > GTC


• C--- > cytidine G--- > guanosine
• D--- > GAT v--- > CGA
• T--- > thymidine R--- > GA(purine)
• U--- >uridine Y--- > TC(pyrimidine)
• M--- > AC(amino) K--- > GT(Keto)
• S--- > GC(strong) H--- > ACT
W--- > AT(weak) N--- > AGCT(any)
Genebank
• Is the NIH (National Institute of Health)
generic sequence database.Placed at the
National Center for Biotechnology
information,National Library
Medicine,USA,It consists of an annoted
collection of all publicly available DNA
sequences
• As on August 2001, the Gene bank had
approximately 13,543,000,000 bases
• in 12,814,000 sequence records
• The NCBI,EBI and DDBJ have
entered into international
Nucleotide Sequence Database
Collaboration in which the data
sharing is carried between the
participating institutes on daily
basis.So it means that all these three
major databases are literally the
same in their content and searching
one database literally means
searching the other two databases
also.
• These International bodies
conduct International advisory
Meeting (IAM) and international
Collaborative Meeting(ICM) at
regular intervals to exchange
views and update techniques.

S-ar putea să vă placă și