Bioinformatics Overview

BIOINFORMATICS AN OVERVIEW
T.R. Sharma
Genoinformatics Lab, National Research Centre on Plant Biotechnology
I.A.R.I, New Delhi 110012
trsharma@nrcpb.org

Introduction
Bioinformatics is the computational analysis of biological data, consisting of the information
stored in the form of DNA and protein sequences in various biological databases. The
National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics which assess relationships
among members of large data sets, the analysis and interpretation of various types of data
including nucleotide and amino acid sequences, protein domains, and protein structures; and
the development and implementation of tools that enable efficient access and management of
different types of information."

Analyses in bioinformatics focus on three types of datasets: genome sequences,
macromolecular structures, and functional genomics experiments (e.g. microarray data).
However, bioinformatics tools are also applied to various other data, e.g. phylogenetic and
metabolic pathway analysis, the text of scientific papers, and plant varietal information and
statistics. Analysis of biological data requires application of large number of techniques like
primary sequence alignment, protein 3D structure alignment, phylogenetic tree construction,
prediction and classification of protein structure, prediction of RNA structure, prediction of
protein function, and expression data clustering. Development of suitable algorithms is an
important part of bioinformatics. The techniques and algorithms were specifically developed
for the analysis of biological data, for instance, the dynamic programming algorithm for
sequence alignment is one of the most popular programmes among the biologists. The
sequence information generated worldwide is stored systematically in different types of
databases. Hence, it is necessary to understand about the databases and their different types.

What is a database?
A database is a collection of information stored in a computer in a systematic way, such that a
computer program can consult it to answer questions. A biological database is a large,
organized body of persistent data, usually associated with computerized software designed to
update, query, and retrieve components of the data stored within the system. A simple
database might be a single file containing many records, each of which includes the same set
of information. For example, a record associated with a nucleotide sequence database
typically contains information such as contact name; the input sequence with a description of
the type of molecule; the scientific name of the source organism from which it was isolated;
and, often, literature citations associated with the sequence.

Bio-informatics: An Overview
VI-78
Divisions of DNA databases
Since the size of databases is growing rapidly, these have been further broken into divisions
on the basis of the taxonomy of the organisms. The GenBank divisions are divided into two
general categories like, organismal and functional. The sequences derived from specific
organisms are stored in the organismal category. Whereas the functional category include
databases which are independent of their taxonomic classification e.g. EST, STS and HTG
etc. Respective Genbank divisions store sequence records of different organism which is
identified from three letter codes indicated in the beginning of each sequence entry. For
instance, HTG (high throughput genome) division contained sequences generated from
different organisms. These sequences are generally unfinished and are further classified as
Phase1(sequences which are unfinished, unordered and contained gaps) and Phase 2
(sequences which unfinished, ordered and contained a few gaps). Once sequences are finished
and all gaps are resolved (Phase 3) it moved to a specific division e.g. PLN in case of plants.
The huge wealth of information in the form of DNA and protein sequences and publications
on molecular biology are stored in the data banks (Fig.1). Major public data banks which
takes care of the DNA and protein sequences are GenBank in USA
(http://www.ncbi.nlm.nih.gov), EMBL (European Molecular Biology Laboratory) in Europe
(http://www.ebi.ac.uk/embl/) and DDBJ (DNA Data Bank) in J apan
(http://www.ddbj.nig.ac.jp). . The growth of DNA sequence data in GenBank is depicted in
Fig. 2. This rapid growth in DNA sequence data is because of the fact that various
Collaborative International Programmes have started during the past few years to sequence
complete genomes of various organisms. The whole genomes of various microorganisms
have already been sequenced by The Institute of Genome Research (TIGR) which can be
seen on their website www.tigr.org . The large genomes like Human (3 billion bp) Rice (450
Mb bp), Arabidopsis (130Mb bp) and Mouse (2.5 billion bp) have also been sequenced and
the data is in public domain in GenBank. Now these DNA sequences have to be used in
meaningful ways for the welfare of mankind. Different types of sequences of important crops
available in public domain are listed in Table1.

Fig.1. Status of Sequences submitted in the GenBank (Source: NCBI)

VI-79
Table1. Different types of sequences of important crops available in public domain*
Type of database
in public domain
Plant species
Whole genome Oryza sativa, Arabidopsis thaliana
Partial genome
T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S.
tuberosum, L. esculentum, V. vinifera, Poncirus trifoliate, Medicago
truncatula, Lotus corniculatus
EST
Aegilops tauschii, Allium cepa, Arabidopsis thaliana, Avena sativa,
Beta vulgaris subsp. vulgaris, Brassica napus, Brassica oleracea,
Brassica rapa, Capsicum annuum, Coffea arabica, Glycine max,
Gossypium arboreum, Gossypium hirsutum, Helianthus annuus,
Hordeum vulgare, Lactuca sativa, Lolium perenne, Lotus corniculatus,
Lycopersicon esculentum, Malus domestica, Medicago sativa, Medicago
truncatula, Nicotiana benthamiana, Nicotiana tabacum, Oryza sativa,
Phaseolus coccineus, Phaseolus vulgaris, Saccharum officinarum,
Secale cereale, Solanum melongena, Solanum tuberosum, Sorghum
bicolor, Triticum monococcum, Vitis vinifera, Zea mays
mRNA
T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S.
tuberosum, L. esculentum, V. vinifera, Medicgo truncatula, L.
corniculatus, O. sativa, A. thaliana
Protein
Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, V.
vinifera, C. sinensis, M. truncatula, E. globulus, O. sativa, A. thaliana
BAC end
Oryza australiensis, O. brachyantha, O. glaberrima, O. granulata, O.
latifolia, O. minuta, O. officinalis, O. punctata, O. ridleyi, O. rufipogon,
O. schlechteri, G. hirsutum
Source: NCBI

Divisions of Protein databases
Protein sequences are mainly stored in two databases EMBL and GenBank. Swiss-Prot which
is a very well maintained and curetted database was established at the Swiss Institute of
Bioinformatics. Though it is a small database, it has important annotations which are freely
available to the academic users. GenBank created PIR a protein database as a translation of
the Genbank. PIR database is further subdivided into four sections like PIR1, PIR2, PIR 3 and
PIR4 on the bases of degree of annotation.

DNA Sequence Analysis
Bioinformatics tools are now easily available to the biologists with the advent of internet and
various Web Browsers on World Wide Web. These tools are indispensable for any Genome
Sequencing Centres. The analysis of DNA sequences started once these are out of the
sequencing machines. The first and foremost task of a biologist is to look for the accuracy of
sequence he got from the machine. One way is to go for finding cloning sites of inserts in the
sequencing vector. If the insert is a PCR product then one should look for the primer
sequences used in the amplification of that product. Then one can perform Basic Local
alignment Search Tool (BLAST) search against the DNA sequence database in the GenBank
and see the probable matches. If the unknown sequences shows hits with any sequence of the
same or related organisms then it is considered as a true sequence. These are the basic steps,
VI-80
which can be performed manually if the dataset is very small or if one has to deal with single
or a few sequences. However, in large genome sequencing projects one has to handle
thousands of sequences at a given time.

Searching for Sequence Alignment
Once high quality sequence is obtained once has to ask an important question whether this is a
new sequence or the sequence similar to other DNA sequences available in the databases. For
getting answer of this question, on has to perform database search for sequence comparisons.
All sequence searching methods rely on the basic concepts of alignment and distance between
the sequences and pair wise sequence alignment is performed. There are different algorithms
to perform global and local alignments (Fig.2). In global alignment, complete alignment of
the input sequence is performed with sequences available in the databases. Whereas in local
alignment, most similar segments of the input sequence are aligned with the database
sequences. Sequence comparison (DNA/protein) against database is one of the very
important and powerful tools of bioinformatics. This type of sequence comparison is
generally performed with two programmes BLAST and FASTA, which compares unknown
sequence against a sequence database. In BLAST best local alignments between the unknown
sequences and the database is found by using an approach based on matching short sequence
fragments and a powerful statistical model. Whereas a method of approximation is used in
FASTA which try to concentrate only on significant alignments. In BLAST search output,
Expected (E) values and Bit scores are mentioned to determine the significant match of
unknown sequences with that of sequences available in the database (Fig.3). The significance
of a BLAST hit is very important for the interpretation of results. Generally 67% identity at
DNA level shows 100% identity in protein level. It is also suggested that at least 75%
sequence identity between two sequences should be observed for considering it as a
significant hit.

Fig.2 . Global and local alignments between two DNA sequences

VI-81

Fig.3. BLAST output showing Bit score and E values after similarity search

Gene Prediction and Annotation
Simply determining four alphabets (ATGC) of DNA sequences of any organism has no value
until some meaning is derived from this by gene prediction. Gene prediction is complex work
and there is no algorithm which can exactly predict the true exons in a DNA sequence.
Basically two major considerations are taken into consideration while predicting a gene. 1)
identification of structural elements such a start/ stop codon and splice sites of the unknown
sequence and 2) performing homology search against protein, EST and cDNA database to
identify potential coding regions. For gene prediction, very commonly used software
GENSCAN developed by MIT, USA (http://www.genes.mit.edu/GENSCAN.html), which is
freely available on Web and online analysis of DNA sequences, can be performed. The output
obtained from the GENSCAN is then used for gene annotation by using BLAST to search the
public or private DNA sequence databases to find out the matches to the unknown query
sequence with millions of sequences available in the Gen Bank. A very popular Website
http://www.ncbi.nlm.nih.gov is available for BLAST at NCBI`s Home page which performs
searches by using various criteria and options (Fig.4).

VI-82

Fig. 4. Performing BLAST search at NCBI Home page
Primer Design
Another important aspects in the use of genome sequence data after predicting genes are to
design primers either for PCR or for sequencing. Such primers are used for the amplification
of genes or its alleles from the known sources and making best use out of it. Though PRIME
software within GCG package is mainly used for this purpose, PRIMER3- a web based
software (www-genoem.wi.mit.edu /genome_software/other /primer3.html) is being
commonly used for designing primers. PCR Primer pairs are designed to amplify a well-
defined target sequences from the template. Some of the important considerations while
designing primers are, the GC content, melting temperature, primer size, and size of the PCR
product to be amplified. These parameters can be used either as default setting or one can
change them as per their requirement.

Phylogenetic Analysis
Once similarity search is performed between unknown sequence and the database sequence to
find per cent homology between them, it is obvious to know how these sequences are related
to each other. The sequences derived from two closely related organisms shows more
similarity at DNA level and distantly related organisms shows more dissimilarity at the
sequence level. To find an evolutionary relationship among sequences derived from different
organisms, a phylogenetic tree is constructed (Fig.5). Such evolutionary tree can also be
constructed on the basis of phenotypic markers, molecular markers or sequence information.
A typical phylogentic tree is comprised of nodes, branches and termini of the branches. When
VI-83
all the branches are emerged from a common node it is termed as the root of a tree. Though
some trees are constructed as un-rooted tree where common evolutionary point is not known.
For constructing a phylogenetic tree the PILEUP option of GCG package is more commonly
used. Besides, DNA STAR software (www.dnastar.com) also have options to construct tree
from different DNA or protein sequences. However, web based tools like MacClade (//www.
phylogeny.arizona.edu/macclade/) can also be used for evolutionary studies of different
organisms based on their DNA sequences.

Similarly, bioinformatics tools can be used for protein function analysis by database search.
Finding SSR markers and SNP markers from the EST or genome sequences can be performed
in silico by using different algorithms which will also be discussed in the presentation.

Fig. 5. Phylogenetic analysis of resistance gene analogue sequences (sk21,sk95, sk10, sk3,
sk76, sk101 and sk65) obtained from rice and known Resistance gene sequences (L6, M,
N,RPS2 and Xa1) isolated from different crops. Analysis was performed with DNASTAR
software.

Conclusions
In functional genomics, investigation of gene expression at whole genome levels under
different stresses can be studied by using microarryas. Now-a-day this type of gene
expression databases are being prepared in different organisms and even at different tissues.
Bioinformatics tools are helpful in locating DNA sequences in the GenBank simply by putting
accession numbers, making alignments of two or more than two sequences, performing
similarity searches for unknown sequences in the GenBank, assembling short sequence reads
and developing consensus sequences, finding genes and markers in silico and in performing
comparative analysis of different genomes.

Selected References and Web Resources
Sobral, B.W.S. 1997. Common language of bioinformatics. Nature. 389:418.
Brown, S.M. 2000. Bioinformatic: A Biologist`s Guide to Biocomputing and the Internet.
Eton Publishing, Natick. MA , USA.
Baxevanis, A.D. and Ouellette B.F.F. 2001. Bioinformatics- A Practical Guide to the
Analysis of Genes and Proteins. Second Edition. A J ohn Wiley and Sons, Inc.,
Publication, NY.
GENSCAN : http://genes.mit.edu/GENSCAN.html
FGENESH :http://www.softberry.com/berry.phtml

Bioinformatics Overview

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bioinformatics Overview

Încărcat de

Drepturi de autor:

Formate disponibile

BIOINFORMATICS AN OVERVIEW

S-ar putea să vă placă și