Sunteți pe pagina 1din 7

DATABASE SEARCHING

Introduction Sequence database similarity searching is one of the most common computing techniques in modern biology. It allows the large repositories of DNA and protein sequence information to be queried using a sequence, with the goal of identifying database sequences homologous to this query sequence. This technique can be extremely useful in the study of gene and protein structure, function and e olution. !ne particular application of this technique in protein biochemistry is the identification of a protein using the sequence of a short peptide fragment, which could be obtained for example by amino"terminal sequencing of a spot on a #"D electrophoresis gel. $owe er this type of search often does not produce results because typical similarity search programs are optimi%ed for longer query sequences, and must be especially configured to handle short queries. &ost database similarity search programs are, in a nutshell, sequence alignment programs. Their fundamental principle is to find the best alignment between the query sequence and e ery sequence in the database. 'or protein sequences, this in ol es ta(ing into account not only amino acids which are identical between the two sequences, but also )related) amino acids which may indicate a common origin or function for the two sequences *for example, a +ys"Arg mismatch is much less of a mismatch than +ys",lu-. The program can then deri e a )similarity score) from each of these alignments, by assigning positi e alues to matches and negati e or %ero alues to mismatches or gaps, then report these database sequences with the highest similarity score. In practice, aligning a whole sequence database with a query sequence is unmanageable on all but specially designed massi ely parallel computers. Database similarity search programs must therefore use )shortcuts) in order to (eep search times down to a practical le el, without too much loss in sensiti ity. The two most common implementations of this approach are .+AST */- and 'astA *0-. This article discusses both methods with special emphasis on their use with short peptide query sequences. BLAST .+AST *.asic +ocal Alignment Search Tool- is, in fact, a collection of fi e different programs which allow different combinations of nucleic acid and protein query sequences and databases to be used. The programs of most rele ance to protein identification from short peptides are blastp, which searches a protein sequence database, and tblastn, which searches the 1 possible translations of the sequences in a DNA sequence database. A common way of accessing .+AST is through the 2orld"2ide 2eb, for example on the public access ser er at the National 3enter for .iotechnology Information *N3.I- *http://www.ncbi.nlm.nih.gov/-. The software itself is freely a ailable by anonymous ftp *ftp://ncbi.nlm.nih.gov/blast-. .+AST speeds up its search of the database by first loo(ing up an )index) of e ery oligomer *by default, e ery tripeptide- in the database for oligomers showing a sufficient degree of similarity to those present in the query peptide. This allows the program to identify ery quic(ly which

database sequences show some similarity with the query. .+AST then tries to extend these initial regions of similarity into a larger ungapped alignment. This alignment without gaps is called an $S4 *high"scoring segment pair-. ,apped .+AST *5- and 26".+AST*http://blast.wustl.edu/- are new ersions of .+AST able to refine these $S4s by inserting gaps in the sequences, therefore gi ing rise to gapped alignments which ma(e more biological sense than ungapped $S4s. .oth programs are also faster than the original .+AST. In gapped .+AST *a ailable at N3.I, http://www.ncbi.nlm.nih.gov/-, only database sequences containing at least two nearby oligomers with some similarity to the query sequence are considered for further analysis and extension. This )two"hit) approach increases the sensiti ity of the search as well as its speed, pro ided there is sufficient similarity between the sequences. $owe er, for short peptide sequences which would gi e rise to low"scoring $S4s, the original ersion of .+AST appears more sensiti e than gapped .+AST. 26".+AST is another refinement of the original .+AST algorithm which pro ides, faster, more sensiti e searches, and gapped alignments. IS783, the Swiss Institute for 8xperimental 3ancer 7esearch *http://www.ch.embnet.org/- is one site pro iding free access to 26".+AST on the 222. 'or each of the alignments produced by the search program, a score can be calculated. 'or protein sequence alignments, each amino acid pair is gi en a score ta(en from a scoring matrix *a table assigning a alue to e ery possible amino acid pairing- and the alignment score is calculated by adding up indi idual amino acid pair scores and subtracting penalties for gaps present in the alignment. A ma9or strength of .+AST is the statistical significance e aluation it performs on the results. ,i en the score of an alignment, .+AST is able to e aluate its statistical expectation, which is related to the probability that this alignment is a chance occurrence. Statistical expectations are easier to deal with than alignment scores because they are independent of the scoring scheme used to calculate alignment scores. 'or example, (nowing that an alignment has a score of /:: is not ery informati e in the absence of a reference scale. !n the other hand, describing this alignment as ha ing a /:; probability of being a chance occurrence gi es a good idea of how significant the match is. In a typical .+AST output, this probability is gi en in the 4*N- column, next to the name and description of the database entry *'igure /a and /b-. After creating the alignments, .+AST screens them according to their li(elihood of being a chance occurrence. The (ey parameter in this process is the Expect *Statistical expectationalue, which controls the le el of similarity with the query required for a database sequence to be reported as a match. The Expect alue corresponds to the number of sequences expected to be found in the database by chance alone. .y default, the Expect alue is /:, which means that if the database contained only random sequences, /: of these sequences would be reported as ha ing a suitable le el of similarity with the query gi en the si%e of the database. +owering the Expect alue ma(es the search more )stringent) since fewer sequences with lower similarity *which are considered more li(ely to be random matches- are reported. 2hen a peptide sequence of < to /: residues is used as the query for a standard .+AST search, the program often fails to report any matching sequences in the database, e en though such sequences are present. The ma9or reason for this beha ior is the statistical screening process= a

short peptide sequence has a high probability of matching sequences in the database by chance alone, and therefore most of the database matches are considered chance occurrences and re9ected. This can be illustrated by the following crude example= if a sequence database contains /:< residues, a random decapeptide has a probability of approximately /: <>#:/: x /:"0 of occurring by chance alone, since there are approximately /:< /:"mers in the database and #:/: possible decapeptides. The statistical expectation for a random decapeptide is therefore ?/: < x /:"0 @ /:::, well in excess of the Expect default alue of /: *note that the statistics used by .+AST are much more sophisticated than those used in this crude example, and do ta(e into account the frequency of occurrence of the different amino acids and the length of the sequences compared-. The smaller the query peptide and the larger the database to search, the more li(ely the peptide to occur by chance alone, and therefore of the )real) matches to be re9ected as chance occurrences. The logical solution to this problem would to use a much higher Expect alue. 6nfortunately most .+AST implementations do not allow the Expect parameter to be set to more than /:::, which is not usually sufficient. It is therefore necessary to )fool) the program into assuming that the database is much smaller than it actually is, so that the number of matches expected by chance alone is lowered. This is done by modifying the )effecti e database length) or A parameter, which controls the alue that the program uses as the number of residues in the database for its calculations. .y default, the actual number of residues in the database is used as the alue of this parameter, but this can be changed in some .+AST implementations. 'or example, the .+AST interface a ailable to users of the Australian National ,enomic Information Ser ice *AN,IS- allows the effecti e database length to be changed directly by typing in a new alue in the corresponding entry box. If the 222 at N3.I is used, the Ad anced .+AST search form must be used, and )A@) followed by the new alue of A entered in the )other ad anced options) box *e.g.., to use an effecti e database length of /::,:::, )A@/:::::) must be entered in the )other ad anced options) box-. 2hich alue to use depends on the length of query peptide and on the si%e of the database to be searched. It is usually simplest to experiment with a few alues ranging from /::,::: to /:,:::,::: and to examine the results *especially the alignments- to find a suitable compromise between the identification of homologous sequences and the reporting of )false positi es) *similarities occurring by chance alone-. The effect of reducing the effecti e database length when using a short peptide query sequence is demonstrated in 'igure /a and /b. Note that changing the effecti e database length in alidates the statistics, and that the results can only be interpreted by using biological (nowledge when loo(ing at the alignments and the function of the sequences identified, since the numbers *expectation and probability- are no longer alid. Another factor which may affect the result of the search is the presence of low complexity regions in the query peptide. A low complexity region can be a simple repeat or a region of abnormal amino acid composition, which can bias the statistics when present in a query sequence. .ecause of this, many .+AST implementations, including the one at N3.I, filter out by default low complexity regions from the query sequence *discussed in ref. #-. This can present a problem with short queries, which can be completely )filtered out) if their composition

is biased. It is therefore a good idea to run the search with and without filtering and to compare the results to ma(e sure that these are not affected by compositional bias. FastA 'astA *0- is a similarity search program which can be used to search a nucleotide sequence database with a nucleotide query sequence, or a protein sequence database with a protein query sequence. Its companion program T'astA *or 'astA"Trans- is used to search a 1"frames translation of a nucleotide sequence database with a protein query sequence. 'astA is distributed freely and is a ailable for a range of platforms at ftp://ftp.virginia.edu/pub/fasta *the installation requires that the sequence databases be installed on your local computer-. 'astA can also be accessed through a number of public 222 sites such as the .aylor 3ollege of &edicine search launcher *http://gc.bcm.tmc.edu:8088/search-launcher/launcher.html- and the 8uropean .ioinformatics Institute *http://www2.ebi.ac.uk/fasta /-. 'astA accelerates database searching by using se eral passes o er the database and only retaining a )best matching) subset for further analysis at each pass, therefore )pruning down) the database progressi ely. The first pass is similar to .+AST)s, in that short sequence )words) are compared for rapid detection of small regions of similarity. $owe er, 'astA uses a smaller word si%e *called ("tuple or ktup in this case-= by default, 1 for nucleic acids, and # for proteins * ersus // and 5 for .+AST-. 6nli(e .+AST, 'astA requires the words to match perfectly, and may o erloo( at this stage some wea( but significant similarity between protein sequences *for example, a +ys"Arg match will be ignored-. 2ith a protein sequence query, decreasing the ktup to / can increase the sensiti ity of the search significantly, at the expense of speed. The program then extends the initial short regions of similarity into alignments without gaps, the best of which are subsequently 9oined into a longer alignment *whose score is designated the initn score-. 'inally, regions with a high initn are aligned with the query sequence using a slower, more sensiti e method. The score calculated for these gapped alignments is called the opt score. .y default this score is used to sort the sequences in the output. The more recent ersions of 'astA perform a statistical e aluation of the results similar to that of .+AST, albeit less rigorous, since the statistical model used assumes alignments without gaps *#-. 'irst, a normali%ed score called the !-score is deri ed from one of the other scores *by default, the opt score-. This score is then con erted into a statistical expectation alue, which approximates the probability that a gi en match is a chance occurrence *when under :.:0-. Database sequences with an expectation alue higher than the "utoff expectation parameter are listed in the program output, together with the arious scores *'igure /c-. The statistical expectation is gi en in the 8 column. The calculation of the statistical expectation from the !-score is based on alignments between the query and a large number of sequences sampled at random from the database, which are used as a representati e sample of chance alignments. Since this process is less stringent than the strict statistics used by .+AST, 'astA reports many more significant matches than .+AST when a

short peptide is used as the query sequence *when both programs are used in their default configuration-. An example is shown in 'igure /c. 8 en then, it is recommended to increase the alue of the "utoff expectation parameter *for instance, to /:: instead of the default alue of /:when using 'astA with a short query sequence. *aSmallest Sum $igh 4robability Sequences producing $igh"scoring Segment 4airs=

Score 4*N- N

pirB"B8CDD0< A&4"acti ated protein (inase, 15E, catalytic... CC :.0D / *bSmallest Sum $igh 4robability Sequences producing $igh"scoring Segment 4airs=

Score 4*N- N

pirB"B8CDD0< A&4"acti ated protein (inase, 15E, cataly... CC :.::0F / spB"BAAE/G4I, 0)"A&4"A3TIHAT8D 47!T8IN EINAS8, 3ATA+ITI... CC :.:1F / gpB"B,#:FFDD/ $.sapiens m7NA for A&4"acti ated protein ... CC :.:<# / spB"BAAE/G7AT 0)"A&4"A3TIHAT8D 47!T8IN EINAS8, 3ATA+ITI... CC :.:D: / gpB"B,/5#1#00 3aenorhabditis elegans cosmid T:/3< CC :.:D: / spB"BAAE#G$6&AN 0)"A&4"A3TIHAT8D 47!T8IN EINAS8, 3ATA+ITI... 5F :.0D / spB"BAAE#G7AT 0)"A&4"A3TIHAT8D 47!T8IN EINAS8, 3ATA+ITI... 5F :.0D / pirB"BS0/:#0 A&4"acti ated protein (inase " human 5F :.0D / gpB"B,<1#CF5 7attus nor egicus 0)"A&4"acti ated protei... 5F :.0D / gpB"B,0</#/< 8. coli rpm$ gene for ribosomal protein +5C #C :.DC / gpB"B,1DD#5: &ycobacterium leprae cosmid .##11 55 :.D1 / gpB"B,/FF//CC +.delbruec(ii pep, and pep2 genes and un(... 5# :.DD / spB"BI$:CGI8AST $I4!T$8TI3A+ D/.# ED 47!T8IN IN 74SFA"S3$... 5# :.DD / pirB"BS100/D carcinoembryonic antigen"binding protein,... #1 :.DD1 / gpB"B,F51#DF S.cere isiae chromosome JIII cosmid <5#0 5/ :.DD1 / pirB"BS0DCC/ hypothetical protein I&7#::w " yeast *Sac... 5/ :.DD1 / spB"BI&01GI8AST $I4!T$8TI3A+ #<.D ED 47!T8IN IN 3+N/"7AD/... 5/ :.DDF / spB"BTISIG&I3,8 T$I&IDI+AT8 SINT$AS8 *83 #././.C0- *TS5/ :.DDF / spB"BTISIG+A33A T$I&IDI+AT8 SINT$AS8 *83 #././.C0- *TS5/ :.DDF / gpB"B,/CD1:/ +.casei thymidylate synthase *thyA- gene,... 5/ :.DDF / spB"B76H.GSINI5 $!++IDAI K6N3TI!N DNA $8+I3AS8 76H. 5/ :.DDF / gpB"B,/FC5#CC $.contortus m7NA for glutamate gated chlo... 5/ :.DD< / pirB"B.5#F50 thyroglobulin " sheep *fragment## :.DD< / gpB"B,0DDD<D S.cere isiae chromosome IJ cosmid <#FF 5/ :.DD< / spB"BII&<GI8AST $I4!T$8TI3A+ //F.D ED 47!T8IN IN 'E$/"ST$... 5/ :.DD< / gpB"B,/F5F/F0 Saccharomyces cere isiae DNA repair>trans... 5/ :.DD< / gpB"B,F51F:C $uman cytos(eleton associated protein *3,... 5: :.DDD0 /

gpB"B,/D:0D:# $omo sapiens DNA from chromosome /D"cosmi... 5: :.DDD0 / pirB"BSC#0/5 recombination"acti ating protein 7A,"# " ... ## :.DDD1 / spB"B&8JAG4S8A8 &6+TID76, 78SISTAN38 47!T8IN &8JA 478367S!7 / pirB"BS0:<10 a ermectin"sensiti e glutamate"gated chlo... 5: :.DDD< / gpB"B,/1#F0/D A. inelandii alg, gene 5: :.DDD< / gpB"B,/:F##/D 3aenorhabditis elegans cosmid 7:58D 5: :.DDD< / gnlBgpFB,11C<FC Saccharomyces cere isiae chromosome JII c... 5: :.DDD< / gpB"B,#:<##: 8.coli dnaA"lacA fusion protein gene frag... #C :.DDDD / gpB"B,00C/#5 &ouse Ig rearranged (appa"chain m7NA H<"K... #/ :.DDDD# / pirB"BKS:#CC hypothetical #.<0E protein " li erwort *&... ## :.DDDD# / gpB"B,DDDC5/ 4leurochrysis carterae chloroplast rpl#F ... ## :.DDDD# /

5: :.DDD<

*cThe best scores are= initn init/ opt %"sc 8*#01/C#8CDD0< pirB"=A&4"acti ated protein (inase, * /<- 0F 0F 0F #/:.5 :.:::/F AAE/G4I, spB"=0)"A&4"A3TIHAT8D 47!T8IN E * /5#0F 0F 0F /DF.5 :.:::D ,#:FFDD/ gpB"=$.sapiens m7NA for A&4"acti * #0F- 0F 0F 0F /D#.D :.::/1 AAE/G7AT spB"=0)"A&4"A3TIHAT8D 47!T8IN E * 0C<0F 0F 0F /<<.: :.::5 ,/5#1#00 gpB"=3aenorhabditis elegans cosmi * 01#- 0F 0F 0F /<F.< :.::5 AAE#G$6&AN spB"=0)"A&4"A3TIHAT8D 47!T8IN E * 00#- C< C< C< /0F.5 :./0 AAE#G7AT spB"=0)"A&4"A3TIHAT8D 47!T8IN E * 00#C< C< C< /0F.5 :./0 S0/:#0 pirB"=A&4"acti ated protein (inase * 00#- C< C< C< /0F.5 :./0 ,<1#CF5 gpB"=7attus nor egicus 0)"A&4"acti * 00#- C< C< C< /0F.5 :./0 S0DCC/ pirB"=hypothetical protein I&7#::w * #/<- 5# 5# C/ /5D.1 /.0 ,F51#DF gnlBgpd=S.cere isiae chromosome JI * #/<- 5# 5# C/ /5D.1 /.0 ,1DD#5: gpB"=&ycobacterium leprae cosmid . * 5<C- C# C# C# /5D.5 /.0 I&01GI8AST spB"=$I4!T$8TI3A+ #<.D ED 47!T8 * #01- 5# 5# C/ /5<.0 /.F TISIG&I3,8 spB"=T$I&IDI+AT8 SINT$AS8 *83 # * #<F- 5F 5F C: /5C.C #.D TISIG+A33A spB"=T$I&IDI+AT8 SINT$AS8 *83 # * 5/1- 5F 5F C: /55.< 5./ ,/CD1:/ gpB"=+.casei thymidylate synthase * 5#D- 5F 5F C: /55.0 5.# ,/FC5#CC gpB"=$.contortus m7NA for glutama * C5#- #0 #0 C: /5/.F C ,5D1<11 gpB"=A. thaliana transcribed seque * 00- 51 51 51 /5/.1 C./ 4S.!GSINI5 spB"=4$!T!SIST8& II &AN,AN8S8"S * #FC- 51 51 5D /5/.5 C.5

#igure $. %atabase similarit& search using a short protein se'uence. A non"redundant protein database combining the S2ISS"47!T, 4I7 and ,en4ept databases *total length F<,#0C,0F# residues- was searched for sequences similar to the octapeptide &S+L+ILHD *from the N" terminus of the pig A&4"acti ated protein (inase sequence-. The matching database sequences reported by the following programs are listed= *a- .+AST4 with default settings, *b- .+AST4 with the effecti e database length *A- set to 0:::::, *c- 'astA with default settings. Conclusion

.+AST and 'astA are the two most popular methods for similarity searching, and it is good practice to try both when searching a database for related sequences. A comparison of arious database similarity search methods has demonstrated that 'astA used with a ktup of / is more sensiti e than .+AST for detecting distant protein homologies, mainly because it produces more meaningful gapped alignments *1-. This, together with its more suitable statistical handling of short query sequences, ma(es 'astA a better algorithm for searching sequence databases for entries similar to a short peptide sequence. $owe er, 'astA is much slower than .+AST, especially with the ktup set to /, and there is a scarcity of public 222 ser ers offering a 'astA search option for some of the larger, most useful databases. In particular, 'astA is not a ailable on a public ser er to search the non"redundant protein database maintained at N3.I, which supports only .+AST. 'astA searches of non"redundant, up to date databases may therefore be a ailable only to some in estigators, through in"house or subscription"based ser ices *which may also pro ide .+AST searches of the same databases, therefore impro ing the integration and management of the results-. In practice, the quality of the database *how up to date, non"redundant and complete it is- is a (ey factor determining the sensiti ity and the usefulness of a database similarity search *F-. The ad anced .+AST search of the non"redundant protein database at N3.I is therefore the best option for researchers who ha e access only to public database search ser ers, especially since the introduction of the gapped .+AST ser ice. $owe er, when searching the database using short peptide query sequences, it may be necessary to reduce the effecti e database length *therefore in alidating the statistical analysis- in order to obtain any results.

S-ar putea să vă placă și