Sunteți pe pagina 1din 28

Bio242|CellularandMolecularBiology

Bio242

BIOINFORMATICS
TUTORIAL
Bio242AmylaseLabSequence

SequenceSearches:BLAST
SequenceAlignment:ClustalOmega
3dStructure&3dAlignments

DONOTREMOVEFROMLAB.
DONOTWRITEINTHISDOCUMENT.
Apdfofthisdocumentisavailableonthebio242website.

Acknowledgements
TheBatesBioinformaticsTutorialwasoriginallydevelopedaspartoftheCollaborativeTechnologies
Developmentproject.DavidAsanuma('09)createdthesiteundertheguidanceofNancyKleckner,
AssociateProfessorofBiology,andMichaelHanrahan,AssistantDirectorofResearchandCurricular
Computing.RevisionofthecontentisperformedannuallybyGregAndersonandCarolynLawsonto
keepthedocumentuptodatewiththewebsite.

Page1of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

BioinformaticsTutorial
Bioinformaticsistheacquisition,storage,arrangement,identification,analysis,andcommunicationof
informationrelatedtobiology.Thetermwascoinedin1990withtheuseofcomputersinDNA
sequenceanalysis.Thinkofitasthetheoreticalbranchofmolecularbiologyliketherelationshipof
theoreticalphysicstothegeneralfieldofphysics.
Nowthatyouhaveobtainedinformationaboutsomeofthechemicalpropertiesofamylase,inthis
exerciseyouwillbecomparingthemolecularstructureoftheenzymeamongthethree(ormore!)
species.ThetutorialwillguideyouthroughfindingthegenesequencesusingboththeEntrezsearch
andBLASTtools,andthencomparingthemusingtheClustalOmegatool.
YouwillbeusingtheDNAandproteinsequenceonlinedatabasesthatarethecoreof
bioinformatics.Therearetwogeneraltypesofsequencedatabases:Primarydatabasescontain
experimentalresultsinanaccessibleformat,butarenotsequencesthatareapopulation
consensus.DDBJ,EMBL,andGenBankareprimarydatabases.Secondarydatabasesarecuratedto
reflectconsensussequencesfrommultipleexperimentsandusuallyusetheprimarydatabasesastheir
sources.
Abbreviations
DDBJDNADatabankofJapan
EMBLEuropeanMolecularBiologyLaboratory
NCBINationalCenterforBiotechnologyInformation
BLASTBasiclocalalignmentsearchtool
ThestandardsequenceformatiscalledFASTA.AllFASTAsequencesstartwithadefinitionlinewhich
consistsof:

auniqueidentificationnumber(theaccessionnumber)
theversionnumberofthesequence
thelengthofthesequence
moleculetype(DNAormRNA)
taxonomicdivision(forinstance,INV=invertebrate)
lastreleasedate
sourceorganism

Everycodingsequencealsohasauniqueproteinnumberassignedtoit,startingwithAA.
Referencesequences(whichundergocontinuingcuration)arethemostcompleteanduptodateand
alwaysstartwithNTforDNA,NMformRNA,orNPforprotein.Hintthesearetheonesyouwantto
useifpossible.

SequenceSearchIntroduction
Entrez
EntrezisadataretrievalsystemdevelopedbytheNationalCenterforBiotechnologyInformation(NCBI)
thatprovidesintegratedaccesstoawiderangeofdatadomains,includingliterature,nucleotideand
proteinsequences,completegenomes,threedimensionalstructures,andmore.Entrezincludes
powerfulsearchfeaturesthatretrievenotonlytheexactsearchresultsbutalsorelatedrecordswithina
datadomainthatmightnotberetrievedotherwiseandassociatedrecordsacrossdatadomains.These
featuresenableustogatherpreviouslydisparatepiecesofaninformationpuzzleforatopicofinterest.

Page2of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

EffectiveandpowerfuluseofEntrezrequiresanunderstandingoftheavailabledatadomains,the
varietyofdatasourcesandtypeswithineachdomain,andEntrezsadvancedsearchfeatures.This
tutorialusescorn(Zeamays)alphaamylasetodemonstratethewidevarietyofinformationthatwecan
rapidlygatherforasinglegene.Thenumbersnotedinthesearchresultswillofcoursechangeovertime
asthedatabasesgrow.Thesametechniquesshownherecanbeusedforanytopicofinterest.
Thesearchgoalsareto:
Identifyingrepresentative,wellannotatedproteinaminoacidsequencerecordsforseveralplant
andanimalamylases,usingEntrezsearchandBLAST,tocompareusingCLUSTALOmegamultiple
sequencealignmenttool;
Retrieveassociatedliterature/citationsforeachaccessionrecord(speciesaasequence)used;
Identifyconserveddomainswithintheprotein;
Findaresolvedthreedimensionalstructurefortheenzymesyouused,or,intheirabsence,
identifystructureswithhomologoussequence;
PerformVASTalignmentsof3dstructuresofplantandanimalamylasestovisualizewhere
similaritiesanddifferencesoccur.

Letsgetstarted!
GototheNCBIwebsitehttp://www.ncbi.nlm.nih.govbyenteringtheURLintheaddressfieldofyour
browser.
AfteraccessingtheNCBIwebsite,youmaynowsearchforcornalphaamylasesequencesineitherthe
nucleotideorproteindatabasesbyselectingoneortheotherfromtheDatabasedropdownmenu.
OtherpointsofinterestontheNCBIHomePagearethePubMedlink,whichallowsyoutosearchfor
journalarticlesonthestructureandfunctionofalphaamylases,andtheBLASTlink,whichallowsyouto
searchfornucleotideorproteinsequenceswithsimilaritytoyoursequenceofinterest.
Fornow,makesureyouareattheNCBIhomepage(clickontheNCBIiconintheupperleftoftheNCBI
pagetobesure),andchoose"Protein"fromthesearchdropdowndatabasesmenu.Type"Zeamays
alphaamylase"inthelinebelow.TheseselectionsareillustratedinFigure1(nextpage).
Click"Search"toproceed.

Page3of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure1.NCBIhomepagefromwhichEntrezsearchesofmanydatabasescanbeperformed.Youwill
choosetosearchtheProteindatabase.

Searchresults:Fig.2showsatypicalresultspageforthissearch.Yoursshouldlooksimilar,butmightbe
alittledifferentdependingonwhatnewinformationhasarrivedsincethescreenshotwasmade.The
sequenceofinteresthastheaccessionnumber(identifier)AAA50161.Itishighlightedinthescreen
shot.Howdoyouknowthisistheoneyouwant?Clickontheaccessionnumberandstudythepagethat
comesup.ItshouldbeidenticaltotheoneshowninFig.3.


Figure2.Typicalsearchresultspageforproteinsequences.

Page4of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

Figure3.Typicalrecordforatypicalaccessionnumberrecord.

InFigure3,takenoteoftheDEFINITION,SOURCEandORGANISM,AUTHORSofthesequence,andthe
TITLEandJOURNALnameofthearticlepublishedaboutit.Ifyoudontalreadyhavethisarticle,youcan
retrieveitsimplybyclickingonthePUBMEDnumber(inthelivewindow)andprintthePDFversion.
Thenfindyourwaybacktotheresultspage.

Page5of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

SkipdownthroughtheFEATURESandnotetheORIGINsection,whichgivesyoutheaminoacid
sequenceofyourprotein.ThisisthesequencewelluseinaBLASTsearch,butthedefaultformatisnot
particularlyhelpful.Allfurtherprocessingofthesequenceinformationrequiresthatthesequencebein
FASTAformat.

FASTAFormat:Conversionofthesequencetoauniversalformat
Scrolltothetopofyourresultspageand
notetheDisplaydropdownboxwith
"GenPept"selected.TheGenPeptformatis
thedefaultsettingandgivesyouallofthe
informationwediscussedabove.However,
theFASTAformatismoreusefulforBLAST
searchesandalignmentsofsequences.
SelectFASTAfromthemenuasillustratedin
Fig4.

Yourresultsshouldappearlikethescreen
shotinFig.5.Younowseelessinformation:
justtheaccessionnumberfollowedbya
briefdescriptor,andtheaminoacid

sequenceprecededbysomeidentifying
information. Figure4.ClickFASTAtoconvertthesequenceto
properformatforfurthersearching.


Figure5.FASTAconversionresults.
Inthelivewindow,highlightandcopythecompleteaminoacidsequencealongwiththeidentifying
information(>gi|426482).FromyourstartmenubringupNotePadandpastetheFASTAsequenceinto
thewindow.YouwillusethissequenceinaBLASTsearchtoidentifyotheraminoacidsequencesinthe
NCBIdatabaseswithsimilaritytoyoursequence.Notethatmanyoftherelevantanalysistoolsthatcan
usethissequenceinformationarelinkeddowntherightsideoftheNCBIpage.Onceyouare
comfortableusingthesetools,youcanworkmoreefficiently.MinimizeNotePadtoreturntotheNCBI
website.

Page6of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

ProteinBLASTIntroduction
ToaccesstheBLASTpage,inyourlivewindow,clickontheNCBIiconintheupperleftofthepage(this
takesyoutothehomepage).ClickonBLASTinthePopularResourcesmenu.TheBLASToptionsare
summarizedasaflowchart(Fig.6);chooseProteinBLAST.Table1explainswhattheBLASToptionsdo.


Figure6.BasicBLASTsearchoptions.

Table1.ExplanationofBLASTprogramfunctionsfortherestofus.
BLASTPROGRAM Furtherdetails
nucleotideblastorblastn Comparesanucleotidequerysequenceagainstanucleotidesequencedatabase.
proteinblast(orblastp) Comparesanaminoacidquerysequenceagainstaproteinsequencedatabase.
Comparesanucleotidequerysequencetranslatedinallreadingframesagainsta
blastx proteinsequencedatabase.Youcouldusethisoptiontofindpotential
translationproductsofanunknownnucleotidesequence.
Comparesaproteinquerysequenceagainstanucleotidesequencedatabase
tblastn
dynamicallytranslatedinallreadingframes.
Comparesthesixframetranslationsofanucleotidequerysequenceagainstthe
sixframetranslationsofanucleotidesequencedatabase.Pleasenotethatthe
tblastx
tblastxprogramcannotbeusedwiththenrdatabaseontheBLASTWebpage
becauseitiscomputationallyintensive.

Page7of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

BLASTPSearch
PasteyourcopiedFASTAsequenceintothetextboxunder"EnterQuerySequence"(Fig.7).Makesure
the"Nonredundantproteinsequence(nr)"databaseisselectedintheDatabasedropdownmenu
under"ChoseSearchSet.".ClickonBLAST.Youmayseeawindowindicatingyourqueryhasbeenadded
totheBLASTQueue.
Youmighthavetowaitforseveralsecondsforyourresultsduringwhichtimeyouwillseeascreenlike
thatinFig.8.Bepatient,rememberthatyoursequenceisbeingcomparedtothousandsofothers!


Figure7.TheBLASTsearchscreen.

Page8of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure8.TheinitialscreenshowingaBLASTsearchresults.

BLASTPResultsPart1
Scrolldowntheblastpresultspagetotheillustrationwiththeredbars(Fig.9).Thisisadiagrammatic
representationofhowyourquerysequence(thetopredbar)linesupwithotherrelatedsequencesin
thedatabasebasedontheprimarystructureoftheprotein(from0toover400aminoacids).This
diagramsummarizesaround100"Hits",orotherproteinsequencesgoingfrommosttoleastsimilarity
toyourcornalphaamylasequerysequence.Notethatsomeofthesequenceslacktheaminoterminus
ofyourcornalphaamylasesequence.


Figure9.BLASTsummaryofrelatedsequences.Thelinesshowrelativealignmentofthehitsequenceswiththe
querysequence.

Page9of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

Onceyourresultsappear,scrolldownpastthereddiagramandyouwillseealistofaccessionnumbers
and descriptors for sequences in order of decreasing similarity to your sequence (Fig. 10). In fact, the
firstiteminthelistis(orshouldbe)yoursequence(checktheaccessionnumbertobesure).Thetwo
scoresattheright(IdentandEvalue)indicatethedegreeofsimilarity.Botharedefinedintheglossary
of terms in this tutorial. You can click on any of these sequences to go to the GenPept page that
describesit.


Figure10.Descriptionsofthe100mostrelatedsequencestothequerysequence.

Fornow,scrolldowntothe"Alignment"sectionoftheresults(Fig.11)toseetheactualaminoacid
sequencesalignedagainstyours.Notetheaminoacididentitiestogetameasureofhowsimilarthe
sequencesare.Thefirstshouldbe100%sinceitistheidenticalsequence.Asyouscrolldownthrough
thenextseveralsequences,though,thepercentidentityshouldgetsmaller.


Figure11.Sequencealignmentinformationforthemostrelatedproteinsequences.
Oryzaisthegenusofrice.

Page10of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

YourimmediategoalusingBLASTPistolocateothercompletesequencesforboththeplantandanimal
alphaamylasesutilizedinyourexperimentsandotherstoincludeinyouranalysis.Scrollbackupslowly
throughthelistof"hits".Whatspeciesdoyousee?Ifitisnotclearfromthebriefdescription,clickon
theaccessionnumbertogettheGenPeptdescriptions.Infact,whatyouwillprobablyfindaremostly
sequencesfromplants,somebacteria,andmaybeafewinsects.Clickonthe"DistanceofTreeResults"
link(inOtherreportsoptions)inthetoppaneloftheBLASTresults(Fig.12)toexamineaphylogenetic
treeconstructedfromtheorganismsincludedintheBLASTresults.UsingtheToolsoptionsyoucansee
differentrepresentationsofthetree.


Figure12.ToppanelofBLASTresultsshowinglocationofDistanceTreeofresultslink.

Howmanyspeciesshouldweincludeintheanalysis?
Togetthemostfromthisanalysisyoumayfindthatusingmorethanjustthethreespeciesweusedin
labwouldbeveryhelpfulinseeinglargerpatternsofsimilarity/dissimilaritywhencomparingplantand
animalamylases.Westronglyrecommendaddingatleastonemoreplant,ifnotmore.Useequal
numbersofplantandanimalamylasestobalancetherepresentationfromeachKingdom.
Ifhumanandoyster(orotherbivalvespecies[ClassBivalvia;OrderPelecypoda])alphaamylasearenot
foundinthislistofBLASThits,howelsemightyoufindthosesequencestocomparetocorn?To
broadenyouranalysisabit,youcanalsosearchforsequencesforcropspecieslikebarley(Hordeum
vulgare)orrice(Oryzasp).Designandcarryoutastrategytofindthem,andonceyoudo,copythe
FASTAformattedsequencestothesameNotePadfileyourothersequenceisin.Makesuretoleaveone
blanklinebetweenthesequences(neededlaterforthesubmissiontoCLUSTALOmega).

ACCESSIONNUMBERSCHECK
Tofacilitateabroadercomparisonofalphaamylaseamongplantandanimals,youshouldnowhave
four(ormore)accessionnumbers:oneforcorn(Zeamays),humans(Homosapiens),Pacificoyster
(Crassostreagigas)andbarley(Hordeumvulgare).Therearenowsequencesforamylasefromtwoother
clamGenerainthedatabases(CerastodermaandCorbicula)whichcouldbeusedasalternativestothe
Pacificoyster.Likelysequencestoincludewillhavelengthssimilartothehumanandcornsequences
andwillberichinAintheaccessionnumber.
RecordthosespeciesandtheiraccessionnumbersbelowandthencheckwithalabinstructororTAto
makesurethatyouhaveappropriatesequencesbeforeyouproceed.

Page11of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

SPECIES ACCESSIONNUMBER
corn(Zeamays) AAA50161
Otherplant:
Otherplant:
Otherplant:

humans(Homosapiens)
Pacificoyster(Crassostreagigas)
Otheranimal:
Otheranimal:

ClustalOmega:ADNAandProteinMultipleSequenceAlignmentTool
URL:http://www.ebi.ac.uk/Tools/msa/clustalo/

Introduction
Onceyouhavefoundatleasttwousablesequencesforbothplantandanimalamylases,youwillwantto
alignthemtoseehowsimilartheyare.WewillusetheprogramClustalOmegatodosuchanalignment.
BesuretoreadtheinformationbelowthatdescribesClustalOmegaandtheunderlyingbasisfor
sequencecomparisons.Whenyouarefinished,entertheURLshownabovetobringupthesitethat
hoststheClustalOmegaprogram.

ClustalOmegaisageneralpurposeglobalmultiplesequencealignmentprogramforDNAorproteinsfor
usewhenyouwanttoalign3ormoresequences(foraligning2sequencesusethepairwisesequence
alignmenttool:http://www.ebi.ac.uk/Tools/psa/).ClustalOmegaproducesbiologicallymeaningful
multiplesequencealignmentsofdivergentsequences.Itcalculatesthebestmatchfortheselected
sequences,andlinesthemupsothattheidentities,similarities,anddifferencescanbeseen.
EvolutionaryrelationshipscanbeseenviaviewingCladogramsorPhylograms.Alignmentscoresare
returnedasaPercentIdentityMatrix.ThePercentIdentityvalueforagivenpairwisecomparisonwillbe
thedatayouwanttoobtainfromthisanalysis.
Multiplealignmentsofproteinsequencesareimportanttoolsinstudyingsequencesandunderstanding
evolutionaryrelationships.Thebasicinformationtheyprovideisidentificationofconservedsequence
regions.Thisisveryusefulindesigningexperimentstotestandmodifythefunctionofspecificproteins,
inpredictingthefunctionandstructureofproteins,andinidentifyingnewmembersofproteinfamilies.
Sequencescanbealignedacrosstheirentirelength(globalalignment)oronlyincertainregions(local
alignment).Thisistrueforpairwiseandmultiplealignments.Globalalignmentsneedtousegaps
(representinginsertions/deletions)whilelocalalignmentscanavoidthem,aligningregionsbetween
gaps.Thealignmentisprogressiveandconsidersthesequenceredundancy.Phylogenetictreescanalso
becalculatedfrommultiplealignments.Theprogramhassomeadjustableparameterswithreasonable
defaults.

Page12of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

SubmissionForm
YouwillusethedefaultsettingsforallmenusthatappearatthetopoftheSubmissionForm(Fig.12),so
don'tchangethese.
Copyallofyoursequences,inFASTAformatincludingtheirfirstdescriptorline,intotheopenframeon
theSubmissionForm;makesuretoleaveoneblankspacebetweenthem(Fig.13).ClustalOmegawill
attempttoaligntheseaminoacidsequencesbasedontheirsimilarities.ClickSubmit.Yourresultsmight
takeafewsecondsorperhapsafewminutes.BEPATIENT.


Figure13.TheClustalOmegasubmissionform.

Page13of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

AlignmentResults
Thefirstscreenyoullseeshowsthealignmentsofyoursequences(Fig.14a).Itwillbehelpfultoclickon
ShowColorstomoreeasilyseelocationsofsimilarityanddifferenceamongthesequencesbasedonthe
chemicalnatureoftheaminoacidresidues.
RED(residuesAVFPMILW)=Small(small+hydrophobic(incl.aromaticY))
BLUE(residuesDE)=Acidic
MAGENTA(residuesRK)=BasicH
GREEN(residuesSTYHCNGQ)=Hydroxyl+sulfhydryl+amine+G
GREY(otherresidues)=Unusualamino/iminoacidsetc
Thedisplayedrows(exceptlastonewiththeconsensussymbols*,:,.)arethealignedaminoacid
sequences;thelastoneisanindicationofconsensus,orwhichaminoacidsareconservedacrossthe
comparedsequences.Bydefault,analignmentwilldisplaythefollowingconsensussymbolsdenoting
thedegreeofconservationobservedineachcolumn.
Conservedmeanstheaminoacidisreplacedbyonehavingsimilarchemicalproperties.

ConsensusSymbols:

"*"meansthattheresidues,ornucleotides,inthatcolumnareidenticalinallsequencesinthe
alignment.

":"meansthatconservedsubstitutionshavebeenobserved;aminoacidshavingstronglysimilar
properties.

"."meansthatsemiconservedsubstitutionsareobserved,i.e.,aminoacidshavingsimilarshape,but
otherwisehaveweaklysimilarproperties.

ClickonResultsSummarybuttonatthetopofthepage.Atableisreturned(Fig.14b)thatallowsyouto
selectmultiplesummariesofinformationabouttheanalysis.Theoneyoullwantisthelastone,the
PercentIdentityMatrix(PIM)thisreturnsthealignmentscoresforthepairwisecomparisonsofthe
sequencesyousubmitted.Thematrix(Fig.14c)liststhesequencesbyaccessionnumberbyrowand
column(weaddedtheredlabels).Thescoreattheintersectionofarowandcolumnisthealignmentfor
thatpair.Tohelpyouunderstandthealignmentscore,reviewthedescriptionbelowfromtheClustal
OmegasiteFAQs.Copy/PastethePIMintoyourNotepadfile.

Howarepairwisealignmentscorescalculated?
Apairwisescoreiscalculatedforeverypairofsequencesthataretobealigned.Pairwisescoresare
calculatedasthenumberofidentities(sameaminoacidresidueinthebestalignmentdividedbythe
numberofresiduescompared(gappositionsareexcluded).Thus,theytellusapproximatelywhat
percentageofthetwosequenceshavefunctionalidentity,orsimilarity.

Page14of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure14.(A)Aportionofamultiplesequencealignment.Thenumberattheendoftherow
indicatestheaminoacidnumberinthelastpositionofthatrowrelativetotheentiremolecule.
(B)ResultsSummaryoptions,(C)Matrixofalignmentscores.

Page15of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

BesuretocopythealignmentsoutputandmatrixscoresresultstoyourNotepadfile.Lookthrough
theentiresequencetolookforareasofsimilarity.

Howmuchisthere?Canyouguesswhyclam/oysterandhumansequencesdidnotappearinthe
BLASTsearchwithcornalphaamylase?
Compareeachpairofsequencestoseewhichonesaremostsimilar.Youmightneedtorerun
ClustalW2withthedifferentpairstomostefficientlydeterminethis.
Arethereanyareasofthesequencethatyouexpecttobemoresimilarbetweenspeciesthan
others(i.e.,theactivesite)?Thepartsofthesequencewiththemostidentityarelikelypartsof
theactivesite.Howaretheydistributedalongthesequence?Howcanyouexplaintheir
distribution?
Ifyoudontknowwheretheimportantfunctionaldomainsare,youshouldrunasearchofthe
literatureinPubMedtofindout.SimplyclickontheNCBIiconontheactivewebpageand
choosePubMed.

ProteinStructuresConservedDomainDatabase(CDD)
Sinceyoufoundthattherearefewsimilaritiesintheaminoacidsequencesforalphaamylaseinthe
threeorganisms,howdoweaccountforthembeingfunctionallysimilar?Weneedtotakeonemore
stepandexaminethethreedimensionalstructureoftheenzymes.YoucanusetoolsontheNCBI
websiteforthisaswell.

1. OpentheNCBImainpage(Fig.14).ClickonDomainsandStructureonthelefthandmenubar,
andthenselectConservedDomainDatabase(CDD)undertheresourcetab.


Figure14.NCBIwebsitehomepage.
2. OntheCDDdatabasepage,clickon"CDSearch"(Fig.15).

Page16of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure15.ConservedDomainDatabaseentrypage.

3. Type(orpaste)theaccessionnumberforhumansalivaryalphaamylaseintothebigcenter
searchwindow(Fig.16).Usethedefaultsettingsaspresented.ClickontheSUBMITbutton.


Figure16.Conserveddomainquerysubmissionpage.

4. Theresultswindowshouldconfirmthatthissequenceisforalphaamylase.ClickonSEARCH
FORSIMILARDOMAINARCHITECTURE(Fig.17).

Page17of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure17.ResultspagefromCDDquery.Notethatthegraphicidentifiestheactive,catalytic,and
Calciumbindingsiteregions.Selectthepfam00128accessionnumbertocontinue.

5. Inthewindowdisplayingtheresults,clickonthepfam00128group(Fig.17),theninthenext
dialog(Fig.18)clickonthe"[+]Structure"menu,whichiscollapsedbydefault.Clickon
StructureView(Fig.18a).Ifyouareusingyourowncomputer,clickonDownloadCn3Dto
installtheviewingprogramandfollowwhateverareyourplatformsusualinstructionsfor
programinstallation.OnBateslaptops,theprogramshouldopenthestructurefile
automatically.

NOTE:MacmaynotbeabletoruntheCn3Dsoftwareneededtoviewthestructures.Wewillhave
laptopsavailableforyoutouseifneeded.

Page18of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

A. B.
Figure18.AccessingtheCn3Ddisplayprogram.
6. TheCn3Dapplicationwillopenenablingyoutoseethestructureofyourprotein(Fig.19).You
canrotatethe3Dstructurebydraggingitwithyourmouse.Thecatalyticactiveregionisshown
inred.
Thealphaamylasemoleculedisplayedisa
consensusstructurerepresentingALLalpha
amylasesacrosstaxa,notonefromaparticular
species.InFig.19themoleculeispositionedto
showthecatalyticsiteintheupperright;itappears
asavshapedgrooveononesideofthemolecule.
InthebottomoftheVastarchmoleculeisshown
asitwouldbeorientedintheactivesite.

Figure19.3Drenderingofthehumansalivaryamylasemolecule.

7. ThecolorkeyoftheimageinFig.19matchestheaminoacidsequenceinformation(Fig.20)in
thewindowthatappearsbelowthe3Drepresentationofyourprotein.Thefirstrowisthe
querysequence.Ifyouselectaportionofthesequencebydraggingthemouse,itwillbe
highlightinyellowofthemodel.Thesameworksforindividualresidues.

Page19of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure20.Aminoacidsequencesofpfam00128amylases.Thefirstrowisthequerysequence.

8. ChangethedisplayformatofCn3D
byselectingStyle>Rendering
Shortcuts>Worms(Fig.21).Now
youshouldbeabletorotatethe
structuretoclearlyseethe/
barrelsiteinthecenterofthe
molecule.Ifneedbe,rotatethe
moleculetoseetheholethrough
themoleculeformedbythebarrel.
Figure21.Commandstochangetherenderingstyleof

the3dmodel.

ProteinStructures:Comparisons
Nowthatyouknowwhatthecatalyticsitelookslike,youcansearchforthe3Dstructureofthespecific
enzymesusedinthisstudyandseehowtheycompare.Wellstartwiththehumansalivaryamylase.
Unfortunately,therearenostructuremodelsforeithercornorclamsinthedatabase,butthereisone
forbarley,anothergrain.Beforeviewingthestructureofthebarleyenzyme,lookatyourClustalOmega
PIMresultsandcomparethebarleyandcornsequencestodetermineifthissubstituteisvalid.
1. ClosetheCDDwindowsandreturntothemainNCBIwebsitebyclickingtheNCBIlogointhe
upperleftcorner.
2. ClickonSTRUCTUREinthedropdownmenubythesearchwindowatthetopofthepage.
3. AttheStructureSearchEntrez,enter1SMD(=humansalivaryamylase)andclickSearch.
4. Rotatethemodeloftheenzymecanyouseethecharacteristiccatalyticsite?Toaccessthe
fullfunctionoftheviewer,clickontheexpandwindowiconinthelowerleftoftheimage
window.
5. Thissitedoesnotshowthecatalyticsiteinred,butyoucanselectasectionofthesequencein
therighthandwindow,anditwillalsobehighlightedonthemodel.Youcanusethisattribute
whencomparingtwoamylasestohelpachievethesameorientation.
6. Now,openanewbrowserbyrightmouseclickingonthebrowsericononyourtaskbaror
desktopandselectingthebrowsername.EntertheNCBIwebsiteURL.SelectSTRUCTURE,
enter1RPK(=barleyalphaamylase)inthesearchwindowandclickSearch.

Page20of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

7. Asbefore,expandthedisplayfunctions.Reducethebrowserwindowsizeanddothesamewith
thewindowforthehumanamylasesothatbothenzymestructurescanbeviewedsidebyside.
Rotatethemodelofthebarleyenzymecanyouseethecharacteristiccatalyticsite?
8. Asbestyoucan,rotatethetwoimagestoorientthemthesameandcomparethem.Arendering
styleofcylinderandplatewilldisplaythepleatedsheetsandhelicesofthe2structure.
9. Oncedone,donotclosethesalivaryamylasewindow.Closethesecondbrowserwindow,
however.Proceedtothenextprocedure.


Figure22.Humansalivaryamylasestructurerecordfor1SMD.

Page21of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

Comparing3DStructureswithVAST(nowthisIScool!!)
WhileCn3Ddoesfinewithsinglestructures,it'sevenbettersuitedto
displayingstructurealignmentsofmultipleproteins,i.e.,itenablesyouto
superimpose3Dstructureontopofeachothersuchthatdifferencesin
structurearereadilyapparent.NCBIcreatesandmaintainsadatabaseofsuch
alignments,calledVAST(VectorAlignmentSearchTool),forallpairsof
proteinsfromMMDBwhosestructureshavesomesimilarcoreregions.The
VASTtooldoestwothingsforeachrelatedpair:itcalculatesanoptimal3D
superimpositionfortheconservedcore,andconstructsasequencealignment

basedonthecorrelationofthe3Dstructures.

1. Ifyouhavethehumansalivaryamylase(1SMD)stillopen,returntotheStructureSummarypage
andthengotostep3.Ifnot,fromtheNCBIhomepage,choosetheStructuredatabase.
2. Searchfor1SMD.
3. Whenyouselect1SMD,youshouldgettheStructureSummarypage(Fig.23).
4. Tocomparethisstructurewithothermolecules,clicktheVAST+buttonontheright.Younow
havealistofsimilarstructures.Findthestructureforbarleyalphaamylase(1AMY)byentering
1AMYforthePDBIDandclickSearchwithinResultsbutton(Fig.24).


Figure23.StructureSummarypage.

Page22of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology


Figure24.VAST+recordpage.

5. Expandtheentrybyclickingonthe+totheleftof1AMY(Fig.25).Youshouldseeballandstick
diagramsofboththehumanandthebarleyamylasesinthewindow.NowclickontheVisualize
3dstructuresuperpositionwithCn3Dbuttontodisplaythealigned3Dstructures.SelectCn3D
apptoviewit.


Figure25.VAST+pageshowingstructurestobealigned.

Page23of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

6. ThedefaultcoloringforstructurealignmentsinCn3Dusesmagentaandbluefortheregions
alignedbytheVASTalgorithm,whereresiduesalignedin3Dspacearemagenta,anddifferent
residuesareblue;unalignedregionsarecoloredgray.NotethatbecauseofthewayVASTworks,
thealignedregionstendtocorrespondtoindividualorgroupsofconsecutivesecondary
structureelementshelicesandstrands,whiletheloopsoutsidethecorevaryinlengthand
orientationandareoftenleftunaligned.
7. TherearesomeimportantdifferencesbetweenstructurebasedalignmentsinCn3Dand
sequencealignmentsfromcommonalgorithmslikeBLASTorClustalOmega,bothinthedisplay
andtheunderlyingalignmentdata.Inastructurealignment(e.g.fromVAST),oneresidueis
alignedwithanotherbecausetheiralphacarbonsarenearbyinspace,notbecauseofthe
residueidentity.
8. Tryaligningamoleculethatisverysimilartohumanalphaamylaseporcinealphaamylase.
SearchforthePDPID=1PIFinsteadofthebarley.
9. Alteromonashalopanctis,thecoldadaptedmarineorganismthatFeller,et.al.,wroteaboutisin
theVASTresultstoosearchforPDPID=1AQH.

CITINGREFERENCESRELATEDTOBIOINFORMATICSINFORMATIONINYOURPAPER
WhenusingClustalOmegatocomparesequences,yourcitationsareintwoparts:

InyourMethods,whenyousaythatyoucomparedproteinsequencesusingClustalOmega,citethe
creatorsoftheprogramjustasyouwouldciteitinanyothersectionofyourpaper.Thecitations
(botharerequiredbyEMBL)inyourLiteratureCitedlistthenis:
Sievers,F,Wilm,A,Dineen,DG,Gibson,TJ,Karplus,K,Li,W,Lopez,R,McWilliam,H,Remmert,
M,Sding,J,Thompson,JD,andDHiggins.Fast,scalablegenerationofhighquality
proteinmultiplesequencealignmentsusingClustalOmega.MolecularSystemsBiology
7Articlenumber:539
Goujon,M,McWilliam,H,Li,W,Valentin,F,Squizzato,S,Paern,J,andRLopez.2010.Anew
bioinformaticsanalysistoolsframeworkatEMBLEBI.NucleicAcidsResearch38Suppl:
W6959
Also,inyourMethods,foreachsequenceyouused,givetheaccessionnumberofthesequenceand
citethepaper/researcherswhosubmittedthatsequencetoNCBI.Forinstance,forthecorn
sequencebelowtheaccessionnumberisAAA50161andthecitationinyourtextis(Younget.al.
1994).ThecitationinyourLiteratureCitedsectionwouldbe:

Young,T.E.,DeMason,D.A.,Close1994.CloningofanalphaamylasecDNAfromaleuronetissue
ofgerminatingmaizeseed.PlantPhysiol.105(2),759760.

Page24of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

Glossary
Alignment
Theprocessoflininguptwoormoresequencestoachievemaximallevelsofidentity(and
conservation,inthecaseofaminoacidsequences)forthepurposeofassessingthedegreeof
similarityandthepossibilityofhomology.
Algorithm
Afixedprocedureembodiedinacomputerprogram.
Bioinformatics
Themergerofbiotechnologyandinformationtechnologywiththegoalofrevealingnewinsights
andprinciplesinbiology.
Bitscore
ThevalueS'isderivedfromtherawalignmentscoreSinwhichthestatisticalpropertiesofthe
scoringsystemusedhavebeentakenintoaccount.Becausebitscoreshavebeennormalized
withrespecttothescoringsystem,theycanbeusedtocomparealignmentscoresfrom
differentsearches.
BLAST
BasicLocalAlignmentSearchTool.(Altschuletal.)Asequencecomparisonalgorithmoptimized
forspeedusedtosearchsequencedatabasesforoptimallocalalignmentstoaquery.Theinitial
searchisdoneforawordoflength"W"thatscoresatleast"T"whencomparedtothequery
usingasubstitutionmatrix.Wordhitsarethenextendedineitherdirectioninanattemptto
generateanalignmentwithascoreexceedingthethresholdof"S".The"T"parameterdictates
thespeedandsensitivityofthesearch.Foradditionaldetails,seeoneoftheBLASTtutorials
(QueryorBLAST)orthenarrativeguidetoBLAST.
BLOSUM
BlocksSubstitutionMatrix.Asubstitutionmatrixinwhichscoresforeachpositionarederived
fromobservationsofthefrequenciesofsubstitutionsinblocksoflocalalignmentsinrelated
proteins.Eachmatrixistailoredtoaparticularevolutionarydistance.IntheBLOSUM62matrix,
forexample,thealignmentfromwhichscoreswerederivedwascreatedusingsequences
sharingnomorethan62%identity.Sequencesmoreidenticalthan62%arerepresentedbya
singlesequenceinthealignmentsoastoavoidoverweightingcloselyrelatedfamilymembers.
(HenikoffandHenikoff)
Conservation
Changesataspecificpositionofanaminoacidor(lesscommonly,DNA)sequencethatpreserve
thephysicochemicalpropertiesoftheoriginalresidue.
Domain
Adiscreteportionofaproteinassumedtofoldindependentlyoftherestoftheproteinand
possessingitsownfunction.

Page25of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

DUST
Aprogramforfilteringlowcomplexityregionsfromnucleicacidsequences.
Evalue
Expectationvalue.Thenumberofdifferentalignmentswithscoresequivalenttoorbetterthan
Sthatareexpectedtooccurinadatabasesearchbychance.ThelowertheEvalue,themore
significantthescore.
FASTA
Thefirstwidelyusedalgorithmfordatabasesimilaritysearching.Theprogramlooksforoptimal
localalignmentsbyscanningthesequenceforsmallmatchescalled"words".Initially,thescores
ofsegmentsinwhichtherearemultiplewordhitsarecalculated("init1").Laterthescoresof
severalsegmentsmaybesummedtogeneratean"initn"score.Anoptimizedalignmentthat
includesgapsisshownintheoutputas"opt".Thesensitivityandspeedofthesearchare
inverselyrelatedandcontrolledbythe"ktup"variablewhichspecifiesthesizeofa"word".
(PearsonandLipman)
Filtering
AlsoknownasMasking.Theprocessofhidingregionsof(nucleicacidoraminoacid)sequence
havingcharacteristicsthatfrequentlyleadtospurioushighscores.SeeSEGandDUST.
Gap
Aspaceintroducedintoanalignmenttocompensateforinsertionsanddeletionsinone
sequencerelativetoanother.Topreventtheaccumulationoftoomanygapsinanalignment,
introductionofagapcausesthedeductionofafixedamount(thegapscore)fromthealignment
score.Extensionofthegaptoencompassadditionalnucleotidesoraminoacidisalsopenalized
inthescoringofanalignment.
GlobalAlignment
Thealignmentoftwonucleicacidorproteinsequencesovertheirentirelength.
H
Histherelativeentropyofthetargetandbackgroundresiduefrequencies.(KarlinandAltschul,
1990).Hcanbethoughtofasameasureoftheaverageinformation(inbits)availableper
positionthatdistinguishesanalignmentfromchance.AthighvaluesofH,shortalignmentscan
bedistinguishedbychance,whereasatlowerHvalues,alongeralignmentmaybenecessary.
(Altschul,1991)
Homology
Similarityattributedtodescentfromacommonancestor.
HSP
Highscoringsegmentpair.Localalignmentswithnogapsthatachieveoneofthetopalignment
scoresinagivensearch.
Identity
Theextenttowhichtwo(nucleotideoraminoacid)sequencesareinvariant.

Page26of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

K
AstatisticalparameterusedincalculatingBLASTscoresthatcanbethoughtofasanaturalscale
forsearchspacesize.ThevalueKisusedinconvertingarawscore(S)toabitscore(S').
Lambda
AstatisticalparameterusedincalculatingBLASTscoresthatcanbethoughtofasanaturalscale
forscoringsystem.Thevaluelambdaisusedinconvertingarawscore(S)toabitscore(S').
LocalAlignment
Thealignmentofsomeportionoftwonucleicacidorproteinsequences
LowComplexityRegion(LCR)
Regionsofbiasedcompositionincludinghomopolymericruns,shortperiodrepeats,andmore
subtleoverrepresentationofoneorafewresidues.TheSEGprogramisusedtomaskorfilter
LCRsinaminoacidqueries.TheDUSTprogramisusedtomaskorfilterLCRsinnucleicacid
queries.
Masking
AlsoknownasFiltering.Theremovalofrepeatedorlowcomplexityregionsfromasequencein
ordertoimprovethesensitivityofsequencesimilaritysearchesperformedwiththatsequence.
Motif
Ashortconservedregioninaproteinsequence.Motifsarefrequentlyhighlyconservedpartsof
domains.
MultipleSequenceAlignment
Analignmentofthreeormoresequenceswithgapsinsertedinthesequencessuchthat
residueswithcommonstructuralpositionsand/orancestralresiduesarealignedinthesame
column.ClustalWisoneofthemostwidelyusedmultiplesequencealignmentprograms
OptimalAlignment
Analignmentoftwosequenceswiththehighestpossiblescore.
Orthologous
Homologoussequencesindifferentspeciesthatarosefromacommonancestralgeneduring
speciation;mayormaynotberesponsibleforasimilarfunction.
Pvalue
Theprobabilityofanalignmentoccurringwiththescoreinquestionorbetter.Thepvalueis
calculatedbyrelatingtheobservedalignmentscore,S,totheexpecteddistributionofHSP
scoresfromcomparisonsofrandomsequencesofthesamelengthandcompositionasthe
querytothedatabase.ThemosthighlysignificantPvalueswillbethosecloseto0.Pvaluesand
Evaluesaredifferentwaysofrepresentingthesignificanceofthealignment.
PAM=PercentAcceptedMutation
AunitintroducedbyDayhoffetal.toquantifytheamountofevolutionarychangeinaprotein
sequence.1.0PAMunit,istheamountofevolutionwhichwillchange,onaverage,1%ofamino
acidsinaproteinsequence.APAM(x)substitutionmatrixisalookuptableinwhichscoresfor

Page27of28BioinformaticsTutorial(rev.102016)

Bio242|CellularandMolecularBiology

eachaminoacidsubstitutionhavebeencalculatedbasedonthefrequencyofthatsubstitution
incloselyrelatedproteinsthathaveexperiencedacertainamount(x)ofevolutionary
divergence.

Paralogous
Homologoussequenceswithinasinglespeciesthatarosebygeneduplication.
Profile
Atablethatliststhefrequenciesofeachaminoacidineachpositionofproteinsequence.
Frequenciesarecalculatedfrommultiplealignmentsofsequencescontainingadomainof
interest.SeealsoPSSM.
Proteomics
Thesystematicanalysisofproteinexpressioninnormalanddiseasedtissuesthatinvolvesthe
separation,identification,andcharacterizationofalloftheproteinsinanorganism.
PSIBLASTPositionSpecificIterativeBLAST
AniterativesearchusingtheBLASTalgorithm.Aprofileisbuiltaftertheinitialsearch,whichis
thenusedinsubsequentsearches.Theprocessmayberepeated,ifdesiredwithnewsequences
foundineachcycleusedtorefinetheprofile.DetailscanbefoundinthisdiscussionofPSI
BLAST.(Altschuletal.)
PSSM=Positionspecificscoringmatrix
ThePSSMgivesthelogoddsscoreforfindingaparticularmatchingaminoacidinatarget
sequence.
Query Theinputsequence(orothertypeofsearchterm)withwhichalloftheentriesinadatabaseare
tobecompared.
VAST VectorAlignmentSearchTool.Atoolthatenablessuperimpositionofmultiple3dstructures.
TheVASTtooldoestwothingsforeachrelatedpair:itcalculatesanoptimal3Dsuperimposition
fortheconservedcore,andconstructsasequencealignmentbasedonthecorrelationofthe3D
structures.

Page28of28BioinformaticsTutorial(rev.102016)

S-ar putea să vă placă și