Bioinformatics Tutorial 2016

Bio242|CellularandMolecularBiology
Bio242
BIOINFORMATICS
TUTORIAL
Bio242AmylaseLabSequence

SequenceSearches:BLAST
SequenceAlignment:ClustalOmega
3dStructure&3dAlignments

DONOTREMOVEFROMLAB.
DONOTWRITEINTHISDOCUMENT.
Apdfofthisdocumentisavailableonthebio242website.
Acknowledgements
TheBatesBioinformaticsTutorialwasoriginallydevelopedaspartoftheCollaborativeTechnologies
Developmentproject.DavidAsanuma('09)createdthesiteundertheguidanceofNancyKleckner,
AssociateProfessorofBiology,andMichaelHanrahan,AssistantDirectorofResearchandCurricular
Computing.RevisionofthecontentisperformedannuallybyGregAndersonandCarolynLawsonto
keepthedocumentuptodatewiththewebsite.
Page1of28BioinformaticsTutorial(rev.102016)
BioinformaticsTutorial
Bioinformaticsistheacquisition,storage,arrangement,identification,analysis,andcommunicationof
informationrelatedtobiology.Thetermwascoinedin1990withtheuseofcomputersinDNA
sequenceanalysis.Thinkofitasthetheoreticalbranchofmolecularbiologyliketherelationshipof
theoreticalphysicstothegeneralfieldofphysics.
Nowthatyouhaveobtainedinformationaboutsomeofthechemicalpropertiesofamylase,inthis
exerciseyouwillbecomparingthemolecularstructureoftheenzymeamongthethree(ormore!)
species.ThetutorialwillguideyouthroughfindingthegenesequencesusingboththeEntrezsearch
andBLASTtools,andthencomparingthemusingtheClustalOmegatool.
YouwillbeusingtheDNAandproteinsequenceonlinedatabasesthatarethecoreof
bioinformatics.Therearetwogeneraltypesofsequencedatabases:Primarydatabasescontain
experimentalresultsinanaccessibleformat,butarenotsequencesthatareapopulation
consensus.DDBJ,EMBL,andGenBankareprimarydatabases.Secondarydatabasesarecuratedto
reflectconsensussequencesfrommultipleexperimentsandusuallyusetheprimarydatabasesastheir
sources.
Abbreviations
DDBJDNADatabankofJapan
EMBLEuropeanMolecularBiologyLaboratory
NCBINationalCenterforBiotechnologyInformation
BLASTBasiclocalalignmentsearchtool
ThestandardsequenceformatiscalledFASTA.AllFASTAsequencesstartwithadefinitionlinewhich
consistsof:
auniqueidentificationnumber(theaccessionnumber)
theversionnumberofthesequence
thelengthofthesequence
moleculetype(DNAormRNA)
taxonomicdivision(forinstance,INV=invertebrate)
lastreleasedate
sourceorganism
Everycodingsequencealsohasauniqueproteinnumberassignedtoit,startingwithAA.
Referencesequences(whichundergocontinuingcuration)arethemostcompleteanduptodateand
alwaysstartwithNTforDNA,NMformRNA,orNPforprotein.Hintthesearetheonesyouwantto
useifpossible.
SequenceSearchIntroduction
Entrez
EntrezisadataretrievalsystemdevelopedbytheNationalCenterforBiotechnologyInformation(NCBI)
thatprovidesintegratedaccesstoawiderangeofdatadomains,includingliterature,nucleotideand
proteinsequences,completegenomes,threedimensionalstructures,andmore.Entrezincludes
powerfulsearchfeaturesthatretrievenotonlytheexactsearchresultsbutalsorelatedrecordswithina
datadomainthatmightnotberetrievedotherwiseandassociatedrecordsacrossdatadomains.These
featuresenableustogatherpreviouslydisparatepiecesofaninformationpuzzleforatopicofinterest.
EffectiveandpowerfuluseofEntrezrequiresanunderstandingoftheavailabledatadomains,the
varietyofdatasourcesandtypeswithineachdomain,andEntrezsadvancedsearchfeatures.This
tutorialusescorn(Zeamays)alphaamylasetodemonstratethewidevarietyofinformationthatwecan
rapidlygatherforasinglegene.Thenumbersnotedinthesearchresultswillofcoursechangeovertime
asthedatabasesgrow.Thesametechniquesshownherecanbeusedforanytopicofinterest.
Thesearchgoalsareto:
Identifyingrepresentative,wellannotatedproteinaminoacidsequencerecordsforseveralplant
andanimalamylases,usingEntrezsearchandBLAST,tocompareusingCLUSTALOmegamultiple
sequencealignmenttool;
Retrieveassociatedliterature/citationsforeachaccessionrecord(speciesaasequence)used;
Identifyconserveddomainswithintheprotein;
Findaresolvedthreedimensionalstructurefortheenzymesyouused,or,intheirabsence,
identifystructureswithhomologoussequence;
PerformVASTalignmentsof3dstructuresofplantandanimalamylasestovisualizewhere
similaritiesanddifferencesoccur.

Letsgetstarted!
GototheNCBIwebsitehttp://www.ncbi.nlm.nih.govbyenteringtheURLintheaddressfieldofyour
browser.
AfteraccessingtheNCBIwebsite,youmaynowsearchforcornalphaamylasesequencesineitherthe
nucleotideorproteindatabasesbyselectingoneortheotherfromtheDatabasedropdownmenu.
OtherpointsofinterestontheNCBIHomePagearethePubMedlink,whichallowsyoutosearchfor
journalarticlesonthestructureandfunctionofalphaamylases,andtheBLASTlink,whichallowsyouto
searchfornucleotideorproteinsequenceswithsimilaritytoyoursequenceofinterest.
Fornow,makesureyouareattheNCBIhomepage(clickontheNCBIiconintheupperleftoftheNCBI
pagetobesure),andchoose"Protein"fromthesearchdropdowndatabasesmenu.Type"Zeamays
alphaamylase"inthelinebelow.TheseselectionsareillustratedinFigure1(nextpage).
Click"Search"toproceed.

Figure1.NCBIhomepagefromwhichEntrezsearchesofmanydatabasescanbeperformed.Youwill
choosetosearchtheProteindatabase.
Searchresults:Fig.2showsatypicalresultspageforthissearch.Yoursshouldlooksimilar,butmightbe
alittledifferentdependingonwhatnewinformationhasarrivedsincethescreenshotwasmade.The
sequenceofinteresthastheaccessionnumber(identifier)AAA50161.Itishighlightedinthescreen
shot.Howdoyouknowthisistheoneyouwant?Clickontheaccessionnumberandstudythepagethat
comesup.ItshouldbeidenticaltotheoneshowninFig.3.

Figure2.Typicalsearchresultspageforproteinsequences.
Figure3.Typicalrecordforatypicalaccessionnumberrecord.
InFigure3,takenoteoftheDEFINITION,SOURCEandORGANISM,AUTHORSofthesequence,andthe
TITLEandJOURNALnameofthearticlepublishedaboutit.Ifyoudontalreadyhavethisarticle,youcan
retrieveitsimplybyclickingonthePUBMEDnumber(inthelivewindow)andprintthePDFversion.
Thenfindyourwaybacktotheresultspage.
SkipdownthroughtheFEATURESandnotetheORIGINsection,whichgivesyoutheaminoacid
sequenceofyourprotein.ThisisthesequencewelluseinaBLASTsearch,butthedefaultformatisnot
particularlyhelpful.Allfurtherprocessingofthesequenceinformationrequiresthatthesequencebein
FASTAformat.
FASTAFormat:Conversionofthesequencetoauniversalformat
Scrolltothetopofyourresultspageand
notetheDisplaydropdownboxwith
"GenPept"selected.TheGenPeptformatis
thedefaultsettingandgivesyouallofthe
informationwediscussedabove.However,
theFASTAformatismoreusefulforBLAST
searchesandalignmentsofsequences.
SelectFASTAfromthemenuasillustratedin
Fig4.
Yourresultsshouldappearlikethescreen
shotinFig.5.Younowseelessinformation:
justtheaccessionnumberfollowedbya
briefdescriptor,andtheaminoacid

sequenceprecededbysomeidentifying
information. Figure4.ClickFASTAtoconvertthesequenceto
properformatforfurthersearching.

Figure5.FASTAconversionresults.
Inthelivewindow,highlightandcopythecompleteaminoacidsequencealongwiththeidentifying
information(>gi|426482).FromyourstartmenubringupNotePadandpastetheFASTAsequenceinto
thewindow.YouwillusethissequenceinaBLASTsearchtoidentifyotheraminoacidsequencesinthe
NCBIdatabaseswithsimilaritytoyoursequence.Notethatmanyoftherelevantanalysistoolsthatcan
usethissequenceinformationarelinkeddowntherightsideoftheNCBIpage.Onceyouare
comfortableusingthesetools,youcanworkmoreefficiently.MinimizeNotePadtoreturntotheNCBI
website.
ProteinBLASTIntroduction
ToaccesstheBLASTpage,inyourlivewindow,clickontheNCBIiconintheupperleftofthepage(this
takesyoutothehomepage).ClickonBLASTinthePopularResourcesmenu.TheBLASToptionsare
summarizedasaflowchart(Fig.6);chooseProteinBLAST.Table1explainswhattheBLASToptionsdo.

Figure6.BasicBLASTsearchoptions.
Table1.ExplanationofBLASTprogramfunctionsfortherestofus.
BLASTPROGRAM Furtherdetails
nucleotideblastorblastn Comparesanucleotidequerysequenceagainstanucleotidesequencedatabase.
proteinblast(orblastp) Comparesanaminoacidquerysequenceagainstaproteinsequencedatabase.
Comparesanucleotidequerysequencetranslatedinallreadingframesagainsta
blastx proteinsequencedatabase.Youcouldusethisoptiontofindpotential
translationproductsofanunknownnucleotidesequence.
Comparesaproteinquerysequenceagainstanucleotidesequencedatabase
tblastn
dynamicallytranslatedinallreadingframes.
Comparesthesixframetranslationsofanucleotidequerysequenceagainstthe
sixframetranslationsofanucleotidesequencedatabase.Pleasenotethatthe
tblastx
tblastxprogramcannotbeusedwiththenrdatabaseontheBLASTWebpage
becauseitiscomputationallyintensive.
BLASTPSearch
PasteyourcopiedFASTAsequenceintothetextboxunder"EnterQuerySequence"(Fig.7).Makesure
the"Nonredundantproteinsequence(nr)"databaseisselectedintheDatabasedropdownmenu
under"ChoseSearchSet.".ClickonBLAST.Youmayseeawindowindicatingyourqueryhasbeenadded
totheBLASTQueue.
Youmighthavetowaitforseveralsecondsforyourresultsduringwhichtimeyouwillseeascreenlike
thatinFig.8.Bepatient,rememberthatyoursequenceisbeingcomparedtothousandsofothers!

Figure7.TheBLASTsearchscreen.

Figure8.TheinitialscreenshowingaBLASTsearchresults.
BLASTPResultsPart1
Scrolldowntheblastpresultspagetotheillustrationwiththeredbars(Fig.9).Thisisadiagrammatic
representationofhowyourquerysequence(thetopredbar)linesupwithotherrelatedsequencesin
thedatabasebasedontheprimarystructureoftheprotein(from0toover400aminoacids).This
diagramsummarizesaround100"Hits",orotherproteinsequencesgoingfrommosttoleastsimilarity
toyourcornalphaamylasequerysequence.Notethatsomeofthesequenceslacktheaminoterminus
ofyourcornalphaamylasesequence.

Figure9.BLASTsummaryofrelatedsequences.Thelinesshowrelativealignmentofthehitsequenceswiththe
querysequence.
Onceyourresultsappear,scrolldownpastthereddiagramandyouwillseealistofaccessionnumbers
and descriptors for sequences in order of decreasing similarity to your sequence (Fig. 10). In fact, the
firstiteminthelistis(orshouldbe)yoursequence(checktheaccessionnumbertobesure).Thetwo
scoresattheright(IdentandEvalue)indicatethedegreeofsimilarity.Botharedefinedintheglossary
of terms in this tutorial. You can click on any of these sequences to go to the GenPept page that
describesit.

Figure10.Descriptionsofthe100mostrelatedsequencestothequerysequence.
Fornow,scrolldowntothe"Alignment"sectionoftheresults(Fig.11)toseetheactualaminoacid
sequencesalignedagainstyours.Notetheaminoacididentitiestogetameasureofhowsimilarthe
sequencesare.Thefirstshouldbe100%sinceitistheidenticalsequence.Asyouscrolldownthrough
thenextseveralsequences,though,thepercentidentityshouldgetsmaller.

Figure11.Sequencealignmentinformationforthemostrelatedproteinsequences.
Oryzaisthegenusofrice.
YourimmediategoalusingBLASTPistolocateothercompletesequencesforboththeplantandanimal
alphaamylasesutilizedinyourexperimentsandotherstoincludeinyouranalysis.Scrollbackupslowly
throughthelistof"hits".Whatspeciesdoyousee?Ifitisnotclearfromthebriefdescription,clickon
theaccessionnumbertogettheGenPeptdescriptions.Infact,whatyouwillprobablyfindaremostly
sequencesfromplants,somebacteria,andmaybeafewinsects.Clickonthe"DistanceofTreeResults"
link(inOtherreportsoptions)inthetoppaneloftheBLASTresults(Fig.12)toexamineaphylogenetic
treeconstructedfromtheorganismsincludedintheBLASTresults.UsingtheToolsoptionsyoucansee
differentrepresentationsofthetree.

Figure12.ToppanelofBLASTresultsshowinglocationofDistanceTreeofresultslink.
Howmanyspeciesshouldweincludeintheanalysis?
Togetthemostfromthisanalysisyoumayfindthatusingmorethanjustthethreespeciesweusedin
labwouldbeveryhelpfulinseeinglargerpatternsofsimilarity/dissimilaritywhencomparingplantand
animalamylases.Westronglyrecommendaddingatleastonemoreplant,ifnotmore.Useequal
numbersofplantandanimalamylasestobalancetherepresentationfromeachKingdom.
Ifhumanandoyster(orotherbivalvespecies[ClassBivalvia;OrderPelecypoda])alphaamylasearenot
foundinthislistofBLASThits,howelsemightyoufindthosesequencestocomparetocorn?To
broadenyouranalysisabit,youcanalsosearchforsequencesforcropspecieslikebarley(Hordeum
vulgare)orrice(Oryzasp).Designandcarryoutastrategytofindthem,andonceyoudo,copythe
FASTAformattedsequencestothesameNotePadfileyourothersequenceisin.Makesuretoleaveone
blanklinebetweenthesequences(neededlaterforthesubmissiontoCLUSTALOmega).
ACCESSIONNUMBERSCHECK
Tofacilitateabroadercomparisonofalphaamylaseamongplantandanimals,youshouldnowhave
four(ormore)accessionnumbers:oneforcorn(Zeamays),humans(Homosapiens),Pacificoyster
(Crassostreagigas)andbarley(Hordeumvulgare).Therearenowsequencesforamylasefromtwoother
clamGenerainthedatabases(CerastodermaandCorbicula)whichcouldbeusedasalternativestothe
Pacificoyster.Likelysequencestoincludewillhavelengthssimilartothehumanandcornsequences
andwillberichinAintheaccessionnumber.
RecordthosespeciesandtheiraccessionnumbersbelowandthencheckwithalabinstructororTAto
makesurethatyouhaveappropriatesequencesbeforeyouproceed.
SPECIES ACCESSIONNUMBER
corn(Zeamays) AAA50161
Otherplant:
Otherplant:
Otherplant:

humans(Homosapiens)
Pacificoyster(Crassostreagigas)
Otheranimal:
Otheranimal:
ClustalOmega:ADNAandProteinMultipleSequenceAlignmentTool
URL:http://www.ebi.ac.uk/Tools/msa/clustalo/
Introduction
Onceyouhavefoundatleasttwousablesequencesforbothplantandanimalamylases,youwillwantto
alignthemtoseehowsimilartheyare.WewillusetheprogramClustalOmegatodosuchanalignment.
BesuretoreadtheinformationbelowthatdescribesClustalOmegaandtheunderlyingbasisfor
sequencecomparisons.Whenyouarefinished,entertheURLshownabovetobringupthesitethat
hoststheClustalOmegaprogram.
ClustalOmegaisageneralpurposeglobalmultiplesequencealignmentprogramforDNAorproteinsfor
usewhenyouwanttoalign3ormoresequences(foraligning2sequencesusethepairwisesequence
alignmenttool:http://www.ebi.ac.uk/Tools/psa/).ClustalOmegaproducesbiologicallymeaningful
multiplesequencealignmentsofdivergentsequences.Itcalculatesthebestmatchfortheselected
sequences,andlinesthemupsothattheidentities,similarities,anddifferencescanbeseen.
EvolutionaryrelationshipscanbeseenviaviewingCladogramsorPhylograms.Alignmentscoresare
returnedasaPercentIdentityMatrix.ThePercentIdentityvalueforagivenpairwisecomparisonwillbe
thedatayouwanttoobtainfromthisanalysis.
Multiplealignmentsofproteinsequencesareimportanttoolsinstudyingsequencesandunderstanding
evolutionaryrelationships.Thebasicinformationtheyprovideisidentificationofconservedsequence
regions.Thisisveryusefulindesigningexperimentstotestandmodifythefunctionofspecificproteins,
inpredictingthefunctionandstructureofproteins,andinidentifyingnewmembersofproteinfamilies.
Sequencescanbealignedacrosstheirentirelength(globalalignment)oronlyincertainregions(local
alignment).Thisistrueforpairwiseandmultiplealignments.Globalalignmentsneedtousegaps
(representinginsertions/deletions)whilelocalalignmentscanavoidthem,aligningregionsbetween
gaps.Thealignmentisprogressiveandconsidersthesequenceredundancy.Phylogenetictreescanalso
becalculatedfrommultiplealignments.Theprogramhassomeadjustableparameterswithreasonable
defaults.
SubmissionForm
YouwillusethedefaultsettingsforallmenusthatappearatthetopoftheSubmissionForm(Fig.12),so
don'tchangethese.
Copyallofyoursequences,inFASTAformatincludingtheirfirstdescriptorline,intotheopenframeon
theSubmissionForm;makesuretoleaveoneblankspacebetweenthem(Fig.13).ClustalOmegawill
attempttoaligntheseaminoacidsequencesbasedontheirsimilarities.ClickSubmit.Yourresultsmight
takeafewsecondsorperhapsafewminutes.BEPATIENT.

Figure13.TheClustalOmegasubmissionform.
AlignmentResults
Thefirstscreenyoullseeshowsthealignmentsofyoursequences(Fig.14a).Itwillbehelpfultoclickon
ShowColorstomoreeasilyseelocationsofsimilarityanddifferenceamongthesequencesbasedonthe
chemicalnatureoftheaminoacidresidues.
RED(residuesAVFPMILW)=Small(small+hydrophobic(incl.aromaticY))
BLUE(residuesDE)=Acidic
MAGENTA(residuesRK)=BasicH
GREEN(residuesSTYHCNGQ)=Hydroxyl+sulfhydryl+amine+G
GREY(otherresidues)=Unusualamino/iminoacidsetc
Thedisplayedrows(exceptlastonewiththeconsensussymbols*,:,.)arethealignedaminoacid
sequences;thelastoneisanindicationofconsensus,orwhichaminoacidsareconservedacrossthe
comparedsequences.Bydefault,analignmentwilldisplaythefollowingconsensussymbolsdenoting
thedegreeofconservationobservedineachcolumn.
Conservedmeanstheaminoacidisreplacedbyonehavingsimilarchemicalproperties.
ConsensusSymbols:
"*"meansthattheresidues,ornucleotides,inthatcolumnareidenticalinallsequencesinthe
alignment.
":"meansthatconservedsubstitutionshavebeenobserved;aminoacidshavingstronglysimilar
properties.
"."meansthatsemiconservedsubstitutionsareobserved,i.e.,aminoacidshavingsimilarshape,but
otherwisehaveweaklysimilarproperties.
ClickonResultsSummarybuttonatthetopofthepage.Atableisreturned(Fig.14b)thatallowsyouto
selectmultiplesummariesofinformationabouttheanalysis.Theoneyoullwantisthelastone,the
PercentIdentityMatrix(PIM)thisreturnsthealignmentscoresforthepairwisecomparisonsofthe
sequencesyousubmitted.Thematrix(Fig.14c)liststhesequencesbyaccessionnumberbyrowand
column(weaddedtheredlabels).Thescoreattheintersectionofarowandcolumnisthealignmentfor
thatpair.Tohelpyouunderstandthealignmentscore,reviewthedescriptionbelowfromtheClustal
OmegasiteFAQs.Copy/PastethePIMintoyourNotepadfile.
Howarepairwisealignmentscorescalculated?
Apairwisescoreiscalculatedforeverypairofsequencesthataretobealigned.Pairwisescoresare
calculatedasthenumberofidentities(sameaminoacidresidueinthebestalignmentdividedbythe
numberofresiduescompared(gappositionsareexcluded).Thus,theytellusapproximatelywhat
percentageofthetwosequenceshavefunctionalidentity,orsimilarity.

Figure14.(A)Aportionofamultiplesequencealignment.Thenumberattheendoftherow
indicatestheaminoacidnumberinthelastpositionofthatrowrelativetotheentiremolecule.
(B)ResultsSummaryoptions,(C)Matrixofalignmentscores.
BesuretocopythealignmentsoutputandmatrixscoresresultstoyourNotepadfile.Lookthrough
theentiresequencetolookforareasofsimilarity.
Howmuchisthere?Canyouguesswhyclam/oysterandhumansequencesdidnotappearinthe
BLASTsearchwithcornalphaamylase?
Compareeachpairofsequencestoseewhichonesaremostsimilar.Youmightneedtorerun
ClustalW2withthedifferentpairstomostefficientlydeterminethis.
Arethereanyareasofthesequencethatyouexpecttobemoresimilarbetweenspeciesthan
others(i.e.,theactivesite)?Thepartsofthesequencewiththemostidentityarelikelypartsof
theactivesite.Howaretheydistributedalongthesequence?Howcanyouexplaintheir
distribution?
Ifyoudontknowwheretheimportantfunctionaldomainsare,youshouldrunasearchofthe
literatureinPubMedtofindout.SimplyclickontheNCBIiconontheactivewebpageand
choosePubMed.
ProteinStructuresConservedDomainDatabase(CDD)
Sinceyoufoundthattherearefewsimilaritiesintheaminoacidsequencesforalphaamylaseinthe
threeorganisms,howdoweaccountforthembeingfunctionallysimilar?Weneedtotakeonemore
stepandexaminethethreedimensionalstructureoftheenzymes.YoucanusetoolsontheNCBI
websiteforthisaswell.
1. OpentheNCBImainpage(Fig.14).ClickonDomainsandStructureonthelefthandmenubar,
andthenselectConservedDomainDatabase(CDD)undertheresourcetab.

Figure14.NCBIwebsitehomepage.
2. OntheCDDdatabasepage,clickon"CDSearch"(Fig.15).

Figure15.ConservedDomainDatabaseentrypage.
3. Type(orpaste)theaccessionnumberforhumansalivaryalphaamylaseintothebigcenter
searchwindow(Fig.16).Usethedefaultsettingsaspresented.ClickontheSUBMITbutton.

Figure16.Conserveddomainquerysubmissionpage.
4. Theresultswindowshouldconfirmthatthissequenceisforalphaamylase.ClickonSEARCH
FORSIMILARDOMAINARCHITECTURE(Fig.17).

Figure17.ResultspagefromCDDquery.Notethatthegraphicidentifiestheactive,catalytic,and
Calciumbindingsiteregions.Selectthepfam00128accessionnumbertocontinue.
5. Inthewindowdisplayingtheresults,clickonthepfam00128group(Fig.17),theninthenext
dialog(Fig.18)clickonthe"[+]Structure"menu,whichiscollapsedbydefault.Clickon
StructureView(Fig.18a).Ifyouareusingyourowncomputer,clickonDownloadCn3Dto
installtheviewingprogramandfollowwhateverareyourplatformsusualinstructionsfor
programinstallation.OnBateslaptops,theprogramshouldopenthestructurefile
automatically.
NOTE:MacmaynotbeabletoruntheCn3Dsoftwareneededtoviewthestructures.Wewillhave
laptopsavailableforyoutouseifneeded.
A. B.
Figure18.AccessingtheCn3Ddisplayprogram.
6. TheCn3Dapplicationwillopenenablingyoutoseethestructureofyourprotein(Fig.19).You
canrotatethe3Dstructurebydraggingitwithyourmouse.Thecatalyticactiveregionisshown
inred.
Thealphaamylasemoleculedisplayedisa
consensusstructurerepresentingALLalpha
amylasesacrosstaxa,notonefromaparticular
species.InFig.19themoleculeispositionedto
showthecatalyticsiteintheupperright;itappears
asavshapedgrooveononesideofthemolecule.
InthebottomoftheVastarchmoleculeisshown
asitwouldbeorientedintheactivesite.
Figure19.3Drenderingofthehumansalivaryamylasemolecule.
7. ThecolorkeyoftheimageinFig.19matchestheaminoacidsequenceinformation(Fig.20)in
thewindowthatappearsbelowthe3Drepresentationofyourprotein.Thefirstrowisthe
querysequence.Ifyouselectaportionofthesequencebydraggingthemouse,itwillbe
highlightinyellowofthemodel.Thesameworksforindividualresidues.

Figure20.Aminoacidsequencesofpfam00128amylases.Thefirstrowisthequerysequence.
8. ChangethedisplayformatofCn3D
byselectingStyle>Rendering
Shortcuts>Worms(Fig.21).Now
youshouldbeabletorotatethe
structuretoclearlyseethe/
barrelsiteinthecenterofthe
molecule.Ifneedbe,rotatethe
moleculetoseetheholethrough
themoleculeformedbythebarrel.
Figure21.Commandstochangetherenderingstyleof
the3dmodel.

ProteinStructures:Comparisons
Nowthatyouknowwhatthecatalyticsitelookslike,youcansearchforthe3Dstructureofthespecific
enzymesusedinthisstudyandseehowtheycompare.Wellstartwiththehumansalivaryamylase.
Unfortunately,therearenostructuremodelsforeithercornorclamsinthedatabase,butthereisone
forbarley,anothergrain.Beforeviewingthestructureofthebarleyenzyme,lookatyourClustalOmega
PIMresultsandcomparethebarleyandcornsequencestodetermineifthissubstituteisvalid.
1. ClosetheCDDwindowsandreturntothemainNCBIwebsitebyclickingtheNCBIlogointhe
upperleftcorner.
2. ClickonSTRUCTUREinthedropdownmenubythesearchwindowatthetopofthepage.
3. AttheStructureSearchEntrez,enter1SMD(=humansalivaryamylase)andclickSearch.
4. Rotatethemodeloftheenzymecanyouseethecharacteristiccatalyticsite?Toaccessthe
fullfunctionoftheviewer,clickontheexpandwindowiconinthelowerleftoftheimage
window.
5. Thissitedoesnotshowthecatalyticsiteinred,butyoucanselectasectionofthesequencein
therighthandwindow,anditwillalsobehighlightedonthemodel.Youcanusethisattribute
whencomparingtwoamylasestohelpachievethesameorientation.
6. Now,openanewbrowserbyrightmouseclickingonthebrowsericononyourtaskbaror
desktopandselectingthebrowsername.EntertheNCBIwebsiteURL.SelectSTRUCTURE,
enter1RPK(=barleyalphaamylase)inthesearchwindowandclickSearch.
7. Asbefore,expandthedisplayfunctions.Reducethebrowserwindowsizeanddothesamewith
thewindowforthehumanamylasesothatbothenzymestructurescanbeviewedsidebyside.
Rotatethemodelofthebarleyenzymecanyouseethecharacteristiccatalyticsite?
8. Asbestyoucan,rotatethetwoimagestoorientthemthesameandcomparethem.Arendering
styleofcylinderandplatewilldisplaythepleatedsheetsandhelicesofthe2structure.
9. Oncedone,donotclosethesalivaryamylasewindow.Closethesecondbrowserwindow,
however.Proceedtothenextprocedure.

Figure22.Humansalivaryamylasestructurerecordfor1SMD.
Comparing3DStructureswithVAST(nowthisIScool!!)
WhileCn3Ddoesfinewithsinglestructures,it'sevenbettersuitedto
displayingstructurealignmentsofmultipleproteins,i.e.,itenablesyouto
superimpose3Dstructureontopofeachothersuchthatdifferencesin
structurearereadilyapparent.NCBIcreatesandmaintainsadatabaseofsuch
alignments,calledVAST(VectorAlignmentSearchTool),forallpairsof
proteinsfromMMDBwhosestructureshavesomesimilarcoreregions.The
VASTtooldoestwothingsforeachrelatedpair:itcalculatesanoptimal3D
superimpositionfortheconservedcore,andconstructsasequencealignment

basedonthecorrelationofthe3Dstructures.

1. Ifyouhavethehumansalivaryamylase(1SMD)stillopen,returntotheStructureSummarypage
andthengotostep3.Ifnot,fromtheNCBIhomepage,choosetheStructuredatabase.
2. Searchfor1SMD.
3. Whenyouselect1SMD,youshouldgettheStructureSummarypage(Fig.23).
4. Tocomparethisstructurewithothermolecules,clicktheVAST+buttonontheright.Younow
havealistofsimilarstructures.Findthestructureforbarleyalphaamylase(1AMY)byentering
1AMYforthePDBIDandclickSearchwithinResultsbutton(Fig.24).

Figure23.StructureSummarypage.

Figure24.VAST+recordpage.
5. Expandtheentrybyclickingonthe+totheleftof1AMY(Fig.25).Youshouldseeballandstick
diagramsofboththehumanandthebarleyamylasesinthewindow.NowclickontheVisualize
3dstructuresuperpositionwithCn3Dbuttontodisplaythealigned3Dstructures.SelectCn3D
apptoviewit.

Figure25.VAST+pageshowingstructurestobealigned.
6. ThedefaultcoloringforstructurealignmentsinCn3Dusesmagentaandbluefortheregions
alignedbytheVASTalgorithm,whereresiduesalignedin3Dspacearemagenta,anddifferent
residuesareblue;unalignedregionsarecoloredgray.NotethatbecauseofthewayVASTworks,
thealignedregionstendtocorrespondtoindividualorgroupsofconsecutivesecondary
structureelementshelicesandstrands,whiletheloopsoutsidethecorevaryinlengthand
orientationandareoftenleftunaligned.
7. TherearesomeimportantdifferencesbetweenstructurebasedalignmentsinCn3Dand
sequencealignmentsfromcommonalgorithmslikeBLASTorClustalOmega,bothinthedisplay
andtheunderlyingalignmentdata.Inastructurealignment(e.g.fromVAST),oneresidueis
alignedwithanotherbecausetheiralphacarbonsarenearbyinspace,notbecauseofthe
residueidentity.
8. Tryaligningamoleculethatisverysimilartohumanalphaamylaseporcinealphaamylase.
SearchforthePDPID=1PIFinsteadofthebarley.
9. Alteromonashalopanctis,thecoldadaptedmarineorganismthatFeller,et.al.,wroteaboutisin
theVASTresultstoosearchforPDPID=1AQH.

CITINGREFERENCESRELATEDTOBIOINFORMATICSINFORMATIONINYOURPAPER
WhenusingClustalOmegatocomparesequences,yourcitationsareintwoparts:
InyourMethods,whenyousaythatyoucomparedproteinsequencesusingClustalOmega,citethe
creatorsoftheprogramjustasyouwouldciteitinanyothersectionofyourpaper.Thecitations
(botharerequiredbyEMBL)inyourLiteratureCitedlistthenis:
Sievers,F,Wilm,A,Dineen,DG,Gibson,TJ,Karplus,K,Li,W,Lopez,R,McWilliam,H,Remmert,
M,Sding,J,Thompson,JD,andDHiggins.Fast,scalablegenerationofhighquality
proteinmultiplesequencealignmentsusingClustalOmega.MolecularSystemsBiology
7Articlenumber:539
Goujon,M,McWilliam,H,Li,W,Valentin,F,Squizzato,S,Paern,J,andRLopez.2010.Anew
bioinformaticsanalysistoolsframeworkatEMBLEBI.NucleicAcidsResearch38Suppl:
W6959
Also,inyourMethods,foreachsequenceyouused,givetheaccessionnumberofthesequenceand
citethepaper/researcherswhosubmittedthatsequencetoNCBI.Forinstance,forthecorn
sequencebelowtheaccessionnumberisAAA50161andthecitationinyourtextis(Younget.al.
1994).ThecitationinyourLiteratureCitedsectionwouldbe:
Young,T.E.,DeMason,D.A.,Close1994.CloningofanalphaamylasecDNAfromaleuronetissue
ofgerminatingmaizeseed.PlantPhysiol.105(2),759760.
Glossary
Alignment
Theprocessoflininguptwoormoresequencestoachievemaximallevelsofidentity(and
conservation,inthecaseofaminoacidsequences)forthepurposeofassessingthedegreeof
similarityandthepossibilityofhomology.
Algorithm
Afixedprocedureembodiedinacomputerprogram.
Bioinformatics
Themergerofbiotechnologyandinformationtechnologywiththegoalofrevealingnewinsights
andprinciplesinbiology.
Bitscore
ThevalueS'isderivedfromtherawalignmentscoreSinwhichthestatisticalpropertiesofthe
scoringsystemusedhavebeentakenintoaccount.Becausebitscoreshavebeennormalized
withrespecttothescoringsystem,theycanbeusedtocomparealignmentscoresfrom
differentsearches.
BLAST
BasicLocalAlignmentSearchTool.(Altschuletal.)Asequencecomparisonalgorithmoptimized
forspeedusedtosearchsequencedatabasesforoptimallocalalignmentstoaquery.Theinitial
searchisdoneforawordoflength"W"thatscoresatleast"T"whencomparedtothequery
usingasubstitutionmatrix.Wordhitsarethenextendedineitherdirectioninanattemptto
generateanalignmentwithascoreexceedingthethresholdof"S".The"T"parameterdictates
thespeedandsensitivityofthesearch.Foradditionaldetails,seeoneoftheBLASTtutorials
(QueryorBLAST)orthenarrativeguidetoBLAST.
BLOSUM
BlocksSubstitutionMatrix.Asubstitutionmatrixinwhichscoresforeachpositionarederived
fromobservationsofthefrequenciesofsubstitutionsinblocksoflocalalignmentsinrelated
proteins.Eachmatrixistailoredtoaparticularevolutionarydistance.IntheBLOSUM62matrix,
forexample,thealignmentfromwhichscoreswerederivedwascreatedusingsequences
sharingnomorethan62%identity.Sequencesmoreidenticalthan62%arerepresentedbya
singlesequenceinthealignmentsoastoavoidoverweightingcloselyrelatedfamilymembers.
(HenikoffandHenikoff)
Conservation
Changesataspecificpositionofanaminoacidor(lesscommonly,DNA)sequencethatpreserve
thephysicochemicalpropertiesoftheoriginalresidue.
Domain
Adiscreteportionofaproteinassumedtofoldindependentlyoftherestoftheproteinand
possessingitsownfunction.
DUST
Aprogramforfilteringlowcomplexityregionsfromnucleicacidsequences.
Evalue
Expectationvalue.Thenumberofdifferentalignmentswithscoresequivalenttoorbetterthan
Sthatareexpectedtooccurinadatabasesearchbychance.ThelowertheEvalue,themore
significantthescore.
FASTA
Thefirstwidelyusedalgorithmfordatabasesimilaritysearching.Theprogramlooksforoptimal
localalignmentsbyscanningthesequenceforsmallmatchescalled"words".Initially,thescores
ofsegmentsinwhichtherearemultiplewordhitsarecalculated("init1").Laterthescoresof
severalsegmentsmaybesummedtogeneratean"initn"score.Anoptimizedalignmentthat
includesgapsisshownintheoutputas"opt".Thesensitivityandspeedofthesearchare
inverselyrelatedandcontrolledbythe"ktup"variablewhichspecifiesthesizeofa"word".
(PearsonandLipman)
Filtering
AlsoknownasMasking.Theprocessofhidingregionsof(nucleicacidoraminoacid)sequence
havingcharacteristicsthatfrequentlyleadtospurioushighscores.SeeSEGandDUST.
Gap
Aspaceintroducedintoanalignmenttocompensateforinsertionsanddeletionsinone
sequencerelativetoanother.Topreventtheaccumulationoftoomanygapsinanalignment,
introductionofagapcausesthedeductionofafixedamount(thegapscore)fromthealignment
score.Extensionofthegaptoencompassadditionalnucleotidesoraminoacidisalsopenalized
inthescoringofanalignment.
GlobalAlignment
Thealignmentoftwonucleicacidorproteinsequencesovertheirentirelength.
H
Histherelativeentropyofthetargetandbackgroundresiduefrequencies.(KarlinandAltschul,
1990).Hcanbethoughtofasameasureoftheaverageinformation(inbits)availableper
positionthatdistinguishesanalignmentfromchance.AthighvaluesofH,shortalignmentscan
bedistinguishedbychance,whereasatlowerHvalues,alongeralignmentmaybenecessary.
(Altschul,1991)
Homology
Similarityattributedtodescentfromacommonancestor.
HSP
Highscoringsegmentpair.Localalignmentswithnogapsthatachieveoneofthetopalignment
scoresinagivensearch.
Identity
Theextenttowhichtwo(nucleotideoraminoacid)sequencesareinvariant.
K
AstatisticalparameterusedincalculatingBLASTscoresthatcanbethoughtofasanaturalscale
forsearchspacesize.ThevalueKisusedinconvertingarawscore(S)toabitscore(S').
Lambda
AstatisticalparameterusedincalculatingBLASTscoresthatcanbethoughtofasanaturalscale
forscoringsystem.Thevaluelambdaisusedinconvertingarawscore(S)toabitscore(S').
LocalAlignment
Thealignmentofsomeportionoftwonucleicacidorproteinsequences
LowComplexityRegion(LCR)
Regionsofbiasedcompositionincludinghomopolymericruns,shortperiodrepeats,andmore
subtleoverrepresentationofoneorafewresidues.TheSEGprogramisusedtomaskorfilter
LCRsinaminoacidqueries.TheDUSTprogramisusedtomaskorfilterLCRsinnucleicacid
queries.
Masking
AlsoknownasFiltering.Theremovalofrepeatedorlowcomplexityregionsfromasequencein
ordertoimprovethesensitivityofsequencesimilaritysearchesperformedwiththatsequence.
Motif
Ashortconservedregioninaproteinsequence.Motifsarefrequentlyhighlyconservedpartsof
domains.
MultipleSequenceAlignment
Analignmentofthreeormoresequenceswithgapsinsertedinthesequencessuchthat
residueswithcommonstructuralpositionsand/orancestralresiduesarealignedinthesame
column.ClustalWisoneofthemostwidelyusedmultiplesequencealignmentprograms
OptimalAlignment
Analignmentoftwosequenceswiththehighestpossiblescore.
Orthologous
Homologoussequencesindifferentspeciesthatarosefromacommonancestralgeneduring
speciation;mayormaynotberesponsibleforasimilarfunction.
Pvalue
Theprobabilityofanalignmentoccurringwiththescoreinquestionorbetter.Thepvalueis
calculatedbyrelatingtheobservedalignmentscore,S,totheexpecteddistributionofHSP
scoresfromcomparisonsofrandomsequencesofthesamelengthandcompositionasthe
querytothedatabase.ThemosthighlysignificantPvalueswillbethosecloseto0.Pvaluesand
Evaluesaredifferentwaysofrepresentingthesignificanceofthealignment.
PAM=PercentAcceptedMutation
AunitintroducedbyDayhoffetal.toquantifytheamountofevolutionarychangeinaprotein
sequence.1.0PAMunit,istheamountofevolutionwhichwillchange,onaverage,1%ofamino
acidsinaproteinsequence.APAM(x)substitutionmatrixisalookuptableinwhichscoresfor
eachaminoacidsubstitutionhavebeencalculatedbasedonthefrequencyofthatsubstitution
incloselyrelatedproteinsthathaveexperiencedacertainamount(x)ofevolutionary
divergence.

Paralogous
Homologoussequenceswithinasinglespeciesthatarosebygeneduplication.
Profile
Atablethatliststhefrequenciesofeachaminoacidineachpositionofproteinsequence.
Frequenciesarecalculatedfrommultiplealignmentsofsequencescontainingadomainof
interest.SeealsoPSSM.
Proteomics
Thesystematicanalysisofproteinexpressioninnormalanddiseasedtissuesthatinvolvesthe
separation,identification,andcharacterizationofalloftheproteinsinanorganism.
PSIBLASTPositionSpecificIterativeBLAST
AniterativesearchusingtheBLASTalgorithm.Aprofileisbuiltaftertheinitialsearch,whichis
thenusedinsubsequentsearches.Theprocessmayberepeated,ifdesiredwithnewsequences
foundineachcycleusedtorefinetheprofile.DetailscanbefoundinthisdiscussionofPSI
BLAST.(Altschuletal.)
PSSM=Positionspecificscoringmatrix
ThePSSMgivesthelogoddsscoreforfindingaparticularmatchingaminoacidinatarget
sequence.
Query Theinputsequence(orothertypeofsearchterm)withwhichalloftheentriesinadatabaseare
tobecompared.
VAST VectorAlignmentSearchTool.Atoolthatenablessuperimpositionofmultiple3dstructures.
TheVASTtooldoestwothingsforeachrelatedpair:itcalculatesanoptimal3Dsuperimposition
fortheconservedcore,andconstructsasequencealignmentbasedonthecorrelationofthe3D
structures.

Bioinformatics Tutorial 2016

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bioinformatics Tutorial 2016

Încărcat de

Drepturi de autor:

Formate disponibile

Bio242|CellularandMolecularBiology

S-ar putea să vă placă și