SaifUr RehmanPhDThesis

AN INVESTIGATION OF HUMAN PROTEIN
INTERACTIONS USING THE COMPARATIVE METHOD

Saif Ur-Rehman
A Thesis Submitted for the Degree of PhD

at the
University of St Andrews
2012
Full metadata for this item is available in

Research@StAndrews:FullText
at:
http://research-repository.st-andrews.ac.uk/
Please use this identifier to cite or link to this item:

http://hdl.handle.net/10023/3119
This item is protected by original copyright

This item is licensed under a
Creative Commons Licence
School of Biology
PhD Thesis
An investigation of human protein interactions using the
comparative method
by
Saif Ur-Rehman
20th Jan 2012
1. Candidates declarations:
I, Saif Ur-Rehman hereby certify that this thesis, which is approximately 52,000 words in length, has been written
by me, that it is the record of work carried out by me and that it has not been submitted in any previous
application for a higher degree.
I was admitted as a research student in [09, 2006] and as a candidate for the degree of PhD in [09, 2007]; the
higher study for which this is a record was carried out in the University of St Andrews between [2006] and [2010].
Date signature of candidate
2. Supervisors declaration:
I hereby certify that the candidate has fulfilled the conditions of the Resolution and Regulations appropriate for the
degree of PhD in the University of St Andrews and that the candidate is qualified to submit this thesis in
application for that degree.
Date signature of supervisor
3. Permission for electronic publication: (to be signed by both candidate and supervisor)
In submitting this thesis to the University of St Andrews I understand that I am giving permission for it to be made
available for use in accordance with the regulations of the University Library for the time being in force, subject to
any copyright vested in the work not being affected thereby. I also understand that the title and the abstract will
be published, and that a copy of the work may be made and supplied to any bona fide library or research worker,
that my thesis will be electronically accessible for personal or research use unless exempt by award of an
embargo as requested below, and that the library has the right to migrate my thesis into new electronic forms as
required to ensure continued access to the thesis. I have obtained any third-party copyright permissions that may
be required in order to allow such access and migration, or have requested the appropriate embargo below.
The following is an agreed request by candidate and supervisor regarding the electronic publication of this thesis:
Add one of the following options:
(ii) Access to [all or part] of printed copy but embargo of [all or part] of electronic publication of thesis for a period
of 2 years (maximum five) on the following ground(s):
publication would preclude future publication;
Date signature of candidate
signature of supervisor
A supporting statement for a request for an embargo must be included with the submission of the draft copy of the
thesis. Where part of a thesis is to be embargoed, please specify the part and the reasons.
Abstract
There is currently a large increase in the speed of production of DNA sequence data as next
generation sequencing technologies become more widespread. As such there is a need for
rapid computational techniques to functionally annotate data as it is generated. One
computational method for the functional annotation of protein-coding genes is via detection
of interaction partners. If the putative partner has a functional annotation then this annotation
can be extended to the initial protein via the established principle of guilt by association.
This work presents a method for rapid detection of functional interaction partners for
proteins through the use of the comparative method. Functional links are sought between
proteins through analysis of their patterns of presence and absence amongst a set of 54
eukaryotic organisms. These links can be either direct or indirect protein interactions. These
patterns are analysed in the context of a phylogenetic tree.
The method used is a heuristic combination of an established accurate methodology
involving comparison of models of evolution the parameters of which are estimated using
maximum likelihood, with a novel technique involving the reconstruction of ancestral states
using Dollo parsimony and analysis of these reconstructions through the use of logistic
regression. The methodology achieves comparable specificity to the use of gene coexpression as a means to predict functional linkage between proteins.
The application of this method permitted a genome-wide analysis of the human
genome, which would have otherwise demanded a potentially prohibitive amount of
computational resource.
Proteins within the human genome were clustered into orthologous groups. 10 of
these proteins, which were ubiquitous across all 54 eukaryotes, were used to reconstruct a
phylogeny. An application of the heuristic predicted a set of functional protein interactions in
human cells. 1,142 functional interactions were predicted. Of these predictions 1,131 were
not present in current protein-protein interaction databases.
Acknowledgements
I thank the BBSRC for funding my work and making this project possible. I would also like
to thank my supervisor, Dr Daniel Barker for his tireless support and good advice. I would
like to thank Professor Mike Ritchie, Dr Jeff Graves and Dr Anne Smith for acting as my
committee. I would also like to thank Dr Rona Ramsay, Ji-Hiyun Lim, Wim Verleyen, Dr
Christoph Echtermeyer and Maria Keays for helpful discussion. I would also like to thank Dr
Neil Symington and Dr Herbert Fruchtl for their helpful advice in using cluster-computing
resources.
On a personal note I would like to thank my wife Cathryn for her constant unconditional
support without which I could not have brought this project to fruition. I would also like to
thank my friends Ken Armstrong and Jack Levell who played a large part in keeping me
balanced over the course of my work. Finally I would like to thank my parents for their love
and support, which has been a source of strength.
Abstract.....................................................................................................................................2
Acknowledgements ..................................................................................................................3
Chapter 1 ..................................................................................................................................8
1.1 History.......................................................................................................................................... 8
1.2 DNA/RNA .................................................................................................................................... 9
1.2.1 RNA..................................................................................................................................... 10
1.3 Proteins ...................................................................................................................................... 11
1.3.1 Protein secondary structure ................................................................................................. 11
1.3.2 Protein tertiary structure ...................................................................................................... 12
1.3.3 Protein quaternary structure ................................................................................................ 12
1.3.4 Protein domains ................................................................................................................... 12
1.3.5 Protein motifs ...................................................................................................................... 12
1.4 Genes .......................................................................................................................................... 13
1.4.1 Structure of a gene............................................................................................................... 13
1.4.1.1 Regulatory region of a gene........................................................................................................ 14
1.4.2 Transcription........................................................................................................................ 15
1.4.2.1 Post Transcriptional processing .................................................................................................. 15
1.4.2.1.1 Genetic Code....................................................................................................................... 15
1.4.2.1.2 Open reading frames ........................................................................................................... 16
1.4.2.1.3 Exons/Introns ...................................................................................................................... 17
1.4.2.2 Post Transcriptional processing (cont)........................................................................................ 18
1.4.2.2.1 RNA splicing....................................................................................................................... 18
1.4.2.2.2 Capping ............................................................................................................................... 19
1.4.2.2.3 Polyadenylation................................................................................................................... 20
1.4.3 Translation ........................................................................................................................... 20

1.5 Genomics.................................................................................................................................... 22
1.5.1 Genome annotation.............................................................................................................. 22
1.5.2 Genome and cDNA assembly ............................................................................................. 23
1.5.3 Gene detection ..................................................................................................................... 23
1.5.4 Functional annotation of genes............................................................................................ 25
1.5.4.1 Laboratory based techniques....................................................................................................... 25
1.5.4.2 Computational methods for functional annotation of genes ....................................................... 26
1.5.4.2.1 Alignment based methods ................................................................................................... 26
1.5.4.2.2 Genome context methods.................................................................................................... 28
1.5.4.2.2.1 Rosetta stone ............................................................................................................... 29
1.5.4.2.2.2 Gene neighbour ........................................................................................................... 29
1.5.4.2.2.3 Interolog detection....................................................................................................... 29
1.5.4.2.2.4 Phylogenetic profiling ................................................................................................. 29
1.5.4.2.2.5 Comparative methods.................................................................................................. 31
1.5.4.2.2.6 Mirror trees.................................................................................................................. 32
1.5.5 Storage of functional information ....................................................................................... 33

1.5.5.1 GO............................................................................................................................................... 33
1.5.5.1 KEGG ......................................................................................................................................... 33
1.6 Transcriptomics ........................................................................................................................ 34

1.6.1 Microarrays.......................................................................................................................... 34
1.6.2 Other methods for transcriptome examination .................................................................... 34
1.7 Proteomics ................................................................................................................................. 34
1.7.1 Protein Structure .................................................................................................................. 35
1.7.2 Protein interactions .............................................................................................................. 35
1.7.2.1 Experimental detection of protein interactions ........................................................................... 36
1.8 Description of project. .............................................................................................................. 37
Chapter 2 ................................................................................................................................39
2.1 Introduction............................................................................................................................... 39
2.1.1 Homology ............................................................................................................................ 39

2.1.2 Molecular evolution............................................................................................................. 40
2.1.2.1 Synonymous and non-synonymous mutations ........................................................................... 41
2.1.3 Phylogenetic trees................................................................................................................ 41

2.1.3.1 Species trees and gene trees........................................................................................................ 42
2.1.3.2 Topologies and branch lengths ................................................................................................... 44
2.1.3.3 Bootstrap support values............................................................................................................. 46
2.1.3.4 Evolutionary models in tree estimation ...................................................................................... 46
2.1.4 Detection of homology in molecular data ........................................................................... 47

2.1.5 Multiple sequence alignment............................................................................................... 49
2.1.5.1 Multiple sequence alignment quality filtration................................................................. 50
2.1.6 Methods to estimate phylogenetic trees .............................................................................. 50
2.1.6.1 Distance methods ........................................................................................................................ 51
2.1.6.2 Discrete character state methods................................................................................................. 53
2.1.6.2.1 Maximum Parsimony .......................................................................................................... 54
2.1.6.2.2 Maximum likelihood........................................................................................................... 55
2.1.6.2.3 Bayesian Methods ............................................................................................................... 56
2.1.6.3 Heuristic search methods ............................................................................................................ 57
2.1.8 Model selection in phylogenetic tree estimation ................................................................. 60

2.1.10 Phylogenetic analysis using gene presence ....................................................................... 63
2.2 Methods...................................................................................................................................... 64
2.2.1 Data Selection...................................................................................................................... 65
2.2.2 Data Acquisition .................................................................................................................. 65
2.2.3 Pairwise Alignment ............................................................................................................. 69
2.2.4 Orthology Determination..................................................................................................... 69
2.2.4.1 Inparanoid ................................................................................................................................... 69
2.2.4.2 Implementation ........................................................................................................................... 70
2.2.4.2.1 Design ...................................................................................................................................... 71
2.2.4.3 Application.................................................................................................................................. 72
2.2.5 Phylogenetic profiles ........................................................................................................... 72

2.2.5.1 Single copy proteins.................................................................................................................... 78
2.2.5.2 Proteome content data/tree.......................................................................................................... 79
2.2.6 Multiple alignment .............................................................................................................. 79

2.2.7 Model selection ................................................................................................................... 80
2.2.8 Phylogeny reconstruction .................................................................................................... 81
2.2.9 Comparison of protein content tree with super matrix tree ................................................. 82
2.3 Results ........................................................................................................................................ 82
2.4 Discussion .................................................................................................................................. 93
2.4.1.ML tree ................................................................................................................................ 93
2.4.2 Proteome content phylogeny ............................................................................................... 94
2.4.3 Conclusion ........................................................................................................................... 95
Chapter 3 ................................................................................................................................96
3.1 Introduction............................................................................................................................... 96
3.1.1 Hamming distance ............................................................................................................... 96
3.1.2 Comparative method ........................................................................................................... 97
3.1.2.1 Phylogenetic profile analysis using the comparative method..................................................... 97
3.1.3 Co-expression as measured by microarray........................................................................ 101

3.1.4 Bayesian classifier ............................................................................................................. 101
3.2 Methods.................................................................................................................................... 103
3.2.1 Assessing quality ............................................................................................................... 103
3.2.2 Training and test data. ....................................................................................................... 104
3.2.3 Hamming distance ............................................................................................................. 105
3.2.4 Constrained ML................................................................................................................. 105
3.2.5 Co-expression of mRNA ................................................................................................... 106
3.3 Results ...................................................................................................................................... 109
3.3.1 Hamming distance ............................................................................................................... 109

3.3.2 Constrained ML................................................................................................................. 110
3.3.2.1 Likelihood ratio statistic ........................................................................................................... 115
3.3.3 Co-expression of mRNA ................................................................................................... 121

3.3.5 Method Comparison .......................................................................................................... 127
3.4 Discussion ................................................................................................................................ 128
3.4.1 Low Sensitivities ............................................................................................................... 129
Chapter 4 ..............................................................................................................................130
4.1. Introduction............................................................................................................................ 130
4.1.1 Ancestral state reconstruction............................................................................................ 132
4.1.1.1 Parsimony ................................................................................................................................. 132
4.1.1.2 Likelihood ................................................................................................................................. 133
4.2 Filters ....................................................................................................................................... 134

4.2.1 Hamming distance filter .................................................................................................... 134
4.2.2 Ancestral state reconstruction filter................................................................................... 134
4.2.2.1 Dollo parsimony........................................................................................................................ 134
4.2.2.2 Maddison Test for correlated evolution.................................................................................... 136
4.3 Methods.................................................................................................................................... 138

4.3.1 Maddison test for correlated evolution.............................................................................. 141
4.3.1.1 Algorithm.................................................................................................................................. 141
4.3.1.2 Calculation of total number of ways of having x gains and y losses over the tree................... 142
4.3.1.3 Calculation of total number of ways of having p gains and q losses in subset k given x gains and
y losses over the entire tree ................................................................................................................... 143
4.3.1.4 Permutation effects ................................................................................................................... 144
4.3.1.5 Evaluation of Maddison test as heuristic for constrained ML .................................................. 148
4.3.2 Modification of test to match Dollo constraints ................................................................ 148

4.3.3 Differential parsimony....................................................................................................... 150
4.3.4 Dollo-pos/ Dollo-overall ................................................................................................... 150
4.3.5 Test based on logistic regression ....................................................................................... 150
4.3.5.1 Evaluation of logistic regression as a heuristic for constrained ML......................................... 156
4.4 Results ...................................................................................................................................... 157

4.4.1 Maddison test for correlated evolution.............................................................................. 158
4.4.2 Differential parsimony....................................................................................................... 160
4.4.3 Dollo-pos ........................................................................................................................... 161
4.4.4 Dollo-overall...................................................................................................................... 162
4.4.5 Logistic regression............................................................................................................. 162
4.4.6 Hamming distance ............................................................................................................. 168
4.5 Discussion ................................................................................................................................ 168
Chapter 5 ..............................................................................................................................170
5.1 Introduction............................................................................................................................. 170
5.1.1 PPI databases ..................................................................................................................... 170
5.1.1.1 MIPS ......................................................................................................................................... 170
5.1.1.2 BIND......................................................................................................................................... 171
5.1.1.3 MINT ........................................................................................................................................ 171
5.1.1.4 INTACT.................................................................................................................................... 171
5.1.1.5 HPRD........................................................................................................................................ 171
5.1.1.6 DIP ............................................................................................................................................ 171
5.1.1.7 REACTOME............................................................................................................................. 171
5.1.1.8 STRING .................................................................................................................................... 172
5.1.1.9 I2D ............................................................................................................................................ 172
5.1.1.10 KEGG ..................................................................................................................................... 172
5.1.1.11 BIOGRID................................................................................................................................ 172
5.1.1.12 Discussion ............................................................................................................................... 172
5.1.2 Power law .......................................................................................................................... 172

5.2 Methods.................................................................................................................................... 173
5.2.1 Short Branch filtration ....................................................................................................... 174

5.2.2 GO term enrichment .......................................................................................................... 178
5.2.3 Intersection with other data sources .................................................................................. 181
5.3 Results ...................................................................................................................................... 182
5.3.1 GO Enrichment.................................................................................................................. 182
5.3.2 Intersection with known data............................................................................................. 187
5.3.3 Network statistics .............................................................................................................. 188
5.4 Discussion ................................................................................................................................ 189
5.4.1 GO enrichment .................................................................................................................. 189
5.4.2 Intersection with known data............................................................................................. 190
5.4.3 Weaknesses........................................................................................................................ 190
5.4.3.1 Scaling....................................................................................................................................... 191
5.4.4 Conclusions ....................................................................................................................... 193
Chapter 6 ..............................................................................................................................194
6.1 Summary of Project................................................................................................................ 194
6.1.1 Repeat Analysis ................................................................................................................. 199
6.2 Conclusion ............................................................................................................................... 200
6.3 Future directions..................................................................................................................... 202
6.3.1 Computational extensions ................................................................................................. 202
6.3.2 Consensus profiles............................................................................................................. 202
6.3.3 Correlated evolution of proteins with the presence or absence of phenotypes ................. 203
6.3.4 Drug Targets ...................................................................................................................... 204
References.............................................................................................................................205
Appendix A Description of divergence of Java implementation of Inparanoid algorithm
from Perl implementation. ..................................................................................................223
Appendix B Individual Gene trees for genes in super matrix utilised in construction of
Phylogeny..............................................................................................................................232
Appendix C: Predictions made by constrained ML .........................................................242
Appendix D Concatenated Filtered Alignment.................................................................316
Chapter 1
Chapter 1
Introduction to computational annotation of protein coding genes
1.1 History
The discovery in the 1940s (Avery et al. 1944) and confirmation in the 1950s (Hershey and
Chase 1952) of DNA (deoxyribonucleic acid) as the physical basis for inheritance was a
milestone in biological research. It provided for a means to examine the materials and
processes underlying phenotypic traits and provided a conceptual link to the other natural
sciences. This was rapidly followed by the elucidation of the three dimensional structure of
B-DNA (Watson and Crick 1953) which is the form of DNA prevalent in living cells as it is
conducive to nucleosome formation (Richmond and Davey 2003). This structure was the now
famous double helix. It had been previously established (Beadle and Tatum 1941) that genes
exist as discrete regions within the genome whose sequence codes for the sequence of a
corresponding chain of amino acids. The genome of an organism is the full set of hereditary
material it possesses (Alberts 2010). This is RNA in the case of some viruses and DNA in the
case of all other types of cellular organism (Brown 2006). The discovery of the genetic code
(Crick et al. 1961) provided information on the mechanism for this production which
operates via initial intermediary transcription into RNA (ribonucleic acid) and then
translation into proteins. (Some genes also code for RNA products such as tRNAs and other
non-coding RNAs (Brown 2006)).
The first feasible method for determining the sequence of DNA was the MaxamGilbert chemical degradation method (Maxam and Gilbert 1977). This method was however
supplanted by the near simultaneous invention of the chain termination reaction method by
Frederick Sanger (Sanger et al. 1977) of DNA sequencing which was both safer and more
efficient (Brown 2006). This led to the first full genome of an organism to be sequenced,
which was bacteriophage fX174 (Sanger et al. 1978). Another contribution by Sanger was
that of shotgun sequencing. This entails the shattering of a piece of DNA into random
fragments and the sequencing of those fragments. The sequences of the fragments are then
assembled through searching for overlaps between them. This method facilitated the
sequencing of number of relatively larger viral and prokaryotic genomes such as
Bacteriophage MS2 (Fiers et al. 1976).
Chapter 1
In 1996 Saccharomyces cerevisiae was the first eukaryotic genome to be sequenced (Goffeau
et al. 1996) via a large collaborative effort. This was followed by the publication of the first
multi cellular eukaryotic genome Caenorhabditis elegans in 1998 (C. elegans Sequencing
Consortium 1998) and the draft genomes of the vertebrate Homo sapiens soon followed in
2001 (Venter et al. 2001). The application of industrial streamlining and automation to
sequencing efforts over the last 20 years as well as more recently with the onset of next
generation sequencing technologies there has been almost exponential growth to sequence
databases such as NCBI GenBank (Benson et al. 2009). Sequence data without further
processing and annotation cannot shed any light on either biological function or evolutionary
relationships between organisms. This means that there has been a focus on the development
of highly accurate high throughput methods for functional annotation of genes and other
functional genomic elements in recent years as the parity between rates of data generation
and rates of accurate and verifiable annotation becomes more divergent (Zhu et al. 2007).
1.2 DNA/RNA
DNA itself is made up of a linear backbone of alternating deoxyribose sugar and phosphate
residues (Strachan and Read 2004). There is a nitrogenous base attached to the 1 (one prime)
carbon of each individual sugar residue. There are two forms of nitrogenous base present
within DNA. One form possesses a single interlocked heterocyclic ring of carbon and
nitrogen atoms. Bases that exist in this conformation are known as pyrimidines (Strachan and
Read 2004). The second form of base consists of two interlocked heterocyclic rings of carbon
and nitrogen atoms. These bases are known as purines (Strachan and Read 2004). There are
two pyrimidines represented within DNA (Strachan and Read 2004). These are cytosine and
thymine commonly represented by the abbreviations C and T respectively (Brown 2006).
There are also two purines present, adenine and guanine represented as A and G (Brown
2006). The stability of the double helix structure of DNA is maintained through hydrogen
bond formation between the pyrimidine-purine pair C and G and hydrogen bond formation
between the remaining pyrimidine-purine pair T and A as well as base stacking interactions
between adjacent bases (Yakovchuk et al. 2006). Due to structural constraints base pairing
can only occur between a pyrimidine and a purine (Brown 2006).
The linear backbone of DNA/RNA is maintained by a phosphodiester bond formed
between the 3 (3 prime) carbon atom of the sugar and the 5 (5 prime) carbon of the
succeeding sugar (Strachan and Read 2004). The backbone is terminated by a sugar where
the 5 carbon is not linked to a succeeding sugar residue. This point is known as the 5 end.
9
Chapter 1
Similarly the other end of the molecule lacks a phosphodiester bond on the 3 carbon and is
known as the 3 end (Strachan and Read 2004). The sequence of DNA is usually described
in the 5!3 direction, as this is the direction of DNA replication as well as transcription of
RNA using DNA as a template (Strachan and Read 2004). Thus a feature along a DNA
molecule is referred to as being upstream of another feature if it is closer to the 5 end. The
length of a DNA molecule is measured in units of individual base pairs (bp).
DNA is a biopolymer and as such can be fully represented by the sequence of its
constituent nucleotide bases. Determination of this sequence for a complete organism
effectively represents the DNA blueprints for the construction of that organism, i.e. the amino
acid sequences of its constituent proteins and RNA molecules, as well as the regulatory
sequences that regulate production of these molecules both spatially and temporally.
1.2.1 RNA
RNA is constructed of similar residues, however the sugar is a ribose as opposed to
deoxyribose and the pyrimidine base thymine is replaced with the base uracil commonly
represented by the abbreviation U (Strachan and Read 2004). There is a diverse population of
RNA molecules produced by the eukaryotic genome. These molecules are involved with a
number of processes essential to life, including protein synthesis and regulation of gene
expression. A breakdown of general RNA types and their functions is presented in Table 1.1.
Abbreviated Name
Full name
mRNA
Messenger RNA
Primary Function
Provides a template for protein
synthesis.
tRNA
Transfer RNA
Connection of mRNA to relevant

amino acid during protein synthesis.
rRNA
Ribosomal RNA
Component of protein synthesising

organelles known as ribosomes.
snRNA
Small nuclear RNA
Component of RNA-protein machine

(involved in post transcriptional
modification of mRNA) known as the
spliceosome.
snoRNA
Small nucleolar RNA
Involved in the modification of rRNA

and snRNA
miRNA
Micro RNA
Involved in the regulation of RNA

stability and translation.
siRNA
Short interfering RNA
Involved in the targeted degradation of

RNA.
10
Chapter 1
Table 1.1: General types of RNA molecules with function (Blow 2004).
1.3 Proteins
Protein molecules are polymers comprised of one or more chains of amino acids. A chain of
amino acids can also be referred to as a polypeptide chain. Amino acids are molecules that
consist of an amino group, a carboxylic group, an R group and a hydrogen atom (Berg et al.
2001). These components are all linked to a central carbon atom known as the ! carbon
(Berg et al. 2001). A polypeptide chain is formed when a peptide bond is formed between
the amino group of one amino acid and the carboxyl group of another. All polypeptide chains
have a free amino group at one end and a free carboxyl group at the other. These are known
as the N-terminus and C-terminus respectively (Alberts 2002). The sequence of a polypeptide
chain is presented as moving from the N-terminus to the C-terminus (Alberts 2002). A linear
polypeptide chain is also considered the primary structure of a protein (Brown 2006).
It is the R group that distinguishes amino acids (Berg et al. 2001). R groups vary in
factors such as size, shape, charge, hydrogen-bonding capacity, hydrophobic
character, and chemical reactivity (Berg et al. 2001). There are 20 naturally occurring
amino acids that are typically utilised by living cells (Alberts 2002).
1.3.1 Protein secondary structure
The interactions of the R, carboxyl, and amine groups of individual amino acids in a
polypeptide chain with each other cause polypeptide chains to fold into characteristic
conformations. These conformations are known as the secondary structure of a protein. There
are two main types of secondary structure (Brown 2006).
The ! helix: This is a structure formed by interactions between the carboxyl groups
and amine groups of amino acids which are separated by a number intermediate
amino acids (Berg et al. 2001).
The " sheet: This is a structure formed by the interactions between two polypeptide
chains running either parallel or anti parallel to each other (Brown 2006).
Random coils: In the absence of particular structural imperatives polypeptide chains

can take on any number of shapes that are sterically possible. These shapes are
referred to as random coils (Shortle and Ackerman 2001).
11
Chapter 1
1.3.2 Protein tertiary structure
The tertiary structure of a protein is formed by the folding up of the secondary structural
constructs formed by the polypeptide chain into a three dimensional configuration (Brown
2006). This configuration is held together a number of chemical forces including hydrogen
bonding between individual amino acids and the interactions of hydrophobic amino acids
with water (Brown 2006).
1.3.3 Protein quaternary structure
The quaternary structure of a protein is formed by the interactions of multiple polypeptide
chains. Quaternary structure is a hallmark of proteins with a complex function (Brown 2006).
1.3.4 Protein domains
A protein domain can be defined as a substructure produced by any part of
a polypeptide chain that can fold independently into a compact, stable structure (Alberts
2002). There are a number of recurrent protein domains that are functionally important within
the eukaryotic cell. These include:
Helix turn helix: This is a domain comprised of two ! helices separated by a short
strand of amino acids. It is functionally important due to its ability to bind DNA
(Brennan and Matthews 1989).
Transmembrane domain: This is a domain consisting of ! helical structures capable

of passing through the lipid bilayer (cell membrane) that surrounds the cell. These are
crucially important in facilitating cell-cell communication and relaying information
about the external environment into a cell (Brown 2006).
1.3.5 Protein motifs

Protein motifs are conceptually similar to protein domains in that they are distinct
substructures within a protein molecule (Brown 2006). In contrast with domains they are not
able to form outside of the context of the overall protein. Functionally important protein
motifs include:
Leucine zipper: This motif is important in that it facilitates the formation of protein
quartenary structure by the dimerisation of two leucine rich regions of separate
polypeptides (Brown 2006). It is a motif that is found in a number of proteins that
bind DNA (Brown 2006).
12
Chapter 1
Zinc finger: The zinc finger motif is a set of polypeptide chains whose interactions is
stabilised by the presence of zinc ions. It is also present in DNA binding proteins
(Brown 2006).
1.4 Genes
As mentioned above the blueprints for the production of given protein and RNA molecules
within an organism are contained in subsections of its genome known as genes. A current
more specific definition of a gene presented by Pesole (Pesole 2008) defines them as a
discrete genomic region whose transcription is regulated by one or more promoters
and distal regulatory elements and which contains the information for the synthesis of
functional proteins or non-coding RNAs, related by the sharing of a portion of genetic
information at the level of the ultimate products (proteins or RNAs).
1.4.1 Structure of a gene
As implied by that definition a gene is made up of two distinct parts. These are firstly a
transcribed area, which is the portion of DNA that is actually converted into RNA and
secondly regulatory regions, which can occur either upstream or down stream of the
transcribed region. Regulatory regions within the vicinity of a gene provide recognition
signals for proteins known as transcription factors. These proteins regulate the transcription
rate of a gene by either carrying out the actual transcription, or by binding to DNA and either
promoting or silencing transcription (Maston et al. 2006). As the binding of the proteins to
these regions provides this functionality, the regions are known as transcription factor
binding sites.
13
Chapter 1
Figure 1.1: General structure of a gene. Adapted from (Maston et al. 2006).
1.4.1.1 Regulatory region of a gene
A typical regulatory region associated with a gene consists of a promoter element and distal
regulatory elements (Maston et al. 2006). The promoter element consists of a core promoter
and proximal promoter elements and typically spans less than 1 kb (kilobase) pairs (Maston
et al. 2006). The core promoter of a gene is the region of DNA at which the proteins
primarily responsible for transcription bind and initiate the process of transcription. Wellstudied elements of the eukaryotic core promoter include the TATA box and the initiator or
Inr sequence (Brown 2006;Strachan and Read 2004). The TATA box generally has a
consensus sequence of 5!-TATAWAW-3! where W is A or T (Brown 2006). The INR
sequence has a consensus 5!-YYCARR-3!, where Y is C or T, and R is A or G (Brown 2006).
The TATA box and Inr sequence are generally present upstream of a large number of
eukaryotic genes. Generally most of the elements of the core promoter are generally
comprised of near identical DNA sequences.
The proximal promoter is generally located a few hundred base pairs upstream of the
core promoter element (Maston et al. 2006). This region of DNA typically contains binding
sites for other proteins, which contribute to the transcription of the gene but are not the
primary mechanism (Maston et al. 2006).
Distal regulatory tend to be further away from the transcribed portion of the gene and
contains elements that either activate or repress the transcription of the gene. Elements that
activate transcription are known as enhancers and conversely elements that repress it are
known as silencers (Raab and Kamakaka 2010).
14
Chapter 1
1.4.2 Transcription
A family of enzymes known as RNA polymerases carry out the process of transcription of
DNA into RNA in eukaryotic cells (Brown 2006). This process is known as transcription as
the fundamental chemical language is not changed (Alberts 2002). There are three RNA
polymerases typically encoded by the eukaryotic genome (Strachan and Read 2004). RNA
polymerase I and RNA polymerase III tend to transcribe genes which code for functional
RNA molecules, while RNA polymerase II is generally utilised for the production of RNA
which is further translated into a protein (Alberts 1998). Transcription proceeds via the
following general steps (Brown 2006):
A protein known as TATA binding protein (TBP) binds to the TATA box sequence.
This causes a bend in the DNA molecule.
This bend provides a recognition signal for other transcription factors to bind to the
DNA creating a structure known as the preinitiation complex (PIC) (Brown 2006).
The formation of the PIC also disrupts base pairing thus creating a single stranded
DNA template from which the RNA molecule is synthesised.
RNA polymerase binds to the PIC and them moves along the single strand on DNA
creating a complementary RNA molecule that conforms to base pairing rules. This
RNA molecule is known as the primary transcript.
1.4.2.1 Post Transcriptional processing

After the primary transcript has been produced it is subjected to further modifications. In the
case of primary transcripts associated with protein coding gene the primary transcript is also
known as pre-mRNA (messenger RNA). In order to explain why these modifications occur it
is necessary to understand how RNA molecules specify corresponding polypeptide
molecules.
1.4.2.1.1 Genetic Code
It was established in work by Francis Crick (Crick et al. 1961) that polypeptide chains are
specified by RNA molecules via triplets of nucleotides known as codons. As there are only
twenty naturally occurring amino acids in eukaryotic proteins, and 43 =64 possible triplets
from the 4 nucleotide types, the genetic code is redundant. Three of the codons specify the
termination of the polypeptide chain and the remaining 61 specify amino acids.
The table below presents the genetic code
15
Chapter 1
Table 1.2: The genetic code (Brown 2006).

The process by which these codons are translated into these amino acids will be presented in
the next section. This code is widely utilised though there are a number of exceptions where a
different code is utilised, e.g. in translation of mitochondrial genes (Knight et al. 2001).
1.4.2.1.2 Open reading frames
Given this code a sequence of triplets that specify a chain of amino acids commencing with a
start codon and ending with a stop codon can be defined as an open reading frame (ORF)
(Brown 2006). An open reading frame can exist in 6 possible orientations as there are two
strands to a DNA molecule and an ORF can start from the first, second or third nucleotide
within either strand as illustrated below.
16
Chapter 1
Figure 1.2: Starting positions for possible ORFs within a double stranded DNA molecule.
1.4.2.1.3 Exons/Introns
ORFs as discussed above are subsections of the primary transcript or pre-mRNA molecule.
ORFs are interrupted within pre-mRNA by sections known as introns (Brown 2006). The
sections of the ORF thus separated by the introns are known as exons (Brown 2006). Thus in
order to produce a molecule containing the full-uninterrupted ORF it is necessary to excise
the introns and splice the exons together as shown in Figure 1.4.
Figure 1.3: Exons and introns within a pre-mRNA molecule.
17
Chapter 1
Figure 1.4: Exons post splicing.
Figure 1.5: Exons post splicing in an alternate configuration.
It is not necessary however for all the exons within a given ORF to be utilised (Brown 2006)
as shown in Figure 1.5. Different permutations of exons can be created to produce different
protein molecules. This process is known as alternative splicing and is responsible for the
disparity between the number of genes within a eukaryotic genome and the number of
proteins it is capable of producing (Strachan and Read 2004). Alternate splicing is a feature
of higher eukaryotes and contributes to overall protein diversity (Black 2003). Estimates of
how many human gene products are alternately spliced include 60% (Black 2003) and 74%
(Johnson et al. 2003).
1.4.2.2 Post Transcriptional processing (cont)
Having now discussed the necessity of posttranscriptional modification it is now possible to
move on to the mechanisms by which splicing is carried out as well as covering other
elements of posttranscriptional processing.
1.4.2.2.1 RNA splicing
As mentioned above the primary transcript or pre-mRNA is treated so as to excise intronic
sequences and splice together exonic sequences. In order for this process to occur a necessary
first step is the recognition of the borders between exons and introns. These areas are known
as splice junctions (Strachan and Read 2004). It has been observed in a large number of cases
18
Chapter 1
that introns in pre-mRNA commence with the sequence GU and end with the sequence AG
(Strachan and Read 2004). These dinucleotides are not in themselves sufficient to signal a
splice junction (Strachan and Read 2004) as splice junctions have been observed to show a
greater degree of conservation (Breathnach et al. 1978). In vertebrates the following motifs
have been observed at splice junctions (Brown 2006).
5! splice site 5!-AG"GUAAGU-3!
3! splice site 5!-PyPyPyPyPyPyNCAG"-3!
In these consensus sequences the " symbol indicates the border between an exon and
intron or vice versa (Brown 2006). Py indicates that the nucleotide is a pyrimidine and N
indicates that any nucleotide could be present at this position (Brown 2006). In addition to
the conserved sequences at splice junctions introns also contain a conserved sequence around
40bp away from the end on the intron known as the branch sequence (Strachan and Read
2004). A large RNA-protein complex known as the spliceosome actually carries out the
actual process of RNA splicing (Strachan and Read 2004). The spliceosome is one of the
largest molecular machines in the human cell containing ~170 distinct proteins (Valadkhan
and Jaladat 2010).
The process of RNA splicing typically involves the following sequence (Brown 2006;
Strachan and Read 2004):
Cleavage of the 5 splice junction detaching the exon from the intron at one end.
The attachment of the cleaved 5 end to the branch sequence forming a lariat like
structure.
Removal of the intronic lariat like RNA structure and the ligation of the two
exons.
1.4.2.2.2 Capping
Another step in posttranscriptional modification of protein-coding genes is capping. This
process is the first step in posttranscriptional processing of eukaryotic pre-mRNAs (Alberts
2002). This entails the addition of a methylated nucleoside (a nucleoside is a molecule
consisting of a deoxyribose or ribose sugar bound to a nitrogenous base (Brown 2006)) to the
19
Chapter 1
first 5 prime end of the transcript (Strachan and Read 2004). This process protects the
transcript from rapid degradation via ribonuclease digestion (Strachan and Read 2004).
1.4.2.2.3 Polyadenylation
Post the termination of transcription the primary transcript is also modified via the addition of
about 200 adenosine nucleotides to the 3 end of the transcript (Alberts 2002). This structure
is known as a poly-A tail. The process is thought to facilitate the transport of the mature
mRNA molecule into the cytoplasm (Strachan and Read 2004).
1.4.3 Translation
After a transcript associated with a protein-coding gene has been transcribed and processed, it
then migrates to the cytoplasm, where a process known as translation occurs. This process
entails the production a polypeptide chain that is specified by the transcript via the genetic
code. The mature mRNA molecule is not synonymous with an ORF (Strachan and Read
2004). Generally an ORF is a subsection within the mature transcript. The ORF is flanked by
sequences known as the 5 UTR and 3UTR (UTR=untranslated regions) (Brown 2006) as
illustrated in Figure 1.6.
Figure 1.6: Schematic of mature mRNA.

The process of translation occurs at cytoplasmic structures known as ribosomes. Ribosomes
are large RNA-protein complexes, which consist of two subunits (Strachan and Read 2004).
20
Chapter 1
The larger subunit is known as the 60S subunit and consists of three different types of
ribosomal RNA (rRNA) molecule and up to 50 ribosomal proteins (Strachan and Read 2004).
The smaller subunit is known as the 40S subunit and contains a single rRNA molecule and
over 30 ribosomal proteins (Strachan and Read 2004). The two subunits of the ribosome exist
as separate entities and attach for the process of translation.
The other molecule that provides the physical basis for the implementation of the
genetic code is transfer RNA (tRNA). tRNA has a secondary structure consisting of four
double helical structures as illustrated in Figure 1.7. tRNA attaches to an amino acid at its 3
end. The anticodon arm of the tRNA molecule has a triplet sequence, which is
complementary to the codon of the amino acid to which it is bound. Thus tRNA attaches
codons to their corresponding amino acids.
Figure 1.7: Structure of a tRNA molecule. Adapted from (Alberts 2008).
The process of translation typically proceeds via the following steps (Strachan and Read
2004):
The two subunits of the ribosome attach to each other and also to a mature mRNA
molecule at the methylated cap at the 5 end.
21
Chapter 1
The mRNA molecule is then pulled through the ribosome.
When a start codon is encountered a tRNA molecule with an anticodon arm

complementary to the start codon enters the ribosome. This tRNA molecule will
have the relevant amino acid pre-bound to it.
The next tRNA corresponding to next codon will then enter the ribosome.
The amino acid attached to the first tRNA will detach from the tRNA and attach
to the amino acid attached to the 3 end of the second tRNA.
This process is iterated constructing a polypeptide chain or protein molecule.
When a stop codon is encountered an enzyme known as a release factor causes the
ribosome to disassociate and release the protein molecule.
In order to prevent premature folding of proteins during translation the emerging

polypeptide chain is stabilised by proteins known as chaperones (Alberts 2008).
1.5 Genomics
The term genome can de defined as the entire genetic complement of a living organism
(Brown 2006). The field of study around ascertaining information about the genome of a
living organism is thus known as genomics. The primary step of any full genomic study is the
determination of the DNA sequence of the genome of the organism in question. Once this has
been determined the next step is annotating the sequence.
1.5.1 Genome annotation
The full genome of an organism is generally a mosaic of functional and non-functional
elements. The percentage of an organisms genome that is functional is variable. In the case
of the human genome it has been calculated that potentially between 2.56% and 3.25% is
functional (Lunter et al. 2006).
Functional elements in a genome include:
Genes.
DNA binding sites.
CpG Islands: These are stretches of the dinucleotide repeat CG. These areas of DNA
are subject to methylation, which is a form of epigenetic control over gene
transcription (Kawaji and Hayashizaki 2008).
22
Chapter 1
Genome annotation can be described as the systematic location of these functional
elements within a genome sequence (structural annotation) and the ascertainment of that
function (functional annotation). The location of functional elements is based on the principle
of sequence specifying function. Thus the sequence of a functional element will vary in some
detectable way from the remainder of the background sequence.
1.5.2 Genome and cDNA assembly
The initial challenge post the generation of sequencing data is the fact that the output of DNA
sequencing is generally reads of short stretches of DNA. These reads range in length from >
700 bp long for Sanger sequencing (Hert et al. 2008) and ~200bp for pyro sequencing
(Sundquist et al. 2007) and down to ~50bp for ligation based sequencing methods
(McKernan et al. 2009).
These short reads have to be assembled into a full sequence for the whole genome.
This process is known as contig assembly. Contig assembly is carried out through scanning a
set of short reads for overlaps. The discovery of an overlap indicates that two fragments are
contiguous and should be connected. This process is necessary both at the level of the full
genome as well at the level of the individual gene (Wang et al. 2005a).
1.5.3 Gene detection
Given a fully sequenced and assembled genome lacking annotation there are a number of
computational techniques available to delineate coding sequence. These can be divided into
two main subtypes: extrinsic and intrinsic (Borodovsky et al. 1994). Extrinsic methods utilise
comparisons of sequence data to an external reference point while intrinsic methods evaluate
sequences based on properties that are internal to the sequence (Borodovsky et al. 1994).
Construction of a cDNA library is one of the standard methods of extrinsic gene
detection. cDNA stands for complementary DNA and is created through application of an
enzyme known as reverse transcriptase to mature mRNA. Reverse transcriptase as the name
implies reverses the process of transcription and creates a DNA strand complementary to the
single stranded mRNA. Further steps are then taken in order to create a double stranded DNA
molecule (Strachan and Read 2004).
A library of cDNA sequences is compiled through the collection of mRNA molecules
from cells under various experimental conditions. This RNA is then converted to cDNA
using the enzyme reverse transcriptase. The resultant cDNA is then amplified using the
23
Chapter 1
polymerase chain reaction (PCR) (Mount 2004) and then sequenced. The library of sequences
thus generated corresponds to the sequence of protein coding genes within the genome minus
the introns. These sequences are then systematically mapped onto the genomic sequence
using local alignment algorithms. This technique is known as cis-alignment. There are a
number of local sequence programs that can be used to carry out these alignments. Exonerate
is one such program. It utilises a bounded dynamic programming approach (Slater and Birney
2005) to generate local alignments. Dynamic programming is discussed in more detail later in
this chapter. Another program, which can be utilised, is Spidey (hosted by the NCBI). This
program employs the Blast heuristic algorithm (Altschul et al. 1990) to generate its
alignments. SIM4 is another program that utilises an algorithm based on Blast but tailored to
the specific problem of mapping cDNA to genomic DNA by factoring in introns and
potential sequencing errors (Florea et al. 1998).
The Ensembl automatic genome annotation system (Curwen et al. 2004;Potter et al.
2004) uses the algorithm GeneWise (Birney et al. 2004) to map cDNA to full genomic data
and the algorithm GenomeWise (Birney et al. 2004) to create a final putative structure for the
gene in question post the initial alignment. cis alignment can be considered to be one of the
most reliable methods for protein coding gene detection/prediction (Brent 2008).
In cases where cDNA libraries are not available or incomplete for the organism under
consideration it is also possible to use cDNA sequences of homologous genes from either the
same species or a different species in order to detect coding sequence. This technique is also
referred to as trans-alignment and is central to various gene prediction tools (Brent 2008).
The GeneWise (Birney et al. 2004), algorithm is also used in this context by the Ensembl
pipeline (Potter et al. 2004). Extrinsic methods for genome annotation are far more cost and
labour intensive as opposed to the strictly in-silico intrinsic approach.
Intrinsic approaches to gene detection are predominantly computational and as such
require an explicit definition/description in order to delineate between coding and non-coding
sequence (Picardi and Pesole 2010). Picardi (Picardi and Pesole 2010) gives a good working
definition of a gene for detection purposes, which defines a gene as a transcribed region of
DNA whose expression is regulated by cis acting elements such as upstream promoters.
Examples of tasks undertaken as a part of intrinsic gene detection include:
ORF (Open Reading Frame) detection: Detection of a potential ORF in genomic

DNA is an indicator of a potential gene (Mount 2004). As prokaryotes in most cases
(exceptions are pointed out in (Edgell et al. 2000)) lack exons and introns ORF
24
Chapter 1
detection drastically reduces the search space for potential genes in the case of
prokaryotes.
Promoter regions detection: Genes are typically associated with one to several
promoter regions. In prokaryotes these include the upstream Pribnow box with the
consensus sequence TATAAT. This sequence is homologous to the eukaryotic
TATA box (Berg et al. 2007). Detection of these motifs within a sequence upstream
of an ORF strengthens the case for a potential gene.
Internal splice junction detection: As the sequence of exon intron borders is broadly
conserved discovery of splice junctions can also contribute to the case for a
prospective gene.
These features can be can be detected within a stretch of sequence using various
techniques to model sequence motifs ranging from simple regular expressions to hidden
Markov models and position weight matrices (Picardi and Pesole 2010). Examples of specific
applications of the intrinsic approach to gene prediction include SNAP (Korf 2004) and
Genscan (Burge and Karlin 1997) both of which utilise Markov models in order to detect
delineating features of genes. The primary weaknesses of the intrinsic approach lie in the fact
that that it requires a representative sample of protein coding genes specifically from the
organism under consideration in order to operate (Aubourg and Rouze 2001).
1.5.4 Functional annotation of genes
After a putative gene has been identified the next stage is determination of the exact
biological role of the product coded for. This process can be carried out computationally or
by entirely laboratory based techniques.
1.5.4.1 Laboratory based techniques
Laboratory based techniques for determination of biological function involve alteration of the
gene in question either in the organism of study (in the case of prokaryotes, unicellular
eukaryotes as well as higher eukaryotes which are deemed suitable) or in the case of
organisms where modification would be impractical or unethical such as Homo sapiens
alteration of the homologous gene in a model organism. The main model organism of choice
for study of mammalian gene function is Mus musculus (Kim et al. 2010). The main
alterations that are possible include:
25
Chapter 1
Knockouts: This entails the removal of the gene in order to observe the effects of its
absence. This technique is only effective if the gene in question is not essential to
organism survival and has a visible/measurable effect on phenotype (Moore 1999).
Alteration in expression: In cases where the gene in question is essential to the

survival of the organism, alterations can be made to the cis-regulatory regions of the
gene in question in order to affect levels of expression (Capecchi 2005).
In order to physically pinpoint specific tissues (in the case of multi-cellular

organisms) or areas within a cell that a protein is active it is possible to place a
reporter gene such as GFP (green fluorescent protein) upstream of the promoter
region of the gene of interest (Chalfie et al. 1994).
Detection of genetic interactions: The interaction of two non-essential genes (and

hence their associated proteins) can be detected if the mutation of both genes leads to
lethality (von Mering et al. 2002). This method has been applied to a large-scale study
in Saccharomyces cerevisiae in order to characterise its set of genetic interactions
(Ooi et al. 2006). The detection of an interaction partner of known function can aid in
the determination of the function of an unknown gene.
1.5.4.2 Computational methods for functional annotation of genes

Computational methods to determine gene function have only become applicable relatively
recently as most computational methods depend on comparison of novel sequence data with
sequence of known function. Computational methods of functional annotation of genes can
be split into a number of broad categories (Pellegrini 2001).
Alignment based methods.
Genome Context methods.
1.5.4.2.1 Alignment based methods

Sequence alignment is a problem that has been at the heart of bioinformatics since the
inception of the field. The basic sequence alignment problem is searching two strings for
areas of similarity (Mount 2004). The products of genes with similar/identical sequences are
extremely likely to carry out the same function. Genes that share a significant degree of
sequence similarity are potentially homologous (descended from a recent common ancestral
gene) to each other. Using these methods the results of laboratory-based annotations only
need to be carried out on one representative of a given set of identical sequences and the
26
Chapter 1
derived functional annotation can be applied to all members. Alignment methods can be
applied at either the gene or the protein level.
There are three primary ways of carrying out pairwise sequence alignments.
Dot matrix analysis: This method entails arranging one sequence horizontally and the
other sequence vertically perpendicular, starting from the left end of the horizontal
sequence. Matches between the two sequences are then marked with a dot. Areas of
similarity can then be viewed as diagonal lines between the two sequences (Mount
2004).
Dynamic programming: Dynamic programming is a programming paradigm which

entails the reduction of a large problem to a series of sub-problems whose solutions
are constructed incrementally and summed to provide the overall solution (Russell et
al. 2003). In terms of sequence alignment it entails the construction of a matrix
similar to the dot matrix and calculating a path through it, where the next step in the
path is determined only by the state of the current cell and its neighbouring cells.
Two popular dynamic programming algorithms utilised in pairwise alignment of
sequences are the Needleman-Wunsch algorithm (Needleman and Wunsch 1970)
which returns an optimal global alignment of two sequences and the previously
mentioned Smith-Waterman algorithm (Smith and Waterman 1981) which provides
an optimised local alignment. Both of these algorithms are proven to return the
optimal alignment between two sequences (Mount 2004).
Heuristic Algorithms: Both the dynamic programming algorithms mentioned above

are O(n2) in terms of both memory utilisation as well as time taken to run (Mount
2004). As such heuristic algorithms such as Blast (Altschul et al. 1990) and Fasta
(Pearson and Lipman 1988) were developed as usable alternatives. The Fasta
algorithm constructs a sequence alignment by searching for matching sequence
patterns called k-tuples. These patterns are k consecutive matches between the two
sequences. These matches are then extended to provide the alignment (Krane and
Raymer 2003). Blast constructs an alignment in a similar manner by locating short
matches and then building an alignment around it. The difference between Blast and
Fasta is that while Fasta examines all possible k-tuples the Blast algorithm is
restricted to only examining matches that are significant and score over a given
threshold (Mount 2004). These matches have to be of a length to achieve
significance. This is 3 for proteins and 11 for DNA. Significance for proteins is
27
Chapter 1
judged through use of the BLOSUM62 substitution matrix (Mount 2004). Given the
rapid expansion of most of the large sequence databases it is typical to use heuristic
algorithms as a search tool.
Profile Hidden Markov models have been used by Eddy (Eddy 1998) to create a
scoring system, which allows detection of remotely homologous sequences. Hidden
Markov models score the probability of a discrete chain of events based on model
parameters whose values are unknown (Durbin 1998).
Alignment methods can also be applied to the three dimensional structures of protein
molecules as well as sequence (Hasegawa and Holm 2009). This method is potentially useful
in cases where sequence divergence reaches a point where two proteins can no longer be
identified as homologous. However as the rate of structure generation lags behind sequence
generation by a considerable degree this method can only be applied in a small subset of
cases.
Detection of a significant alignment with a gene of known function can be used to attach
the same function to a gene of known function. Martin (Martin et al. 2004) used GO terms
(Ashburner et al. 2000) in conjunction with Blast (Altschul et al. 1990) to achieve this with
some success. There is however a danger with alignment based methods of a Chinese
whispers effect where if for example a gene p with known function a displayed 90 %
identity using some form of pairwise alignment algorithm with gene q of unknown function.
Assigning function a to gene q would seem to be intuitively legitimate. However if gene q
was assigned function a and the process was iterated a number of times a situation could arise
where a gene x would be assigned function a with little or no sequence similarity to the
original protein p. Examples of incorrect annotation by automated methods of homology
detection occur in the case of genes where translations of the antisense strand of the coding
region are entered into databases such as GenBank (Linial 2003).
1.5.4.2.2 Genome context methods
The recent proliferation of genome data has made it possible to detect and assign function to
proteins through examination of their genomic context. Genome context methods compare
and contrast the context of a gene between genomes (i.e. the arrangement of its homologues)
in other genomes. Context methods are based on the principle of guilt by association which
is the hypothesis that genes, which show proximity or association by some measure, e.g.
phyletic distribution or chromosomal ordering are functionally associated (Aravind 2000).
28
Chapter 1
Thus through demonstration of functional association or interaction between one gene/protein
of known function with one of unknown function, the latter entity may be annotated with the
function of the former.
1.5.4.2.2.1 Rosetta stone
The Rosetta stone method or detection of domain fusion was recognised through work by
Marcotte (Marcotte et al. 1999) and Enright (Enright et al. 1999) which showed that sets of
separate proteins in one organism which exist in a unified (fused) homologous form in
another organism are likely to be interaction partners. As fusion events are comparatively
rare and generally affect genes that are tightly functionally coupled this method is effective at
detection of interaction partners (Kensche et al. 2008). However the rareness of these events
lowers the overall coverage of this method.
1.5.4.2.2.2 Gene neighbour
Examination of the genomes of nine bacterial and archaeal genomes by Dandekar (Dandekar
et al. 1998) showed that the proteins encoded by genes which showed conserved physical
order along a chromosome tended to interact physically.
1.5.4.2.2.3 Interolog detection
A term introduced by Walhout (Walhout et al. 2000) an interolog is a pair of proteins that
interact in a given organism. If both proteins involved in the interaction are conserved in
another organism a similar interaction can be inferred in the second organism. This method
has shown comparable accuracy with large-scale experimental data (Yu et al. 2004b).
1.5.4.2.2.4 Phylogenetic profiling
Phylogenetic profiling is a method that operates on the hypothesis that functionally linked
proteins evolve in a correlated manner (Pellegrini et al. 1999). Consider for example a
group of genes/proteins, which exist as a self-contained modular group and are associated
with a particular cellular function. If this associated function was no longer needed by a given
set of organisms the selective pressure to maintain all the genes/proteins within that group
would be lowered thus leading to an eventual correlated cascade of losses for the genes in
question. Genes are primarily lost through psdeudogenisation, which is the conversion of a
functional gene to a non-functional copy. This can be caused by mutations that cause the
premature truncation of a transcript through the creation of a premature stop codon or a
29
Chapter 1
mutation in upstream cis-regulatory sequences thus removing the potential for transcription
(Brown 2006). Pseudogenes can also be formed through retrotransposition of mature mRNA
(Graur et al. 1989).
Thus through examination of multiple genomes for correlations in the presence and
absence of proteins potential functional linkages can be detected. A phylogenetic profile is
typically a binary string representing the presence or absence of a homolog of a given
gene/protein. Predictions are made through examination of levels of similarity between these
strings. These suggestions are suggestive in their nature rather than specific as it is unclear
what the nature of a functional linkage between two proteins with similar profiles might be.
The relationship could be a direct physical interaction such as subunits involved in
heterodimerisation or more indirect such as the link between a transcription factor and the
product of its associate gene.
The first use of phylogenetic profiles to predict functional linkages used Hamming
distance as a metric in order to cluster similar profiles (Pellegrini et al. 1999). The Hamming
distance of two strings can be defined as the number of points at which they differ (Hamming
1950). There have been various extensions and reinterpretations of the method since then
(Ranea et al. 2007). Some of these involved examination of profiles using higher logical
operations to carry out more complex comparisons of profiles (Bowers et al. 2004; Antonov
and Mewes 2008). The method was also applied to protein domains rather then whole
sequences (Pagel et al. 2004b). Work by Ranea utilised domain information from the Gene3D
database to create phylogenetic profiles of the presence and absence of structural domains
within genomes (Ranea et al. 2007). This method thus bypasses the problem of identification
of genes that are functionally homologous by focussing on the presence and absence of
predefined domains within proteins. Chen and Vitkup used examination of correlation
coefficients to measure similarity in phylogenetic profiles (Chen and Vitkup 2006). They
observed that the method was successful in identifying genes that were members of the same
metabolic pathways (Chen and Vitkup 2006).
As a tool phylogenetic profiling could be used to detect errors in genome annotation
through the detection and displays of gene absences, which are not plausible in closely
related species. A similar approach has in fact been used by Pinney to detect and annotate
enzyme-coding genes in the protist E. tenella (Pinney et al. 2005).
Other extensions to the method involved the utilisation of the phylogenetic
relationships of the organisms include work by Barker and Pagel (Barker and Pagel 2005).
30
Chapter 1
This method made use of an explicit phylogeny and ancestral reconstruction over the
phylogeny based on a continuous-time Markov model. The likelihood of a model of
dependent or contingent evolution was compared with the likelihood of a model of
independent evolution over the phylogeny. This method was then further extended by
investigating the effects of constraining the rate at which genes could be acquired over the
phylogeny (Barker et al. 2007).
Other methods of incorporating phylogenetic information included the work by Vert
(Vert 2002), which utilised support vector machines, as well as the work by Cokus (Cokus et
al. 2007), which utilised phylogeny as a heuristic by ordering profiles by the phylogenetic
closeness of the organisms involved.
1.5.4.2.2.5 Comparative methods
Comparing phylogenetic profiles over a phylogenetic tree can be considered to be an
application of the comparative method to traits at the molecular level. The comparative
method is a well-established method in biology (Harvey and Pagel 1991). The fundamental
idea of underpinning the comparative method is how the state of one factor (which can be a
trait or environmental condition) influences the state of another over the context of a
topology of a phylogenetic tree (Maddison 1990). Testing for correlations without
considering the phylogeny will detect correlations in gene content based on phylogenetic
relationships rather than functional linkage. For example the set of all genes that are intrinsic
to the class Mammalia will share similar phylogenetic profiles. This does not however
suggest that they are all functionally linked.
There are a number of tests that have been developed in order to test the correlations
in the states of traits over a phylogeny. Ridley (Ridley 1983) developed one of the earliest of
these tests. This test involved the construction of a 2x2 contingency table where the state of
each trait was considered as a categorical variable defined at each node in the tree. The
method assumed that the construction of an accurate phylogeny and accurate reconstruction
of ancestral context for each node within the phylogeny. Ridleys method did not however
differentiate between dependant and independent variables in measuring the significance of a
given set of changes (Maddison 1990). The method did not take into account the sequence of
changes in the states of traits (i.e. was a change in state A followed by a change in state B or
vice versa). This makes the results of the method difficult to interpret (Maddison 1990).
Joe Felsenstein (Felsenstein 1985b) developed another test for correlations in traits
over a phylogeny. This test was developed to measure continuous data and modelled changes
31
Chapter 1
over a tree as a Brownian process. Another test for detection of correlations in traits and/or
external environmental conditions was devised by Grafen. This test was a phylogenetically
corrected regression, which did not rely on any form of ancestral reconstruction (Grafen
1989).
Maddison developed a similar test to Ridleys in 1990 (Maddison 1990). It however
did distinguish between dependant and independent variable by defining areas of a phylogeny
to be in state A or state B depending on the state of one of the traits under consideration. The
test then measured how many of the changes in the other trait occurred in the area of the tree
that was in state A compared to how many changes were possible over the whole tree.
One of the issues with the tests described above was the fact that none of them
integrated information on branch lengths of the phylogeny. This meant that the probability of
a change in the state of a given trait was equally likely over a branch of a phylogenetic tree
regardless of its length. However clearly a change on a short branch is less likely than a
longer branch. Work by Pagel took this into account by integrating branch lengths into a test
for correlated evolution (Pagel 1994). The parameters defined by this work were utilised by
Barker and Pagel in their approach to phylogenetic profile analysis (Barker and Pagel 2005).
1.5.4.2.2.6 Mirror trees
Another method of detection potential protein interactions is known as mirror trees. This
method involves the detection of protein interactions through the construction and
comparison of phylogenetic trees of proteins with a single genome (Pazos and Valencia
2001). The rationale behind this method is similar to that of phylogenetic profiling. However
correlation is sought not in the presence and absence of homologous genes but in the pattern
of sequence evolution of interacting proteins. Trees are examined by examining distance
matrices of homologous sequences for correlations. These matrices are the inputs used in the
formation of the trees in question. The phylogenetic tree of any given protein in a genome
will however carry signal from the speciation events, which shaped the genome of the
organism in question. An upgrade of the method has been developed to take into account this
background similarity (Pazos et al. 2005). Hakes and others have however pointed out that
the evolutionary pressures as well as the functional constraints on duplicated genes differ
depending whether the mechanism of duplication was whole genome duplication or smallscale duplication (Hakes et al. 2007). This indicates that sequence divergence and functional
evolution are not necessarily correlated (Robertson and Lovell 2009). Thus any similarity in
32
Chapter 1
the phylogenetic trees of functionally linked genes is more likely to be due to chance or as
mentioned above due to background similarity.
1.5.5 Storage of functional information
With the exponential increase in sequence data that has been generated through the 2000s
there have been a number of attempts with which to organise and contextualise function
information surrounding genomic entities.
1.5.5.1 GO
A notable attempt to do this has been the establishment of a controlled vocabulary with which
to describe the functional role of a gene as well as its physical location within the cell. The
vocabulary is known as the Gene Ontology (GO) (Ashburner et al. 2000). GO associates a set
of terms with gene products. These terms are known as GO terms and fall into three general
domains. These are
Cellular component: This is the physical location within the cell where the gene
product is generally to be found.
Biological process: This is the biological pathway or process that the gene product has
been localised in.
Molecular function: This is a lower level to the biological process domain and
includes the specific molecular capabilities of the molecule in question. An example
of molecular function could be the ability to bind a particular metal.
Terms are organised as a network starting from the root terms defined above. As the
network is traversed starting from a root term, terms become more specific, i.e. if term B
is directly below term A in the ontology then term B is a subclass of term A.
1.5.5.1 KEGG
Another database that localises gene products within functional pathways is KEGG (Kyoto
Encylopedia of Genes and Genomics) (Kanehisa 1997; Kanehisa et al. 2006). KEGG
maintains a list of functional pathways of processes that occur within the cell. These
processes are arranged in a similar manner to GO in that they start from general categories
and become more specific.
33
Chapter 1
1.6 Transcriptomics
The transcriptome of a cell can be considered to be the sum total of its genome that is
transcribed into RNA. Studying the transcriptome can also yield insights into the
functionality of gene products.
1.6.1 Microarrays
At the transcriptomic level the putative function of a gene can be at least partially determined
through establishing the association of the expression of a particular gene with a particular
external condition or treatment. This can be achieved through the use of glass slides known
as microarrays (Mount 2004). These slides have oligonucleotides, which are subsections of a
set of genes attached to them. Cells of the organism under study are subjected to variable
experimental conditions. mRNA is then extracted from these cells, converted to cDNA and
fused with a unique florescent dye. By examining the relative degrees of florescence for the
colours associated with the two versions of the cDNA of the gene of interest it is possible to
measure levels of gene expression in response to a given experimental condition. A variant of
this involves using full cDNA molecules as the contents of the chip.
1.6.2 Other methods for transcriptome examination
Expression levels for a given environmental condition can also be measured through direct
sequencing and counting through use of the SAGE (Serial analysis of gene expression). In
this method mRNA is extracted from the cells of interest. A small section is excised from
each mRNA molecule. A tag is then connected to each separate subsection. These
subsections are then amplified and the tags counted thus providing a measure of gene
expression levels (Velculescu et al. 1997). Another protocol for sequencing mRNA to detect
gene expression levels has also been developed. This protocol is known as RNA-Seq and is
made feasible through the utilisation of the high throughput nature of next generation
sequencing (Wang et al. 2009b).
1.7 Proteomics
Proteomics in a similar way to genomics and transcriptomics is the study of the full protein
complement produced by a cell. The proteomic level is the point where the connection
between macromolecules and measurable phenotypes is first bridged. Proteins can be
considered as making up close to the totality of both structural (e.g. microtubules) and active
(e.g. enzymes) components of a cell. The function of a protein can be determined by the
determination of its structure and/or the determination of its interaction partners.
34
Chapter 1
1.7.1 Protein Structure
There are two main methods utilised to determine the three dimensional structure of a protein
molecule (Brown 2006). These are:
X-Ray crystallography: This procedure involves the production of a crystal from the
protein of interest. X-rays are then fired through this crystal to acquire a backscatter
diffraction pattern. This diffraction pattern can then be used to reconstruct the
structure of the protein. X-ray crystallography is limited by the fact that it requires the
protein to be able to crystallise (Brown 2006).
NMR spectroscopy: NMR or nuclear magnetic resonance is electro-magnetic

radiation produced by the absorption and re-emition of electro-magnetic radiation by
the nuclei of atoms. By bombarding a protein with electro-magnetic radiation, these
patterns of resonance can be used to work out the structure of the protein (Brown
2006).
1.7.2 Protein interactions

In terms of protein interactions there are two primary modes of protein interaction. The first
is a direct physical interaction. Direct physical interactions between distinct proteins can
occur in two contexts (Orengo et al. 2003). These are:
Formation of a stable complex: A protein complex is a stable structure formed by two

or more proteins to carry out a specific function. In order to maintain the structural
integrity of a complex proteins within the complex have to maintain relatively long
term direct physical interactions. The subunits of the ribosome are an example of a
stable protein complex as well as the histone octamer and RNA polymerases (Orengo
et al. 2003). Not all interactions within a protein complex are direct as members of a
complex with more then two interacting partners do not necessarily have to be
physically connected to every other protein within that complex.
Transient interaction: These are functional interactions where proteins physically

interact but also exist independently in their own right (Orengo et al. 2003). An
example of a transient interaction is the interaction between the human proteins Rho
and RhoGap, which triggers a signalling cascade, involved in cytoskeleton formation
and cell proliferation (Nooren and Thornton 2003).
35
Chapter 1
The other form of interaction between proteins is indirect interactions. Examples of these
could be two proteins that have a role in a given metabolic pathway but whose production is
temporally and spatially separated. Examples of indirect interactions include the interaction
between SHC-transforming protein and mitogen-activated protein kinase 1 over several steps
of the insulin-signalling pathway (Sasaoka and Kobayashi 2000).
The full collection of all protein interactions within a cell has been labelled the interactome.
1.7.2.1 Experimental detection of protein interactions
Protein interactions can be detected using a variety of techniques. The main techniques
include:
Yeast two-hybrid: In order to detect protein interactions one widely used (Marcotte et
al. 1999) method is the yeast two-hybrid technique. This technique exploits the S.
cerevisiae GAL4 transcription factor. This transcription factor has two domains that
require physical proximity in order to operate. One of these domains binds DNA and
the other domain is an activator for the transcription factor. A protein interaction can
be detected by fusing two genes of interest to both of these domains respectively on
separate plasmids and insertion of these plasmids into a yeast cell with a reporter gene
upstream of the GAL4 transcription factor-binding site. Reporter gene transcription is
only possible if the protein products of the two genes of interest were able to maintain
a physical interaction (Griffiths 2002). The primary drawbacks to this method are the
facts that all interactions must take place in the nucleus removing a large number of
proteins from their native cell compartment and that only binary protein interactions
can be tested for (von Mering et al. 2002). The yeast two-hybrid method does have a
high rate of false positives. One reason for this is that pairs of proteins that stick
together are not necessarily ever expressed at the same time or in the same tissue
(Vidalain et al. 2004). Also some proteins such as heat shock proteins are inherently
promiscuous in their binding affinities (Vidalain et al. 2004).
Proteome chips: In a manner similar to the use of microarrays described above for the
measurement of gene expression levels microarrays can also be used with proteins.
By printing translations of 5800 ORFs from S. cerevisiae on to a microarray chip Zhu
and others (Zhu et al. 2001) were able to detect 33 novel interactions for the multi
functional calcium binding protein calmodulin. The drawbacks to this method are that
it is low throughput and again is restricted to binary interactions.
36
Chapter 1
Mass spectrometry of purified complexes: In order to detect interactions that are not
binary, complexes of proteins can be isolated using techniques such as tandem affinity
purification. This technique entails the tagging of a protein of interest with a tag that
allows the purification of the main protein and any complex partners that it might
have. These complexes can be characterised through the use of mass spectrometry
(von Mering et al. 2002).
1.8 Description of project.

This work details the development and application of a novel heuristic which combines
application of the Barker and Pagel approach to phylogenetic profiling (Barker et al. 2007;
Barker and Pagel 2005) in conjunction with a novel data filter. The Barker and Pagel
approach to phylogenetic profiling will subsequently be referred to as constrained ML
(maximum likelihood). It was observed over the course of this project that this method could
be useful in elucidating novel protein interactions. Novel protein interactions will allow
further elucidation and annotation of protein function through the principle of guilt by
association as articulated above. The proteome of Homo sapiens is still filled with known
unknowns in terms of protein-protein interactions. The HPRD (Prasad et al. 2009) currently
contains 38,788 binary protein interactions and data on 998 protein complexes. Current
estimates of the interactome size such as work by Stumpf (Stumpf et al. 2008), which
estimates the size of the interactome as 650,000, intimate that the majority of protein-protein
interactions have not yet been elucidated. The potential of phylogenetic profiling to detect
novel interactions has been demonstrated in work by Ramazzina (Ramazzina et al. 2006)
where two novel genes involved in the degradation of uric acid were detected. The
phylogenetic profiling method has also been successful in identifying enzymes of the
MEP/DOXP pathway (Cunningham et al. 2000).
Chapter 2 details the construction of a eukaryotic phylogeny over 54 taxa as well as
the phylogenetic profiles of known proteins within the human genome relative to the other 53
species which was one of the essential precursor steps to this study.
Chapter 3 contains the results of the application of the method in context and
compares it to a comparable high throughput experimental technique. Specifically the method
is compared to detection of protein-protein interactions as well as indirect functional linkages
through co-expression of genes as measured by microarrays. The method is also compared to
PIPs which is the protein interaction prediction system maintained by Barton (McDowall et
37
Chapter 1
al. 2009; Scott and Barton 2007). This system makes novel predictions through the
combination of different informative features.
Chapter 4 describes the construction of the data filter, which is based on Dollo
parsimony. The filter reduces the size of the overall search space facilitating the use of the
method for whole genome comparisons. This is achieved through the elimination of pairs of
proteins, whose function cannot be detected via examination of patterns of presence and
absence.
Chapter 5 presents a network of predictions generated as a putative human
interactome of proteins, which are susceptible to this line of enquiry. This network is
analysed for consistency with known data. A set of novel predictions is presented.
Finally Chapter 6 will sum up this work and present details on potential future
directions.
38
Chapter 2
Chapter 2
Reconstruction of eukaryotic phylogeny as precursor to comparative
analysis
2.1 Introduction
Examination of the evolutionary histories of organisms is a fundamental step for any form of
study of biological function as adaptation can only be examined within an historical context
(Harvey and Pagel 1991). As a phylogeny is by definition an evolutionary history of species
(Harrison and Langdale 2006) it is a necessary step within the process of a comparative
study. In terms of examination of changes in gene content within a probabilistic framework it
provides the necessary topology over which such changes occur. This is a fundamental
parameter in any such model.
2.1.1 Homology
The fundamental object of any phylogenetic study, whether molecular or morphological, is
the comparison of homologous structures within the organisms under consideration. When
genomic data is under consideration homologous structures within organisms correspond to
those genomic elements, which were present in the last common ancestor of the set of
organisms under consideration. These elements can provide a measure of divergence (Fitch
1970). These elements if functional (which is implied by conservation) can either maintain
their ancestral function or if sufficiently diverged have a new (or no) function. In discussions
of elements of genomes (genes) there are a number of subclasses of homologous
relationships. These are:
Orthology: Genetic elements are orthologous if they are the direct product of
divergence from a common ancestral species (speciation) (Fitch 1970).
Paralogy: Genetic elements are paralogous if they are the product of a duplication
event within a given species. Mechanisms of duplication include retrotransposition
(insertion of reverse transcribed RNA back into a genome) and unequal crossover
leading to tandem duplication of a portion of a chromosome (Hurles 2004). It is
thought that these duplication events are a major force in creating and broadening
genetic repertoires (Zhang 2003).
Xenology: Genetic elements are xenologous if they are the product of a direct
exchange of DNA between organisms (Fitch 2000). These exchanges are known to be
far more prevalent in prokaryotes given their lack of a true nucleus and the existence
39
Chapter 2
of plasmids (free floating segments of DNA) in some prokaryotes. Genes have also
been observed as xenologous in eukaryotes. Xenologous genes in eukaryotes can be
acquired via organelles, which are the product of endosymbiosis such as the
mitochondrion and chloroplasts (Blanchard and Lynch 2000).
It is important for purposes of phylogenetic reconstruction to be able to draw a distinction
between genes which are paralogous and which are orthologous. If paralogous genes are
compared between species the distance between them does not necessarily reflect the overall
genetic divergence between the species under consideration. Genetic elements that are
orthologous provide information on levels of divergence between speciation events whereas
those that are paralogous provide data on duplication events.
A converse relationship to homology is that of analogy where through convergent
evolution genes that share no common ancestry develop and maintain sequence similarity due
to similar demands placed on the organisms in question by their environment. A classic
example of this at the molecular level is that of the convergent evolution of the enzyme
lysozyme in both the langur monkeys of the Indian subcontinent (Semnopithecus entellus)
and ruminants due to the similar requirements imposed by a herbivorous diet (Swanson et al.
1991).
2.1.2 Molecular evolution
The fundamental idea at the heart of modern biology is that of random mutations guided by
natural selection producing adaptation, which allow an organism to thrive in a given
ecological niche. The large-scale study of evolution at a molecular level has only recently
become possible due to advance in DNA sequencing technologies. This has been extremely
useful as random mutations occur at the molecular level and also DNA/ amino acids are the
fundamental comparable common denominator across morphologically and physiological
diverse species (Nei and Kumar 2000).
At the DNA level there are four basic types of mutation (Nei and Kumar 2000). These are:
Insertions: These mutations are insertions of additional nucleotides into a sequence of

nucleotides. These can be caused by replication errors (Brown 2006). If an insertion
occurs within an open reading frame it can cause the frame to be shifted hence
insertions in coding regions can also be referred to as frameshift mutations.
Deletions: Deletions are the opposite of insertions. Deletions within an ORF can also
cause a frame shift (Brown 2006).
40
Chapter 2
Substitutions: These mutations are also referred to as point mutations and involve the
substitution of a nucleotide with any other nucleotide. Substitutions do not necessarily
have to involve a single nucleotide (Brown 2006). There are two types of
substitutions transitions which entail the replacement of a purine with another purine,
e.g. A to G or a pyrimidine with another pyrimidine, e.g. C to T. The other form of
substitution is a transversion, which involves the replacement of a purine with a
pyrimidine or vice versa (Nei and Kumar 2000).
Inversions: An inversion mutation involves the reversing of the sequence of a strand

of DNA, e.g. the sequence TGA being replaced with AGT (Nei and Kumar 2000).
2.1.2.1 Synonymous and non-synonymous mutations

Recalling the genetic code where nucleotide triplets known as codons specify amino acids,
mutations within coding regions can also be classified by the effect that they have on the
potential protein product. Thus mutations where the amino acid specified is altered are
known as non-synonymous mutations, while mutations where there is no effect on the amino
acid specified are referred to as synonymous mutations (Nei and Kumar 2000). Most
synonymous mutations occur in the third position of codons. Measuring the relative rate of
synonymous vs. non-synonymous mutations is technique for the detecting of positive
Darwinian selection (Nei and Kumar 2000).
2.1.3 Phylogenetic trees
The evolutionary relationship between organisms has traditionally been presented as a tree
like structure starting first presented in 1801 by French botanist Augustin Augier (Stevens
and Augier 1983). Intuitively it is fairly clear what an evolutionary tree represents.
Mathematically a tree can be defined as an acyclic graph. A graph is an abstraction, which
can be used to model binary relations between objects (Parida 2008). A graph G can be
defined as G(V,E) where V is a set of vertices or nodes and E is a subset defined as E ! (V "
V)(Parida 2008). E is thus a subset of the set of all ordered pairs that can be created from
elements of V. The elements of E are referred to as the edges of the graph (Parida 2008). A
tree is a graph where all vertices are connected possessing the property that any two vertices
v1, v2 # V are connected by a unique path (Parida 2008). A vertex in a tree that has one
incoming edge is known as a leaf node (Parida 2008). All other vertices by contrast are
known as internal nodes (Parida 2008).
41
Chapter 2
In the case of a phylogenetic tree leaf nodes are extant taxonomic units or taxa and
internal nodes are proposed hypothetical common ancestors as illustrated in Figure 2.1. A
subsection of a phylogenetic tree can be referred to as a clade (Nei and Kumar 2000).
Figure 2.1: Sample phylogenetic tree. In this tree the extant taxa are nodes A, B and C
while node E is an ancestral node for A and C.
2.1.3.1 Species trees and gene trees
There are two main types of phylogenetic tree that are commonly investigated. These are:
Species trees: The topology of these phylogenetic trees represents the branching order
of species. Thus internal nodes are hypothetical common ancestors for the nodes that
succeed them. The split at these ancestral nodes represent speciation events. A
42
Chapter 2
speciation event is considered to be the moment in time when two species were
reproductively isolated from each other (Nei and Kumar 2000).
Gene trees: Gene trees measure the degree of divergence between homologous genes
within and/or across species. Thus internal nodes in a gene tree represent a
hypothetical gene that existed prior to a mutation event that created its two immediate
descendants (Nei and Kumar 2000).
Figures 2.2 and 2.3 illustrate the differences between gene trees and species trees.
Figure 2.2: A species tree. Adapted from (Brown 2006).
43
Chapter 2
Figure 2.3: A gene tree. Adapted from (Brown 2006).

2.1.3.2 Topologies and branch lengths
The branching pattern of a phylogenetic tree is known as its topology. The topologies of
phylogenetic trees can be rooted as above in Figure 2.1 or unrooted as present below in
Figure 2.4.
44
Chapter 2
Figure 2.4: Sample of an unrooted phylogenetic tree representing four taxa.

Theoretically the topologies of most phylogenetic trees are bifurcating, as ancestral
nodes will split into two descendant nodes at a given point in time. Multifurcation is possible
in phylogenetic trees. A node with more than two descendents is referred to as a polytomy.
There are two types of polytomy. Soft polytomies where the multifurcation is attributable to a
lack of information and hard polytomies where species genuinely split into multiple
descendants simultaneously (Page and Holmes 1998). Most polytomies are treated as soft as
simultaneous speciation is considered unlikely (Page and Holmes 1998).
The number of possible unrooted bifurcating tree topologies B(t) can be calculated
using the formula given below (Salemi and Vandamme 2003) where t is the number of taxa
under consideration.
45
Chapter 2
t
B(t) = # (2i " 5)
(1)
i= 3
The number of possible rooted bifurcating topologies B(t) can be counted using the following
formula:
B(t) =
(2t " 3)!

2 t"2 (t " 2)!
(2)
Thus estimation of a phylogenetic tree is a problem that quickly becomes computationally

!
intractable as the number of taxa rises.
Another attribute that can be added to a phylogenetic tree is the length of its
individual branches. As the nodes within a tree represent taxa, the lengths of the braches
between them represent the degree of evolutionary change between the taxa over time. A
phylogenetic tree with branch length information is also known as an additive tree, a metric
tree or a phylogram (Page and Holmes 1998).
2.1.3.3 Bootstrap support values
Another attribute commonly associated with internal nodes in phylogenetic tree is the
bootstrap support value. This value reflects the amount of times a particular internal node or
split is selected if a phylogenetic analysis is repeated on a random set of re-samples (with
replacement) from the original dataset (Page and Holmes 1998).
2.1.3.4 Evolutionary models in tree estimation
In order to estimate the amount of evolutionary change between taxa, methods considered to
be effective tree estimators, utilise models of evolution that specify information on the
evolutionary rate of substitution between homologous stretches of nucleotide or amino acid
data. These models are framed as m " m matrices where m is the number of entities in the
data type, i.e. 4 in the case of nucleotides and 20 in the case of amino acids. To illustrate, an
example of the simplest substitution model possible for DNA is the Jukes-Cantor model,
!
which assumes that nucleotide substitution occurs with equal frequency (Nei and Kumar
2000). Thus the substitution matrix for the Jukes-Cantor model is presented below where !
represents this uniform rate of substitution.
46
Chapter 2
A
Table 2.1: Rates of nucleotide substitution for the Jukes-Cantor model (Nei and Kumar
2000).
The methods of tree estimation that utilise these models of evolution include the
distance method, tree estimation by Bayesian methods, and tree estimation by maximum
likelihood (Felsenstein 2004).
In distance methods an evolutionary model provides a measure of evolutionary
distance between taxa, whereas in probabilistic methodologies such as maximum likelihood
and Bayesian methods they provide a measure of probability for a given set of substitutions
between taxa. Evolutionary models can be calculated via a priori assumptions about the
evolutionary process or can be constructed empirically by examining the rate of observed
substitutions in homologous sequences. Examples of empirically calculated substitution
matrices for amino acids include the PAM matrices created in the seminal work by Margaret
Dayhoff (Dayhoff et al. 1978) and more recently the WAG (Whelan and Goldman 2001) and
LG matrices (Le and Gascuel 2008).
2.1.4 Detection of homology in molecular data
In order to construct a phylogenetic tree, which represents the evolutionary history of a set of
taxa using molecular data, it is necessary to compare homologous sequences. More
specifically it is necessary to detect orthologous genes/proteins. These genes/proteins are the
most appropriate measure of genetic divergence between species, as an equal level of genetic
divergence will have occurred since the speciation event causing the split.
There are a number of algorithms, which are utilised in the selection of homologous
genes/proteins and their subsequent classification as orthologous or paralogous. These
include:
47
Chapter 2
Reciprocal Best Hits (RBH): This procedure is implemented by the COGs (Tatusov et
al. 2003) database hosted by the NCBI. The underlying rationale of the algorithm is
that orthologous genes between two species will possess more similarity with each
other then with any other gene. This similarity is generally established using pairwise
sequence alignment algorithms such as BLAST (Altschul et al. 1990) or the SmithWaterman algorithm (Smith and Waterman 1981).
InParanoid: This algorithm extends the idea behind RBHs by using them to seed
orthologous clusters, and then by an application of an iterative inclusion process
constructs a set of gene/protein families (Remm et al. 2001).
OrthoMCL: This process also utilises RBHs as seed pairs for clusters. Similarity
relations between gene/proteins are then established as a graph and additional
paralogous sequences are determined through a process of graph clustering (Li et al.
2003).
Reciprocal smallest distance (RSD): This procedure does not utilise RBHs and
instead, for a set of hits for a given query protein, over a given E-value (Expect
value), conducts pairwise alignments between each of the hits and the original query.
Hits that are alignable to a given threshold are then subjected to further analysis to
calculate the number of amino acid substitutions or distance between them and the
original query. The hit with the shortest distance is then used to reverse the process. If
the reversal yields the original query then the two sequences are declared orthologous
(Wall et al. 2003).
EnsemblCompara GeneTrees: This is an algorithm utilised by the Ensembl Compara

database (Vilella et al. 2009). The process involves RBHs. Two species are subjected
to an all against all pairwise alignment. Like OrthoMCL the resulting data is then
converted into a graph. This graph is then clustered. Gene trees are then constructed
using these clusters and reconciled against a gold-standard species tree.
In comparative studies the Inparanoid algorithm (Remm et al. 2001) was shown to
perform better than its rivals (Hulsen et al. 2006). This work showed the Inparanoid
algorithm tied as the best performer with simple reciprocal best hits at identification of
orthologs. However reciprocal best hits in practise only yield one to one orthologous
relationships (Hulsen et al. 2006). This reduces the coverage of the method (Hulsen et al.
48
Chapter 2
2006). OrthoMCL was shown to perform a close second to the Inparanoid algorithm in
benchmarking tests (Hulsen et al. 2006). Subsequent benchmarking work (Altenhoff and
Dessimoz 2009) showed that OrthoMCL outperformed Inparanoid to an extent at lower
levels of specificity but higher coverage. However at points benchmarking was applied to
data and organisms common to both reviews the results were seen as broadly congruent
(Altenhoff and Dessimoz 2009).
2.1.5 Multiple sequence alignment
Given a set of orthologous sequences further processing is required in order to convert them
into a suitable input for a phylogenetic tree estimation procedure. This input is known as a
multiple sequence alignment (MSA) (Edgar and Batzoglou 2006). The process involves
creating an optimal alignment between three or more protein sequences. Insertions and
deletions between orthologous proteins are represented by introducing gaps into the
alignment. Alignments are scored through the use of substitution matrices. The process
converts orthologous sequences into a rectangular array where each column of the array
corresponds to a homologous attribute between the taxa under consideration (Edgar and
Batzoglou 2006).
Forms of multiple sequence alignment include.
Progressive: This form of alignment involves the construction of initial pairwise

alignments between all the sequences under consideration. The distances thus
established between the sequences are used to create a guide tree. The multiple
sequence alignment is then built up progressively in the order suggested by the guide
tree (Mount 2004). The main flaw with this method is that errors made any stage of
constructing the MSA remain in the final alignment (Wheeler and Kececioglu 2007).
A very prominent example of a progressive MSA tool is the Clustal suite (Higgins
and Sharp 1988).
Iterative: In order to reduce the errors introduced by the progressive approach to MSA
the iterative approach realigns sub-groups of the sequences repeatedly (Mount 2004).
Examples of iterative MSA programs include MUSCLE (Edgar 2004) and DIALIGN
(Morgenstern et al. 1998). The performance of the iterative approach can be improved
by the inclusion of consistency information between the growing MSA and the pre-
49
Chapter 2
computed pairwise alignments used by some of algorithms within the MAFFT
(Multiple alignment by fast Fourier transform) program (Katoh et al. 2002).
The quality of a multiple alignment is crucial to the accuracy of the phylogenetic tree
created via its analysis (Blair and Murphy 2011). This is especially true when there are gaps
in the alignment (Talavera and Castresana 2007). Thus benchmarking tests have been carried
out to examine the performance of various algorithms currently available. The results of these
have found that MAFFT (running in its iterative, consistency enhanced mode) using the
Smith-Waterman algorithm (Smith and Waterman 1981) for its initial pairwise alignment
outperformed its nearest rivals (Ahola et al. 2006; Nuin et al. 2006). This mode of MAFFT is
known as MAFFT-L-INS-i.
2.1.5.1 Multiple sequence alignment quality filtration
Given the effects of MSA quality on phylogenetic analysis it is argued that filtration of
areas, which are problematic to align, will improve the outcome of subsequent phylogenetic
analyses (Talavera and Castresana 2007). It is common practise to edit MSAs by hand before
analysing them further though it is considered that this makes all results thus gained
irreproducible through the subjectivity of the overall process (Blair and Murphy 2011). Thus
this process has been semi automated by programs such as Gblocks (Talavera and Castresana
2007) and Trimal (Capella-Gutierrez et al. 2009). These programs will retain sections of
MSAs, which are highly conserved and remove gaps in the alignment.
Gblocks will either remove all gaps in its stringent mode or only remove gaps if they are
present in more than half the sequences in the alignment in its relaxed mode (Talavera and
Castresana 2007). Trimal will remove columns from an alignment based on a conservation
threshold defined by the user, i.e. how much of the original alignment does the user wish to
conserve (Capella-Gutierrez et al. 2009). In benchmarking tests optimum performance for
Gblocks in enhancing tree estimation was observed using its relaxed mode (Capella-Gutierrez
et al. 2009).
2.1.6 Methods to estimate phylogenetic trees
The focus of this section as mentioned above shall be on the analysis of molecular data
though the methods described are applicable to any form of measurable polymorphic trait.
These data provide a measure of distance between the species under consideration.
50
Chapter 2
The first subdivision in types of methods of phylogenetic analyses is between discrete
character state and distance matrix methods (Salemi and Vandamme 2003). Discrete
character state methods examine the differences in state of a set of discrete characters or
traits. Distance matrix methods utilise the distance between sets of data through the creation
of a matrix of pairwise distances and application of clustering techniques. Subtypes of the
character state method include the maximum parsimony method that does not utilise an
explicit model of evolution and maximum likelihood, which conversely does (Salemi and
Vandamme 2003).
2.1.6.1 Distance methods
Distance methods were originally developed to construct phenograms, i.e. (diagrams which
reflect the similarity between a given group of taxa without consideration of
ancestor/descendant relationships (Salemi and Vandamme 2003; Sneath and Sokal 1973) as
opposed to phylogenies. Distance methods however can also be applied to elucidating
phylogeny under the assumption of equal rates of mutation in cases where a quick initial
result is required.
Distance methods of phylogeny depend on the construction of a matrix of pairwise
distances for the trait data of the organisms under consideration. This data is generally
nucleotide and or amino acid sequence data though the method is also applicable to any other
form of discrete descriptive data. In the case of amino acid or nucleotide data distances are
estimated according to evolutionary models, which allow a meaningful calculation of the
evolutionary distance between two species.
The simplest form of evolutionary distance measure is the proportion of differing sites
between two sequences p. This is calculated through a simple count of differing sites nd and
division by the total number of sites n as shown in Equation 3 (Nei and Kumar 2000).
p=
nd
n
(3)
p is an underestimate of evolutionary distance over extended periods of time as multiple

!
substitutions accumulate per site. Thus in order to represent this information substitutions
can be modelled as a Poisson process over time and then the probability of k mutations over t
time can be can be calculated by the standard Poisson distribution function where # = the rate
of mutations / unit time and e = the base of the natural logarithm (Nei and Kumar 2000).
51
Chapter 2
e" # #k
p(k;t) =
k!
(4)
This probability can then be used to calculate a distance between two sequences. This
distance is referred to as the Poisson corrected distance (Nei and Kumar 2000).
The Poisson corrected distance assumes a homogenous rate of mutations /
substitutions over a molecular sequence. This assumption however is not true as different
areas of a sequence (coding or not coding in the case of nucleotides, for example) will be
subject to differing selective pressure hence differing mutation rates (Nei and Kumar 2000).
This information is integrated into calculations of distance via the observation that
variation in rates of substitution over a sequence follows a gamma distribution (Nei and
Kumar 2000).
Having created a matrix of pairwise distances between the sequences under
comparison this matrix can then used to generate a phylogenetic tree via clustering.
A commonly used form of clustering in the generation of distance-based trees is
neighbour joining. This algorithm follows the following steps (Brown 2006):
Construction of a fully multifurcating star shaped tree including all taxa under
consideration.
The selection of a random pair of taxa and removing them from the star to
form a tree consisting of a clade containing that pair and a clade containing the
rest of the star.
Evaluation of the total branch lengths of the new tree.
Iteration of this process for all possible pairs storing the results of the branch
length calculation.
Identification of the pair, which yields the first interim tree with the shortest
branch length.
This pair is now placed on their own branch and the process is iterated until a
fully bifurcating tree is retrieved.
Another method of tree estimation involving distance matrices is least squares fitting
in which for each tree the residual sum of squares is calculated between pairs of taxa. This
method is known as the Fitch-Margoliash method. This involves applying the following
equation (Nei and Kumar 2000).
52
Chapter 2
Rs = # (dij " eij )2
(5)
Where dij is the observed distance in the matrix between taxa i and taxa j and eij is the
patristic distance between the taxa. The patristic distance between two taxa is the sum of the
branch lengths that make up the shortest path between the two taxa. The tree with the lowest
Rs is selected by the method. Generally tree space is searched using a heuristic search
method as described below in Section 2.1.6.3.
Other standard techniques for this process are clustering methods such as UPGMA
(unweighted pair group methods with arithmetic means), which group organisms by degree
of closeness in the matrix. The underlying assumption of UPGMA is that the evolutionary
process occurs at a consistent pace, i.e. follows a molecular clock (Felsenstein 2004). Thus in
cases where data does not follow a molecular clock, UPGMA will deliver misleading results
as it will cluster species on short branches with each other (Felsenstein 2004).
Another commonly applied method is minimum evolution, which creates a tree where
the overall amount of evolution (measured by the total branch lengths of the tree from root to
tip) is minimised (Salemi and Vandamme 2003). Again tree space is traversed by heuristic
search as described below.
Distance methods are comparatively fast compared to character based methods and
given a dataset with relatively constant rates of evolution and closely related taxonomic units
fairly accurate (Felsenstein 2004). However they suffer from a systemic issue where if the
taxa under consideration display variability of rates of evolution along a sequence at different
points in a tree this cannot be detected as all distances between the sequence are calculated
locally, i.e. between adjacent species (Felsenstein 2004).
2.1.6.2 Discrete character state methods
Discrete state character methods operate on matrices populated with assigned attributes or
characters to each taxon under consideration. Possible trees are then evaluated against this
matrix in an attempt to satisfy an optimality criterion (Salemi and Vandamme 2003). One of
the two most popular optimality criterions is parsimony, which entails minimisation of the
amount of change required over a given tree to produce the data observed in the matrix. The
other widely used criterion for selection of trees is likelihood. This method frames the tree as
a hypothesis for the matrix of observed data and evaluates its likelihood given the matrix of
53
Chapter 2
observed data (Felsenstein 2004). Maximisation of the likelihood function yields the
optimum tree.
2.1.6.2.1 Maximum Parsimony
Using parsimony, as a criterion for judging potential trees was first introduced by Camin and
Sokal in 1965 (Camin and Sokal 1965). The rationale behind considering a tree that is more
parsimonious is based on the principle of Ockhams razor, which can be stated, as a simpler
explanation for an observed phenomenon is to be preferred to a more complex ad hoc
explanation (Steel and Penny 2000).
Specific variants of parsimony that can be utilised are (Felsenstein 2004):
Fitch parsimony: this form of parsimony is also known as Wagner parsimony

(Felsenstein 2004). This form of parsimony allows all possible changes in any
direction and counts them all equally.
Camin-Sokal parsimony: this form of parsimony only allows evolutionary

change in one direction (Camin and Sokal 1965). For example if a two state
character which can take on states 0 and 1 is considered Camin-Sokal
parsimony will only allow changes in the 0 to 1 direction (assuming that is the
direction selected as permissible) (Felsenstein 2004).
Dollo parsimony: this form of parsimony is based on the principle of

phylogenetic irreversibility (Lequesne 1974). The acquisition of a complex
character is allowed once and then all subsequent changes over the tree can
only be reversions (Felsenstein 2004).
Parsimony on an ordinal scale: this deals with the case where changes in a
multi state character are considered on an ordinal scale. Thus only changes
that are adjacent are allowed (Felsenstein 2004).
Polymorphism parsimony: this form of parsimony allows an intermediate state

of polymorphism to be acquired at a point within the tree. All changes
subsequent to the polymorphic areas in the tree are counted as a loss of one of
the composite characters that make up the polymorphic state (Farris 1978;
Felsenstein 1979; Felsenstein 2004).
Evaluating the number of character changes required over a particular tree for a given
character matrix is computationally easy and can be calculated rapidly through applications
of dynamic programming algorithms such as:
54
Chapter 2
The Fitch algorithm (Fitch 1971): operates by carrying out a post order
traversal of the phylogenetic tree (Felsenstein 2004). At each internal node the
set of potential ancestral states is set to either the intersection of the states of
its immediate descendant nodes if such an intersection exists. If no such
intersection exists then the state of the nodes is set to the union of the states of
the two descendant nodes.
The Sankoff algorithm (Sankoff 1975): is similar but not identical to the Fitch
algorithm (Felsenstein 2004). A cost matrix is created which stores the cost of
all possible changes of state within the context of the data under consideration.
Ancestral node states are then assigned by selecting the state with the minimal
cost.
The complexity of both these algorithms is linear as the number of operations

required increases linearly with the number of taxa, the number of characters under
consideration and the number of states that those characters can take on.
Parsimony based methods have been popular due to their computational as well as
conceptual simplicity. Parsimony methods were recently utilised to construct the largest
phylogenetic tree ever reconstructed consisting 73,060 eukaryotic taxa with a combination of
morphological and molecular data (Goloboff et al. 2009). However in general over the last
two decades there has been a swing toward the use of likelihood-based methods (Steel and
Penny 2000). This is due in part to the demonstration (Felsenstein 1978) that under given
conditions maximum parsimony will converge on the wrong tree. These conditions have
come to be known as the Felsenstein zone.
2.1.6.2.2 Maximum likelihood
Maximising the likelihood of a given tree as a hypothesis to explain observed data was first
applied to phylogenetic inference by Edwards and Cavali-Sforza (Cavalli-Sforza 1964). The
likelihood function assigns a value to the ability of a hypothesis to explain an observed set of
results. Assume a statistical model M that associates a probability with a set of possible
outcomes and a set of observed outcomes (or results) R. Thus P(R|M) is the probability of
observing R assuming M is a correct description of the process under study (Edwards 1992).
The likelihood L of M is defined as:
L(M)=P(R|M) " k
(6)
!
55
Chapter 2
Where k is an arbitrary constant. Use of this constant allows relative comparison of
likelihoods (Edwards 1992). To paraphrase an example from (Durbin 1998) in the case of a
die if our hypothesis is that the die is fair then the probability of any outcome is equal to 0.16.
If we go on to roll 5 sixes then this forms our observed data. The likelihood of the hypothesis
is then proportional to 0.165 or 0.000104. Hypotheses can thus be judged on their relative
abilities to explain observed results. A hypothesis with a higher estimate of the probability of
rolling a 6 would be better fit to the observed data in the case of the die. Thus likelihood
provides a framework with which to select a hypothesis or model appropriate to the observed
data.
In the case of phylogeny each tree is a hypothesis explaining the distribution of the
traits under consideration. The phylogeny with the maximum likelihood is selected as the
optimal tree. The likelihood of a tree can be measured through the application of a
substitution model of evolution, which models the probability of individual evolutionary
events over the tree. Empirically calculated substitution models can be used as a substitute for
the calculation of a set of probabilities, which permits the application of more generalised
rules of evolution to each individual phylogenetic study. Empirical models of evolutionary
events can be created through the examination of homologous sequences in different species.
Models currently in use for amino acid based phylogenies include the WAG (Whelan and
Goldman 2001) and LG (Le and Gascuel 2008) substitution models.
If a model is badly specified and a poor fit for the data then likelihood methods can
return an inaccurate tree with high statistical support (Keane et al. 2006). There are a limited
number of cases where parsimony methods can outperform likelihood-based methods, which
has been called the inverse Felsenstein zone, or Farris zone (Siddall 1998). It has been shown
however that these cases are extremely rare in real data (Swofford et al. 2001) and in cases
where it is computationally feasible maximum likelihood has become the one of the dominant
paradigms in phylogeny reconstruction.
2.1.6.2.3 Bayesian Methods
Another criterion related to likelihood is the posterior probability of a tree given a matrix of
observations and a prior probability for the tree. The posterior probability of a hypothesis is
the probability of the hypothesis being true given some observed data. The posterior
probability of a tree given a multiple alignment is calculated through the application of Bayes
theorem, which is defined as:
56
Chapter 2
P(X | Y ) =
P(Y | X)P(X)
P(Y )
(7)
Where X and Y are separate events and P(Y|X) is the conditional probability of event Y given
event X has occurred. P(X) is known as the prior probability of event X. P(X|Y) is the
posterior probability of event X given event Y has occurred. P(X) represents a subjective prior
belief in the probability of X occurring.
In the case of phylogenetic analysis X is a phylogenetic tree and Y is a given multiple
alignment. It is however non-trivial to evaluate the posterior probabilities over all possible
tree topologies exhaustively (Huelsenbeck et al. 2001). This process had been made feasible
by sampling the distribution of posterior probabilities of trees. The posterior probability of a
particular tree is measured as the amount of times it is visited over traversal of tree space.
Tree space traversal is facilitated by the use of Metropolis-coupled MCMC (Markov chain
Monte Carlo) methods first introduced by the doctoral work of Li and Mau (Pickett and
Randle 2005).
The algorithm returns a set of trees sampled from the posterior distribution. An
individual phylogeny is then generally assembled from the returned sample through using
majority rule consensus methods (Cranston and Rannala 2007).
Bayesian methods suffer from the potential source of bias of prior probabilities
(Holder and Lewis 2003). This issue can be ameliorated through the use of flat or
uninformative priors. Flat priors can however still bias a Bayesian phylogenetic study
towards trees with particular configurations of clades (Pickett and Randle 2005).
2.1.6.3 Heuristic search methods
Given the large number of possible topologies possible for even a small number of taxa the
estimation of phylogenetic trees is a problem that is intractable by brute force searching. Thus
the space of all possible trees is usually searched heuristically (Felsenstein 2004). What this
entails is the selection of a random first tree. This tree is then evaluated on the basis for
whatever measure that has been defined to evaluate the quality of the tree. Examples of
possible quality measures for a phylogeny include as previously discussed parsimony,
likelihood and distance. The tree is then altered thus moving to a new point in tree space.
This new tree is then evaluated. This process is then iterated until a local optimum point has
57
Chapter 2
been reached within the space. This point is not guaranteed to be a global optimum within the
space (Felsenstein 2004).
Examples of alterations/moves that are used to traverse tree space include (Felsenstein 2004):
Nearest neighbour interchange (NNI): This process involves the swapping of adjacent
branches within a tree. This is a local rearrangement of the tree.
Subtree pruning and regrafting (SPR): This process involves the removal or pruning
of a subtree from an overall tree and reattaching it at another point. As opposed to
NNI this is a global rearrangement of the tree.
Tree bisection and reconnection (TBR): This involves the deletion of an interior
branch to split a tree into separate trees and then all possible connections are made
between the branch set of the first tree and the second. This is also a global
rearrangement of the tree.
Global rearrangements are more radical moves within the tree space and thus are less
likely to stabilise in local optima. Modern phylogeny estimation programs generally
provide the options to carry out either form of rearrangement. The advantage of using
local rearrangements is greater speed in arriving at the optimum tree. Examples of
programs, which offer this choice, are a number of programs within the PHYLIP suite
(Felsenstein 1989) and PhyML (Guindon and Gascuel 2003). PhyML is generally as
accurate as other phylogeny estimation programs while being considerably faster
(Dereeper et al. 2008). Programs within PHYLIP can carry out multiple searches through
the space jumbling the order of the taxonomic data to widen space coverage. The
programs within PHYLIP that offer heuristic search are:
PROTPARS
DNAPARS
DNACOMP
DNAML
DNAMLK
PROML
PROMLK
RESTML
58
Chapter 2
FITCH
KITSCH
NEIGHBOR
CONTML
PARS
MIX
DOLLOP
Short descriptions of these programs can be found on the PHYLIP webpage

(http://evolution.genetics.washington.edu/phylip/).
2.1.7 Bootstrapping
Bootstrapping was first proposed as a method for evaluating confidence limits in
phylogenetic trees by Felsenstein in 1985 (Felsenstein 1985a). The procedure evaluates how
well supported a particular tree topology is by a given dataset. This entails constructing a
dataset created by random resampling from the original dataset. This new dataset should be
of the same size as the original dataset. This procedure is repeated to produce the appropriate
number of replicates. The original tree estimation procedure is then repeated on this subset
producing a set of trees.
The original tree is then evaluated in the light of these new trees. Each interior branch
of the original tree is compared to the bootstrap trees, and for every bootstrap tree, which
contains a branch, which creates an identical partition of the data the branch is marked as
present. Thus each internal branch gets a score or bootstrap confidence value calculated by
dividing the number of times it was found to be present in one of the bootstrap trees with the
total number of bootstraps (Nei and Kumar 2000).
In the PHYLIP package presented by Felsenstein bootstrapping is carried out through
the use of two of its internal programs SEQBOOT and CONSENSE (Felsenstein 1989). This
procedure calculates confidence values for a consensus tree created from the bootstrap trees
as opposed to the original tree (Nei and Kumar 2000). The procedure for obtaining these
bootstrap support values is:
Run SEQBOOT on original dataset. This produces the number of required

resamples/replicates of the dataset. SEQBOOT requires a random odd number
as a seed.
59
Chapter 2
Repeat original estimation procedure on each replicate. This produces a set of

trees all of which represent the original data.
CONSENSE is then used to merge these trees together. CONSENSE is a

program, which is designed to produce a consensus tree from a set of trees.
CONSENSE will add internal values /branch which represent the confidence
values in those branches.
2.1.8 Model selection in phylogenetic tree estimation

As mentioned above, three of the main methods of phylogenetic tree estimation utilise
evolutionary models as described in section 2.1.3.4 with which to convert homologous data
matrices into trees. The use of an inappropriate model has been shown to adversely affect all
aspects of tree reconstruction including branch lengths and topology (Bruno and Halpern
1999). A poorly specified model will return an incorrect tree with high statistical confidence
(Posada and Buckley 2004).
Thus procedures have been developed with which to estimate how well a given
evolutionary model fits a dataset. A standard probabilistic measure that is used to measure the
fit of a model to a given dataset is likelihood. As a model can be fitted (over fit) perfectly to a
dataset by adding parameters, the likelihood of any given model is evaluated in the context of
how many parameters the model has. Two standard measures for integrating this information
(the likelihood of the model with respect to the dataset and the number of parameters) for
doing this are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion
(BIC). Both of these measures penalise parameter rich models over simpler models thus
selecting the simplest model that explains the observed data adequately (Felsenstein 2004).
The AIC is calculated by the following equation (Felsenstein 2004).
(8)
AICi = "2 ln Li + 2 pi
Where Li is the likelihood of model i and pi is the number of parameters in model i. The BIC
has a higher penalty for parameter richness than the AIC and is calculated via the following
equation (Felsenstein 2004).
(9)
BICi = "2 ln Li + pi ln(n)
!
60
Chapter 2
Where Li is the likelihood of model i and pi is the number of parameters in model i and n is
the number of data points in the dataset.
An example of a model selection procedure involving likelihood, utilised by a model
selection tool ModelGenerator is as follows (Keane et al. 2006).
The construction of a simple guide tree using neighbour joining on the dataset.
Each model to be evaluated is then examined over this guide tree and the dataset to
calculate the likelihood of that model.
The AIC/BIC are calculated for the model.
The model with lowest AIC/BIC is presented as the best model for the given dataset.
2.1.9 Comparing phylogenetic trees

Given sets of two or more phylogenetic trees produced by disparate estimation procedures it
can sometimes be necessary to compare trees for congruence with each other as well as with
the datasets underlying them.
There are a number of procedures that can be followed to compare trees. The simplest
of these is attempting to visualise the parts of trees that are topologically similar. A method
for doing do is presented in work by Nye (Nye et al. 2006).
Other measures have been designed to define the difference between two trees. These
include:
The Robinson-Foulds distance: This is a distance that counts how many

branches differ between two trees. This is done by ignoring branch lengths and
considering each tree as a set of branches. Each branch splits the tree into two
partitions. The Robinson-Foulds distance is a count of the partitions present in
one tree and not in the other (Felsenstein 2004).
The NNI distance: This distance can be considered an edit distance, analogous
to the Levenshtein distance used to compare strings of text. It is the number of
NNI operations it would take to transform one of the trees into the other
(Felsenstein 2004).
The Branch Score distance: This measure uses branch lengths as well as
topology to calculate the distance between two trees (Felsenstein 2004).
61
Chapter 2
Another way to measure the quality of a pair of estimated tree relative to a given dataset is
use of the Kishino-Hasegawa (KH) test. This is a test of how well individual homologous
sites within a dataset support a given tree in contrast to another tree (Goldman et al. 2000).
Both trees are selected a priori as the possible best hypotheses for an observed dataset. The
test was first introduced in the work (Hasegawa and Kishino 1989). The underlying rationale
is that if the trees are equally well supported by the dataset. Thus using the notation provided
in (Goldman et al. 2000) the test can be carried out via the following procedure.
Given two trees T1 and T2 calculated by a given quality criterion, e.g. parsimony or
likelihood.
Assuming for the purposes of explanation that the quality criterion is likelihood, the
likelihoods of T1 and T2 (with respect to a given dataset D) are L1 and L2 respectively.
Calculate $ as the difference between the two likelihoods, i.e. $ = L1 - L2.
The underlying hypothesis of the test is that T1 and T2 do not explain D equally well or
$ %0.
Thus the hypotheses for the test are:

H0: E[$]=0
H1: E[$] % 0
Where E[x] corresponds to the expected value of the random variable x.
In order to test these hypotheses it is necessary to calculate how extreme the observed
value of $ is with respect to the distribution of $.
In order to calculate this distribution a bootstrapping procedure is followed to

calculate multiple replicated datasets from D.
The likelihoods of the trees T1 and T2 are then recalculated for each bootstrapped
dataset.
For each of these likelihoods a corresponding $ is calculated.
62
Chapter 2
This provides a distribution for $ against which the position of the initial $ can be
compared via a two-tailed test. A two tailed test is used as there is no a priori
expectation of which tree is to be preferred (Goldman et al. 2000).
The test assumes that all columns within the dataset are independent and identically
distributed according the evolutionary history of the taxa under consideration.
The KH test is used to compare trees, which are selected a priori as possible best
explanations for a given dataset. To examine how well a given best estimated tree matches an
underlying dataset relative to another tree the SOWH (Swofford-Olsen-Wadell-Hillis) test
can be used in conjunction with the KH test. This test is an example of parametric
bootstrapping (Goldman et al. 2000). Essentially the test involves the construction of a tree
with a given quality criterion over a given dataset. Then this initial tree is used to create
multiple datasets, simulated using the parameters that define the tree. Each of these datasets
is then used to create a new tree. The likelihoods of these new trees can then be compared to
the likelihood of the initial tree relative to the simulated datasets. This creates a set of
likelihood differences, the distribution of which can be compared to the difference between
the initial tree and other trees of interest. This test is not widely used due the computational
demands of the construction of multiple trees and the construction of simulated datasets.
2.1.10 Phylogenetic analysis using gene presence
Leaving aside sequence data there are other aspects to the genomes of a set of organisms that
also provide signal indicating their evolutionary divergence. One of these aspects is proteome
content, i.e. which genomic features are present in a genome and which are absent (Snel et al.
1999). This aspect is essentially the phylogenetic profile of the genomic feature. The
presence and absence of the genomic feature can be treated as a binary trait and then used as
an input to apply standard phylogenetic tree estimation procedures. Methods for the
estimation of trees from discrete traits precede the methods described above for the analysis
of molecular sequence data. The issues and criteria for estimation procedures surrounding the
tree estimation from this form of data remain the same, as the underlying process for the
generation of the data is the same.
As an illustrative example consider 5 species with six genes. Any of these genes can either be
present or absent. An absence of a gene is coded as 0 and the presence of a gene is coded as
1.
63
Chapter 2
Thus given the following distribution of the genes over the 5 species named [A-E].
A
111010
111111
000001
111100
000000
A phylogenetic tree can be estimated from this data clustering these species. An example
tree reconstructed using Dollo parsimony as implemented in the PHYLIP package in the
program DOLLOP (Felsenstein 1989) is shown in the figure below.
Figure 2.5: Dollo parsimonious estimation of a phylogenetic tree from example data.
As species E is devoid of all six genes it is placed as an outgroup relative to the other 5.
2.2 Methods
In order to carry out a comparative study on human protein function a phylogeny was
constructed to provide a framework on which to evaluate the distribution of human proteins
over the eukaryotic kingdom. Sets of phylogenetic profiles, which detail this distribution,
were also generated.
64
Chapter 2
2.2.1 Data Selection
Given the computational nature of this project as well as the abundance of molecular data it
was clear that its use was to be preferred over morphological data. Also given the wide
morphological divergence of eukaryotes isolating individual features to be compared was not
considered a plausible option.
The next point of consideration was whether to utilise nucleotide or amino acid
molecular data. It is known that over long periods of evolution it is more likely for nucleotide
data to become saturated with multiple back substitutions as nucleotide data has four
potential changes of state at any given site as opposed to 20 potential changes in amino acid
data. This can lead to an underestimate of genetic distance (Harrison and Langdale 2006;
Salemi and Vandamme 2003). Thus despite nucleotide data outperforming amino acid data
over smaller time frames such as the period of time covering the divergence of the division
Angiospermae (Simmons et al. 2002) and of the sub phylum Vertebrata (Townsend et al.
2008) it was felt that amino acid data was a more appropriate choice as a measure of genetic
distance over all eukaryotes.
2.2.2 Data Acquisition
Having decided to utilise amino acid data the next step was to acquire usable data. The
protein sets of 54 eukaryotic genomes were downloaded on the 16th and 17th of August 2007
of which 41 organisms were accessed from the NCBI RefSeq database (using the Entrez data
retrieval interface at http://www.ncbi.nlm.nih.gov/sites/gquery) (Pruitt et al. 2005) and the
remainder from the Sanger Centre (ftp://ftp.sanger.ac.uk), Genoscope
(http://www.genoscope.cns.fr/spip/spip.php?lang=en), TIGR ((now the JCVI )
http://www.jcvi.org/), the Broad Institute (http://www.broadinstitute.org), Ensembl (using
BioMart at http://www.ensembl.org/biomart) and lastly SilkDb
(http://silkworm.genomics.org.cn) (Wang et al. 2005b). These databases were employed, as
they were (at the time of access) the sources utilised by the KEGG database (Kanehisa et al.
2006). An additional archeon Methanosarcina acetivorans was downloaded from the NCBI
RefSeq database (Pruitt et al. 2005) in order to root the phylogeny using the outgroup
criterion. This method entails using an organism that falls outside the known taxonomy of the
group under consideration to provide a point of reference for the overall topology of the tree
(Felsenstein 2004). It is widely accepted after work by Carl Woese (Woese et al. 1990) that
the archea form a sister group to the Eucarya. Thus it was felt that an archeon was an
65
Chapter 2
appropriate choice for an outgroup species. Full details of all data sources and species can be
seen in the table below.
66
Chapter 2
Organism
Database
Common Name
Methanosarcina acetivorans (Outgroup)
RefSeq
NA
Anopheles gambiae
Ensembl
Mosquito
Arabidopsis thaliana
RefSeq
Thale Cress
Ashbya gossypii
RefSeq
NA
Aspergillus fumigatus
RefSeq
NA
Aspergillus niger
RefSeq
NA
Bombyx mori
SilkDB
Silkworm
Bos Taurus
RefSeq
Cow
Caenorhabditis briggsae
Ensembl
NA
Caenorhabditis elegans
RefSeq
NA
Candida albicans
RefSeq
NA
Candida glabrata
RefSeq
NA
Canis familiaris
RefSeq
Dog
Ciona intestinalis
RefSeq
Sea squirt
Cryptococcus neoformans
RefSeq
NA
Cryptosporidium hominis
RefSeq
NA
Cryptosporidium parvum
RefSeq
NA
Danio rerio
RefSeq
Zebrafish
Debaryomyces hansenii
RefSeq
NA
Dictyostelium discoideum
RefSeq
NA
Drosophila melanogaster
RefSeq
Fruitfly
Drosophila pseudoobscura
RefSeq
NA
Encephalitozoon cuniculi
RefSeq
NA
Entamoeba histolytica
RefSeq
NA
Gallus gallus
RefSeq
Chicken
Homo sapiens
RefSeq
Human
Kluyveromyces lactis
RefSeq
NA
Leishmania major
Sanger
NA
Macaca mulatta
RefSeq
Rhesus macaque
Magnaporthe grisea
RefSeq
NA
Table 2.1: Organisms in study.
67
Chapter 2
Monodelphis domestica
RefSeq
Grey short-tailed oppossum
Mus musculus
RefSeq
Mouse
Neurospora crassa
Broad Institute
NA
Oryza sativa
RefSeq
Rice
Ostreococcus lucimarinus
RefSeq
NA
Pan troglodytes
RefSeq
Chimpanzee
Paramecium tetraurelia
Genoscope
NA
Pichia stipitis
RefSeq
NA
Plasmodium falciparum
RefSeq
NA
Plasmodium knowlesi
Sanger
NA
Plasmodium yoelii
TIGR
NA
Populus trichocarpa
JGI
Black cottonwood tree
Rattus norvegicus
RefSeq
Rat
Saccharomyces cerevisiae
RefSeq
Brewers yeast
Schizosaccharomyces pombe
RefSeq
Fission yeast
Strongylocentrotus purpuratus
RefSeq
NA
Takifugu rubripes
Ensembl
Pufferfish
Tetrahymena thermophila
RefSeq
NA
Theileria annulata
RefSeq
NA
Theileria parva
RefSeq
NA
Trichomonas vaginalis
RefSeq
NA
Trypanosoma brucei
RefSeq
NA
Trypanosoma cruzi
RefSeq
NA
Ustilago maydis
RefSeq
NA
Yarrowia lipolytica
RefSeq
NA
Table 2.1: Organisms in study (cont).
68
Chapter 2
2.2.3 Pairwise Alignment
In order to gauge the relatedness of individual proteins in the organisms it was necessary to
use a pairwise alignment algorithm, which would deliver a measure of similarity between any
two given sequences. The Smith-Waterman algorithm (Smith and Waterman 1981) was
selected as it is guaranteed to locate optimal regions of local similarity. Speed is an issue with
use of the Smith-Waterman algorithm however an accelerated implementation developed by
Michael Farrar provided within the Fasta package made its use feasible (Farrar 2007; Pearson
and Lipman 1988).
A necessary pre-processing step was to subject all sequences to low complexity
filtering to remove regions of the sequence which are non random but not biologically
significant such as regions of compositional bias. Thus all sequences were fed into the SEG
program (Wootton and Federhen 1993) with the parameter x, which masks out regions of
low complexity sequence and replaces them with the character lower case x.
Each protein set was then split into its individual proteins and each protein compared
against every other organism in the dataset in order to locate sequences that were
significantly similar. Each comparison was run with a gap-opening penalty of -12 and a gap
extension penalty of -2. The substitution matrix BLOSUM62 (Henikoff and Henikoff 1992)
was used to score the alignments. The results of these searches were then parsed and
pertinent data, i.e. raw Smith-Waterman score, E value, bit score and the coordinates of the
alignments along the sequences were stored in a relational database structure in MySQL to
facilitate further analysis.
2.2.4 Orthology Determination
In order to select data which would allow the measurement of evolutionary divergence over
the species it was necessary to cluster the proteins into orthologous clusters. As the
Inparanoid procedure (Remm et al. 2001) had been observed to perform well in this function
it was decided to utilise this procedure. The Inparanoid procedure as described in work
published by Remm (Remm et al. 2001) is detailed below.
2.2.4.1 Inparanoid
Given a set of n pairwise alignments between organism A and organism B the Inparanoid
algorithm returns a set of clusters s using sequence similarity as an inverse distance. A
pairwise alignment of two proteins protein a and protein b in this case is a composite input
consisting of
69
Chapter 2
Bit score aSb: This is the result of the normalisation of a raw pairwise alignment with
respect to the scoring system (Karlin and Altschul 1990). Normalisation places all
scores on the same scale, which is a fundamental prerequisite for use as a distance
metric.
Sequence Lengths ax and bx.
Alignment lengths: The length of the alignments along both proteins alength and
blength.
The latter two length inputs are used to eliminate short hits using a minimum length
cut-off. These short hits may reflect functionally homologous (potentially orthologous)
domains as opposed to whole proteins, which are inherited intact as a discrete unit from the
last common ancestor of the two species under consideration. The bit score is employed as a
cut-off to limit the radius of the clustering step. The Inparanoid algorithm runs the following
steps in a pairwise comparison (Remm et al. 2001).
Sort all bit scores in ascending order.
Read in all hits excluding those that fall below score and length cutoffs.
Select the best scores between organism A and B.
For each best-hit protein from A-B examine the reciprocal relationship B-A.
All reciprocal best hits are stored as a set of seed pairs for orthologous clusters.
For each seed pair paralogous genes are grouped around them if the largest score that
the putative paralog has in the set of all scores is against the putative ortholog in the
seed cluster of the organism under consideration.
Overlapping clusters are then resolved through either deletion or subsumption

depending on the degree and topology of the overlap.
2.2.4.2 Implementation
An implementation of the Inparanoid algorithm in the Perl language (Remm et al. 2001) was
acquired from the Inparanoid website (http://inparanoid.sbc.su.se). However as this
implementation provided by proved to not be amenable to the analysis of bespoke output, it
was decided to re-implement the procedure described above.
The Inparanoid algorithm was implemented through the application of object
orientated (OO) software design principles. OO principles involve the characterisation of a
problem domain as a collection of interacting objects where functionality inherent to each
70
Chapter 2
object is implemented as internal to that particular object (Pressman 2001). Objects within a
problem domain are generally identified through the identification of nouns within a problem
statement (Pressman 2001). Perusal of the algorithm specification provided in (Remm et al.
2001) led to the above design.
2.2.4.2.1 Design
The main object that the procedure required in order to operate was identified as a Cluster
object, which was implemented with the following attributes and operations.
Figure 2.6: Class diagram of the main entity.

The box that represents the object above is split into three segments. The first segment
contains the name of the object. The second segment contains the attributes that the object
contains. In the case of the Cluster object it has two attributes. One of these attributes
theAGenes is a list of genes/proteins clustered together from species A. Conversely
theBGenes is a list of genes/proteins clustered together from species B.
The third segment of the boxes represents the operations that it is possible for the
object to carry out. In the case of the Cluster object the main operations necessary are the
ability to add and delete members as well as to detect whether two Clusters overlap. Finally
the ability to merge two Clusters is an essential operation.
71
Chapter 2
It was decided to use the programming language Java to implement the above design
as the language provides functionality, which facilitates the OO paradigm. The
implementation was then tested against the author provided Perl implementation to ensure
correctness.
The Java implementation deviated slightly from the author provided Perl
implementation in two respects. The first deviation from the author provided Perl
implementation was the use of higher precision double values to represent the bit scores of
alignments as opposed to the use of integers by Perl. This led to cases where scores that had
been rounded in the Perl implementation and (thus marked as reciprocal best hits) were not
marked as reciprocal best hits (examples in Appendix A).
The second deviation was to cluster orthologous genes with two equal reciprocal best
hits between the species at the stage of sorting. This change in the order of steps has no effect
overall on the groups produced by the implementation. The implementation was run using a
bit score cut-off of 50 and an alignment length cut-off, which was 50% of the length of the
longer protein.
2.2.4.3 Application
The data generated from the similarity searches was then clustered to identify orthologous
genes using the constructed Inparanoid implementation. The study was carried out using H.
sapiens as a reference species. Orthologous groups are sought in each organism for every
protein within the human proteome. This study can be considered unbalanced as no
information was collected on proteins that are absent in H. sapiens (Davey et al. 2007).
As the dataset under consideration was amino acid sequence data there was a choice
as to how to deal with alternatively spliced isoforms of the same protein. As the goal of this
project was to examine the presence and absence of the proteins under consideration it was
decided that the retention of all isoforms in the dataset and clustering them as inparalogs was
appropriate. This would allow an examination of correlations in gain and loss of the protein
as an independent phenotypic entity.
2.2.5 Phylogenetic profiles
The output from the clustering step was then used to generate phylogenetic profiles, which as
mentioned previously are binary strings of presence and absence of an orthologous group for
a given gene in the reference species.
72
Chapter 2
In order to establish create phylogenetic profiles of each protein within the human
proteome the following steps were undertaken.
A list of each GI identifier for the set of human proteins was generated.
For each entry in this list the relevant identifier was scanned against all files
containing orthology predictions in alphabetical order.
If the entry was present within the orthology prediction file for a given organism that
position within the profile string was marked as 1. Otherwise that position was
marked as 0. The order of the profiles is alphabetical. Therefore for example the
profile 100000000000000000000000100000000000000000000000000000 indicates a
protein with an orthologous group present in Anopheles gambiae and Homo sapiens
but absent in all other species under consideration.
Two sets of profiles were generated, one including the outgroup for use in ortholog
selection and the other excluding the outgroup for use in prediction of functional
linkage.
Table 2.2 lists the organisms under study along with their proteome sizes and number of
orthologous groups with reference to H. sapiens. Figure 2.7 shows the distribution of
proteome sizes in the organisms under consideration. Figure 2.8 shows the number of
proteins clustered in each organism.
73
Chapter 2
Organism
No. of Proteins
No. of Clusters
No. of Proteins Clustered
Methanosarcina acetivorans 4544

408
822
Anopheles gambiae
13465
4889
5030
(Outgroup)
Arabidopsis thaliana
31915
2849
3389
Ashbya gossypii
4292
1590
1599
Aspergillus fumigatus
9630
2167
2175
Aspergillus niger
14102
2197
2210
Bombyx mori
21302
4060
4104
Bos Taurus
25379
15508
16567
Caenorhabditis briggsae
19553
4009
4212
Caenorhabditis elegans
23220
4149
4404
Candida albicans
14107
1909
3367
Candida glabrata
5192
1737
1784
Canis familiaris
33527
16372
17681
Ciona intestinalis
15852
6097
6117
Cryptococcus neoformans
6594
1985
2045
Cryptosporidium hominis
3886
953
957
Cryptosporidium parvum
3805
1080
1086
Danio rerio
36083
11598
13007
Debaryomyces hansenii
6317
1861
1887
Dictyostelium discoideum
13377
2551
2650
Drosophila melanogaster
20071
5037
6847
Drosophila pseudoobscura
9871
4230
4240
1996
711
725
Entamoeba histolytica
9772
1282
1572
Gallus gallus
18710
11489
11769
Homo sapiens
33473
26634
33473
Kluyveromyces lactis
5339
1745
1751
Leishmania major
8302
1494
1706
Macaca mulatta
37856
17025
22774
Magnaporthe grisea
14010
2099
4420
Monodelphis domestica
20194
13633
13880
Mus musculus
35048
16607
18448
Neurospora crassa
10082
1998
2002
Oryza sativa
26887
2699
2753
Ostreococcus lucimarinus
7603
2152
2241
Pan troglodytes
51517
18916
28944
Paramecium tetraurelia
39642
2147
2740
Pichia stipitis
5816
1883
3826
5270
1090
2198
Plasmodium knowlesi
4958
1125
2254
Plasmodium yoelii
7861
1058
2140
Populus trichocarpa
45555
3037
3363
Rattus norvegicus
35903
16414
21279
Saccharomyces cerevisiae
5883
1803
3706
Schizosaccharomyces pombe 5045
2066
4272
Table 2.2: List of organisms used in study along with data source and proteome size as well
as number of orthologous groups in alphabetical order. The outgroup is placed at the top.
74
Chapter 2
Strongylocentrotus
42373
6638
13387
Takifugu rubripes
22428
10891
11046
purpuratus
Tetrahymena thermophila
26235
2014
2051
Theileria annulata
3795
1000
1016
Theileria parva
4079
1005
2044
59681
1823
2567
Trypanosoma brucei
8772
1531
3596
Trypanosoma cruzi
19606
1737
2079
Ustilago maydis
6548
3756
3762
Yarrowia lipolytica
6545
4092
4132
Table 2.2: List of organisms used in study along with data source and proteome size as well
as number of orthologous groups in alphabetical order (cont).
Figure 2.7: Distribution of proteome sizes in organisms under consideration. N=55.
75
Chapter 2
Figure 2.8: Distribution of number of proteins placed within clusters in each organism. N=55.
Table 2.3 shows the top ten profiles within the human genome and provides an interpretation.
76
Chapter 2
Profile
Count
Interpretation
000000000000000000000000100000000000000000000000000000
5150
Present in species
Homo sapiens.
000000000000000000000000100000000010000000000000000000
2281
Present in Tribe
Hominini.
000000100001000000000001100101100010000001000000000000
1089
Present in Phylum
Chordata.
000000000000000000000000100100000010000000000000000000
622
Present in Order
Primate.
000000100001000000000000100101100010000001000000000000
495
Present in Infraclass
Eutheria.
000000000000000000000000100100000000000000000000000000
466
Present in species
Homo sapiens and
Macaca mulatta.
000000100001000000000001100101100010000001000000000000
378
Mammalia and
000000100001000010000001100101100010000001000000000000
000000100001000000000000100100100010000001000000000000
Present in Class
Class Aves
342
Present in Class
Mammalia
000000100001000010000001100101100010000001000000000000
281
Present in the
Phylum Chordata
with the exception
of the Species
Takifugu rubripes
000000100001100010000001100101100010000001000100000000
235
Present in in Class
Mammalia ,Class
Actinopterygii and
Class Aves
Table 2.3: Top ten occurring phylogenetic profiles ranked by counts.

77
Chapter 2
2.2.5.1 Single copy proteins

In order to select orthologous proteins, which most accurately reflected the evolutionary
histories of the species under consideration, it was decided to focus on orthologous groups
that were present in a single copy across all 55 taxa. This focus on single copy proteins
excluded potential comparisons between paralogous proteins.
A set of proteins with ubiquitous profiles was extracted from the data and these were
sifted for proteins that were present in single copy in each organism in the dataset. 10
proteins were present in single copy over all the organisms under study. Table 2.4 shows
details of these proteins.
NCBI RefSeq GI Number
Description in NCBI annotation.
Entrez Gene Name
116805340
glycyl-tRNA synthetase
GARS
32307132
NFS1 nitrogen fixation 1 precursor
NFS1
4506605
ribosomal protein L23
RPL23
4506743
ribosomal protein S8
RPS8
signal recognition particle 54kDa
SRP54
4507215
isoform 1
excision repair cross-
ERCC3
complementing rodent repair

deficiency,complementation group
4557563
5031815
lysyl-tRNA synthetase isoform 2
KARS
H(+)-transporting two-sector
ATP6V1D
7706757
ATPase
5803092
Methioine aminopeptidase 2
METAP2
26s protease regulatory subunit 4
PSMC1
24430151
Table 2.4: Single copy ubiquitous genes extracted via analysis of profiles.
78
Chapter 2
2.2.5.2 Proteome content data/tree
The phylogenetic profiles as developed provided a matrix of presence and absence of every
human protein in the dataset across the remaining 53 eukaryotes and 1 outgroup archeon. As
this form of data also contains phylogenetic signal, i.e. shows the divergence in the
proteomes of the given species over time, it was decided to subject this data to a phylogenetic
analysis as well as the main phylogenetic analysis to be carried out on the multiple sequence
alignments of the homologous proteins. The tree was reconstructed using Dollo parsimony
(Farris 1977) via the program DOLLOP (Felsenstein 1989). This tree can be seen in Figure
2.17.
In order to carry this out the profiles were transposed, so that rather than showing the
distribution of human proteins over a set of species they showed the pattern of presence and
absence of human proteins over a single species. Thus instead of a matrix of 33,473 proteins
by 55 species the end product was a matrix of 55 species by 33,473 proteins. In other words
each species was assigned a binary string of length 34,373 where 1 indicated the presence of
a particular human protein and 0 its absence.
This matrix was converted into PHYLIP format through the truncation of species
names to 10 characters and the addition of header information about the size of the matrix.
Finally this formatted file was input to the DOLLOP program (Felsenstein 1989). The
program was run with its default settings.
In order to examine the level of support for the initial outputted tree 100 bootstrap
replicates were created with SEQBOOT (Felsenstein 1989). These 100 replicates were
resubmitted to DOLLOP to produce 100 bootstrap trees. These trees were unified using
CONSENSE (Felsenstein 1989).
2.2.6 Multiple alignment
As an initial step to generate a phylogenetic tree using the orthologous proteins selected a
multiple alignment of each of the 10 proteins was constructed utilising Mafft (Katoh et al.
2002) (Multiple Alignment By Fast Fourier Transform) using the L-INS-i algorithm for 1000
iterations. Each alignment was then subjected to Gblocks filtration (Talavera and Castresana
2007) to remove columns that were poorly aligned. Gblocks was run in its relaxed mode.
These alignments were then concatenated to form a super matrix measure of divergence in
79
Chapter 2
order to generate a measure of divergence across the genomes as opposed to at a single locus,
as has been suggested by (Rokas et al. 2003). The full alignment can be seen in Appendix D.
2.2.7 Model selection
In order to select an evolutionary model which provided a statistically accurate measure of
genetic divergence ModelGenerator (Keane et al. 2006) was used to select the model that best
fitted the concatenated multiple alignment. It requires as an argument a number for gamma
categories, to account for heterogeneity in substitution rates. The argument was given a value
of 4 gamma categories as this has been observed to be sufficient number to create a nearoptimum fit of a model (Yang 1994).
It has been observed that individual gene trees can be highly incongruent with species
trees (Cranston et al. 2009). Thus it is possible for the inference of a species tree to be misled
by non-phylogenetic signal from the individual genes (Cranston et al. 2009). In order to
examine potential incongruence between gene alignments each individual alignment was first
analysed separately. The model selected for the complete supermatrix was the LG
substitution matrix (Le and Gascuel 2008). The LG matrix was predicted with the additional
parameter $ that allows different rates of evolution across the sequence. Each gene alignment
was also matched by the LG matrix along with different variations as shown in Table 2.5.
The LG matrix is generated by a model of evolution that takes into account mutation rate
heterogeneity over sites, thus yielding better results then its predecessors WAG and JTT (Le
and Gascuel 2008). The models selected were a best fit judged by both the Aikake
Information Criterion (AIC) as well as the Bayesian Information Criterion (BIC) in all but
two cases (GARS and ERCC3). Where they disagreed the BIC was selected over the AIC
(Yang 2008) thus lowering the possibility of overfitting a more complex model to the data.
The models selected for the individual orthologs were exactly the same as the model
for the concatenated alignment except in the case of SRP54 where the additional parameter I
indicating that a proportion of the alignment was invariant.
80
Chapter 2
Entrez Gene Name
Substitution model of best fit (Selected by

BIC)
GARS
LG+$
NFS1
LG+$
ATP6V1D
LG+$
KARS
LG+I+$
SRP54
LG+I+$
PSMC1
LG+$
METAP2
LG+$
ERCC3
LG+$
RPL23
LG+$
RPS8
LG+$
Table 2.5: Substitution model selected by ModelGenerator for each ortholog.

2.2.8 Phylogeny reconstruction
After selection of the model each alignment was individually input to PhyML (Guindon and
Gascuel 2003) with its individual substitution model as well as the concatenated alignment. A
total of 11 trees were generated one per orthologous gene and one for the concatenated
alignment as a whole. For the concatenated alignment 1000 replicates were also generated
using SEQBOOT (Felsenstein 1989) and rerun using PhyML (Guindon and Gascuel 2003).
These 1000 replicated trees were input to CONSENSE (Felsenstein 1989) in order to acquire
an estimate of the overall bootstrap support for the tree from within the data. The topology of
the bootstrapped trees was identical to the topology of the primary ML tree.
81
Chapter 2
2.2.9 Comparison of protein content tree with super matrix tree
In order to place a measure on whether the protein content tree was a significantly worse
hypothesis of the evolutionary relationships of the species under study the PHYLIP program
PROML (Felsenstein 1989) was utilised. PROML was given the supermatrix alignment as a
dataset and the two trees as user inputted trees to evaluate against the dataset. PROML then
ran the KH test against the two trees to examine the differences in the likelihood of the trees
relative to the dataset.
2.3 Results
The individual gene trees can be seen in Appendix B. In order to examine potential
differences in phylogenetic signal between the individual genes trees TREEDIST
(Felsenstein 1989) was used to generate a distance matrix between the 10 gene trees.
TREEDIST uses the Branch Score distance (Kuhner and Felsenstein 1994) to calculate the
distance between two trees. This distance takes into account the branch lengths of the trees
input as well as the overall topology. This matrix was then used to generate a dendogram
using UPGMA (Unweighted Pair Group Method with Arithmetic mean) clustering as shown
in Figure 2.9.
Fig 2.9: Cluster diagram of tree distances of individual gene phylogenies.
82
Chapter 2
The gene trees fell into three main clusters. In order to examine the degree of incongruence of
each cluster from the super matrix species tree the trees contained with each cluster were then
submitted to CONSENSE (Felsenstein 1989) in order to view the consensus trees.
CONSENSE was run with the majority rule setting where a group has to appear more than
50% of the time in the input trees in order to be conserved in the consensus tree. Figures 2.10
shows the outlier tree estimated from ERCC3 and Figures 2.11 and 2.12 show the consensus
trees.
83
Chapter 2
Figure 2.10: Gene tree for gene ERCC3.

84
Chapter 2
Figure 2.11: Consensus Tree for Cluster 2 containing genes: RSP8 ATP6V1D PSMC1 and
METAP2.
85
Chapter 2
Figure 2.12: Consensus Tree 2 for Cluster 3 containing genes: GARS, NFS1, RPL23, SRP54
and KARS.
86
Chapter 2
Both consensus trees preserve the kingdoms of Plantae, Animalia and Fungi though
the order of branching is lost. However both clusters demonstrate a broad congruence with
the fully concatenated ML tree, which can be seen in Figure 2.13. This is a useful measure of
the degree of overlap of phylogenetic signal contributed from each of the individual genes.
Figure 2.13 is an illustration of the topology of the tree with the animals, fungi and plants
highlighted. Figure 2.14 also overlays the bootstrap support values for each proposed clade.
Figure 2.15 presents the reconstructed phylogeny with a measure of support, which is the
proportion of the individual gene trees that supported a clade (Bratke 2009). Figure 2.16
shows the topology of the tree in combination with branch lengths.
87
Chapter 2
Figure 2.13: ML tree of 54 eukaryotes without branch lengths created from a super matrix of
the concatenated alignments of all genes listed in Table 2.5. The clades containing animals,
fungi and plants are coloured blue, red and green respectively.
88
Chapter 2
Figure 2.14: ML tree of 54 eukaryotes without branch lengths created from a super matrix of
the concatenated alignments of all genes listed in Table 2.5. Bootstrap support values are
only shown at each node where support was less than 1000 (not universally supported across
1000 bootstrap replicates).
89
Chapter 2
Figure 2.15: ML tree of 54 eukaryotes created from the concatenated alignments of genes
listed in Table 2.5. Support Values are proportion of individual gene trees, which show a
given clade (Bratke 2009).
90
Chapter 2
Fig 2.16: ML tree of 54 eukaryotes with proportional branch lengths.
91
Chapter 2
Figure 2.17: Proteome content phylogeny with bootstrap support only shown at each node
which was not 100% supported out of 100 bootstraps.
92
Chapter 2
Tree Comparison
The results of the KH test carried out using PROML (Felsenstein 1989) showed that the
proteome content tree is a significantly worse hypothesis of the evolutionary relationships
between the organisms as shown in the table below.
Tree
Log likelihood
Protein alignment supermatrix ML tree
-131725.8
Proteome content parsimony tree
-133941.0
PROML (Felsenstein 1989) reported that the log likelihood of the proteome content tree was
significantly worse than that of the protein supermatrix tree.
2.4 Discussion
2.4.1.ML tree
In terms of current thought about super groups within eukaryotes the phylogeny
reconstructed as seen in Figure 2.13 is incongruent. However given that it is based on a
concatenation of nuclear genes this is not surprising (Parfrey et al. 2006). This work showed
that there is generally weak support for most putative eukaryotic super groups in phylogenies
built using proteins coded by nuclear genes. The super group Opisthokonta is however
supported as would be expected from the work (Parfrey et al. 2006) though it does subsume
the Amoebozoa. The ML tree is consistent with known eukaryotic trees (Baldauf et al. 2000)
in placing plantae as an outgroup to fungi and metazoans. The base of the tree is inconsistent
with known trees in its placement of E. histolytica as an early braching eukaryote with T.
vaginalis when it is thought that E. histolytica branches higher up in the tree as a member of
the Amoebozoa super group (Parfrey et al. 2006). However there is no clear synapomorphy
(shared derived character) which defines the group Amoebozoa. There is also a lack of
unambiguous support for the existence of the group as a whole within the nuclear genome
(Parfrey et al. 2006). This placement is also not novel as both organisms lack mitochondria
and have been grouped together at the base of eukaryota by phylogenetic analyses of small
subunit (SSU) RNA genes though their ultimate placement is not certain (Vanacova et al.
2003).
93
Chapter 2
Within the animals the tree is consistent with the Coelomata hypothesis which places
the nematoda as an outgroup to both arthropods and vertebrates. This grouping is fairly
common in phylogenies derived using molecular data (Wolf et al. 2004) despite being held to
be false (Aguinaldo et al. 1997). This is thought to be an artefact of long branch attraction
due to rapid evolution along the C. elegans line (Telford 2004). The classes Mammalia (H.
sapiens, P. troglodytes, B .taurus, M. domestica, C. familiaris, M. musculus and R.
norvegicus), Aves (G. gallus), Osteichthyes (T. rubripes, D. rerio) are all maintained in the
order that they are generally found in most broad vertebrate phylogenies, e.g. work by Stuart
(Stuart et al. 2002).
The tree also arranges its four plant species as expected (Rodriguez-Ezpeleta et al.
2005) with the algae O. lucimarinus forming an outgroup to the monocot O. sativa and the
two dicots A. thaliana and P. trichocarpa.
Within the fungi the tree is consistent with known fungal phylogenies (Fitzpatrick et
al. 2006). The kingdom Dikarya is a separate clade. Within Dikarya in the phylum
Ascomycota the subphyla Saccharomycotina (S. cerevisiae, C. albicans, C. glabrata, P
.stipitis, D. hanseii, Y. lipolytica, K. lactis , A. gossypii), Taphrinomycotina (S. pombe) and
Pezizomycotina (A. fumigatus, A. niger, N. crassa, M. grisea) are grouped as separate clades.
Another phylum in Dikarya Basidiomycota (U. maydis, C. neoformans) is a separate clade
within the tree. The microsporidium E. cuniculi branches out as an outgroup to the Dikarya.
Within Saccharomycotina the WGD (fungi which have undergone whole genome
duplication) (C. glabrata, S. cerivisae) are presented as a clade. Also the CTG group (fungi
which utilise the codon CTG to encode serine instead of leucine) (P. stipitis, D. hanseii, C.
albicans) (Fitzpatrick et al. 2006) is proposed as a separate clade within the tree. Within the
CTG group the tree shows disagreement with some published trees (Wang et al. 2009a) by
placing P. stipitis and C. albicans together with D. hanseii as an outgroup. Figure 2.6 shows
that this fungal topology was highly supported by the bootstrap analysis.
The Chromoalveolates are grouped together in one clade. Within this clade the
Apicomplexa form a monophyletic group within that clade with the Ciliates as an outgroup.
These groupings are congruent with published trees (Burki et al. 2008; Rodriguez-Ezpeleta et
al. 2007).
2.4.2 Proteome content phylogeny
The proteome content phylogenetic tree as presented in Figure 2.17 shows a degree of
topological congruence with the tree based on concatenated protein sequences shown in
94
Chapter 2
Figure 2.13 in that it preserves the animals as a monophyletic group. The branching order of
the taxa is however different. It also shows the fungi as a clustered group (though not
monophyletic). However given that PROML reports that it is a significantly worse fit to the
alignment of homologous proteins it is clearly a worse representation of the dataset then the
ML supermatrix tree.
2.4.3 Conclusion
The reconstructed phylogeny via an application of maximum likelihood to the concatenated
supermatrix of 10 eukaryotic proteins appeared to be a plausibly accurate reflection of the
relationships between the taxa. This plausibility was assessed by both by inspection by eye
and comparisons to previously published eukaryotic phylogenies.
The proteome content phylogeny reconstructed by Dollo parsimony on the other hand
was shown to be a significantly worse representation of the evolutionary relationships
between the species. As such it was the ML supermatrix tree that was utilised as the
framework for comparative analysis of protein function within the species.
The work described in this chapter also produced phylogenetic profiles for each human
protein across 54 other eukaryotes.
95
Chapter 3
Chapter 3
Comparison of methods of prediction of functional linkage in proteins
3.1 Introduction
This chapter presents the comparison of four systems of inferring functional links between
proteins using phylogenetic profiles. These systems were:
1) Hamming distance: Phylogenetic profiling was initially used without taking into
account species phylogeny and treating the state of each point in the profile as
independent (Pellegrini et al. 1999). Using profiling in this way entailed comparison
of profiles using the string comparison algorithm Hamming distance (Hamming
1950), which is a count of the points at which two strings differ.
2) Use of the comparative method (Barker and Pagel 2005; Pagel 1994) in the context of
phylogenetic profiling (Pellegrini et al. 1999) with constrained rates of gene gain
(Barker et al. 2007) over the phylogeny developed in Chapter 2 to detect protein
interactions. An implementation of the method, BayesTraits (Pagel et al. 2004a) was
used in order to calculate the relevant likelihoods.
3) Co-expression of mRNAs corresponding to given proteins: Proteins that physically
interact or are required to be produced in some of form of spatio-temporal order tend
to show correlations (positive or negative) in the expression of their underlying
mRNA molecules. This method has been shown to be effective in detecting
interactions in Saccharomyces cerevisiae (von Mering et al. 2002) as well as in
Arabidopsis thaliana in combination with examination of other genomic features (De
Bodt et al. 2009). Use of this system presents a comparison of an un-curated highthroughput physical experimental system with an equivalently un-curated
computational system.
4) Use of a Bayesian classifier to combine disparate sources of evidence comprising
gene co-expression, orthology, post translational modification, co-localisation,
intrinsic disorder, domain co-occurrence and network analysis data in order to predict
protein interactions (McDowall et al. 2009).
3.1.1 Hamming distance
The distance measure Hamming distance is named after its creator Richard Hamming who
introduced it in his work (Hamming 1950). As mentioned above and in previous chapters, it
is the distance between two strings of equal length calculated as a count of the points where
96
Chapter 3
they differ. A string can be represented as a vector of characters. As an illustrative example
given the two strings x and y as defined below:
x = [c, a,t]
y = [h, a,t]
The hamming distance between the two strings is 1 as they vary by 1 character.
In the context of phylogenetic profile analysis, a hamming distance of 1 between two

profiles would indicate that the gene/protein under consideration differed by only one species
in its pattern of distribution.
3.1.2 Comparative method
As mentioned in the introductory chapter the comparative method involves the examination
of the association of a given trait in an organism with another variable in the context of a
phylogenetic tree (Harvey and Pagel 1991). This variable can be another trait or potentially
an environmental factor. A prerequisite for this form of analysis is reconstructing putative
ancestral states at each hypothetical ancestral node within the tree (Pagel 1994). Using this
method in the context of phylogenetic profile analysis involves testing whether the presence
or absence of a given gene is associated with the presence or absence of a second gene or
protein.
The implementation of the comparative method used in this chapter is based on the
approach introduced by Mark Pagel (Pagel 1994) and utilised in the context of phylogenetic
profiling by Barker and Pagel (Barker and Pagel 2005). This approach is preferred to other
applications of the comparative method (e.g. work by Wayne Maddison (Maddison 1990)
and work by Mark Ridley (Ridley 1983)) due to the fact that it does not depend on a single
set of reconstructed ancestral states over a tree but instead calculates test statistics based on
all possible ancestral states (Barker and Pagel 2005; Pagel 1994). An explanation of how
Barker and Pagel (Barker and Pagel 2005) utilised the framework established in previous
work by Pagel (Pagel 1994) in order to analyse phylogenetic profiles follows.
3.1.2.1 Phylogenetic profile analysis using the comparative method
Imagine a gene or protein G1, which exists in a group of species and a phylogenetic tree
representing the evolutionary relationships between the members of the group. At any given
internal node (hypothetical ancestor) within the tree G1 can either be present or absent. This
state will be denoted as 0 for absent and 1 for present.
97
Chapter 3
Over a given branch if the state of G1 at the ancestral node is 0 then there is a
probability of a gain, i.e. moving to state 1 at the descendant node over the time period
represented by the branch. Conversely if the ancestral state is 1 then there is a corresponding
probability of a loss. These probabilities are represented as P01(t) and P10(t) where t is equal
to the time interval represented by the branch. There are also the probabilities of no
transitions which are represented by P00(t) and P11(t). The probabilities P01(t) and P10(t) can
also be considered the rate of transitions.
The comparative method as applied to phylogenetic profiling is an examination of
whether the state of a second gene/protein has an effect on the state of the first. Thus
introducing a second gene / protein G2 there is a corresponding set of probabilities for G2. In
order to examine whether the state of G2 has an effect on the state of G1 the probabilities or
transition rates P01(t), P10(t), P00(t) and P11(t) for G1 can be split in order to factor in the state
of G2. Thus for example P01(t) can be split into two probabilities one corresponding to the rate
of gain of G1 if G2 is present and the other corresponding to the rate of gain of G1 if G2 is
absent. These transition rates were the basic parameters used by Barker and Pagel (Barker
and Pagel 2005) as represented in the following figure.
Figure 3.1: Parameters for modelling state transitions for pairs of genes as used by
Barker and Pagel (Barker and Pagel 2005) (Figure is directly reproduced).
98
Chapter 3
Assuming the numbers at the corners of the above figure represent the states of G1 and G2,
The parameters as presented above represent the following rates of transitions.
Parameter
Description
q13
Rate of gain for G1 given the absence of G2.
q31
Rate of loss for G1 given the absence of G2.
q12
Rate of gain for G2 given the absence of G1.
q21
Rate of loss for G2 given the absence of G1.
q34
Rate of gain for G2 given the presence of G1.
q43
Rate of loss for G2 given the presence of G1.
q24
Rate of gain for G1 given the presence of G2.
q42
Rate of loss for G1 given the presence of G2.
Table 3.1: Description of rate parameters used by Barker and Pagel (Barker and Pagel 2005).
Given these rates it is possible to investigate whether the state of G2 has an effect on the
transition rates for G1. In order to carry out this investigation two competing models /
hypotheses were constructed using these parameters (Barker and Pagel 2005). One was a
dependant model where the presence of G1 is somehow contingent on the presence of the
absence of G2 or an independent model where there was no connection (Barker and Pagel
2005). The dependent model makes an assumption that the rate of gain/loss of G1 is
somehow affected by the state of G2. Thus for example the rate of gain of G1 in the presence
of G2 (q24) will be different from the rate of gain of G1 in the absence of G2 (q13). Conversely
the dependant model makes the assumption that there is no effect on the transition rates of G1
by the state of G2. To detect gains and losses over a phylogenetic tree Barker and Pagel
reconstructed the likelihood of these two competing hypothesis about the distribution of pairs
of proteins in the constituent species (Barker and Pagel 2005). The premise of the work was
that the dependent model would prove a better fit to observed data if the transition rate for a
given protein were affected by the state of the other. In order to detect correlated evolution
the two competing models were thus defined as follows
99
Chapter 3
The independent model of evolution where the probabilities of gain and loss of A
were independent of the state of B. In order to create this model the parameters
involving gain and loss of A were constrained to be equal irrespective of the state B
and vice versa. Using the symbols to define the rates shown in Figure 3.1 this entailed
setting the transition rates as q13=q24, q42=q31, q31=q43 and q12=q34. This reduced the
number of parameters for the independent model to four.
A dependant or correlated model of evolution that utilised all eight-transition rate

parameters.
Thus given a set of phylogenetic profiles and a phylogenetic tree, the values of the
parameters that maximised the likelihood of each of the two models was calculated in turn
between pairs of profiles. These likelihoods were calculated summing the likelihoods of all
possible ancestral reconstructions at each internal node of the tree thus removing the need for
the reliance on a single set. Having calculated the likelihoods of both models, the goodness of
fit of the models to the observed data was compared using the likelihood ratio statistic, LR.
This can be calculated using the following equation (Yang 2006):
(1)
LR = "2(ln(H 0 ) " ln(H1 ))
As applied to detection of correlated evolution H0 was the likelihood of the

independent model of evolution and H1 was the model of dependent evolution (Barker et al.
2007; Barker and Pagel 2005).
Further work showed that constraining the rate of gain of a gene to a preset low level
as a more potentially realistic representation of actual biological reality improved on the
ability of the method to detect functional linkages in Saccharomyces cerevisiae (Barker et al.
2007). This entailed specifying values for the parameters connected to gene gain specifically
q31, q24, q34 and q12. Rate of gene gain is specifically low in eukaryotes where horizontal gene
transfer is relatively rare (Whitaker et al. 2009). The rate of de novo generation of genes over
a relatively short time scale has also been observed to be extremely low. A study detected
three potential genes in the human genome, which were generated de novo since the split
with P. troglodytes (Knowles and McLysaght 2009) (estimated at around 4 million years
(Hobolth et al. 2007)).
100
Chapter 3
This method of fitting models of correlated and uncorrelated models of evolution for
pairs of proteins using maximum likelihood (ML) (Barker and Pagel 2005) while
constraining the rate of protein gain (Barker et al. 2007) shall from here on be referred to as
constrained ML.
3.1.3 Co-expression as measured by microarray
As mentioned in Chapter 1 a microarray is a chip usually made of glass, with fluorescently
labelled oligonucleotide probes representing subsections of genes. The degree of florescence
from these probes corresponds to the abundance of a given mRNA in a sample and therefore
the level of expression (Quackenbush 2002; Wodicka et al. 1997). In a typical microarray
experiment cells are subjected to different treatments, or harvested from organisms with
differing phenotypic or disease states (e.g. cancerous tissue vs. normal in human cancer
patients). The probes on a given microarray chip are generally designed to map to a within a
coding region on a given gene (Brown 2006).
To establish whether a given experimental condition corresponds with differential
expression of a given gene, a descriptive statistic illustrating the central tendency of
expression (usually the median) is calculated for the entire set of samples. If a given gene is
found to be expressed at a statistically significant higher level than the central value, this
gene is interpreted as up-regulated. Similarly if a gene is expressed at a significantly lower
level then that gene is interpreted as down-regulated. Probes on a microarray chip can map to
a single gene, or members of a gene family depending on the specificity of the probe design
(Heyer et al. 1999).
3.1.4 Bayesian classifier
The fourth system of prediction of functional linkage to be considered was that utilised by the
PIPs server (http://www.compbio.dundee.ac.uk/www-pips)(McDowall et al. 2009). This
system (Scott and Barton 2007) utilised a combination of sources of evidence for protein
interactions. These sources were:
Gene co-expression: As described and used above a correlated shift in gene

expression patterns in response to a given environmental stimulus can be used to
detect given protein interactions.
Orthology: If a given pair of proteins is orthologous to a pair of proteins in another

species that are known to interact then this interaction annotation can be ported from
one species to the other (Yu et al. 2004a).
101
Chapter 3
Subcellular localisation, domain co-occurrence, and posttranslational modification cooccurrence: These features of a protein can also be informative as to its interaction
partners. The PIPs system (Scott and Barton 2007) combines these as a joint source of
evidence.
Protein disorder: This measure is based on the observation that the unstructured
regions within protein molecules are often involved in transient protein interactions.
(Singh et al. 2007) showed that intrinsic disorder is enriched in date hubs, proteins
that maintain multiple interactions but at different times.
Network topology similarity: This measure utilises the principle that proteins that
interact will share other interacting partners.
These five predictors are combined using a nave Bayesian classifier to generate a single
score based on the posterior odds ratio of interaction after calculation of likelihood ratios
over each of the individual predictor modules (Scott and Barton 2007).
In order to explain what an odds ratio is it is necessary first to define odds. Odds are a
method of presenting the probability of an event by relating this probability to the probability
of the event not occurring. Thus the odds of an event are simply the probability of an event
occurring divided by the probability of the event not occurring (Sokal and Rohlf 1995). The
odds of a given event can be calculated by the equation:
odds(e) =
p(e)
1" p(e)
(2)
An odds ratio is thus is the ratio of multiple odds. It can be used to measure the effect size
of a given factor on the probability of an event. Thus if for example the probability of heart
disease in people who consume a high fat diet is calculated as
1
and the converse probability
4
of heart disease in individuals who do not consume a high fat diet is calculated as
1
, the
8
odds of having heart disease with a high fat diet are !

thus via Equation 2 equal to 0.3 and the
odds of having heart disease with a low fat diet are 0.14. Thus the odds ratio would be
!
0.3
calculated as
which is roughly equal to 2. Thus in this example a high fat diet roughly
0.14
doubles the probability of heart disease.
!
102
Chapter 3
The posterior odds ratio utilised by PIPs (Scott and Barton 2007) was calculated by
utilising a prior odds ratio calculated by using a prior probability of interaction estimated as
1
. This prior odds ratio was then multiplied by the likelihood ratios yielded by each of the
400
individual predictor modules. The product of this calculation is the posterior odds ratio.
As in the example, the posterior odds ratio corresponds to the posterior probability of
interacting, e.g. a score of 2 translates to the probability of interacting being twice as high as
the probability of not interacting (McDowall et al. 2009; Scott and Barton 2007).
3.2 Methods
3.2.1 Assessing quality
In terms of classification of the accuracy of a binary classification system a common method
of measurement is the use of sensitivity and precision. Sensitivity can be defined as the
probability of predicting a true positive and precision as the probability of that prediction
being correct (Baldi and Brunak 2001). In order to calculate these measures some
terminology must be introduced.
True positives (TP): The number of positive predictions made by a binary classifier that
lie within the positive training set.
False positives (FP): The number of positive predictions made by a binary classifier that
lie within the known negative training set.
False negatives (FN): The number of items in the known positive set, which were not
predicted by a binary classifier.
Given these values precision and sensitivity can be calculated as follows (Baldi and
Brunak 2001; Barker et al. 2007; von Mering et al. 2003):
precision =
(TP)
(TP + FP)
(3)
sensitivity =
(TP)
(TP + FN )
(4)
103
Chapter 3
3.2.2 Training and test data.

In order to calculate these values it was necessary to acquire data on known positive
interactions. Thus the data used by Scott in her development of PIPs (Scott and Barton 2007)
which was in turn derived from the HPRD (Human Protein Reference Database) (Mishra et
al. 2006) was acquired. This dataset contained 25,013 predicted protein interactions. The
dataset was then compared with the set of human proteins contained in the version of RefSeq
(Pruitt et al. 2005) downloaded in Chapter 2 and the overlap was kept leaving a positive
dataset of 6,106 proteins and 18,322 protein pairs.
A negative dataset was generated by creating a set of all possible pairs of proteins
from the full set of human proteins. In order to filter out proteins which could potentially
interact the full set of GO (Gene Ontology)(Ashburner et al. 2000) terms associated with each
protein was downloaded. All pairs with any overlaps in associated GO terms were then
excluded. As with the positive set the negative dataset was compared to the set of human
proteins contained in the locally held version of RefSeq (Pruitt et al. 2005) and the overlap
preserved. As a final check the negative set was compared to the positive set in order to
examine whether there was any overlap. There was an overlap of 9,568 proteins pairs
between the positive and negative datasets or 52% of the positive set. This suggests that the
use of specific GO terms is not better than the selection of random pairs as a procedure for
the generation of a negative set. However as previous work such as the PIPs procedure (Scott
and Barton 2007) considered in this chapter, utilised solely random pairs, it was decided that
this procedure of GO + HPRD filtration was an improvement on this process. The process
resulted in a negative dataset of 3,216 proteins and 207,952 protein pairs.
To use the datasets effectively as an objective measure of quality as well as a training
tool for the evaluation of optimal rates at which to constrain levels of gene gain (Barker et al.
2007) for BayesTraits (Pagel et al. 2004a) the datasets were randomly split into two halves.
This was done in order to cross validate the predictive power of any proposed optimal rate of
gain. This process yielded a positive training set of 4,868 proteins / 9,161 protein pairs and a
negative training set of 3,216 proteins / 103,971 protein pairs. The second half of the dataset
was marked as testing data and contained a positive testing set of 4,796 proteins / 9,161
protein pairs and a negative testing set of 3,215 proteins / 103,974 protein pairs. The sizes of
the two negative sets are uneven as 5 pairs of proteins had to be removed from the negative
104
Chapter 3
training set, as they were present in a B-A orientation in the positive set. Similarly two pairs
of proteins had to be removed from the negative testing set.
The ratio of the size of negative to positive datasets in this case was roughly 11 to
1.This is biologically unrealistic as current estimates of the size of the full human interactome
range from 154,000-369,000 (Hart et al. 2006) to 650,000 (Stumpf et al. 2008). Stumpf
estimated the potential size of the interactome by treating known experimentally verified data
as a sub-network of the true network and extrapolating from the sub-network to the full
network (Stumpf et al. 2008). Hart on the other hand employed the idea that two independent
samples (experiments) from the complete interactome or subspace of the interactome of size
N would be expected to share k interactions by random chance under the hypergeometric
distribution (Hart et al. 2006). Thus Hart estimated the size of N using actually observed
intersections between experiments (Hart et al. 2006).
If these numbers are subtracted from the size of all potential interactions
112,044,172,9 (calculated as all possible pairs from version of RefSeq held) the remaining
ratios of negative to positive range from 1722:1 to 8 617:1. For any full genome-wise survey
it would be necessary to scale all the precision and sensitivity scores from the training set
ratios to ratios constructed from estimates of the interactome size. This issue is addressed
more fully in Chapter 5.
The ability of Hamming distance to differentiate between the positive and negative training
set was measured with a lower distance corresponding to a higher score. Precision/sensitivity
were evaluated at every integer within a range of Hamming distance cut-offs ranging from 0
to 54.
3.2.4 Constrained ML
To use phylogenetic profiling in a phylogenetically aware manner to detect correlations in
gain and loss the software package BayesTraits was utilised (Pagel et al. 2004a). This has
been used in previous work (Barker and Pagel 2005) to demonstrate that detection of
correlations in gain and loss of particular genes can be used as a tool with which to detect
functional interactions.
The script bms_runner (Barker et al. 2007) was used to examine the performance of
different rates of gain in predicting functional interactions amongst the training sets in order
to select an optimal rate. The script utilised the phylogenetic profiles and phylogeny
105
Chapter 3
described in Chapter 2 as well the positive and negative training sets to evaluate the
performance of different rates of gain. bms_runner (Barker et al. 2007) creates input for the
program BayesTraits (Pagel et al. 2004a) to evaluate the relative likelihood of correlated
evolution at a range of rates of gain. bms_runner creates a non-redundant set of profiles
(Barker et al. 2007) before passing them on to BayesTraits for comparisons. Thus 113,132
protein pairs in the training set were reduced to a set of 54,906 non-redundant pairs of
profiles.
A number of rates of gain were evaluated for precision and sensitivity over the
training data ranging from 1 " 10-6 up to placing no restriction on gain. An LR score was
calculated for each profile pair at rate of gain and assigned to each protein pair corresponding
with that profile pair. bms_runner then evaluates precision and sensitivity at a range of cut!
offs commencing at the minimum LR encountered and moving up by a decreasing interval
until a value close to the maximum LR is reached (Barker et al. 2007). The program then
provides a table that includes the following information for this range of cut-offs.
LR cutoff
No of predictions
Precision
Sensitivity
Table 3.2: Column headings for data matrix returned by bms_runner (Barker et al. 2007).
3.2.5 Co-expression of mRNA
The co-expression of two genes in association with a given environmental condition can be
considered a potential indictor of functional linkage. In order to examine the performance of
the ML reconstruction method in predicting protein functional interactions against the coexpression of mRNA the results of all microarray experiments held in the EBIs ArrayExpress
database were downloaded.
This data was pre-processed and thus contained expression data at the gene level
rather than at the probe level. As oligonucleotide probes only map to small subsections of a
gene and also can hybridise with multiple targets the relationship between probe to gene is
many-to-many. This many-to-many relationship was collapsed by data processing carried out
by ArrayExpress on each individual experiment.
Thus a total of 377 experiments were downloaded. Each experiment record contained
information on genes whose expression level varied significantly in response to the
experimental treatment/tissue state. The size of individual experiments ranged in size from 1
106
Chapter 3
gene to a maximum of 15,987. The mean number of genes showing significant variation per
experiment was approximately 3,143.
A sample line from the downloaded data is shown below for illustrative purposes:
Gene Symbol
STAT1
Ensembl ID
Species
ENSG00000115415
Factor
Homo
Disease
sapiens
state
Value
Accession
normal
E-
Expression
DOWN
p Value
0.0423247888020165
GEOD3790
Table 3.3: Sample processed data from ArrayExpress for experiment E-GEOD-3790 (Hodges
et al. 2006).
E-GEOD-3790 is a study on gene expression in brain tissue afflicted with Huntingtons
disease (Hodges et al. 2006). The factor column has a number of potential values
corresponding to the annotation of the individual samples. In this case it corresponds to
whether the tissue comes from a patient diagnosed with Huntingtons disease as opposed to
normal tissue. The value column shows the value of the factor (in this case normal). The p
value column shows the significance of the identified differential expression. Thus the data
presented above shows that the gene STAT1 is significantly (p<0.05) down-regulated in
tissue annotated as normal.
To use the training datasets to measure the ability of gene co-expression to predict
protein interactions it was necessary to convert the training data from protein pairs to gene
pairs. Using a translation key provided by the International Protein Index (IPI) (Kersey et al.
2004) each RefSeq Gi number was mapped to the associated gene name. As individual genes
can produce multiple proteins via the process of alternative splicing there isnt a one to one
correspondence between the number of genes and the number of protein in both training sets.
With some Gi entries missing in the translation key this created a positive gene training set of
4319 genes /8057 gene pairs and a negative set of 2833 genes /89549 pairs of genes.
In order to predict functional linkage the following procedure was followed.
The data in each microarray experiment was split according to experimental

condition.
For each experimental condition pairs of genes were marked as functionally linked if
their expression level went up or down in response to the given experimental
condition.
True positives were counted if the genes existed as a pair in the positive set.
107
Chapter 3
False positives were counted if the genes existed as a pair in the negative set.
False negatives were counted as the complement of the set of predictions and the
positive set.
This data was processed using a program implemented in Java, which followed the steps
presented below. The input to the program was a file containing a set of lines as shown above
for a single experiment.
Create a non-redundant set of factors present in the experiment.
Create a non-redundant set of values for each factor.
Create an empty set L to hold functionally linked pairs
For each factor F:

For each factor value FV:
For each expression value E (UP or DOWN)
Create a set S of all genes possessing the attributes {F,FV,E}where the associated p
value for E is less than 0.05
All genes within S are declared functionally linked.
All possible pairs of genes within S are added to L
L is now evaluated against the training data for precision/sensitivity.

The PIPs server offers its data split into six score cut-offs: 0.25, 1, 2.5, 25, 250 and 2500
(McDowall et al. 2009). These datasets were downloaded and evaluated for precision and
sensitivity at each of these cut-offs. For each cut-off the server provides a file with pairs of
proteins and the associated posterior odds ratio score. Each pair of proteins within each file
was declared to be functionally linked and then evaluated against the training data. This
provided associated precision/sensitivity scores for each cut-off.
108
Chapter 3
3.3 Results
Figure 3.2 shows the performance of phylogenetic profiling over the training sets using
Hamming distance.
Figure 3.2: Performance of phylogenetic profiling using Hamming distance over the training
data.
Hamming distance as a measure does not perform well over the training data
achieving a maximum precision of 0.08796.
109
Chapter 3
3.3.2 Constrained ML
The ability of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to distinguish
between the training data was tested at a number of rates of gain. The results of this can be
seen in Figure 3.3.
Figure 3.3: Performance of constrained ML (Barker et al. 2007; Barker and Pagel 2005) over
training data at different rates of gain.
Figure 3.3 shows points at a range of sensitivity between 0-1. Sensitivity over the
whole training set ranges between 1 at LR cut-offs of 0 where all pairs of proteins are
110
Chapter 3
predicted to be functionally linked to 0 at the points where no pairs from the positive set are
predicted to be functionally linked.
Precision ranges from 0.0809 (a base level that is derived from the ratio of the size of
the positive set to the size of the negative set) up to 1 which is the point at which all
predictions made at a given LR cut-off lie in the positive set, i.e. are true positives.
Of the two metrics (precision/sensitivity) it is precision, which appears to be the
strong suit of constrained ML. This is probably due to the fact that correlated evolution will
not occur in all cases of protein interactions. A large number of protein interactions will
contain members that are phylogenetically ubiquitous. In some of these cases the interaction
will be essential to maintenance of normal eukaryotic cellular function. In other cases even if
an interaction is being lost or gained in an organism, its individual members might still be
present (the interaction being lost due to some form of temporal/spatial separation of the
members). Thus as low sensitivity is inevitable with this method; it was decided to focus on
rates of gain that achieve 100% precision. Figure 3.4 places all rates of gain on a single plot
and zooms into a range of sensitivities between 0 and 0.001.
111
Chapter 3
training data magnified to a scale of sensitivity ranging from 0 to 0.001. For clarity some of
the worse performing rates of gain are removed.
Figure 3.4 shows that the rate 0.025 is the clear best performer as it delivers
predictions with a precision of 1 at the highest sensitivity. The LR cut-off at this point is
58.54. The sensitivity at this cut-off for the rate 0.025 is 0.000545. This rate is thus chosen as
the exemplar rate to represent the method in comparisons and to utilise for further analysis.
The findings of (Barker et al. 2007) were borne out in this investigation as lower rates of gain
were generally seen as the best performers.
112
Chapter 3
Over the training data constrained ML (Barker et al. 2007; Barker and Pagel 2005)
with the rate of gain constrained to 0.025 makes five predictions from the positive set shown
in Table 3.1
RefSeq Accessions
Annotation Protein A
Annotation Protein B
NP_001789
Cell division protein
Origin recognition
NP_004144
kinase 2
complex subunit 1
Interaction type
Direct
Verified By
Protein microarray
(Ramachandran et
al. 2004).
NP_005617
NP_006266
Splicing factor,
Splicing factor,
arginine/serine-rich
arginine/serine-rich 6
Complex
Site-directed
mutagenesis
(Monsalve et al.
2000)
NP_001789
NP_001790
Cell division protein
Cyclin dependent
kinase 2
kinase 7
Direct
In-vitro
experimentation
(Garrett et al. 2001)
NP_001347
NP_003391
DEAD/H (Asp-Glu-
Exportin 1
Direct
In-vivo/in-vitro
Ala-Asp/His) box
experimentation
polypeptide 3
(Yedavalli et al.
2004)
NP_066953
NP_000935
Peptidyl-prolyl cis-
Serine/threonine-
trans isomerase A
protein phosphatase
Direct
Yeast 2-hybrid
(Stelzl et al. 2005)
2B catalytic subunit
alpha
Table 3.1: True positive proteins predicted by constrained ML (Barker et al. 2007; Barker
and Pagel 2005) from the training data with rate of gain constrained to 0.025 at an LR cutoff
of 58.54.
The examination of the training datasets yielded a similar result to (Barker et al. 2007) in so
much as lower rates of gain tended to perform better.
113
Chapter 3
As 0.025 was selected as the optimum rate of gene gain over the training data it was
also tested on the testing data to cross validate this selection. Figure 3.5 shows the results of
this cross validation check.
test data with rate of gain constrained to 0.025.
Figure 3.5 shows that constraining the rate of gain to 0.025 also achieves a precision
of 1 over the testing data. This precision occurs at an LR cutoff of 53.3. The sensitivity at
this point is 0.00054. At this cutoff constrained ML (Barker et al. 2007; Barker and Pagel
2005) makes 5 predictions that are true positives as shown in Table 3.2.
114
Chapter 3
RefSeq Accessions
NP_002583
NP_001347
NP_001118
NP_001119
Annotation Protein A
Annotation Protein B
Interaction type
Direct
Verified By
proliferating cell
ATP-dependent
nuclear antigen
RNA helicase
experimentation
DDX3X isoform 1
(Ohta et al. 2002)
Direct
In-vitro
adaptor-related
AP-1 complex
Yeast 2-hybrid
protein complex 1
subunit gamma-1
(Takatsu et al.
beta 1 subunit
isoform b
2001)
isoform a
NP_000391
NP_001790
TFIIH basal
cell division protein
transcription factor
kinase 7
Direct
In-vitro
experimentation
complex helicase
(Coin et al. 1998)
XPD subunit
isoform 1
NP_005517
NP_004497
NP_066953
NP_000936
heat shock factor
heat shock factor
protein 1
protein 2 isoform a
Direct
In-vivo/in-vitro
experimentation
(He et al. 2003)
peptidyl-prolyl cis-
calcineurin subunit
trans isomerase A
B type 1
Complex
In-vitro
experimentation
(Huai et al. 2002)
Table 3.2: True positive proteins predicted by constrained ML (Barker et al. 2007; Barker
and Pagel 2005) from the testing data with rate of gain constrained to 0.025 at an LR cutoff
of 50.217.
3.3.2.1 Likelihood ratio statistic
The likelihood ratio statistic (LR) derived from the comparison of the independent and
dependant models of evolution is asymptotically distributed as a %2 variate with degrees of
freedom equal to the difference of numbers in parameters between the two models which in
this case equals 4 under assumptions about the size of the phylogeny and the speed of
evolution of the character under consideration (Barker and Pagel 2005; Pagel 1997). Thus if
the LR falls within the critical region of the distribution it is considered significant. A
histogram showing the theoretical %2 distributions with 4 degrees of freedom is shown below
in Figure 3.6.
115
Chapter 3
Figure 3.6: Theoretical %2 distribution with 4 degrees of freedom.
The distribution of LRs in the positive and negative set as well as over the combined
training data differs from this theoretical distribution as can be seen in Figures 3.7, 3.8, 3.9
and 3.10.
116
Chapter 3
Figure 3.7: Distribution of likelihood ratio statistic for constrained ML within the rate
of gain 0.025 over the positive training set.
117
Chapter 3
Figure 3.8: Distribution of likelihood ratio statistic for constrained ML within the rate
of gain 0.025 over the negative training set.
118
Chapter 3
Figure 3.9: Distribution of likelihood ratio statistics for constrained ML (Barker et al.
2007; Barker and Pagel 2005) within the rate of gain 0.025 over the complete training
dataset.
Minimum
0.08932
1st Quartile
7.85
Median
10.51
Mean
11.14
3rd Quartile Maximum

13.79
74.80
Table 3.3: Descriptive statistics for the distribution of likelihood ratios for the rate of gain
0.025 over the complete training data.
119
Chapter 3
Figure 3.10: Distribution of likelihood ratio statistics for constrained ML within the
rate of gain 0.025 over the complete training dataset, the positive training dataset and the
negative training dataset compared with the theoretical %2 distribution with 4 degrees of
freedom.
The distribution of LR statistics over the training data seems to differ from the
theoretical %2 distributions with 4 degrees of freedom.
This distribution was also tested via a two-sample Kolmogorov-Smirnov test for
goodness of fit between a generated theoretical %2 distribution with 4 degrees of freedom and
the LR statistic score distribution over the training data using R (R Development Core Team
2011). This also showed a difference between the two distributions (D=0.9993, p-value<2.2e16
).
120
Chapter 3
This may be due to a violation of assumptions of the model with regards to the speed
of character transition.
The overall frequency of higher LR statistics does appear to be higher in the positive
set which is further validation for the constrained ML method.
3.3.3 Co-expression of mRNA
The results for each microarray experiment measured over the training data are given below
in Figure 3.11.
Figure 3.11: Precision/ sensitivity results for 377 microarray experiments over the
training datasets.
121
Chapter 3
As before the area of interest in Figure 3.11 is the point at which precision equals 1.
This is because the average correlation between transcript abundance and peptide abundance
has been observed to be fairly low in primates at around 0.33 (Fu et al. 2007). Thus mRNA
co-expression is unlikely to be capable of high sensitivities in protein-protein interaction
detection. Figure 3.12 is a magnification of this area.
Figure 3.12: precision/ sensitivity results for microarray experiments over the training
datasets magnified to a scale of sensitivity ranging from 0 to 0.01.
Mean precision over all 377 microarray experiments was 0.2141 and mean sensitivity
was 0.1195. Out of the 377 total 18 experiments achieved a precision of 1. Details of these
experiments are shown in Table 3.4.
122
Chapter 3
Accession
E-GEOD-4567
Size
166
Description of experiment
Sensitivity
Transcription profiling of human pulmonary artery endothelial
0.0006547359
cell culture treated with Chapel Hill Ultrafine particle.

E-GEOD-2280
168
Transcription profiling of oral cavity samples from human
0.0003274752
squamous cell carcinoma patients (O'donnell et al. 2005).
E-GEOD-3183
255
Transcription profiling of human bronchial cell line treated
0.0002183168
with IL-13 to better understand early cytokine-mediated

mechanisms that lead to asthma.
E-GEOD-994
266
Transcription profiling of human intra-pulmonary airways and
0.0002183168
buccal mucosa to identify the effects of cigarette smoke on the

human airway epithelial cell transcriptome (Spira et al. 2004).
E-GEOD-2152
474
Transcription profiling of human uterine fibroids mith mutated
0.0008728860
or wild type fumarate hydratase gene (Vanharanta et al. 2006).

E-GEOD-2504
28
Transcription profiling of untreated, HIV-1 vector-infected
0.0002182929
and TNFalpha-treated human Jurkat T cells (Lewinski et al.

2005).
E-GEOD-4748
191
Transcription profiling of human dendritic monocytes treated
0.0004366336
with LPS (lipopolysaccharide) or CyP (Cyanobacterial

Product) (Macagno et al. 2006).
E-MEXP-1224
538
Transcription profiling of human colon samples from patients
0.0030511060
who have colorectal cancer recurrence or are recurrence-free

(Garman et al. 2009).
E-GEOD-7664
254
Transcription profiling of human PBMC response to benzene
0.0008728860
metabolites (Gillis et al. 2007).

E-GEOD-2361
87
Transcription profiling of 36 normal human tissue types to
0.0003274752
identify tissue-specific genes (Ge et al. 2005).

E-GEOD-1739
212
Transcription profiling of blood samples from human patients

with severe acute respiratory syndrome (SARS) (Reghunathan
et al. 2005).
Table 3.4: Microarray experiments achieving a precision of 1.
123
0.0002182929
Chapter 3
E-TABM-577
95
Transcription profiling of human placenta from women
0.0001091584
presenting at term with villitis of unknown etiology (Kim et

al. 2009).
E-GEOD-2018
129
Transcription profiling of human bronchoalveolar lavage
0.0001091584
samples collected from lung transplant recipients with

rejection states determined at the time of sample collection
(Lande et al. 2003).
E-GEOD-1786
126
Transcription profiling of human male vastus lateralis muscle
0.0008728860
samples from healthy and COPD subjects before and after 3

months of training (Radom-Aizik et al. 2005).
E-GEOD-2624
293
Transcription profiling of human tetracycline-regulated cell
0.0014181302
line expressing an NF-kB inhibitor to systematically identify

NF-kB dependent genes (Tian et al. 2005).
E-MEXP-714
55
Transcription profiling of human hepatitis C virus replicon
0.0002182929
cell line treated with interferon-alpha 2a in a time series.

E-GEOD-9770
1225
Transcription profiling of human neurons from different brain
0.0008731718
regions derived from individuals with mild cognitive

impairment.
E-GEOD-403
333
Transcription profiling time series of the cAMP-induced
0.0007641087
decidualization of human endometrial stromal cells (Tierney

et al. 2003).
Table 3.4: Microarray experiments achieving a precision of 1 (cont).

The highest scoring microarray experiment was E-MEXP-1224, an investigation into
whether there was a difference in expression profiles between the colorectal tissue of patients
who has recurrent cancer and those who remained clear (Garman et al. 2009). The sensitivity
of this experiment was 3.0511"10 -3 with a precision of 1.
The ability of!the PIPs (McDowall et al. 2009) server to predict functional interaction over
the training set was evaluated at 6 cutoffs. The results can be seen in Figure 3.13
124
Chapter 3
Figure 3.13: Precision/ sensitivity results for predictions from the PIPs server over six
cut-offs over the training dataset.
125
Chapter 3
Figure 3.14: precision/ sensitivity results for predictions from the PIPs server over six cutoffs
over the training dataset zoomed in to a maximum sensitivity of 0.15.
None of the score cut-offs over the predictions from the PIPs server achieved a full
precision of 1. However none of them fell under 0.9 either as seen in Table 3.5.
126
Chapter 3
Cutoff
Predictions
Precision
Sensitivity
0.25
79441
0.9135546
0.14366504
1.00
37606
0.9395973
0.11068444
2.50
25598
0.9533333
0.09825928
25.00
5394
0.9949239
0.04742318
250.00
1232
0.9865772
0.01832689
2500.00
498
0.9883721
0.01067973
Table 3.5: precision/ sensitivity results for predictions from the PIPs server over six cutoffs.
3.3.5 Method Comparison

Three out of the four methods evaluated are able to discriminate between the negative and
positive examples in the training data with varying degrees of success. Figure 3.12 shows all
three methods charted in on the same plot. Examination of phylogenetic profiles via detection
of correlated evolution using maximum likelihood is represented by the single optimum rate
of gene gain of 0.025 as the object of the ML correlated evolution training step was the
selection of this optimum rate of gain.
127
Chapter 3
Figure 3.12: All methods compared over training dataset. Legend explanation (PIPs=PIPs
server, MA= microarray experiment and PP= phylogenetic profiling measuring correlation in
gain and loss over a phylogeny with constrained rate of gain).
3.4 Discussion
Arguably the best performing method out of all three methods is the PIPs server (McDowall
et al. 2009) as it achieves the highest rates of combined precision and sensitivity over the
training data. The success of the PIPs server in terms of accuracy and coverage is attributable
to its use of multiple, disparate sources of evidence. The other two methods both focus on
particular types of interactions.
Phylogenetic profiling measured with constrained ML over a phylogeny is limited to
proteins that have been gained and lost in a correlated fashion over a phylogeny. Thus protein
interactions between phylogenetically ubiquitous partners cannot be detected. Similarly it
cannot detect interactions between interactors with potentially redundant partners.
128
Chapter 3
Microarrays are more flexible in the types of interaction they are capable of detecting.
However individual experiments are limited in the types of interactions that they can uncover
by the experimental conditions under which their constituent mRNAs were extracted. They
are also biased toward stable complexes (von Mering et al. 2002). Another limitation in the
use of microarray experiments in the prediction of protein interactions is the fact that
expression levels of a gene at the transcription level do not correlate strongly with overall
levels of protein production at the translational level (Gygi et al. 1999). This is due to
regulation at the posttranscriptional level by factors such as mRNA half-life, codon usage and
ribosome occupancy and density (Wu et al. 2008). The best performing microarray
experiments outperformed constrained ML in terms of sensitivity.
However given the difference in cost and labour intensiveness between a microarray
experiment and a computational analysis employing phylogenetic profiling, the latter can
clearly be a useful tool in the functional annotation of identified genes within a newly
sequenced genome.
3.4.1 Low Sensitivities
None of the methods as described and utilised above can are particularly sensitive in
detecting protein-protein interactions. Constrained ML and gene co-expression are insensitive
to protein-protein interactions for the reasons described above.
The PIPs server as the best performer achieves a sensitivity of 0.14 at a high level of
precision. However this still corresponds to a 14% chance of detecting a possible protein
interaction despite its integration of various forms of supporting evidence. It is possible that it
is this integration of evidence that renders PIPs insensitive. If for example the likelihood ratio
returned by one of its predictor modules was high with the rest all being low, the overall
posterior odds ratio score would be low. Thus the individual sensitivities of the module
predictors are averaged out.
It seems that maximising coverage of the interactome is beyond the scope of each of
the predictive methods considered in this chapter. To use the analogy of the interactome as a
dark room, none of these methods are equivalent to an overhead light that illuminates every
corner of the room. Rather each method is more like a lamp that casts a pool of light on its
immediate surroundings. It is only by lighting a number of these lamps that the entire room
can be illuminated.
129
Chapter 4
Chapter 4
Design and implementation of data filter
4.1. Introduction
The constrained maximum likelihood (ML) method used to detect proteins which share
correlated evolutionary histories as described in Chapter 3 and in work by Barker et al.
(Barker et al. 2007; Barker and Pagel 2005) estimates values for parameters which model the
transition rates of the gain and loss of discrete characters (Pagel 1994) by integrating over all
possible ancestral states at each node within the phylogenetic tree.
As pointed out by Barker (Barker et al. 2007) placing a constraint on the rate of
acquisition of new proteins increases the ability of the likelihood method to discriminate
between proteins that interact and those that do not. The determination of an optimum rate of
gain reduces the scale of the problem of parameter estimation (Barker and Pagel 2005) as it
reduces the numbers of parameters to be fitted to 2 for the independent model and 4 for the
dependent model.
The detection of potential functional interactors for a single given protein using this
method is possible, however given the low sensitivity of the method (see Chapter 3) the
probability of detecting a functional interaction for any given single protein or even a set of
proteins is low. A complete genome-wide survey however would detect all protein pairs that
displayed evidence of correlated evolution.
The procedure is however prohibitively slow for a complete genome-wide survey
without access to a significant amount of computing power. A timed training run over the
training dataset for a single rate of gain took approximately 110 CPU-hours to conclude
54,906 comparisons of non redundant phylogenetic profile pairs on a single core of a 3 GHz
dual-core Intel Xeon processor (see Section 4.5). As there are 60,615,555 possible nonredundant pairs of phylogenetic profiles in the version of the human proteome currently held;
a full genome comparison would take 121,825.05 CPU-hours or 13.9 CPU-years on the
single core of a dual-core 3 GHz Intel Xeon processor. The speed of constrained ML (Barker
et al. 2007; Barker and Pagel 2005) was also measured in work presenting a genome order
based approach to phylogenetic profiling (Cokus et al. 2007). In this case it was found to
range between 5-15 seconds per pair of proteins (Cokus et al. 2007). This caused the authors
to utilise a subset of their data in their benchmarking study of constrained ML (Cokus et al.
2007).
130
Chapter 4
Potentially access to multi-core CPUs and/or computing clusters could ameliorate this
to a certain extent. As application of constrained ML (Barker et al. 2007; Barker and Pagel
2005) involves sequential comparison of pairs of phylogenetic profiles, it is a process that is
easily amenable to parallelisation via splitting the task into a smaller set of tasks, which can
be launched in parallel. Task farming is applied in computational biology to tasks that are
potentially intractable if tackled serially, e.g. analysis of gel electrophoresis data (Dowsey et
al. 2003) or analysis of microarray data (Hill et al. 2008). However even with the application
of task farming it is clear that a full genome-wide survey is not feasible for this method on
any averaged sized eukaryotic genome.
This chapter details the development of a data filter to remove protein pairs that
display little or no evidence of correlated evolution. There are two main types of filter
evaluated. The first type is a simple distance based test (Hamming distance) as shown in
Chapter 3 and utilised in early work on phylogenetic profiling (Pellegrini et al. 1999).
Potentially proteins that display evidence of correlated evolution will have phylogenetic
profiles that have a lower Hamming distance from each other. Thus even though Hamming
distance applied in isolation performs poorly as seen in Chapter 3, it may serve as a filter for
proteins which do not display evidence of correlated evolution in combination with the
second type of filter.
The second type of filter will utilise a single set of reconstructed ancestral states. By
using a single set of reconstructed states and a simpler method for the detection of evidence
of correlated evolution proteins that do not display any such evidence may be filtered out.
This chapter describes the implementation and comparison of five filters, which utilise a
single set of reconstructed ancestral states to detect signs of correlated evolution. As a large
amount of the computations performed by constrained ML (Barker et al. 2007; Barker and
Pagel 2005) involve estimation of the transition rate parameters by integrating over all
possible ancestral states, use of a single set of reconstructed ancestral states reduces the scope
of the problem. Through the use of an effective and accurate data filter a genome-wide
survey for an average eukaryotic organism could be rendered feasible.
The end product of this research described in this chapter is just such a filter based on
logistic regression of a set of empirically evaluated predictors/parameters, which reflect
correlated evolution between a pair of proteins. The filter is approximately 2208 times faster
then constrained ML and achieves a reasonable degree of precision/sensitivity over the
training data in its own right. Thus application of this filter can facilitate a heuristic search
131
Chapter 4
for genes/proteins displaying evidence of correlated evolution over an entire
genome/proteome.
In order to describe the process of filter development/evaluation it will firstly be
necessary to present an overview of ancestral state reconstruction.
4.1.1 Ancestral state reconstruction
The procedures involved in the reconstruction of the states of characters and traits in extinct
ancestral species are similar to those involved in phylogeny reconstruction. This is due to the
similarity of the issues involved. The reconstruction procedures for character states thus
utilise similar criteria with which to judge putative reconstructions. Ancestral reconstruction
is a useful tool for investigating hypothetical evolutionary scenarios having been used to
investigate many biological questions such as for example the demonstration of homoplasy in
the evolution of lysozyme (Malcolm et al. 1990; Messler and Stewart 1997; Stewart et al.
1987). It is also a prerequisite step for a number of comparative method tests (Maddison
1990; Ridley 1983).
4.1.1.1 Parsimony
A parsimonious reconstruction of ancestral states over a phylogenetic tree would
entail the selection of the internal state that minimised change. Thus if for example two
terminal nodes within a given clade had the same internal state the same state would be
assigned to the node immediately preceding them.
Algorithms such as the Fitch (Fitch 1971) and Sankoff (Sankoff 1975) algorithms as
described in Chapter 2 are used employed as a step within phylogeny reconstruction (Albert
2006). However given a particular already constructed phylogenetic tree they can be
employed to reconstruct a set of ancestral node values which minimises evolutionary change
over that particular tree (Felsenstein 2004). The algorithms themselves do not reconstruct
individual states at each internal node but instead construct sets of potential states at each
node. These potential states can be resolved into a singular state reconstruction through the
application of algorithms such as ACCTRAN (Accelerated transformation) (Swofford and
Maddison 1987), which reconstructs ancestral states by placing points of change as close to
the root of the tree as possible (Agnarsson and Miller 2008). The converse approach to
ACCTRAN is DELTRAN (delayed transformation)(Swofford and Maddison 1987), which
reconstructs ancestral states by placing points of change as close to the tips of the tree as
possible (Agnarsson and Miller 2008). ACCTRAN and DELTRAN are the most commonly
132
Chapter 4
used methods for collapsing node state sets into individual node states though of the two
ACCTRAN is the more widely employed (Agnarsson and Miller 2008).
Parsimony methods fail to consider different branch lengths in different parts of the
tree (Yang et al. 1995). Parsimony based methods have also been criticised for their lack of
statistical soundness (Elias and Tuller 2007). Parsimony methods are also unable to
distinguish between reconstructions that are equally parsimonious (Koshi and Goldstein
1996).
4.1.1.2 Likelihood
In a similar fashion as likelihood is employed as an optimality criterion for phylogeny
generation, it can also be used in the context of ancestral state reconstruction. Maximum
likelihood techniques are used to estimate the parameters of the specified model of evolution
(Yang 2006). Once these parameters are estimated they can be utilised to calculate the
posterior probability of ancestral states using Bayes theorem (Yang 2006). The state with the
highest posterior probability is then assigned to the node under consideration. This procedure
has been defined as empirical Bayes (Yang 2006). Empirical Bayes can be used to either
assign a character state to a set of nodes in a tree via a process known as marginal
reconstruction or it can be used to assign a set of possible characters to each node (Yang
2006). This latter process is known as joint reconstruction (Yang 2006).
Empirical Bayes can be contrasted with hierarchical Bayes where rather than estimating a
single value for the parameters of a model of evolution a prior probability distribution is
assigned for each unknown parameter (Yang 2006). The posterior probability for a given
ancestral state is then calculated by integrating over all possible values of parameters
(Huelsenbeck and Bollback 2001). Again the putative state with the highest posterior
probability is then assigned to each ancestral node.
Work by Koshi and Goldstein used the empirical Bayes method to reconstruct the
sequence of ancestral ribonuclease (Koshi and Goldstein 1996). The performance of
parsimony and the empirical Bayes method was also compared in a reconstruction of
lysozyme c by Yang et al. (Yang et al. 1995). This work found that empirical Bayes
outperformed parsimony but both methods suffered when the sites within the multiple
alignments being reconstructed were highly variable and the distance from the ancestral
nodes to the extant species was high (Yang et al. 1995).
An interesting application of empirical Bayes reconstruction was carried out by
Gashen (Gaschen et al. 2002). This work entailed reconstruction of the reconstruction of the
133
Chapter 4
sequence of the ancestor to various regional variants of the HIV-1 virus in order to contribute
to the creation of a potential vaccine (Gaschen et al. 2002).
4.2 Filters
4.2.1 Hamming distance filter
The original work which introduced the methodology of phylogenetic profiling as a means of
detection of functional interaction between genes (Pellegrini et al. 1999) utilised Hamming
distance (Hamming 1950) as a measure of similarity of profiles. Phillip Kensche also
examined this method in a review of phylogenetic profiling methods, and found it to perform
reasonably well over a dataset composed of the proteins sequences of 25 fungi (Kensche et
al. 2008). Hamming distance did not perform well over the training data as seen in Chapter 3
however it was possible that it could reduce the possible search space for an application of
constrained ML. As a potential heuristic it offers speed, as Hamming distance is one of the
simplest comparisons that can be carried out between two strings. Hamming distance
therefore was investigated as a potential filter to be used possibly in conjunction with a filter
based on a single set of reconstructed states.
4.2.2 Ancestral state reconstruction filter
The first consideration in the development of a heuristic/filter based on a single set of
reconstructed characters was which criterion to use to reconstruct that set. Likelihood as a
criterion yields more accurate results as discussed above. However as the aim of this heuristic
approach was to develop a method that reduced the search space for an application of the
computationally intensive constrained ML (Barker et al. 2007; Barker and Pagel 2005) to
phylogenetic profiling, it was decided to use the simpler though less accurate criterion of
parsimony.
4.2.2.1 Dollo parsimony
Dollo parsimony operates under the assumption that once a complex trait has been lost it
cannot be re-acquired (Albert 2006). Given that the character under investigation is the
presence and absence of genes/proteins in eukaryotic organisms it was decided that Dollo
parsimony was the appropriate variant to use. Dollo parsimony has been previously used to
investigate the propensity of particular genes to be lost over the course of evolutionary time
in eukaryotes (Krylov et al. 2003). It was chosen by the authors due to the relative rarity of
lateral gene transfer events in eukaryotes (Krylov et al. 2003).
134
Chapter 4
Dollo parsimony has also been utilised to investigate gene gain in poxviruses
(McLysaght et al. 2003). The results of this use however may have been affected by the fact
that poxviruses were later observed to acquire genetic material from infected hosts (Hughes
and Friedman 2005). Kensche also evaluated the efficacy of Dollo reconstructions of profiles
as a method of phylogenetic profiling (Kensche et al. 2008). Kensche utilised a distance
measure d(A,B) between the Dollo parsimonious reconstructions of the phylogenetic profiles
of two (orthologous groups of ) proteins A and B calculated as:
d(A, B) =
$| (anc(a ) " desc(a )) " (anc(b ) " desc(b )) |

i
(1)
i#branches
where branches denoted the set of branches in the phylogenetic tree, anc(ai) was defined as
the state of orthologous group A at the ancestral node of branch i, desc(ai) was defined as the
state of orthologous group A at the descendant node of branch i, anc(bi) was defined as the
state of orthologous group B at the ancestral node of branch I and desc(bi) was defined as the
state of orthologous group B at the descendant node of branch i (Kensche et al. 2008). The
distance d(A,B) was a count of branches where either orthologous group was gained or lost
independently. The method performed as well as more sophisticated techniques on the data
analysed by Kensche (Kensche et al. 2008).
One of the methods evaluated by Barker as a potential source of signal for correlated
evolution was also examination of Dollo parsimony based reconstructions of phylogenetic
profiles over a phylogeny (Barker et al. 2007). Dollo parsimony was utilised as it reflected
the idea of setting the rate of acquisition of a complex trait (in this case a protein) to a preset
low level (Barker et al. 2007). Pairs of proteins were scored on branches of the tree where
they were jointly lost and jointly gained to form a score referred to as Dollo-pos (Barker et al.
2007). Branches where proteins were not gained or lost together were also counted and
subtracted from Dollo-pos to form a score referred to as Dollo-overall (Barker et al. 2007).
Both these scores however did not perform particularly well over the data examined (Barker
et al. 2007). Dollo-overall however performed significantly better than Dollo-pos (Barker et
al. 2007).
Thus given the fact that Dollo parsimony based tests had been moderately successful
at detecting correlated evolution, a series of potential data filters /heuristics for examination
of phylogenetic profiles using constrained ML (Barker et al. 2007; Barker and Pagel 2005)
135
Chapter 4
based on a single set of reconstructed ancestral states over the phylogeny using Dollo
parsimony were investigated.
4.2.2.2 Maddison Test for correlated evolution
To use the reconstructed ancestral state data a test to detect correlated evolution using the
comparative method that utilised a set of reconstructed ancestral states over a given
phylogenetic tree was needed. One candidate test was a contingency table based test
presented by Ridley where a gain or loss of a character was considered in the light of whether
it occurred in the presence or absence of another character over a phylogenetic tree (Ridley
1983). This test however does not separate which character is dependent and which is
independent.
A second candidate test considered was a procedure described by Wayne Maddison
(Maddison 1990) for the comparison of the association of changes in one binary character
with the given state of another. This test was designed to carry out this analysis assuming a
given phylogenetic tree and a set of reconstructed characters (Maddison 1990). This test has
been referred to as a test for concentrated changes (Felsenstein 2004).
The fundamental idea behind the Maddison test is to test whether changes in one trait
or character are concentrated in an area of a tree where a second trait or character in a given
state. As an illustrative example consider a fictional monophyletic group of related cow-like
animals. These animals do not possess horns. The phylogenetic relationships of these
animals are fully resolved and understood as well as the ancestral states for all morphological
and molecular traits. Now imagine that this group overall has no ability to metabolise valine.
Finally imagine the ability to metabolise valine is independently acquired by a sub-clade of
our fictional group and this leads to the development of horns in this sub-clade.
If we wished to test whether the ability to metabolise valine leads to horn
development, the Maddison test would return the probability of the observed configuration of
valine metabolism / horn presence. This probability would be calculated by firstly calculating
the total number of ways to acquire horns in the presence of the ability to metabolise valine
over the phylogenetic tree. Secondly the number of ways to acquire horns over the entire tree
irrespective of the state of the ability to metabolise valine are calculated. By dividing the first
value by the second a probability can be calculated. If horns are concentrated in parts of the
tree where valine metabolism is also present this probability will be lower.
136
Chapter 4
Figure 4.1: Illustrative example tree.

Thus imagine in the above figure only Cow1 has the ability to metabolise valine and
also possesses horns. Thus a reasonable hypothesis/reconstruction could be that both abilities
were gained in the branch leading to Cow1. There is 1 gain of horns. Over the entire tree
there are 6 branches (not counting the root branch) and thus 6 ways to have 1 gain of horns.
However there is only one way of having a gain of horns in the presence of valine
metabolism and that is on the branch leading to Cow1. Thus the probability of the observed
configuration is
1
.
6
137
Chapter 4
To reiterate the test works through counting all possible ways of having a set of
observed changes in a character over a phylogeny and then counting how many ways there
are of having the same number of changes in parts of the tree where a second character is in a
given state. Thus if correlated evolution is occurring changes in the first character will be
concentrated in areas of the tree where the second character is in the causative state. Consider
as a second example two proteins, which carried out the same function. If the presence of the
first protein made the second protein redundant then losses in the second protein could be
concentrated in areas where the first protein was present.
The drawbacks of the test are the fact that it treats all forms of evolutionary change as
equally likely and its inability to take into account branch lengths (Pagel 1994). However as
the motivation behind the implementation of the test was its use as a simple data filter to
remove protein pairs that showed little or no evidence of correlated evolution it was decided
that the Maddison test (Maddison 1990) was an appropriate test.
4.3 Methods
To create Dollo parsimony based reconstructions over each phylogenetic profile over the
phylogeney presented in Chapter 2, the program DOLLOP from the PHYLIP package
(Felsenstein 1989) was used. The program implements the Dollo parsimony reconstruction
algorithm described in work by Farris (Farris 1977).
Given a binary trait T that can take on 2 possible values coded as [0,1], DOLLOP
implements Dollo parsimony by seeking to explain a given observed configuration of
presence and absence for T over a set of taxa over a phylogenetic tree by allowing one gain
(transition from 0 to 1) and multiple reversions (transition from 1 to 0) (Felsenstein 1989).
As an illustrative example consider the tree below and a trait with the distribution
010101.
138
Chapter 4
Figure 4.2: Example tree.

DOLLOP will reconstruct the trait as initially gained at the root of the tree and lost at
the braches leading to A, B and E. This is as opposed to allowing multiple gains on the
branches leading to D and G.
The process followed was similar to the process followed to generate the genome
content tree produced in Chapter 2. The main difference in this case was that DOLLOP was
run with the U option, which instructed it to produce Dollo parsimonious reconstructions
over a user-supplied tree. The program was supplied with the phylogeny generated in Chapter
2 as well as the phylogenetic profile for each protein under consideration.
Apart from the U option the program was run with its default settings. The output
from DOLLOP contained data on the state of the protein at every node in the phylogeny as
well the branches within the phylogeny at which transitions occurred. In order to record this
data DOLLOP assigns an identifying number to each internal node of the phylogeny.
An example of the outputted data from DOLLOP is given below. For a human protein
with the profile 000000000000000000000000100000000000000000000000000000 (Only
present in Homo sapiens) over the species used for the phylogeny produced in Chapter 2, the
following reconstruction was provided for species close to the root of the tree.
139
Chapter 4
From
To
Changed
State
root
No
Absent
Entamoeba
No
Absent
histolytica
1
No
Absent
Trichomonas
No
Absent
vaginalis
2
No
Absent
No
Absent
Table 4.1: Sample output from DOLLOP.

Clearly the parsimonious reconstruction for this protein would only contain one gain. This
gain occurs between the ancestral node immediately preceding Homo sapiens.
The output files from DOLLOP were stored for further use.
In order to process this data and utilise it as input for various tests of correlated
evolution two Java objects were defined and implemented.
Figure 4.3: Class diagram illustrating classes underpinning Dollo analyses.
140
Chapter 4
The main object in the preceding figure is the Transition Matrix object. This object has 2
main attributes.
The States: This is a list of Transition objects. Transition objects contain the same 4
attributes as shown in Table 4.1
The Position Map: This is a Tree Map, which contains a position within the tree as a
key and the state of a given trait at that position as a value. Thus this attribute can be
queried for the state (present or absent) of a given trait at any point in the tree.
The Transition Matrix object also has 2 main operations.
Calculate clade: This function returns all parts of a tree descended from a given node.
Thus if a trait is gained or lost at Node n, the function will return the monophyletic
group consisting of n and all its descendants.
Create Position Map: This function traverses the States list and utilises the Calculate
clade function to populate the Position map.
These objects underpin all further analyses described in this chapter.

4.3.1 Maddison test for correlated evolution
Given the set of the ancestral reconstructed state the Maddison test as described above and in
the original work by Wayne Maddison (Maddison 1990) was implemented. The following
description of the algorithm utilised is based entirely upon the work presented by Maddison
(Maddison 1990). A modification to the test for correlated evolution defined by Maddison
(Maddison 1990) to fit the constraints imposed by Dollo parsimony is presented in Section
4.3.2
4.3.1.1 Algorithm
Assume two discrete binary characters A and B and a phylogenetic tree T and a set of
reconstructed states for A and B for each node N within T. Possible states for characters A and
B lie within the closed interval [0,1]. A gain is defined as a transition from 0 to 1 and
conversely a loss is defined as a transition from 1 to 0.
Define character B as reference trait. Define state s as the relevant state of character
B. Define subset k ( k " T ) as the area(s) of the tree where B is in state s.
!
141
Chapter 4
Define W root (x, y | b) as the total number of ways to have x gains and y losses of
character A over the tree starting at the root node given that state of character A is b at the
root of the tree.
!
Define Broot ( p,q | x, y,b) as the total number of ways to have p gains and q losses of
character A in subset k given x gains and y losses over the entire tree starting at the root node
given that state of character A is b at the root of the tree.
!
The test for correlated evolution is thus calculated by
p(obs) =
Broot ( p,q | x, y,0) + Broot ( p,q | x, y,1)

W root (x, y | 0) + W root (x, y |1)
(2)
Solving Equation 2 provides the probability p(obs) of having p gains and q losses of
character A in subset k given a total of x gains and y losses occur over the whole tree under
the null hypothesis of no correlated evolution. If gains and losses of character A are in some
way dependent on whether character B is in state s then we could expect those gains and
losses to be concentrated in subset k. W root (x, y | b) and Broot ( p,q | x, y,b) are calculated
through the use of a dynamic programming approach starting at the tips of the tree and
proceeding in a post order fashion (Maddison 1990).
!
!
4.3.1.2 Calculation of total number of ways of having x gains and y losses over the tree
In order to calculate W root (x, y | b) over the entire tree for a character A, a matrix containing
the number of ways of having 0 to x gains, 0 to y losses for either potential values of b (0 or
1) has to be calculated for each node in the tree.
!
For a leaf node there are 0 ways of having x gains and y losses at the node for all
values of x and y which are greater than 0. There is one way of having 0 gains and 0 losses at
a leaf node.
For a non-leaf node K there are four calculations to make. Firstly assume all gains and
losses occur post the nodes immediate descendants L and M and that the state of character A
is 0. A non-leaf node is only processed after both its descendants have been visited. The
number of ways of having x gains and y losses at node K given a state of 0 can be calculated
x
by the expression # #W L (i, j | 0) " W M (x $ i, y $ j | 0) . This counting system operates on the

i= 0 j= 0
principle that for every way of having i gains and j losses on node L there are (x-i) gains and
(y-j) gains on node M. Thus if for example there was 1 gain and 1 loss to distribute over node
!
142
Chapter 4
K then if both of them occurred post descendent L then no changes would occur post
descendent M. If only the gain occurred post L then the loss would occur post node M.
The second part of this calculation is based on the assumption that one of the changes
occurs between K and one of its child nodes for example M. Thus as one of the changes has
occurred (the change is a gain as the state of the character is of character A at node K is 0) the
state of character A at node M is now 1 and one of x gains has already occurred. Thus the
number of ways to have the remaining number of gains and losses can be calculated by the
x#1 y
expression.
$ $W
(i, j | 0) " W M (x # i, y # j |1) . The third part of the calculation covers the
i= 0 j= 0
eventuality that the change happens between K and its other child L. Thus the number of
ways remaining to have x gains and y losses are calculated by the expression
!
x#1 y
$ $W
(i, j |1) " W M (x # i, y # j | 0) . Finally assume changes occur between K and both of its
i= 0 j= 0
child nodes L and M. The states of both nodes will be 1 and there will be two fewer gains to
distribute over the remainder of the tree. Thus the fourth part of the calculation is:
x#2 y
$ $W
(i, j |1) " W M (x # i # 2, y # j |1) .
i= 0 j= 0
Summing up the results of these four expressions will provide the number of ways of
having x gains and y losses at non-leaf node K given that the state of character A is 0. The
calculation of the number of ways of having x gains and y losses if the state of character A is
1 at node K is a mirror image of the process described above (Maddison 1990).
4.3.1.3 Calculation of total number of ways of having p gains and q losses in subset k
given x gains and y losses over the entire tree
This calculation of Broot ( p,q | x, y,b) is very similar to the one described above. As above a
matrix containing the number of ways of having 0 to p gains in subset k, 0 to q losses in
subset k given 0 to x gains and 0 to y losses over the whole tree for either potential values of b
!
(0 or 1) has to be calculated for each node in the tree.
For a leaf node there are 0 ways of having p gains and q losses in subset k given x
gains and y losses overall for all values of p, q, x and y which are greater than 0. There is one
way of having 0 gains and 0 losses in subset k given 0 gains and 0 losses overall.
As above a non-leaf node is only processed when both its children have been visited.
For a non-leaf node K with character A having state 0 with children L and M there are again
143
Chapter 4
four calculations to be made. The first calculation counts the possibilities where both changes
occur post the child nodes. This number is calculated through the expression
x
" " ""

i= 0
j= 0
BL ( f ,g | i, j,0) # BM ( p $ f ,q $ g | x $1, y $ j,0) . The second calculation counts
f = 0 g= 0
the possibilities where one of the changes occurs between node K and node M. Whether this
change is counted as within subset k depends on whether node M lies within subset k. To
facilitate calculation (Maddison 1990) defined a number ZM as set to 1 if M lies within k.
Thus the second calculation is evaluated by the expression
x#1 y p#Z m q
$ $ $ $ B ( f ,g | i, j,0) " B
L
( p # f # Z M ,q # g | x # i #1, y # j,1) . The third calculation
i= 0 j= 0 f = 0 g= 0
counts the possibility of one of the changes occurring between node K and node L. This is
evaluated via the expression

x#1 y p#Z L
$ $ $ $ B ( f ,g | i, j,1) " B
L
( p # f # Z L ,q # g | x # i #1, y # j,0) . The fourth calculation
i= 0 j= 0 f = 0 g= 0
counts the possibilities where changes occur between K and L as well as K and M. This is
evaluated with the expression:

x#2 y p#Z L #Z M
$ $ $ $ B ( f ,g | i, j,1) " B
L
i# 0 j= 0
f =0
( p # f # Z L # Z M ,q # g | x # i # 2, y # j,1)
g= 0
The summation of the solutions of the four expressions yields the total number of
ways to have p gains and q losses of character A within subset k given x gains and y losses of
character A overall under node K given the state of character A is 0. As above this process is
mirrored when the state of character A is 1 (Maddison 1990).
4.3.1.4 Permutation effects
The Maddison test for correlated evolution (Maddison 1990) is potentially susceptible to two
effects in the context of examination of protein phylogenetic profiles. Maddisons test was
designed to test specific hypotheses about correlated evolution. For example one of the first
applications of the test was on data testing the association of gregariousness in butterflies
with unpalatable larvae (Sillentullberg 1988; Maddison 1990). Thus in a pairwise comparison
of characters one character is held static as a reference while the location of changes in the
other dynamic character are examined over the tree. The terms static and dynamic shall be
used in this context in all subsequent references.
144
Chapter 4
In the case of examinations of correlated evolution in phylogenetic profiles however
it is not possible to state whether we are testing for the dependence of the distribution of
protein A with the state of protein B or vice versa. The first effect is thus permutation.
The second effect is based on how subset k is defined. As phylogenetic profiles
compare patterns of presence and absence of genes subset k can either be defined as the
presence of protein B or the absence of protein B.
This second effect is however precluded as defining subset k as the absence of protein
B shifts position of the number of changes sought. Consider the tree shown in Figure 4.4. If
for example protein A was gained once within the clade containing Species 1 and Species 2
and protein B was present in that clade but no where else within the tree. Thus the ancestral
state of B would be reconstructed parsimoniously as shown in Figure 4.5.
If k is defined as the presence of B then the test is investigating the probability of 1
gain within k with 1 gain over the entire tree. If k was defined as the absence of B then the
test is investigating the probability of 0 gains within k with 1 gain over the entire tree.
Thus over the sample tree if k is defined as the presence of B, then there are 3 ways to
have 1 gain of A within k. That is 1 on the branch leading to Species 1, 1 on the branch
leading to Species 2 and 1 on the branch leading to the clade. There are 9 ways of having 1
gain over the entire tree. Thus the probability of 1 gain in k is
3
or 0.33. If on the other hand
9
k is defined as the absence of B then there are 0 gains within k with 1 gain overall. As there is
one gain to be accounted for and this gain can only occur within the clade containing Species
!
1 and Species 2. Thus as before there are 3 ways of having one gain within that clade and 9
ways of having one gain over the whole tree thus the associated probability remains the same,
i.e. 0.33.
145
Chapter 4
Figure 4.4: Sample phylogeny of 5 hypothetical species. The numbers on the tree
represent presence and absence of protein B. The arrow points out the point post which
protein A was acquired.
146
Chapter 4
Figure 4.5: Sample phylogeny of 5 hypothetical species. The black area of the tree
corresponds to a Fitch parsimonious reconstruction (Fitch 1971) carried out by Mesquite
(Maddison and Maddison 2010) of protein B if protein B has the phylogenetic profile 11000
(where the order of species in the profile is the same as the numerical order of the species). It
is also the Dollo parsimonious reconstruction. This black area corresponds to subset k if it is
defined as the presence of B. Conversely the white area of the tree corresponds to k if it is
defined as the absence of B.
A further example of this concept can be considered by using the initial example
provided in 4.2.2.2 involving our fictional cow like species. In that case the probability of
acquiring horns in the presence of valine was calculated as
1
. If we were to examine the
6
probability of acquiring horns in the absence of valine then the denominator remains the
same and there are 0 gains of horns in k (the area of the tree where the ability to metabolise
!
valine is absent). There is 1 way of having 0 gains of horns. Thus the probability of the
observed configuration remains the same.
Thus the results shown in by the Maddison-Dollo test over the training data as
described in Chapter 3 were identical with respect to the choice of how subset k is defined.
147
Chapter 4
4.3.1.5 Evaluation of Maddison test as heuristic for constrained ML
The Maddison test (Maddison 1990) as described above modified to accommodate the
assumptions of Dollo parsimony (Section 4.3.2) was implemented using Java. This entailed
writing 3,563 lines of code. The implementation was supplied with the set of reconstructed
states for each protein pair and the phylogenetic tree on which they are reconstructed.
In order to remove permutation based effects from the analysis the training dataset as
described in Chapter 3 was doubled so it included proteins pairs in both the A-B and the B-A
orientations. The Maddison test was run on this expanded training set. Thus each protein pair
in the training set had two associated probability scores. The lower of these two scores was
selected as the lower the probability of the observed distribution of gains and losses the
stronger the evidence for correlated evolution. In order to use this probability as an ascending
score the score was defined as 1-p. The test was run with subset k defined as the absence of
trait B.
The ability of the test to detect protein interactions was then judged according to the
criterion of precision/sensitivity as defined in Chapter 3.
This process was then repeated with k defined as the presence of trait B in order to
verify the observation that there was no effect on whether k was defined as the absence or the
presence of trait B.
4.3.2 Modification of test to match Dollo constraints
The calculation of the null distribution under the standard Maddison approach (Maddison
1990) allows for all sequences and permutations of gains and losses as allowed by Fitch
parsimony (Fitch 1971). In order to reconcile the test to the assumptions of Dollo parsimony
the test was modified to remove all possibility of a gain following a loss. This was achieved
by examining the state of the character under consideration at the root node. If the root node
was 0 then the standard test as described in Equation 2 was utilised. However if the state was
1 then the following test was used.
p(obs) =
Broot ( p,q | x, y,0) + Broot (0,q | 0, y,1)

W root (x, y | 0) + W root (0, y |1)
(3)
If a character is acquired at the root of the tree then no gains can be allowed to occur
post a loss thus a is calculated for 0 gains and y losses from the root.
148
Chapter 4
To illustrate this difference consider the following tree.
Figure 4.6: Example tree to illustrate the imposition of Dollo parsimonious constraints
on the Maddison test for correlated evolution.
Assume the state of a character C was reconstructed as 1 at the root node of the tree
and there was one gain and one loss to be distributed over the tree. If gains were allowed to
follow losses then there are 4 ways of having one gain and one loss over the tree. These are:
A loss between node 2 and node 6 followed by a gain between node 6 and
node 7.
node 8.
node 4.
node 5.
149
Chapter 4
However with the added Dollo parsimony constraint there are no ways having one
gain and one loss over the tree in Figure 4.6 if the state at the root node is 1.
4.3.3 Differential parsimony
The distance as defined in work by Phillip Kensche (Kensche et al. 2008) and reiterated in
Section 4.3.1 was implemented and its performance examined over the training data.
4.3.4 Dollo-pos/ Dollo-overall
Both measures as described by Barker et al. (Barker et al. 2007) were implemented and
examined in the light of the testing data.
4.3.5 Test based on logistic regression
In order to test for correlated evolution the reconstructed ancestral states allowed the
calculation of potential predictor variables, which bore a correspondence to the transition rate
parameters, used by Barker and Pagel (Barker and Pagel 2005; Pagel 1994) as described in
Chapter 3. These parameters represent the rates of transition in state for a discrete binary
character given a particular state for a second discrete binary character.
Given a single set of reconstructed ancestral states these transitions can be empirically
counted over the reconstructed states. For example a protein is lost on a given branch of the
phylogeny and a second protein is present on that branch according to the reconstructed states
this can be counted as a loss of one protein in the presence of the other.
The Dollo parsimony based reconstructions of each phylogenetic profile contain
within them the state of the associated protein at every given point in the tree. It was thus
possible to compare the state of any given ancestral branch within the tree for any two given
proteins. The Dollo reconstruction data is framed in terms of transitions between two nodes.
This meant that it was possible to compare a transition in the state of a given protein with the
state of the other protein at the same point in the tree.
In order to use the Dollo reconstructions of each phylogenetic profile as potential
predictors of functional interaction using logistic regression the reconstructions had to be
framed in terms of being potential predictors of correlated evolution. The possible states that
a protein could be in at any given transition in the tree were coded as:
0: Absent
1: Present
2: Lost
150
Chapter 4
3:Gained
Each profile was associated with a matrix of transitions, which was constructed using
the reconstruction of the ancestral states of the profile over the tree. Pairwise comparisons
were then carried out. Thus at each transition point the state of protein A was compared to
that of protein B.
In order to avoid permutation effects the order in which protein pairs were considered
was made redundant. This was performed by framing transitions in terms of changes in
proteins as opposed to changes in a particular protein. If for example protein A was lost in
the presence of protein B this would be counted as the loss of a protein in the presence of
another, not the loss of A in the presence of B. Pairwise comparisons were thus carried out at
each node of the tree to create the predictors shown in Table 4.2. The lower case s stands for
scenario.
151
Chapter 4
Predictor
Description
s00
A point in the tree where both proteins are absent.
s01
A point in the tree where one protein is present and the

other is absent.
s02
A point in the tree where one protein is absent and the

other is lost.
s03

other is gained.
s11
A point in the tree where both proteins are present.
s12

other is lost.
s13

other is gained.
s22
A point in the tree where both proteins are lost.
s23
A point in the tree where one protein is lost and the

other is gained.
s33
A point in the tree where both proteins are gained.
Table 4.2: Description of predictor parameters to be utilised in regression model.

Whether or not two proteins interact is a binomially distributed variable. Models for
the calculation of the probability of a dichotomous outcome include:
Two group discriminant function analysis: given two predictors X1 and X2

discriminant functional analysis constructs a variable Z which is a linear function of
X1 and X2. This function is the equation of a line that separates the data under
consideration into two groups. One of these groups will have high values for Z and
the other will have low values (Sokal and Rohlf 1995). Novel data with measured
values for X1 and X2 can thus be classified by solving the equation for Z (Sokal and
Rohlf 1995).
152
Chapter 4
Logistic regression: logistic regression relates the probability of a successful outcome

in this case that of an interaction with estimated coefficients for a set of predictors via
the following application of the logistic function (Sokal and Rohlf 1995).
Preliminary trials were carried out on the training data, which found the results of logistic
regression and a linear discriminant function to be broadly similar. However the values of
predictors associated with gains are not distributed normally as the number of gains is
restricted to 1. In such cases logistic regression is the recommended technique (Lei and
Koehly 2003) as it makes no assumptions of normality regarding the distribution of the
predictor variables. Thus logistic regression was selected as an appropriate method of testing
whether the predictor variables contribute to the outcome of interaction as well as
determining the degree to which they contribute. Logistic regression is carried out via an
application of the logistic function, which can be defined as shown in Equation 4
p(Interaction) =
e a +bX
1+ e a +bX
(4)
Where a is the y-intercept of a regression line, e is the base of the natural logarithm
and b is the coefficient of a predictor variable X for a set of predictors (X0,X1,X2.Xi) .
The logistic function, which is also known as the sigmoid function returns a value within the
closed interval [0,1] for values in the range of real numbers from "# to +" .
This probability is converted into the odds of an interaction versus no interaction with
the expression:
p
where p is equal to the probability
! of an!interaction (Sokal and Rohlf
(1" p)
1995).
Solving the expression substituting Equation 4 as a value for p yields the following
!
equation (Sokal and Rohlf 1995).
p
= e a +bX
(1" p)
(5)
Finally the odds of an interaction are converted into the log odds or logit of an
interaction via Equation 6 (Sokal and Rohlf 1995):
p
ln(
) = a + bX
1" p
(6)
153
Chapter 4
The optimal values for the coefficients and intercept of an optimal regression line are
estimated by maximum likelihood (Sokal and Rohlf 1995). The full positive training set of
9,161 protein pairs was used as examples of proteins, which interact. A random subsample of
9,161 proteins was then selected from the negative training set as examples of proteins,
which do not interact. The size of negative and positive sets were set to be equal to allow the
linear model to create a regression line which matched the distribution of the predictors in
both sets rather than being biased by a larger negative set.
Counts were then calculated for each parameter for each pair of profiles within the new
training dataset. The statistical package R (R Development Core Team 2011) was then used
to fit a generalised linear model between the two binary variables using a binomial (logit)
link function. The predictor variables were considered to be continuous.
Predictor variables s13, s22, s23 and s33 were found to cause singularities within the model.
s13 was found to be perfectly correlated with s03 and s33 as Dollo parsimony only allows one
acquisition of a complex trait. s22 and s33 only occur rarely as seen below.
Minimum
1st Quartile
Median
0
Mean
0.5259
3rd Quartile
Maximum.
14
Table 4.3: Descriptive statistics for counts of predictor s22.
Minimum
1st Quartile
Median
0
Mean
0.1
3rd Quartile
Maximum.
Table 4.4: Descriptive statistics for counts of predictor s33.

The predictor s23 never occurs at all within the data. The results of the initial
regression are shown in Table 4.5.
154
Chapter 4
Predictor
s00
Coefficient
p value <0.05
0.02179
0.1278 (Not
significant)
s01
0.02607
0.0845 (Not
significant)
s02
0.03775
0.0244
s03
-0.24041
2.66 & 10-05
s11
0.06446
5.46 & 10-5
s12
0.04350
0.0189
Table 4.5: Coefficients of the initial logistic regression equation.

After removing all predictors that caused singularities as well as all insignificant
predictors the analysis was repeated. This led to the following coefficients as shown in Table
4.6.
Predictor
Coefficient
p value <0.05
s02
0.019429
2.05 & 10-7
s03
-0.177835
0.00115
s11
0.043359
< 2 " 10-16
s12
0.018787
0.00107
!
Table 4.6: Coefficients of the logistic regression equation derived via examination of
the reduced set of predictors.
The full equation of the regression line is shown below as Equation 7.
155
Chapter 4
y = 0.019429 s02 " 0.177835 s 03 + 0.043359 s11 + 0.018787 s12 " 0.791849 (7)
y is equal to the logit score of the probability of two proteins interacting versus the
probability of them not interacting. The logit scores were then transformed into a probability
of interaction via an application of the logistic function. This probability was used as the
score.
The predictor, which contributes the most to the probability of an interaction, is s03.
This would suggest that a protein being gained in the absence of the other is indicative of the
two proteins not being functionally linked (as the coefficient is negative). The other
significant terms, which contribute positively to the probability of an interaction, are s02, s11
and s12. Losses appear to be a defining event for correlated evolution whether a loss in the
presence of a protein or a loss in the absence of a protein. This could potentially be
confirmation of work postulating that gene loss is relatively the most important event shaping
gene content and determining phenotype. However this may be attributable to the use of
Dollo parsimony for ancestral state reconstruction. A loss of a protein in the presence of
another might suggest some form of redundancy-based loss. A loss of a protein in the
absence of another on the other hand might suggest a cascade of losses of a group of proteins
that carry out a particular function that is no longer needed in a particular lineage.
s11 is the final significant predictor which implies that two proteins coexisting for
periods of time at different points in the phylogeny are also more likely to be functionally
linked. This predictor can be thought of as similar to the method used by Cokus (Cokus et al.
2007) except whereas while that work carried out a horizontal comparison of the distribution
of presence and absences of proteins across a set of genomes clustered by similarity, the
predictor s11 measures co-occurrence both horizontally across species and vertically over a
set of putative ancestors as reflected by the phylogenetic tree.
4.3.5.1 Evaluation of logistic regression as a heuristic for constrained ML

In order to examine the efficacy of the derived regression equation it was applied to the full
training dataset. The performance of the method was then evaluated using through the
calculation of precision/sensitivity.
156
Chapter 4
4.4 Results
The results for the five tests carried out were measured in terms of precision and sensitivity
over the training data. The intersection of the predictions made by the tests with the 6
predictions made by constrained ML method (Barker et al. 2007; Barker and Pagel 2005) at
its optimum rate of gain 0.025 and its optimum likelihood ratio cut-off of 58.54 over the
training dataset was also examined. As pointed out in Chapter 3 this combination of rate of
gain and likelihood ratio cut-off yielded 5 predictions all of which were true positives.
Maintenance of this intersection is judged to be a key criterion for any data filter to form part
of a heuristic approach. Thus in order to use a test as a data filter for constrained ML (Barker
et al. 2007; Barker and Pagel 2005) the highest acceptable cut-off for the test was considered
to be the point at which all 5 predictions made by the ML method were preserved.
157
Chapter 4
4.4.1 Maddison test for correlated evolution
Figure 4.7: Performance of the Maddison test for correlated evolution on the training data with k as the absence
of trait B. The figure shows the performance between cut-offs of 0.9999 and 1 rising by increments of 1 x 10-7.
158
Chapter 4
Figure 4.8: Performance of the Maddison test for correlated evolution on the training data with k defined as the
presence of trait B. The figure shows the performance between cut-offs of 0.9999 and 1 rising by increments of
1 x 10-7.
The Maddison test for correlated evolution (Maddison 1990) performs reasonably
well on the training data as can be seen in Figure 4.7. There is a maximum intersection of 1 in
this range of cut-offs with the 5 predictions made by constrained ML (Barker et al. 2007;
Barker and Pagel 2005) at its optimum rate of gain and optimum LR cut-off. It is however a
computationally intensive process with each node in the tree being evaluated for all
combinations of states, gains and losses respectively firstly for the calculation of the null
distribution and secondly for all combinations of gains within subset k, losses within subset k,
gains over the whole tree, losses over the whole tree and states. Binary tree traversal is
carried out in linear time (Felsenst.J 1973). The amount of time taken as the number of gains
and losses to be accounted for increase however rises at a much steeper rate (Maddison
1990).
159
Chapter 4
The mirror property of the algorithm, i.e. that the calculations mirror each other for
any given combination of states (Maddison 1990) cannot be utilised in this particular case as
gains are restricted to 1 by the use of Dollo parsimony.
4.4.2 Differential parsimony
Figure 4.9: Precision and sensitivity measure of differential parsimony over training data.
Differential parsimony (Kensche et al. 2008) was not very successful on the training
data as shown in Figure 4.9. It reached a maximum precision of 0.22547 with a sensitivity of
0.023. There was an intersection of 3 at this point with the predictions made by ML
constrained at its optimum rate of gain and LR cut-off.
160
Chapter 4
4.4.3 Dollo-pos
Figure 4.10: Precision and sensitivity measure of Dollo-pos over training data. Range of cutoffs at which predictions are made: 0 to14.
Dollo-pos (Barker et al. 2007) was fairly successful over the training data with a
maximum precision of 0.57 at a cut-off of 13 as shown in Figure 4.10. The sensitivity at this
point was 0.00043. There was an intersection of 2 with the 6 predictions made by constrained
ML (at its optimum rate of gain and LR cut-off) at this cut-off. In order to use Dollo-pos as a
potential data filter the lowest acceptable cut-off that maintained an intersection with all 6
predictions made by constrained ML was 1. At this cut-off the precision of Dollo-pos was
0.105 and the sensitivity was 0.389.
161
Chapter 4
4.4.4 Dollo-overall
Figure 4.11: Precision and sensitivity measure of Dollo-overall over training data. Range of
cut-offs at which predictions are made: -23 to 14.
The results for Dollo-overall (Barker et al. 2007) were found to be broadly similar to
Dollo-pos (Barker et al. 2007) as seen in Figure 4.11. However the effective range of cut-offs
for Dollo-overall is shifted down. The maximum precision achieved was 0.5 at a cut-off of
11. The sensitivity at this point was 0.00002. The lowest cut-off at which the intersection
with the 6 predictions made by constrained ML (at its optimum rate of gain and LR cut-off)
was maintained was -21. At this cut-off the precision of Dollo-overall was 0.08074 and the
sensitivity was 0.99.
4.4.5 Logistic regression
Logistic regression performed well on the training data as can be seen below in Figure 4.11.
162
Chapter 4
Figure 4.12: Precision and sensitivity measure of logistic regression over training data. Range
of probability cut-offs at which predictions are made: 0 to 1. Cut-offs were incremented by
0.001.
The maximum precision achieved by logistic regression was 0.736 at a sensitivity of
0.01. This was achieved at a probability cut-off of 0.967. The lowest probability cut-off at
which the intersection with predictions made by constrained ML was maintained was 0.85.
The precision achieved at this point was 0.479 and the sensitivity was 0.0598. The method
that achieved the highest precision while maintaining a full intersection with the predictions
made by constrained ML was thus logistic regression.
163
Chapter 4
In order to cross validate the ability of the logistic regression based filter to
discriminate between proteins that interact and those that do not, the filter was run on the
testing data. Figure 4.13 shows the performance of the filter over the testing data.
Figure 4.13: Precision and sensitivity measure of logistic regression over testing data. Range
of probability cut-offs at which predictions are made: 0 to 1. Cut-offs were incremented by
0.001.
As the s predictors used to determine the logit score used in the logistic regression
filter are based on the transition rate parameters used to detect correlated evolution by the
constrained ML technique (Barker et al. 2007; Barker and Pagel 2005; Pagel 1994) it was
expected that there is a correlation between a high logit score for a protein pair and a high LR
164
Chapter 4
(likelihood ratio statistic) score using the exemplar rate of gain elucidated in Chapter 3
(0.025). The distribution of LR scores is extremely skewed towards the lower end as can be
seen in Chapter 3. All true positive protein pairs detected by constrained ML (Barker et al.
2007; Barker and Pagel 2005) at the optimum rate of gain lie within the 99th percentile of LR
scores. This is due to the fact that many protein pairs within the training dataset do not
display little or no evidence of correlated evolution, i.e. the low sensitivity of the method as
shown in Chapter 3.
It is only in the upper ranges of the LR scores that protein pairs that interact are
distinguished from those that do not interact. This phenomenon was also observed by Barker
as well as by Kensche (Barker et al. 2007; Kensche et al. 2008). Thus in order to display the
relationship between the two prediction systems the logit derived probability scores of
protein pairs with corresponding LR scores that lay in the 95th percentile of the distribution of
LR scores were selected. This came to a set of 5,658 protein pairs.
The logit derived probability scores of these pairs were plotted against their
corresponding LR score. Figure 4.14 shows the logit derived probability scores of these 5,658
proteins pairs plotted against their LR scores. The relationship between the values is
displayed via an overlaid regression line. There is a large amount of scatter around the
regression line however the relationship between the two variables is found to be significant
(p value < 0.001).
165
Chapter 4
Figure 4.14: Linear regression line (Adjusted R2=0.1049) drawn over a plot of logit
derived probability scores against likelihood ratio statistics over the training data. Vertical
dotted line shows optimum cut-off for likelihood ratio statistic. Horizontal dotted line shows
optimum cut-off for the logit derived probability score.
Figure 4.14 shows that there is a positive relationship between the LR score generated
by constrained ML (Barker et al. 2007; Barker and Pagel 2005) and the logit derived
probability scores generated by application of Equation 6 to the set of reconstructed ancestral
states.
Figure 4.15 shows that a similar relationship is observed over the testing data set (p
value < 0.001).
166
Chapter 4
Figure 4.15: Linear regression line (Adjusted R2=0.0995) drawn over a plot of logit derived
probability scores against likelihood ratio statistics over the testing data. Vertical dotted line
shows optimum cut-off for likelihood ratio statistic determined by the training data.
Horizontal dotted line shows optimum cut-off for the logit derived probability score
determined by training data.
167
Chapter 4

In order to further improve the quality of the heuristic hamming distance was applied to the
set of predictions generated at each logit score cut-off. The goal of the additional filter was to
reduce the size of the search space while still preserving the 6 true positive predictions made
by constrained ML.
This yielded an unexpected result. The expectation with the application of Hamming
distance was that as the Hamming distance increased the number of true positives predicted
actually went up. This would suggest that as evidence of correlated evolution goes down, the
probability of predicting a protein-protein interaction goes up. This anomalous result is
probably due to the fact that Hamming distance does not take in the phylogenetic
relationships between the organisms in the study as pointed out by Barker (Barker and Pagel
2005).
4.5 Discussion
The Maddison-Dollo test proved not to be an effective filter as evidenced by its speed. The
method achieved a high level of accuracy however it did not maintain an intersection with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its higher cut-offs.
Differential parsimony (Kensche et al. 2008) was unable to differentiate between the
negative and positive training data to an adequate degree.
Dollo-pos and Dollo-overall (Barker et al. 2007) were both able to differentiate
between the negative and positive data however they did not preserve the intersection with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at a reasonable level of
precision/sensitivity.
The application of logistic regression on the other hand is a viable filter for
constrained ML analysis (Barker et al. 2007; Barker and Pagel 2005) as evidenced by both its
relative speed as well as its relationship with the LR scores provided by constrained ML
(Barker et al. 2007; Barker and Pagel 2005).
The filter utilising logistic regression on a single set of reconstructed ancestral states
as implemented above is quick as compared to both constrained ML (Barker et al. 2007;
Barker and Pagel 2005) as well as the Maddison test for correlated evolution (Maddison
1990). The comparative CPU time taken by each method over the training data as measured
by the Mac OS X utility time is given in Table 4.7.
168
Chapter 4
Process
Duration: Minutes/Hours
Maddison Dollo test (Farris
3988 minutes and 20
1977; Felsenstein 1989;
seconds/ 66 hours and 28
Maddison 1990)
minutes (approximately).
Logistic regression using
3 minutes and 48 seconds.
Dollo parsimony
reconstructions (Farris 1977;
Felsenstein 1989)
Constrained ML (Barker and
6624 minutes and 37
Pagel 2005;Barker et al.
seconds/ 110 hours and 24
2007)
minutes.
Table 4.7: Times taken by each of the three methods on the training data. The time given for
the Maddison Dollo test is an extrapolation from a run on 12.5% of the training data. This
12.5% was selected randomly. Times given in minutes are rounded to the nearest second.
Times given in hours are rounded to the nearest minute. All tests were run on an Intel Xeon
3.0 GHz processor.
The reduction in potential protein pairs to be investigated via an application of the
logistic regression filter using the probability score cut-off of 0.85 is 111,902 (113,1321132). This is a reduction of 98.9%. As the filter discriminates between proteins, which show
evidence of correlated evolution, the remaining 1.1% should be enriched for proteins
amenable to investigation via phylogenetic profiling using ML reconstructions with
constrained rates of gain (Barker et al. 2007; Barker and Pagel 2005). Thus an application of
the filter to the full human genome followed by an application of phylogenetic profiling using
constrained ML (Barker et al. 2007; Barker and Pagel 2005) will potentially yield a large set
of interactions from within the human genome some of which may be novel.
All code implementing the procedures described in this chapter is available on request from
the author.
169
Chapter 5
Chapter 5
Genome-wide prediction of protein functional interactions in humans using
a heuristic approach
5.1 Introduction
The interactome of an organism can be defined as the complete set of molecular interactions
that occur within its full complement of cell types (Yu et al. 2008). This study focuses on the
elucidation of interactions between proteins (both direct and indirect) in the human proteome
(PPIs). PPIs have been defined as physical interactions between proteins (De Las Rivas and
Fontanillo). PPIs are detected by methods such as the yeast 2-hybrid and tandem affinity
purification coupled to mass spectrometry (TAP-MS) (De Las Rivas and Fontanillo 2010) as
well as co-immunoprecipitation (De Las Rivas and Fontanillo 2010). Interactions between
proteins that are indirect can be detected by gene co-expression as investigated in Chapter 3
or techniques like double mutant synthetic lethality (De Las Rivas and Fontanillo 2010).
Indirect protein interactions are also detected by TAP-MS as proteins that share membership
of a complex do not necessarily maintain a direct physical interaction. Computational
interaction detection methods as described in Chapter 1 can contribute to this effort by
pointing out putative interactions, which can then be further verified. This chapter describes
the application of the logistic regression-based data filter developed in Chapter 4 in
combination with constrained ML (Barker et al. 2007; Barker and Pagel 2005) phylogenetic
profile analysis to detect potential novel protein-protein interactions as well as novel indirect
interactions between proteins.
5.1.1 PPI databases
As experimental data has accumulated on protein-protein interactions there have been a
number of attempts to organise and annotate accumulated data on PPIs. There are thus a
number of databases, which contain data on human protein-protein interactions. As any
attempt to examine the quality of predicted PPIs comparison with known data, a brief
overview of the major PPI databases is presented below.
5.1.1.1 MIPS
MIPs (Mammalian Protein-Protein Interaction Database)(Pagel et al. 2005), is a PPI database
which contains 1,812 experimentally verified human protein-protein interactions (PPIs). It
170
Chapter 5
only includes published data from individual experiments as opposed to large scale highthroughput surveys (Pagel et al. 2005).
5.1.1.2 BIND
BIND (Biomolecular Interaction Network Database) contains data on three main interactions
types (Bader et al. 2001). These are binary interactions, molecular complexes and pathway
data (Bader et al. 2001).
5.1.1.3 MINT
MINT (Molecular INTeraction) in contrast to MIPS contains data from large scale highthroughput experiments (Chatr-aryamontri et al. 2007). As of 2009 it contains data derived
from more than 19,000 experiments and 25,105 curated human PPIs (Ceol et al. 2010).
5.1.1.4 INTACT
IntAct is one of the larger PPI databases containing over 200,000 curated binary protein
interactions (Aranda et al. 2010). IntAct follows an extremely specific curation process with
information from experiments being recorded in high detail using a number of controlled
vocabularies to facilitate further data analysis (Aranda et al. 2010).
5.1.1.5 HPRD
The HPRD (Human Protein Reference Database) is a human specific PPI database. There are
currently 45,207 interactions held in the HPRD (Prasad et al. 2009). It contains manually
curated data on protein interactions derived from both high throughput surveys as well as
single experiments (Prasad et al. 2009).
5.1.1.6 DIP
The DIP (Database of Interacting Proteins) is one of the earlier PPI databases. It contains data
derived from manual curation of the literature as well as from structural information on
complexes derived from the PDB (Protein Data Bank) (Salwinski et al. 2004).
5.1.1.7 REACTOME
The REACTOME database holds data on PPIs in the context of the biological pathways that
underpin cellular processes and is also manually curated (Haw et al. 2011).
171
Chapter 5
5.1.1.8 STRING
The STRING database holds data on PPIs that are experimentally verified and also adds a set
of computationally predicted PPIs (von Mering et al. 2005). It contains PPI information on
630 organisms (Jensen et al. 2009). The total number of interactions held by STRING
exceeds 50,000,000 (Jensen et al. 2009).
5.1.1.9 I2D
The I2D (Interologous Interaction) database contains the full literature derived predictions
from the databases HPRD, BIOGRID, InTact, BIND and MINT as well as computationally
predicted interactions (Brown and Jurisica 2005). The sources of evidence utilised for
computational predictions include domain co-occurrence, gene co-expression and intersection
of GO terms. I2D contains 133,250 unique entries for detected protein interactions between
13,490 proteins.
5.1.1.10 KEGG
KEGG as mentioned in Chapter 1 localises gene products within functional pathways
(Kanehisa 1997; Kanehisa et al. 2006). This is similar to REACTOME.
5.1.1.11 BIOGRID
The BIOGRID database also contains curated data. It has 49,378 interactions involving
human proteins (Stark et al. 2006).
5.1.1.12 Discussion
There is an overlap between these databases as they are all based on examination of similar
experimental data (De Las Rivas and Fontanillo 2010). Given that current estimates of the
human interactome size are around 650,000 including non-direct functional interactions
(Stumpf et al. 2008) there are still a large number of interactions still to be characterised.
5.1.2 Power law
In order to examine the statistical properties of PPI networks, these networks are usually
analysed as graphs (Jeong et al. 2001). An interesting observation of the degree distribution
within some of these graphs (the degree of a vertice within a graph is the number of edges
connected to that given vertice) appear to follow a power law (Jeong et al. 2001). That is the
number of vertices within a graph with degree k is approximately k " x where x is a constant
(Alon 2007). What this entails is that for any given protein within the PPI network the
probability of having a large degree (many interactions) is low. There will however be
!
172
Chapter 5
proteins within the network that will have a large number of interacting partners. These
proteins have been referred to as hubs (Han et al. 2004). It has been hypothesised that
there are two forms of protein hub (Han et al. 2004). These are date hubs, which interact
with different partners at different times, and party hubs, which interact with multiple
partners simultaneously (Han et al. 2004).
Networks with similar degree distributions have been observed in both natural and
man-made networks such as the neural arrangements of C. elegans and the power grid of the
western United States (Watts and Strogatz 1998).
In the case of PPI networks however as there is a clear physical limit to the number of
interacting partners that a given molecule can interact with, the power law distribution over a
PPI network will sharply decay at the upper ends of the distribution, as hub proteins reach
saturation point with a given number of interaction partners. Similarly the lower end of the
distribution may not match a power law, as the cellular environment and other physiochemical factors will affect the probability of being an entirely monogamous interacting
partner in a binary interaction. Examples of PPI networks that do not exhibit a power law in
degree distribution have been pointed out in the literature, e.g. in work by Tanaka (Tanaka et
al. 2005).
5.2 Methods
The first step in carrying out a full genome-wide survey was to develop a list of all possible
ordered pairs of proteins within the version of RefSeq (Pruitt et al. 2005) used. This came a
total of to 560,237,601 pairs. The logistic regression-based filter implemented in Chapter 4
was applied to the ordered pairs of profiles at its optimum probability cut-off of 0.85.
Removing all pairs that scored beneath this threshold resulted in a total of 5,312,880 pairs of
proteins. This was a reduction of approximately 90 % of the total search space. This set of
reduced profile pairs was then analysed by constrained ML (Barker et al. 2007; Barker and
Pagel 2005) with the rate of gain parameters restricted to the optimum rate of 0.025. This
analysis was carried out using a cluster consisting of 260 2 GHz dual core Opteron 270
processors.
The results of the constrained ML analysis were then filtered for pairs with a
likelihood ratio (LR) statistic score of higher than 58.54 (this was the optimum LR score
determined in Chapter 3). This led to a set of 20,605 predicted interactions between protein
173
Chapter 5
pairs, consisting of 2,188 individual proteins. In order to examine predicted interactions
between members of the same orthologous group predictions were then converted to
predictions between orthologous groups. These orthologous groups were identified as the
groups clustered by the Inparanoid (Remm et al. 2001) implementation described in Chapter
2 resulting in a predicted set of 7,150 interactions between orthologous group pairs,
consisting of 1,417 individual orthologous groups.
5.2.1 Short Branch filtration
Examination of the distribution of interactions amongst the individual proteins showed that
some of the individual proteins were predicted to have an extremely large number of
interacting partners. The maximum number of interactions partners was predicted for the
protein with RefSeq GI number 148613856 (described as probable ATP-dependent RNA
helicase DDX17 isoform 3 on the NCBI website). This was predicted to have 1,503
interactions. However given the overall distribution of interaction partners within the set of
predictions as shown below in Figure 5.1 these extreme numbers seem to be implausible.
Figure 5.1: Distribution of number of predicted interaction partners/protein in constrained

ML predictions.
Thus the profiles of these highly connected proteins were investigated.
174
Chapter 5
Another protein with the RefSeq GI numbers 29029591 (labelled putative ribosomal RNA
methyltransferase 1 isoform b on the NCBI website) was predicted to take part in 1,430
interactions. The phylogenetic profile of this protein was:
001101001001010001111010100010000001001000010001111010
This translates to these proteins being present in Ashbya gossypii, Aspergillus fumigatus,
Bombyx mori, Caenorhabditis elegans, Canis familiaris, Cryptococcus neoformans,
Debaryomyces hansenii, Dictyostelium discoideum, Drosophila melanogaster, Drosophila
pseudoobscura, Entamoeba histolytica, Homo sapiens, Magnaporthe grisea, Paramecium
tetraurelia, Plasmodium knowlesi, Schizosaccharomyces pombe, Theileria annulata,
Theileria parva, Trichomonas vaginalis, Trypanosoma brucei and Ustilago maydis. This is
an extremely unbalanced distribution over the tree as illustrated in Figure 5.2.
175
Chapter 5
Figure 5.2: Distribution of protein labelled putative ribosomal RNA methyltransferase 1

isoform b. The character 1 indicates presence of the orthologous group.
It was hypothesised that this unbalanced distribution of this profile contributed to its
display of a high likelihood ratio statistic (LR) score with a large number of proteins. In
particular it was hypothesised that the profiles of prediction heavy proteins might contain
losses on the branches leading to P. troglodytes and M. mulatta. As the branches leading to
these taxa are short (see Chapter 2) this may contribute to spuriously high LR scores. In
order to investigate this hypothesis a list of RefSeq Gis for proteins lost on either the branch
leading to P. troglodytes or the branch leading to M. mulatta was sifted from the overall set
of phylogenetic profiles.
176
Chapter 5
The following procedure was then applied iteratively.
Set cut-off to 0.
Select all protein Gis from predicted interactions where no. of predicted interactions
for > cut-off.
Examine intersection of Gis in selected list with set of Gis of proteins lost on short
branches.
Increment cut-off by 1.
At the point where the cut-off was equal to 298, the intersection between the two sets was
100% as shown in the Venn diagram below.
Figure 5.3: Intersection of proteins lost on in P. troglodytes and M. mulatta with proteins
with > 298 predicted interaction partners.
177
Chapter 5
The 16 proteins in this intersection alone accounted for 13,082 of the total predicted protein
interactions or 63%. It is impossible to tell whether these proteins are genuinely evolving in a
correlated fashion or merely an artefact of the loss on a short branch.
Thus a post-processing step was applied which removed any prediction involving
proteins with profiles that matched this pattern. Thus 16,301 proteins with profiles that
contained a 0 at either the position representing P. troglodytes or the position representing M.
mulatta were removed from the set of predicted interactions. This led to the removal of
19,463 predicted interactions between proteins. This left a reduced set of 1,142 predictions.
An examination of the training data showed that 2 of the 5 predicted interactions by
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its selected optimum rate of
gain (0.025) during the training step (see Chapter 3) involved the protein
(16936528/NP_001789) which would have been removed by this post processing step. This
reduces the sensitivity of the method by 40%.
5.2.2 GO term enrichment
A plausible method to examine the potential accuracy of the predicted interaction is to test
whether the GO terms associated with the predicted interaction partners are enriched for
particular terms.
In order to subject the data to GO (Gene ontology) term analysis (Ashburner et al.
2000) the set of 1,142 predicted interactions between protein pairs was converted to
predictions between gene pairs. This was carried out using IDconverter (Alibes et al. 2007).
This produced a set of 273 interactions between pairs of genes and 183 individual genes.
In order to investigate the validity of predicted interactions the set of interactions
between genes was converted into a network of interactions. The network can be represented
as an undirected graph. The graph in this case would be undirected as there is no way of
inferring any form of directional relationship between putative predictions.
The predicted interactions were converted into a graph through insertion of the
characters xx between each predicted pair. This converted the predicted interactions into a
format known as the simple interaction format, which is usable by a platform known as
Cytoscape (Shannon et al. 2003). Cytoscape is a program that allows visualisation and
analysis of network data (Shannon et al. 2003) and is widely used for such analyses. The
broad structure of the resultant graph is shown in Figure 5.4.
178
Chapter 5
Figure 5.4: Graph of 273 interactions between genes as predicted by the application of
constrained ML post data filtering. Each vertex in the graph is one gene. The edges in the
graph represent a predicted functional interaction between two vertices.
In order to examine the quality of the predictions the graph was subjected to clique
analysis to break up the network into sub-graphs, which are densely connected. The
Cytoscape plugin ClusterViz (Cai 2010) was utilised to deconstruct the network into sub
clusters. The plugin was used with the FAG-EC agglomerative hierarchical algorithm (Li et
al. 2008), which builds up sub-clusters through analysis of the clustering coefficient of each
edge in the graph. The clustering coefficient measures the density of connections between a
given edge and its neighbours. It does this by calculating the number of triangles that a given
edge is part of and dividing this number by the number of triangles that might potentially
include it given the degree (number of incoming edges) of its adjacent nodes (Radicchi et al.
2004). FAG-EC was run with a specified cut-off of sub-cliques of at least size 3. This was
because GO terms can be found to be significantly enriched in pairs of proteins even if an
annotation is attached to just one protein in the pair.
179
Chapter 5
FAG-EC was also run with a weak module definition. This identifies modules as
sub-cliques within graphs where the sum of in-degree of each node within a module is higher
than the sum of out-degree (Li et al. 2008). The in-degree of a node within an undirected
graph is defined as defined as the number of edges connecting it to other nodes in the same
subgraph (Li et al. 2008). The out-degree of a node is defined as the number of edges
connecting it to the rest of the graph excluding its subgraph (Li et al. 2008).
The application of this algorithm yielded 10 connected sub-cliques. GO term
enrichment was examined through the use of the Cytoscape plugin Bingo (Maere et al. 2005).
Bingo operates through examination of all GO terms associated with a given network. There
are a number of sources of evidence by which a term may be associated with a gene
(Ashburner et al. 2000). These are:
IMP: inferred from mutant phenotype
IGI: inferred from genetic interaction
IPI: inferred from physical interaction
ISS: inferred from sequence similarity
IDA: inferred from direct assay
IEP: inferred from expression pattern
IEA: inferred from electronic annotation
TAS: traceable author statement
NAS: non-traceable author statement
ND: no biological data available
IC: inferred by curator
In order to utilise reliable sources of evidence terms that were determined using the evidence
codes ISS, IEA, NAS and ND were excluded.
Bingo operates by calculating the probability of the association of a given set of terms
with a cluster of genes given a background distribution of terms associated with a reference
set of genes. This is calculated using the hypergeometric test (Maere et al. 2005). The
probability of a given set of genes being associated with a given GO term follows the
hypergeometric distribution, which is equivalent to the binomial distribution but utilising
sampling without replacement (Sokal and Rohlf 1995). The probability of a cluster C of r
genes being associated with a given GO term g (if evaluated against a background set of N
180
Chapter 5
genes and assuming the total number of genes associated with g is t) can be calculated. The
background probability of any given gene being associated with g is t/N and the probability
of g not being associated with a gene is (1-t/N). Thus the probability of x genes inside C
" t
%"
%
t
$( N )(N)'$(1( N )(N)'
$
'$
'
# x &# r ( x &
being associated with g can be calculated using the formula
where
"t
%
$ N (N)'
$
'
# r &
"k%
$ ' is the number of combinations of k items taken Y at a time (Sokal and Rohlf 1995).
#Y &
!
The effects of multiple testing are reduced through application of the Bonferroni
!
correction. This correction scales the point at which p values are found to be significant down
by dividing by n the number of tests performed (Sokal and Rohlf 1995). A procedure
involving the hypergeometric test is common in GO enrichment tools and is also utilised by
ClueGO (Bindea et al. 2009), Gorilla (Eden et al. 2009) and GOEAST (Zheng and Wang
2008) amongst others.
Bingo was run against a background set of genes, which consisted of the full set of
human genes held in Entrez Gene.
5.2.3 Intersection with other data sources
To examine the extent to which the predictions made by the filter in combination with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) intersected with known data it
was decided to compare the predictions to a known set of PPIs. It was decided to utilise the
I2D database (Brown and Jurisica 2005) as it contained data from all the other major PPI
databases. Thus the Interologous Interaction Database (I2D) version 1.95 was downloaded.
Predictions were converted from RefSeq GI numbers to their corresponding Uniprot
(Apweiler et al. 2010) primary accession. Only Swiss-Prot accessions were used, as these are
high confidence protein molecules that have been manually annotated (Apweiler et al. 2010).
As mentioned above there were 1,142 predictions made between RefSeq GI pairs.
This conversion reduced the set of predictions to 278 predictions as a complete mapping of
RefSeq to Uniprot is lacking.
181
Chapter 5
5.3 Results
5.3.1 GO Enrichment
Table 5.1 shows details of the sub-clusters generated by ClusterViz (Cai 2010) ordered in
descending order by size.
Cluster No. No. of genes No. of interactions
1
97
188
Genes
PRPF31 RRP9 SNRPE SUPT4H1 TNPO1
COPB1 FASN GLRX5 RPS25 WWOX
DHDDS EXO1 H2AFY ATP5C1 RPS29
RER1 PHF5A PIGL RPS21 POLR2G PSMD8
TP53RK ABT1 ANAPC10 TCEA2 NOP10
POLR2L SF3B5 LZTR1 TUBGCP2 CDS1
MAK16 CTDSPL RBM34 KIFC1 GFPT1
PPP1CC UBE2D4 BYSL PSMA6 FDX1L
TFB1M C20orf118 KIAA1609 UBE2V1
NAPNAPB RLBP1 RPF1 PSMC4
TRAPPC6B RBMX2 RHOC TOP2B UBE2I
CDK5 FKBP4 CCT6A CDK7 CKS2 CTDSP1
DIMT1L FAM96B FKBP5 GNB1 GUK1
HSPE1 KIF19LSM7 POLE2 PSMB1
RIOK2RPL13 RPL19 RPL30 RPL37A
RPS23 SEC11C SLC2A6
SMARCAL1TRAPPC1 TRAPPC4
TRAPPC6ATXNL4B UBE2V2 VBP1 VPS45
ZDHHC21 ERCC2 SPO11 TRIT1 SHMT2
GDPD1 DOLK DUSP5 LIG1 TRMT112
Table 5.1: Sub-cliques of predicted interactions generated through analysis of clustering

coefficients.
182
Chapter 5
Cluster No. No. of genes No. of interactions
2
Genes
MTMR2 MTM1
MTMR9 MTMR1
10
VAPB ZNF516 STK17A

ZNF225
MAZ ZNF286A ZNF304
DNM1L
NLE1 CDC6 TEP1 GEMIN5
ZRSR2 PPP1CB
RPL31
DERL1 DERL2
DNAJC12
H2AFV ATG4A
UNG
Table 5.1: Sub-cliques of predicted interactions generated through analysis of clustering

coefficients (cont).
To investigate whether GO terms were enriched in the predicted sub-clusters; clusters were
subjected to analysis for GO term enrichment using Bingo (Maere et al. 2005). All terms
were judged significant at p < 0.05 after application of the Bonferroni correction. Table 5.2
presents the clusters and the GO terms enriched in each cluster.
183
Chapter 5
Cluster No.
1
Enriched GO terms
44238
primary metabolic process
44237
cellular metabolic process
8152
metabolic process
44260
cellular macromolecule metabolic process
43170
macromolecule metabolic process
6414
translational elongation
6412
translation
44267
cellular protein metabolic process
6368
RNA elongation from RNA polymerase II
promoter
6354
RNA elongation
10467
gene expression
19538
protein metabolic process
44265
cellular macromolecule catabolic process
No significant enrichment
19224
termination of RNA polymerase II
transcription
43653 mitochondrial fragmentation during apoptosis
79 regulation of cyclin-dependent protein kinase

activity
31981 nuclear lumen
No significant enrichment
30970 retrograde protein transport, ER to cytosol

30433 ER-associated protein catabolic process
6515 misfolded or incompletely synthesized protein
catabolic process
6984 ER-nuclear signaling pathway
30176 integral to endoplasmic reticulum membrane
31227 intrinsic to endoplasmic reticulum membrane
51789 response to protein stimulus
Table 5.2: GO enrichment in sub-cliques within predicted interaction network.

184
Chapter 5
Cluster No.
6
Enriched GO terms
31301 integral to organelle membrane

31300 intrinsic to organelle membrane
43161 proteasomal ubiquitin-dependent protein catabolic
process
10498 proteasomal protein catabolic process
5789 endoplasmic reticulum membrane
42175 nuclear envelope-endoplasmic reticulum network
44432 endoplasmic reticulum part
43632 modification-dependent macromolecule catabolic
process
51603 proteolysis involved in cellular protein catabolic
process
19941 modification-dependent protein catabolic process
6511 ubiquitin-dependent protein catabolic process
44257 cellular protein catabolic process
30163 protein catabolic process
9607 response to biotic stimulus
6886 intracellular protein transport
19060 intracellular transport of viral proteins in host cell
30581 intracellular protein transport in host
51708 intracellular protein transport in other organism
during symbiotic interaction
15031 protein transport
44265 cellular macromolecule catabolic process
45184 establishment of protein localization
43285 biopolymer catabolic process
8104 protein localization
9057 macromolecule catabolic process
46719 regulation of viral protein levels in host cell
33036 macromolecule localization
12505 endomembrane system
6508 proteolysis
46907 intracellular transport
5783 endoplasmic reticulum
Table 5.2: GO enrichment in sub-cliques within predicted interaction network (cont).
185
Chapter 5
Cluster No.
6
Enriched GO terms
44248 cellular catabolic process
31090 organelle membrane
9056 catabolic process
51649 establishment of localization in cell
42288 MHC class I protein binding
51641 cellular localization
42287 MHC protein binding
30307 positive regulation of cell growth
42221 response to chemical stimulus
45793 positive regulation of cell size
65008 regulation of biological quality
19048 virus-host interaction
45927 positive regulation of growth
51701 interaction with host
7242 intracellular signaling cascade
44419 interspecies interaction between organisms
44404 symbiosis, encompassing mutualism through
parasitism
6950 response to stress
6810 transport
51234 establishment of localization
22415 viral reproductive process
16032 viral reproduction
1558 regulation of cell growth
51179 localization
8361 regulation of cell size
44267 cellular protein metabolic process
19538 protein metabolic process
40008 regulation of growth
44260 cellular macromolecule metabolic process
51869 response to stimulus
16021 integral to membrane
7154 cell communication
43170 macromolecule metabolic process
30968 endoplasmic reticulum unfolded protein response

186
Chapter 5
Cluster No.
6
Enriched GO terms
31224 intrinsic to membrane
44446 intracellular organelle part
44422 organelle part
43283 biopolymer metabolic process
7165 signal transduction
44425 membrane part
51706 multi-organism process
22414 reproductive process
8284 positive regulation of cell proliferation
6986 response to unfolded protein
No Significant enrichment.
5.3.2 Intersection with known data

The comparison of the set of protein interactions predicted by constrained ML (Barker et al.
2007; Barker and Pagel 2005) with the I2D database (Brown and Jurisica 2005) was carried
out by determining the intersection between the two sets of interactions. There were 2
predictions in common between the two datasets, which were not self-interactions (there were
9 self interactions in the intersection). These were:
Protein Pair (RefSeq GI numbers)
Evidence of interaction
118600991 13129120
(Kummel et al. 2008)
21361657 4758304
(Jessop et al. 2007)
Table 5.2: Intersection between I2D (Brown and Jurisica 2005) and predictions by logistic
regression/constrained ML (Barker et al. 2007; Barker and Pagel 2005).
There are thus 1,131 predictions made by logistic regression/constrained ML (Barker
et al. 2007; Barker and Pagel 2005), which are potentially novel. All predictions made can be
seen in Appendix C.
187
Chapter 5
5.3.3 Network statistics
The degree distribution of nodes within the graph appears to follow a power law. This could
potentially also be indicative of the correctness of the predictions made by constrained ML.
This pattern is observed in both the full and the reduced graphs as shown in Figures 5.5 and
5.6.
Figure 5.5: Degree distribution for full graph of protein interactions. Line is fitted power law
of the form y=axb. Line is fitted by least squares regression R2=0.694.
188
Chapter 5
Figure 5.6: Degree distribution for graph of protein interactions post short branch filtration.
Line is fitted power law of the form y=axb. Line is fitted by least squares regression
R2=0.768.
5.4 Discussion
A full genome wide investigation of human protein interactions by constrained ML (Barker et
al. 2007; Barker and Pagel 2005) in combination with the logistic regression-based data filter
seems to be a potentially fruitful source of new protein interactions. The enrichment of GO
terms in some sub-cliques of the resultant network suggests that the system has an ability to
make predictions with some basis in reality and thus a proportion of the set of predictions
made are both novel and accurate.
5.4.1 GO enrichment
GO enrichment was investigated conservatively by excluding the GO evidence code IEA.
This evidence code is associated with 90% of GO annotations (Buza et al. 2008). However
189
Chapter 5
despite removing terms associated with this code as well as terms associated with the codes
ISS, ND and NAS, a reasonable degree of enrichment was still observed.
The terms enriched appear to be associated with processes, which are divergent across
eukaryotes such as transcription (enriched in sub-clique 1) (Coulson and Ouzounis 2003).
This is a demonstration of the fact that it is only proteins that show a degree of variability in
their distribution pattern that are susceptible to this line of investigation.
5.4.2 Intersection with known data
The level of intersection with the I2D database is fairly low. Using the estimate of
interactome size provided by (Stumpf et al. 2008) and assuming every prediction in I2D
(Brown and Jurisica 2005) is correct. This would correspond to a coverage level of
133,250/ 650,000 or 20%. Thus the probability of any given accurate prediction being within
this database would be 0.2. Thus the converse probability of an accurate prediction not being
in the database would be 1-0.2 or 0.8.
If every prediction made by the heuristic approach were accurate, then the observed
result of an intersection of 11 and a complement of 1,131 would be highly improbable 0.81131
or ~0). The lack of intersection between the two datasets could be due to the bias in PPI
databases to particular physical detection systems such as yeast 2 hybrid. Approximately 37
% of the binary interactions held in HPRD (Mishra et al. 2006) were detected using yeast 2
hybrid.
The issue of RefSeq to Uniprot mapping is also pertinent in contributing to this lack
of intersection as over 75% of the predictions were lost post mapping.
Finally it is also unlikely that there is 100% accuracy in all PPIs held in I2D.
5.4.3 Weaknesses
Clearly the result of a precision of 1 as achieved on the training and testing data cannot be
extended to a full genome wide survey. The fact that predictions are made through
comparisons of the phylogenetic distribution of proteins suggests that one weakness of the
method could be an inability to distinguish between paralogs/isoforms and proteins showing
evidence of correlated evolution. However this issue is far from clear-cut as there is evidence
to show that homologous proteins are more likely to interact (Ispolatov et al. 2005; Orlowski
et al. 2007). Thus it is possible that the success of the phylogenetic profile method is partly
based on this observation. This is a potentially confounding issue for the method. However
190
Chapter 5
examination of interactions between predicted orthologous groups can ameliorate this. In the
case of this study of the 1,142 pairs of proteins predicted to be functionally linked by this
study 221 lie within the same orthologous group as identified by the Inparanoid
implementation (Remm et al. 2001).
Thus predictions between members of orthologous groups are not particularly
widespread over the data examined.
The other weakness of phylogenetic profiling in general that applies to this set of
predictions is potentially inaccurate profiles. Profiles can be inaccurate for a number of
reasons including low coverage sequencing, poor annotation or incorrect assumptions in
homolog identification. The short branch filtration step undertaken before further analysis is
potentially attributable to this phenomenon.
5.4.3.1 Scaling
The precision and sensitivity results observed over the training data were based on a
biologically unrealistic ratio of 10:1 of negative to positive examples of interacting proteins.
The results observed can be adjusted for the whole genome by scaling to a more realistic
ratio. A possibly more realistic ratio can be calculated using estimates of interactome size.
These range from 154,000-369,000 (Hart et al. 2006) to 650,000 (Stumpf et al. 2008). If these
numbers are subtracted from the size of all potential interactions 560,237,601 (calculated as
all possible pairs from version of RefSeq held) estimated ratios of negative to positive range
from approximately 860:1 to 3636 :1. Assume for the sake of argument the ratio of 860:1 is
adopted (via an assumption of an interactome size of 650,000). Recall that the size of the
positive set in the training data is 9,161 pairs of known interactions. Thus as an illustrative
example if a given predictive method yielded a precision of 0.5 and a sensitivity of 0.1 over
the training data this would correspond to making 916 predictions of which 50% were correct
(TP=458, FP=458 and FN=8703). In order to scale the data the following numbers need to be
calculated:
P(TP)= Probability of predicting a true positive.
P(FP)= Probability of predicting a false positive.
P(FN)= Probability of predicting a false negative.
These numbers can be calculated by the following equations:
191
Chapter 5
P(TP) =
(TP)
(PS)
(1)
P(FP) =
(FP)
(NS)
(2)
P(FN) =
(FN)
(PS)
(3)
!
!
!
Where PS= size of the positive set and NS= size of the negative set.
For the example above P(TP)=458/9161=0.049, P(FP)=458/103971=0.004 and
P(FN)=8703/9161=0.95. Thus by multiplying these probabilities by the estimated full
interactome size (in this case) 650,000 the sizes of TP and FN can be calculated over the full
interactome. In order to calculate the size of FP the size of a potential negatome (proteins that
do no interact) must be calculated. This can be calculated as the estimated size of the
interactome subtracted from the number of all possible interactions (in this case 560,237,601650,000=559,587,601). Given these numbers the values of TP, FP, and FN over the full
interactome for this example would be 31,850, 2,238,216.5 and 617,500 respectively leading
to a scaled precision of 0.014 and a scaled sensitivity of 0.049 over the whole interactome. In
cases where precision =1 scaling will not affect this value as there are no false positives
predicted.
Probabilities of predicted interactions being genuine can also be calculated via an
alternate route applying Bayes theorem with the prior probability of an interaction being
derived from an estimate of interactome size. Thus applying Bayes theorem the posterior
probability of an interaction can be calculated using the following parameters (Yang 2006):
P(I)=prior probability of interaction. Calculated by division of interactome size

estimate by total number of potential interactions.
P(Pos)= Probability of making any positive prediction. Calculated as
P(Pos) = P(Pos | I )(P(I )) + P(Pos |~ I )(P(~ I ))

In cases where precision is = 1 P(Pos |~ I ) = 0 . Note P(Pos |~ I ) = P(FP)
Thirdly the probability of making a positive prediction given an interaction is

calculated as: P(Pos|I)= Sensitivity of the method.
!
!
192
Chapter 5
Thus the posterior probability of a predicted interaction being genuine can be calculated
using Bayes theorem as presented in Equation 4:
P(I | Pos) =
P(I) " P(Pos | I)

P(Pos)
(4)
Bayes theorem however is only applicable in cases where precision < 1 as the posterior
probability is 1 when precision =1.
This can be simply demonstrated using basic algebra and recasting the terms.
P(I | Pos) =
(Pr ior " Sensitivity)

(Pr ior " Sensitivity) + (P(FP)(1# Pr ior))
(5)
Thus as P(FP)(1" Pr ior) = 0 the posterior probability is 1.

5.4.4 Conclusions
!Given results observed on the training data and testing data and the GO term enrichment
observed in the sub-clusters, as well as the results of previous work on the phylogenetic
profile method (Barker et al. 2007; Barker and Pagel 2005; Bowers et al. 2004; Cokus et al.
2007; Kensche et al. 2008; Pagel et al. 2004b; Pellegrini et al. 1999; Vert 2002) amongst
others, it appears that the method is capable of discerning between proteins that are
functionally linked and proteins that are not. Thus the novel predictions made could
potentially be genuine interactions, which are of yet uncharacterised.
193
Chapter 6
Chapter 6
Conclusions and further work
6.1 Summary of Project
The goal of this project has been an investigation into detection of human protein interactions
using the comparative method. More specifically the development of a novel heuristic
approach to allow application of the effective but computationally intensive constrained ML
(Barker et al. 2007; Barker and Pagel 2005) approach to phylogenetic profile analysis on a
genome-wide scale. This application was intended to allow the generation of novel
predictions of protein interactions.
A database of all against all comparisons of the proteomes of 54 eukaryotic organisms
plus 1 archaeon was created. This was used to as input to an implementation of the
Inparanoid (Remm et al. 2001) procedure to cluster the contents of the proteomes into
orthologous groups. Using the human proteome as a reference point phylogenetic profiles
were then constructed for each protein within the human proteome.
10 proteins that were universally present in single copies in all organisms under
consideration were then selected through analysis of the phylogenetic profiles and
orthologous groups. The versions of these single copy proteins from each species were then
aligned to create a multiple sequence alignment. Each multiple sequence alignment was then
concatenated to create one single combined alignment. This combined alignment provides a
measure of divergence between the 55 organisms under consideration. The concatenated
multiple sequence alignment was then used to reconstruct a phylogenetic tree of the 54
eukaryotes under consideration using the archaeon as an outgroup with which to root the tree.
This phylogeny was broadly congruent with current thought on eukaryotic evolution (see
Chapter 2) .
194
Chapter 6
Figure 6.1: Process flow for research carried out in Chapter 2.

In order to use constrained ML (Barker et al. 2007; Barker and Pagel 2005) to analyse
the training data it was necessary to ascertain the optimum rates at which a character could be
gained in order to constrain the models of evolution used by the method (Barker et al. 2007).
In order to do this it was necessary to obtain training data, i.e. examples of protein pairs that
interact and examples of protein pairs that are unlikely to interact. Positive data was acquired
which was based protein interactions held with the HPRD database. Negative data was
generated by creating a set of all possible pairs of human proteins. These pairs were filtered
by removing all pairs that possessed any Gene Ontology (GO) (Ashburner et al. 2000) terms
in common. Once these training sets were obtained different rates of protein gain were
evaluated in terms of precision and sensitivity and an optimum rate of gain of 0.025 was
selected. The highest sensitivity reached by constrained ML at this rate was 1 at a cut-off of
56.37. The sensitivity of the method at this cut-off was 0.000654.
The efficacy of constrained ML (Barker et al. 2007; Barker and Pagel 2005) in
detecting protein-protein interactions was then compared to a comparable high throughput
laboratory based method for detecting interactions using the training data. This method was
examination of gene co-expression in response to given experimental stimuli as measured by
195
Chapter 6
microarrays. The highest performing microarray experiments also achieved a precision of 1.
The highest performing microarray experiment E-MEXP-1224 (Garman et al. 2009) achieved
a sensitivity of 0.003.
Constrained ML (Barker et al. 2007; Barker and Pagel 2005) was also compared to
the PIPs server (McDowall et al. 2009) which uses a semi-naive Bayesian classifier (Scott
and Barton 2007) in order to evaluate multiple sources of evidence for potential protein
interactions. At its highest cut-off the Bayesian classifier achieved a precision of 0.9883721
and a sensitivity of 0.01.
Figure 6.2: Process flow for research described in Chapter 3.

These comparisons showed that constrained ML (Barker et al. 2007; Barker and Pagel
2005) showed comparable levels of precision to gene co-expression at an optimal level of
constraint for rate of gain and outperformed the method that integrated multiple sources of
evidence. In terms of sensitivity however constrained ML (Barker et al. 2007; Barker and
Pagel 2005) was clearly the worst performer. However given that constrained ML (Barker et
196
Chapter 6
al. 2007; Barker and Pagel 2005) achieved a precision of 1 over the training data it was
utilised for further analysis.
The application of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to a
full genome-wide survey was found to be impractical due to time considerations. Thus a
heuristic was developed which approximated the ability of constrained ML (Barker et al.
2007; Barker and Pagel 2005) to distinguish between proteins that interact and those that do
not. This heuristic was based on the reconstruction of ancestral states using Dollo parsimony
(Farris 1977) over the phylogenetic tree. Two novel potential heuristics were developed,
implemented and tested using the Dollo parsimonious reconstruction. The first was an
implementation of a test for correlated evolution which calculates the probability of the
concentration of a set of gains and losses of a protein in the areas of a phylogenetic tree
where a second protein was either present or absent (Maddison 1990). The second potential
heuristic was based on logistic regression using empirical counts of the presence, absence,
gain or loss of one protein given the presence, absence, gain or loss of the other as predictor
variables.
The Maddison test (Maddison 1990) based heuristic performed reasonably well in its
own right as a method of detecting functional interactions. It achieved a maximum precision
of 0.857 with a sensitivity of 6.54 " 10-4 over the training data at a score cut-off of
0.9999997999999475. However it proved not to be efficient enough in terms of speed to be
justified for use as a heuristic. It also did not maintain an intersection with the 5 predictions
!
made by constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its optimum rate of
gain (0.025) and at its optimum likelihood ratio (LR) statistic score cut-off (58.3).
Maintenance of an intersection with these predictions was considered a necessary property of
an effective heuristic.
The heuristic that utilised logistic regression achieved a precision of 0.736 with a
sensitivity of 0.01 at its optimum cut-off of 0.967. It also maintained an intersection with the
predictions made by constrained ML (Barker et al. 2007; Barker and Pagel 2005) (at its
optimum rate of gain and LR cut-off) up to a cut-off of 0.85. At a cut-off of 0.85 the heuristic
made 1,230 predictions, which amounted to a reduction of the search space of potential
proteins by 98.9%.
The heuristic based on logistic regression was then applied to the full human genome
in order to filter out protein pairs that displayed little or no evidence of correlated evolution.
The heuristic reduced the size of the search space by 90% over the whole genome.
197
Chapter 6
Figure 6.3: Process flow for research carried out in Chapter 4. Note: Validation sets
were used to validate all methods. The connectors have been left out for clarity.
Having applied the heuristic to the method a full genome-wide survey was launched.
The results of the genome-wide survey found that a large majority of predicted protein
interactions involved proteins, which had been lost on short branches in the phylogeny. These
predictions were removed from the overall set of predictions. The prediction set was then
recast as a network of interactions.
The results of the genome-wide survey were then examined by generating subnetworks from the complete network generated and examining these sub networks for
enrichment in Gene Ontology (GO) (Ashburner et al. 2000) terms. GO term enrichment was
found in 57% of the clusters generated. The intersection of the predictions made by
constrained ML (Barker et al. 2007; Barker and Pagel 2005) with the I2D database (Brown
and Jurisica 2005) was also examined. The intersection with the I2D (Brown and Jurisica
2005) database was low suggesting that any correct predictions generated by this project are
198
Chapter 6
also novel predictions of protein interaction. The genome-wide survey yielded a final set of
1,131 predictions of protein interaction.
Figure 6.4: Procedure for research carried out in Chapter 5.
6.1.1 Repeat Analysis

To apply this procedure to a new dataset, the following procedure would have to be followed.
Prerequisites needed:
Phylogenetic tree for species of interest.
Phylogenetic profiles for proteins of interest.
Positive and negative examples of protein interaction data. An automated procedure

for the acquisition of training/testing data is found in (Chen et al. 2011).
Having acquired these, the programs BayesTraits (Pagel et al. 2004a) and bms_runner
(Barker et al. 2007) should be downloaded.
To determine the optimum rate of protein gain for use in the constrained ML procedure
bms_runner should be used to evaluate multiple rates of gain. The LR scores for all
proteins for the optimum rate of gain should be kept.
199
Chapter 6
Once this rate is determined, the next step is the ancestral state reconstructions. In order
to carry out these reconstructions it will be necessary to download the program
DOLLOP held in the PHYLIP package (Felsenstein 1989).
DOLLOP should be run with the U option, which will allow it to utilise the
phylogenetic tree. (Note bms_runner uses a NEXUS formatted tree while DOLLOP will
need a PHYLIP format tree). DOLLOP should be run on every profile in the dataset.
Thus the end product of this step is a set of ancestral reconstructions over the tree for
each profile.
At this point code written by the author (available on request) can be used to process
these reconstructions. This code will take in the reconstructions and return a dataset
consisting of the s parameters described in Chapter 4 calculated for each protein.
This data can then be processed using standard statistical package R (R Development
Core Team 2011) in order to carry out logistic regression. Once regression has been
carried out, this should yield a linear equation for calculating a logit based score for the
probability of interacting.
Again code available from the author can now be utilised. This code will take in the
specified coefficients for the s parameters calculated in R, the Dollo reconstructions of
the proteins, the LR scores of the proteins at the optimum rate of gain and the validation
data and return the optimum logit cut-off for the data for preserving the performance of
constrained ML (Barker et al. 2007; Barker and Pagel 2005).
At this point a dataset of all possible pairs of profiles should be prepared. Code from the
author can be used to apply the linear equation to each of these pairs to calculate the logit
score. These pairs can now be filtered by the optimum cut-off.
Once a reduced set has been created, constrained ML can be applied to this set (Barker
et al. 2007; Barker and Pagel 2005).
6.2 Conclusion
This project has investigated use of the comparative method specifically constrained ML
(Barker et al. 2007; Barker and Pagel 2005) as a means to detect protein-protein interactions.
It has generated a set of predictions that if validated by further laboratory based investigation
could contribute to knowledge about the human interactome. It has also developed a method
200
Chapter 6
that allows the application of the computationally intensive constrained ML (Barker et al.
2007; Barker and Pagel 2005) approach to phylogenetic profiling on a genome-wide scale.
The ability of the comparative method to unearth protein interactions can only be
enhanced by the current rate of data generation given the rapid uptake of next generation
sequencing technologies such as the Roche 454 GS FLX sequencer, the Illumina Genome
Analyser and the Applied Biosystems SOLID sequencer, which can generate gigabases of
sequence data in a matter of days (Mardis 2008). As more organisms are sequenced the
quality of reconstructed phylogenies and consequently the efficacy of the comparative
method in detecting associations between traits should improve due to increased taxon
sampling (Heath et al. 2008).
Given this increased pace of data generation it is also necessary to develop fast and
effective computational techniques for functional annotation of proteins. Detection of protein
interactions can be used to functionally annotate proteins via the principle of guilt by
association (Aravind 2000). Thus the combination of the developed heuristic with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) can contribute to annotation
efforts. It has been seen that this method is not very sensitive thus the probability of it making
any predictions at all for a given protein are low. But used in a high throughput unsupervised
context the method is potentially capable of detecting novel interactions as one tool amongst
many.
Among the methods of detecting protein interactions examined over the course of this
study was the PIPs server (McDowall et al. 2009), which as mentioned above combines
multiple sources of evidence in order to detect potential protein interactions (Scott and Barton
2007) utilising a Bayesian classifier. A similar approach of using combined evidence in a
Bayesian framework was previously taken by (Jansen et al. 2003). This combination of
diverse sources of evidence as a means to elucidate protein interactions has also been applied
by Mohamed (Mohamed et al. 2010) utilising a classifier based on a majority vote from a
collection of decision trees. Other approaches such as support vector machines and singular
decision trees have also been investigated by (Qi et al. 2006).
Potentially the application of constrained ML (Barker et al. 2007; Barker and Pagel
2005) in combination with the heuristic in a genome-wide manner could be utilised as a
source of contributory evidence in a similar framework.
201
Chapter 6
6.3 Future directions
Constrained ML (Barker et al. 2007; Barker and Pagel 2005) has been seen to be capable of
detecting protein-protein interactions at a reasonable level of accuracy. With the data
accumulated over the course of this project there are a number of further avenues of
investigation and areas of extension.
6.3.1 Computational extensions
The procedure followed in order to utilise constrained ML (Barker et al. 2007; Barker and
Pagel 2005) for a genome-wide survey of H. sapiens involved the use of bespoke scripts and
various programs provided by a plethora of authors as cited throughout this text. To facilitate
the application of this tool by other users it will be necessary to create an interface and
combine the functionality of the programs utilised into one computational procedure.
The construction of phylogenetic profiles for all proteins in all species held in the
current dataset and the provision of these profiles online via a web interface would also
facilitate this process. The data generated by this project as presented in Appendix D could
also be presented via an online database either an extant protein interaction database such as
String or I2D or a bespoke database, which would have to be constructed.
6.3.2 Consensus profiles
The application of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to detection
of protein interactions is carried out in a pairwise fashion. Work by Bowers extended the idea
of pairwise comparisons to three way comparisons using Boolean logic operators (Bowers et
al. 2004). This method attempted to detect dependencies in the presence and absence of a
given gene on the presence and absence of two other genes. A similar technique could be
utilised to integrate matching profiles into consensus profiles. By classifying mismatches as
missing information consensus phylogenetic profiles could be constructed to represent groups
of proteins. The program BayesTraits (Pagel et al. 2004a), which is utilised to apply the
constrained ML approach, handles missing data by reconstruction of the missing data as an
extension of ancestral state reconstruction. Thus when a plausible reconstruction is reached at
the immediate ancestral node of the taxon with the missing data the state of the taxa can be
estimated using rate transition parameters (Pagel 1994). Consensus profiles will utilise a
mismatch character X to represent missing information. Thus if for example we compare the
following four species profiles:
1010
202
Chapter 6
1110
The consensus profile of the above two profiles would be:
1X10
In comparisons of consensus profiles the X character will remain unchanged if matched
against another X, shift to 0 if matched against a 0 and shift to 1 if matched against a 1. Thus
a 1 or a 0 in a consensus profile will always be present in more than 50% of its constituent
profiles.
Some of these groups will represent clade specific distributions of proteins. Others
will represent distributions of proteins correlated with the distribution a given function over
the species under consideration. Comparison of a protein with an as yet unascertained
function using consensus profiles would connect a protein to either a clade-specific group or
a group, which possessed a function connected to the presence of the protein. Thus a protein
that showed correlated evolution with a consensus profile could potentially be functionally
linked to all constituent members of that profile. At a higher-level if two consensus profiles
show evidence of correlated evolution with each other this could suggest functional linkage
between two groups of proteins, e.g. the functional interaction of one pathway with another.
6.3.3 Correlated evolution of proteins with the presence or absence of phenotypes
Given the data currently generated an interesting avenue of investigation would be the
comparison of the presence and absence of given phenotypes with the presence and absence
of given proteins. This process can detect proteins that underlie the phenotype of interest.
This method was developed by Levesque (Levesque et al. 2003) and used to detect genes
associated with cell motility. It was also applied to associating a number of phenotypes with
given proteins (Jim et al. 2004; Slonim et al. 2006). The method was found to be to be
reasonably effective with traits that were evenly distributed among the organisms under
consideration (Jim et al. 2004). A further application of the method by Gonzalez and Zimmer
examined the association of optimal growth pH with given genotypes (Gonzalez and Zimmer
2008). Gonzalez and Zimmer utilised a threshold with which to discretise continuous
phenotypes (Gonzalez and Zimmer 2008). If the measured value of a measured phenotype
was over a given value then the phenotype was declared present. Applications of this method
have so far utilised measures like string distance measures (Jim et al. 2004; Levesque et al.
2003) and mutual information (Gonzalez and Zimmer 2008; Slonim et al. 2006) to compare
the phylogenetic profiles of genes and given phenotypes. Use of a phylogenetically aware
method such as constrained ML (Barker et al. 2007; Barker and Pagel 2005) would enhance
203
Chapter 6
the method and potentially yield more accurate results. Given the range of eukaryotic
organisms currently held potential traits to be investigated could include multi-cellularity,
aerobic respiration and parasitism.
6.3.4 Drug Targets
Keeping with the theme of parasitism there are a number of disease causing parasitic
organisms in the dataset currently held. These are
Plasmodium knowlesi
Plasmodium yoelii
Trypanosoma brucei
Trypanosoma cruzi
Leishmania major
Theileria annulata
Theileria parva
These include T. cruzi and T. brucei, which cause Chagas disease (Lescure et al.
2010) and sleeping sickness (Ralston et al. 2009) respectively. Also included in the dataset
are three members of the malaria-causing genus Plasmodium. Take for example P.
falciparum. There is currently resistance to all five groups of anti-malarial drugs (Hayton and
Su 2004). The detection of protein interactions in P. falciparum could potentially aid in the
development of new anti-malarial drugs. Using this species as a reference point, phylogenetic
profiles for its proteome could be constructed. An application of the logistic regression based
heuristic would make all against all comparisons using constrained ML (Barker et al. 2007;
Barker and Pagel 2005) feasible. These studies could potentially detect novel protein
interactions within P. falciparum. Disruption of protein-protein interactions is potentially
one avenue for drug development. This could potentially be carried out via procedures such
as peptidomimetics (Hruby 1997), which involves the construction of a molecule that mimics
the properties of one of the interacting partners. The construction of phylogenetic profiles
could also reveal proteins and protein interactions that are unique to P. falciparum. These
molecules could potentially be targeted with a lower risk of side effects in the host organism.
A similar procedure could be followed with all other parasitic organisms in the dataset.
204
References
References
Agnarsson I, Miller JA (2008) Is Acctran Better Than Deltran? Cladistics 24:1032
Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA (1997)
Evidence for a Clade of Nematodes, Arthropods and Other Moulting Animals.
Nature 387:489
Ahola V, Aittokallio T, Vihinen M, Uusipaikka E (2006) A Statistical Score for Assessing
the Quality of Multiple Sequence Alignments. BMC Bioinformatics 7:484
Albert VA (2006) Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford
Alberts B (1998) Essential Cell Biology : An Introduction to the Molecular Biology of the
Cell. Garland, New York
Alberts B (2002) Molecular Biology of the Cell. Garland Science
Alberts B (2008) Molecular Biology of the Cell. Garland Science, New York ; Abingdon
Alberts B (2010) Essential Cell Biology. Garland Science, New York ; London
Alibes A, Yankilevich P, Canada A, Diaz-Uriarte R (2007) Idconverter and Idclight:
Conversion and Annotation of Gene and Protein Ids. BMC Bioinformatics 8
Alon U (2007) An Introduction to Systems Biology Design Principles of Biological
Circuits. Chapman & Hall / CRC
Altenhoff AM, Dessimoz C (2009) Phylogenetic and Functional Assessment of Orthologs
Inference Projects and Methods. PLoS Comput Biol 5:e1000262
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic Local Alignment
Search Tool. J Mol Biol 215:403
Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schaffer AA, Yu YK (2005)
Protein Database Searches Using Compositionally Adjusted Substitution
Matrices. FEBS Journal 272:5101
C. elegans Sequencing Consortium (1998) Genome Sequence of the Nematode C. Elegans:
A Platform for Investigating Biology. Science 282:2012
Antonov AV, Mewes HW (2008) Complex Phylogenetic Profiling Reveals Fundamental
Genotype-Phenotype Associations. Computational Biology and Chemistry 32:412
Apweiler R, Martin MJ, O'Donovan C, Magrane M, Alam-Faruque Y, Antunes R, Barrell D,
Bely B, Bingley M, Binns D, Bower L, Browne P, Chan WM, Dimmer E, Eberhardt R,
Fedotov A, Foulger R, Garavelli J, Huntley R, Jacobsen J, Kleen M, Laiho K, Leinonen R,
Legge D, Lin Q, Liu WD, Luo J, Orchard S, Patient S, Poggioli D, Pruess M, Corbett M, di
Martino G, Donnelly M, van Rensburg P, Bairoch A, Bougueleret L, Xenarios I, Altairac S,
Auchincloss A, Argoud-Puy G, Axelsen K, Baratin D, Blatter MC, Boeckmann B, Bolleman
J, Bollondi L, Boutet E, Quintaje SB, Breuza L, Bridge A, deCastro E, Ciapina L, Coral D,
Coudert E, Cusin I, Delbard G, Doche M, Dornevil D, Roggli PD, Duvaud S, Estreicher A,
Famiglietti L, Feuermann M, Gehant S, Farriol-Mathis N, Ferro S, Gasteiger E, Gateau A,
Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hulo N, James J, Jimenez S,
Jungo F, Kappler T, Keller G, Lachaize C, Lane-Guermonprez L, Langendijk-Genevaux P,
Lara V, Lemercier P, Lieberherr D, Lima TD, Mangold V, Martin X, Masson P, Moinat M,
Morgat A, Mottaz A, Paesano S, Pedruzzi I, Pilbout S, Pillet V, Poux S, Pozzato M, Redaschi
N, Rivoire C, Roechert B, Schneider M, Sigrist C, Sonesson K, Staehli S, Stanley E, Stutz A,
Sundaram S, Tognolli M, Verbregue L, Veuthey AL, Yip LN, Zuletta L, Wu C, Arighi C,
Arminski L, Barker W, Chen CM, Chen YX, Hu ZZ, Huang HZ, Mazumder R, McGarvey P,
Natale DA, Nchoutmboube J, Petrova N, Subramanian N, Suzek BE, Ugochukwu U,
205
References
Vasudevan S, Vinayaka CR, Yeh LS, Zhang J (2010) The Universal Protein Resource
(Uniprot) in 2010. Nucleic Acids Research 38:D142
Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M,
Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M,
Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B,
van Eijk K, Hermjakob H (2010) The Intact Molecular Interaction Database in
2010. Nucleic Acids Res 38:D525
Aravind L (2000) Guilt by Association: Contextual Information in Genome Analysis.
Genome Research 10:1074
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S,
Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene
Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium.
Nat Genet 25:25
Aubourg S, Rouze P (2001) Genome Annotation. Plant Physiology and Biochemistry
39:181
Avery OT, Macleod CM, McCarty M (1944) Studies on the Chemical Nature of the
Substance Inducing Transformation of Pneumococcal Types : Induction of
Transformation by a Desoxyribonucleic Acid Fraction Isolated from
Pneumococcus Type Iii. J Exp Med 79:137
Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW (2001) Bind--the
Biomolecular Interaction Network Database. Nucleic Acids Res 29:242
Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF (2000) A Kingdom-Level Phylogeny
of Eukaryotes Based on Combined Protein Data. Science 290:972
Baldi P, Brunak S (2001) Bioinformatics : The Machine Learning Approach. MIT Press,
Cambridge, Mass.
Barker D, Meade A, Pagel M (2007) Constrained Models of Evolution Lead to Improved
Prediction of Functional Linkage from Correlated Gain and Loss of Genes.
Bioinformatics 23:14
Barker D, Pagel M (2005) Predicting Functional Gene Links from PhylogeneticStatistical Analyses of Whole Genomes. PLoS Comput Biol 1:e3
Beadle GW, Tatum EL (1941) Genetic Control of Biochemical Reactions in Neurospora.
Proc Natl Acad Sci U S A 27:499
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) Genbank. Nucleic
Acids Res 37:D26
Berg JM, Tymoczko JL, Stryer L (2001) Biochemistry. W. H. Freeman and CO., New York
Berg JM, Tymoczko JL, Stryer L (2007) Biochemistry. W. H. Freeman, New York
Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, Fridman WH,
Pages F, Trajanoski Z, Galon J (2009) Cluego: A Cytoscape Plug-in to Decipher
Functionally Grouped Gene Ontology and Pathway Annotation Networks.
Birney E, Clamp M, Durbin R (2004) Genewise and Genomewise. Genome Res 14:988
Black DL (2003) Mechanisms of Alternative Pre-Messenger Rna Splicing. Annu Rev
Biochem 72:291
Blair C, Murphy RW (2011) Recent Trends in Molecular Phylogenetic Analysis: Where
to Next? J Hered 102:130
Blanchard JL, Lynch M (2000) Organellar Genes - Why Do They End up in the Nucleus?
Trends in Genetics 16:315
Blow MJ (2004) A Survey of RNA Editing in the Human Brain Sanger Institute.
University of Cambridge, Cambridge
206
References
Borodovsky M, Rudd KE, Koonin EV (1994) Intrinsic and Extrinsic Approaches for
Detecting Genes in a Bacterial Genome. Nucleic Acids Res 22:4756
Bowers PM, Cokus SJ, Elsenberg D, Yeates TO (2004) Use of Logic Relationships to
Decipher Protein Network Organization. Science 306:2246
Bratke K (2009) Comparative Analysis of Poxvirus Genome Evolution. University of
Dublin,Trinity College, Dublin
Breathnach R, Benoist C, O'Hare K, Gannon F, Chambon P (1978) Ovalbumin Gene:
Evidence for a Leader Sequence in mRNA and DNA Sequences at the ExonIntron Boundaries. Proc Natl Acad Sci U S A 75:4853
Brennan RG, Matthews BW (1989) The Helix-Turn-Helix DNA Binding Motif. J Biol
Chem 264:1903
Brent MR (2008) Steady Progress and Recent Breakthroughs in the Accuracy of
Automated Genome Annotation. Nat Rev Genet 9:62
Brown KR, Jurisica I (2005) Online Predicted Human Interaction Database.
Brown TA (2006) Genomes 3. Garland Science Pub., New York
Bruno WJ, Halpern AL (1999) Topological Bias and Inconsistency of Maximum
Likelihood Using Wrong Models. Molecular Biology and Evolution 16:564
Burge C, Karlin S (1997) Prediction of Complete Gene Structures in Human Genomic
DNA. Journal of Molecular Biology 268:78
Burki F, Shalchian-Tabrizi K, Pawlowski J (2008) Phylogenomics Reveals a New
'Megagroup' Including Most Photosynthetic Eukaryotes. Biology Letters 4:366
Buza TJ, McCarthy FM, Wang N, Bridges SM, Burgess SC (2008) Gene Ontology
Annotation Quality Analysis in Model Eukaryotes. Nucleic Acids Research
36(2):e12
Cai JC, G. Wang , J (2010) ClusterViz: A Cytoscape Plugin for Graph Clustering and
Visualization Central South University, Changsha
Camin JH, Sokal RR (1965) A Method for Deducing Branching Sequences in Phylogeny.
Evolution 19:311
Capecchi MR (2005) Gene Targeting in Mice: Functional Analysis of the Mammalian
Genome for the Twenty-First Century. Nat Rev Genet 6:507
Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T (2009) Trimal: A Tool for Automated
Alignment Trimming in Large-Scale Phylogenetic Analyses. Bioinformatics
25:1972
Cavalli-Sforza LLE, Edwards A.W.F (1964) Reconstruction of Evolutionary Trees.
Phenetic and Phylogenetic Classification 6:67-76
Ceol A, Aryamontri AC, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G
(2010) Mint, the Molecular Interaction Database: 2009 Update. Nucleic Acids
Research 38:D532
Chalfie M, Tu Y, Euskirchen G, Ward WW, Prasher DC (1994) Green Fluorescent Protein
as a Marker for Gene-Expression. Science 263:802
Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni
G (2007) Mint: The Molecular Interaction Database. Nucleic Acids Res 35:D572
Chen LF, Vitkup D (2006) Predicting Genes for Orphan Metabolic Activities Using
Phylogenetic Profiles. Genome Biology 7:R17
Chen XW, Jeong JC, Dermyer P (2011) Kups: Constructing Datasets of Interacting and
Non-Interacting Protein Pairs with Associated Attributions. Nucleic Acids Res
39:D750
207
References
Coin F, Marinoni JC, Rodolfo C, Fribourg S, Pedrini AM, Egly JM (1998) Mutations in the
Xpd Helicase Gene Result in Xp and Ttd Phenotypes, Preventing Interaction
between Xpd and the P44 Subunit of Tfiih. Nature Genetics 20:184
Cokus S, Mizutani S, Pellegrini M (2007) An Improved Method for Identifying
Functionally Linked Proteins Using Phylogenetic Profiles. BMC Bioinformatics
8:S7
Coulson RMR, Ouzounis CA (2003) The Phylogenetic Diversity of Eukaryotic
Transcription. Nucleic Acids Res 31:653
Cranston KA, Hurwitz B, Ware D, Stein L, Wing RA (2009) Species Trees from Highly
Incongruent Gene Trees in Rice. Systematic Biology 58:489
Cranston KA, Rannala B (2007) Summarizing a Posterior Distribution of Trees Using
Agreement Subtrees. Systematic Biology 56:578
Crick FH, Barnett L, Brenner S, Watts-Tobin RJ (1961) General Nature of the Genetic
Code for Proteins. Nature 192:1227
Cunningham FX, Lafond TP, Gantt E (2000) Evidence of a Role for Lytb in the
Nonmevalonate Pathway of Isoprenoid Biosynthesis. Journal of Bacteriology
182:5841
Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M (2004) The
Ensembl Automatic Gene Annotation System. Genome Res 14:942
Dandekar T, Snel B, Huynen M, Bork P (1998) Conservation of Gene Order: A
Fingerprint of Proteins That Physically Interact. Trends Biochem Sci 23:324
Davey R, Savva G, Dicks J, Roberts IN (2007) Mpp: A Microarray-to-Phylogeny Pipeline
for Analysis of Gene and Marker Content Datasets. Bioinformatics 23:1023
Dayhoff MO, Schwartz. RM, Orcutt. BC (1978) A Model of Evolutionary Change in
Proteins. Atlas of Protein Sequence and Structure 5:345
De Bodt S, Proost S, Vandepoele K, Rouze P, Van de Peer Y (2009) Predicting ProteinProtein Interactions in Arabidopsis Thaliana through Integration of Orthology,
Gene Ontology and Co-Expression. BMC Genomics 10:288
De Las Rivas J, Fontanillo C (2010) Protein Protein Interactions Essentials: Key
Concepts to Building and Analyzing Interactome Networks. PLoS Comput Biol
6:e1000807
Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S,
Lefort V, Lescot M, Claverie JM, Gascuel O (2008) Phylogeny.Fr: Robust
Phylogenetic Analysis for the Non-Specialist. Nucleic Acids Research 36:W465
Dowsey AW, Dunn MJ, Yang GZ (2003) The Role of Bioinformatics in Two-Dimensional
Gel Electrophoresis. Proteomics 3:1567
Durbin R (1998) Biological Sequence Analysis : Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge New York
Eddy SR (1998) Profile Hidden Markov Models. Bioinformatics 14:755
Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z (2009) Gorilla: A Tool for Discovery
and Visualization of Enriched Go Terms in Ranked Gene Lists. BMC
Edgar RC (2004) Muscle: Multiple Sequence Alignment with High Accuracy and High
Throughput. Nucleic Acids Res 32:1792
Edgar RC, Batzoglou S (2006) Multiple Sequence Alignment. Curr Opin Struct Biol 16:368
Edgell DR, Belfort M, Shub DA (2000) Barriers to Intron Promiscuity in Bacteria. J
Bacteriol 182:5281
Edwards AWF (1992) Likelihood. Johns Hopkins University Press, Baltimore ; London
Elias I, Tuller T (2007) Reconstruction of Ancestral Genomic Sequences Using
Likelihood. Journal of Computational Biology 14:216
208
References
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Protein Interaction Maps for
Complete Genomes Based on Gene Fusion Events. Nature 402:86
Farrar M (2007) Striped Smith-Waterman Speeds Database Searches Six Times over
Other Simd Implementations. Bioinformatics 23:156
Farris JS (1977) Phylogenetic Analysis under Dollo's Law. Systematic Zoology 26:77
Farris JS (1978) Inferring Phylogenetic Trees from Chromosome Inversion Data.
Systematic Zoology 27:275
Felsenstein J (1973) Maximum Likelihood and Minimum-Steps Methods for Estimating
Evolutionary Trees from Data on Discrete Characters. Systematic Zoology 22:240
Felsenstein J (1978) Cases in Which Parsimony or Compatibility Methods Will Be
Positively Misleading. Syst Zool 27:401
Felsenstein J (1979) Alternative Methods of Phylogenetic Inference and Their
Interrelationship. Systematic Zoology 28:49
Felsenstein J (1985a) Confidence-Limits on Phylogenies - an Approach Using the
Bootstrap. Evolution 39:783
Felsenstein J (1985b) Phylogenies and the Comparative Method. The American Naturalist
125:1
Felsenstein J (1989) Phylip - Phylogeny Inference Package (Version 3.2). Cladistics 5:164
Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland, Mass.
Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, Min Jou W,
Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, Ysebaert M (1976)
Complete Nucleotide Sequence of Bacteriophage Ms2 Rna: Primary and
Secondary Structure of the Replicase Gene. Nature 260:500
Fitch WM (1970) Distinguishing Homologous from Analogous Proteins. Syst Zool 19:99
Fitch WM (1971) Toward Defining Course of Evolution - Minimum Change for a
Specific Tree Topology. Syst Zool 20:406
Fitch WM (2000) Homology a Personal View on Some of the Problems. Trends Genet
16:227
Fitzpatrick DA, Logue ME, Stajich JE, Butler G (2006) A Fungal Phylogeny Based on 42
Complete Genomes Derived from Supertree and Combined Gene Analysis. BMC
Evol Biol 6:99
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A Computer Program for
Aligning a Cdna Sequence with a Genomic DNA Sequence. Genome Res 8:967
Fu N, Drinnenberg I, Kelso J, Wu JR, Paabo S, Zeng R, Khaitovich P (2007) Comparison of
Protein and Mrna Expression Evolution in Humans and Chimpanzees. PLoS One
2:e216
Garman KS, Acharya CR, Edelman E, Grade M, Gaedcke J, Sud S, Barry W, Diehl AM,
Provenzale D, Ginsburg GS, Ghadimi BM, Ried T, Nevins JR, Mukherjee S, Hsu D,
Potti A (2009) A Genomic Approach to Colon Cancer Risk Stratification Yields
Biologic Insights into Therapeutic Opportunities (Vol 105, 19432, 2008).
Proceedings of the National Academy of Sciences of the United States of America
106:6878
Garrett S, Barton WA, Knights R, Jin P, Morgan DO, Fisher RP (2001) Reciprocal
Activation by Cyclin-Dependent Kinases 2 and 7 Is Directed by Substrate
Specificity Determinants Outside the T Loop. Molecular and Cellular Biology
21:88
Gaschen B, Taylor J, Yusim K, Foley B, Gao F, Lang D, Novitsky V, Haynes B, Hahn BH,
Bhattacharya T, Korber B (2002) Aids - Diversity Considerations in Hiv-1 Vaccine
Selection. Science 296:2354
209
References
Ge XJ, Yamamoto S, Tsutsumi S, Midorikawa Y, Ihara S, Wang SM, Aburatani H (2005)
Interpreting Expression Profiles of Cancers by Genome-Wide Survey of Breadth
of Expression in Normal Tissues. Genomics 86:127
Gillis B, Gavin IM, Arbieva Z, King ST, Jayaraman S, Prabhakar BS (2007) Identification
of Human Cell Responses to Benzene and Benzene Metabolites. Genomics 90:324
Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel
JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin
H, Oliver SG (1996) Life with 6000 Genes. Science 274:546
Goldman N, Anderson JP, Rodrigo AG (2000) Likelihood-Based Tests of Topologies in
Phylogenetics. Systematic Biology 49:652
Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Kallersjo M, Farris JS (2009)
Phylogenetic Analysis of 73 060 Taxa Corroborates Major Eukaryotic Groups.
Cladistics 25:211
Gonzalez O, Zimmer R (2008) Assigning Functional Linkages to Proteins Using
Phylogenetic Profiles and Continuous Phenotypes. Bioinformatics 24:1257
Grafen A (1989) The Phylogenetic Regression. Philosophical Transactions of the Royal
Society of London Series B-Biological Sciences 326:119
Graur D, Shuali Y, Li WH (1989) Deletions in Processed Pseudogenes Accumulate Faster
in Rodents Than in Humans. Journal of Molecular Evolution 28:279
Griffiths AJF (2002) Modern Genetic Analysis : Integrating Genes and Genomes. W.H.
Freeman and Co., New York
Guindon S, Gascuel O (2003) A Simple, Fast, and Accurate Algorithm to Estimate Large
Phylogenies by Maximum Likelihood. Syst Biol 52:696
Gygi SP, Rochon Y, Franza BR, Aebersold R (1999) Correlation between Protein and
Mrna Abundance in Yeast. Molecular and Cellular Biology 19:1720
Hakes L, Pinney J.W, Lowell S.C, Oliver S.G, Robertson D.L (2007) All Duplicates Are
Not Equal: The Difference between Small-Scale and Genome Duplication.
Genome Biology 8:R209
Hamming RW (1950) Error Detecting and Error Correcting Codes. Bell System
Technical Journal 26:147
Han JDJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM,
Cusick ME, Roth FP, Vidal M (2004) Evidence for Dynamically Organized
Modularity in the Yeast Protein-Protein Interaction Network. Nature 430:88
Harrison CJ, Langdale JA (2006) A Step by Step Guide to Phylogeny Reconstruction.
Plant J 45:561
Hart GT, Ramani AK, Marcotte EM (2006) How Complete Are Current Yeast and
Human Protein-Interaction Networks? Genome Biol 7:120
Harvey PH and Pagel MD (1991). The Comparative Method in Evolutionary Biology.
Oxford: Oxford University Press
Hasegawa H, Holm L (2009) Advances and Pitfalls of Protein Structural Alignment.
Curr Opin Struct Biol 19:341
Hasegawa M, Kishino H (1989) Confidence-Limits on the Maximum-Likelihood Estimate
of the Hominoid Tree from Mitochondrial-DNA Sequences. Evolution 43:672
Haw R, Hermjakob H, D'Eustachio P, Stein L (2011) Reactome Pathway Analysis to
Enrich Biological Discovery in Proteomics Datasets. Proteomics: 11(18):3598-613.
Hayton K, Su XZ (2004) Genetic and Biochemical Aspects of Drug Resistance in Malaria
Parasites. Curr Drug Targets Infect Disord 4:1
210
References
He HY, Soncin F, Grammatikakis N, Li YL, Siganou A, Gong JL, Brown SA, Kingston RE,
Calderwood SK (2003) Elevated Expression of Heat Shock Factor (Hsf) 2a
Stimulates Hsf1-Induced Transcription During Stress. Journal of Biological
Chemistry 278:35465
Heath TA, Hedtke SM, Hillis DM (2008) Taxon Sampling and the Accuracy of
Phylogenetic Analyses. Journal of Systematics and Evolution 46:239
Henikoff S, Henikoff JG (1992) Amino Acid Substitution Matrices from Protein Blocks.
Hershey AD, Chase M (1952) Independent Functions of Viral Protein and Nucleic Acid
in Growth of Bacteriophage. J Gen Physiol 36:39
Hert DG, Fredlake CP, Barron AE (2008) Advantages and Limitations of Next-Generation
Sequencing Technologies: A Comparison of Electrophoresis and NonElectrophoresis Methods. Electrophoresis 29:4618
Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring Expression Data: Identification and
Analysis of Coexpressed Genes. Genome Research 9:1106
Higgins DG, Sharp PM (1988) Clustal: A Package for Performing Multiple Sequence
Alignment on a Microcomputer. Gene 73:237
Hill J, Hambley M, Forster T, Mewissen M, Sloan TM, Scharinger F, Trew A, Ghazal P
(2008) Sprint: A New Parallel Framework for R. BMC Bioinformatics 9
Hobolth A, Christensen OF, Mailund T, Schierup MH (2007) Genomic Relationships and
Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a
Coalescent Hidden Markov Model. PLoS Genet 3:e7
Hodges A, Strand AD, Aragaki AK, Kuhn A, Sengstag T, Hughes G, Elliston LA, Hartog C,
Goldstein DR, Thu D, Hollingsworth ZR, Collin F, Synek B, Holmans PA, Young
AB, Wexler NS, Delorenzi M, Kooperberg C, Augood SJ, Faull RL, Olson JM, Jones
L, Luthi-Carter R (2006) Regional and Cellular Gene Expression Changes in
Human Huntington's Disease Brain. Hum Mol Genet 15:965
Holder M, Lewis PO (2003) Phylogeny Estimation: Traditional and Bayesian
Approaches. Nature Reviews Genetics 4:275
Hruby VJ (1997) Prospects for Peptidomimetic Drug Design. Drug Discovery Today 2:165
Huai Q, Kim HY, Liu YD, Zhao YD, Mondragon A, Liu JO, Ke HM (2002) Crystal
Structure of Calcineurin-Cyclophilin-Cyclosporin Shows Common but Distinct
Recognition of Immunophilin-Drug Complexes. Proceedings of the National
Academy of Sciences of the United States of America 99:12037
Huelsenbeck JP, Bollback JP (2001) Empirical and Hierarchical Bayesian Estimation of
Ancestral States. Systematic Biology 50:351
Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP (2001) Bayesian Inference of
Phylogeny and Its Impact on Evolutionary Biology. Science 294:2310
Hughes AL, Friedman R (2005) Poxvirus Genome Evolution by Gene Gain and Loss.
Molecular Phylogenetics and Evolution 35:186
Hulsen T, Huynen MA, de Vlieg J, Groenen PM (2006) Benchmarking Ortholog
Identification Methods Using Functional Genomics Data. Genome Biol 7:R31
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol
2:E206
Ispolatov I, Yuryev A, Mazo I, Maslov S (2005) Binding Properties and Evolution of
Homodimers in Protein-Protein Interaction Networks. Nucleic Acids Res 33:3629
Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M,
Greenblatt JF, Gerstein M (2003) A Bayesian Networks Approach for Predicting
Protein-Protein Interactions from Genomic Data. Science 302:449
211
References
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A,
Simonovic M, Bork P, von Mering C (2009) String 8-a Global View on Proteins
and Their Functional Interactions in 630 Organisms. Nucleic Acids Research
37:D412
Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and Centrality in Protein
Networks. Nature 411:41
Jessop CE, Chakravarthi S, Garbi N, Hammerling GJ, Lovell S, Bulleid NJ (2007) Erp57 Is
Essential for Efficient Folding of Glycoproteins Sharing Common Structural
Domains. EMBO J 26:28
Jim K, Parmar K, Singh M, Tavazoie S (2004) A Cross-Genomic Approach for Systematic
Mapping of Phenotypic Traits to Genes. Genome Research 14:109
Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt
EE, Stoughton R, Shoemaker DD (2003) Genome-Wide Survey of Human
Alternative Pre-Mrna Splicing with Exon Junction Microarrays. Science
302:2141
Kanehisa M (1997) A Database for Post-Genome Analysis. Trends Genet 13:375
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T,
Araki M, Hirakawa M (2006) From Genomics to Chemical Genomics: New
Developments in Kegg. Nucleic Acids Res 34:D354
Karlin S, Altschul SF (1990) Methods for Assessing the Statistical Significance of
Molecular Sequence Features by Using General Scoring Schemes. Proc Natl Acad
Sci U S A 87:2264
Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: A Novel Method for Rapid
Multiple Sequence Alignment Based on Fast Fourier Transform. Nucleic Acids
Res 30:3059
Kawaji H, Hayashizaki Y (2008) Genome Annotation. Methods Mol Biol 452:125
Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO (2006) Assessment of
Methods for Amino Acid Matrix Selection and Their Use on Empirical Data
Shows That Ad Hoc Assumptions for Choice of Matrix Are Not Justified. BMC
Evolutionary Biology 6
Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and Theoretical
Advances in Predicting the Function of a Protein by Its Phylogenetic
Distribution. J R Soc Interface 5:151
Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R (2004) The
International Protein Index: An Integrated Database for Proteomics
Experiments. Proteomics 4:1985
Kim IY, Shin JH, Seong JK (2010) Mouse Phenogenomics, Toolbox for Functional
Annotation of Human Genome. BMB Rep 43:79
Kim MJ, Romero R, Kim CJ, Tarca AL, Chhauy S, LaJeunesse C, Lee DC, Draghici S,
Gotsch F, Kusanovic JP, Hassan SS, Kim JS (2009) Villitis of Unknown Etiology Is
Associated with a Distinct Pattern of Chemokine up-Regulation in the FetoMaternal and Placental Compartments: Implications for Conjoint Maternal
Allograft Rejection and Maternal Anti-Fetal Graft-Versus-Host Disease. Journal
of Immunology 182:3919
Knight RD, Landweber LF, Yarus M (2001) How Mitochondria Redefine the Code. J Mol
Evol 53:299
Knowles DG, McLysaght A (2009) Recent De Novo Origin of Human Protein-Coding
Genes. Genome Res 19:1752
Korf I (2004) Gene Finding in Novel Genomes. BMC Bioinformatics 5:59
212
References
Koshi JM, Goldstein RA (1996) Probabilistic Reconstruction of Ancestral Protein
Sequences. Journal of Molecular Evolution 42:313
Krane DE, Raymer ML (2003) Fundamental Concepts of Bioinformatics. Pearson
Education International, San Francisco
Krylov DM, Wolf YI, Rogozin IB, Koonin EV (2003) Gene Loss, Protein Sequence
Divergence, Gene Dispensability, Expression Level, and Interactivity Are
Correlated in Eukaryotic Evolution. Genome Research 13:2229
Kuhner MK, Felsenstein J (1994) Simulation Comparison of Phylogeny Algorithms under
Equal and Unequal Evolutionary Rates. Mol Biol Evol 11:459
Kummel D, Oeckinghaus A, Wang C, Krappmann D, Heinemann U (2008) Distinct
Isocomplexes of the Trapp Trafficking Factor Coexist inside Human Cells. FEBS
Lett 582:3729
Lande J, Gimino V, Berryman T, Hertz MI, King RA (2003) Gene Expression Profiling of
Bronchoalveolar Lavage Cells in Acute Lung Rejection. American Journal of
Human Genetics 73:421
Le SQ, Gascuel O (2008) An Improved General Amino Acid Replacement Matrix. Mol
Biol Evol 25:1307
Lei PW, Koehly LM (2003) Linear Discriminant Analysis Versus Logistic Regression: A
Comparison of Classification Errors in the Two-Group Case. Journal of
Experimental Education 72:25
Lequesne WJ (1974) Uniquely Evolved Character Concept and Its Cladistic Application.
Systematic Zoology 23:513
Lescure FX, Le Loup G, Freilij H, Develoux M, Paris L, Brutus L, Pialoux G (2010) Chagas
Disease: Changes in Knowledge and Management. Lancet Infectious Diseases
10:556
Levesque M, Shasha D, Kim W, Surette MG, Benfey PN (2003) Trait-to-Gene: A
Computational Method for Predicting the Function of Uncharacterized Genes.
Current Biology 13:129
Lewinski MK, Bisgrove D, Shinn P, Chen H, Hoffmann C, Hannenhalli S, Verdin E, Berry
CC, Ecker JR, Bushman FD (2005) Genome-Wide Analysis of Chromosomal
Features Repressing Human Immunodeficiency Virus Transcription. Journal of
Virology 79:6610
Li L, Stoeckert CJ, Jr., Roos DS (2003) Orthomcl: Identification of Ortholog Groups for
Eukaryotic Genomes. Genome Res 13:2178
Li M, Wang JX, Chen J (2008) A Fast Agglomerate Algorithm for Mining Functional
Modules in Protein Interaction Networks. Bmei 2008: Proceedings of the
International Conference on Biomedical Engineering and Informatics, Vol 1:3
Linial M (2003) How Incorrect Annotations Evolve - the Case of Short Orfs. Trends in
Biotechnology 21:298
Lunter G, Ponting CP, Hein J (2006) Genome-Wide Identification of Human Functional
DNA Using a Neutral Indel Model. PLoS Comput Biol 2:2
Macagno A, Molteni M, Rinaldi A, Bertoni F, Lanzavecchia A, Rossetti C, Sallusto F (2006)
A Cyanobacterial Lps Antagonist Prevents Endotoxin Shock and Blocks
Sustained Tlr4 Stimulation Required for Cytokine Expression. Journal of
Experimental Medicine 203:1481
Maddison WP (1990) A Method for Testing the Correlated Evolution of Two Binary
Characters - Are Gains or Losses Concentrated on Certain Branches of a
Phylogenetic Tree. Evolution 44:539
Maddison WP, Maddison DR (2010) Mesquite: A Modular System for Evolutionary
Analysis. Version 2.73
213
References
Maere S, Heymans K, Kuiper M (2005) Bingo: A Cytoscape Plugin to Assess
Overrepresentation of Gene Ontology Categories in Biological Networks.
Malcolm BA, Wilson KP, Matthews BW, Kirsch JF, Wilson AC (1990) Ancestral
Lysozymes Reconstructed, Neutrality Tested, and Thermostability Linked to
Hydrocarbon Packing. Nature 345:86
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D (1999) Detecting
Protein Function and Protein-Protein Interactions from Genome Sequences.
Science 285:751
Mardis ER (2008) The Impact of Next-Generation Sequencing Technology on Genetics.
Trends Genet 24:133
Martin DM, Berriman M, Barton GJ (2004) Gotcha: A New Method for Prediction of
Protein Function Assessed by the Annotation of Seven Genomes. BMC
Maston GA, Evans SK, Green MR (2006) Transcriptional Regulatory Elements in the
Human Genome. Annual Review of Genomics and Human Genetics 7:29
Maxam AM, Gilbert W (1977) New Method for Sequencing DNA. Proc Natl Acad Sci U S
A 74:560
McDowall MD, Scott MS, Barton GJ (2009) Pips: Human Protein-Protein Interaction
Prediction Database. Nucleic Acids Res 37:D651
McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu YT, Tsung EF, Clouser CR,
Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC,
Sokolsky TD, Zhang L, Sheridan A, Fu HN, Hendrickson CL, Li B, Kotler L, Stuart
JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC,
Lyons MR, Beaudoin RE, Coleman BE, Laptewicz MW, Sannicandro AE, Rhodes
MD, Gottimukkala RK, Yang S, Bafna V, Bashir A, MacBride A, Alkan C, Kidd JM,
Eichler EE, Reese MG, De la Vega FM, Blanchard AP (2009) Sequence and
Structural Variation in a Human Genome Uncovered by Short-Read, Massively
Parallel Ligation Sequencing Using Two-Base Encoding. Genome Research
19:1527
McLysaght A, Baldi PF, Gaut BS (2003) Extensive Gene Gain Associated with Adaptive
Evolution of Poxviruses. Proceedings of the National Academy of Sciences of the
United States of America 100:15655
Messler W, Stewart CB (1997) Episodic Adaptive Evolution of Primate Lysozymes.
Nature 385:151
Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K,
Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran
S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S,
Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia
H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R,
Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS,
Pandey A (2006) Human Protein Reference Database--2006 Update. Nucleic
Acids Res 34:D411
Mohamed TP, Carbonell JG, Ganapathiraju MK (2010) Active Learning for Human
Protein-Protein Interaction Prediction. BMC Bioinformatics 11 Suppl 1:S57
Monsalve M, Wu ZD, Adelmant G, Puigserver P, Fan ML, Spiegelman BM (2000) Direct
Coupling of Transcription and Mrna Processing through the Thermogenic
Coactivator Pgc-1. Molecular Cell 6:307
Moore KJ (1999) Utilization of Mouse Models in the Discovery of Human Disease Genes.
Drug Discov Today 4:123
214
References
Morgenstern B, Frech K, Dress A, Werner T (1998) Dialign: Finding Local Similarities by
Multiple Sequence Alignment. Bioinformatics 14:290
Mount DW (2004) Bioinformatics : Sequence and Genome Analysis. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor, N.Y.
Needleman SB, Wunsch CD (1970) A General Method Applicable to the Search for
Similarities in the Amino Acid Sequence of Two Proteins. J Mol Biol 48:443
Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press
Nooren IMA, Thornton JM (2003) Structural Characterisation and Functional
Significance of Transient Protein-Protein Interactions. Journal of Molecular
Biology 325:991
Nuin PA, Wang Z, Tillier ER (2006) The Accuracy of Several Multiple Sequence
Alignment Programs for Proteins. BMC Bioinformatics 7:471
Nye TMW, Lio P, Gilks WR (2006) A Novel Algorithm and Web-Based Tool for
Comparing Two Alternative Phylogenetic Trees. Bioinformatics 22:117
O'donnell RK, Kupferman M, Wei SJ, Singhal S, Weber R, O'Malley B, Cheng Y, Putt M,
Feldman M, Ziober B, Muschel RJ (2005) Gene Expression Signature Predicts
Lymphatic Metastasis in Squamous Cell Carcinoma of the Oral Cavity.
Oncogene 24:1244
Ohta S, Shiomi Y, Sugimoto K, Obuse C, Tsurimoto T (2002) A Proteomics Approach to
Identify Proliferating Cell Nuclear Antigen (Pcna)-Binding Proteins in Human
Cell Lysates - Identification of the Human Chl12/Rfcs2-5 Complex as a Novel
Pcna-Binding Protein. Journal of Biological Chemistry 277:40362
Ooi SL, Pan X, Peyser BD, Ye P, Meluh PB, Yuan DS, Irizarry RA, Bader JS, Spencer FA,
Boeke JD (2006) Global Synthetic-Lethality Analysis and Yeast Functional
Profiling. Trends Genet 22:56
Orengo C, Jones D, Thornton JM (2003) Bioinformatics : Genes, Proteins, and
Computers. BIOS Scientific ; Distributed in the U.S. by Springer-Verlag, Oxford
New York
Orlowski J, Kaczanowski S, Zielenkiewicz P (2007) Overrepresentation of Interactions
between Homologous Proteins in Interactomes. Febs Letters 581:52
Page RDM, Holmes EC (1998) Molecular Evolution : A Phylogenetic Approach.
Blackwell Science, Oxford
Pagel M (1994) Detecting Correlated Evolution on Phylogenies - a General-Method for
the Comparative-Analysis of Discrete Characters. Proceedings of the Royal
Society of London Series B-Biological Sciences 255:37
Pagel M (1997) Inferring Evolutionary Processes from Phylogenies. Zoologica Scripta
26:331
Pagel M, Meade A, Barker D (2004a) Bayesian Estimation of Ancestral Character States
on Phylogenies. Syst Biol 53:673
Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C,
Mark P, Stumpflen V, Mewes HW, Ruepp A, Frishman D (2005) The Mips
Mammalian Protein-Protein Interaction Database. Bioinformatics 21:832
Pagel P, Wong P, Frishman D (2004b) A Domain Interaction Map Based on Phylogenetic
Profiling. J Mol Biol 344:1331
Parfrey LW, Barbero E, Lasser E, Dunthorn M, Bhattacharya D, Patterson DJ, Katz LA
(2006) Evaluating Support for the Current Classification of Eukaryotic
Diversity. PLoS Genet 2:e220
Parida L (2008) Pattern Discovery in Bioinformatics : Theory & Algorithms. Chapman &
Hall/CRC, London
215
References
Pazos F, Ranea JAG, Juan D, Sternberg MJE (2005) Assessing Protein Co-Evolution in the
Context of the Tree of Life Assists in the Prediction of the Interactome. Journal of
Molecular Biology 352:1002
Pazos F, Valencia A (2001) Similarity of Phylogenetic Trees as Indicator of ProteinProtein Interaction. Protein Eng 14:609
Pearson WR, Lipman DJ (1988) Improved Tools for Biological Sequence Comparison.
Pellegrini M (2001) Computational Methods for Protein Function Analysis. Curr Opin
Chem Biol 5:46
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning
Protein Functions by Comparative Genome Analysis: Protein Phylogenetic
Profiles. Proc Natl Acad Sci U S A 96:4285
Pesole G (2008) What Is a Gene? An Updated Operational Definition. Gene 417:1
Picardi E, Pesole G (2010) Computational Methods for Ab Initio and Comparative Gene
Finding. Methods Mol Biol 609:269
Pickett KM, Randle CP (2005) Strange Bayes Indeed: Uniform Topological Priors Imply
Non-Uniform Clade Priors. Molecular Phylogenetics and Evolution 34:203
Pinney JW, Shirley MW, McConkey GA, Westhead DR (2005) Metashark: Software for
Automated Metabolic Network Prediction from DNA Sequence and Its
Application to the Genomes of Plasmodium Falciparum and Eimeria Tenella.
Nucleic Acids Research 33:1399
Posada D, Buckley TR (2004) Model Selection and Model Averaging in Phylogenetics:
Advantages of Akaike Information Criterion and Bayesian Approaches over
Likelihood Ratio Tests. Syst Biol 53:793
Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R,
Clamp M (2004) The Ensembl Analysis Pipeline. Genome Res 14:934
Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla
D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S,
Somanathan DS, Sebastian A, Rani S, Ray S, Kishore CJH, Kanth S, Ahmed M,
Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S,
Ranganathan P, Ramabadran S, Chaerkady R, Pandey A (2009) Human Protein
Reference Database-2009 Update. Nucleic Acids Research 37:D767
Pressman R (2001) Software Engineering: A Practioners Approach. McGraw-Hill
Pruitt KD, Tatusova T, Maglott DR (2005) Ncbi Reference Sequence (Refseq): A Curated
Non-Redundant Sequence Database of Genomes, Transcripts and Proteins.
Nucleic Acids Res 33:D501
Qi YJ, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of Different Biological Data
and Computational Classification Methods for Use in Protein Interaction
Prediction. Proteins-Structure Function and Bioinformatics 63:490
Quackenbush J (2002) Microarray Data Normalization and Transformation. Nat Genet
32 Suppl:496
Raab JR, Kamakaka RT (2010) Opinion Insulators and Promoters: Closer Than We
Think. Nature Reviews Genetics 11:439
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and Identifying
Communities in Networks. Proceedings of the National Academy of Sciences of the
United States of America 101:2658
Radom-Aizik S, Hayek S, Shahar I, Rechavi G, Kaminski N, Ben-Dov I (2005) Effects of
Aerobic Training on Gene Expression in Skeletal Muscle of Elderly Men.
Medicine and Science in Sports and Exercise 37:1680
216
References
Ralston KS, Kabututu ZP, Melehani JH, Oberholzer M, Hill KL (2009) The Trypanosoma
Brucei Flagellum: Moving Parasites in New Directions. Annual Review of
Microbiology 63:335
Ramachandran N, Hainsworth E, Bhullar B, Eisenstein S, Rosen B, Lau AY, Walter JC,
LaBaer J (2004) Self-Assembling Protein Microarrays. Science 305:86
Ramazzina I, Folli C, Secchi A, Berni R, Percudani R (2006) Completing the Uric Acid
Degradation Pathway through Phylogenetic Comparison of Whole Genomes.
Nature Chemical Biology 2:144
Ranea JA, Yeats C, Grant A, Orengo CA (2007) Predicting Protein Function with
Hierarchical Phylogenetic Profiles: The Gene3d Phylo-Tuner Method Applied to
Eukaryotic Genomes. PLoS Comput Biol 3:e237
R Development Core Team (2011) R: A language and environment for statistical
computing. R Foundation for Statistical Computing Vienna, Austria
Reghunathan R, Jayapal M, Hsu LY, Chng HH, Tai D, Leung BP, Melendez AJ (2005)
Expression Profile of Immune Response Genes in Patients with Severe Acute
Respiratory Syndrome. BMC Immunology 6
Remm M, Storm CE, Sonnhammer EL (2001) Automatic Clustering of Orthologs and inParalogs from Pairwise Species Comparisons. J Mol Biol 314:1041
Richmond TJ, Davey CA (2003) The Structure of DNA in the Nucleosome Core. Nature
423:145
Ridley M (1983) The Explanation of Organic Diversity : The Comparative Method and
Adaptions for Mating. Clarendon Press, Oxford
Robertson DL, Lovell SC (2009) Evolution in Protein Interaction Networks: CoEvolution, Rewiring and the Role of Duplication. Biochem Soc Trans 37:768
Rodriguez-Ezpeleta N, Brinkmann H, Burey SC, Roure B, Burger G, Loffelhardt W, Bohnert
HJ, Philippe H, Lang BF (2005) Monophyly of Primary Photosynthetic
Eukaryotes: Green Plants, Red Algae, and Glaucophytes. Curr Biol 15:1325
Rodriguez-Ezpeleta N, Brinkmann H, Burger G, Roger AJ, Gray MW, Philippe H, Lang BF
(2007) Toward Resolving the Eukaryotic Tree: The Phylogenetic Positions of
Jakobids and Cercozoans. Current Biology 17:1420
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-Scale Approaches to Resolving
Incongruence in Molecular Phylogenies. Nature 425:798
Russell SJ, Norvig P, Canny J (2003) Artificial Intelligence : A Modern Approach.
Prentice Hall, Upper Saddle River, N.J.
Salemi M, Vandamme A-M (2003) The Phylogenetic Handbook : A Practical Approach
to DNA and Protein Phylogeny. Cambridge University Press, Cambridge, U.K. ;
New York
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database
of Interacting Proteins: 2004 Update. Nucleic Acids Res 32:D449
Sanger F, Coulson AR, Friedmann T, Air GM, Barrell BG, Brown NL, Fiddes JC, Hutchison
CA, Slocombe PM, Smith M (1978) Nucleotide-Sequence of Bacteriophage-PhiX174. J Mol Biol 125:225
Sanger F, Nicklen S, Coulson AR (1977) DNA Sequencing with Chain-Terminating
Inhibitors. Proc Natl Acad Sci U S A 74:5463
Sankoff D (1975) Minimal Mutation Trees of Sequences. Siam Journal on Applied
Mathematics 28:35
Sasaoka T, Kobayashi M (2000) The Functional Significance of Shc in Insulin Signaling
as a Substrate of the Insulin Receptor. Endocrine Journal 47:373
Scott MS, Barton GJ (2007) Probabilistic Prediction and Ranking of Human ProteinProtein Interactions. BMC Bioinformatics 8:239
217
References
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B,
Ideker T (2003) Cytoscape: A Software Environment for Integrated Models of
Biomolecular Interaction Networks. Genome Research 13:2498
Shortle D, Ackerman MS (2001) Persistence of Native-Like Topology in a Denatured
Protein in 8 M Urea. Science 293:487
Siddall ME (1998) Success of Parsimony in the Four-Taxon Case: Long-Branch
Repulsion by Likelihood in the Farris Zone. Cladistics-the International Journal of
the Willi Hennig Society 14:209
Sillentullberg B (1988) Evolution of Gregariousness in Aposematic Butterfly Larvae - a
Phylogenetic Analysis. Evolution 42:293
Simmons MP, Ochoterena H, Freudenstein JV (2002) Amino Acid Vs. Nucleotide
Characters: Challenging Preconceived Notions. Molecular Phylogenetics and
Evolution 24:78
Singh GP, Ganapathi M, Dash D (2007) Role of Intrinsic Disorder in Transient
Interactions of Hub Proteins. Proteins-Structure Function and Bioinformatics
66:761
Slater GS, Birney E (2005) Automated Generation of Heuristics for Biological Sequence
Comparison. Bmc Bioinformatics 6
Slonim N, Elemento O, Tavazoie S (2006) Ab Initio Genotype-Phenotype Association
Reveals Intrinsic Modularity in Genetic Networks. Molecular Systems Biology
Smith TF, Waterman MS (1981) Identification of Common Molecular Subsequences. J
Mol Biol 147:195
Sneath PHA, Sokal RR (1973) Numerical Taxonomy : The Principles and Practice of
Numerical Classification. W. H. Freeman, San Francisco
Snel B, Bork P, Huynen MA (1999) Genome Phylogeny Based on Gene Content. Nat
Genet 21:108
Sokal RR, Rohlf FJ (1995) Biometry : The Principles and Practice of Statistics in
Biological Research. W.H. Freeman, New York
Spira A, Beane J, Shah V, Liu G, Schembri F, Yang XM, Palma J, Brody JS (2004) Effects
of Cigarette Smoke on the Human Airway Epithelial Cell Transcriptome.
Proceedings of the National Academy of Sciences of the United States of America
101:10143
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: A
General Repository for Interaction Datasets. Nucleic Acids Research 34:D535
Steel M, Penny D (2000) Parsimony, Likelihood, and the Role of Models in Molecular
Phylogenetics. Mol Biol Evol 17:839
Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner
M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann
S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H,
Wanker EE (2005) A Human Protein-Protein Interaction Network: A Resource
for Annotating the Proteome. Cell 122:957
Stevens PF, Augier A (1983) Augustin Augier's "Arbre Botanique" (1801), a
Remarkable Early Botanical Representation of the Natural System. Taxon
32:203
Stewart CB, Schilling JW, Wilson AC (1987) Adaptive Evolution in the Stomach
Lysozymes of Foregut Fermenters. Nature 330:401
Strachan T, Read AP (2004) Human Molecular Genetics. Garland Press, New York
Stuart GW, Moffett K, Leader JJ (2002) A Comprehensive Vertebrate Phylogeny Using
Vector Representations of Protein Sequences from Whole Genomes. Mol Biol
Evol 19:554
218
References
Stumpf MP, Thorne T, de Silva E, Stewart R, An HJ, Lappe M, Wiuf C (2008) Estimating
the Size of the Human Interactome. Proceedings of the National Academy of
Sciences of the United States of America 105:6959
Sundquist A, Ronaghi M, Tang HX, Pevzner P, Batzoglou S (2007) Whole-Genome
Sequencing and Assembly with High-Throughput, Short-Read Technologies.
PLoS One 2
Swanson KW, Irwin DM, Wilson AC (1991) Stomach Lysozyme Gene of the Langur
Monkey - Tests for Convergence and Positive Selection. J Mol Evol 33:418
Swofford DL, Maddison WP (1987) Reconstructing Ancestral Character States under
Wagner Parsimony. Mathematical Biosciences 87:199
Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS (2001) Bias in
Phylogenetic Estimation and Its Relevance to the Choice between Parsimony and
Likelihood Methods. Systematic Biology 50:525
Takatsu H, Futatsumori M, Yoshino K, Yoshida Y, Shin HW, Nakayama K (2001) Similar
Subunit Interactions Contribute to Assembly of Clathrin Adaptor Complexes
and Copi Complex: Analysis Using Yeast Three-Hybrid System. Biochemical and
Biophysical Research Communications 284:1083
Talavera G, Castresana J (2007) Improvement of Phylogenies after Removing Divergent
and Ambiguously Aligned Blocks from Protein Sequence Alignments. Syst Biol
56:564
Tanaka R, Yi TM, Doyle J (2005) Some Protein Interaction Data Do Not Exhibit Power
Law Statistics. Febs Letters 579:5140
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM,
Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV,
Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The Cog Database: An Updated
Version Includes Eukaryotes. BMC Bioinformatics 4:41
Telford MJ (2004) Animal Phylogeny: Back to the Coelomata? Curr Biol 14:R274
Tian B, Nowak DE, Jamaluddin M, Wang SF, Brasier AR (2005) Identification of Direct
Genomic Targets Downstream of the Nuclear Factor-Kappa B Transcription
Factor Mediating Tumor Necrosis Factor Signaling. Journal of Biological
Chemistry 280:17435
Tierney EP, Tulac S, Huang STJ, Giudice LC (2003) Activation of the Protein Kinase a
Pathway in Human Endometrial Stromal Cells Reveals Sequential Categorical
Gene Regulation. Physiological Genomics 16:47
Townsend JP, Lopez-Giraldez F, Friedman R (2008) The Phylogenetic Informativeness of
Nucleotide and Amino Acid Sequences for Reconstructing the Vertebrate Tree. J
Mol Evol 67:437
Valadkhan S, Jaladat Y (2010) The Spliceosomal Proteome: At the Heart of the Largest
Cellular Ribonucleoprotein Machine. Proteomics 10: 4128
Vanacova S, Liston DR, Tachezy J, Johnson PJ (2003) Molecular Biology of the
Amitochondriate Parasites, Giardia Intestinalis, Entamoeba Histolytica and
Trichomonas Vaginalis. International Journal for Parasitology 33:235
Vanharanta S, Pollard PJ, Lehtonen HJ, Laiho P, Sjoberg J, Leminen A, Aittomaki K, Arola
J, Kruhoffer M, Orntoft TF, Tomlinson IP, Kiuru M, Arango D, Aaltonen LA (2006)
Distinct Expression Profile in Fumarate-Hydratase-Deficient Uterine Fibroids.
Human Molecular Genetics 15:97
Velculescu VE, Zhang L, Zhou W, Polyak K, Basrai M, Bassett D, Hieter P, Vogelstein B,
Kinzler KW (1997) Serial Analysis of Gene Expression (Sage). American Journal
of Human Genetics 61:A36
219
References
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,
Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman
JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas
PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick
VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos
R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S,
Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E,
Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R,
Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian
AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z,
Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina
N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg
S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R,
Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong
F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A,
Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I,
Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport
L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart
B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T,
Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D,
McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K,
Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH,
Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E,
Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M,
Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell
MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania
A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz
R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M,
Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M,
Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek
A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J,
Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu
X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T,
Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J,
Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M,
Wu D, Wu M, Xia A, Zandieh A, Zhu X (2001) The Sequence of the Human
Genome. Science 291:1304
Vert JP (2002) A Tree Kernel to Analyse Phylogenetic Profiles. Bioinformatics 18 Suppl
1:S276
Vidalain PO, Boxem M, Ge H, Li S, Vidal M (2004) Increasing Specificity in HighThroughput Yeast Two-Hybrid Experiments. Methods 32:363
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) Ensemblcompara
Genetrees: Complete, Duplication-Aware Phylogenetic Trees in Vertebrates.
Genome Res 19:327
von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen
MA, Bork P (2005) String: Known and Predicted Protein-Protein Associations,
Integrated and Transferred across Organisms. Nucleic Acids Res 33:D433
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002)
Comparative Assessment of Large-Scale Data Sets of Protein-Protein
Interactions. Nature 417:399
220
References
von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA, Bork P
(2003) Genome Evolution Reveals Biochemical Networks and Functional
Modules. Proc Natl Acad Sci U S A 100:15428
Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal
M (2000) Protein Interaction Mapping in C. Elegans Using Proteins Involved in
Vulval Development. Science 287:116
Wall DP, Fraser HB, Hirsh AE (2003) Detecting Putative Orthologs. Bioinformatics
19:1710
Wang D, Hsieh M, Li WH (2005a) A General Tendency for Conservation of Protein
Length across Eukaryotic Kingdoms. Mol Biol Evol 22:142
Wang H, Xu Z, Gao L, Hao B (2009a) A Fungal Phylogeny Based on 82 Complete
Genomes Using the Composition Vector Method. BMC Evol Biol 9:195
Wang J, Xia Q, He X, Dai M, Ruan J, Chen J, Yu G, Yuan H, Hu Y, Li R, Feng T, Ye C, Lu
C, Wang J, Li S, Wong GK, Yang H, Wang J, Xiang Z, Zhou Z, Yu J (2005b)
Silkdb: A Knowledgebase for Silkworm Biology and Genomics. Nucleic Acids
Res 33:D399
Wang Z, Gerstein M, Snyder M (2009b) Rna-Seq: A Revolutionary Tool for
Transcriptomics. Nat Rev Genet 10:57
Watson JD, Crick FH (1953) Molecular Structure of Nucleic Acids; a Structure for
Deoxyribose Nucleic Acid. Nature 171:737
Watts DJ, Strogatz SH (1998) Collective Dynamics of 'Small-World' Networks. Nature
393:440
Wheeler TJ, Kececioglu JD (2007) Multiple Alignment by Aligning Alignments.
Bioinformatics 23:i559
Whelan S, Goldman N (2001) A General Empirical Model of Protein Evolution Derived
from Multiple Protein Families Using a Maximum-Likelihood Approach.
Molecular Biology and Evolution 18:691
Whitaker JW, McConkey GA, Westhead DR (2009) Prediction of Horizontal Gene
Transfers in Eukaryotes: Approaches and Challenges. Biochem Soc Trans 37:792
Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ (1997) Genome-Wide Expression
Monitoring in Saccharomyces Cerevisiae. Nat Biotechnol 15:1359
Woese CR, Kandler O, Wheelis ML (1990) Towards a Natural System of Organisms:
Proposal for the Domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U
S A 87:4576
Wolf YI, Rogozin IB, Koonin EV (2004) Coelomata and Not Ecdysozoa: Evidence from
Genome-Wide Phylogenetic Analysis. Genome Research 14:29
Wootton JC, Federhen S (1993) Statistics of Local Complexity in Amino-Acid-Sequences
and Sequence Databases. Computers & Chemistry 17:149
Wu G, Nie L, Zhang WW (2008) Integrative Analyses of Posttranscriptional Regulation
in the Yeast Saccharomyces Cerevisiae Using Transcriptomic and Proteomic
Data. Current Microbiology 57:18
Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006) Base-Stacking and BasePairing Contributions into Thermal Stability of the DNA Double Helix (Vol 34,
Pg 564, 2006). Nucleic Acids Res 34:1082
Yang Z (2006) Computational Molecular Evolution. Oxford University Press, Oxford
Yang Z (2008) Computational Molecular Evolution. Oxford University Press, Oxford
Yang ZH (1994) Maximum-Likelihood Phylogenetic Estimation from DNA-Sequences
with Variable Rates over Sites - Approximate Methods. Journal of Molecular
Evolution 39:306
221
References
Yang ZH, Kumar S, Nei M (1995) A New Method of Inference of Ancestral Nucleotide
and Amino-Acid-Sequences. Genetics 141:1641
Yedavalli VSRK, Neuveut C, Chi YH, Kleiman L, Jeang KT (2004) Requirement of Ddx3
Dead Box Rna Helicase for Hiv-1 Rev-Rre Export Function. Cell 119:381
Yu HY, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa
T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot A, Vazquez A, Murray RR,
Simon C, Tardivo L, Tam S, Svrzikapa N, Fan CY, de Smet AS, Motyl A, Hudson
ME, Park J, Xin XF, Cusick ME, Moore T, Boone C, Snyder M, Roth FP, Barabasi
AL, Tavernier J, Hill DE, Vidal M (2008) High-Quality Binary Protein Interaction
Map of the Yeast Interactome Network. Science 322:104
Yu HY, Luscombe NM, Lu HX, Zhu XW, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M,
Gerstein M (2004a) Annotation Transfer between Genomes: Protein-Protein
Interologs and Protein-DNA Regulogs. Genome Research 14:1107
Yu HY, Luscombe NM, Lu HX, Zhu XW, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M,
Gerstein M (2004b) Annotation Transfer between Genomes: Protein-Protein
Interologs and Protein-DNA Regulogs. Genome Res 14:1107
Zhang JZ (2003) Evolution by Gene Duplication: An Update. Trends in Ecology &
Evolution 18:292
Zheng Q, Wang XJ (2008) Goeast: A Web-Based Software Toolkit for Gene Ontology
Enrichment Analysis. Nucleic Acids Research 36:W358
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R,
Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder M
(2001) Global Analysis of Protein Activities Using Proteome Chips. Science
293:2101
Zhu X, Gerstein M, Snyder M (2007) Getting Connected: Analysis and Principles of
Biological Networks. Genes Dev 21:1010
222
Appendix A
Appendix A Description of divergence of Java implementation of Inparanoid algorithm
from Perl implementation.
In order to examine the differences in output between the novel Java implementation of the
Inparanoid algorithm and the version (2.0) distributed by Remm et al. (Remm et al. 2001) the
following test was run.
Organism A: Saccharomyces cerevisiae.
Organism B: Encephalitozoon cuniculi.
The algorithm BLASTP (version 2.2.18) was run on with the Fasta formatted file
containing all proteins for Saccharomyces cerevisiae as the query and the Fasta formatted file
containing all proteins for Encephalitozoon cuniculi as the database. The program formatdb
was used to create parsable input for BLASTP.
The substitution matrix BLOSUM62 was used to score alignments. The converse
command was also run with Encephalitozoon cuniculi as the query and Saccharomyces
cerevisiae as the database. The two organisms were also run against themselves as query and
database. The parameters v and b were set to the number of proteins in the database files and
the parameter z is set to a theoretical maximum database size by (Remm et al. 2001) to
maintain consistent values for relevant statistics such as K and " described below.
The exact syntax of the commands is given below:
!
blastall -i Saccharomyces_cerevisiae -d Saccharomyces_cerevisiae -p blastp -v 5883 -b 5883
-F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Saccharomyces_cerevisiae -d Encephalitozoon_cuniculi -p blastp -v 1996 -b 1996 F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Encephalitozoon_cuniculi -d Saccharomyces_cerevisiae -p blastp -v 5883 -b 5883 F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Encephalitozoon_cuniculi -d Encephalitozoon_cuniculi -p blastp -v 1996 -b 1996 F "m S" -M BLOSUM62 -z 5000000 -V
The output from these commands was fed to the Perl script blast_parser.pl provided in the
Inparanoid package, which produces formatted output in the following order:
223
Appendix A
Protein Id1.
Protein Id2.
Bit Score.
E value.
Protein A Length.
Protein B Length.
Alignment Length on query sequence.
Identity percentage.
Similarity percentage.
Coordinates of alignment on query sequence.
Blast bit scores are calculated using the formula
S' =
!
!
"S # lnK
ln2
(1)
S ' = bit score, S= raw score, K = constant associated with search space size and " =constant
associated with scoring system(Mount 2004).

Blast bit scores can be affected by composition based score adjustments which were
!
introduced in order to deal with comparisons of proteins with highly biased amino acid
compositions (Altschul et al. 2005).
This and variable database sizes (depending on whether the order of database and
query is reversed due to search space size variation) can lead to an asymmetry in bit scores
for the same pair of sequences. In order to deal with this artefact scores are normalised by
both implementations by averaging the A-B and B-A orientations.
Results were filtered for hits containing only single high scoring pairs as the Java
implementation was constructed to deal with SSEARCH output, which only returns a single
optimal local alignment.
224
Appendix A
Reciprocal best hits as marked by Perl implementation but not by Java implementation.
Protein pair 1
Protein A: NP_015092 (Saccharomyces cerevisiae )
Protein B: NP_586462 (Encephalitozoon cuniculi)
A-B bit score = 56.2 (Rounded down to 56 by Perl). This is the best score in the A-B
direction.
B-A bit score = 55.5 (Rounded up to 56 by Perl).
Mean bit score = 55.85 (Rounded to 56 by Perl).
NP_586462 however has another significant score against Saccharomyces cerevisiae, which
is NP_013908 with a mean bit score of 56.2 (Rounded down to 56 by Perl).
The Java implementation does not recognise NP_015092 and NP_586462 as reciprocal best
hits as 56.2 > 55.85.
Protein pair 2
Protein A: NP_014520 (Saccharomyces cerevisiae ).
Protein B: NP_586468 (Encephalitozoon cuniculi).
A-B Bit score = 61.6 (Rounded up to 62 by Perl).
B-A Bit score = 61.6 (Rounded up to 62 by Perl).
Mean bit score = 61.6 (Rounded up to 62 by Perl).
NP_586468 has another significant score against Saccharomyces cerevisiae, which is
NP_014097 with a mean bit score of 62.0.
hits as 62.0 > 61.6.
Protein pair 3
A-B Bit score = 111.0
B-A Bit score = 112.0
Mean bit score = 111.5 (Rounded up to 112 by Perl).
NP_013546 with a mean bit score of 112.0
225
Appendix A
hits as 112.0 > 111.5.
Protein pair 4
Mean bit score= 60.5 (Rounded up to 61 by Perl).
NP_586039 has two other significant scores against Saccharomyces cerevisiae, which are
NP_010629 and NP_010630, which both have mean bit scores of 60.85.
hits as 60.85 > 60.5.
Protein pair 5
Protein A: NP_010407 (Saccharomyces cerevisiae).
Mean bit score = 245.5(Rounded up to 246 by Perl).
NP_013197, which has a mean bit score of 246.
hits as 246> 245.5.
Differences in Cluster Output
There are a number of groups which differ between the two implementations on this test data.
This is due to different scores being stored for various values affecting the criterion for
reciprocal bests as well as the criteria for merging and deleting clusters. However the primary
purpose for ortholog selection in this project, which is detection of presence and absence of
proteins, is achieved, as the number of Saccharomyces cerevisiae proteins found to be present
in Encephalitozoon cuniculi was identical.
226
Appendix A
Groups, which differ between implementations.
There are 16 groups, which differ between the two implementations. The Java
implementation produces 616 groups while the Perl implementation produces 619.
Orthologous Group 1
Perl Inparanoid implementation
NP_009501
NP_586181
NP_014887
Java implementation clusters NP_009501 and NP_014887 with a separate protein
XP_955683.
Orthologous Group 2
NP_011424 NP_586425
Orthologous Group 3
NP_012263 NP_597203
Groups 2 and 3 are merged into one group by the Java implementation.
Orthologous Group 4
NP_012610.
XP_955636
NP_010056.
NP_009928.
NP_010504.
Group 4 does not contain NP_009928 in output from the Java implementation.
Orthologous group 5
227
Appendix A
NP_012710.
NP_597625
NP_014074.
Orthologous group 6
NP_014293.
NP_597270
NP_014752.
NP_012264.
Groups 5 and 6 are merged into one group by the Java implementation.
Orthologous group 7
NP_011573.
XP_965975
NP_011975.
NP_013418.
Orthologous group 7 has an additional paralog added in Saccharomyces cerevisiae by the
Java implementation NP_009928.
Orthologous group 8
NP_011651.
NP_597477

Java implementation NP_014604.
Orthologous Group 9
NP_013618.
NP_597286
NP_014604.
228
Appendix A
Java implementation NP_015007. This paralog replaces NP_014604.
Orthologous Group 10
NP_014045.
NP_586473
NP_015007.
Java implementation clusters NP_014045. and NP_015007. with a separate protein
NP_597286.
NP_012603.
NP_584802
NP_597429
Group 11 does not contain NP_597429 in output from the Java implementation.
NP_010144.
NP_597320
NP_010089.
NP_586125
NP_014737.
NP_009723.
NP_597558
NP_015274.
Orthologous groups 12 and 13 are merged into one group by the Java implementation.
229
Appendix A
NP_009800
NP_584705
NP_010629
NP_586039
NP_010630
NP_013182
NP_011960
NP_012316
NP_014486
NP_010632
NP_011962
NP_012321
NP_011964
NP_013724
NP_116644
NP_010845
NP_014470
NP_010036
NP_012692
NP_011411
NP_014081
NP_010087
NP_010143
NP_010825
NP_014538
NP_010785
NP_010675
NP_116613
NP_011805
NP_010034
NP_012694
NP_009857
230
Appendix A
NP_010082
Perl implementation NP_010082.
NP_012710.
NP_597625
NP_014074.
NP_014293.
NP_597270
NP_014752.
NP_012264.
Orthologous groups 15 and 16 are merged into one group by the Java implementation.
231
Appendix B
Appendix B Individual Gene trees for genes in super matrix utilised in construction of
Phylogeny
Gene RPL23: 60S ribosomal protein L23.
232
Appendix B
Gene RPS8: 40S ribosomal protein S8.
233
Appendix B
Gene SRP54:signal recognition particle 54 kDa protein.
234
Appendix B
Gene ERCC3: TFIIH basal transcription factor complex helicase XPB.

235
Appendix B
Gene KARS: lysyl-tRNA synthetase.

236
Appendix B
Gene METAP2: methionine aminopeptidase 2

237
Appendix B
Gene ATP6V1D: V-type proton ATPase subunit D.
238
Appendix B
Gene PSMC1: 26S protease regulatory subunit 4.
239
Appendix B
Gene NFS1:cysteine desulfurase, mitochondrial precursor.
240
Appendix B
Gene GARS: glycyl-tRNA syntheta

241
Appendix C
Appendix C: Predictions made by constrained ML
Protein 1
Protein 2
Description
Description
PREDICTED:
apolipoprotein A-I binding protein
91984773
similar to adaptor-related protein complex 1
precursor
89042891
sigma 2 subunit
PREDICTED: similar to peptidylprolyl
23110944
proteasome alpha 6 subunit
113429091
isomerase A isoform 1
meiotic recombination protein SPO11
23110944
38201680
23110944
113414586
isoform b
PREDICTED: similar to CG17293-PA
PREDICTED: similar to Ubiquitin-63E
11024714
ubiquitin B precursor
113423966
11024714
ubiquitin B precursor
5454144
7705785
transcription factor B1, mitochondrial
113414586
CG11624-PA, isoform A
ubiquitin D
7705785
113429091
7705785
38201680
isoform b
4557896
myotubularin
41350318
myotubularin-related protein 2 isoform 2
transcription elongation factor A protein

4507385
PREDICTED: similar to 40S ribosomal
2 isoform a
51467029
protein S26

4507385
2 isoform a
4557896
myotubularin
7705477
hypothetical protein LOC51504
28872761
myotubularin-related protein 1

4507385
2 isoform a
PREDICTED: similar to postmeiotic

113418682
segregation increased 2-like 2

4507385
2 isoform a
38201710
242
DEAD box polypeptide 17 isoform 1
Appendix C
4507385
2 isoform a
4557896
myotubularin
PREDICTED: similar to large subunit

113427529
44680154

4507385

small nuclear ribonucleoprotein polypeptide
2 isoform a
4507129

4507385
ribosomal protein L36a
E
RNA, U3 small nucleolar interacting protein
2 isoform a
4759276

4507385
2 isoform a
116812591
RER1 retention in endoplasmic reticulum 1

4507385
2 isoform a
4557719
DNA ligase I

4507385
2 isoform a
56549681
small CTD phosphatase 3 isoform 2
4557896
myotubularin
18491016
exonuclease 1 isoform b

4507385
phosphatidylinositol glycan anchor
2 isoform a
4758922
biosynthesis, class L
4506233
proteasome 26S non-ATPase subunit 8
7019319
activator of basal transcription 1

4507385
2 isoform a
4507385
2 isoform a
4507385
DNA directed RNA polymerase II
2 isoform a
10863925

4507385
polypeptide L
dehydrodolichyl diphosphate synthase
2 isoform a
45580738
isoform b

4507385
2 isoform a
4557896
myotubularin
150170706
19923424
anaphase promoting complex subunit 10


4507385
4507385
2 isoform a
8923942
243
40254869
nucleolar protein family A, member 3

pre-mRNA processing factor 31 homolog
Appendix C
2 isoform a
4507385
2 isoform a
41327715
p53-related protein kinase

4507385
2 isoform a
4507311

4507385
2 isoform a
suppressor of Ty 4 homolog 1
113427044

4507385
2 isoform a
4506651
ribosomal protein L36a-like protein

4557896
myotubularin
113429091
4557896
myotubularin
113414586

4507385
2 isoform a
4505947
phosphoribosyl pyrophosphate
4506127
polypeptide G
phosphoribosyl pyrophosphate synthetase 1-
synthetase 1
28557709
like 1
PREDICTED: similar to adaptor-related
153791910
89042891
4506541
retinaldehyde binding protein 1
4557719
protein complex 1 sigma 2 subunit

DNA ligase I
38348232
dual specificity phosphatase 7
89042891
pre-mRNA processing factor 31

40254869
homolog
113414586


40254869
homolog
protein L26 (Silica-induced gene 20 protein)

113418826
(SIG-20)

40254869
40254869
homolog

113431146
113429091
244
(SIG-20)
Appendix C
homolog
38348232
89041736

40254869
homolog
38201680
isoform b
13775200
SF3b10
62909985

N-ethylmaleimide-sensitive factor
13775200
SF3b10
4505331
attachment protein, gamma

13775200
SF3b10
113429091
13775200
SF3b10
38201680
isoform b
13775200
SF3b10
89042891
13775200
SF3b10
4502743

cyclin-dependent kinase 7
ras homolog gene family, member C
13775200
SF3b10
111494251
precursor
13775200
SF3b10
111494248
precursor
13775200
SF3b10
113431146
(SIG-20)
13775200
SF3b10
113418826
(SIG-20)
13775200
SF3b10
113414586
DNA replication complex GINS protein

7706367
PSF2

113429091
eukaryotic translation initiation factor 3

7705433
subunit 6 interacting protein
89042891
245
Appendix C
7662482
transmembrane protein 15
4557719
DNA ligase I
4826675
38201680
isoform b
4826675
113429091
4826675
4557719
4507213
signal recognition particle 19kDa
DNA ligase I
113414586
113414586
delta isoform of regulatory subunit B56,

5453954
protein phosphatase 2A isoform 1

4826675
89042891

4826675
89041736

35493987
ubiquitin-conjugating enzyme E2I
113429091
44917606
attachment protein, beta
89042891
44917606
4557719
DNA ligase I
16945972
kelch domain containing 3
89041736
44917606
35493987
4505331
113414586
44917606

89041736

16945972
89042891

35493987
38201680
246
isoform b
Appendix C
16945972
7019405
host cell factor C2

133925811
transportin 1 isoform 1
89042891
133925811
23510381

111494248
precursor
38201680

111494251
precursor
38201680

111494251
isoform b
isoform b
precursor
89042891

111494251
precursor
4557719
DNA ligase I

111494251
precursor
118600973
RNA binding motif protein, X-linked 2
118600973

111494248
precursor
111494251
precursor
89041736

119943098
dihydropyrimidine dehydrogenase
113429091

111494251
precursor
111494248
precursor

111494251
precursor
4506717
ribosomal protein S29 isoform 1
71772583
71772583
47717139
leucine-zipper-like transcription regulator 1

111494248
precursor
111494251
precursor
111494251
precursor
247
Appendix C
111494248
precursor
47717139

111494248
precursor
45580738

111494251
isoform b
precursor
45580738
isoform b
38201710
38201710
56549681
56549681
14249398
PHD-finger 5A
14249398
PHD-finger 5A

111494251
precursor
111494248
precursor
111494248
precursor
111494251
precursor
111494248
precursor
111494251
precursor
111494251
precursor
10863925

111494248
polypeptide L
precursor
89042891

111494248
precursor
4557719

111494248
DNA ligase I
precursor
89041736

111494248
precursor
4502859
CDC28 protein kinase 2
4506717

111494248
precursor
248
Appendix C
glucose-6-phosphate dehydrogenase
109389365
isoform a
89042891
glucose-6-phosphate dehydrogenase
109389365
isoform a
89041736

111494248

precursor
10863925
polypeptide L
51173724
bystin
38201680
isoform b
51173724
bystin
113429091
51173724
bystin
113414586
41406094
J domain containing protein 1 isoform b
13236516
Der1-like domain family, member 1

41281768
cytochrome b-5 isoform 1
89042891

41349495
DNA primase polypeptide 2
38201680
isoform b
41406094
11141871
J domain containing protein 1 isoform a
41349495
113414586
41281768
4503183

41349495
113429091
41281768
89041736
41406094
31455614
41406094
4557719
35493996
113414586
35494003
113414586
DNA ligase I

35494003
113429091
249
Appendix C
35493996
113429091
35493996
38201680
isoform b
35494003
38201680
isoform b
31581534
tRNA isopentenyltransferase 1
38201680
isoform b
minor histocompatibility antigen 13

30581111
isoform 3
4557719
DNA ligase I
31543831
tubulin, gamma 1
6996005
dynamin 1-like protein isoform 1

30581111
isoform 3
6996005
31543831
tubulin, gamma 1
4557719
DNA ligase I
31581534
29826282
protein phosphatase 1G
31581534
113429091
4505999
113414586

30581111
89042891

isoform 3
19913408
DNA topoisomerase II, beta isozyme
protein phosphatase 1G
isoform 3
30581111

89041736
113414586

19913408
113429091
19913408
38201680
isoform b
13236516
89041736
250
Appendix C
13236516
113429091
13236516
113414586

13236516
38201680
isoform b
13236516
89042891

10835049
ras homolog gene family, member A
89042891
ATP-binding cassette, sub-family C,

9955970
member 3
113429091
nuclear factor of kappa light polypeptide

10092619
gene enhancer in B-cells inhibitor, alpha
89041736

5729877
heat shock 70kDa protein 8 isoform 1
89042891
proteasome 26S ATPase subunit 4

5729991
isoform 1
4557719

5729991
isoform 1
89042891

5729991
isoform 1
DNA ligase I

113429091

5729991
isoform 1
71772583
4506717

5729991
isoform 1
5729991
isoform 1
10863925
polypeptide L
5729877
5729991
heat shock 70kDa protein 8 isoform 1
89041736
8923942
251

Appendix C
isoform 1
6005764
GABA(A) receptor-associated protein
89042891

5729991
isoform 1
89041736

5729991

isoform 1
45580738
isoform b
56549681

5729991
isoform 1
ATP-binding cassette, sub-family C
4557481
(CFTR/MRP), member 2
113429091
4507785
113429091
4507785
113414586

4507785
38201680
protein phosphatase 3, regulatory

4506025
subunit B, alpha isoform 1
113429091
ubiquitin-like protein fubi and

4503659
isoform b
PREDICTED: similar to Ubiquitin-like
ribosomal protein S30 precursor
113422449
protein FUBI
4504277
H2B histone family, member Q
89042891

4503183
89041736

4503183
89042891

protein phosphatase 1, catalytic subunit, beta
153252132
ribosomal protein L31 isoform 3
4506005
isoform 1
hypothetical protein LOC57604 isoform

153251916
153251913
252
hypothetical protein LOC57604 isoform 1
Appendix C
153252132
ribosomal protein L31 isoform 3
113414586
33286434
p47 protein isoform c
116256336
SEC31 homolog A isoform 4
33286434
p47 protein isoform c
6996005

30520314
89042891
7657339
molybdenum cofactor synthesis 3
113414586
coenzyme Q10 homolog A isoform b
151101384
coenzyme Q10 homolog A isoform a
151101386
73622130
7662010
BolA-like protein 2
85797673
bolA-like protein 2B
zinc finger protein 516
10190686
19718751
uracil-DNA glycosylase isoform UNG2
H2A histone family, member V isoform

41406067

149944735
89041601
protein S26 isoform 1
149944735
15011936

149944735
88980535
protein S26
149944735
88982349
protein S26
149944735
113420084
protein S26
149944735
89025350

149944735
113430282
protein S26
149944735
88987217
protein S26
149944735
150010661
SEC14-like 5
113429703
89042891
253
protein S26
Appendix C
150170706
anaphase promoting complex subunit 10
113429091
38201710
28626498
kinesin family member C1
38201680
4557719
isoform b
DNA ligase I
47778943
syntaxin 16 isoform a
89042891

38201710
89041736
38201710
4758496
38201710
113414586

H2A histone family, member Y isoform 2
38201710
113418826
(SIG-20)
38201710
113431146
(SIG-20)
38201710
113429091
excision repair cross-complementing

rodent repair deficiency,
15834617
complementation group 2 protein
45580738
isoform b

15834617
8923942

15834617
15834617
47717139
4557719
254

DNA ligase I
Appendix C
15834617
10863925
polypeptide L

15834617
89042891

15834617
89041736

15834617
113429091

15834617
38201680
COX17 homolog, cytochrome c oxidase

5031645
assembly protein
89042891
COX17 homolog, cytochrome c oxidase

5031645
isoform b

assembly protein
89041736

4503719
148727247
fragile histidine triad gene
89042891
ubiquitin specific peptidase 5 isoform 2
8923942

148727247
ubiquitin specific peptidase 5 isoform 2
45580738
coatomer protein complex, subunit

148536853
alpha isoform 2
148596961
stearoyl-CoA desaturase 4 isoform a
isoform b
89042891
148596938
coatomer protein complex, subunit

stearoyl-CoA desaturase 4 isoform b
148536853
alpha isoform 2
113429091
145275210
RNA processing factor 1
113418826
255
Appendix C
(SIG-20)
145275210
113431146
(SIG-20)
145275210
113414586

145275187
tRNA-(N1G37) methyltransferase
113429091
145275210
113429091
145275210
38201680
126723390
ankyrin repeat domain 24
121582655
124256496
heat shock 70kDa protein 1-like
34419635
isoform b
heat shock 70kDa protein 6 (HSP70B')
126723390
89041736

122937243

(putative)
89042891

121582655
89041736

118600973
113418826
(SIG-20)
118600973
113431146
(SIG-20)
trafficking protein particle complex 6B

118600991
isoform 1
13129120

118600991
isoform 1
118498359
ribosomal L1 domain containing 1
trafficking protein particle complex 6A

38201680
113414586
256
isoform b
Appendix C
118600973
113429091
113414586

118600991
isoform 1
RER1 retention in endoplasmic
116812591
reticulum 1
62909985

116812591
reticulum 1
113414586

116812591
reticulum 1
38201680

116812591
reticulum 1
isoform b
113429091

116812591
reticulum 1

113431146
(SIG-20)

116812591
reticulum 1

113418826
ATP-binding cassette, sub-family A

116734710
member 3
89042891

116734710
member 3
116256336
(SIG-20)

89041736
110347439

116256336
4506005
115387112
ubiquitin-like 5
13236510
115387112
ubiquitin-like 5
113414586
proto-oncogene tyrosine-protein kinase

112382244
112382241
isoform 1
ubiquitin-like 5
FGR
89042891
89042891
257
Appendix C
FGR

PREDICTED: similar to Ubiquitinconjugating enzyme E2S (Ubiquitinconjugating enzyme E2-24 kDa) (Ubiquitinprotein ligase) (Ubiquitin carrier protein)
112382377
ubiquitin-conjugating enzyme E2S
113430896
SEC24 (S. cerevisiae) homolog B

112382212
isoform a
89042891
SEC24 (S. cerevisiae) homolog B

112382212
isoform a
FGR
89041736
FGR
113429091
113429091
containing 1 isoform 1
89041736
89042891
serine/threonine kinase 24 (STE20

homolog, yeast) isoform b
110347439
110347439

containing 1 isoform 1
110349738
ankyrin repeat and FYVE domain

110815813
ankyrin repeat and FYVE domain

110815813


112382241


112382244
(E2-EPF5)

89042891
6996005
10190696
110347439
MYC-associated zinc finger protein

110347459
isoform 2

110349799
testis-specific protein kinase 2
89042891

PREDICTED: similar to zinc finger protein
110347439
113413881
110347439
10190686
110349799
testis-specific protein kinase 2
89041736
258
114
Appendix C
109452595
109452593
109255245
serine/threonine kinase 17a
113414586
109255245
serine/threonine kinase 17a
6996005
ATP-binding cassette, sub-family E,

108773782
member 1
89042891
ATP-binding cassette, sub-family E,

108773784

member 1
89042891

95147356
mitogen-activated protein kinase 15
89042891

95147356
mitogen-activated protein kinase 15
89041736
vesicle-associated membrane protein94721250
associated protein A isoform 1
89041736
nuclear LIM interactor-interacting factor

93004102
89042891
nuclear LIM interactor-interacting factor

2
93141204
methyltransferase like 2B

93004102

89041736
113414586

89145417
methyltransferase like 7A
89042891
eukaryotic translation initiation factor

84043963
5B

113429091

77812674
isoform b
8923444
isoform a
77812670
exosome component 9 isoform 2
89042891

77812670
exosome component 9 isoform 2
89041736
259
Appendix C
myosin head domain containing 1
75812980
isoform 3

113429091
myosin head domain containing 1

75812980
isoform 3
89042891
digestive-organ expansion factor

75677335
homolog
isoform 1
89042891

113429091

72534754


72534754
isoform 1
38201680
isoform b

72534754
isoform 1
71772583
4758496
71772583
32130516
113414586

serologically defined colon cancer antigen 1
71772583
113431146
(SIG-20)
71772583
113418826
(SIG-20)
71772583
62909985
71772583
113414586
71772583
19718751
ubiquitin-conjugating enzyme E2D 4

71772583
8393719
peroxisomal enoyl-coenzyme A
(putative)
70995211
hydratase-like protein
71772583
4506717
71772583
38016127
89042891
260

RNA binding motif protein 34
Appendix C
71772583
4502743
71772583
4557719
DNA ligase I
71772583
38201680
isoform b
71772583
113429091
68509270
transcriptional adaptor 2-like isoform a
113414586

68303635
mutS homolog 3
89042891
68226422
Yip1 domain family, member 5
32401427
Yip1 domain family, member 5

62955833
DNA-damage inducible protein 2
89042891
serologically defined colon cancer

64276486

antigen 10
89042891

64276486
antigen 10
89041736
62955833
DNA-damage inducible protein 2
48717485
DDI1, DNA-damage inducible 1, homolog 1

62865890
113429091
62460637
importin 4
89041736

62865890
45580738
62865890
4557719
isoform b
DNA ligase I
62865890
8393719
(putative)
62865890
89042891
62865890
8923942
261

Appendix C
62865890
89041736
62865890
56549681
62240994
cysteinyl-tRNA synthetase isoform d
62240992
cysteinyl-tRNA synthetase isoform c
62234438
Notchless gene homolog isoform b
41350318
62234438
44680154
62234461
Notchless gene homolog isoform a
41350318
62234461
62234438
62234438
4502703
62234438
21536371
telomerase-associated protein 1
62234461
44680154
58533179
trafficking protein particle complex 2
7657548
cell division cycle 6 protein

60279265
Sec61 gamma subunit
38201680
isoform b
58533179
60279265
Sec61 gamma subunit
38201680
7657546
isoform b
Sec61 gamma subunit
60279265
Sec61 gamma subunit
89042891

58533179
113429091
58533179
113431146
(SIG-20)
58533179
113418826
(SIG-20)
58533179
113414586
262
Appendix C
57165436
serine/threonine kinase 16
57165434
serine/threonine kinase 16
56549681
88943062
isomerase A (cyclophilin A)-like 4

56549681
113423887
56549681
89041736
56549681
31543091
56549681
22035624
phosphatidate cytidylyltransferase 1
56549683
89042891
56549681
4502743

56549681
113418826
(SIG-20)
56549681
113431146
(SIG-20)
56549681
113422777
56549681
89042897
56549681
38016127

56549681
38201680
56549681
5729840
isoform b
tubulin, gamma complex associated protein 2
56549681
8393719
(putative)
56549681
88943041
263
Appendix C
56549681
88953813
PREDICTED: similar to TBC1 domain
family member 3 (Rab GTPase-activating
protein PRC17) (Prostate cancer gene 17
56549681
113426831
56549681
4557719
CCR4-NOT transcription complex,

56550059
protein) (TRE17 alpha protein) isoform 1

DNA ligase I
CCR4-NOT transcription complex, subunit 4
subunit 4 isoform b
56550057
isoform a
56699411
solute carrier family 35, member E2
89042891

56549683
89041736

56549681
113429091
56549681
89042891
56549681
4758496

56549681
6912680
isoform a
56699411
solute carrier family 35, member E2
89041736
56549681
62909985

55956895
CGI-01 protein isoform 3
89041736

56549113
debranching enzyme homolog 1
89042891
56118223
choline/ethanolaminephosphotransferase
5174415

choline/ethanolaminephosphotransferase
56549113
debranching enzyme homolog 1
89041736
264
Appendix C
SMT3 suppressor of mif two 3 homolog
54792071
SMT3 suppressor of mif two 3 homolog 2
2 isoform b precursor
54792069
isoform a precursor
oxoglutarate (alpha-ketoglutarate)
dehydrogenase (lipoamide) isoform 1
51873036
precursor
51944950
phosducin-like 2

89042891
113414586
DnaJ (Hsp40) homolog, subfamily B,

50593537
member 12
41054844
5-phosphatase, A isoform 2
89041736
chromosomes 4-like 1
89042891
89042891
89042891
18765707
113429091
113429091
subunit 11 isoform 2
50409781
18777675
11 isoform 2
APC11 anaphase promoting complex subunit
50409750
APC11 anaphase promoting complex

50409796
11 isoform 2

50409789

50409796

50409789
phosphatase isoform 2
SMC4 structural maintenance of

50658065

skeletal muscle and kidney enriched inositol

50658063

phosphatidylinositol (4,5) bisphosphate

50726960


50726960


50658063
member 12

50658065


50726960
11 isoform 2
50409750
265
11 isoform 2
Appendix C
50409796
50409789

50409804
50409750

50409804
50409789
18777675
50409781
50409796
18777675
50409781
11 isoform 2
polypeptide A'
89042891
NAD(P)H:quinone oxidoreductase type

49574502
11 isoform 2
small nuclear ribonucleoprotein

50593002
11 isoform 2

50409796
11 isoform 2

50409789
11 isoform 2

50409804
11 isoform 2

50409804
11 isoform 2

50409804
11 isoform 2

3, polypeptide A2
89042891

50083277
ATPase class I type 8B member 4
89042891

50409781
50409750

50409781
18777675
11 isoform 2
18777675
NAD(P)H:quinone oxidoreductase type

49574502
11 isoform 2

50409750
11 isoform 2
3, polypeptide A2
89041736
266
Appendix C
DDI1, DNA-damage inducible 1,
48717485
homolog 1
89042891
DDI1, DNA-damage inducible 1,

48717485

homolog 1
89041736
6996005
leucine-zipper-like transcription
47717139
regulator 1
47717139
regulator 1
38201680
isoform b
47717139
regulator 1
4758496
4557719
DNA ligase I
47717139
regulator 1
solute carrier family 25 member 3
47132595
isoform b precursor
45580738
isoform b

45580738
isoform b
113414586

45580738
isoform b
21536371
23397458
kinesin family member 19

45580738
isoform b
45580742
isoform a
45580738
isoform b

45580742
isoform a
113414586

45580738
isoform b

113423887

45580738
isoform b
88943062

45580738

isoform b
89042897
267
Appendix C
45580738
isoform b

113422777

45580738
isoform b
5729840

45580738

isoform b
38201680
isoform b

45580738
isoform b
4506707


45580738
isoform b

113418826
(SIG-20)

45580738
isoform b

113431146
(SIG-20)

45580738
isoform b
41872631
fatty acid synthase
38016127

45580738
isoform b
45580738
solute carrier family 25 member 3 isoform b
isoform b
4505775

45580742
isoform a

113429091

45580738
precursor
isoform b
89042891
32130516
serologically defined colon cancer antigen 1

45580738
isoform b
45580738
isoform b

113429091

45580738
45580738
isoform b
62909985
113426831
268

Appendix C
isoform b

protein phosphatase 1, catalytic subunit,

46249376
beta isoform 1
4506005
isoform 1

45580738
isoform b
17978477
vacuolar protein sorting 11

45580738
isoform b
4557719

45580742
DNA ligase I
isoform a
38201680
isoform b

45580738
isoform b
4758496
4502743

45580738
isoform b
45580738
isoform b
42516576

45580738
WW domain-containing oxidoreductase
isoform b
7706523

45580738
glutaredoxin 5
isoform 1
isoform b
18860884
isoform 2

45580738
isoform b
7705369
coatomer protein complex, subunit beta


45580742
isoform a

113431146
(SIG-20)

45580742
isoform a

113418826
(SIG-20)

45580738
45580738
isoform b
269
31455614
23510381
Appendix C
isoform b
45580738
isoform b
88953813

45580738
isoform b
88943041

45238849
poly(A) binding protein, cytoplasmic 3
89041736

myotubularin-related protein 2 isoform

44680154

113431146
(SIG-20)

44680154
113418826
(SIG-20)
113414586

44680154
1
44680154

113429091

44680154
41350318
19923424
21536371

44680154
1
44680154
1
transcription elongation factor A 1
45439355
isoform 2
113414586

45238849
poly(A) binding protein, cytoplasmic 3
89042891
28872761

44680154
1
acyl-CoA synthetase long-chain family
42794754
acyl-CoA synthetase long-chain family
member 3
42794752
270
member 3
Appendix C
42516563
UDP-glucuronate decarboxylase 1
89041736

41872631
fatty acid synthase
89042891

42516563
UDP-glucuronate decarboxylase 1
89042891

42516576
glutaredoxin 5
89042891

41872631
fatty acid synthase
89041736
19923424

41350318
2
41350316

113429091

41350318

113418826
(SIG-20)

41350318
113431146

41350318
(SIG-20)
113429091

41350318
21536371
28872761

41350318
2
41350318
113414586

41327715
113429091
41327715
89042891
271
Appendix C
41349441
41327715
89042891
113414586

41349441
89041736
ubiquitin-conjugating enzyme E2
40806167
variant 1 isoform a

113429091
113414586
40806167
variant 1 isoform a
40806167
variant 1 isoform a
38201680
isoform b
transmembrane emp24 protein transport

39725636
domain containing 9
113414586


39725636
domain containing 9

113418826
(SIG-20)

39725636
domain containing 9

113431146
(SIG-20)
39725636
domain containing 9
113429091
38708309
113428755

38327644
89041736

38327644
89042891
38327644
62909985

38201680
PREDICTED: similar to ribosomal protein
isoform b
29742309
272
L31
Appendix C
38201680
isoform b

113427093
L31

38201680
isoform b
7706343
4504221
guanylate kinase 1
4506193
proteasome beta 1 subunit
7657198
dimethyladenosine transferase
4507797
ubiquitin-conjugating enzyme E2 variant 2
4506699
4506643

38201680
isoform b
38201680
isoform b
38201680
isoform b
38201680
isoform b
38201680
isoform b
38201680
isoform b
38201680
isoform b
13129120

38201680
isoform b
7706423
U6 snRNA-associated Sm-like protein LSm7
4557719
DNA ligase I

38201680
isoform b
38201680
isoform b
10863925
polypeptide L
14249398
PHD-finger 5A
62909985
7705477

38201680
isoform b
38201680
isoform b
38201680
isoform b
273
Appendix C
38201680
guanine nucleotide-binding protein, beta-1
isoform b
11321585

38201680
subunit
SWI/SNF-related matrix-associated actin-
isoform b
21071060
dependent regulator of chromatin a-like 1

38201680
isoform b
4506631

38201680
isoform b
15150809
SEC11-like 3

38201680
isoform b
7657546
Sec61 gamma subunit
8923475
thioredoxin-like 4B

38201680
isoform b
38201680
isoform b

113418682

38201680
isoform b
4758384
FK506 binding protein 5
8922905
RIO kinase 2
4506701

38201680
isoform b
38201680
isoform b
38201680
isoform b
18105063
vacuolar protein sorting 45A


38201680
isoform b

113418826
(SIG-20)

38201680
isoform b

113431146
(SIG-20)

38201680
38201680
isoform b
8923942
4502643
274
chaperonin containing TCP1, subunit 6A
Appendix C
38201680
isoform b
isoform a
isoform b
4758922

38201680
isoform b
15431295
15431297

38201680
isoform b
38201680
isoform b

113429091

38201680
isoform b
7706667
4507311

38201680
isoform b

38201680
isoform b
protein L29 (Cell surface heparin-binding

113428574
protein HIP)

38201680
isoform b
32189369
DNA polymerase epsilon subunit 2

38201680
isoform b
4503729

38201680
FK506-binding protein 4
isoform b
89035017

38201680
isoform b
4507873

38201680
small nuclear ribonucleoprotein polypeptide
isoform b
4507129

38201680
von Hippel-Lindau binding protein 1
E
solute carrier family 2 (facilitated glucose
isoform b
8923733
transporter), member 6

38201680
isoform b
113414586
275
Appendix C
38201680
isoform b
4502859
4506609

38201680
isoform b
CTD (carboxy-terminal domain, RNA

38201680
polymerase II, polypeptide A) small
isoform b
32813443

38201680
isoform b

113427613

38201680
phosphatase 1 isoform 2
L31
isoform b
6912680
isoform a
7657548

38201680
isoform b
38201680
isoform b

113427529

38201680

isoform b
51467029
protein S26
10864021

38201680
isoform b
38201680
isoform b
PREDICTED: similar to DNA primase large

113418084

38201680
isoform b
subunit, 58kDa
113418086
subunit, 58kDa

38201680
isoform b
4506717

38149981

polypeptide B''
89042891

38201680
isoform b
4502743
5729840

38201680
isoform b
276
Appendix C
38201680
isoform b
4504523

38201680
isoform b
heat shock 10kDa protein 1 (chaperonin 10)

113419590
protein S28

38201680
isoform b
4506651

38201680
isoform b

113427044

38201680
isoform b
23397458

38201680
RNA, U3 small nucleolar interacting protein
isoform b
4759276
chaperone, ABC1 activity of bc1

34147522
complex like precursor
34147513
RAB7, member RAS oncogene family
2
89042891
113414586
chaperone, ABC1 activity of bc1

34147522

complex like precursor
89041736
glycerophosphodiester
32698962
phosphodiesterase domain containing 1
4557719
DNA ligase I
ubiquitin-conjugating enzyme E2A

32967278
isoform 3
32967276
ubiquitin-conjugating enzyme E2A isoform 2
32967280

32967278
isoform 3
32130516
antigen 1

113429091
31542547
dullard homolog
113429091
31542507
32130516
HORMA domain containing 1
113429091
10863925
277
Appendix C
32130516
antigen 1
polypeptide L
antigen 1
18765707
31542547
dullard homolog
89042891

32130516
antigen 1
29553970
H2A histone family, member J
30425538
zinc finger, DHHC-type containing 21
4506717
63029935
H2A histone family, member B3
4557719
DNA ligase I
31455614
89042891

31455614
89041736

30410779
huntingtin interacting protein B
89041736

29553970
89041736
29553970
63029943

28872761
113429091
28872761
19923424
28872761
113418826
(SIG-20)
28872761
113431146
(SIG-20)
29553970
28827774
89042891
89042891
dual-specificity tyrosine-(Y)-
278
Appendix C
phosphorylation regulated kinase 4
28872761

113414586
PRP38 pre-mRNA processing factor 38

24762236
(yeast) domain containing A
89042891

24371272
isoform 2
24430186
phosphatidylinositol glycan, class C

24371241
4505795
isoform 1
24430186
89042891
23510381
8923942
23397458
8923942

23397458
10863925
polypeptide L
23510381
89042891

23397458
23199991
casein kinase 1 epsilon
89042891
4503093

casein kinase 1 epsilon
22202633
prefoldin subunit 5 isoform alpha
113429091
22035624
phosphatidate cytidylyltransferase 1
113429091
21624654
spermatogenesis associated 5
113429091
21362110
thiamin pyrophosphokinase 1 isoform a
89041736

21624654
spermatogenesis associated 5
89042891
21362110
thiamin pyrophosphokinase 1 isoform a
89042891
279
Appendix C
21450653
113429091
21361144
113414586

21361376
splicing factor 3a, subunit 2
89042891

21361144
113429091
protein disulfide isomerase-associated 3

21361657
precursor
4758304
protein disulfide isomerase-associated 4

20270343
ADP-ribosylation factor-like 10B
89042891
SWI/SNF-related matrix-associated
actin-dependent regulator of chromatin
21071060
a-like 1
4502743
21071060
a-like 1
113414586
21071060
a-like 1

113429091
19718751
113429091
19718751
10863925
polypeptide L
19718751
18765707
19718751
4506717
18860916
5'-3' exoribonuclease 2
89042891
280
Appendix C
19913428
vacuolar H+ATPase B2
113429091
autophagy-related cysteine endopeptidase 2
19718751
30795252
isoform a
19718751
19923424
19718751
30795248
113414586
8923942
isoform b
18105063
113418826
(SIG-20)
18105063
18491016
exonuclease 1 isoform b
18105063
113431146
4557719
113414586
(SIG-20)
DNA ligase I
18105063
113429091
ATPase, H+ transporting, lysosomal

18087815
31kDa, V1 subunit E isoform 2
89042891

17978519
vacuolar protein sorting 26 A isoform 1
17978477
vacuolar protein sorting 11
89042891
4502859

17978519
vacuolar protein sorting 26 A isoform 1
113429091
15011936
88980535
protein S26
15011936
15150809
SEC11-like 3
89041601
113414586
281

Appendix C
15011936
88982349
protein S26
15011936
113420084
protein S26
15150809
SEC11-like 3
113429091
15011936
89025350

15011936
113430282
protein S26
15011936
88987217
protein S26
15011936
14249398
PHD-finger 5A
113429703
4758496
protein S26
14249398
PHD-finger 5A
113431146
(SIG-20)
14249398
PHD-finger 5A
113418826
(SIG-20)
14249398
PHD-finger 5A
113429091
IMP2 inner mitochondrial membrane

14211845
protease-like
113414586
14249398
PHD-finger 5A
113414586

14149696
SEC31 homolog B
113429091
13236510
ubiquitin-like 5
113414586
11863130
phosphatidylinositol Nacetylglucosaminyltransferase subunit A
113429091
282
Appendix C
10863925
isoform 1
polypeptide L
89042891

10864021
113429091

10863925
polypeptide L
8393719
(putative)
4502743
4557719
DNA ligase I

10863925
polypeptide L
10863925
polypeptide L
10863925
polypeptide L
38016127
62909985

10863925
polypeptide L
10863925
polypeptide L
6912680
isoform a
4758496

10863925
polypeptide L
10863925
polypeptide L

113429091

10863925
polypeptide L
5729840

10863925
polypeptide L
10864021

89041736
113414586


10863925
10863925
polypeptide L

113431146
113418826
283
(SIG-20)
Appendix C
polypeptide L
(SIG-20)

10863925
polypeptide L
113414586

8923942
4506005
isoform 1
8923942
4502743
8923942
4557719
DNA ligase I
8923942
4758496
8923942
38016127

8923942
113426831
8923942
113414586

8923942
113431146
(SIG-20)
8923942
113418826
(SIG-20)
8923942
89042891

8923942
89041736
uncharacterized hypothalamus protein

8923712

HARP11
89042891

8923942
113429091
uncharacterized hypothalamus protein

8923712
HARP11
89041736
284
Appendix C
7706326
splicing factor 3B, 14 kDa subunit
89042891

7706753
ubiquitin C-terminal hydrolase UCH37
89042891

7706667
89041736

7706657
cell division cycle 40 homolog
89041736

7706667
113429091
7706326
splicing factor 3B, 14 kDa subunit
89041736

7706495

member 11 precursor
89042891

7706657
cell division cycle 40 homolog
89042891

7706667
7705483
89042891
113414586

7705483
89041736

7705483
7657198
89042891
113414586

7657522
ring finger protein 7 isoform 1
89042891

7657198
113429091
7657548
113429091
285
Appendix C
7657546
Sec61 gamma subunit
89042891

7657548
113431146
(SIG-20)
7657548
113418826
(SIG-20)
7657548
113414586

6912680
isoform a

113429091

6912680
isoform a
4557719

6912680
isoform a
89042891

6005701
DNA ligase I

member 8
89042891

5902002
89042891
plasma glutathione peroxidase 3

6006001
DnaJ (Hsp40) homolog, subfamily A,
precursor
31542539

6005701
member 8
protein 2
member 3
89041736
tubulin, gamma complex associated

5729840

113429091
113414586

5729840
protein 2
5729840
protein 2
6996005

PREDICTED: similar to Ubiquitin-63E
5454144
ubiquitin D
113423966
286
CG11624-PA, isoform A
Appendix C
5729840
protein 2
4557719

5729840
protein 2
89042891

5729840

protein 2
89041736

5453660
DNA ligase I

protein 3
89042891

5032133
eukaryotic translation initiation factor 1
89042891

5031635
cofilin 1 (non-muscle)
113429091
U2 small nuclear RNA auxiliary factor

4827046
1-like 2
4506005
isoform 1
4758496
ATP synthase, H+ transporting,

mitochondrial F1 complex, gamma
4885079
subunit isoform H (heart) precursor

4759302
VAMP-associated protein B/C
89042891
4759302
6996005

4759302
89041736

4826924
polypeptide K
113414586

4826924
polypeptide K

113429091
113414586

4758922
4758922
4502743
287
Appendix C
4758922

113429091

4758922

113431146
(SIG-20)

4758922

113418826
(SIG-20)
4507873
113431146
(SIG-20)
4507873
113418826
(SIG-20)

4557563
complementation group 3
9910180
ACN9 homolog
4507947
tyrosyl-tRNA synthetase
113429091
4507873
113414586
4507797
variant 2

113429091
4507873
113429091
113414586
4507797
variant 2

4506701
113429091
4506699
113418826
4506699
113431146
288
(SIG-20)
Appendix C
(SIG-20)
4506717
4758496
solute carrier family 7 (cationic amino

4507047

acid transporter, y+ system), member 1
89042891

4506717
113418826
(SIG-20)
4506717
113431146
4506699
4506717
62909985
4506717
113414586
4506699
4758496
4502743
(SIG-20)
4506717
8393719
(putative)
4506699
113429091
solute carrier family 7 (cationic amino
4507047
acid transporter, y+ system), member 1
89041736
4506717
38016127
4506717
4502743
4506701
113431146
(SIG-20)
4506701
113418826
(SIG-20)
4506701
113414586
289
Appendix C
4506699
113414586

4507123
polypeptide B''
4506717

89042891
4557719

DNA ligase I
4506715
113422526

4506715
113423050

4506715
89034184

4506715
88959151
protein S28
4506715
88953906
protein S28
4506717
113429091
4506643
113414586

4506617
89042891

4506643
113429091
4506609
113414586

4506609
113429091
4506193
113414586

4506193
113429091
protein tyrosine phosphatase, receptor

4506303
type, A isoform 1 precursor
113429091
290
Appendix C
4506005
beta isoform 1
4557719
DNA ligase I
4506233
proteasome 26S non-ATPase subunit 8
4505621
prostatic binding protein
113429091
22165364
mitochondrial ribosomal protein L38
4505795
89042891

4504511

member 1
89041736

4504511
member 1
113429091
4504221
guanylate kinase 1
113414586

4504007
glycerol kinase isoform b
89042891

4504007
glycerol kinase isoform b
89041736

4504221
guanylate kinase 1
4502703
cell division cycle 6 protein
113429091
21536371
4503301
2,4-dienoyl CoA reductase 1 precursor
89042891
chaperonin containing TCP1, subunit

4502643

6A isoform a
89042891

63029935
89041736

28557709
synthetase 1-like 1
89042891
63029935
63029943
28557709
synthetase 1-like 1
89041736
291
Appendix C
66912162
148747574
histone 2, H2bf
89042891
21945058

63029935
89042891

118402582
cell division cycle 20
113429091
38016127
89041736

38016127
51467029
protein S26
38016127
89042891
38016127
4557719
37595752
lamin B receptor
37595750

DNA ligase I
lamin B receptor
38016127
113429091
38348260
38016127
32189369
89041736
6996005
113414586

32189369
113429091
32189369
113431146
(SIG-20)
32189369
113418826
32484973
adenosine kinase isoform a
113429091
292
(SIG-20)
Appendix C
32528306
replication factor C large subunit
113429091
32483374
nucleolar protein 5A
89042891

31795544
origin recognition complex, subunit 1
113429091
31543091
31795544
89042891
113414586

31543091
113429091
autophagy-related cysteine
30795252
endopeptidase 2 isoform a
30795248
CCR4-NOT transcription complex,

31542315
isoform b
subunit 8
89041736

31543091
89041736

31795544
113431146
(SIG-20)
31795544
113418826
(SIG-20)
28376621
SEC14p-like protein TAP3
89042891

28376621
SEC14p-like protein TAP3
89041736

28173554
histone H2B
89042891
293
Appendix C
28559085
cytidine triphosphate synthase II
28559083
cytidine triphosphate synthase II
30795252
113414586
113414586
30795248
endopeptidase 2 isoform b
30795248
endopeptidase 2 isoform b
113429091
30795252
113429091
24586679
testis-specific histone H2B
89041736

24586675
slingshot homolog 3
89042891

24586679
testis-specific histone H2B
89042891
potassium voltage-gated channel,

shaker-related subfamily, beta member
potassium voltage-gated channel, shaker-
27436969
2 isoform 2
4504825
22538446
tumor protein p53 inducible protein 3
22538444
tumor protein p53 inducible protein 3
22001417
gemin 5
21536371
adaptor-related protein complex 1 sigma

22027655
2 subunit
89041736
adaptor-related protein complex 1 sigma

22027655
related subfamily, beta member 2 isoform 1

2 subunit
89042891

21362084
TBC1 domain family, member 15
89041736

21362084
TBC1 domain family, member 15
89042891
IMP1 inner mitochondrial membrane

21450679
peptidase-like
113414586
294
Appendix C
21314720
Smad nuclear interacting protein
113429091
20911035
peptidylprolyl isomerase-like 4
89042891
serine hydroxymethyltransferase 2
19923315
(mitochondrial)
89042891
18860884

isoform 2
89042891
18860884

isoform 2
7706523
isoform 1
4557719
DNA ligase I
serine hydroxymethyltransferase 2
19923315
(mitochondrial)
18860884
isoform 2
89041736

15431297
113429091
15431297
113414586

16306568
poly(A) polymerase gamma
89042891

16306566
histone H2B
89042891

15431297
113418826
(SIG-20)
15431297
113431146
DEAD (Asp-Glu-Ala-Asp) box

14251212
polypeptide 20
(SIG-20)
113429091
13376747
nucleotide binding protein-like
89042891
295
Appendix C
14043026
vesicle-associated membrane protein 8
89042891

14043026
vesicle-associated membrane protein 8
113429091
13430872
nucleolar protein 10
113414586

13430872
nucleolar protein 10
113429091
13129120
113414586
guanine nucleotide-binding protein,

11321585
beta-1 subunit
89041736
guanine nucleotide-binding protein,

11321585

beta-1 subunit
89042891

11056006
kelch-like 12
89041736

11056006
kelch-like 12
89042891

12758125
89041736

12758125
113429091
12758125
89042891

11386163
ELAV-like 4
89042891

8922905
RIO kinase 2
113429091
113414586
solute carrier family 2 (facilitated

8923733
glucose transporter), member 6
10190686
10190696
10190686
18765707
296
Appendix C
8923733
62909985

10190686
113413881

8923733
114
113429091

8923733

113431146
(SIG-20)

8923733
113418826
(SIG-20)
8922905
RIO kinase 2
113414586
10190686
6996005
7706343
113414586
7705477
113414586

7705748
TNNI3 interacting kinase
113429091
113414586
U6 snRNA-associated Sm-like protein

7706423
LSm7

7705748
89041736
7706497
cytidylate kinase
113414586
7705369
113414586

7705369
89042891
7706523

isoform 1
89042891
297
Appendix C
7706343
113429091
7705477
4557719
DNA ligase I
7706497
cytidylate kinase
113429091
8922388
89042891

7657508
ring-box 1
89042891

7705477
113431146
(SIG-20)
7705477
113418826
7706523
(SIG-20)
isoform 1
89041736

7705477
113429091
U6 snRNA-associated Sm-like protein

7706423
LSm7
113429091
7705748
89042891

7019405
host cell factor C2
89042891

7019405
host cell factor C2
89041736
AHA1, activator of heat shock 90kDa

6912280
protein ATPase homolog 1
113429091
7019319
113414586
298
Appendix C
7657315
Lsm3 protein
6996005
113414586
10190696

AHA1, activator of heat shock 90kDa

6912280
protein ATPase homolog 1
113414586

7019319
113429091
5729953
nuclear distribution gene C homolog
5902034
periodic tryptophan protein 1
89042891
113414586

4557719
DNA ligase I
113429091
H2A histone family, member Y isoform

4758496
4503729

4759156
polypeptide A
113414586

4759224
programmed cell death 5
89041736
4758384
4503729

RNA, U3 small nucleolar interacting

4759276
protein 2
113414586

4759224
programmed cell death 5
89042891

4759156
polypeptide A

113429091
4758384
89042891

4758496
4557719
DNA ligase I
4557719
62909985
299
DNA ligase I
Appendix C
4758496

113427044

4758496
4506651

4557719
DNA ligase I
113431146

4758496
(SIG-20)
113429091
4557719
DNA ligase I
18765707
RNA, U3 small nucleolar interacting

4759276
protein 2
113429091
4557719
DNA ligase I
4506007

4758496
gamma isoform
89042891

glucosamine-fructose-6-phosphate
4557719
DNA ligase I
4503981
aminotransferase
4557719
DNA ligase I
4502743
4557719
DNA ligase I
4505331

4557719
DNA ligase I
8393719

4758496
(putative)
113427529

4557719
DNA ligase I
89042891

4507311
113418826
300
(SIG-20)
Appendix C
4507311
113431146
(SIG-20)
4507369
tyrosine aminotransferase
89042891

4507311
113429091
4506631
113431146
(SIG-20)
4506631
113418826
(SIG-20)
4506631
113429091
4506629
113429091
4507311
113414586

4557719
DNA ligase I
113426831

4506629
27482992
protein HIP)
4557719
DNA ligase I
113418826
(SIG-20)
4557719
DNA ligase I
113414586
4507133
polypeptide G
113429091
4506203
proteasome beta 7 subunit proprotein
113429091
301
Appendix C
4557719
DNA ligase I
4506631
89041736
113414586

4506629
113428574
protein HIP)
113414586
heat shock 10kDa protein 1 (chaperonin

4504523
10)
4504523
10)

113429091
4503729
113431146
(SIG-20)
4503729
113418826
(SIG-20)
4505235
mannose-6- phosphate isomerase
113429091
4505773
prohibitin
113429091
alpha isoform of regulatory subunit

4506019
B55, protein phosphatase 2
89042891
4505331

89042891

4504257
H2B histone family, member A
89042891

4504261
H2B histone family, member D
89042891

4504269
H2B histone family, member J
89042891
302
Appendix C
4506019

113418826
(SIG-20)

4506019

113431146
(SIG-20)
4504263
H2B histone family, member E
89042891
4505331

89041736


4504523
10)

113431146
(SIG-20)

4504523
10)

113418826
(SIG-20)
4505997
protein phosphatase 1D
89042891

4506007
gamma isoform
4502743

89042891
113414586

4502743
89041736

148222882
113429091
113414586
S-phase kinase-associated protein 1A

25777713
isoform b
25777711
isoform a

113429091
21166389
H2B histone family, member L
89042891
303
Appendix C
4502743
113431146
(SIG-20)
4502743
113418826
(SIG-20)
25777713
isoform b
113429091
23592238
glucose transporter 14
113414586
113414586

25777711
isoform a

23592238
glucose transporter 14
113429091
4502859
113429091

25777713
isoform b
25777711
isoform a
4502743
89042891

63029943
89041736

PREDICTED: similar to Kinesin heavy chain
isoform 5C (Kinesin heavy chain neuron-
4758650
kinesin family member 5C
113413289
specific 2)
58615669
cytochrome c oxidase subunit III
89042891
58615665
cytochrome c oxidase subunit I
17981855
cytochrome c oxidase subunit I
58615673
NADH dehydrogenase subunit 5
17981853
58615673
58615663
58615673
17981862
304
Appendix C
58615673
13128862
histone deacetylase 3
62909985
58615672
113414586
8923475

thioredoxin-like 4B
protein phosphatase 1 (formerly 2C)63003905
like
113414586
dual specificity phosphatase and pro

51491914
isomerase domain containing 1
89042891
58615672
17981862

62909985
113429091
58615666
cytochrome c oxidase subunit II
17981859
58615669
58615666
58615669
17981856
58615669
17981859

4557851
113414586
58615666
17981856

13128862
histone deacetylase 3
113429091
63029943
89042891
58615663
17981853

32813443
113414586

17981859
32813443

89042891

113429091
305
Appendix C
4507129
polypeptide E

113418826
(SIG-20)

4507129
polypeptide E

113431146
protein phosphatase 2A, regulatory

30065643
subunit B' isoform b
89041736

29725611
89041736
32813443
32813443
113429091
17981859
113429091
113431146
113418826
polypeptide E
113429091
17981856
4507129
89042891

polypeptide G
(SIG-20)
89042891

4505947
(SIG-20)

30065643

29725611

4507129


29725611


30065643
(SIG-20)

113429091
113414586
306
Appendix C
polypeptide E
RNA pseudouridylate synthase domain
27734887
RNA pseudouridylate synthase domain
containing 3
14249470

31542539
member 3
containing 4
113429091

32967280
isoform 1
32967276

10190696
8923475
113413881
114
thioredoxin-like 4B
113414586

15431295
113429091
8923475
thioredoxin-like 4B
113429091
adaptor-related protein complex 4,

21361394
sigma 1 subunit
89041736

general transcription factor IIH,

19923732
polypeptide 3, 34kDa

113431146
(SIG-20)

19923732

113418826
(SIG-20)
8923475
thioredoxin-like 4B
113418826
(SIG-20)
8923475
15431295
5802970
thioredoxin-like 4B
113431146
(SIG-20)
113414586
AFG3 ATPase family gene 3-like 2
113414586
307
Appendix C
21396484
H2B histone family, member H
89042891

8393719
(putative)

113429091
15431295
113418826
(SIG-20)
15431295
113431146

19923732

113429091
adaptor-related protein complex 4,

21361394
(SIG-20)
sigma 1 subunit
89042891

19923732
113414586

5802970
AFG3 ATPase family gene 3-like 2
113429091

113429091
113427529
113414586

113419590
protein S28
113429091

113427093

113429091
29742309

88987217
89041601
L31
protein S26
88982349

113420084
L31
protein S26
protein S26
88987217
88980535
308
protein S26
Appendix C
113420393
protein S26
protein S26
113420084

113420393
protein S26
88982349

113429091
protein S26
protein S26
PREDICTED: similar to APG4 autophagy 4
113413585
homolog B isoform a
113414586
113414586
PREDICTED: similar to ribosomal

29742309
protein L31
113427093
protein L31

4758754
napsin A preproprotein
89042891

113427613
protein L31
113414586

113422777
113431146

89042897
protein L26 (Silica-induced gene 20
protein) (SIG-20)
113418826
(SIG-20)
4504265
H2B histone family, member G
89042891

4504271
H2B histone family, member K
89042891

113420393
protein S26
88987217

113430282
protein S26
protein S26
88987217
protein S26

113429091

113418826
309
(SIG-20)
Appendix C
113431146
protein) (SIG-20)

113429091

88982349
protein S26
88980535

113420084
protein S26
88980535
protein S26
protein S26
89041601

89041601
protein S26

113420084

88982349
protein S26

51467029
protein S26
113414586
113414586

113427529

113429091
51467029

113430282
protein S26
89025350

113429091

113418084

113429091
protein S26
subunit, 58kDa
113418086
subunit, 58kDa

51467029
protein S26

113418826
(SIG-20)

113431146
protein) (SIG-20)
51467029

89041601
protein S26
89025350
310
Appendix C
89025350
88980535
protein S26

family member 3 (Rab GTPaseactivating protein PRC17) (Prostate
cancer gene 17 protein) (TRE17 alpha
113426831
protein) isoform 1
89042891

113431146
protein) (SIG-20)
89041736


89041736

113418826

113429091
(SIG-20)
113419590
protein S28
113414586
113414586

113427044
4506651

89042891
89042328

89042891
S18 isoform 4
41150652
S18 isoform 1
4505289
diphosphomevalonate decarboxylase
89042891


113427529

113418826
(SIG-20)

113431146
113428574
protein) (SIG-20)

113427529
27482992
protein L29 (Cell surface heparin-
311

Appendix C
113430282
binding protein HIP)
protein HIP)
protein S26
113420393

113429091
protein S26
89035017
PREDICTED: similar to DNA primase

113418086
large subunit, 58kDa
113414586
113414586
PREDICTED: similar to DNA primase

113418084
large subunit, 58kDa

4506651
113418826
(SIG-20)

113431146
protein) (SIG-20)

113427044


113427044

113418826
(SIG-20)
4506651
113431146

113429091
(SIG-20)
113427044

4506651
113429091

113429091
113418682
113414586
113414586

89035017

113429091
312
Appendix C
88953813
88943041

113418682
113414586

113430282
protein S26
89041601

113430282

protein S26
88980535
protein S26
4506651
113427529

113427529
113427044

89042891

89041736

113429091

PREDICTED: similar to kidney-specific
89040714
protein (KS)

family member 3 (Rab GTPaseactivating protein PRC17) (Prostate
cancer gene 17 protein) (TRE17 alpha
113426831
protein) isoform 1
89041736

89041601
88987217

88987217
protein S26
protein S26
protein S26
88980535

113430282
protein S26
113429703
protein S26
4758754
napsin A preproprotein
89041736

113429703

protein S26
88982349
313
protein S26
Appendix C
113429703
protein S26

113420084

113420084
protein S26
88982349

89025350
88987217
113427613
88953906
88959151
88953906
89034184
88959151
89034184
88959151
protein S28
88953906
88953906
protein S28
113422526

113420393
protein S28

113423050
protein S28

113423050


88959151
protein S28

89034184


113422526
protein S28

113423050
protein S28

113423050
protein S28

89034184
L31

113422526
protein S26

113422526
protein S26

113429091
protein S26

protein S26
89025350
314
Appendix C
89042891
PREDICTED: similar to aortic preferentially

113414263
expressed gene 1

89042891

113418826
(SIG-20)

113431146
protein) (SIG-20)
89042891

113431146
protein) (SIG-20)

113427613
L31

113427613
protein L31

113418826
(SIG-20)
4506651
113427044


113429091
27482992

113429703
protein HIP)
protein S26
89025350


113429091

113428574

113429703
protein S26
89041601

113429703
protein HIP)

protein S26
88980535
315
protein S26
Appendix D
Appendix D Concatenated Filtered Alignment
Alignment Follows on Subsequent page.
316
10
Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278
20
30
40
50
60
70
80
R A R M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S NM I NT WR Q F F V L E E QM L E V DC S I L T P E P V L K A S G HV D
R K A V V NT L E R R L F Y I P S F K I Y - - - - - - - S G V A G L F DY G P P G C A I K S NV L S F WR QH F I L E E NM L E V DC P C V T P E V V L K A S G HV D
R E N L E S V L K R R F F F A P A F E L Y - - - - - - -G G V S G L Y DY G P P G C A F QA N I V DV WR K H F I L E E DM L E V DC T M L T P Y E V L K T S G HV D
R T V L D S M L R R R L F Y T P S F D I Y - - - - - - -G G V S G L Y DY G P P G T A L L NN I V D L WR K H F V L E E DM L E V DC T M L T P H E V L K T S G HV D
R T L F E S L L K R R L F Y T E S F E I Y R T S G N L T G D S R G L Y DY G P P G C A L Q S N I V D L WR K H F V L Q E DM L E L DC T I L T P E E V F K T S G HV D
R A K M E D L I K R R F F Y DQ S F A I Y - - - - - - -G G I T G Q F D F G P MG C A L K S NM I H L WK K F F I L Q E QM L E V E C S I L T P E P V L K A S G HV E
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R L K L E D L L K R R F F Y DQ S F A I Y - - - - - - -G G V T G L Y D F G P MG C A L K A NM L QQWR K H F I L E E G M L E V DC T S L T P E P V L K A S G HV D
R L K L E D L L K R R F F Y DQ S F A I Y - - - - - - -G G V T G L Y D F G P MG C S L K A NM L Q E WR K H F I L E E G M L E V DC T S L T P E P V L K A S G HV D
R D S L E QT L K R R F F F A P S F E I Y - - - - - - -G G V A G L F D F G P P G C A F QNNV I DA WR K H F I L E E DM L E V E A T M L T P HDV L K T S G HV D
R E K L E S V L R G R F F Y A P A F D L Y - - - - - - -G G V S G L Y DY G P P G C S F QA NV V DQWR K H F I L E E DM L E V DC T M L T P Y E V L K T S G HV D
R QQM E DT L K R R F F Y G QA F E L Y - - - - - - -G G V S G L Y D F G P V G C M L K NN I I S E WK QH F I L HDQM L E I E C T M L T P E P V L R A S G H I E
K S T L DA L L A R R F F F A P S F E I Y - - - - - - -G G V A G L Y DY G P T G S A L QA N I L DA WR K HY I I E E DM L E L DT T I MT L S DV L K T S G HV D
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M L E I S A T C L T P Y N P L K A S G HV D
R E A L E N L L K R R F F I A P S F E I Y - - - - - - -G G V A G L F DY G P P G C A L K S E V E S F WR R H F V L A E DM L E I S A T C L T P Y N P L K A S G HV D
R T K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I L QV WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R E S L E QV L K R R F F F A P A F E I Y - - - - - - -G G V S G L Y DY G P P G C A L QA N I MDT WR K H F I L E E DM L E V DC T M L T P H E V L K T S G HV D
R A G L E D L MK R R F F I T Q S F S I Y - - - - - - -G G QA G L Y DY G P P G C A V K A N L I N L WR QH F V L N E DM S E V DC V S V T P E QV L K A S G HV A
R A K M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S N I L A L WR QY F A L E E QM L E V DC S I L T P E P V L K A S G HV E
R A K M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S N I L S L WR QY F A L E E QM L E V DC S I L T P E P V L K A S G HV E
QQQ I E Q I L K K R F F I T Q S A Y I Y - - - - - - -G G V S G L Y D L G P P G L S I K T N I L S L WR K H F V L E E DM L E I E T T T M L P HDV L K A S G HV D
K A K L D E I L K QR NMV I Q S Y E I Y - - - - - - -G G I A G L Y DMG P L G C A L K QN I L Q F WR K H F T T Y E N F F E V E G P I L T P K C V L A A S G HT A
R V K M E DT L K R R F F Y DQA F S I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QA WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R E S L E S V L K R R F F Y A P A F E L Y - - - - - - -G G V S G L Y DY G P P G C S F QA N I V DV WR K H F V L E E DM L E V DC T M L T P Y E V L K T S G HV D
R T E F E DT C R R R F F Y G L A F D P Y - - - - - - -G G T A G L Y D L G P T MC A MK S NM L H F WR QH F V I E E S MC E V DT T C L T P E E V F K A S G HV T
R G A L DT I L R R R M F Y T P S F E I Y - - - - - - -G G V S G L Y DY G P P G C A L QA N I I DA WR K H F V L E DDM L E V DC S V L T P A DV L K T S G HV D
Y E K V F E L A K R R G F L WN S F E L Y - - - - - - -G G S R G F Y DY G P L G S T L K R R I E QV WR E F Y V I Q E G HM E I E C P T I G I E E V F I A S G HV G
R V K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G E C E - - - -G P G L G S L A P WA V S S DR S V L R L P Q S L A G R R C S L G W P E - - - - - - - - - R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QA WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
K G A L E S M L R R R M F F A P S F D I Y - - - - - - -G G V A G L Y DY G P P G C A L QA N I I D I WR K H F V L E E DM L E V DC T A L T P HDV L K T S G HV D
R QA V V NT L E R K L F Y I P S F K I Y - - - - - - -R G V A G L Y DY G P P G C A V K A NV L A F WR QH F V L E E NM L E V DC P C V T P E V V L K A S G HV E
R A K L G Q L L E G R L F Y I P S F K I Y - - - - - - -G G V A G L Y DY G P P G C A V K S NV QQ F WR QH F V L E E S M L E V E C P A V T P E P V L R A S G HV E
R E A L E QV I K R R F I Y Q P A F S L Y - - - - - - -G G V A G L Y DY G P V G C A I K T N I E QY WR E H F I I E E D L F E I A A T I L T P E P V L K A S G HV D
R E S L E QV L K R R F F F A P A F D I Y - - - - - - -G G V S G L Y DY G P P G C A F QA NV V DT WR K H F V L E E DM L E V DC T M L T P HDV L K T S G HV D
R T K L E N L V K R K F F Y T N S F E I Y - - - - - - -G G A S G L F DY G P S G C L L K S E L E N L WR C H F I Y Y D E M L E I S G S C V T P Y QV L K T S G HV D
R T K I DN L A K R K L F Y T N S F E I Y - - - - - - -G G S S G L I DY G P S G C L L K S E L E N L WR Y H F I F Y D E M L E I S A T C I T P Y T V L K T S G HV D
R S K L E S L I K R R L F Y T N S F E I Y - - - - - - -G G V S G L I DY G P S G C L L K Y E L E K L WR NH F V F Y D E M L E I K G T C I T P Y S V L K T S G HV D
R QA V V NT L E R R L F F I P S F K I Y - - - - - - -R G V A G L Y DY G P P G C A V K S NV L A F WR QH F V L E E NM L E V DC P C V T P E V V L K A S G HV D
R DK L E S T L R R R F F Y T P S F E I Y - - - - - - -G G V S G L F D L G P P G C Q L QNN L I R L WR E H F I M E E NM L QV DG P M L T P Y DV L K T S G HV D
R T Q F E E L MK K R F F F S P S F Q I Y - - - - - - -G G I S G L Y DY G P P G S A L Q S N L V D I WR K H F V I E E S M L E V DC S M L T P H E V L K T S G HV D
R A K M E DT L K R R F Y Y DQ S Y A I Y - - - - - - -G G V S G L Y D F G P T G C A MK A N F I N I WR NH F I I E E G M L E V D S A I L A P E NV F K A S G HV E
R T K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P MG C A L K NN I L QV WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R K Y F E D L I K R R Y F F NQG F E I Y - - - - - - -G G V A G L Y DY G P P G C A I K NN L L K L WR E H F I L E E DM L E I S S T C I T P Y P V F K A S G HV D
K V DC E N L L R R R F F Y T N S F E I Y - - - - - - -G G S A G L F D F G P P G C A L K S E L E R L WR E H F V V F D E M L E V S C T C I T P H P V L K S S G HV D
K V DC E N L L R R R F F Y A N S F E I Y - - - - - - -G G S A G L F D F G P P G C A L K S E L E R L WR E H F I V F D E M L E V S C S C I T P H P V L K S S G HV D
R A T A E D L E V S G F F WV P S F E I Y - - - - - - -G S V A G I Y D L G P T G C A I E R N F L QK WR DH F V L E DDM L E V R C S A L T P R P V L DA S G HT E
R S E F E DT C R R R F F F G L A F D P Y - - - - - - -G G S A G L Y DMG P P L C A MK A N L L A HWR QH F V L A E S MC E V DT T C L T P Q E V F V T S G HV T
R A E F E DT C R R R F F F G L A F D P Y - - - - - - -G G S A G L Y D L G P P L C A MK A N L L S Y WR QH F V L E E NMC E V DT T S L T P E E V F K A S G HV V
R S Q L E V L MT K R F F Y I Q S F E I Y - - - - - - -G G V G G L Y DY G P T G A A L QA N I I NQWR NH F I I E E E M L E L DT T I MT L S DV L K T S G HV D
R E T L DA V L K R R F F Y A P A F E I Y - - - - - - -DG V S G L Y DY G P P G C A L QT R I I DT WR DH F V L E DDM L E V DT T M L T P H E V L K T S G HV D
90
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
100
110
120
R F A D L MT K DV K NG E C F R L D P I T G ND L T E P I E F N L M F G T Q I G P L R
K F T D L MV K D E K T G T C Y R A D P DT K N P L S D P Y P F N L M F QT S I G P MR
K F S DWMC QD P K S G E I F R A D P V T G E T L E P P K A F N L M F E T A I G P L R
K F A DWMC K D P K T G E I F R A D P T T DG N L L P P V A F N L M F QT S I G P L R
K F E DWMC K D F K K G D F L R A D P DG DA P V S S P V P F N L M F K T T V G P L R
R F A D L MT K D I K T G E C F R L D P I S G ND L T P P I E F N L M F NT Q I G P L R
K F A D F MV K D L K NG E C F R A D P T T G ND L S P P V P F N L M F K T F I G P L R
R F A DWMV K DT K NG E C F R A D P I T G ND L T E P I A F N L M F P T Q I G P L R
R F A DWMV K DMK NG E C F R A D P I T G ND L T E P I A F N L M F P T Q I G P L R
R F S DWMC K D L K T G E I F R A D P S T G G K L E P P V E F N L M F DT A I G P L R
K F S DWMC R D L K T G E I F R A D P V T G E P L E P P MA F N L M F E T A I G P L R
K F A D F MV K DV K NG E C F R A D P NT G ND L S P P V S F N L M F K T F I G P L R
R F A D L MV K D E K T G A C F R A D P V T G N E I S D P MD F N L M F QT T I G P L R
K F A DWMV K DV K NG E I Y R A D P T T G N E V S E P V E F N L M F E S N I G P L R
R F T D S M I T D I K T N E Y Y R A D P - S G G E W S E P Y P F N L M F R T K I G P MR
R F T D S M I T D I K T N E Y Y R A D P - S G G E W S E P Y P F N L M F R T K I G P MR
K F A DY MV K DV K NG E C F R A D P S T G ND L T P P I S F N L M F QT S I G P L R
K F A DWMC R D L K T G E I F R A D P A T DG P L E L P I E F N L M F E T A I G P L R
K F A D F MV K D E V T K A F F R A D P E T G NA L T E P Y P F N L M F QT Q I G P L R
R F A D L MV K DV K T G E C F R L D P L T G ND L T E P I E F N L M F A T Q I G P L R
R F A D L MV K DV K T G E C F R L D P L T G ND L T E P I E F N L M F A T Q I G P L R
K F C D I L V F D E V S G DC F R A DT - L G NK L S K S QQ F N L M F G T Q I G Y L R
K F S DY MV K D L K NG C C Y R A D P DT G ND L S E P L A F N L M F A T D I G P L R
K F A D F MV K DMK NG E C F R A D P I T G ND L S P P V S F N L M F K T S I G P L R
K F A D F MV K DV K NG E C F R A D P I T G ND L S P P V S F N L M F K T F I G P L R
K F S DWMC K D P K T G E I F R A D P V S G DK L E P P R A F N L M F E T A I G P L R
R F NDV MV R DT V T G E C I R A D P -K G N P F S D P F P F N L M F A T H I G P MR
K F A DWMC K D P K T G D I F R A D P A T G L L P T P P V S F N L M F S T S I G P L R
G F S D P L C E C MNC K E A F R A D P E C G G E F E DA Y E F N L M F K T T I G P L R
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S WA WR R E V A G L C
K F A D F MV K DV K NG E C F R A D P T T G ND L S P P V P F N L M F QT F I G P L R
K F A DWMC K D P K NG D I L R A D P A T G V Q P E P P V A F N L M F QT A I G P MR
K F T D L MV NDV V T K DC F R A D P V T G ND L S E P Y P F N L M F P T Q I G P L R
R F T D L L V C D S K T G T G Y R A D P E T G ND L T D P T P F N L M L P T I I G P L R
K F A DWMC K D L K T G E I F R A D P A T G G K L E P P V E F N L M F E T A I G P L R
R F T D L M I R DV V T NDC Y R A D P - L K ND L S E P F P F N L M F QT K I G P L R
R F T D L M I R DA V T G D F Y R A D P -G K ND F V G P F P F N L M F QT R I G P L R
R F T D L M I K D I V T K DC Y R A D P - E K ND L S D P F P F N L M F QT K I G P L R
K F A D F MV K DV K NG E C F R A D P T T G ND L S P P V P F N L M F QT F I G P L R
K F T DWMC R N P K T G E Y Y R A D P V T NDV L DA L T S F N L M F E T K I G A L R
K F A DWMC K D P A T G E I F R A D P A T NG E L E T P R Q F N L M F E T Q I G P L R
R F A D F MV K DG K T G E C F R A D P T T NND L S D P M E F N L M F A T A I G P L R
K F A DY MV K DV K NG E C F R A D P T T G ND L T P P I S F N L M F QT S I G P L R
R F T D L MV K DV K NG A G HR A D P DT G N E L G F P E P F N L M F G T P I G P L R
R F T D L MV K N L S NG DC Y R A D P - E G D E F S K P F P F N L M F S T S I G P L R
R F T D L MV K N L S NG DC Y R A D P - E G ND F S K P F P F N L M F S T S I G P L R
K F ND L M L T DMT T K A L Y R A D P - E G N E F S E P A P F N L M F NT R V G P L R
R F NDV MV R DT V T G E C I R A D P -K G NA L S D P F P F N L M F S T S I G P MR
R F NDA MV R DT V T G E C I R A D P -K G NA L S E P F P F N L M F S T S I G P MR
K F A DWMC K DT K T G E I F R A D P E S G N E V S E P V E F N L M F E S Y I G P L R
K F A DWMC R D L A S G E I F R A D P V T G G P L E K P M E F N L M F E T A I G P L R
130
140
150
160
P E T A QG I F V N F K R L L E F -NQG R L P F A A A Q I G N S F R N E I S
P E T A QG I F V N F K D L Y Y Y -NG K K L P F A A A Q I G QA F R N E I S
P E T A QG Q F L N F NK L L E F -NNG K T P F A S A S I G K S F R N E I S
P E T A QG Q F L N F QK L L E F -NQQ S M P F A S A S I G K S F R N E I S
P E T A QG Q F L N F K K L L DY -NQN S M P F A S A S I G K S F R N E I S
P E T A QG I F V N F K R L L E F -NQG R L P F A A A Q I G N S F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A A A Q I G L G F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A A A Q I G L G F R N E I S
P E T A QG Q F L N F NK L L E F -NNDK M P F A S A S I G K S F R N E I A
P E T A QG I F L N F K R L L E F -NQG K L P F S A V Q I G M S F R N E I S
P E T A QG H F V N F A R L L E F -NNG K V P F A S A Q I G K S F R N E I A
P E T A QG I F V N F K R L Y E Y -NG K K L P F S V A Q I G L G F R N E I A
P E T A QG I F V N F K R L Y E Y -NG K K L P F S V A Q I G L G F R N E I A
P E T A QG Q F L N F S K L L DC -NN E K M P F A S A S I G K S F R N E I S
P E T A QG I F T N F G K L Y E Y -NG K K L P F A A A Q I G NA F R N E I A
P E T A QG I F V N F K R L L E F -NQG K L P F A V A Q I G N S F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A V A Q I G N S F R N E I S
P E T A QG Q F L N F K K L C E Y -NNDK L P F A S A S I G K A Y R N E I S
P E T A QG I F T M F K R N L E F -NG G K V P F G V T Q I G NV F R N E I A
P E L A QG I I L N F K R L MD S G NA QR M P F A G A C V G T A F R N E I A
P E T A QG Q F L N F A K L L E Y -NNQQM P F A S A S I G K S Y R N E I S
P E T A QG M F V D F QR L S R F -Y R DK L P F G A V Q I G K S Y R N E I A
PAW PR A L LC - - - - LGT T - PGGR LAV A - - - - - - - - - - - - P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F A K L L E Y -NA G NM P F A S A S I G K S Y R N E I A
P E T A QG I F V N F K D L Y Y Y -NG QK L P F A A A Q I G QA F R N E I S
P E T A QG I F V N F R D L L Y Y -NG G K L P F A A A Q I G Q S F R N E I A
P E T A QG M F L N F A R L L E Q -NG G R V P F G A A Q I G L G F R N E I A
P E T A QG Q F L N F A K L L E F -NN E K M P F A S A S I G K S F R N E I A
P E T A QG I F V N F K K L L E Y -NG G K T P F A G A Q L G L G F R N E I S
P E T A QG I F V N F K K L L E Y -NG G K M P F A G A Q I G L G F R N E I S
P E T A QG I F V N F K K L L E Y -NG G K M P F A G A Q I G L G F R N E I S
P E T A QG I F V N F K D L Y Y Y -NG NK L P F A A A Q I G QA F R N E I S
P E T A QG Q F L N F NK L L E I -NQG K I P F A S A S I G K S F R N E I S
P E T A QG Q F L N F S R L L E F -NNG K V P F A S A MV G K A F R N E I S
P E T A QG I F V N F K R L L E F -NQG R L P F G A A Q I G T A F R N E I S
P E T A QG M F V N F NR L N E F -NG G R I P F A A A Q I G L G F R N E I A
P E T A QG I F V N F NR L L E F -NG G K I P F A A A Q I G L G F R N E I S
P E T A QG I F V N F T R L L E F -NG G K I P F A A A Q I G L G F R N E I S
P E T A QG I F V N F T R L L NA -NR G S L P F A A A QV G A G Y R N E I S
P E L A QG I I L N F K R L L DT G NA QR M P F A C A S I G T A F R N E I A
P E L A QG I I L N F K R L L D S G NA QR M P F A G A C I G T A F R N E I A
P E T A QG H F V N F QR L L E F -NNG R V P F A S A Q I G K S F R N E I S
P E T A QG Q F L N F NK L L DC -NNT K M P F A S A S I G K S F R N E I S
170
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
180
190
200
210
220
230
240
P R S G L I R V R E F T MC E I E H F C D P Q -A K NH P K F E NV A DT V MT L Y S A C NQA V - -A S G L V A N E T L G Y F MA R I QMY L HR I G I L P E R L R
P R QG L L R V R E F T L A E I E H F V D P E -NK S H P K F S DV A K L E F L M F P R E E QA V - -A K G T V NN E T L G Y F I G R V Y L F L T R L G I DK E R L R
P R S G L L R V R E F L MA E I E H F V D P E -NK NH P R F D E V K N L K L K F L P K G V QA V - -A S G M I DNQT L G Y F I A R I Y Q F L T K I G V D E E K L R
P R A G L L R V R E F L MA E I E HY V D P E G G K K HHR F E E V K D I E MA F L NR NV QA V - - E T G MV DN E T L G Y F I A R I Q L F L L K L G V D P NK L R
P R S G L L R V R E F L MA E I E H F V D P E G G K K HA K F D L V K D L Q L S F L DR A T QA V - - S S K MV DN E T L G Y F L G R I Y I F L L K I G V DT NK V R
P R S G L L R V R E F T MC E I E H F C D - - -V K E H P K F E S V K NT QM L L Y S A DNQA V - - S K G I V NN E T L G Y F MA R I HMY M L A V G I D P K R L R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L Y L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L V K V G V S P E K L R
P R QG L I R V R E F T MC E I E H F V D P E -DK R F P K F A K V A D E K L V L F S A C NQA V - -A NK T V A N E T L G Y Y MA R C HQ F L MK V G I DG R R L R
P R QG L I R V R E F T MC E I E H F V D P E -DK S L A K F A K V A DQK L V L F S A C NQA V - -A K K T V A N E T L G Y Y MA R C HQ F L MK V G I DG R R L R
P R A G L L R V R E F L MA E I E HY V D P E - S K S H P K F E DV K D I K L K F L P K NV QA V - - S S G MV DN E T L G Y F I A R I Y L F L V K I G V DT NR L R
P R A G L L R V R E F L MA E I E H F V D P L -DK S H P K F H E V K D I K L S F L P R N I QA V - -A S K MV DN E T L G Y F I A R I Y L F L I K I G V DDT K L R
P R S G L I R V R E F T MA E I E H F V D P S - E K E H P K F QNV A D L H L Y L Y S A K A QA V - -DQG V I NN S V L G Y F I G R I Y L Y L T K V G V S P DK L R
P R S G L I R V R E F QMG E I E H F V D P L -R K E H P L F S T V K D I K V P L Y S S K A QA V - - E DG T I DN E T L G Y F MG R I Y L F C V K V G I D P F K F R
P R QG L L R V R E F T MA E I E HY V D P L -DK R HA R F N E V K DV V L T L L A K G V QA V - -A E G I V DN E T L G Y F L G R T Q L F L T K I G I D P A R L R
P R NG L L R V R E F QMA E I E H F I H P D -R K DH P K F DDV A F K C L P L Y S S K T QA V HG E E K I I NN E T L A Y F L S R T Y D F L I S I G I N P DG I R
P R NG L L R V R E F QMA E I E H F I H P D -R K DH P K F DDV A L K C L P L Y S S K T QA V HG E E K I I NN E T L A Y F L S R T Y D F L I S I G I N P DG I R
P R S G L I R V R E F T MA E I E H F V D P N - E K V H F K F S NV A D L D I M L Y S S K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L V K V G V A K DK L R
P R A G L L R V R E F L MA E I E HY V D P D -NK S H S R F D E I K D L K L K F L P K G V QA V - - S S G MV DN E T L G Y F L A R I Y S F L I K I G V D P S R L R
P R A G L L R V R E F T MA E I E H F V N P N -NK T H P K F N E I K DV E A N L L S S D S QA V - - E K K L I DN E T L A Y F MA R T QQ F L HT V G I K P A G L R
P R S G L I R V R E F T MA E I E H F C D P V - L K DH P K F G N I K S E K L T L Y S A C NQA V - -A S K L V A N E T L G Y Y MA R I QQ F L L A I G I K P E C L R
P R S G L I R V R E F T MA E I E H F C D P T -QK DH P K F G NV K D E K MT L Y S A C NQA V - - S A K L V A N E T L G Y Y MA R I QQ F L L A I G I K P E C L R
P R S G L L R V R E F DQA E I E H F V L T D - E K DH P K F S T V QG I K L K L MHHDA S A I - - E R G I V C N E T MG Y Y I G R T A L F L I E L G I DR E L L R
P R NG L L R V R E F T L A E I E Y F V L P D -K K T H S N F S DV E N L S V Q L Y P R E L QA V - -NDG I I N S Q L L A Y F MG R T F K F L I E L G I P A E H I R
P R S G L I R V R E F T MA E I E H F V D P S - E K NH P K F Q S V A D L N I L L Y S S K A QA V - -QQG V I NN S V L G Y F I G R I Y L F L T K V G V S P DK L R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L H L Y L Y S A K A QA V - - E QG V I NNT V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R S G L L R V R E F L MA E I E H F V D P N -DK S HK R F QD I K D I K L K F L P R E V QA V - -A T K L V DN E T L G Y F I A R I Y Q F L I K I G V D P E R L R
P R S A L I R V R E F T L A E I E H F V N P S -NK NH E K F DR V R DV E I WA W P R H F QA V - - E A K V I DNQT L G Y F MG R V A L F L T S I G V - -R F Y R
P R S G L L R V R E F L MA E I E H F V D P E S G K K H P R F A E V A D I E L E L L DR E T QA V - -K DG L V DN E T L G Y F L A R I H L F L E K I G V DK S K L R
P R QG V I R L R E F T QA E C E L F V D P R -NK K H P N F E R F A DK E L V L Y S QA A QA V - - E T G V I A H E I L G Y N I A L T N E F L T K V G I D P E K L R
- - - - - - - - - - - - - - - - -R I C S P R -C QK P P - - - - - - - - - L E L L T S S L R P - - -A R G V I NN S V L G Y F L G R I Y L F L T K A G V C A E R L R
P R S G L I R V R E F T MA E I E H F V D P T - E K DH P K F Q S V A D L C L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R G G L L R V R E F L MA E I E H F V D P A G HK K H E R F H E V A D I E L A L L DR NV QA V - -K QK I V DN E T L G Y F L A R I H L F L K K I G V DQ S K I R
P R QG L L R V R E F T L A E I E H F V D P E -DK S H P K F V DV A D L E F L M F P R E L QA V - - S K G T V NN E T L G Y F I G R V Y L F L T R L G I DK NR L R
P R A G L L R V R E F T QA E I E H F V H P E -HK E H P R F A E V A DT V L S L F S QDA QA V - - S K G I I A N E T L G Y F I A R C H L F L V Q I G I DT NR L R
P R G G L L R C R E F QMA E I E Y F V D P T E K S T F K K F NK Y I N L E I P L L S R Q L QA V - -K E G I I NN E T L A Y F I C R T Y L Y L V E I G I N P V N I R
P R A G L L R V R E F L MA E I E H F V D P N -DK S H P K F K DV QD I K L R F L P K DV QA V - - S S G MV DNQT L G Y F L A R V Y Q F L I K V G V DT DR L R
P R NG L L R V R E F QMA E I E Y F V N P K -K K NH E K Y Y L F K Y L M L P L Y P R DNQA V - - E K N I I A N E A L A Y F L A R T Y L F L L K C G I NK DG I R
P R NG L L R V R E F E MA E I E Y F V N P E -K K C H E K Y H L F K H L I L P L Y P R E E QA V - -T K G I I A N E A L A Y F L A R T Y L F L L K C G I NK DG L R
P R NG L L R V R E F E MG E I E Y F F N P E -K S K H E K Y D L Y K H L V L P L Y P R T NQA V - -NNG I I C N E A L A Y F L A R T Y L F L L K C G I K K DG I R
P R QG L L R V R E F T L A E I E H F V D P E -DK S H P K Y S E V A D L E F L M F P R E QQA V - - S K G I V NN E T L G Y F I G R V Y L F L T H L G I DK DR L R
P R S G L I R V R E F T MA E I E H F V D P T - E K DH P K F P S V A D L Y L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R S G L L R V R E F L MA E I E H F V D P L -NK S HA K F N E V L N E E I P L L S R R L QA V - -N S G MV E N E T L G Y F MA R V HQ F L L N I G I NK DK F R
P R S G L L R V R E F L MA E V E H F V D P K -NK E HDR F D E V S HM P L R L L P R G V QA V - -K K G I V DNT T L G Y F MA R I S L F L E K I G I DMNR V R
P R S G L L R V R E F T MC E I E H F I D P T -NK DH P K F DT V A N L A I P L F P V DR QA V - - E K G M I K S R V L G Y F MG R T F L F M I K V G I D P K K L R
P R S G L I R V R E F T MA E I E H F V D P K - E K V HQK F A NV A D L E I L L Y S S K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L I K V G V A K DK L R
P R NG L Y R V R E F DMA E I E H F F D P K -R P E H P K F K Y V K D L K L P L L T A K S QA V - -K S G T V S N E T HA Y F I G R T F L F L V E A G V NQNN I R
P R NG L L R V R E F P MA E I E Y F V N P K - F K T H E K F P E F K NT V L P L L T R DQQA V - - S S G I V G N E A L A Y F L A R T F L F L K R V G I N E A G L R
P R NG L L R V R E F P MA E I E Y F V N P K - F K T H E K F P E F S HV V L P L V T R DQQA V - - S S G MV G N E A L A Y F L A R T F L F L K R V G I N E A G L R
P R NG L V R C R E F QMA E I E H F A D P E Q L NN F P K F E T V K N L K V K L F P A S I QA I - -A QHV V S HK T L G Y Y I G R V Y L F L C E I G I Q P DT I R
P R A N L I R V R E F T L A E I E H F V N P N -DK T H E K F A L V K DV E I WMWA R K QQA V - -A QK I I DN E T L A Y F I A R T A Q F L E A V G A - -R Y V R
P R A N L I R V R E F T L A E I E H F V N P N -DK S H E K F E S V R G T E F WA W S R E L QA V - -A K K I I DN E T L G Y F I A R T V L F L E A V G L - -R F L R
P R A G L L R V R E F T MA E I E H F V D P E -DK NHDR F D E V K H I NV P L L A K DV QA V - - S A G I I DNQT L G Y F I G R I Y L F L V K I G I DA T R L R
P R S G L L R V R E F T MA E I E H F V D P L -DK DHHR F D E V K DV K L R F L A K DV QA V - - E T G L V DNK T L G Y F L A R I Y L F L I K I G V N P DR L R
260
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
270
F R QHMG N E MA HY A C DC WDA E C L T - S Y G W I
F R QH L A N E MA HY A A DC WDA E I E S - S Y G W I
F R QHM S N E MA HY A T DC WDA E L K T - S Y G W I
F R QHMA N E MA HY A A DC WDA E L L T - S Y G W I
F R QHMA N E MA HY A T DC WDA E L QT -T Y G W I
F R QHMG N E MA HY A C DC WDA E C L S - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QH L S N E MA HY A QDC WDA E I L T - S Y G W I
F R QH L S N E MA HY A QDC WDA E I L T - S Y G W I
F R QHM S N E MA HY A S DC WDA E L E T - S Y G W I
F R QHMA N E MA HY A A DC WDA E L K T - S F G W I
F R QHMQN E MA HY A C DC WDA E C K T - S Y G WV
C R QHMA N E MA HY A T DC WD F E I Q S - S Y G W I
F R QH L S T E MA HY A S DC WDA E V L T - S Y G W I
F R QH L S T E MA HY A S DC WDA E V L T - S Y G W I
F R QHMDN E MA HY A C DC WDA E T K T - S Y G W I
F R QHM S N E MA HY A A DC WDA E L HT - S Y G W I
F R QHQK N E MA HY A QDC WDA E I L S - S Y G WV
F R QHM S N E MA HY A C DC WDA E I L T - S Y G WV
F R QHM S N E MA HY A C DC WDA E I L T - S Y G WV
F R QHK K D E MA HY A K G C WDA E I Y T - S Y G W I
F R QH L K T E MA HY A K DC WDA E I R L - S Y G WV
F R QHMA N E MA HY A A DC WDA E L QT - S Y G W I
F R QHQ S T E MA HY A QDC WDA E L L T - S Y G W I
F R QHMA N E MA HY A C DC WDA E L L T - S Y G W I
F R QH L T D E MA HY A I DC WDA E I E T DR F G WV
F R QHMDN E MA HY A C DC WDA E A R T - S Y G W I
F R QHMG N E MA HY A C DC WDA E L L T - S S G WV
F R QH L P N E MA HY A A DC WDA E I E C - S Y G W I
F R QH L K H E MA HY A A DC WDA E I QC - S Y G W I
F R QHQA D E MA HY S S DC WDA E I E M - S S G WV
F R QHMG N E MA HY A S DC WDA E L QT - S Y G W I
F R QH L K T E MA HY A NDC WDA E I L T - S F G F I
F R QH L P T E MA HY A NDC WDA E I L T - S Y G F I
Y R QH L E K E MA HY A NDC WDA E I L T - S Y G Y I
F R QH L A N E MA HY A A DC WDA E I E S - S Y G W I
F R QH L K N E MA HY A T DC WDG E I L T - S Y G W I
F R QHM S N E MA HY A C DC WDA E I QC - S Y G W I
F R QHM F N E MA HY A T DC WDA E T K T - S Y G WV
F R QHMDN E MA HY A C DC WDA E T K T - S Y G W I
F R QHM S N E MA HY A C DC WDA E I E F - S HG F K
F R QHMA N E MA HY A S DC WDA E I L T - S Y G WV
F R QHT A N E MA HY A S DC WDA E I L T - S Y G WV
F R MHR K N E MA HY A R E C WDA E I Y T K T L G W L
F R QH L R N E MA HY A QDC WDA E L L T - S Y G WV
F R QHQR D E MA HY A QDC WDA E L L T - S Y G WV
F R QHM S N E MA HY A S DC WDA E I HT - S Y G W I
F R QHM S N E MA HY A T DC WDA E L HT - S Y G W I
280
290
300
310
320
330
E C V G C A DR S A Y D L T QHT NA T - - - - - - -G V K L V A E K K L P A P K A A I G K A F K K E A K A
E C V G I A DR S A Y D L R A H S DK S - - - - - - -G T P L V A E E K F A E P K K E L G L A F K G NQK N
E C V G C A DR S A Y D L T V HA NK T - - - - - - -K T A L V V R E K L DV P K K L F G P K F R K DA P K
E C V G C A DR S A Y D L T V HK NK T - - - - - - -G A P L V V R E P R A E P K K K F G P R F K K DG K A
E C V G C A DR S A Y D L T V H S R K T - - - - - - -K E P L V V R E P R R E P K P K L G P L F K K NA K A
E C V G C A DR S A Y D L T QHT K A T - - - - - - -G I R L A A E K K L P A P K A A I G K A F K K D S QA
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G NA DR A C Y D L QQHY K A T - - - - - - -NV K L V A E K K L P E P MA L L G K K Y K K E A K K
E C V G NA DR A C Y D L QQHY K A T - - - - - - -NV K L V A E K K L P E P MA L L G K S F K K DA K K
E C V G C A DR S A Y D L S V H S A R T - - - - - - -G E K L V A R QT L A E P K K K F G P K F R K DA G T
E C V G C A DR S A Y D L T V HA NK T - - - - - - -K E K L V V R QK L E T P K K L F G P K F R K DA P K
E C V G C A DR S C Y D L K C H S QA A - - - - - - -K V N L S A E R P L P E P K QA V G K A F K K DA K K
E C V G C A DR S A Y D L T V H S V R T - - - - - - -K Q P L R V QQR L DQ P A K A F G MK F K K DA T M
E C A G HA DR S C Y D L L QH S K A T - - - - - - -K T D L F A S E K Y D E P K P L I G K T F K Q E A S L
E C A G HA DR S C Y D L L QH S K A T - - - - - - -K T D L F A S E K Y D E P K P L I G K T F K Q E A S L
E I V G C A DR S C Y D L L C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K F
E C V G C A DR S A Y D L S V H S A R T - - - - - - -N E K L V V R Q P L P E P K K K F G P K F R K DA G T
E C V G HA DR S C Y D L K V HA T E S - - - - - - -K S N L S A Y E E F K E P P G A I S K K HR A A V S P
E C V G C A DR S A Y D L G QHT A A T - - - - - - -G V R L V A E K R L P A P K QA L G K T F K K E A K N
E C V G C A DR S A Y D L G QHT A A T - - - - - - -G V R L V A E K R L P A P K QA L G K T F K K E A K T
E C V G I A DR A C Y D L S C H E DG S - - - - - - -K V D L R C K R R L A E P K K E WG A K L R DR F S V
E C V G HA DR G D F D L S NHA R C S - - - - - - -K V DQ S V F I A Y D E P K G V MG K K Y K K D S QK
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L I A E K L L K E P K G A I G K A Y K K DA K V
E C V G C A DR S A Y D L T V H S NK T - - - - - - -K E K L V V R E A L E T P K K L F G P K F R K DA P K
E C V G I A DR S A Y D L T QH S NA S - - - - - - -K K D L C A R E E Y D E P K G L I G K T F G K K A G E
E C V G C A DR S A Y D L S V HA K K T - - - - - - -NA P L I V R QR L P E P K K K F G P K F K K DA K A
E I V G I A DR T DY D L K A HA R V S - - - - - - -K T D L Y V Y V E Y D E P MG K L G P L F K G K A K A
E I V G C A DR S C Y D L T C H S R A T - - - - - - -K V P L V A E K L L R E P K A A I G R T Y K K DA R L
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A V G K A Y K K DA K L
E C V G C A DR S A Y D L T V HA K K T - - - - - - -G A P L V V R E T L E T P S K K F G P T F R K DA K T
E C V G I A DR S A Y D L R A H S DK S - - - - - - -G V P L V A H E K F S K P K K D L G L A F K G NQK M
E C V G L A DR S A F D L K A H S DK S - - - - - - -K V D L V A Y E R F DK P K K V L G K A F K K DA K P
E C V G L A DR S A Y D L NA H S E A T - - - - - - -G QK L QA A R K F K V P K QK I G K E L K K DG MA
E C V G C A DR S A Y D L S V HA A R T - - - - - - -NA S L V V R Q P L P E P K K K F G P K F K K DG G A
E V V G HA DR S A Y D L QHHMK Y T - - - - - - -G A N L Y A C E K Y N E P K A K I G HT F K S E QNK
E V V G HA DR S A Y D L K NHMK V T - - - - - - -G A N L Y A C E K Y DT P K A K MG MK F K S QQNV
E C V G HA DR S A Y D L K HHMNA T - - - - - - -G S N L Y G C QK Y DK P K A K I G MK F K S DQNK
E C V G I A DR S A Y D L R A HT DK S - - - - - - -G V P L V A H E K F S E P K K E L G L S F K G NQK K
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A V G K A Y K K DA K L
E C V G C A DR A A F D L T V H S K K T - - - - - - -G R S L T V K QK L DT P K K F F G S K F K QK A K L
E C V G C A DR S A Y D L S V H S K A T - - - - - - -K T P L V V Q E A L P E P R K K F G P R F K R DA K A
E C V G NA DR S C Y D L T C HA K H S - - - - - - -K V A MV A E K K L P E P K S V MG K A F K K E A K V
E I V G C A DR S C Y D L A C HA R V T - - - - - - -K V P L V A E K P L K E P K G A L G K A Y K K DA K I
E C V G L A NR S A F D L E S HT K G S - - - - - - -G V K L L A A R R L P E P K K E I F K A L K G DG N E
E V V G HA DR MA Y D L MC H S K S T - - - - - - -N S Q L V A HHR Y DN P K P E I G K A F K S DQK I
E V V G HA DR MA Y D L MC H S K S T - - - - - - -N S Q L V A HHR Y DT P K P E I G K A F K A DQK I
E C V G I A DR Q S WD L S R HA K Y T T K K G DA E S S P L Y L S A P L DT P K S A I G K I F R K DA K E
E C V G V A DR S A Y D L T QH S G A T - - - - - - -K K D L C A R E E F A E P K G A I G K A F G K NA G D
E C V G I A DR S A Y D L T QH S A A S - - - - - - -K K D L F A R E E F A E P K G A I G K A F G K NA G E
E C V G C A DR S A Y D L T V H S K R T - - - - - - -K K D L V V QK A HK E P K K N L G P K F K K DA K F
E C V G C A DR S A Y D L S V H E A R T - - - - - - -K V K L QV QQK L DA P K K K F G P L L K K A A K P
340
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
350
360
370
380
390
400
410
I T E A L P S V I E P S F G I G R I MY S L L E H S F R MR R S Y F S L P P V V A P L K C S V L P L S NNA E F A P F V K K I S S A L T S V DV S HK V DD S S G S I
V V E S L P S V I E P S F G I G R I I Y C L Y E HC F S T R L N L F R F P P L V A P I K C T V F P L V QNQQ F E E V A K V I S K E L A S V G I S HK I D I T G T S I
V E NY L P NV I E P S F G I G R I I Y A I F E H S F W S R R S V L S F P P L V A P T K V L L V P L S NNA D L A E V V T E V S R V L R K E Q I P F K V DD S G V S I
V A A A V P NV I E P S F G I G R I L Y S M I E HV Y WA R R G V L S F P P A I A P T K V L I V P L S T HA S F R P L L QQ L MT K L R R MG I S NR V DD S S A S I
V E A A L P NV I E P S F G F G R I F Y S L L E HV Y WHR R G V L S L P I S V A P T K V L I V P L S T HQD F V P I T K R I T E D L R E L G I S C R A D E S S A S I
I NDT L - - - - - - - - - - - - - - - - - - -N S F - - - - - - - - - - - - -A P MK C V V L P L S G NA E F Q P F V R D L S Q E L I T V DV S HK V DD S S G S I
V M E Y L P S V I E P S F G L G R I MY T V F E HT F QV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG I S HK V DD S S G S I
I QT A L P S V I E P S Y G I G R I MY A L L E H S F R QR R T F L A F K P L V A P I K C S I L P I S A NDT L V P V MDA V K E E L S HY E L S Y K V DD S S G T I
I QT S L P S V I E P S Y G I G R I MY A L L E H S F R QR R T F L A F K P L V A P I K C S V L P I S A NDT L I P V MDA V K E E L S R F E M S Y K V DD S S G T I
V E K W L P NV I E P S F G I G R I L Y S I F E HQ F WC R R G V L S L P P I V A P T K V L L V P L S NN S E L Q P I V K K V S QA L R K E K I P F K V DD S S A S I
V E A Y L P NV I E P S F G I G R I I Y S I F E H S F W S R R A V L S F P P L V A P T K V L L V P L S NHK D L A P V T A QV S K I L R K E Q I A F R V DD S G V S I
V L E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
V QA H L P S V I E P S F G F G R L L Y S T L E HN F K I R R T F F S L P A V I A P Y K C S L L P L S NK P D F N P F I T S L S L A L K K L G I S HK V DT S S G S I
I K E T L P NV I E P S F G I G R I L Y C V L E HT Y WA R R G V L S L P A L V A P I K C L I V S I S QDA Q L R S K I H E V S R E MR K R G I A S R V DD S S A T I
V T E A L P G V I E P S F G I G R I I Y C L L E H S F K I R R S Y L S L P A L I A P V K C S I L P I S S NA I F ND L I N L L HK S F I NHG I S C K V DT S S A S I
V T E A L P G V I E P S F G I G R I I Y C L L E H S F K I R R S Y L S L P A L I A P V K C S I L P I S S NA I F ND L I N L L HK S F I NHG I S C K V DT S S A S I
A M E Y L P NV I E P S F G I G R I MY S I F E HT F R I R R T Y F S F P A T V A P Y K C S V L P L S QNQ E F M P F V R E L S E A L T R NG V S HK V DD S S G S I
V E NW L P NV I E P S F G I G R I L Y S I F E HQ F WA R R T V L S L P P L V A P T K V L L V P L S S NA E L Q P I V K K I S A F L R K E QV P F K V DD S S A S I
I K K Y L P HV I E P S F G L G R I I Y S I L E QNY Y T R R G V L S L P A I I A P V K A S I L P L T S S DR I A P F V QT I S K A L K E A N I S T K V DDT G NA I
I T DA L P S V V E P S F G I G R I MY S L L E H S F QC R R C Y F T L P P L V A P I K C S I L P L S NNT D F Q P Y T QK L S S A L T K A E L S HK V DD S S G S I
I T E A L P S V V E P S F G I G R I MY A L L E H S F QC R R C Y F T L P P L V A P L K C S I L P L S NNA E F Q P Y T QK L S S S L T K A E L S HK V DD S S G S I
L M E T V P DV I E P S F G I G R I L Y A L I E H S F Y L R R P V F R F K P A I A P V QC A I G Y L I H F D E F N E H I L N I K R F L T DNG L V V HV N E R S C S I
L F A Y A P NV I E P S F G V G R V L T A V L E H S F WV R K S V L S I P A S I A P V K V G L F P L L T K L E F NNK I A E I E K I C K NG F L S F K S NT T A V A I
V M E Y L P NV I E P S F G I G R I MY T V F E HT F R I R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG I S HK V DD S S G S I
V M E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG V S HK V DD S S G S I
V E A R L P NV I E P S F G I G R I I Y S I F E H S F W S R R A V L S F P P L V A P T K V L L V P L L NN P E L S K I T A QV S Q I L R K E Q I P F K V D E S G V S I
V MA Y L P S V I E P S F G I G R I L Y C L L E Q S Y WV R R A V F S F S P L L A P QK V A L L P L MV K P E L L A T I S E I R Q E L V MR G I S V R V DD S S V T I
V M E Y L P -Y F G I K I G L L R NG Y S HT Q L T - - - - - - - - - - - P W L K P QV C T - - - - - - - - - - - - - - - - L QK A C S R - - - - - - - - - - - - - V E T V L P NV I E P S F G I G R I L Y C L L E HNY WT R R G V L S F T P V V A P T K V L I V P L S R HDD F V P F V QK I S QK L R S V G V S S R V DD S S A T I
V A DA L P HV I E P S Y G I DR I F Y G I M E HA F D E E R L V MH F S S A V A P V QV A V L P L L T R K E L A D P A K E I I A K L R E K T L L V NY DD S G -T I
V L DY L P NV I E P S F G L G R I MY T V F E HT F R V R R T F F S F P A I V A P Y K C S V L P L S QNQ E F A P F V R E L S E A L T R NG V S HK V DD S S G S I
V L E Y L P S V I E P S F G L G R I MY T I L E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
V E A A I P NV I E P S F G I G R I L Y S L L E HNY WV R R G V I S F P P A V A P V K V L I V P I S S K A E F A P HV R R L S QK L R S V G I S S R V DD S S A S I
V V E A L P S V I E P S F G I G R I I Y C L F E H S F Y T R L NV F R F P P I V A P I K C T V F P L V K NQ E F DDA A K V I DK A L T T A G I S H I I DT T A I S I
V T DA L P S V I E P S F G I G R I MY C M F E HA F Y I R K T V L R L T P V V A P I K T T I F P L V NDDK L NA I A A E MNK M L T T NG I S A K L DA T A I S V
V M E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG V S HK V DD S S G S I
L I K Y V P HV I E P A F G I G R I L QA I I E H S F NQR K T F F K F S P R V A P V K C S I L S V V Q S E E F DNV I F E L T S S L K K L G I S C K T DNA G V A L
I QK W L P NV I E P S F G I G R I L Y S I F E HQY WA R R G V L S L P P L V A P T K V L L V P L S N S A D L Q P I V T K V S A Y L R K QQ I P F K V DD S S A S I
I Y A C L P NV I E P S F G I G R L I F C I L E H S F R I R R QY L S L P Y K L A P I K C S I L S I S NNK A F Y P Y I K Q I QM L L NQY N I S S K I DN S S V S I
I Y QW L P NV I E P S F G I G R L I F C I I E H S F R T R R HY L S L P Y T L A P I K C S V L T I S NHK T F I P F V K QV QM I L N E F S I S S K I DN S S V S I
I Y K I L P NV I E P S F G I G R L I F C I L E H S F R V R R HY L S L P Y A L S P I K C S V L S I S NNK E F Y P Y I K Q I QT I L S E NN I S C K L DN S S V S I
V V E A L P S V I E P S F G I G R I I Y C L Y E H S F Y MR QNV F R F P P L V A P I K C T V F P L V QNQQY E DV A K I I S K S L T A A G I S HK I D I T G T S I
V L E Y L P S V I E P S F G L G R I MY T I L E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
I E S V L P NV I E P S F G L G R I I Y C I F DHC F QV R R G F F S F P L Q I A P I K V F V T T I S NNDG F P A I L K R I S QA L R K R E I Y F K I DD S NT S I
V E E A M P NV I E P S F G L G R I L Y V L M E HA Y WT R R G V L S F P A S I A P I K A L I V P L S R NA E F A P F V K K L S A K L R N L G I S NK I DD S NA N I
V V E H L P S V I E P S F G V G R I L Y S I L E HN F K V R R T Y F T L P P I I A P Y K C C V L P L S S NK D F E P L V K T L A QA L S NA S I S HK V D S S S G S I
A MD F L P NV I E P S F G I G R I MY T I F E HT F HV R R T F F S F P A T V A P Y K C S V L P L S QNQ E F V P F V R E L S E E L T R NG V S HK V DD S S G S I
L T K L I P Y V I E P S F G V G R I F S A I L E H S F R MR R T F F H L P P K I S P I K C S I L P V I S H E K Y NDA I HK L K V G L T K V G V S S K V DDT G HA I
L L NH L P C V I E P S F G L G R L I F S I L E H S Y R V R R K Y V A L NK S I A P T K C S V L P L S S K E V F E P L I T R V QA H L R R L G I S HK V DK T G A S I
I L NH L P C V I E P S F G L G R L I F S I L E HA Y R V R R K Y V S L HK S I A P T K C S I L P L S S K E V F E P L I S R V Q S Q L R S L G I S HK V DK T G A S I
I MDA L N I T V C G DK T V T Y E MY N I NDT V V T T R - - - - F F P NA L S S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V M E Y L P S V I E P S F G V G R I L Y A L L E Q S Y WV R R A V F G L R P L I A P QK V A V F P L L MK P E L I R T V E E I K E R M L L HG I S T R T DD S G A S I
V M E Y L P S V I E P S F G V G R I L Y A L L E Q S Y WV R R A V F S MR P V I A P QK V A V L P L L V K P E L L R V V E E I R G DMV L R G I S T R T DD S G A S I
V E E A I P NV I E P S F G I G R I F Y S L L E H S F WT R R G V L S L P P L V A P I K A S I V P I S S N E K L S P L V K QV S R K L R S A G V A S R V DD S NA S I
V E E W F P NV I E P S F G I G R I L Y S L I E HC F WT R K G V L S F P P R I A P T K V L V V P L S S QK E L A P F T Q E V S K K L R QA R I S A K V DD S S A S I
420
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
430
440
450
460
470
480
490
G R R Y A R T D E I A I P Y G I T I D F DT L K - E P HT V T L R E R D S MK QV R I G L D E V A NT I R D L A T G - -R T S WK Y E P - - P - - I P T R V G K K K G
G K R Y A R T D E L G V P F A I T V D S - - - - - -DT S V T I R E R D S K DQV R V T L K E A A S V V S S V S E G - -K MT WK F E P - -A A - P P A R V G R K QG
G K R Y A R ND E L G T P F G I T I D F E S I K - -DG S V T L R E R D S T R QV R G S V T D I I R A I R D I T Y N - -G V T WK Y E P - - P - -V Q S K I G R K K G
G K R Y A R ND E L G T P F G I T V D F Q S V K - -DNT F T L R DR DT T K QV R A S E D E I L QA I K S L V DG - - E K T WK Y E P - - P P P P T T R I G R K K G
G K R Y A R ND E L G I P L G I T I D F D S V K - -DG T I T L R E R D S T K QV R A S K E E I L G A I E S L I S G - -K MNWR Y E P - - P P P P T T R L G R K K G
G R R Y A R T D E L G V P Y A V T V D F DT I K - E P HT V T L R E R D S MR QV R L P MA DV P T V V R D L S N S - -K I L W - - - - - - - - - - - - - - - - - - G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V R D L A NG - - S I MWK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G I P F G I T V D F E S G K T T P Y T V T I R HA E T M S Q I R L E V S E L G R L I S D L V S G - -R QQWK Y E A - - P - - I P S R I G K K K G
G R R Y A R T D E I G I P F G I T V D F D S L K T T P F T V T I R HA E T M S Q I R L E V S E L G R L I S D L V A G - -R QQWK Y E A - - P - - I P S R I G K K K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -DD S V T L R E R D S T K QV R G S I Q E I V E A I K D I T Y N - -DG T WK Y E P - - P - -V E S K F G K K K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -DG S V T L R E R D S T K QV R G S V E A V I K A V R E I T Y N - -G A S WK Y E P - - P - -V Q S K F G R K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V C D L A NG - - S I T WK Y E P - - P - -V P T R V G K K K G
G K R Y A R T D E I A V P F G I T V D F DT V K I E P HT A T L R E R D S L V Q I R A T V E E I P Q I V Y D L V Q E - -NT T WK Y E R - - P - - L P T R V G K R R G
G K K Y A R ND E L G T P F G C T V D F A T I Q - -NG T MT L R E R D S T S Q L I G P I E DV I S V V DQ L V K G - -V L DWK W E P - - P - -V P T R I G K K K G
G R R Y A R T D E I G I P F G I T I D F Q S V K - -DDT V T L R E R D S MK QV R I S S S E V P S V I S K I I NQ - -Q I T WK L E S - -A - - P P P M E MK R K G
G R R Y A R T D E I G I P F G I T I D F Q S V K - -DDT V T L R E R D S MK QV R I S S S E V P S V I S K I I NQ - -Q I T WK L E S - -A - - P P P M E MK R K G
G R R Y A R T D E I G V A F G I T I D F DT V NK N P HT A T L R DR D S MR Q I R A E V G E L P E I I R D L A NG - -A I T WK Y E P - - P - - I P T R V G K R K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -D E S V T L R DR D S T K QV R G S L E D I V E A I K D I A Y N - -NV S WK Y E P - - P - -V E S K F G K K R G
G R K Y A R T D E I G V P F G V T I D F QT V E - -DNT V T L R E R DT T K QV R I P I S E L A S T L R K L C D L - -T V S WK Y Q P - - P P - P P T Q F G K K K G
G R R Y A R T D E I A I P Y G I T V D F DT L K - E P HT V T L R DR NT MK QV R V G L E E V V G V V K D L S T A - -R T T WK Y E P - - P - - I P T R V G K K K G
G R R Y A R T D E I A I P Y G I T V D F DT L K - E P HT V T L R DR NT MK QV R V G L E QV V G V V K D L A T A - -R T S WK Y E P - - P - - I P T R V G K K K G
G R K Y S S C D E L G I P F F I T F D P D F L K - -DR MV T I R E R D S MQQ I R V DV E K C P S I V L E Y I R G - -Q S R WN L QD - -T - -T T I N L R R R R E
G K K Y A QA D E A G I P F DV T V DY T S L S - -DNT V T L R DR DT T K Q I R I P I DK L V E T V HA L T Q L H P T T T F - - - - - - - - -M S MT L G K K R E
G R R Y A R T D E V G V A F G I T I D F DT V NR T P HT A T L R DR D S MR Q I R A E V S E L P A I I R D L A NG - -Y L T WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G V T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P S I V QD L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K R Y A R ND E L G T P F G V T I D F D S V T - -DG S I T L R E R D S T K QV R G S V A DV I K A I R E I T Y Q - -G V S WK Y E P - - P - -V E S K F G R K K G
G K K Y A R V D E L G I P F A I T C D F E G - - - -DG S V T L R E R DT A S QV R V P K L E V A S V V V D L C N P L Q P L T WK W E P - - P - -V A P E I G K R K G
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -K Y E P - - P - -V P T R V G K K K G
G R R Y S R ND E L G T P L G I T V D F QT V K - -DG T I T L R DR DT T V QV R A DQDK I V E A I Q E L V S G - -NK V WK Y E P - - P P R P T T R V G R K K G
G R R Y R R ND E I G T P Y S V T V DY DT L Q - -DG T V T I R DR D S MR QV R A P I NG I E NV L Y E L I Y R - -G R D F - - - - - - - - - - - - - - - - - - G K R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P K V V C S L A NG - -T MT WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P NV V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G R R Y A R ND E L G T P F G L T I D F QT L Q - -DG T F T L R E R D S T R QV R A E E E K I V DA I K A L V E G - - S K T WK Y E P - - P P K P T T R I G R K K G
G R R Y A R T D E I G V P F A V T V D S - - - - - -A T S V T I R E R D S K E Q I R V G I D E V A S V V K Q L T DG - -Q S T WK F E P - - P A -A P S R V G R K QG
G K R Y A R T D E L G V P F A V T V DHR S V T - - E NT V T V R E R D S C G QV R V P I P E V P G L L G R L C K M - -T V DWK Y V P - - P A - P P MR V G K K K G
G R R Y A R T D E I G V A F G V T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P S I V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K K Y A R T D E I G I P F A I T V DK E T L T - -A Q S V T L R E I E T T K QV R V P I A E V P R L I L E L S A G - - L I L WK K P P - - P - - P P QR V G R K K G
G K R Y A R ND E L G T P F G I T I D F D S I K - -D E S V T L R E R D S T K QV R G S F E DV V A A I K E I T Y T - -G T T WK Y E P - - P - -V E S K F G K K R G
G K K Y A R T D E I G I P F A V T I D F QT L K - -DK T I T L R E R D S M L Q I R I S M S H L V D I I N S M L HA - -K K NWK L E S - -V - - P I S HMG K K K G
G K K Y A R T D E I G I P F A V T I D F QT L K - -DK T V T L R E R D S M L QV R I D L S D L V E I V T S L L R Q - -K K T WK L E S - -A - - P P S H I G K R K G
G K K Y A R I D E I G I P F A V T I D F QT L K - -DK T I T L R DR D S M L Q I R V N I S E V S D I I N S L L S Q - -K S S WK L E S - - S - - P P S H I G K R K G
G K R Y A R T D E L G V P F A I T V D S - - - - - -T S S V T I R E R D S K DQ I R V NV E E A A S V V K S V T DG - -HT T WK F E P - -A A - P P A R V G R K QG
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K K Y A R ND E L G T P F G I T I D F E T I K - -DQT V T L R E R N S MR QV R G T I T DV I S T I DK M L HN P D E S DWK Y E P - - P - -V Q S K F G R K K G
G R R Y A R ND E L G T P F G L T V D F E T L Q - -N E T I T L R E R D S T K QV R G S QD E V I A A L V S MV E G - -K S S F K Y E P - - P - -V P T R T G R R K G
G R R Y A R T D E I S V P F C I T V D F D S L K - E P HT V T L R DR DT F E QV R T L V S DV A D I I R D L S S D - -K I R WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V R E L P G I I R D L A NG - -T L S WK Y E P - - P - - I P T R V G K K K G
G R R Y A R T D E L G I P F G I T I DNDT L V - -DD S V T L R E I L T T K Q I R I P I NDV F R V V S D L A DG - - L I T WK DQM - - P - - - - - - - -R E K G K R Y A R T D E I G I P F C V T L D F Q S V N - -DDT V T L R E R DT MQQV R I K L DD L G E L I NN L L K D - -D I T WQ S R S - -D - -Q P V T F G K R K R
G K R Y A R T D E I G V P F C V T L D F Q S V N - -DDT V T L R E R D S MQQV R V K L DQV G Q L L S N L L K - - -D I T W - - - - - - - - - - - - -M I K R I K
- - - - - - - - - - - - - - - - - - - - - - - - - -HR S V S A E S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E E E S V N P K K K L T E E E R K K R
G K K Y A R V D E L G V P F C V T C DM E N - - - -DG C V T L R E R D S A QQV R I P K E K V A D I V A E MC R P L R P R E WK W E P - - P - -V A T D I G K K K G
G K K Y A R V D E L G I P F C V T C D F E T - - - -DG C V T L R E R D S A R QV R I P R E A V A DV V A E L S R P L R P R E WK WQ P - - P - -V A S D I G K K K G
G R R Y A R ND E L G T P F A C T L D F A S L S - -K G T MT L R E R DT T A QR I G P I DQV I DV I R Q L C DG - - S L DWK W E P - - P - - L P T R V G K K K G
G K R Y A R ND E MG T P F G I T V D F DT V K - -DN S V T L R E R D S T R QV R G S I DA V I A A I NV MT A D - -DV A WK Y E P - - P - -V NT R S HR K K G
500
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
510
520
530
540
550
560
570
P DA A L K L P QV - - - - - -T P HT R C R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G S P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P E A A A R L P T V - - - - - -T P S T K C K L R L L K L E R I K DY L L M E E E F V A NQ E DD L R G T P M S V G N L E E L I D E NHA I V S S S V G P E Y Y V G I
P A T A E K L P NV - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E F V T N S E E E I R G T P L S I G T L E E I I DDDHA I V T S P T T P D F Y V S I
P S A A S K L P D I - - - - - - F P T S R C K L R Y L R MQR V HDH L L L E E E Y V E NM E DDMR G S P MG V G N L E E L I DDDHA I V S S A T G P E Y Y V S I
P S T A S K L P D I - - - - - - F P T S R C K L R Y L R MQR V HDH L L L E E E Y V E NM E DDMR G S P MG V G N L E E L I DDDHA I V S S A T G P E Y Y V S I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M E E E F I R NQ E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - P DV A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A S K L P A V - - - - - -T P HA R C R L K L L K S E R I K DY L L M E Q E F I QNQ E D E L R G T P MA V G S L E E I I DDQHA I V S T NV G S E HY V N I
P DA A S K L P A V - - - - - -T P HA R C R L K L L K S E R I K DY L L M E Q E F I QNQ E D E L R G T P MA V G S L E E I I DDQHA I V S T NV G S E HY V N I
P DT A V K L P S V - - - - - -Y P NT R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P MA I G T L E E I I DDDHA I V S S T A S S E Y Y V S I
P A T A E K L P NV - - - - - -V P S T R C K L K L L R M E R I K DH L L L E E E F V T N S E E E I R G N P L S I G T L E E I I DDDHA I V T S P T M P DY Y V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A NK L P S V - - - - - -T P HT R C R L R L L K Q E R I K DY L L M E E E Y I R NQ E DD L R G S P M S V G T L E E I I D E NHA I V S T S V G S E HY V S I
P DA S S R L P A V - - - - - -Y P T T R C K L K L L K M E R I QDY L L M E E E F V S NQA D E L R G S P MG V G T L E E I I DDDHA I V S S G G G S E Y Y V G I
P P QY A R L P A V - - - - - -V P NA K C R L R L L K Y E R I K DY L MM E Q E F I T S M E DD L R G S P MN I G T L E E I I D E NHA I V S S S V G S E Y Y V N I
P P QY A R L P A V - - - - - -V P NA K C R L R L L K Y E R I K DY L MM E Q E F I T S M E DD L R G S P MN I G T L E E I I D E NHA I V S S S V G S E Y Y V N I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K QDR I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DT A V K L P S V - - - - - -Y P S T R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P M S I G T L E E I I DDDHA I V S S T A G S E Y Y V S I
A E T S T K L P V I - - - - - -T P H S K C K L K Q L K L E R I K DY L L M E Q E F L QNY D E E L R G D P L T V G N L E E I I DDNHA I V S S T V G P E HY V R I
P DA A MK L P QV - - - - - -T P HT R C R L K L L K L E R I K DY L MM E D E F I R NQ E DD L R G T P M S V G N L E E I I DDNHA I V S T S V G S E HY V S I
P DA A MK L P L V - - - - - -T P HT R C R L K L L K L E R I K DY L MM E D E F I R NQ E DD L R G T P M S V G N L E E I I DDNHA I V S T S V G S E HY V S I
G K A A S K P P QV - - - - - -Y P L MK C K L R Y L K L K K L A H L L S L E DN I L S L C E E Q L R G S P L S V G T L E E F V DDHHG I I T T G V G L E Y Y V N I
Y G NNNK L P Q I - - - - - -N P R T QC N L K K L R L E R L K D I L L I QR D F I E NQ E E E L R G S P L E V S K L H E M I DDHHA I I S S G NT MQY C V P V
P DA A S K P P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P S T V E K L P S V - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E Y V T N S E DD I R G T P L S I G T L E E I V DDDHA I V T S P T T P DY Y V S I
P DA A T R I P K V - - - - - -Y P NR A C L L R K Y R L E R C K DY L L L E E E F L R T I N E D I R G T P L E V A T L E E A V DD S HA I V S I S -G T E Y Y V P L
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY I S I
T S A A A K L P A I - - - - - -Y P T S R C K L R L L R MQR T HDH L L L E E E F V E NQ E DDMR G S P MG V G T L E E M I DDDHA I V S S T T G P E Y Y V S I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E Q L T E P P L F I A T I L E V NG E I A L I R QHG NNQ E - - - -V
A S A A A K L P S V - - - - - -Y P T S R C K L R L L R MQR I HDH L L L E E E Y V E NQ E DDMR G S P MG V G V L E E L I DDDHA I V S S T S G P E Y Y V S I
P E A A A R L P NV - - - - - -A P L S K C R L R L L K L E R V K DY L L M E E E F V A A Q E DD L R G T P M S V G S L E E I I D E S HA I V S S S V G P E Y Y V G I
I E G S T R L P NV - - - - - -A P Q S K C K L R M L K L E R V K DY L L M E E E F V G NQ E D E MR G A P M S V G S L E E I I DDT HG I V S S S I G P E Y Y V N I
V E QA S K L P A A - - - - - -T P I T K C R L K L L K N E R I K DY L L L E Q E F I E NQQ E Q L R G T P M I V G T L E E F V N E NHA I V S S S V G P E S Y S G I
P DT A V K L P S V - - - - - -Y P NT R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P M S I G T L E E I I DDDHA I V S S T A G S E Y Y V S I
T S G H S K L P NV - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E Y I T NQ E DD L R G S P V S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
V P G H S K L P T V - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E F I T NQ E DD L R G S P M S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
A T G H S K L P T V - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E F I T NQ E DD L R G S P M S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
P E A A A R L P T V - - - - - -T P HT K C K L R L L K M E R I K DY L L M E E E F V A NQ E DD L R G S P M S V G N L E E L I D E NHA I V S S S V G P E Y Y V G I
P A T A E K L P N I - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E F V S N S E E E I R G N P L S I G T L E E I I DDDHA I V T S P T M P DY Y V S I
P DA S A K L P T V - - - - - - I P T T R C R L R L L K MQR I HDH L L M E E E Y V QNQ E D E I R G T P M S V G T L E E I I DDDHA I V S T A -G P E Y Y V S I
P DT A S K L P QV - - - - - -T P HT K C R L R L L K M E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G S L E E I I DDNHA I V S A S V G S E Y Y V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K Q E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
- -K P A P V R I V - - - - - -T P I S K C R L R Q L K L DR I K DY L L M E Q E F I R NQ E E Q L R G S P M L I G T L E E F I D E DHA I V S S - I G P E Y Y A N I
Q L A P V R I P T V - - - - - -T P N S K C R L R L L K L E R I K DY L L L E E E Y I T NK S DD I R G S P M S V G T L E E I I D E NHA I V T S S I G P E Y Y V N I
P V L T NR L P L V NV K G K L T P N S K C R L R L L K L E R I K DY L L L E E E Y I T NK S DD I R G S P M S V G T L E E I I D E NHA I V T S S I G P E Y Y V N I
G A K NT H I P T V - - - - - -T P NA K C Q L R L L K L E R V K DW L K M E E E F I NNC E E E V R G S P MMV G T L E E I V DDDHA I V S R S V -QD F Y V T I
P DA A A K L P K I - - - - - -Y P S R A C L L K Q L R L E R C K DY L L L E D E L L T M I T DA L R G M P L E V G T L E E V I DDT HA I V S T A -G S E Y Y V A M
P DT A A K L P K I - - - - - -Y P V K A C L L K Q L R L E R C K DY L L L E E E L L K T I G DA L R G M P L E V G T L E E V I DDT HA I V S T A -G S E Y Y V P M
P D S S S K L P T V - - - - - -Y P NT R C R L K L L K L E R I K DH L L L E E E F V QNQ E DD L R G S P MA V G T L E E I I DD E HA I V S S A T G P E Y Y V S I
P E NA NK L P G V - - - - - -Y P T T R C K L K L L K M E R I K DH L L L E E E F V QNQ E DT L R G S P MG V G T L E E I I DDDHA I V S S T S G P E Y Y V S I
590
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
600
610
620
630
640
L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L G DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - -Q L E P G C S I L MHNK V L S V V G I L QD E V D P MV S V MK V E K A P L E S Y A D I G G L E A Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK I DK S P T E S Y ND I G G L E A Q I Q E I K
M S F V DK D - - - L L E P G A S I L L HHK S V S V V G V L T E E S D P L V S V MK L DK A P T E S Y A D I G G L E S Q I Q E V R
M S F V DK D - - - L L E P G A S I L L HHK S V S V V G V L T E E S D P L V S V MK L DK A P T E S Y A D I G G L E T Q I Q E V R
- - - - - - - - - - - - - - - - - - - - - -K V HA V V G V L G DDT D P MV S V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK E - - -Q L E P G C S V L L NHK NHA V I G V L S DDT D P MV S V MK L E K A P Q E T Y A DV G G L DQQ I Q E I K
M S F V DK E - - -Q L E P G C S V L L NHK NHA V I G V L S DDT D P MV S V MK L E K A P Q E T Y A DV G G L DQQ I Q E I K
M S F V DK G - - - L L E P G C S V L L HHK T V A V V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK I DK S P T E S Y S D I G G L E S Q I Q E I K
L S F V DK D - - - L L E L G C S V L L NHK V HA V I G V L MDDT D P L V I V MK V E K T P Q E T Y A D I R A L DNQ I Q E I K
L S F V DK D - - - L L E P G C T V L MNHK V HA V V G F L G DDV D P L V T V MK L E K A P K E S Y A D I G G L DT Q I T E I K
M S F V DK D - - - L L E P G C S V L L HHK T HA V V G V L A DDT D P MV S V MK L DK A P T E S Y A D I G G L E S Q I Q E I K
L S F V DK N - - -Q L E P G S S V L L HNK V Y S V V G I MND E V D P L V S V MK V DK A P L E S Y A D I G G L E QQ I Q E I K
L S F V DK N - - -Q L E P G S S V L L HNK V Y S V V G I MND E V D P L V S V MK V DK A P L E S Y A D I G G L E QQ I Q E I K
M S F V DK G - - - L L E P S C S V L L HHK T V S I V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
M S F V DK S - - -K L Y L G A T V L L NNK T L S V V G V I DG E V D P MV NV MK V E K A P T E S Y S D I G G L E A QV Q E MK
L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L S DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L S DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
M S F V DK D - - - L L E P G C T V L L NY K DN S V V G V L E G E MD P MV NV MK L E K A P S E T Y A D I G G L E E Q I Q E I K
L S I V DR E - - - L L E P G V QV L T HNHNK A I V G V L QND E D P HV S V MK V DK A P L E S Y A DV G G L E K Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK L E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S V V G V L QDDA D P MV S V MK MDK S P T E NY S D I G G L E A Q I Q E I K
M S F V DK E - - -Q L E L G C S V L L HDR QH S I V G V L K DDV D P L V S V MK V DK A P E DT Y A D I G G L E QQ I Q E I K
M S F V DK D - - - L L E P G A S V L L HHK S V S I V G V L T DDA D P L V S V MK L DK A P T E S Y A D I G G L E QQ I Q E V R
L T Q I P E E C L G K I E P G MR V A V N -G A Y S I I S I V S R A A DV R A QV M E L I N S P G V DY S M I G G L DDV L Q E V R
M S F V DK D - - - L L E P G A S V L L HHK S V S I V G V L T DDT D P A V S V MK L DK A P T E S Y A D I G G L E QQ I Q E V R
L S F V DK D - - -Q L E P G C S I L MHNK V L S V V G I L QD E V D P MV S V MK V E K A P L E S Y A D I G G L DA Q I Q E I K
A S F V DK S - - -Q L E P G C A V L L HHK N S A V V G T L A DDV D P MV S V MK V DK A P L E S Y A DV G G L E DQ I Q E I K
M S F V DK D - - -Q L E P G C S V L L NQR S Y A V V G I MQD E I D P L L NV MK V DK A P L E S Y A D I G G L E QQ I Q E I K
M S F V DK G - - - L L E P G C S V L L HHK T V S V V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NNK T N S V V G I L L D E V D P L V S V MK V E K A P L E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - -Q L E P G C A I L MHNK V L S V V G L L QD E V D P MV S V MK V E K A P L E S Y A D I G G L DA Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK MDK S P T E S Y S D I G G L E S Q I Q E I K
M S F V DK D - - -M L E P G C S V L L HHK A M S I V G L L L DDT D P M I NV MK L DK A P T E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - -Q L E P G C T V L L NHK V L A I V G V L G DDT D P MV S V MK L E K A P Q E S Y A D I G G L DT Q I Q E I K
L S F V DK D - - -Q L E P G S T V L L NNR T MA V V G I MQD E V D P M L NV MK V E K A P L E C Y A D I G G L E QQ I Q E V K
L S F V DK E - - - L L E P G C S V L L HNK T N S I V G I L L DDV D P L V S V MK V E K A P L E S Y DD I G G L E E Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HNK T N S I V G I L L DDV D P L V S V MK V E K A P L E S Y DD I G G L E E Q I Q E I K
S S F V DR K - - -A L Q I G C S V L L H E K A L T I V G L L DDDA N P L V DV MK V E NA P L E S F A D I G G L E DQ I V D I K
L S F V DK E - - -K L E L G C S V L L HDR Y HNV V G L L E S NT D P L V S V MK V DK A P Q E T Y A D I G G L E DQ I Q E I K
L S F V DK E - - -K L E L G C S V L L HDR QH S V V G V L QN S I D P HV S I MK V E K A P Q E T Y A D I G G L E E Q I Q E I K
M S F V DK D - - - L L E P G C S V L L HHK A MA I V G V L S DDA D P MV S V MK L DK A P S E S Y A D I G G L E T Q I Q E I K
M S F V DK D - - - L L E P G C S V L L HHK T V S V V G V L QDDA D P MV S V MK L DK A P T E S Y A D I G G L E S Q I Q E I K
650
E SV
EAV
EAV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
EAV
EA I
E SV
E SV
E SV
EAV
E SV
E SV
EAV
EAV
E SV
E SV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
EAV
EAV
EAV
EAV
EAV
EAV
E SV
E SV
EAV
E SV
E SV
EAV
EAV
EAV
EAV
EAV
EAV
EAV
E SV
660
E L P L T H P E Y Y E E MG
E L P LT H P E LY ED I G
E L P L T H P E L Y E E MG
E L P L L H P E L Y E E MG
E I P L T H P E L Y DD I G
E I P L T H P E L Y DD I G
E L P LT H P E LY E E I G
E L P L T N P E L Y Q E MG
E L P L SH P E LY E E I G
E F P L SH P E LY D E I G
E L P LT E P E L F ED LG
E L P L T H P E I Y E DMG
E L P L T H P E I Y E DMG
E L P L T R P E L Y DD I G
E L P L T R P E L Y DD I G
E L P LT H P EQ FD E I G
E F P L SH P E L FD EV G
E F P L SH P E LY D EV G
670
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I E P P SGV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
V Q P PK GV
V K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
680
I LY G P PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GC PGT GK T
I LY GC PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GV PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G E PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G L PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GV PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
L L HG A P G T G K T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G E PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I MY G P P G T G K T
I LY G P PGT GK T
I LY G P PGT GK T
I L FG P PGT GK T
I LY GV PGT GK T
I LY GV PGT GK T
I LY GV PGT GK T
I LY GA PGT GK T
690
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L A K A V A NQT S A T
LAK AV AN ST SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
L A K A V A NR T S A T
I A K A I A S QA K A T
LAK AV AN ST SAT
LAK AV AN ST SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ST SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAR AV AK ST SAT
700
710
720
730
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A DD L S P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A A E NA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q I F QV A A E HA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q I F QV A A E HA P S I
F L R I V G S E L I QK Y L G DG P K MV R E L F R V A E E NA P S I
F L R I V G S E L I QK Y L G DG P K MV R E L F R V A E E NA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A A DHA P S I
F L R V V G S Q L I QK Y L G NG P K L I R E L F R V V E E HA P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A G E L A P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E C A P S I
F L R V V G T E L I Q E Y L G E G P K L V R E L F R V A DMHA P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F QA A K D S A P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HG P S I
F L R V V G S E L I QK Y S G E G P K L V R E L F R V A E E H S P A I
F L R I V G S E L I QK Y L G DG P R L V R Q L F QV A A E NA P S I
F I R M S G S D L V QK F V G E G S R L V K D I F Q L A R DK S P S I
F L R I V G S E L I QK Y L G DG P R L V R Q L F QV A A E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E L S P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F R V A D E M S P S I
F L R V V G S E L I QK Y QG DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A G E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E DHA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E DHA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A DD L S P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A G E NA P S I
F L R V V G S E L I QK Y L G DG P R L V R Q L F NA A E E H S P S I
F L R I V G S E L I QK Y A G E G P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G E G P K L V R E M F K V A E DNA P S I
F L R V V G S E L I QK Y L G E G P K L V R E M F K V A E DNA P S I
F L R V V G S E L I QK Y L G E G P K L V R E L F K T A H E L A P S I
F L R V V G S E L I QK Y S G E G P K L V R E L F R V A E E N S P S I
F L R V V G S E L I QK Y S G DG P K L V R E L F R V A E E N S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A A E HA P S I
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
I
V
V
V
V
V
V
V
L
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
I
I
V
V
V
V
V
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
F MD E I
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
740
DA V G T K R Y D S N
DA V G T K R Y DA N
DA I G T K R Y D S N
DA I G T K R Y D S T
DA I G T K R Y E S T
DA I G T K R Y E S N
DA V G T K R HD S Q
DA V G T K R HD S Q
DA I G S K R Y E S S
DA V G T K R Y D S Q
DA I G G K R Y NT S
DA V G T K R Y DA H
DA I G T K R Y DT D
DA V G S MR T Y DG
DA V G T K R Y DA H
DA V G T K R Y D S Q
DA V G T K R Y D S H
DA V G T K R Y E A T
DA V G T K R Y DA H
DA I G T K R Y DA Q
DA V G S K R Y NT S
DA I G T K R Y DA T
DA I G T K R Y DA T
DA V G T K R Y D S T
DA I G T K R Y DT D
DA I G T K R Y DT D
750
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
760
S G G E R E I QR T M L E L
S S G E R E I QR T M L E L
S G G E R D I QR T M L E L
S G G E R D I QR T M L E L
S G G E R E V QR T M L E L
S G G R R E V QR T M L E L
S G G E K E I QR T M L E L
S S G T K E V QR T M L E L
T S G S A E V NR T M L Q L
S G A E R E I QR T M L E L
S S G E R E V QR T M L E L
S G G A K E V QR T M L E L
S S G A K E V QR T M L E L
S G G E R E I QR T L L E L
770
780
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I L A T NR
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D - S R G DV K V L MA T NR
L NQ L DG F D - S R G DV K V L MA T NR
L NQ L DG F D -DR G D I K V I MA T NK
L NQ L DG F D -T R G DV K V I MA T NK
L NQ L DG F E -A R G DV K V I MA T NK
L NQ L DG F E -A R G DV K V I MA T NK
L NQ L DG F D -A R T DV K V I MA T NR
L NQ L DG F D -T R ND I K V I MA T NK
L NQ L DG F D -T R G E V K V I I A T NR
L T Q L DG F D - S S NDV K V I MA T NR
L A E MDG F D - P K G NV K V V A A T NR
L NQMDG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R A DV K V I L A T NK
L NQ L DG F DT S QR D I K V I MA T NR
L NQ L DG F D - S R T DV K V I L A T NK
L NQ L DG F D - S Q S DV K V I MA T NK
L NQ L DG F D - S Q S DV K V I MA T NK
L NQ L DG F D -DR G D I K V I MA T NR
L T Q L DG F D - S C NDV K V I MA T NR
L T Q L DG F D - S S NDV K V I MA T NR
L NQ L DG F D -T R HDV K V I MA T NR
790
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
ET LD PA
E S LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
D S LD PA
EN LD PA
E S LD PA
E S LD PA
ET LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
EA LD PA
E S LD SA
ET LD PA
ET LD PA
E S LD PA
DT L D P A
ET LD PA
ET LD PA
D L LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
E S LD PA
E S LD PA
D S LD PA
D S LD PA
D S LD PA
E S LD PA
ET LD PA
ET LD PA
SD LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R AGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
LR PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
800
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
F DR S I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
810
E F P L PD EK T K R R I
E F P L PD I K T R R R I
L F E N P DV S T K R K I
L F E N P DQNT K K K I
L F E N P DQNT K K K I
E F P L PD EK T K K R I
L F E N P DA NT K K K I
L F EN PD L ST K R K I
E F PMPD EK T K R R I
E F P L P DT K T K R H I
E L P N P DC K T K R R I
E L P N P DC K T K R R I
L F E N P D S NT K K R I
E F P L PD I K T K R K I
E F G M P DA A T K K K I
E F P L PD I K T K R K I
L F EN PD I T T K R K I
E F P F PD EK T K R R I
L F E N P DQNT K R K I
EV P L PD EK GR V E I
L F E N P DQNT K R K I
E F P L P DV K T K R H I
E F P L P DV K NK K K I
L F E N P DA NT K K K I
Q L P N P DT K T K R R I
L F EN PD L ST K K K I
L F EN PD EAT K R K I
E F P V P DMK T K K K I
Q L PN PD SK T K R K I
Q L PN PD SK T K R K I
E L P F P DNK T K L K I
E F P F PD EK T K K MI
E F P F PD EK T K K MI
E F P L P DQK T K MH I
L F EN PD ST T K R K I
820
F N I HT A R MT L A E DV N L
F Q I HT S K MT L A E DV N L
L G I HT S K MN L S A DV D L
F T L HT S K M S L A DDV D L
F T L HT S K M S L G DDV D L
F T I HT S R MT L A DDV N L
F Q I HT S R MT L A DDV T L
F Q I HT S R MT L G DDV N L
F Q I HT S R MT L G K E V N L
L T I HT S K M S L A DDV N L
L G I HT S K MN L S S DV D L
F Q I HT S R MT L A DA V T L
F N I HT A R MT L S DDV NV
F K L HT S R M S L A DDV D I
F Q I HT S K MT L S DDV D L
F Q I HT S K MT L S DDV D L
F Q I HT S R MT V A E DV S L
L H I HT S K M S L A DDV K L
F E I HT A K MN L S E DV N L
F T I HT S R MT L A E DV N L
F T I HT S R MT L A E DV N L
F D I HT S R MT L D E S V N I
F E I HT S K MT L E E G V DM
V G I HT S K MN L A E DV D L
F E I HT S R M S L A E DV D I
F T L HT S K M S L N E DV D L
L K I HT R K MK L A DDV D F
F T L HT S K M S L N E DV D L
F Q I HT S K MT L A DDV N L
F N I HT G R MN L S A DV Q L
F Q I HT S K MN L G E DA N L
L T I HT S K M S L A DDV N L
F Q I HT S K MT M S P DV D L
F Q I HT A R MT L A DDV N L
L G I HT S K MN L S E DV N L
F T I HT S K MN L G E DV N L
F N I HT S R MT L S NDV N L
F Q I HT S R MT V A DDV T L
F E I HT S K MA L G E E V N F
F E I HT S K MT M S K DV D L
F E I HT S K MT M S K DV D L
F Q I HT A NMH L A P DV N L
F E I HT S R M S L A E DV D L
F E I HT S R M S L A E DV D I
F K L HT S R MN L D S DV D L
MG I HT S K MN L NDDV D L
840
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
850
S E L I MA K DD L S G A D I K A I C T E A G L MA
E E F V MT K D E F S G A D I K A I C T E A G L L A
E T L V T S K DD L S G A D I K A MC T E A G L L A
D E F I NQK DD L S G A D I R A I C T E A G L MA
D E F I NQK DD L S G A D I R A I C T E A G L MA
S E L I M S K DD L S G A D I K A I C T E A G L MA
DD L I MA K DD L S G A D I K A I C T E A G L MA
E E F I T A K D E L S G A D I K A MC T E A G L L A
E E F I T A K D E L S G A D I K A MC T E A G L L A
D E I V T G K DD L S G A D I K A I C T E A G L L A
E N L V T S K DD L S G A D I QA MC T E A G L L A
D E HV QA K DD L S G A D I K A I C T E A G L L A
E E L V MT K D E L S G A D I K A V C T E A G L L A
E E F I MA K DD I S G A D I K A I C T E A G L L A
E E F I MA K DD I S G A D I K A I C T E A G L L A
DD L I L A K DD L S G A D I K A I C T E A G L MA
D E L V T S K D E L S G A D I K A MC T E A G L L A
E E F V M S K DD L S G A D I K A I C T E S G L L A
E L L I T SK ED L SGAD I K A I CT EAGMI A
E E F V M S K DD L S G A D I K A I C T E A G L L A
D E L I MA K DD L S G A D I K A I C T E A G L MA
DN L V T S K DD L S G A D I K A MC T E A G L L A
S E F I HA K D E M S G A DV K A I C T E A G L L A
E E F I A QK DD L S G A D I K A I C S E A G L MA
EK LAK V LT GK SGA E I SV I V K EAG I FV
E E F I A QK DD L S G A D I K A I C S E A G L MA
E E F V MA K D E L S G A D I K A L C T E A G L L A
D E F I NA K D E L S G A D I K A MC T E A G L L A
D E L V T S K DD L S G A D I K A I C T E A G L L A
E E FV MSK D E L SGAD I K A I CT EAG L LA
E T L V T T K DD L S G A D I QA MC T E A G L L A
E E L I QC K DD L S G A E I K A I V S E A G L L A
D E Y I T S K DD L S G A D I K A I C T E A G L MA
DD L I L A K DD L S G A D I K A I C T E A G L MA
DT F V HV K DD L S G A D I K A MC T E A G L L A
D E F V V NK DD L S G A D I K A MC T E A G L L A
D E F V V NK DD L S G A D I K A MC T E A G L L A
M E F A NT K D E I S G A D I K A I C S E A G L I A
S E F I HA K D E M S G A D I K A I C T E A G L L A
S E F I HA K E E M S G A D I K A I C T E A G L L A
E E F V A MK DD L S G A D I K S L V T E A G L L A
E E F V S S K D E L S G A D I K A MC T E A G L L A
860
870
880
890
900
910
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T S MD P R V L DA MM P Y L T
L R E R R MK V T HV D F K K A K E K V M F K K K E G V P - E G L Y Y L DMQA T T P I D P R V F DA MNA S Q I
L R E R R MQV T V E D F K QA K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V V DT M L K F Y T
L R E R R MR V QMDD F R A A R E R I MK T K QDG G P V E G L Y Y L DMQA T T P V D P R V L DA M L P Y L T
L R E R R MR V QMDD F R A A R E R I MK T K QDG G P V E G L Y Y L DMQA T T P T D P R V L DA M L P Y L T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y - - - - - - - - - - - - - - - - - - - - - - L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MR V T M E D F QK S K E NV L Y R K K E G A P - E E L Y Y L DV QA T S P MD P R V V DA M L P Y M I
L R E R R MR V T M E D F QK S K E NV L Y R K K E G A P - E E L Y Y L DV QA T A P MD P R V V DA M L P Y M I
L R E R R MQV K A E D F K S A K E R V L K NK V E E N L - E G L NY L DV QA T T P V D P R V L DK M L E F Y T
L R E R R MQV T A E D F K QA K E R V MK NK I E E N L - E G L Y Y MDMQA T T P T D P R V L DV M L K F Y T
L R E R R MK V T N E D F K K S K E NV L Y R K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MK V T S E D F K K S K E NV L Y R K N E G A P -QG L Y Y MDA QA T T P L D P R V L DK V M S Y Y V
L R E R R MR V T R T D F T T A R E K V L Y G K D E NT P -A G L Y Y L DMQA T T P MD P R V L DK M L P L F T
L R E R R MR V T Q E D L R K A K E K A L Y R K K G G I P - E G L Y Y F DY QA T T P V D P R V L DK MM P F F T
L R E R R MR V T Q E D L R K A K E K A L Y R K K G G I P - E G L Y Y F DY QA T T P V D P R V L DK MM P F F T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MD F QA T T P MD P R V L DA M L P Y QV
L R E R R MQV K A E D F K A A K E R V L K NK V E E N L - E G L Y Y L DV QA T T P T D P R V L DR M L E F Y T
L R E R R MR V T HT D F K K A K E K V L Y R K T A G A P - E G L Y Y L DMQ S T T P I D P R V L DA M L P L Y T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T P MD P R V L DA M L P Y L T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T P MD P R V L DA M L P Y L T
L R E R R K T V T MK D F I S A R E K V F F S K QK MV S -A G L Y F L DV Q S T T P V D P R V L DA M L P F Y T
L R E R R MK V NQ E D F K K A K E K V MY R K K E G V P -DG L Y Y L DNNA T T MV D P E V L N S M L P Y F S
L R E R R MK V T N E D F K K S K E N F L Y K K T E G T P - E G L Y Y L DV QA T T P L D P R V L DR M L P Y L T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L I
L R E R R MQV T A QD F K E A K E R V L K NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DT M L K F Y T
L R E R R MK V C QA D F I K G K E NV QY R K DK S T F - S R F Y Y MDNQA T T P L D P R V L DA M L P Y MT
L R E R R MR V QMA D F R A A R E R V L R T K Q E G E P - E G L Y Y L DMQA T T P V D P R V L DA M L P L Y V
L R R R G K E I T MA D F MK A Y E K V V NV Q E P T I P -QA M F Y MDN S A T T P V R K E V V E E M L P Y L T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MR V QMA D F R A A R E R V L R T K Q E G E P - E G L Y Y L DMQA T T P I D P R V L DA MM P Y F T
L R E R R MK V T HA D F K K A K E K V M F K K K E G V P - E G L Y Y MDMQA T T P V D P R V L DA M L P F Y L
L R E R R MQV T HA D F S K A K E K V L Y K K K E G V P - E G M F - - -MQA T T P L D P R V L DA M L P Y F T
L R E R R MK I T Q E D F R K A K E K I L Y L K K G N I P - E G L Y Y L D F QA T T P T DY R V L DA M L P Y L T
L R E R R MQV K A DD F K S A K E R V L K NK V E E N L - E G L Y Y L DV QA T T P T D P R V L DK M L T F L T
L R E R R MK I T QA D L R K A R DK A L F QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK M L P Y MT
L R E R R MK I T QV D L R K A R DK A L Y QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK MM P Y MT
L R E R R MK I T Q L D L R K A R DK A L Y QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK MM P Y MT
L R E R R MK V T HT D F K K A K E K V M F K K K E G V P - E G L Y Y L DMQA T S P V D P R V L DA M L P Y Y L
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MQV T A E D F K QA K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DT M L K F Y T
L R E R R MR V V MDD F R QA R E K V L K T K D E G G P A G G L Y Y MD F QA T S P L DY R V L D S M L P F F T
L R E R R MK V NN E D F K K S K E NV L Y R K T E G T P - E G L Y Y L DA Q S T T P L D P R V MDA MM P Y S V
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MD F QA T T P MD P R V L DA M L P Y QV
L R E R R MK V T L DD F T K A K DK V L Y L K K G DT P -DG L Y Y L D F QA T T P L D F R V L DK MM P Y QT
L R E R R MQ I T QA D L MK A K E K V L F QK K G NV P -DV L Y Y L DNQA T T C V D P R V L D S MM P Y L T
L R E R R MQ I T QA D L MK A K E K V L Y QK K G NV P -DV L Y Y L DNQA T T C V D P R V L DA MM P Y L T
L R DG R L M E C QA D F R K G R E MV MY R R K E N I P - E G L Y Y L DT QA T S V L D P R V F DT M I P Y E T
L R DR R MK V C Q S D F V K G K E NV QY R K DK G R F - S K F Y Y L D L Q S T T P L D P R V L DK M L P Y MT
L R DR R MK V C QA D F V K G K E NV QY R K DK S S F - S K F Y Y L D F QA T T P L D P R V L DR M L P Y L T
L R E R R MR V T K K D F T T A R E R V I DR K N E G T P - E G L Y Y L DA Q S T T P V D P R V V DK MM P Y MT
L R E R R MR V T A E D F R T A K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DV M L NY Y T
920
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
930
940
950
960
970
980
A Y Y G N P H S R T HA Y G W E T E E A V E K A R A QV A S L I G A -D P K E V V F T S G A T E S NN I S V K G I G R F K K H I I T T QT E HK C V
H E Y G N P H S R T H L Y G W E A E NA V E NA R NQV A K L I E A - S P K E I V F V S G A T E A NNMA V K G V MH F K K HV I T T QT E HK C V
G L Y G N P H S NT H S Y G W E T S Q E V E K A R K NV A DV I K A -D P K E I I F T S G A T E S NNMA L K G V A R F K NH I I T T R T E HK C V
G I Y G N P H S R T HA Y G W E S E K A V E QA R E Y I A K L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T S QT E HK C V
G I Y G N P H S R T HA Y G W E S E K A V E QA R E HV A K L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T T QT E HK C V
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HK C V
NY Y G N P H S R T HA Y G W E S E A A M E C A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
ND F G N P H S R T H S Y G WK A E E G V E QA R K Y V A D L I K A -D P R D I V F T S G A T E S NN L A I K G V A K F K NH I I T L QT E HK C V
ND F G N P H S R T H S Y G WK A E E G V E QA R E HV A N L I K A -D P R D I I F T S G A T E S NN L A I K G V A K F K NH I I T L QT E HK C V
G L Y G N P H S S T HA Y G W E T DK E V E K A R T Y I A DV I NA -D P K E I I F T S G A T E T NNMA I K G V P R F K K H I I T T QT E HK C V
G L Y G N P H S NT HA Y G W E T NK E V E T A R DHV A K V I R A -D P K E I I F T S G A T E S NN L A I K G V G R F K K H I I T T R T E HK C A
NY Y G N P H S R T HA Y G W E S E A A M E HA R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
S Y Y G N P H S R T HA Y G WQA E DA V E V A R QQV A DV I NA -D P R E I I F T S G A T E S NN L A V K G V G R F K K H I I T T Q I E HK C V
E QY G N P H S R T HA Y G W E A E K A V D E A R QQV A Q L V G A -Q P K D I V F T S G A T E S NNM L I K G I A K F K K H I I T T QT E HK C V
E K F G N S H S R T HG Y G W E A E E A V E NA R T N I A N L I K C - L P K E I I F T S G A T E S NNT I I R G V C D I K NH I I T T Q I E HK C V
E K F G N S H S R T HG Y G W E A E E A V E NA R T N I A N L I K C - L P K E I I F T S G A T E S NNT I I R G V C D I K NH I I T T Q I E HK C V
NY Y G N P H S R T HA Y G W E S E S A M E K A R K QV A G L I G A -D P R E I V F T S G A T E S NNM S I K G V A R F K MH I I T T Q I E HK C V
G L Y G N P H S S T H S Y G W E T DK E V E K A R K Y V A DV I NA -D P K E I I F T S G A T E S NNMA V K G V P R F K K H I I T T QT E HK C V
E NY G N P H S K T HA Y G WT S ND L V E DA R E K V S K I I G A -D S K E I I F T S G A T E S G N I A I K G V A R F K NH I I T T V T E HK C I
N F Y G N P H S R T HA Y G W E T E S A V E K A R E QV A T L I G A -D P K E I I F T S G A T E S NN I A V K G V A R F K R HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E T A V E K A R E QV A N L I G A - E T K E I I F T S G A T E S NN I A V K G V A R F K K HV V T T QT E HK C V
T V F G N P H S R T HR Y G WQA E A A V E K A R S QV A S L I G C -D P K E I I F T S G A T E S NN L A L K G V S G F A A H I I T L QT E HK C I
E I Y G N P N S - L HA F G QK A R K A L S D S L D I I Y E C I G A S DDDT V L I T A N S T E G NNT V L K T M L A R R NK I I V S Q I E H P S I
G C Y G N P H S R T HA Y G W E S E A A T E R A R R QV A D L I G A -D P R E V I F T S G A T E S NNMA I K G V A R F K K H I I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
G L Y G N P H S NT H S Y G W E T NK E I E QA R K Y I A DV I K A -D P K E I I F T S G A T E S NNMA L K G V S R F R NH I I T T R T E HK C V
E E Y G N P N S R T HQY G W S A E E A V E K A R R QV A D L I G A - S P K E I F F T S G A T E C NN I A I K G V G N F K NH I I T L QT E HK C V
G V Y G N P H S R T HA Y G W E S E K A V E DA R A HV A S L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T T QT E HK C V
E N F G N P - S S I Y E L G K I S K HA V E NA R K R V A DA I G A - E E N E I Y F T S G G T E S DNWT V K G V A F A G K H I I T S S I E HHA V
NY Y G N P H S R T HA Y G W E S E A A V E HA R QQV A S L I G A -D P R E I I F T S G A T E S NN L A I K G V A R F K K HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L V T T QT E HK C V
NV Y G N P H S R T HA Y G W E T DK A V E E A R K H I A D L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T S QT E HK C V
S R Y G N P H S R T H L Y G W E S DA A V E E A R A R V A S L V G A -D P R E I F F T S G A T E C NN I A V K G V MR F R R HV V T T QT E HK C V
E QY G N P H S R T HMY G W E T E DA I E K A R G E L A S L I G A -NA K E I V F T S G A T E S NNM S L K G V A R F K K H I I T T T T E HK C V
NQY G N P H S K T H S F G W E T E K A V E NA R S Q I A N L I NT -Q P Q S I I F T S G A T E S NNA A L K G L Y G F K NH I I T T QT E HK C V
G MY G N P H S S T HA Y G W E T DK E V E K A R E Y V A A V I K A -D P K E I I F T S G A T E T NNMA I K G V P R F K K H I I T T QT E HK C V
Y I Y G NA H S R NH F F G W E S E K A V E DA R T N L L N L I NG K NNK E I I F T S G A T E S NN L A L I G I C T Y K NH I I T S Q I E HK C I
Y I Y G NA H S R NH F F G W E S E E A V E DA R K N I L H L I NG K NNK E I I F T S G A T E S NN L A L I G I C T Y K NH I I T S Q I E HK C I
Y I Y G NA H S R NH F F G W E S E QA V E DA R A N L I K L L NG NNNK E I I F T S G A T E S NN L A L I G T C T Y K NH I I T S Q I E HK C I
A R Y G N P H S R T H L Y G W E S DQA V E T A R S Q I A D L I G A - S P K E I V F T S G A T E S NN I S V K G V I K F K R HV V T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L V T T QT E HK C V
G L Y G N P H S NT H S Y G W E T NT A V E NA R A HV A K M I NA -D P K E I I F T S G A T E S NNMV L K G V P R F K K H I I T T R T E HK C V
G I Y G N P H S R T HA Y G W E A E K A V E NA R Q E I A S V I NA -D P R E I I F T S G A T E S NNA I L K G V A R F K K H L V S V QT E HK C V
A Y Y G N P H S R T H S Y G W E S DDA V E HA R K QV A N L I G A -DA R E I I F T S G A T E S NN I S V K G T A R F K K HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E T A M E T A R K QV A D L I G A -D P R E I I F T S G A T E S NNMA I K G V A R F K R HV I T T QT E HK C V
NMY G N P H S R S H E Y G WA T E K A T E DA R A QV A D L I G A -D P K E I T F T S G A T E S NNQA L K G L A A F K K H I I T T Q I E HK C I
HA F G N P H S R T H S Y G W E A E K A V E T A R A DV A N L I NC - E S K NV I F T S G A T E S NN L A I K G S K S F K NHV I T T Q I E HK C V
HA F G N P H S R T H S Y G W E A E K A V E T A R A D I A N L I NC - E S K NV I F T S G A T E S NN L A I K G S K S F K NHV I T T Q I E HK C V
Y V HG NA H S K QHG F G Q E A MA A V E K A R K S V A D L I NA -K P N E I I F T S G A T E C NN I A I K G A MG Y K K HV I V S S I E HK C V
E MY G N P H S R T H S Y G WT A E E A V E K A R T QV A D L I R A - S P K G V F F T S G A T E S NN I A I K G V A NY K NH L I T L QT E HK C V
E R Y G N P H S R T HR Y G WT A E DA V E K A R A E V A D L I G T - S P K G V F F T S G A T E S NN I A I K G V A Y Y K NH I I T L QT E HK C V
NQY G N P H S R T HA Y G W E S E K G V E E G R E H I A S L I G A -D P K E I I F T S G A T E S NNMA I K G V A H F K NH I I T T QT E HK C V
DMY G N P H S R T H S Y G W E T DT A V E K A R E E I A A L I G A -D P K E I I F T S G A T E S NNMV I K G I A R F K R H I I T T QT E HK C I
990
LD SCR A L EG
L D S C R H L QQ
L E A A R S MK D
L D S C R H L QD
L D S C R H L QD
LD SCR A L EG
LD SCR S L EA
LD SCR Y L EN
LD SCR Y L EN
L D S A R HMQD
L EAAR GMI N
LD SCR S L EA
LD SCR A L EN
LD SCR WL ST
L ST LR E L E L
L ST LR E L E L
LD SCR V L ET
L D S A R HMQD
LD SCR H L EM
LD SCR A L EN
LD SCR A L EN
L DT C R N L E E
S E S EK Y LK E
LD SCR S L EA
LD SCR S L EA
L E A A R A MK N
LD SCR Y L EM
LD SCR S L EA
L D S C R H L QD
L HA C A W L E G
LD SCR S L EA
LD SCR S L EA
L D S C R H L QD
L D S C R Y L QQ
LD SCR Q L ER
LD SCR S L EA
LD SCR Y L E E
L D S A R HMQD
L QT C R F L QT
L QT C R Y L QT
L QT C R Y L QT
L D S C R H L QQ
LD SCR S L EA
L E A A R A MMK
LD S LR A LQ E
LD SCR V L EG
LD SCR V L E S
L DT C R N L E E
L QC C R Q L E N
L QC C R Q L E N
I E S A R A L QK
LD SCR Y L EM
LD SCR Y L EM
LD SCR R LQ E
L D S C R Y L QD
1000
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1010
1020
1030
1040
1050
1060
1070
E G F HV T Y L P V Q S NG L I S M E E L E K A I T P - E T S L V S I MT V NN E I G V K Q P I A E I G R L C V F F HT DA A QA V G K I P L DV NK MN I D L M S I
E G F E V T Y L P V K T DG L V D L E M L R E A I R P -DT G L V S I MA V NN E I G V V Q P M E E I G M I C V P F HT DA A QA I G K I P V DV K K WNV A L M S M
E G F DV T F L NV N E DG L V S L E E L E QA I R P - E T S L V S V M S V NN E I G V V Q P I K E I G A I C V F F H S DA A QA Y G K I P I DV D E MN I D L L S I
E G F E V T Y L P V QNNG L I R M E D L E A A I R P -DT A L V S I MA V NN E I G V I Q P L E E I G K L C V F F HT DA A QA V G K I P L DV NK L N I D L M S I
E G F DV T Y L P V Q S NG L I R M E E L E A A I R P -DT A L V S I MA V NN E I G V I Q P M E E I G K L C I F F HT DG A QA V G K I P L DV NK L N I D L M S I
E G F R I T Y L P V QQNG I I N L K D L E DA I T P - E T S L V S I MT V NN E I G V R Q P I E A I G A I C V F F HT DA A QA V G K V P L DV NT MN I D L M S I
E G F K V T Y L P V K K S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I K E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F K V T Y L P V DK G G MV DM E Q L E Q S I T P - E T C L V S I M F V NN E I G V V Q P I K Q I G E L C V Y F HT DA A QA T G K V P I DV ND L K I D L M S I
E G F K V T Y L P V DK G G MV DM E Q L T Q S I T A - E T C L V S I M F V NN E I G V MQ P I K Q I G E L C V Y F HT DA A QA T G K V P I DV N E MK I D L M S I
E G F E V T Y L P V S S E G L I N L DD L K K A I R K -DT V L V S I MA V NN E I G V I Q P L K E I G K I C V F F HT DA A QA Y G K I P I DV N E MN I D L L S I
E G F DV T F L S V DNQG L I DMK E L E E A I R P -DT C L V S V MA V NN E I G V MQ P L K E I G A L C I Y F HT DA A QA Y G K V P I DV N E MN I D L L S V
E G F QV T Y L P V K K S G I I D L K E L E S A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F K V T Y L P V K P NG I V D L K V L E E S F Q P -DT S L V S I I F V NN E I G - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - QG F E V T Y L P V L P NG L V S I N E L K A A L R P -DT S L V S I MA V NN E I G V I Q P L A E I S QA I P L F HT DA A QA V G K I P I DV E A L G I DA M S I
K G F R V T Y L K V NNK G L I S L E E L E K S I I P G E T I L A S I MHV NN E I G V I Q P MN L I G E I C V L F H S DV A QG L G K I N I DV DK WNA D F L S L
K G F R V T Y L K V NNK G L I S L E E L E K S I I P G E T I L A S I MHV NN E I G V I Q P MN L I G E I C V L F H S DV A QG L G K I N I DV DK WNA D F L S L
E G F D I T Y L P V K S NG L I D L K Q L E DT I R P -DT S L V S I MA I NN E I G V K Q P V K E I G H L C V F F HT DA A QA V G K I P V DV T DWK V D L M S I
E G F E V T Y L P V N E E G L I S L DD L R K S I R K -DT S L V S I MA V NN E I G V V Q P L K E I G K I C I F F HT DA A QA Y G K I D I DV N E MN I D L M S I
E G F K V T Y L P V G E NG L V D L E L L K NT I T P -QT S L V T I MA V NN E I G V V Q P I K E I G K I C V F F HT DA A QA V G K I P I DV NDMN I D L L S I
E G F K V T Y L P V L A NG L I D L QQ L E E T I T S - E T S L V S I MT V NN E I G V R Q P V D E I G K L C V F F HT DA A QA V G K V P L DV NA MN I D L M S I
E G F T V T Y L P V QT NG I I D L K Q L E E A L T P - E T S L V S I MA V NN E I G V K Q P I D E I G R L C V F F HT DA A QA V G K I P MDV NA MN I D L M S I
NG V E V T Y L P V G NDG V V D I DDV K K S I K E -NT V L V S I G A V N S E I G T V Q P L K E I G M L C V L F HT DA A QG V G K I Q I DV N E MN I D L L S M
R G I E V I K M P V N E DG V V D P K D L E R L I DD -K T A L V S C MWV NN E T G L I M P V E E L C K I A A L F H S DA T QA MG K I K V S V K DV P V DY L T F
E G F Q I T Y L P V QK NG L I D L K E L E A A F Q P -DT S L V S V MA V NN E I G V K Q P I R D I G E I C V F F HT DA A QA V G K I P L DV ND S K I D L M S I
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G R I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G Y E I T F L NV D E QG L I N L E E L E A A I R P - E T C L V S V MA V NN E I G V MQ P L K E I G E L C V F F HT DA A QA Y G K I P I DV N E MK I D L M S I
E G F E V T Y L P V QK NG I L D L K V L E A A I K P -T T C L V S C MA A HN E I G V L Q P I R E I G A L C V L F HT DA A QA L G K V K V DV NA DN I D L M S M
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T Y L P V QN S G L V D L K E L E A A MR P - E T A L V S I MT V NN E I G V I Q P V E E I G K MC I F F HT DA A QA V G K I P MDV NA MN I D L M S I
QG F E V T Y L P V DR Y G MV S P E E L K NA I R D -DT I L I S I M L A NN E I G T I Q P V E E I G K I S I Y F HT DA V QA I G HV P I DV K K MNV D L L S L
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S I MT V NN E I G V K Q P I A D I G R I C V Y F HT DA A QA I G K I P L NV NDMK I D L M S I
E G F R V T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I R Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T Y L P V K S S G L I DMA E L E A A I R P -DT A I V S I MA V NN E I G V I Q P L E E I G K L C I F F HT DA A QA V G K I P V DV NA MN I D L M S I
E G F E V T Y L P V R P DG L V DV A Q L A DA I R P -DT G L V S V MA V NN E I G V V Q P L E E I G R I C V P F HT DA A QA L G K I P I DV NQMG I G L M S L
E G F DV T Y L P V K E NG L V D L K E L E A A MR D -DT A I V S V MA V NN E I G V I Q P L K A I G E L C I F F HT DG A QA V G K V P MDV NDMN I D L M S I
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G R I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
K G V E V T Y L P V D S NG L I S L QQ L Q E S I K S -NT L C V S V M L V NN E I G V I QN L K E I S R I C V Y V H S DMA QA I A K I P V DV QD L D I D L G S I
E G F DV T Y L P V D E HG L I S L DD L K A A I R K -DT I L V S V MA V NN E I G V V Q P L K E I G K I C I F F HT DA A QA Y G K I D I DV NDMN I D L L S I
K G F E V T Y L K P DT NG L V K L DD I K N S I K D -NT I MA S F I F V NN E I G V I QD I E N I G N L C I L F HT DA S QA A G K V P I DV QK MN I D L M S M
K G F E V T Y L K P E P NG I V K L E D I E K N I K E -NT I MA S F I HV NN E I G V I QD I E N I G L L C V I F HT DA S QA I G K I P I DV QK MN I D L L S M
K G F E V T Y L K P DA NG L I K L E D L K N S I K E -NT I L A S F I Y V NN E I G V I QD I E N I G K I C I I F HT DA S QA V G K I K I DV QK L N I D L L S L
E G F E V T Y L P V G NDG I V D L E K L K G S I R P -DT G L V S V MA V NN E I G V I Q P M E E I G E I C V P F HT DA A QA L G K I P I DV DK WNV S L M S L
E G F R V T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C L Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T F L NV DDQG L I D L K E L E DA I R P -DT C L V S V MA V NN E I G V I Q P I K E I G A I C I Y F HT DA A QA Y G K I H I DV N E MN I D L L S I
E G F E V T F L P V QT NG L I N L D E L R DA I R P -DT V C V S V MA V NN E I G V C Q P L E E I G K I C V F F H S DA A QG Y G K I D I DV NR MN I D L M S I
E G F D I T Y L P V K P NG I I D L K E L E A A F R P -DT V L C S I MA I NN E I G V K Q P MK Q I G E MC V F F HT DA A X A V G K I P V DV NDMK I D L M S I
E G F S V T Y L P V QK NG L V D L E L L E A S I R P -DT S L L S V MT V NN E I G V QQ P I D E I G R I C V F L HT DA A QA V G K I P I NV S DWK V D L M S I
QG Y E I T Y L P V QK NG L V D L E V F K NA I R P -DT L V A S I I L V HN E I G V I QD I K T I G K I C V F F HT DA A QA L G K I P I NV D E MN I D L M S M
E G Y S V T Y L K P DK Y G M I L P D L V R K N I R P - E T F L C S V I HV NN E I G V I QN I S E I G R I C V I F HT DA A Q S F G K L P I D L K N L DV D L L S I
E G Y S V T Y L K P DK Y G M I L P E E V R K N I R P - E T F L C S V I HV NN E I G V I QD I A E I G K V C V I F HT DA A Q S F G K L P I D L K N L E V D L L S I
E G F DA T F L QV G K DG R V D P K E V A K N I R P -DT G L V S C M L V NN E I G S I N P V Q E I S K I C V W F HT DA A QG F G K I P I DV K K I G A N F M S I
E G F E V T Y L P V E K NG I V N L QK L E E A I R P -T T A L V S C MY V NN E I G V I Q P I G E I G K I C V L F HT DA A QA V G K L D I DV DR DN I D L M S V
DG F E V T Y L P V E K NG L V N L QK I E E A I R P -T T A L V S C MY V HN E I G V I Q P I S E I G N L C V L F HT DA A QA L G K V S I DV E R DN I D L M S L
E G F E V T Y L P V Q S NG L I D L K Q L E E A L R P -T T A L V S I MT V NN E I G V I Q P I K E I G Q L L P F F HT DA A QA A G K I R L DV N E L G I D L M S L
E G F E V T Y L P V L S S G L I DMK Q L E A A I R P -DT A L V S I MA V NN E I G V I Q P I A E I G A L C V F F HT DA A QA V G K I P I DV NA DK I DV M S I
1090
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1100
1110
1120
1130
1140
1150
1160
S G HK I Y G P K G I G A L Y V -R R -K P R V R V E A I Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E I A A R E MA Y DHR WM E F L S K R L NG D - - P
S A HK I Y G P K G V G A L Y V -R R -R P R I R L E P L MNG G G Q E R G L R S G T G A T QQ I V G F G A A C E L A MK E M E Y D E K W I K G L Q E R L NG S - -M
S S HK I Y G P K G I G A L Y V -R R -R P R V R M E P L L S G G G Q E R G F R S G T L P P P L V V G L G HA A K L MV E E Y E Y D S A HV R R L S DR L NG S - -A
S S HK I Y G P K G I G A C Y V -R R -R P R V R L E P I I S G G G Q E R G L R S G T L A P H L V V G F G E A C R I A S QDM E Y DR K HV E R L S K R L NG D - -A
S S HK I Y G P K G MG A C Y V -R R -R P R V R L E P I I S G G G Q E R G L R S G T I A P H L V V G F G E A C R I A Y E DM E Y D S K H I A R L S K R L NG D - - P
S G HK V Y G P K G V G A L Y I -R R -R P R V R V E P I Q S G G G Q E R G MR S G T V P T P L V V G L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A E R L NG D - - P
S G HK I Y G P K G A G A L Y V -R R -R P R V R V E A QM S G G G Q E R G L R S G T V A A P L C I G L G E A A R I A G R E M E MDK A HV E R L S R M L NV D - - E
S A HK I Y G P K G A G A L Y V -R R -R P R V R I E A QM S G G G Q E R G L R S G T V A A P L C I G L G E A A K I A DK E MA MDK A HV E R L S QM L NG D - -A
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I T G G G Q E R G L R S G T L A P P L V A G F G E A A R L MK Q E S S F DK R H I E K L S S K L NG C NDA
S S HK I Y G P K G I G A L Y V -R R -R P R V R L E P L L S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MH E E Y NA D I A H I DK L S S K L NG S - -A
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A DR L NG D - - P
----------------------------------------------------------------------------------S G HK L Y G P K G V G A A Y V -R R -R P R V R L E P L I HG G G Q E R G L R S G T V A A P L V V G L G E A C R I A E N E MA A DHA R I K A L S DR L NG D - - S A HK V Y G P K G I G A F Y I -R S -K P R R R I K P L I F G G G Q E R G MR S G T M P V P L A V G F G E A C K I A S S E MN S D S I HV K S L Y DK L NG C - -G
S A HK V Y G P K G I G A F Y I -R S -K P R R R I K P L I F G G G Q E R G MR S G T M P V P L A V G F G E A C K I A S S E MN S D S I HV K S L Y DK L NG C - -G
S A HK I Y G P K G V G A L F V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E I A QQ E L E Y DHK R V S L L A NR L NG D - - P
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I V T G G G Q E R G L R S G T L A P P L V A G F G E A S R L MK E E MD - - - - - - - - - - - - - - - - - - S G HK I Y G P K G V G A L F V -R R -R P R V R I E P I T T G G G Q E R G I R S G T V P S T L A V G L G A A C D I A L K E MNHDA A WV K Y L Y DR L NG D - - L
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P I Q S G G G Q E R G L R S G T V P A P L A V G L G A A A E L S L R E MDY DK K WV D F L S NR L NG D - -A
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P I Q S G G G Q E R G L R S G T V P A S L A V G L G A A A E L S QQ E M E Y DK K W I D F L S NR L NG D - -A
C A HK I Y G P K G I G A L Y V -R R -R P R V R MV P L I NG G G Q E R G L R S G T V A S P L V V G F G K A A E I C S K E MK R D F E H I K E L S K K L NG S - - T A HK F HG P K G V G A L F I -R A G K P - - - I T P L L HG G E QMG G L R S G T I DT P S V V G MA V A L K K A T HD I N I E NT Y V R K L R DK L V G K - - P
S G HK I Y G P K G V G A I Y V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E V A Q E E M E Y DHK R I S Q L A E R L NG D - -R
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L S E R L NG D - - P
S S HK I Y G P K G I G A I Y V -R R -K P R V R L D P L I S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MMK E Y E ND S NH I K R L S DK L NG S - -A
S S HK V Y G P K G C G A L Y V -R R -R P R V R L R S P V S G G G Q E R G V R S G T V A A A L V V G MG A A C E V A MK E WK R DA A HT E R L Q E R L NG D - - L
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A E R L NG D - - P
S G HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I S G G G Q E R G L R S G T L A P P L I V G F G E A C R I A K Q E M E Y D S K R V K Y L S DR L NG H - - P
S G HK F G G P K G C G A L Y I -R K - - -G T K I E A F L HG G A Q E R K R R A G T E NV P S I V G L G K A I G L A T G E M E E T NK P L L E MR E R L NG H - - P
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A Q E E M E NDHK R I S M L A E R L NG D - - P
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E L A QQ E M E Y DHK R I S K L A E R L NG D - - P
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I S G G G Q E R G L R S G T L A P P L V V G F G E A C R I A K E E M P Y D S K R I K H L S DR L NG D - - P
S A HK I Y G P K G V G A L Y L -R R -R P R I R V E P QM S G G G Q E R G I R S G T V P T P L V V G F G A A C E I A A K E MDY DHR R A S V L QQR L NG S - -M
S G HK F Y G P K G I G A L Y V -R R -R P R V R M E P I I NG G G Q E R G L R S G T L P T P L I V G I G E A A R V A QK E L QR D E E HV NR L A K R L NG D - -R
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L S E R L NG D - - P
S A HK L Y G P K G I G A L Y V -R R -K P R V R L QQ I I HG G G Q E R G L R S G T L A P H L C V G F G K A A E I A L T E L P Y D I QHV DK L Y NR L NG S - - L
S S HK I Y G P MG I G A C Y V -R R -R P R V R L D P I I T G G G Q E R G L R S G T L S P P L V A G F G E A A R L MK E E MDY DK A H I T R L S NK L NG S NN P
S G HK L Y G P K G I G A L Y I K R K -K P N I R L NA L I HG G G Q E R G L R S G T L P T H L I V G F G E A A K V C S L E MNR D E K K V R Y F F NY V NG C - -Q
S G HK L Y G P K G I G A L Y I K R K -K P N L R L NA L I HG G G Q E R G L R S G T L P T H L I V G L G E A A N L G S I E MNR DHK K MK F F F DY V NG C - -Q
S S HK L Y G P K G V G A L Y I K R K -K P N I R L NA I I HG G G Q E R G L R S G T L P T H L I V G L G E A A N I C L S E MDR DNK K MN F F F NY V NG C - -Q
S G HK I Y G P K G V G A L Y M -R R -R P R I R V E P QMNG G G Q E R G I R S G T V P T P L V V G MG A A C E L A K K E M E Y DDK R I R A L H E R MNG S - -V
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E L A QQ E M E Y DHK R I S K L A E R L NG D - - P
S S HK I Y G P K G I G A I Y V -R R -R P R V R L E P L L S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MK K E F DNDQA H I K R L S DK L NG S - - P
S A HK I Y G P K G I G A A Y V -R R -R P R V R L E P L I S G G G Q E R G L R S G T L A P S QV V G F G T A A R I C K E E MK Y DY A H I S K L S QR L NG D - - P
S G HK I Y G P K G I G A L Y V -R R -R P R V R V E A L Q S G G G Q E R G MR S G T L P A P L V V G L G A A C E V S QQ E M E Y DHK R I S A L S E R L NG D - - P
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C S V A QQ E I E Y DHQR V S M L A NR L NG D - - P
S S HK V Y G P K G I G G L Y V -R R -K P K V R I L P I I NG G G Q E R G L R S G T L A P H L C V G F G E A C E I A K R E MDNDK K H I QR L S E K F NG D - -K
S G HK I Y G P K G V G A L F V -R T -K P R I R L Q P I I DG G G Q E R G L R S G T L P T A L V V G L G T A A K I A K M E MK R DQ L HM E N L F F K L NG S I K P
S G HK I Y G P K G V G A L F V -R T -K P R I R L Q P I I DG G G Q E R G L R S G T L P T A L V V G L G T A A K I A K M E M E R DHR HM E N L F F K L NG S I K P
S G HK I HG P K G I G A L Y V - S S -R P R S R V E P I I NG G G Q E R N I R S G T L A V P L I V G L G K A A E I A K R E MK Y D S P Y I E S L G K H L NG S - - L
S S HK I Y G P K G C G A L Y M -R R -R P R V R V R S P V S G G G Q E R G V R S G T I A T P L A V G L G A A C E L A K V E MK R D S E R I A Q L S K R L NG D - -V
S S HK I Y G P K G C G A L Y M -R R -R P R V R V R S P V S G G G Q E R G V R S G T V A T A QV V G MG A A C A I A K V E M E R D S A H I S R L S K R L NG D - - L
S S HK L Y G P MG I G A C Y I -R R -R P R V R L E P I I NG G G Q E R G L R S G T L A P P L I A G F G E A A R L A K Q E L A Y DHA H I S K L S QR L NG D - - S G HK L Y G P MG I G A C Y V -R R -R P R V R L E P I I T G G G Q E R G L R S G T L A A P L V A G F G E A A R L C R Q E M P Y DT A H I K K L S DK L NG D - -A
1170
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1180
1190
1200
1210
1220
1230
1240
V Q S Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T I E E V DY T A E K C I
D S R Y V G N L N L S F A Y V E G E S L L MG L - -K E V A V S S G S A C T S A S L E P S Y V L R A L G V D E DMA HT S I R F G I G R F T T K E E I DK A V E L T V
DHR Y P G C V NV S F A F V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA I G R DDA L A H S S I R F G I G R F T T E A E V DY V I K A I T
E R HY P G C V NV S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G S S D E S A H S S I R F G I G R F T T D S E I DY V L K A V Q
DR HY P G C V N I S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G S S D E S A H S S I R F G I G R F T T D S E I DY V L K A V Q
- - - - - - - - - - - - - - - -G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G L G R F T T I E E V DY T A E K T I
E HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
K HA Y P G C V N L S F A Y V E G E S L L MA L - -K S I A L S S G S A C T S A S L E P S Y V L R A I G S E E D L A H S S I R F G L G R F T T E E E V K HT I D L C V
R HA Y P G C V N L S F A Y V E G E S L L MA L - -K S I A L S S G S A C T S A S L E P S Y V L R A I G S E E D L A H S S I R F G L G R F T T D E E V K HT I D L C I
K S QY P G C V NV S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L HA L G A DDA L A H S S I R F G I G R F T T E A E V DY V I QA I N
E K R Y P G C V NV S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F T T E E E V DY V L K A I T
----------------------------------------------------------------------------------V NG Y P G C V N L S F S Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G A A E DMA H S S L R F G I G R F T T E E E I D L V V QR I V
V NR M F G N L N L S F T G V E G E S L MMK L - -Y S L A L S S G S A C T S A S L E P S Y V L R A I G V G E DV A HT S I R F G L G R F T K H E DV DK A V K E I V
V NR M F G N L N L S F T G V E G E S L MMK L - -Y S L A L S S G S A C T S A S L E P S Y V L R A I G V G E DV A HT S I R F G L G R F T K H E DV DK A V K E I V
DQR Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -V C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I F V - - - - - - - - - - - - NA R Y Y G N L N I S F S Y V E G E S L L MA I - -K DV A C S S G S A C T S S S L E P S Y V L R S L G V E E DMA H S S I R F G I G R F T T E Q E I DY T I E I L K
K A T Y NG C L N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T V E E V DY T A DK C I
V A T Y NG C L N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G L G R F T T V E E V DY T A DK C I
E K G F P G C V NV S F P F V E G E S L L MH L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G R DD E L A H S S I R F G I G R F T MA K E I D I V A NK T V
E L R V P NT I L V A F K G V E G E A M L WD L NK HG I A A S T G S A C A S E S L QA N P T F K A MK F G E D L S HT G I R L S L S R F NT E E E I DY T I D I I K
E HR Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R F T T E E E I DY T V QK C I
K HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
DHR Y P G C V N I S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F T T D E E I DY V I K A I T
K HR L P G N L N I S F S C V E G E S L L MG M - -R DV A V S S G S A C T S A S L E P S Y V L R A L G V DA E NA HT S I R F G I G R F T T A K E V D L V I E E C V
DH F Y P G C V NV S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G N S D E S A H S S I R F G I G R F T T E R E I DY V L K A V Q
T E R L A NNV NV T F E Y I E G E S L L L L L NA K G I F A S T G S A C N S T S L E P S HV L T A C G V P H E I V HG S L R L S L G R MNT L E DV DR V L E V L P
QQHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
K QHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V F R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
NH F Y P G C V NV S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G N S D E S A H S S I R F G I G R F T T E Q E I DY V L K A V T
E HR Y P G N L N L S F A Y V E G E S L L MG L - -K E V A V S S G S A C T S A S L E P S Y V L R A L G V E E DMA HT S I R F G I G R F T T E E E V DR A I E L T V
E A R Y HG NV NM S F A Y V E G E S M L MG L - -K E I A V S S G S A C T S A S L E P S Y V L R A L G V N E E MA HT S V R Y G L G R F T T E A E V DR A I E A T V
K HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
E HR Y K G N L NV S F A F V E G E S L I MA I - -K QV A V S S G S A C T S A S L E P S Y V L R A L G V Q E DMA HT S L R I G I G R F T T E K E V D F L L DQ L S
E S QY P G C V N I S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L HA L G A DDA L A H S S I R F G I G R F T T E E E V DY V I K A I N
I NR Y Y G NMN I S F L F V E G E S L L M S L - -N E I A L S S G S A C T S S T L E P S Y V L R S I G I S E D I A HT S I R I G F NR F T T F F E V QQ L C I N L V
T NR Y F G NMNV S F L F V E G E S L L M S L - -N E I A L S S G S A C T S S T L E P S Y V L R S I G I S E D I A HT S I R I G F NR F T T F F E V QQ L C E N L V
I NR Y F G NMN I S F L F V E G E S L L M S L - -ND I A L S S G S A C T S S T L E P S Y V L R S I G I T E E I A HT S I R I G F NR F T T F F E V QQ L C K N L V
E R R Y A G N L N L S F A Y V E G E S L L MG L - -K DV A V S S G S A C T S A S L E P S Y V L R A L G V D E DMA HT S I R F G I G R F T T E E E I DR A I E L T V
K QHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V QK C I
DHR Y P G C V NV S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F S T E E E V DY V V K A V S
K S R Y P G C V N I S F NY V E G E S L L MG L - -K N I A L S S G S A C T S A S L E P S Y V L R A I G Q S D E NA H S S I R F G I G R F T T E A E I DY A I E NV S
D E T Y P G C V N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A Q E D L A H S S I R F G I S R F T T E E E V DY T A E K C V
NQR Y P G C V N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R HHR R R S G L HG R K MY L
DQR Y V G N I N I S F E F V E G E S L MMG I - -K QC A V S S G S A C T S A S L E P S Y V L R A L G V N E E L A HT S L R I G F G R F T T D E E V DY L I N L L S
G E R Y F G N L NM S F E F I E G E S L L M S L - - S N F A L S S G S A C T S A S L E P S Y V L R S L DV S E E L A HT S I R F G L G R F T M E S E V DMA L E S I T
G QR Y F G N L NM S F E F I E G E S L L M S L - - S N F A L S S G S A C T S A S L E P S Y V L R S L DV S E E L A HT S I R F G MG R F T I E S E V DMA L D S I T
E HR W F G C V N I S F E A V E G E S L MA T I - - P N F G V S S G S A C T S A S L E P S Y V L K G I G V G D E L A HT S L R I G I S K F T T R E E V DQ F V E L L E
E R R F HG N L N I S F A C V E G E S L L MG M - -K K V A V S S G S A C T S A S L E P S Y V L R A L G I DA E NA HT S I R F G I G R F T T E R E V DV T V E E C A
E K R Y P G N L N I S F S C V E G E S L L MG M - -K NV A V S S G S A C T S A S L E P S Y V L R A L G I DA E NA HT S I R F G I G R F T T E R E I DV T I E E C V
QNG Y P G C L N L T F QY V E G E S L L MA L - -K D I C L S S G S A C T S A S L E P S Y V L R A L G L ND E NA H S S L R F G I G R F T T E E E V DY V A DK I I
E HHY P G C V N I S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G A DDA L A H S S I R F G I G R F T T E A E V DY V L K A V Q
1250
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1260
1270
1280
1290
1300
1310
1320
K HV T R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S A G G K F R I S L G L P V G A V I NC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
K QV E K L R E M S P L - - - - - - - -Y E M S K R - - - - - -G R G G T S G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
E R V E F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
DR V H F L R E L S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - -MT L G L P C G A V MNC C DN S G A R N L Y I I S V K G V G A R L NR L P A A
DR V H F L R E L S P L - - - - - - - -W E M S A R - - - - - -G R G G A S G NK L K MT L G L P C G A V L NC C DN S G A R N L Y I I S V K G I G A R L NR L P A A
R HV E R L R E M S P L - - - - - - - -W E I K T L - - - - - -G R G G S A G A K F R I S L G L P V G A V I NC A DNT G A K N L Y V I A V QG I K G R L NR L P A A
HHV K R L R E M S P L - - - - - - - -W E M S K L - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
R E T E R L R E L S P L - - - - - - - -W E M S K R - - - - - -G R G G A S G A K F R I S L G L P V G A V MNC A DNT G A K N L F V I S V Y G I R G R L NR L P S A
R E T NR L R D L S P L - - - - - - - -W E M S K R - - - - - -G R G G A S G A K F R I S L G L P V G A V MNC A DNT G A K N L F V I S V Y G I R G R L NR L P S A
E R V D F L R K M S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MNC A DN S G A R N L Y V L A V K G T G A R L NR L P A A
E R V K F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
HHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
- - - - - - - - - - - - - - - - - - - - - -M S X X - - - - - -X X X X X X X X X F R I S L S L P V G A V V NC A DNT G A K N L Y I I A V K G I R G R L NR L P A A
S V V NK L R DM S P L - - - - - - - -W E M S - - - - - - - - I K S A A A G T K F R M S L G L P V G A V MNC A DN S G A K N L Y V I S V I G F G A R L NR L P A A
E S V T L L R K M S P L - - - - - - - -WDM -K R - - - - - -G R G A A G G A K MR I T L G L NV G A L I NC C DN S G G K N L Y I I A V K G T G S C L NR L P S A
E S V T L L R K M S P L - - - - - - - -WDM -K R - - - - - -G R G A A G G A K MR I T L G L NV G A L I NC C DN S G G K N L Y I I A V K G T G S C L NR L P S A
HQV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - L G L P V G A V V NC C DN S G A R N L Y I V S V K G F G A R L NR L P A A
K NV QR L R DM S P L - - - - - - - -W E M - - - - - - - - - S K A QA V G S NY R V S L G L P V G A V MN S A DN S G A K N L Y V I A V K G I K G R L NR L P S A
K HV E R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G T A G G K F R I S L G L P V G A V MNC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
K HV E R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G T A G G K F R I S L G L P V G A V MNC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
E A V QK L R E M S P L - - - - - - - -Y E MA A E K K T E V L E K K I S I K P R Y K MT R G I QV E T L MK C A DN S G A K I L R C I G V K R Y R G R L NR L P A A
K S V DR L R Q L S S T - - - - - - - -Y A M P K R - - - - - -G A G G R QG NK F R V T C G L NNA S T V NC A DNT G A K T L T I I S V K G F HG R L NR L P R A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I K G A DNT G A K N L Y I I S V K G I K G R L NR L P A A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
E R V D F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
R NV E R L R E L S P L - - - - - - - -WDM -G K - - - - - -DQA NV K G C R F R V S V A L P V G A V V NC A DNT G A K N L Y V I S V K G Y HG R L NR L P S A
E R V S F L R E L S P L - - - - - - - -W E M -A K - - - - - - L S R G A P G G K L K MT L G L P V G A I MNC A DN S G A R N L Y I I S V K G I G A R L NR L P A G
E I V QK L R NM S P L - - - - - - - -T P - - - - - - - - - - - - - -MK G MR S N I P R A L NA G A Q I A C V DNT G A K V V E I I S V K K Y R G V K NR M P C A
E R V G F L R E L S P L - - - - - - - -W E M -A K - - - - - -Q S R G A P G G K L K MT L G L P V G A I MNC A DN S G A R N L Y I I S V K G I G A R L NR L P A G
HQV K K L R DM S P L - - - - - - - -Y E M S K R - - - - - -G R G G S A G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
R QV E K L R E M S P L - - - - - - - -W E M S K R - - - - - -G G G NA S G T K Y K M S Y G V P V G A V V NC A DNT G A K N L Y L I A V K R WG S R QNR L P A A
G A V R K L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G QV G I K L R I T L A C NV G A V L NC A DN S G A K N I Y V I S T F G I K G H L S R L P S A
E R V E F L R K M S P L - - - - - - - -W E M - - - - - - - - - S G S G A S G NK F R M S L A L P V G A V MNC A DN S G A R N L Y V L A V K G V G A R L NR L P A A
K S V E R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
K S V K R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
K S V K R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
QQV E K L R E M S P L - - - - - - - -Y E M S K R - - - - - -G R G G S A G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
DR V K F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P V G A I MNC A DN S G A R N L Y I I A V K G S G S R L NR L P A A
R QV S F L R NM S P L - - - - - - - -WDM - S R - - - - - -G R G A A S G T K Y R MT L G L P V QA I MNC A DN S G A K N L Y I V S V F G T G A R L NR L P A A
H E V T Q L R E M S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -G K N L Y I I A V S G I G G R L NR L P NA
P C L QT E G D E S S MG DG S G R HR Y Q - - - - - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
E K V S K L R E M S P L - - - - - - - -W E A A A R - - - - - -G R G G QV G T K A K V S L G L P V G A V MNC A DN S G A K N L Y T I A C F G I K G H L S K L P S A
K V V E K L R N L S P L - - - - - - - -Y E M -K R - - - - - -G R G G S G G NK L R V T L G L P V G A L I NC C DN S G G K N L Y L I A V K G T G A C L NR L P S A
K V V E K L R N L S P L - - - - - - - -Y E M -K R - - - - - -G R G G S G G NK L R V T L G L P V G A L I NC C DN S G G K N L Y L I A V K G T G A C L NR L P S A
HA V K H L R D L S P L - - - - - - - -W E M S K R - - - - - -G R T G QQG T K F A MT A G L P V G A V I NC C DN S G A K NM F I I S V R G HK G R L NR L P A A
R T V E R L R E M S P L - - - - - - - -WDM -G K - - - - - -DK A NV K G C R F R V S L A L P V G A V V NC A DNT G A K N L Y I I S V K G Y HG R L NR L P A A
R NV E R L R E M S P L - - - - - - - -WDM -G K - - - - - - E K A NV K G C R F R V S L A L P V G A V V NC A DNT G A K N L Y I I S V K G Y HG R L NR L P A A
K V V NK L R DM S P L - - - - - - - -W E M - - - - - - - - - - S K A A V G T K F R MT L A L P V G A V MNC A DN S G A K N L F V I A V HG I G A R L NR L P A A
E R V N F L R E L S P L - - - - - - - -W E M - - - - - - - - - - - S G A S G T K Y K M S MA L P V G A I MNC A DN S G A R N L Y V I A V K G C G A R L NR L P A A
1330
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1340
1350
1360
1370
1380
1390
1400
G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
C V G DMV MA T V K K G K P D L R K K V L P A V I V R QR K P WR R K DG V F MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K P WR R K DG V Y L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V M P A V V V R Q S K P WR R P DG I Y L Y F E DNA G V I V NA K G E MK G S A I T G P V G K E A A E L W P V S S L L F S N
G V G DMV MA T V K K G K P E L R K K V M P A V V V R Q S K P WR R P DG I Y L Y F E DNA G V I V NA K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
G S G DM I V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DM F V C S V K K G K P E L R K K V L QG V V I R QR K Q F R R K DG T F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - -A N
G V G DM F V C S V K K G K P E L R K K V L QG V V I R QR K Q F R R K DG T F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - -A N
A A G DMV MA T V K K G K P E L R K K V M P A I V I R Q S K P WR R R DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K A WR R K DG V Y L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R V A - - - S N
G V G DMV MA T V K K G K P X L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V X K E C A D L W P X I A - - - S N
G V G D I V L A T V K K G K P E L R K K V H P A V I I R Q S K S Y R R K HG QM I Y F E DNA G V I V NQK G E MK G - - - - - - - - - - - - - - - - - - - - - - - A A G DMV MA S V K K G K P E L R K K V M P A V I C R QR K P WR R R DG I F L Y F E DNA G V I V NA K G E MK G S A I NG P V A K E C A D L W P R I A - - - S N
S I G DMV L A T V K K G K P E L R K K V W P A V I V R QR K A F R R P E G T F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E C A E L W P K V S - - -A A
S I G DMV L A T V K K G K P E L R K K V W P A V I V R QR K A F R R P E G T F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E C A E L W P K V S - - -A A
S A G DMV MA T V K K G K P E L R K K I M P A I V V R QA R P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V C T G L V V R QR K HWK R K DG V Y I Y F E DNA G V MC N P K G E V K G N - I L G P V A K E C S D L W P K V A - - -T N
G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
A P G D I C V V S V K K G K P E L R K K V HY A I L I R QK K I WR R T DG S H I M F E DNA A V L I NNK G E L R G A Q I A G P V P R E V A DMW P K I S - - - S Q
G C G DMV V A T C K K G K P E Y R K K MHT A V I I R QR R T WR R K DG V T L Y F E DNA A V I V NMK G E MK G S A I T G P V S K E S A D L W P K I S - - - S N
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K A WR R K DG V F L Y F E DNA G V I A N P K G E MK G S A V T G P V G K E C A D L W P R I A - - - S N
A L G DMV MC S V K K G K P E L R K K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E S A D L W P K I S - - -T H
G V G DMV MA T V K K G K P E L R K K V H P A V I V R Q S K P WK R T DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
G I G DMC V V S V K K G T P E MR K QV L L A V V V R QK Q E F R R P DG L HV S F E DNA MV I T D E E G I P K G T D I K G P V A R E V A E R F P K I G - - -T T
G V G DMV MA T V K K G K P E L R K K V H P A V I V R Q S K P WK R F DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
C V G DMV MA T V K K G K P D L R K K V M P A V I V R QR K P WR R K DG V Y MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
N P G S MV MA T V K K G K P D L R K K V F P A I I V R QR K P I R R K E G L I I Y F E DNA G V I C N P K G E MK G S A I A G P V A K E C A D L W P R V A - - - S A
S I G DMV L C S V K QG K P A L R K K V MQA V V V R QR K P Y R R R E G Y Y I Y F E DNA G V I I N P K G E MK G S A I T G P V G K E A A D L W P K I A - - - S A
S A G DMV MA T V K K G K P E L R K K V M P A I V I R Q S R P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV L A T V K K G K P D L R K K V L NA I I C R Q S K A WR R H E G Y Y I Y F E DNA G V I V N P K G E MK G S A I T G P V A R E C A E L W P K L S - - - S A
S L G DMV L A T V K K G K P D L R K K V L NA I I T R Q S K A WR R H E G Y F I Y F E DNA G V I V T P R -R MK G S A I T G P V A R E C A E L W P K L S - - - S A
S L G DMV L A T V K K G K P D L R K K V L NA I I T R Q S K A WR R H E G Y F I Y F E DNA G V I V N P K G E MK G S A I T G P V A R E C A E L W P K L S - - - S A
C V G DMV MA T V K K G K P D L R K K V M P A V I V R QR K P WR R K DG V F MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
S L G DMV MA T V K K G K P E L R K K V M P A I V V R QA K S WR R R DG V F L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R V A - - - S N
S C G DMV L A T V K K G K P D L R K K I M P A I V V R QR K A WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G L G DM I V A T V K K G K P E L R K K V M P A V V I R QR K P I R R R E G I V L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S I G DM I L C S V K K G S P K L R K K V L QA I V I R QR R P WR R R DG V F I Y F E DNA G V I A N P K G E MK G S Q I T G P V A K E C A D I W P K V A - - - S N
S V G DMV L A T V K K G R P D L R K K V L P A V I V R QR K A WR R R E G Y F I Y F E DNA G V I V N P K G E MK G S A I NG P V A K E C A E L W P K I S - - -A A
S V G DMV L A T V K K G R P D L R K K V L P A V I V R QR K A WR R R E G Y F I Y F E DNA G V I V N P K G E MK G S A I NG P V A K E C A E L W P K I S - - -A A
S V S D L I V V T C K K G K P A L R K K V S MG V V V R QR A I WR R K DG V V I G F QDNA G V I I NDK G E MK G S A I T G P V A K E A A E L W P K V A - - - S V
A L G DMV MA S V K K G K P E L R R K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E A A E L W P K I S - - -T H
A L G D I V MA S V K K G K P E L R R K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E A A D L W P K I S - - - S H
A A G DMV V A S V K K G K P E L R K K V M P A V V V R QR K P WR R R DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D I W P R I A - - - S N
G A G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
1420
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1430
1440
1450
1460
1470
1480
1490
A G S I - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MA R K A R I I DV V Y NA S NN E L I R T K
A NA I G - - I S R D S I HK K K R K Y E L G R Q P A NT K L - - S I R V R G G NV K WR A L R L DT G N F S WG S E A V T R K T R I L DV A Y NA S NN E L V R T Q
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NK K F R A L R I E T G N F S WA S E G V S R K T R I V G V V Y H P S NN E L V R T N
T S S -G - - I S R D S R HK K K R A F E K G R Q P A NT R I - -G V R T R G G NR K F R A L R L E S G N F S WG S E G I S R K T R V I V V A Y H P S NN E L V R T N
S G V V G - - I S R D S R HK K K R A F E K G R Q P S NT R I - -G V R T R G G NQK F R A L R L E S G N F S WG S E G I S R K T R V I V V A Y H P S NN E L V R T N
A S S I G - - I S R DHWHK K K R K Y E L G R P A A NT R L - -G V R S R G G NT K Y R A L R L DT G N F S WG S E C S T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R D S WHK K K R K F E L G R P A A NT K I - -G V R T R G G N L K Y R A L R L DNG N F S WA S E QT T R K T R I V DT MY NA T NN E L V R T K
A G S I G - - I S R D S WHK K K R K F E L G R P A A NT K I - -G V R T R G G N E K Y R A L R L D S G N F S WA S E QT T R K T R I V DT MY NA T NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NQK F R A L R V E T G N F S WG S E G V S R K T R I A G V V Y H P S NN E L V R T N
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NQK F R A L R I E T G N F S WA S E G V A K K T R I V G V V Y H P S NN E L V R T N
- - -MG - - I S QD S WHK K K R K F E L G R P A A NT K I - -G V R T R G G NT K F R G L R L DT G N F S WG S E A C A R K T R I I DV MY NA S NN E L L R T K
A G T V G - - I T R D S R HK K K R K F E L G R Q P A MT K L D - S V R T R G G NV K Y R A L R L D S G N F A WG S E S V T R K T R L I QV R Y NA T NN E L L R T Q
A P S I G - - I S R D S R HK K K R K Y E MG R P A S NT K L - -G V R C R G G NK K F R A L R L D S G NY S WG S QG V T R K A R I M E V V Y NA S NN E L V R T K
A P S I G - - I S R D S R HK K K R K Y E MG R P A S NT K L - -G V R C R G G NK K F R A L R L D S G NY S WG S QG I S R K A R I M E V V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G I R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G N E K F R A L R I E T G N F S WG S E G V A R K T R L A G V V Y H P S NN E L V R T N
A G T I G - - I S R DA L HK K K R K Y E L G R QA A K T K I - -C I R V R G G HQK F R A L R L DT G N F S WA T E K I T R K C R I L NV V Y NA T S ND L V R T N
A S S I G - - I S R D S A HK K K R K F E L G R P A A NT K L - -G V R T R G G NT K L R A L R L E T G N F A WA S E G V A R K T R I A DV V Y NA S NN E L V R T K
A S S I G - - I S R D S A HK K K R K F E L G R P A A NT K L - -G V R T R G G N S K L R A L R L E NG N F A WA S E G V A R K T R I A DV V Y NA S NN E L V R T K
A S S I G - - I NHR G DHK K K R NNR A G S Q P S S T K I - -G V R V R G G NR K Y K A L R L DMG H F K F I T T G K F R MA K L L QV V Y H P S S N E L V R T N
A P T I G - - I T R D S R HK K K K K NT MG R Q P A NT R L - -G V R C R Y G I I K R R A L R L E NG N F S WA S Q S I T K G T K I L NV V Y NA S DND F V R T N
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R QA A NT K I - -G V R T R G G NQK F R A L R I E T G N F S WA S E G V A R K T R I T G V V Y H P S NN E L V R T N
A P A I G - - I V R S R L HK K R MK A E L G R L P A NT R L - -G V R A R G G N F K I R A L R L DT G N F A WA S E A I A HR V R L L DV V Y NA T S N E L V R T K
S G V V G - - I S R D S R HK QK R A W E A G R Q P A S T K I - -G V R V R G G NT K Y R A L R L D S G N F S WG S E G V T R K T R V I A V A Y H P S NN E L V R T N
A S I I - - -MR WQG S S R G K R K F E MG R E S A E T R I - - S V P T MG G NR K V R L L Q S NV A NV T N P K DG K T V T A P I E T V I DNT A NK HY V R R N
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R A F E A G R Q P A NT R I - -G V R T R G G NHK Y R A L R L D S G N F A WA S E G C T R K T R V I V V A Y H P S NN E L V R T N
A NA I G - - I S R D S MHK K K R K Y E L G R Q P A NT K L - - S V R V R G G N L K WR A L R L DT G NY S WG S E A V T R K T R I L DV V Y NA S NN E L V R T Q
A S S I G - - I S R D S L HK K K R K Y E L G R Q P A NT K L - - S V R C R G G N I K HR A L R L DT G N F A WG S E NC T R K T R I L DV V Y NA S NN E L V R T K
A G S V G - - I S R D S R HK K K R A F E K G R QA A MT K L V S G I R V R G G N F K F R A L R L S E G N F S WG S QG I A K K A K I V E V V Y H P S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q S A NT K I - -G V R T R G G NQK F R A L R V E T G N F S WG S E G V S R K T R I A T V V Y H P S NN E L V R T N
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G R NY K Y R A I K L D S G S F S W P T F G I S K NT R I I DV V Y NA S NN E L V R T K
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G K N L K Y R A I K L D S G S F S W P A F G V S K I T R I I DV V Y NA S NN E L V R T K
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G R NY K Y R A I K L D S G S F S W P A F G I S K MT R I I DV V Y NA S NN E L V R T K
A NA I G - - I S R D S MHK K K R K Y E L G R Q P A S T K L - - S I R V R G G NV K WR A L R L DT G NY S WG S E A V T R K T R I L DV V Y NA S NN E L V R T Q
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NK K Y R A L R I E T G N F S WA S E G I S K K T R I A G V V Y H P S NN E L V R T N
A G T V G - - I T R D S R HK K K R K F E L G R Q P S NT R I - -G V R V R G G NK K F R A L R L D S G N F S WG S E G V S K K T R I I QV A Y H P S NN E L V R T N
A S T I G G R I P DDT T R K A HY A L P L A R K K G A K L L - -G V R C MG G N I K R R A L R L DNG N F S WG S E HT T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K L - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S V G - - I S R D S K HK K K R A F E K G R P I S MT K L - -T V R V R G G H L K F R A L R L C E G N F S WG S E N I T R K T K I L DV K Y NA T NN E L V R T K
A P S I G - - I S R D S R HK K K R K Y E L G R P S S NT K L - -G V R C R G G N L K F R A L R L D S G N F S WG S QNV T R K T R V MDV V Y NA S S N E L V R T K
A P S I G - - I S R D S R HK K K R K Y E L G R P S S NT K L - -G V R C R G G N L K F R A L R L D S G N F S WG S QNV T R K T R V MDV V Y NA S S N E L V R T K
A P A V G - - I T R MG D L K K K R N F L A G R P S A QT R I - -G V R V R G G N L K MR A L R L E T G T F A WA S E NC T R K T R I L NV T Y H P A DND L V R T N
A P A I G - - I V R S R L HK K R MK A E L G R L P A NT K L - -G V R A R G G N F K L R G L R L DT G N F A WG T E A S A QR A R I L DV V Y NA T S N E L V R T K
A P A I G - - I V R S R L HK K R MK A E L G R L P A HT K L - -G V R A R G G N F K L R G L R L DT G N F A WG T E A I A QR A R I L DV V Y NA T S N E L V R T K
A G T V G - - I T R D S R HK K K R A F E L G R QA A NT R I - -G V R V R G G N L K HR A L R L E S G N F A WG S E H I T A K T R V L G V V Y NA S NN E L V R T N
S G V V G - - I S R D S R HK K K R K F E C G R QG A V T R I - -G V R T R G G NK K F R A I R I E T G N F S WG S E G T T R K T R V L G V S F H P S NN E L I R T N
1500
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
I
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
1510
1520
1530
1540
1550
1560
L V K NA I I V I DA S P F R QWY E S HY S K S N L R K Y V K R - - -QK NA K I D P A V E E Q F NA G R L L A C I S S R P G QV G R A DG Y I
L V K S A I V QV DA A P F K QG Y L QHY S NHV QR K L E MR - - -Q E G R A L D S H L E E Q F S S G R L L A C I A S R P G QC G R A DG Y I
L T K S A I V Q I DA T P F R QWY E S HY S R NA E R K WA A R - - -A G DA K I E G A V D S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L T K S A V V Q I DA A P F R QWY E A HY S N S V V K K QA A R F - -A DHG K V E P A I E K Q F E S G R L Y A V I A S R P G Q S G R V DG Y I
L T K S A V V Q I DA A P F R QWY E A HY S N S V V K K QA A R F - -A E QG K V E S A V E R Q F E S G R L Y A V V S S R P G Q S G R V DG Y I
L V K NA I V V V DA T P F R QWY E S HY S QK T A R K Y L A R - - -QR L A K V E G A L E E Q F HT G R L L A C V A S R P G QC G R A DG Y I
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L V K G A I V S V DA A P F R QWY E A HY S NHT L K K Y T E R - - -QK T A A V DA L L T E Q F NT G R L L A R I S S S P G QV G QA NG Y I
L V K G A I I S V DA A P F R QWY E A HY S HHT MK K Y T E R - - -QK T A A V DA L L I E Q F NT G R L L A R I S S S P G QV G QA NG Y I
L T K S A V V Q I DA T P F R QWY E NHY S R K V E R K L A A R - - - S G A A A I E S A V D S Q F G S G R L Y A V I S S R P G Q S G R C DG Y I
L T K A A I V Q I DA T P F R QWY E A HY S K S A E R K WA A R - - -A A S A K V E S A V D S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L V K NA I I Q I D S T P F R QWY E A HY S K K T QK K Y E E R - - -K K E P K V A QA L E E Q F NQG R I L A C I S S R P G Q S G R C DG Y I
L V K G A V V D I DA T P F R QWY E S HY S NHV K R I L E E R - - -K K V A K I D P L L E QQ F R A G R L L A V I S S R P G Q S G R A DG Y I
L V K NA I V V I DA T P F R Q F Y L QR Y S G H L L A T R K A R - - - L MNNV I D P L V E E Q F G I G R L L A C V S S R P G QC G R C DG Y I
L V K NA I V V I DA T P F R Q F Y L QR Y S G H L L A T R K A R - - - L MNNV I D P L V E E Q F G I G R L L A C V S S R P G QC G R C DG Y I
L V K NC V V L V D S T P Y R QWY E S HY S K K V QK K F T L R - - -R K T A K I S P L L E E Q F L QG K L L A C I S S R P G QC G R A DG Y V
L T K A A I V Q I DA T P F K QW F E T HY S R K V E R K L A QR - - - S G A S N I E S A V E HQ F NA G R L Y A A I S S R P G Q S G R C DG Y I
L V K G S I V Q I DA T P Y K QWY E T HY S A S L L A K L A S R - - -A K G R V L D S A I E S Q I G E G R F F A R I T S R P G QV G K C DG Y I
L V K N S I V V I DA T P F R QWY E A HY S E K V MK K Y L E R - - -QK Y G K V E QA L E DQ F T S G R I L A C I S S R P G QC G R S DG Y I
L V K N S I V V I DA T P F R QWY E S HY S E K V MK K Y L E R - - -QK F G K V E QA L E DQ F T S G R I L A C I S S R P G QC G R S DG Y I
L T K S S V V K I S A E P F K ND I K - - - - - - - - - - - - - - - - -DV A R DV D P S L H E S F E K G H L Y A I I T S R P G QV G MA QG HV
L V K G A I I E I D P A P F R L W F L K F Y S K T MQK K Y A K K L E V L K NMK F D E A L L E G F Q S G R V L A C I S S R P G QT G S V E G Y I
L V K NC I V L V D S T P Y R QWY E A HY S K K I QK K Y D E R - - -K K NA K I A S I L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K A A I V Q I DA T P F R QWY E S HY S K NT E R K WA A R - - -A A E A K I E HA V D S Q F G A G R L Y A A I S S R P G Q S G R C DG Y I
L V K NC I V A V DA A P F K R WY A K HY S P K L QR E WT R R - - -R R NHR V E K A I A DQ L R E G R V L A R I T S R P G Q S G R A DG I L
L T K S A V I Q I DA A P F R QWY E A HY S K S V E K K QA E R F - -A A R G K V D S A L E K Q F E A G R V F A V V S S R P G Q S G R C DG Y I
L T K G S V I R T S MG T - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -A R V T S R P G QDG V V NA V L
L V K NC I V L V D S T P Y R QWY E S HY S K K I QK K Y E E R - - -K K NA K I S P L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K S A V V Q I DA A P F R QWY E A HY S K S V E K K QA E R F - -A A A G K V D P A L E K Q F E A G R L Y A V I S S R P G Q S G R C DG Y I
L V K S A I V QV DA A P F K QWY L T HY S NHV V R K L E K R - - -QQT R T L D S H I E E Q F G S G R L L A C I S S R P G QC G R A DG Y I
L V K S A V I A V DA A P F R A WY A QHY S K S V T MK L R S R - - -NQK H E V A K A I D E Q F A T G R L L A I I T S R P G QC G R A DG Y V
L T R G V I V QV DA T P F R QWY A K K Y S R S L I K K L E QR - - -A K DNA I DA L V Q E Q F T NQR L L V R I T S R P G Q S G R A DG Y I
L T K A A I V Q I DA T P F R QWY E NHY S R K V E R K L A S R - - -A G QA A I E S A V DA Q F G S G K L Y A A I S S R P G Q S G R C DG Y I
L V K NC I V V I D S H P F T T WY E NT F T Y G V I K K I - - - - - -G K S K N I D P L L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K NC I V L I D S H P F T T WY E NT F S Y S V I K K I - - - - - -G K S K Q I D P A L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K NC I V L I D S H P F T A WY E NT F S Y S V I K K I - - - - - -G K A K Q I D P A L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K S A I V QV DA A P F K QWY L QHY S NHV I R K L E K R - - -QQV R K L D P H I E E Q F G S G R L L A S I S S R P G QC G R A DG Y I
L T K A A I V Q I DA T P F R QW F E A HY S K NA E R K WA A R - - -A A S A K I E S S V E S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L T K S A I V Q I DA A P F R V WY E T HY S K HV QR K H S A R - - - L G D S K V D S A L E T Q F A A G R L Y A V V S S R P G Q S G R C DG Y I
L V K NA I V Q I D S T P F R QWY E A HY S K K V V K K F E E R - - -K K T A K V A QA L E E Q F G T G R L L A C I A S R P G QC G R A DG Y I
L V K NC I I L V D S L P F R QWY E A HY S K K T QK K Y D E R - - -K K T A K I S T L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y I
L V K N S I V E I D S T P F R E WY K L HY S R HV QK R V -K R - - -T K A QA L E K N I E E Q F V S QR I L A C I T S R P G Q S G R A DG Y I
L V K NA I V T V D P T P F K L W F K T HY S E K V - - - - - - - - - - - -A G L V P K T L L E Q F S S G R L L A C I S S R P G QC G R C DG Y V
L V K NA I V T V D P T P F K L W F K T HY S E K V - - - - - - - - - - - -A A L V P R T L L DQ F S S G R L L A C I S S R P G QC G R C DG Y V
L A R G S V V S I DA A P F K QWY E R Q F T DK MT QR WA A N - - -K DG G V V A P E L V A E F DQG R L L A V I T S R P G QC G R A DG Y I
L V K NC I V V V DA A P F R L WY A K HY S S K L K R K W E Y R - - -R K HHK I E K A L A DQ L R E G R L L A R I T S R P G QT G R A DG A L
L V K NC I V V V DA A P F K L WY A K HY S D E L K R K WM L R - - -R E NHK I E K A V A DQ L K E G R L L A R I T S R P G QT A R A DG A L
L V K G C I V QV DA T P F R QA Y E K HY S NNV T R K L E NR - - -R K E G K L D S L V E QQ F G A G R L Y A A V S S R P G Q S G R C DG Y I
L T K S A I V Q I DA T P F R QWY E S Y Y A E A DQA A V A A R - - -QA DA K L D P A V E A Q F G A G R L Y A C V S S R P G Q S G R V DG Y V
1570
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EG E E LA FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E LD FY
L EGK E LD FY
L EG E E LA FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EAK E L E FY
L EGK E L E FY
L EGK E L E FY
L QG D E L K F Y
L EGK E LD FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EGA E LQ FY
L EGK E L E FY
L EG E E LA FY
I E------L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
I EGD E L L FY
I EGD E L L FY
I EGD E L L FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EG E E LH FY
L EGK E LD FY
L EGK E L E FY
L EGK E L E FY
L EG E E LN FY
L EG E E LN FY
L EG E E LA FY
L EGA E LQ FY
L EGA E LQ FY
L EGK E L E FY
L EG E E LA FY
1580
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1590
1600
1610
1620
1630
1640
1650
L K K I K NK K MV L A D L G R K I T NA L H S L S K A T I I N E E -V L D S M L K E I C T A L L E A DV N I R L V K K L R E NV R Q S A V F K E L V K L V I M F V G
MK K L QK K K MV L A E L G G R I T R A I QQM S NV T I I D E K -A L N E C L N E I T R A L L Q S DV S F P L V K E MQ S N I K E QA I F S E L C K MV V M F V G
V R R L T A K K MV L A D L G K R I NA A V A QA L NNDT DDY V A G V E T M L K A I V T A L L E NDV N I K L V S S V R S N I K QK T V F E E L C A L V V M F V G
QR A I R K - -MV L QD L G R R I NA A V ND L T R S S N L D E K -A F DDM L K E I C A A L L S A DV NV R L V QT L R K S I K QK A V F D E L V A L V I M F V G
QR A I R K - -MV L QD L G R R I NA A V ND L T R S NN L D E K QA F DDM I K E I C A A L L S A DV NV R L V Q S L R K S I K QK A V F D E L V S L V I M F V G
LR K I K SK R - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R K I R A K K MV L A D L G R K I R NA I G K L G QNT V I N E E - E L D L M L K E V C T A L I E S DV H I R L V K Q L K DNV K QK T V F N E L L K L V F M F V G
L R K I R A K K MV L A D L G R K I R NA I G K L G Q S T V I N E G - E L D L M L K E V C T A L I E S DV H I R L V K Q L K DNV K QK T V F N E L L K L V F M F V G
L R R L T A K K MV L A D L G S R L R G A L S S V E S G S - - -DD - E I QQM I K D I C S A L L E S DV NV K L V A K L R G N I K QK I I F D E L C A L V I M F V G
L R R L T A K K MV L A D L G K R I NNA V N S A L S NT E DDY V N S I DG M L K G I S T A L L E A DV N I M L V S K V R NN I R QK T V F D E L C G L I I M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
QR K I K A R K MV L A D L G R K I NNA L R S L S NA T I I N E E -V L Q S M L S E I C R A L L E S DV N I R L V K K L R E NV R Q S A V F R E L V K L V I M F V G
HHK L Q I R K MV L A D L G T R L HG A WNQ L S K A S V I DDK -V I DG V L K E L C A A L L E S DV NV K L V A S L R T K V K QK A V F D E L V A L V L MA V G
K K K L EK K K - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - K K K L E K K K MV L A D L G G Q L A S A I R K F Q S S T I A D E A -A I D L C L K E I A T A L L K A DV NV K L V A Q L R NN I K Q S A V V E E L V N I V I V F V G
L R K I K A K K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C A A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R R L T A K K MV L A D L G S R L R G A L S NV E S G S - - - E T - E I Q S M I K D I C NA L L E S DV N I K L V A K L R DN I K QK I V F D E L C S L V I M F V G
QR R L QK K K MV L A D L G NQ L S S A L R S L N E T T I V N E D -T I NQ L L K E V G NA L S K S DV S M S L I I QMR K N I K K QV V F D E L I R L I V M F V G
L K K I K S K K MV L A D L G R K I T T A L H S L S K A T V I N E E -A L N S M L K E I C A A L L E A DV N I R L V K Q L R E NV R Q S A V F K E L V K L V V M F V G
L K K I K S K K MV L A D L G R K I T T A L H S L S K A T V I N E E -A L N S M L K E I C A A L L E A DV N I R L V K Q L R E NV R Q S A V F K E L I K L I I M F V G
A DK F NK K S -M I T E L G R S I T NT L S N L L S S P A T DQH - - I E T A I R E I C N S L I L S NV N P R Y V S D L R D E L R QNA V Y E R L V D L V V V F V G
S K K I S DK K MV L S Q L G S S L V T A L R K MT S S T V V D E E -V I NT L L K E I E T S L L G E DV N P I F I R QMV NN I K K D S V F E E L I N L V L MMV G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L I I M F V G
L R R L T A K K MV L A D L G K R I NNA V T NA I S N E QT DY E T T V Q S M L K E I A T A L L E NDV N I R L V S R L R E N I K QK T V F D E L C N L V I M F V G
L K R L E K K K MV L A E L G QK I G QA I HR M S A K S M L G E D -DV K E L MN E I A R A L L QA DV NV T I V K K L QV S I R QNA V F NG L K R I V V M F V G
QR K L HK - -MV L QD L G R R I NA A V S D L T R A P N L D E K -A - - - - - -K I C A A L L E A DV NV R L V G Q L R K S I K QK A V F D E L V S L V I M F V G
- - - - - - - -MV M E K L G D S L QG A L K K L I G A G R I D E R -T V N E V V K D I QR A L L QA DV NV K L V MG M S QR I K I R I V Y Q E L M E I T I MMV G
QR K L HK - -MV L QD L G R R I NA A V S D L T R A P N L D E K -A F DG M L K E I C S A L L E A DV NV R L V G Q L R K S I K QK A V F D E L V R L V I M F V G
MK K L QR K K MV L A Q L G G S I S R A L A QM S NA T V I D E K -V L S DC L N E I S R A L L Q S DV Q F K MV R DMQ S N I K QQA V F T E L C NMV V M F V G
QK K MMK K K MV L ND L G NK I A S A L R S L NA HV V V D E E - L L DA C L K D I T NA L L A S DV A V P L V V R MK K N I V E R A V F K E L T A L V V M F V G
I K K V E QK K MV L A E L G K S I NA A L QK L S K A P V V D E A - L V DQ I L G E I A MA L L K A DV NA K F I K K L R E DV K QK A V V DG L T R MV I M F V G
L R R L T A K K MV L A D L G S R L R G A L S S V E S A S - - -D E - E I NQM I K DV C T A L L E S DV N I K L V V K L R DN I K QK I I Y D E L V G L I I M F V G
K R K MDK K K MV L T E L G T Q I T NA F R K L QT S T L A DDV -V I E E C L K E I I R A L I L S D I NV S Y L K D I K S N I K QK Y V V E E L I K L V I L F V G
K R K MDK K K MV L T E L G T Q L T S A L QK L QA S A V A DD S -A I E E C L K E V I R A L I L A D I N I S Y L K D I K S N I K QQY V V E E L I N L V I L F V G
K R K MDK K K MV L T E L G A Q L T S A L QK I QA A P V A DDN -V I E E C L K E I V R A L I L A D I NV I Y L K D I K S N I K QK Y V V E E L I K L V I L F V G
MK K I QR K K MV L A Q L G G S I S R A I QQM S NA T I I D E K -A L NDC L N E I T R A L L Q S DV Q F K L V R DMQT N I K QQA I F N E L C K I V V M F V G
L R R L T A K K MV L A D L G K R I N S A V NNA I S NT QDD F T T S V DV M L K G I V T A L L E S DV N I A L V S K L R NN I R QK T V F D E L C K L I I M F V G
L R R MA P K K MV F A D L G R R L N S A L G D F S K A T S V N E E - L V DT L L K N I C T A L L E T DV NV R L V Q E L R S N I K QK A V F D E L C S L V I MMV G
MR K MR A K K MV L A D L G R K I T S A L K S L S NA T I I D E D -V L N S M L N E I C R A L L E A DV N I R L V K A L K E NV K QT A V F K E L V K L I I M F V G
L R K I K A K K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C A A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
I R K L Q S K K MV L A D L G K R I NNA L QQ L NK A P V I D E E - L L NQV L K E I Q L A L L Q S DV NV K Y V A K L K S N I I QQA V V Q E L T QMV V M F V G
R R R MDK K K MV L A E L S NQ I T QA F R K L H S T T V I S E A -V I E E V I G D I V R A L L MA DV NV K L V HK L K E NV K QK I V V D E L V NMV I M F V G
R R R MDK K K MV L A E L S NQ I T K A F R K L H S T T V I S E A -V I E E V I G D I V R A L L MA DV NV K L V HK L K E NV K QK I V V D E L V NMV I M F V G
S DK I A K K K -M L QD L G E K L MG S I K K L S E S K T I D E K -V Y V T F MA E V A K S L I A A DC S K E I V F D F S R R L K E K A V F N E L V K L I F MMV G
L K K L DK K K MV L A E L G QK I G A A I S K M S S K S F V G E D -DV K E F L N E V A R A L L QA DV NV K T V K E L QQNV R QT A V F NG I K K M I V M F V G
L K K L E K K K MV L A E L G QK I G G A I S K M S S K P L L G E D -DV K E F L N E V A R A L L QA DV HV T T V K E L QQT I R QT A V F S G L R K I I V M F V G
V R R L K A S K MV L S D L G R R I N S A F QD L S K V P T V DA A - S I DQ L L K S V C NA L I E A DV NV K L V A N L R S QV K QK A V F DH L V A L V I M F V G
L K K I V S K K MV L E D L G K R I NG A F A N L S K G G D I D E - -A L DA M L K E V C S A L L E S DV N I K L V S Q L R QK V K QK A L F D E L V N L V V M F V G
1670
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1680
1690
1700
1710
1720
1730
1740
L QG S G K T T T C T K L A Y HY QK K NWK S C L V C A DT F R A G A Y DQ I K QNA T K A R I P F Y G S Y T E V D P V T I A QDG V E M F K K E G F E F I I V DT

L QG A G K T T T C T K Y A Y Y HQK K G Y K P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P V K I A V E G V DT F K K E NC D L I I V DT
L QG A G K S T S C S K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A K I P F Y G S Y T E T N P V R V A A DG V A K F K K E R F E I I I V DT
L QG A G K T T T C T K L A R HY QMR G F K T A L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T QT D P A V V A A E G V A K F K K E R F E V I I V DT
L QG A G K T T T C T K L A R HY QMR G F K T A L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T QT D P A I V A A E G V A K F K K E R F E I I I V DT
----------------------------------------------------------------------------------L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C S K MA Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y S E I D P V K I A A E G V E K F T K E G F E I I I V DT
L QG S G K T T T C T K MA Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y S E I D P V K I A A E G V E K F T Q E G F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A N I P Y Y G S Y L E P D P V K I A F E G V QK F K Q E K F D I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA V K A R I P F Y G S Y T E T D P V K V A G DG I A K F K K E K F DV I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C T K L A Y Y Y QR K NWK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E A D P V V I A S E G V E T F K E E N F E I I I V DT
I QG A G K T T T C T K L A V HY QR R G F R T C L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E T D P V A I A S L G V E K F R K E R F DV I I V DT
----------------------------------------------------------------------------------L QG S G K T T T C T K F A NY Y QR R G WK T A L V C A DT F R A G A F DQ L K QNA T K V K I P F Y G S Y T E T D P V K I A R DG V R E F R K E G Y D L I I V DT
L QG S G K T T T C S K L A Y Y F QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A A E G V E K F K S E N F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G Y K V G L V C A DT F R A G A F DQ L K QNA I K S S I P Y Y G S Y I E T D P V K V A Y E G V V K F K Q E K F D I I I V DT
L QG A G K T T S V T K L A Y F Y K K K G F S T A I V C A DT F R A G A F DQV R HNA A K A K I HY Y G S E T E K D P V V V A R T G V D I F K K DG T E I I I V DT
L QG S G K T T T C T K L A Y HY QK R NWK S C L V C A DT F R A G A Y DQV K QNA T K A R I P F Y G S Y T E I D P V V I A QDG V DM F K R E G F E M I I V DT
L QG S G K T T T C T K L A Y HY QK R NWK S C L V C A DT F R A G A Y DQ I K QNA T K A R I P F Y G S Y T E I D P V V I A Q E G V DM F K R E G F E M I I V DT
L QG S G K T T S I C K Y A N F Y K K K G Y K V G I V C A DT F R A G A F DQV R QNA L K I K V P F F G S - S E A D P V K V A S A G V E R F R K E R F E L I L V DT
L QG A G K T T T I T K L A L Y Y K NR G Y K P A V V G A DT F R A G A Y E Q L QMNA K R A G V P F F G I K E E S D P V K V A S E G V R T F R K E K ND I I L V DT
L QG S G K T T T C S K L A Y F Y QR K G WK T C L I C A DT Y R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G Y K V G L V C A DT F R A G A F DQ L K QNA I K A K I P F Y G S Y T E P N P V K V A K DG V DK F K K E K F E I I I V DT
L QG S G K T T S C T K Y A A Y F QR K G F K T A L V C A DT F R A G A Y DQ L R QNA T K A K V R F Y G S L T E A D P V A I A K E G V A E L K K E K Y D L I I V DT
L QG A G K T T T C T K L A R HY Q S R G F K A C L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T E T D P A V V A R E G V DK F K K E R F E V I I V DT
L QG S G K T T S A A K L A R Y F QR K G L K A G V V A A DT F R P G A Y HQ L K T L A E K L NV G F Y G E E G N P DA V E I T K NG L K A L - - E K Y D I R I V DT
L QG A G K T T T C T K L A R HY Q S R G F R V G L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T E T D P V V V A R DG V DK F K K E K F E I I I V DT
L QG S G K T T T C T K Y A Y Y HQR K G F K P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y M E S D P V K I A V E G V E R F K K E NC D L I I V DT
L QG A G K T T T C T K F A HY Y A K K G F K P S L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P A T I A A A G V K R F E E E K S D L I I V DT
L QG S G K T T T C T K Y A Y Y Y QK K G WK V A L V C A DT F R A G A F DQ L K QNA T K V R V P F Y G S Y T E A D P V Q I A Q E G V NV F K K E A F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A S I P Y Y G S Y L E QD P V K I A Y E G V T K F R S E K F D I I I V DT
L QG S G K T T T C T K Y A HY Y QK K G F K T A L V C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A T DG V NA F L K DK Y D L I I V D S
L QG S G K T T T C T K F A HY Y QK K G F K T A L V C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A S DG V NA F L K E K Y D L I I V D S
L QG S G K T T T C T K Y A HY Y QK K G F K T A L I C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A T DG V NT F L K DK Y D L I I V D S
L QG S G K T T T C T K Y A Y Y HQK K G WK P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P V K I A E E G V E T F K K E NC D L I I V DT
L QG S G K T T T C S K L A Y F Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA I R A R I P F Y G S Y T E T D P A K V A E E G I NK F K K E K F D I I I V DT
L QG S G K T T T C S K L A L HY QR R G L K S C L V A A DT F R A G A F DQ L K QNA I K A R V P Y F G S Y T E T D P V V I A K E G V DK F K NDR F DV I I V DT
L QG S G K T T T C T K L A Y Y Y QK K G WK V A L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E V D P V I I A A DG V E K F K K E N F E I I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V V I A T E G V DK F K M E N F E I I I V DT
L QG A G K T T T C T K Y A Y HWQK K G WR T A L I C A DT F R A G A F DQ L K QNA T K V R V P F Y G S Y S E T D P V A I A E E G V K H F K K E NY E M I I V DT
L QG A G K T T S C T K F A Y HY QR K G WR T A L I C A DT F R A G A F DQ L K QNA A K V K I S F Y G S Y S E A N P A K V A A DG V A R F K E E K Y DM I I V DT
L QG A G K T T S C T K F A Y HY QR K G WR T A L I C A DT F R A G A F DQ L K QNA A K V K I S F Y G S Y S E A N P A K V A A DG V A R F K E E K Y DM I I V DT
L QG A G K T T T V T K L A N F Y K R R NWR T G V I A A DT F R A G A R E Q L MQNA QT A R I P Y F V D F T E QD P V QA A L K G I E K F R K DK Y E I V I I DT
L QG S G K T T S C T K Y A A Y F QR K G L K T G L V C A DT F R A G A Y DQ L R QNA T K A K V R F Y G S L T E A D P V I I A K E G V L E L K K E K Y D L I I V DT
L QG S G K T T S C T K Y A A Y F QR K G L K T A L V C A DT F R A G A Y DQ L R QNA T K A K I R F Y G S L T E A D P V I I A K E G V A E L E K E K Y D L I I I DT
L QG S G K T T S C T K L A L Y Y QK R G F K T G L V C A DT F R A G A F DQ L K QNA S K I NV P F Y G S Y T E T D P V A I S A A G V A S F K QNR F E V I I V DT
L QG S G K T T S C T K L A V Y Y QR R G F K V G L V C A DT F R A G A F DQ L K QNA T K A K I P F F G S Y T E T D P V A V A A E G V A K F K K E K F E I I I V DT
1750
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1760
1770
1780
1790
S G R HK Q E E S L F E E M L A V A NA V N P DN I I F V MDA T I G QA C E A QA K A F K E K V D I G S V
S G R HK Q E A S L F E E MR QV A E A T K P D L V I F V MD S S I G QA A F DQA QA F K Q S V A V G A V
S G R HHQ E DA L F Q E MV E I A Q E V K P NQT I MV L DA S I G QA A E QQ S R A F K E A A D F G A I
S G R HK Q E E E L F T E MT Q I QNA V T P DQT I L V L D S T I G QA A E A Q S A A F K A T A N F G A I
S G R HK Q E E E L F T E MT Q I QT A V T P DQT I L V L D S T I G QA A E A Q S S A F K A T A D F G A I
- - - - - - - - - - - - - - - - - - - - - - P DN I I F V MDA T I G QA C E A QA R A F K DK V D I G S V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E A S L F E E M L QV S NA V T P DNV V F V MDA S I G QA C E A QA R A F S QT V DV A S V
S G R HK Q E A S L F E E M L QV S NA V T P DNV V F V MDA S I G QA C E A QA R A F S QT V DV A S V
S G R HR Q E E Q L F T E MV Q I G E A V Q P T QT I MV MDG S I G QA A E S QA R A F K E S S N F G S I
S G R HHQ E E E L F H E MV Q I S NV I K P NQT I MV L DA S I G QA A E QQ S K A F K E S S D F G A I
S G R HK Q E D S L F E E M L QV Y NA V A P DNV I F V MDA S I G QA C E G QA R A F K E K V DV A S V
S G R HK Q E S E L F E E MV A I G A A V K P DMT L MV L DA S I G QA A E G Q S R A F K D S A D F G A I
- - - - - - - - - - - - -M E QV V M E T N P DDV V F V MD S H I G QA C Y DQA MA F C NA V DV G S V
S G R HK Q E S S L F V E M E QV V M E T N P DDV V F V MD S H I G QA C Y DQA MA F C NA V DV G S V
S G R HK Q E D S L F E E M L QV S NA V Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E Q S L F N E M I Q I S E M I V P T QT I MV MDG S I G QA A E S QA K A F K E S S Q F G S I
S G R HK QD S E L F E E MK Q I E T A V K P DNC I F V MD S S I G QA A Y E QA T A F R S S V K V G S I
S G R HK Q E E S L F E E M L A V S NA V S P DN I I F V MDA T I G QA C E A QA K A F K DK V D I G S V
S G R HK Q E E S L F E E M L A V A NA V N P DN I I F V MDA T I G QA C E A QA K A F K DK V D I G S V
S G R HT Q E T E L F T E MK D I I R E I S P S S I V F V MDA G I G Q S A E DQA MG F K R A V DV G S I
S G R HK QDK E L F K E MQ S V R DA I K P D S I I F V MDG A I G QA A F G QA K A F K DA V E V G S V
S G R HQQ E D S L F Q E MV E I S QA V K P K QT I MV L DA S I G QA A E HQ S K A F K E S A D F G S I
S G R HK Q E S A L F E E MK QV E E A V K P ND I V F V M S A T DG QA V E E QA R N F K E MV A V G S V
S G R HR Q E S A L F Q E MMD I QK A V K P D E T I MV L DA S I G QQA E A QA K A F K E A A D F G A I
A G R HA L E A D L I E E M E R I HA V A K P DHK F MV L DA G I G QQA S QQA HA F ND S V G I T G V
S G R HK Q E D S L F E E M L QV S NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HR Q E E A L F Q E MMD I QT A V K P D E T I MV L DA S I G QQA E A QA K A F K E A A D F G A I
S G R HK Q E A A L F E E MR QV S E A T K P D L V I F V MD S S I G QA A F DQA QA F K Q S V S V G A V
S G R HK Q E E A L F E E MR E I A S V T E P T MT I F V MD S S I G Q S A S DQA K A F A S T V DV G G V
S G R HK Q E ND L F E E MK QV E A A V K P DD I V F V MD S S I G QA C F DQA L A F K K A V NV G S V
S G R HR Q E HQ L F Q E MV Q I G E M I Q P T QT I MV MDG S I G QA A E S QA K A F K E S S N F G S I
S G R HK Q E N E L F E E M I QV E N S I Q P E E I I F V I D S H I G Q S C HDQA MA F K N S V S L G S I
S G R HK Q E S E L F E E MK QV E S S I N P E E I V F V I D S H I G Q S C HDQA MA F K N S V T L G S I
S G R HK Q E ND L F E E MK QV E N S I K P E E I V F V I D S H I G Q S C HDQA MA F K N S V K V G S I
S G R HK Q E A A L F E E MR QV S E A T K P D L I I F V MD S S I G QA A F DQA QA F K QMV A V G A V
S G R HK Q E D S L F E E M L QV S NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HHQ E E E L F Q E M I E I S NV I K P NQT I MV L DA S I G QA A E QQ S K A F K E S S D F G A I
S G R HQQ E Q E L F A E MV E I S DA I R P DQT I M I L DA S I G QA A E S Q S K A F K E T A D F G A V
S G R HK Q E D S L F E E M L QV A NV T S P DN I I F V MDA S I G QA C E S QA K A F K E K V DV A S V
S G R HK Q E D S L F E E M L QV S NA V Q P DN I V Y V MDA S I G QA C E S QA K A F K DK V DV A S V
S G R HK Q E S E L F D E MK QV QA A V N P D E C I F V MDG S I G QA C Y DQA QA F R NA V NV G S V
S G R HK Q E DA L F D E MK L I Y DA V Q P D E V V F V MD S H I G QA C Y DQA S A F NK A V DV G S V
S G R HK Q E DA L F D E MK L I Y DA V Q P D E V V F V MD S H I G QA C Y DQA A A F NK A V DV G S V
S G R HMQ E E A L F A E MK A L A A A V N P H E I I F V MDG T I G QA A Y DQA L G F K NA V G V G S I
S G R HK Q E S A L F E E MK QV QQA V K P ND I V F V M S A T DG QG I E E QA R Q F K E K V P I G S V
S G R HK Q E S A L F E E MK QV Q E A V K P ND I V F V M S A T DG QG I R E QA R Q F K E K V P V G S V
S G R HK Q E Q E L F D E MR E I DT A V T P D L T I MV L DA N I G QA A E A Q S R A F K QA A G Y G A I
S G R HR Q E S E L F T E MV D I G A A V K P D S T I MV L DA S I G QA A E P Q S R A F K DA S D F G S I
1800
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
1810
I T K L DG HA K G G G A
I T K MDG HA K G G G A
L T K MDG HA K G G G A
I T K T DG HA A G G G A
V T K L DG HA K G G G A
I T K L D S HA K G G G A
I T K L D S HA K G G G A
L T K MDG HA R G G G A
I T K MDG N S MG G G A
L T K I DG T T K A G G A
I T K L DG H S NG G G A
V T K L DC QT K G G G A
I T K T DG HA S G G G A
I T K L DG T A K G G G A
V T K MDG HA K G G G A
MT K L DG HA K G G G A
I T K I DG HA K G G G A
I T K MDG HA K G G G A
I T K L D S NA K G G G A
V T K L DG QA K G G G A
1820
L SAV AAT N S P I I F I G
L SAV AAT K S PV I F I G
I SAV AAT K T PV I F I G
I S A V A A T HT P I I F L G
I S A V A A T HT P I I Y L G
L SAV AAT Q S P I I F I G
L SAV AAT K S P I I F I G
L SAV AV T K S PV I F I G
L SAV AV T K S PV I F I G
I SAV AAT K T P I V F I G
I S A V A A T NT P I A F I G
I SAV AAT K T P I I F LG
L SAV AAT GA P I I F I G
L SAV AAT GA P I I F I G
I SAV AT T K T P I V F I G
I S A V A A T NT P I I F I G
I S SV AAT K C P I E FV G
I S A V A S T NT P I I F I G
L SAV AAT R S P I V F I G
I S A V A A T HT P I V F I G
L SAV S ET K A P I A F I G
I S A V A A T HT P I V F I G
I SAV S ET K A P I L F I G
L SAV AAT E S P I V F I G
I SAV AAT K T P I V F I G
L SAV AAT GC P I T F I G
L SAV A ST GC P I T F I G
L SAV SA I GC P I T F I G
L SAV AAT K T P I V F I G
L SAV AAT R S P I I F I G
L SAV AAT E S P I I F I G
L SAV SAT N S P I I F I G
L SAV SAT N S P I I F I G
L SAV AAT N S P I S F I G
L A A V A MT K S P I V F I G
L A A V A MT K S P I V F I G
I SAV AAT K T P I MF I G
1830
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1840
1850
1860
1870
1880
1890
1900
T G E H I DD L E P F K T K P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E E L I DK I K HG Q - F T I R DMY E Q F QN I MK MG P F S Q I MG M I
T G E HMD E F E V F DV K P F V S R L L G MG DW S G F V DK L Q E V V P - - -K DQQ P E L L E K L S QG N - F T L R I MY DQ F QN I L NMG P L K E V F S M L
T G E HV HD F E K F S P K S F V S K L L G I G D I E S L L E Q F QT V S N - - -K E DT K A T M E N I QQG R - F T L L D F QK QMQT I MK MG P L S N L A S M I
T G E H L MD L E R F E P K A F V QK L L G MG DMA G L V E HV QA V T K - -D S A A A K E T Y K H I A E G I -Y T L R D F R E N I T S I MK MG P L S K L S G M I
T G E H L MD L E R F E P K A F I QK L L G MG DMA G L V E HV QA V T K - -D S A S A K E T Y K H I S E G I -Y T L R D F R E N I T S I MK MG P L S K L S G M I
T G E H I DD L E P F K T K P F I K K L L G MG D I E G L L DK V N E L K - - - - L DDND E L L E K L K HG Q - F T L R A MY E Q F QN I MK MG P F S Q I MG M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I DD F E I F K P K S F V QK L L G MG D I A G L V DMV ND I G - - - - I S DNK E L V G R L K QG Q - F T L R DMY E Q F QN I MK MG P F S Q I MV M I
T G E H I DD F E I F K P K S F V QK L L G MG D I A G L V DMV ND I G - - - - I QDNK E L V G R L K QG Q - F T L R DMY E Q F QN I MK MG P F S Q I MG M I
T G E HV G D L E I F K P T T F I S K L L G I G D I QG L I E HV Q S L N L -HQD E G HK QT I E H I K E G K - F T L R D F QNQMNN F L K MG P L T N I A S M I
T G E H I HD L E K F S P K S F I S K L L G I G D I E G L F E Q L K T V S N - - -K E DT K A T M E N I QQG K - F T L L D F K K QMQT I MK MG P L S N I A QM I
T G E H I DD F E M F K T Q P F V R K L L NMG D I E G L I DK V N E L K - - - - L DDN E E L L DK L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I M S M I
T G E H L ND L E R F A P Q P F I S K L L G MG DMQG L V E HMQDMA R -A N P DR QK D L A K K L E QG K - F T I R DWR E Q L S N I MNMG S I S K I A S M I
T G E H F D E F E P F E T K G F V S R L L G L G D I S G L MA K I N E V V P - - - L DR Q P DMV NR L V QG I - F T L R DMY E Q F QNM L NMG S P S A L L S M I
T G E H F D E F E P F E T K G F V S R L L G L G D I S G L MA K I N E V V P - - - L DR Q P DMV NR L V QG I - F T L R DMY E Q F QNM L NMG S P S A L L S M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E E L I DK L K HG Q - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E HA T D L E I F K P T S F I S K L L G I G D I Q S L I E HV Q S L N L -QDD E S HK K T I E N F K E G K - F T L R D F QT QMNN F MK MG P L T N I A S M I
T G E H L T D L E L F D P S T F V S K L L G Y G DMK G M L E K I K E V I P - - - - - E D S T S L K E I A QG K - F T L R S MQQQ F QQ I MQ L G P I DK L V QM I
T G E H I DD L E P F K T K P F V S K L L G MG D I E G L I DK V N E L K - - - - L DG ND E L L E K I K HG H - F T I R DMY E Q F QN I MK MG P F S Q F MNM I
T G E H I DD L E P F K T K P F V S K L L G MG D I E G L I DK V N E L K - - - - L DG N E E L L E K I K HG H - F T I R DMY E Q F QN I MK MG P F S Q I MNM I
T G E G MDD L E A F DA R R F V S R M L G MG DV E G L M E K V G S L G I - - - - -D E K E V V K K L R QG R - F T L G D F Y DQ F QK I L S L G P I S K L L E M I
T G E K V N E I E E F DA E S F V R K L L G MG D L K G I A K L A K D F A E - - -NA E Y K T MV K H L Q E G T - L T V R DWK E Q L S N L QK MG Q L G N I MQM I
T G E H I HD F E K F S P K S F V S K L L G I G D I E S L M E R F QT V S D - - -QDDA K NT L E N I QQG K - F T L L D F K NQMQT I MK MG P L S N I A NM I
T G E H F E D F D L F N P E R F V QK M L G MG D I G G L MDT MR DA N - - - - I DG N E E V Y K R L QDG L - F T MR DMY E H L QNV L K MG S V G K I M E M L
T G E HM L D L E R F A P QQ F I S K L L G MG DMA G L V E HV Q S L K L - - - - -DQK DT I K H I T E G I - F T I R D L R DQ L QN I MK MG P L S K MA G M I
V G E T P E D F E K F E A DR F I S R L L G MG D L K S L M E K A E E S L S - - - - - E E DV NV E A L MQG R - F T L K DMY K Q L E A MNK MG P L K Q I M S M L
T G E HM L D L E R F V P NN F I S K L L G MG DMA G L V E HV Q S L K L - - - - -DQK DT I K H I T E G I - F T I R D L R DQ L QN I MK MG P L S K MA G M I
T G E H I D E F E V F DV K P F V S R L L G MG DW S G F MDK I H E V V P - - -T DQQ P E L L QK L S E G T - F T L R L MY E Q F QN I L K MG P I G QV F S M L
T G E H I G E L E A F E T T S F V S K L L G MG D I K G L V E K MN E I V P - - - E E S A E K L M E A F G S G T - F T MR L L Y E Q F QN L QNMG P I S S I M S MV
E G E H F DD L E S F E A S S F V R R L L G L G D I NK L F Q S V K DV V N - - -MR DQ P Q L I QK L K E G K - F S I R D L QT Q F N S V L K L G S L NQ F M S A I
T G E H I G D L E I F K P T T F I S K L L G I G D I Q S L I E HV QG L N L -QND E NHK QT M E N I K E G K - F T L K D F QNQMNN F L K MG P L T N I A S M I
T G E HV ND F E K F E A K S F V S R L L G L G D I S G L V S T I K E V I D - - - I DK Q P E L MNR L S K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E H I ND F E K F E A K S F V S R L L G L G D I NG L V S T L K E V I D - - - I E K Q P Q L I NR L S K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E HV ND F E K F E A K S F V S R L L G L G D I DG L V S T L K E V I D - - - I E K Q P Q L I NR I A K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E HMD E F E V F DV K P F V S R L L G MG DW S G F MDK I H E V V P - - -MDQQ P E L L QK L S E G N - F T L R I MY E Q F QN L L K MG P I G QV F S M L
T G E H I HD L E K F S P K S F I S K L L G I G D I E S L F E Q L QT V S N - - -K E DA K A T M E N I QK G K - F T L L D F K K QMQT I MK MG P L S N I A QM I
T G E H I ND L E R F S P R S F I S K L L G L G D L E G L M E HV Q S L D F - - - - -DK K NMV K N L E QG K - F T V R D F R DQ L G N I MK L G P L S K MA S M I
T G E H I DD F E P F K T K P F I S K L L G MG D I E G L I DK V S E L N - - - - L DDN E E L I NK L K HG E - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DR V ND L K - - - - L DDN E E L I DK L K HG Q - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E H F E D L E P F N P E S F V K R L L G L G D I K G M I T T V T E A V D - - -M E T QG K A I A N I T K G Q - F S I R D F QA QY K S I L K L G S I NQ F M S M I
T G E H F DD F E P F D P K S F I S R L L G F G D I NG L I NT L K DV I N - - - L E DK P D L L DR I A S A K - F T I R DMY DQ F QN L L K MA P I G K V M S M L
T G E H F DD F E P F D P K S F I S R L L G F G D I NG L I NT L K DV I N - - - L DDK P D L L DR I A S A K - F T I R DMY DQ F QN L L K MA P I G K V M S M L
S G E Q F T D L E W F D P N S F V S R L L G I QD P G V I QR T L E E I D - - - -K E A NK E I A E H I QK G Q - F S F R D L Y NQY K MV L DV G N F N S M L D S I
T G E H F DD F E L F Q P E S F V S R M L G MG DMR A L V D S MK DA N - - - - I DT D S E L Y K R F QDG Q - F T L R DMY E H L QNV L K MG S V S K I MDM I
T G E H F E D F E L F Q P E S F V S R M L G MG DMR A L MDT MK DA N - - - - I DT D S E L Y R R F QDG Q - F T MR DMY E H L QNV L K MG S V S K I MNM I
T G E HA A D L E P F R A Q P F I S K L L G MG D I S G L MDK M E E MQMNG G Q E R QQ E M L K K I G QG G I F S I R DWR E Q L S N I MG MG P L S K I A G M I
T G E H I HD L E A F S P K Q F I S K L L G I G D L QG L M E T MQ S L N L - - - - -DQK K T M E H I Q E G I - F T L A D L R DQMG NM L K MG S L S S I A G M I
1920
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
1930
1940
1950
1960
1970
1980
1990
- - - - - - - P -G F S QD F MT K G -G E Q E S MA R I K R L MT MMD S M S DG E L DR V T R V A QG S G V M E R E V R D L I G G MNG L QNMMDY R S QM E L

- - - - - - - P -G I S A E MM P K G -H E K E S QA K I K R Y MT MMD S MT ND E L DR MMR I A R G S G R QV R E V M E M L G G MG G L Q S L MD F T -D L E L
- - - - - - - P -G M S G -MM S G I - S E D E T S R K MK K MV Y V L D S M S R E E L E R M L R V A R G S G T S V F E V E M I L A A M P DMA DMMD F S -Y L K L
- - - - - - - P -G L S N - L T A G L -DD E DG S L K L R R MV Y I F D S MT A A E L DR MV R I A C G S G T T V R E V E D L L G G M -D L Q S MMD F S - S L A L
- - - - - - - P -G L S N - L T A G L -DD E DG S MK L R R M I Y I F D S MT A A E L DR MV R I A C G S G T T V R E V E D L L G G M -D L Q S MMD F S - S L S L
- - - - - - - P -G F S QD L M S K G - S E Q E S MA K F K L L MT I MD S I D E E - - - - - - - - - - - - - - - -R DV K D L I G G M S G L HNMMDY R S QM P L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F G P E F MNK G -N E Q E S V NR L K R MMT V MD S M S DK E L DR V A R V A R G S G C F QQ E V R D L L G G MG G L QNMMDY R K DM P L
- - - - - - - P -G F G S E F MT K G -N E Q E S V NR L K R MMT V MD S M S DK E L DR V A R V A R G S G S HQQ E V R D L L G G MG G L QNMMDY R K DM P L
- - - - - - - P -G L S N - I M S QV -G D E E T S K K I K NM I Y I MD S MT I K E L E R I V R V A R G S G C A V V E V E M I L G G - - - L A G MMD F S -Y L K L
- - - - - - - P -G MG N -MM S QV -G E E E T S QK MK K MV Y V L D S MT K E E L E R L V R V A R G S G T S V F DV E M I L G G M P DMN E MMD F S -Y L K L
- - - - - - - P -G F N -D F MT K G -H E K D S T E R L K R L MT I MD S MT D E E L DR V A R V A R G S G T L P K E V N E L L G G MNG L QNMM - - - - - - - - - - - - - - P -G L P A -G I MDG -N E E E A S A K L K R L I F I T DA MR A D E L DR A K R V A K G S G T S L R E L E D L L G G M P DM S QMQD L S -G QT L
- - - - - - - P -G MG P N I L A K E -D E QA G I E R L K K F MV I MD S MT E S E L DR V E R I S R G S G T S NQDV Q E L L G G A QNMMNMMDY S S E L K L
- - - - - - - P -G MG P N I L A K E -D E QA G I E R L K K F MV I MD S MT E S E L DR V E R I S R G S G T S NQDV Q E L L G G A QNMMNMMDY S S E L K L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V A T R DV Q E L L G G MA G L Q S MMDY R L QM L L
- - - - - - - P -G M S N - I M S QV -G E E E T S S K I K NM I Y I MD S MT T K E L DR I I R V A R G A G C S A V E V E MV L G G M P DM S S MMD F S -Y L K L
- - - - - - - P -G MNQ - - L P Q L -QG N E G G L K L K A Y I N I L D S L S E K E L DR I I T I A QG S G R H P N E V V E L L G G M P S MG D L A DY S K R C I L
- - - - - - - P -G F S QD F MT K G -G E Q E S MA R V K R MMT MMD S M S DN E L DR C V R V A QG A G V M E R E V K E L I G G V G G I QNMMDY R S QMQ L
- - - - - - - P -G F S QD F MT K G -G E A E S MA R I K R MMT MMD S M S DN E L DR C T R V A QG A G V L E R E A K E L I G G V G G L QNMMDY R G QMQ L
- - - - - - - P -G F S G - - - - - - - L S L P D E DT F K K L I Y V F D S L S R G E L DR I MR V A R G S G T S V QG V V E I L G S M F - - - - - - P Y E - E L V L
- - - - - - - - -G L NH P M F QG G -N I E - - -K K F K V F MV I L D S MT DR E L DR I R R L A R G S G R D I R E V N E L F -NQA Q L QQ L MDY - -DMR L
- - - - - - - P -G MG N -M L NQ F - S E E E T S K K MK T MV Y I F D S MT K K E L E R L V R V A K G S G T T V F D I E M L L G G M P NMQD I MD F S -Y L K L
- - - - - - - P -G M S G HA A T A G - - -QQG D I A L K G F I HM L D S MT V A E L DR MHR I A R G S G H S V V E V QN L I G G MG G L QG MM S L S G DA R T
- - - - - - - P -G M S N -MMQG M -DD E E G T G K L K R M I Y I C D S MT DK E L DR MT R V A R G S G T HV R E V E D L L G G M -DMA A MMDY S -Y L K L
- - - - - - - P MG MG G MK F S D E -M F QA T S DK MK NY K V I MD S MT E E E MT R I K R I S K G S G C S S E DV R E L L G G K F N I QK MM E I S F K - - - - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R QQM P L
- - - - - - - P -G M S N -MMQNM -DD E E G S L K L K R M I Y I C D S MT DK E L DR MT R V A R G S G T T V R E V E D L L G G M -DMNA MMD F S -Y L N L
- - - - - - - P -G F S S E L M P K G -H E K E S QA K I K R Y MT MMD S MT DG E L DR I L R I A R G S G R P V R DV V DM L G G MG G L Q S L MD F T -K L E L
- - - - - - - P -G MA D -M I P K G -G E E QG T K R MK S MMV L MD S MT DA E L DR M E R V C R G A G R L P S E M I T L L G G QQG MMNMMD F S - E L E L
- - - - - - - P -G MG S S V L S K G -N E K E S I K R I QR F L C I MN S MT A D E L DR I V R I A K G S G T S I E E V H I L L G G MDNV MNMMDY R - - -N I
- - - - - - - P -G L S N - I M S QV -G E E E T S K K I K NMV Y I MD S MT T E E L E R I L R V A K G S G C A A V E I E MV L G G M P NMA NMMDY S -Y L K L
- - - - - - - P -G F G NN L I S K G -T E K E G I DK I K K F MV I MD S MT N E E L DR C L R I V K G S G T R L QD I K E L L G G A NNMV N I L DY S K DMK L
- - - - - - - P -G F G NN I I S K G -T E K E G I E K I K K F MV I MD S MT N E E L DR C L R I C K G S G T R L QD I R E L L G G A NNMV N I L DY S K DMK L
- - - - - - - P -G F G T N L I S K G -T E K E G I DK I K K Y MV I MD S MT N E E L DR C I R I C K G S G T K L S D I K E L L G G A NNMV NM L DY S K E MK L
- - - - - - - P -G F S A E L M P K G -H E K E S QA K I K R Y MT MMD S MT N E E L DR MMR I A R G A G R P I R DV M E I L G G MG G L QN L MD F S -K L E L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R QQM P L
- - - - - - - P -G M S N -MMNQV -G E E E T S QK MK K MV Y V L D S MT K E E L E R MV R V A K G S G T S V F E V E M I L G G M P DMN E MMD F S -Y L R L
- - - - - - - P -G M S N -MMNG M -ND E E G S L R MK R M L Y I V D S MT E Q E L DR V L R V A R G S G T S V L E V E E T I G G M -D F S G M L D F S N L L G L
- - - - - - - P -G F S S D F MT K G -N E Q E S MA R L K K L MT MMD S MK D E E L DR V QR V A R G S G V S V R E V Q E L L G G MNG L Q S MMDY R G QM E L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V A T R DV Q E L L G G MA G L Q S MMDY R S QMQM
- - - - - - - P -G MG N S I L DK N -N E K E S I R K V K K F L T I MD S MNDN E L DR I V R I A R G S G S S L E DV NQ L L G G MG N I MNMV DY S -T L E L
- - - - - - - P -G I P P E L L QA G -R E Q E G V DR I K R F M I I MD S MT D E E L DR I MR I A K G S G S S P H E I N F L I G G A G N I MK L MDY S N - L K L
- - - - - - - P -G I P P E L L QA G -R E Q E G V DR I K R F M I I MD S MT D E E L DR I MR I A K G S G S S P H E I S F L I G G A G N I MK L MDY S N - L K L
T I A QG I K P -G MK N - - - - - - -D P E HT K DT V K R I L V V I DA M S T S E I E R I A R L C S G S G M P P P F V QY V I G G I E G L E A S MK Y F K D L Y I
- - - - - - - P -G M S G F T G NA G - - -DA G DV T L K T F I HMMD S MT A A E L DR I HR I A R G S G HT I L E V HN L I G G L T G L QD I M - - - - - - - - - - - - - - P -G M S A L S G A A G - - - E L G DV T L K A F I H I MD S MT A A E L DR I L R V A R G S G H S I H E V HN L I G G L G G L QG MM - - - - - - - - - - - - - - P -G MG QM L S G A G G DD E A A G S K MK R MM F I T DA MT A E E L DR A R R V A R G S G T S V K E V E E F L G G M P DM S Q L A D F T -K M P L
- - - - - - - P -G L S G -MA S S I - S D E E G T R R I K R M I Y I L D S MNQK E L DR I T R V A R G S G T S I R E V E E V L G G M P DMG QMMD F S -Y L K L
2000
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2010
2020
2030
2040
2050
2060
2070
K P DN E S R P L WV A P -NG H - - I F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
K P DHG NR P L WA C A -DG K - - I F L E T F S P L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T E T I I S V L NK - - - - L
K P DHA S R P L W I A P NDG R - - I I L E S F S P L A E QA QD F L V T I A E P V S R P S HV H E Y K I T A Y S L Y A A V S V G L E T DD I I A V L DR - - - - L
K P DHA NR P L W I D P L K G T - - I T L E S F S P L A P QA QD F L T T I A E P L S R P T H L H E Y R L T G N S L Y A A V S V G L Q P T D I I N F L DR - - - - L
K P DHA NR P L W I D P L K G T - - I T L E S F S P L A P QA QD F L T T I A E P L S R P T H L H E Y R L T G N S L Y A A V S V G L L P QD I I N F L DR - - - - L
K P DNA S R P L WV A P -NG H - - I F L E A F S P V Y K HA HD F L I A I A E P V C R - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T DY L R K - - - - L
K A D F S A R P L WV A P -DG H - - I F L E S F S P V Y K HA R D F L I A I S E P V C R P QH I H E Y Q L T A Y S L Y A A V S V G L QT K D I I E Y L E R - - - - L
K G D F T A R P L WV A P -DG H - - I F L E S F S P V Y K HA R D F L I A I S E P V C R P QH I H E Y Q L T A Y S L Y A A V S V G L QT K D I I E Y L E R - - - - L
K P DH F S R P I W I S P NDG R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S H I H E Y R I T A Y S L Y A A I S V G L E T DD I I S V L NR - - - - L
K P DHA S R P I W I S P S DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y K I T A Y S L Y A A V S V G L E T DD I I S V L DR - - - - L
K DDHG S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
----------------------------------------------------------------------------------K G DH S L R P L WV DD -R G N - - I I V E A F A P F A K QA QD F L V A I S E P V S R P A L I H E Y R I T K P S L H S A M S I G L E T K V I I E V L S R - - - - L
K S DHDK R P I WV F P -DG L - - I I I E T F HQ S S K A A C E F L V T I S E P L S R P E L I H E Y Q L T I F S L Y A A V S L G I T V D S I I E T L G K - - - - F
K S DHDK R P I WV F P -DG L - - I I I E T F HQ S S K A A C E F L V T I S E P L S R P E L I H E Y Q L T I F S L Y A A V S L G I T V D S I I E T L G K - - - - F
K NDH S S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I S E P V C R P T HA H E Y K L T A Y S L Y A A V S V G L QT S D I I E Y L QK - - - - L
K P DH F S R P I W I S P I DA R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S HV H E Y R I T A Y S L Y A A V S V G L E T DD I I L V L NR - - - - L
K QDNK S R P I WV C P -DG H - - I F L E T F S A I Y K QA S D F L V A I A E P V C R P QN I H E Y Q L T P Y S L Y A A V S V G L E T ND I I T V L G R - - - - L
R P DHG NR P L WV A P -NG H - -V F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
R P DHG NR P L WV A P -NG H - -V F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
K E DG E S H P I WV NY -DG L - - I I L E T F R E S S R QA S D F L I A I A E P M S R P L Q I H E F Q I T A Y S L Y A A V S V G L T T S D I I E T L DR - - - - F
K P NH P E L P MWV S S -N L R - - I V V E T S NDM F K E V S DY L S R V A QV K S R M E HMH E Y Q L T P T S I MT A F S F G S T P E A M I S T L E K - - - -Y
K A DNA S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T H I H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L QK - - - - L
K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
K P DHA S R P L W I S P NDG R - -V I L E S F S P L A E QA QD F L V T I A E P V S R P S H I H E Y R I T A Y S L Y A A V S V G L E T E D I I A V L DR - - - - L
G E K C L F V E S R I E S -DG Y I T I I A E S F R R S Y V N I R P F L T T L A E A I S R P S L MH E Y L L T P F S L G A A V S NG I DA A E A T A F L E T HA Y G L
K P DHA NR P L W I N P DK G I - - I I L E S F N P L A E QA QD F L I T I A E P Q S R P T F L H E Y A L T A H S L Y A A V S V G L H P QD I I S T L DR - - - - F
- - - - - - - - - - - - - -QG T - - - - - - - - - - - - - - - - - - - L L I K G NV R V P N S I WD E R S G S F R A P A - - - - - L Y Y R D I V NY L K E - - - - K DDHA S R P L WV A P -DG H - -V F L E A F S P V Y K Y A QD F L V A I A E P V C R P S HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L K K - - - - L
K G DHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
K P DHDQK P L W I D P E K G T - - I I L E K F S P DA DR V T D F L V T I A E P K S R P H F L H E Y Q L T A H S L Y A G V S I G L Q S K D I I DT L DR - - - - F
K P DHA NR P L WA C A -DG R - - I F L E T F S P L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T S T I I S V M S K - - - - L
K P DHA NR P L WV C D -DG R - - I F L E S F S P V Y K A A Y D F L I S V A E P V C R P A NMH E Y V L T P H S L Y A A V S V G L E T S T I L S V L DR - - - - L
E I V Q S NK P L I L S P -D L G - - I I V E K F N P L Y E I A F E F L MC V A E P I S R S E L I H E Y V L T QM S MY T A MV L QY S A DD I I R L L D L - - - - L
K P DH F S R P I WM S P -DG R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S H I H E Y R L T P Y S L Y A A V S V G L E T DD I I S V L S R - - - - L
K K NHMNK P L W I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E L I H E F Q L T I F S L Y A A I S V G I T L D E L L I N L DK - - - - F
K K NHMNK P MW I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E L I H E F Q L T I F S L Y A A I S V G V T L D E L L V N L DK - - - - F
K K NHMNK P L W I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E I I H E F Q L T I F S L Y A A I S V G I T L D E L L L N L DK - - - - F
K P DHA NR P L WA C A -DG R - - I F L E T F S S L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T E T I I S V L NK - - - - L
K G DHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
R P DHA S R P L W I S P S DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y K I T A Y S L Y A A V S V G L E T DD I I S V L DR - - - - L
K L DHT A R P L W I N P I DG R - - I I L E A F S P L A E QA I D F L V T I S E P V S R P A F I H E Y R I T A Y S L Y A A V S V G L K T E D I I A V L DR - - - - L
K K DHG S R P L W L A P -DG H - - I F L E S F S P V Y K HA HD F L I A I S E P V C R P E N I H E Y K L T A Y S L Y A A V S V G L QT S D I I E Y L R R - - - - L
K DDHA S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L I A I A E P V C R P T H I H E Y K L T A Y S L Y A A V S V G L QT S D I V E Y L QK - - - - L
K DDY R E R P I L I C P -DG I - - I F L E T F N P L Y R V A Y Q F L I S I G E P V QR P L S MHK F T L T K Y S L Y T A MV L QY E P K D I I L C L E K - - - - L
K T NHT A R P L WV C P -DG Y - - L Y L E L F T P V S K QA L D F I V T I A E P V C R P E L I H E Y QV T V F S L Y T A V S V G L S F E E L L NN L NK - - - - F
K NNH S A R P L WV C P -DG Y - - L Y L E L F T P V S K QA L D F I V T I A E P V C R P E L I H E Y QV T V F S L Y T A V S V G L S F E E L L NN L NK - - - - F
L E N S DNR P A I V M P -DG H - - I F V E T F S P F Y S K V V D F I I A I A D P C S R P K Y V Q E Y Q I N P Y S I F S A V S I G L K A K E I I R I L A I - - - - I
- - - - - - - - - - L G P -G G R - - I F I NHG H P A Y P H L MD F L T A C C E P V C R T L Y V S E Y T I S P S S L S A A T A E G T Y S M E MV R NV I R Y F R L D
- - - - - - - - - -V G A -NG S - - L F V NNT H P A Y P H L V D F L T S C C E P V S R T L R M S E Y V I S P S S L S A A S A E G T Y S T A M I R N F I R Y F R L D
K L DHA S R P L W I S P DDG H - - I I L E G F S P L A E QA QD F L I A I A E P V S R P A Y I H E Y K L T P Y S L Y A A V S V G L Q P DD I I E V L NR - - - - L
K P DHA A R P L W I N P E DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y R I T T Y S L Y A A V S V G L E T S D I I S V L NR - - - - L
2080
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2090
2100
2110
2120
2130
2140
2150
S K I R L C T L S Y G K K L V L K HNK Y F V E F E V NQ E K I - E V L QK R C I - - E I E F P L L A E Y D F R NDT I N - - - -A D I N I D L K P A - - -A V L R P
S K I HA S T A NY G K K L V L K K NR Y F I E F E I D P A L V - E NV K QR C L P NA L NY P M L E E Y D F R NDNV N - - - - P D L DM E L K P H - - -A Q P R P
S K I K G A T V S Y G K K L V I K HNR Y F V E F E I DR E S V - E L V K R R C Q - - E I DY P V L E E Y D F R NDNR N - - - - P D L E I D L K P S - - -T Q I R P
S K I I D F T K S F G K K V V L K HNR F F V E F E I P N E A V - E S V K A R C Q - -A MG C P A L E E Y D F R ND E I N - - - - P T L D I D L K P N - - -A R I R S
S K I L D F T K S Y G K K V V L K HNR F F V E F E I P N E A V - E P V K A R C Q - -A MG C P A L E E Y D F R ND E I N - - - - P T L D I D L K P A - - -A R I R S
- - - - - - - - - - - - - - - - - - - -Y L V E F E V D P DK I - E V I QK R C I - - E L E H P L L A E Y D F R ND S I N - - - - P D I N I D L K P T - - -A V L R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I QMC T V S Y G K K L V L K HNR Y F V E F E I K Q E T I - E T V QR R C I - - E L E Y P L L A E Y D F R NDT MN - - - - P N L G I D L K P S - - -T T L R P
S K V QMC T V S Y G K K L V L K HNR Y F V E F E I K Q E T I - E T V QK R C I - - E L E Y P L L A E Y D F R NDT L N - - - - P N L G I D L K P S - - -T T L R P
S K I K A A T V S Y G K K L V L K HNR Y F V E F E I A HD S V - E I V K R R C Q - -D I E Y P V L E E Y D F R HDA R N - - - - P D L E I D L K P S - - -T Q I R P
S K I K G A T I S Y G K K L V I K HNR Y F V E F E I A N S S V - E I V K R R C Q - - E I DY P V L E E Y D F R NDNR N - - - - P D L E I D L K P S - - -T Q I R P
----------------------------------------------------------------------------------S K I E E WT A S F G K R L V L K DNR Y F L E F E V S G E R M - E DV R R R C K - -D I D L P A L E E Y D F R NDT I N - - - - P N L D I Q L K P M - - -T V I R P
S K I R G HC K L F G K K I V L L E G R Y F V E F E I S G DK V -D I V T MA S F -V S L HR P L L S E Y D F R S D I K N - - - - P N L D I S L K HT - - -T Q I R Y
S K I R G HC K L F G K K I V L L E G R Y F V E F E I S G DK V -D I V T MA S F -V S L HR P L L S E Y D F R S D I K N - - - - P N L D I S L K HT - - -T Q I R Y
S K I K L C T V S Y G K K L V L K HNR Y F V E F E I R Q E M I - E E L QK R C I - -Q L E Y P L L A E Y D F R NDT V N - - - - P D I NMD L K P T - - -A V L R P
S K I R G A T I S Y G K K L V L K HNR Y Y V E F E I A N E S V - E I V K R R C Q - - E I E Y P V L E E Y D F R NDDR N - - - - P D L D I D L K P S - - -T Q I R P
S K V R QC T Q S Y G K K L V L QK NK Y F V E F E I D P QQV - E E V K K R C I - -Q L DY P V L E E Y D F R NDT V N - - - - P N L N I D L K P T - - -T M I R P
S K I R L C T L S Y G K K L V L K HNK Y F I E F E V A Q E K I - E V I QK R C I - - E I E H P L L A E Y D F R NDT NN - - - - P D I N I D L K P A - - -A V L R P
S K I R L C T L S Y G K K L V L K HNK Y F I E F E V S Q E K I - E V I QK R C I - - E I E H P L L A E Y D F R NDT NN - - - - P D I N I D L K P A - - -A V L R P
S K I T E C T L S Y G K K L V MK E S S F F L E L S I E V E E V - E L V K K R C I - - E I DY P L I E E Y D F R NDK V L - - - -R S L Q I D L K P T - - -T I I R S
S K I R QA G E K K QNR L V L I NG K Y Y L Q I E V K QT S V - F K L K K K C K - -K K K V R V Y E E Y H F L R DK Q - - - - -K E L P I Q L R K D - - - -C L R P
S K I K L C T V S Y G K K L V L K R NR Y F V E F E V K Q E M I - E E L QK R C I - -H L DY P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I K S A T V S Y G K K L V I K HNR Y F V E F E I DNA S V - E I V K K R C Q - - E L DY P V L E E Y D F R NDR R N - - - - P D L D I D L K P S - - -T Q I R P
A E I E S C MK R Y N L R I I I DA E R T L V Q F L L Q S R A M S K V V A A QC V - -V L G L P I QQQY D F E NDT S V - - - -R T A H I S L R T Q - - -T K P R R
L K I E V S T K S Y G K K L V L K NT QY F V E F Q I E D E G V - E I V QK R C L - - E L NY P I L E E Y D F R NDT F N - - - - P V L D I D L R P N - - -T QV R P
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S G I D F E DA V L D L L P C P D L S A A Y E A S G K K L K L R D
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -C L E Y P L L A E Y D F R NDT L N - - - - P D I N I D L K P T - - -A V L R P
L K I E S C T K S Y G K K L V L NNNK Y F V E F E I P E T A V - E I V QR R C L - -D L G F P I L E E Y D F R ND S NN - - - -A D L E I D L R P N - - -T Q I R P
S K I HA S T A NY G K K L V L K K NR Y F V E F E I D P S QV - E NV K QR C L P NA L N F P M L E E Y D F R NDT V N - - - - P D L E M E L K P Q - - -A R P R P
S K V H E C T E NY G K K L V L QR NK F Y L E F E I E A R QV - E HV K QR C L P G N L G Y P T L E E Y D F R NDT R N - - - - P D L G I E L K P M - - -T R I R P
S K I R HHT NN I G QK F F L QDK S Y Y I D F R I V G DY F - -DV A QA L I - -R S S V P L I Q E Y D F T K E K - - - - - -QK L D I N L K P S - - -T K P R L
S K I K S A T I S Y G K K L V L K HNR Y F V E F E I A N E S V - E I V K R R C Q - -D I DY P V L E E Y D F R NDA R N - - - - P D L E I D L K P S - - -T Q I R P
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK I - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L I C S L K S H - - -V Q I R Y
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK L - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L I C S L K S H - - -V Q I R Y
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK I - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L NC S L K S H - - -V Q I R Y
S K I HG S T A NY G K K L V L K K NR Y F I E F E V D P S QV - E NV K QR C L P NA L NY P M L E E Y D F R NDT V N - - - - P D L NM E L K P H - - -A Q P R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -C L E Y P L L A E Y D F R ND S L N - - - - P D I N I D L K P T - - -A V L R P
S K I K G A T I S Y G K K L V I K HNR Y F V E F E I A N E S V - E V V K K R C Q - - E I DY P V L E E Y D F R NDHR N - - - - P D L D I D L K P S - - -T Q I R P
S K I R A C T V S Y G K K L V L K K NR Y F I E F E I K H S S V - E T I K K R C A - - E I DY P L L E E Y D F R NDN I N - - - - P D L P I D L K P S - - -T Q I R P
S K I K L C T L S Y G K K L V L K HNR Y F V E F E V V QD E I - E N L QK R C I - - E L E Y P L L A E Y D F R NDT R N - - - - P D L S I D L K P T - - -A V L R P
SK I K V - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S K I T E NT QNY G A R L F L DD S S Y Y L D I Q I I G DH F - - E V T K A V I - -NC S V P L I Q E Y D F E NK S F - - - - -K Q L E I E L K P K - - - I K V R Y
S K I L NT S S A F G K K L V L R D S R Y W I E F E V QQ E K I - E D L K R E A L -QT MR R P L V M E Y D F R K DNN S - - - - P S L NC C I R S N - - - I K I R Y
S K I L NT S S A F G K K L V L R D S R Y W I E F E V QQ E K I - E E L K R E A L -QNMR R P L V M E Y D F R K DNN S - - - - P S L NC C I R S N - - - I K I R Y
S K I E L C C L S V G K K S V L R NT K Y Y I E F Q I K T E S V -R E I R QY A V - -DHN L F I S D E Y D F MNDK T I - - - -DN L G I Q L K NT - - -T R I R P
E QA K V S A NG DV K - -V K K E E T K E V A S QV MDG K M -R NV R E R L Y -K E L S V R A D L F Y DY V QDH S L - - - -HV C D L E L S E N - - -V R L R P
E QT H E S E E S QG K S L V K Q E A T E E T A S QV K DG R L -R NV R E R L F -K E L G V R A D L F Y DY V QDG T L - - - -DV R D L A L A E H - - -V R L R P
S K I R E Y T A S F G K K L V L K QNK Y F V E F E I A E E Y I - E QV K K R C N - - E I G Y P M L E E Y D F R NDQ L N - - - -A D L E I D L K P I - - -T H I R P
S K I H S C T K S Y G K K L V L K HNR Y F V E F E I A P D S V - E T V K K R C Q - - E I DY P V L E E Y D F R NDHG N - - - - P D L D I D L K S S - - -T Q I R P
2160
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2170
2180
2190
2200
2210
2220
2230
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A C C T V R K R A L V L C N S G V S V E QWK QQ F WG I MV L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A A R I K K S C L C L A T NA V S V DQWA Y Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A A M F R R V V S T
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K K G T I V L C T S S M S V V QWR N E F WG L M I L D E V HV V P A S M F R K V T S A
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A G C T I K K G T I V L C T S S M S V V QWR N E F WG L M I L D E V HV V P A S M F R K V T S A
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A V C T V R K R A L V L C N S G V S V E QWK QQ F WG L V V L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G N S R A R S G V I V L P C G A G K T L V G V T A V T T V NK R C L V L A N S NV S V E QWR A Q F WG L L L L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G N S R A R S G V I V L P C G A G K T L V G V T A V T T V NK R C L V L A N S NV S V E QWR A Q F WG L L L L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A QM F R R V V T T
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V S T
----------------------------------------------------------------------------------Y Q E M S L A K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S A L V L C T S A V S V A QWK QQ F WG F L L L D E V HV T P A DM F R K C I NN
Y Q E QA L R MM F S NG R A R S G I I V L P C G A G K T L T G I T A A C T MR K S I L I L T T S A V A V S QWK F Q F WG L L I F D E V Q F A P A P A F R R I NG I
Y Q E QA L R MM F S NG R A R S G I I V L P C G A G K T L T G I T A A C T MR K S V L I L T T S A V A V S QWK F Q F WG L L I F D E V Q F A P A P A F R R I NG I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S S V S V E QWK A Q F WG L I I L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V T T
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L S G I T A A C T V K K S I L V L C T S A V S V E QWK Y Q F WG L V L L D E V HV V P A A M F R K V L T V
Y Q E I C L NK M F G NG R A R S G I I V L P C G S G K T I V G I T A I S T I K K NC L V L C T S A V S V E QWK QQT WG L L V L D E V HV V P A MM F R R V L S L
HQ E R A L QQ I F DN E MA R S G I V V L P C G A G K T L T A I A A C S K I K R S T I V L T HT T Q S V F QWK E E F WG F I I F D E V HG S T T DN I E K F V C K
Y Q I E A V DA A I HDG T L N S G C L L L P C G A G K T L L G I M L MC K V K K P T L V L C A G S V S V E QWK S Q I Y G L L I L D E V HV M P A E S F R G S L G F
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K R G V I V L C T S T M S V V QWR D E F WG L M I L D E V HV A P A K M F R R V T S A
Y QA E A L V A W S E N - - E K WG V L V L P T G S G K T L L G I R A I A G C NT P A L V I V P T L D L L E QWK T Q L F G L L V F D E V HH L P A A G Y R S I A E F
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A R M F R R V L T I
Y Q E Q S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K K G V I V L C T S S M S V V QWR Q E F WG L M L L D E V HV V P A DV F R R V I S S
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A C R I K K S C L C L A T NA V S V DQWA F Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L T G I A A A A R I R K S C L C L C T S S V S V DQWA A Q F WG C M L L D E V HV V P A A M F R K V I G I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P G K QA G A E L R V
Y Q L R A A K T V I MG DY A K S G L I V L P C G A G K T L V G V L C M S L I K S S T V I I C D S NV S V E QWK R E I WG I C I V D E V HR L P A V Q F QNV L K Q
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V T T
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S A L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S S L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S S L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A C R I K K S C L C L A T NA V S V DQWA F Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A A M F R R V V T T
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K T L V G V T A S C T V R K R C MV L C T S G V A V E QWR S Q F WG L M I L D E V HT I P A K Q F R R V L T Q
----------------------------------------------------------------------------------Y Q E R A L K N I F I QK K A R S G L I I L P C G A G K T I V G V I A I E R I K Q S T V I I C D S DV S V DQWR D E L WG V C V I D E V HK L P A NT F QNV L K Q
Y Q E R A L R R M F S NG R A R S G I I V L P C G A G K T L T G I V A A C T V R K S I F V L T T S A V A V E QW I K Q F WG M L I F D E V Q F V P A P A F R R I N E I
Y Q E R A L R R M F S NG R A R S G I I V L P C G A G K T L T G I V A A C T V R K S I F V L T T S A V A V E QW I K Q F WG M L I F D E V Q F V P A P A F R R I N E I
Y Q E K A L T K M F S G G R S I S G I I V L P C G A G K T L V G I A A L A T I NK P T V I V C NNR L T V K QWY NQ I WG L L I L D E V QD S A A NT F R NV T D I
Y QV A S L E R F R S G NK A HQG V I V L P C G A G K T L T G I G A A A T V K K R T I V MC I NV M S V L QWQR E F WG L L L L D E V HT A L A HN F Q E V L NK
Y QV A S L E R F R C G NK A HQG V I V L P C G A G K T L T G I G A A T I L K K R T I V MC I NV I S V L QWQR E F WG L L L L D E V HA A L A HH F Q E V L NK
Y Q E K S L A K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S C L V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A S M F R R V L T K
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R K V V T N
2250
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2260
2270
2280
2290
2300
2310
2320
V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK R G Y I A R V QC A E V WC P MA P E F Y R E Y L Y V MN P A K F R A C QY L I R Y
T K S HC K L G L T A T L V R E D E K I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y V MN P NK F R A C E F L I R F
I A A HA K L G L T A T L V R E DDK I S D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
I A T Q S K L G L T A T L L R E DDK I K D L N F L I G P K L Y E A NWM E L A E QG H I A K V QC A E V WC P MT T E F Y T E Y L Y I MN P R K F QA C Q F L I DY
I A C Q S K L G L T A T L L R E DDK I K D L N F L I G P K L Y E A NWM E L A E QG H I A K V QC A E V WC P MT T E F Y S E Y L Y I MN P R K F QA C Q F L I DY
V H S HA K L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QN S G Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
V QA HC K L G L T A T L V R E DDK I T D L N F L I G P K I Y E A NWM E L QK A G H I A K V QC A E V WC P MT S A F Y S Y Y L A V MN P NK F R I C Q F L I K F
V QA HC K L G L T A T L V R E DDK I T D L N F L I G P K I Y E A NWM E L QK A G H I A K V QC A E V WC P MT S A F Y S Y Y L A V MN P NK F R I C Q F L I K F
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
----------------------------------------------------------------------------------F K V HA K L G L T A T L V R E DDR I G D L G Y L I G P K L Y E A NWMD L A K NG H I A T V QC A E V WC P MT P E F Y R E Y L HA MN P NK I QA C Q F L I NY
V K A HC K L G L T A T L V R E DD L I QD L QW L I G P K L Y E A NWM E L QDR G Y L A K A L C S E V WC P MT A S Y Y R E Y L WV C N P NK L R V C E F L I R W
V K A HC K L G L T A T L V R E DD L I QD L QW L I G P K L Y E A NWM E L QDR G Y L A K A L C S E V WC P MT A S Y Y R E Y L WV C N P NK L R V C E F L I HW
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I R F
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
T K A HC K L G L T A T L L R E D E K I QD L N F L I G P K L Y E A NW L D L QK A G F L A NV S C S E V WC P MT A E F Y K E Y L Y T MN P NK F R A C E Y L I R F
V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK K G Y I A R V QC A E V WC P M S P E F Y R E Y L Y V MN P S K F R S C Q F L I K Y
V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK K G Y I A R V QC A E V WC P M S P E F Y R E Y L Y V MN P S K F R S C Q F L I K Y
V S HHC K L G L T A T L V R E DDK I E D L N F L I G P K L Y E A DWQD L S A K G H I A R V S C I E V WC G MT G D F Y R E Y L S I MN P T K F QV C E Y L I NK
I K A QC K L G L T A T L I R E DDR I R D L E F M I G P M L Y E A S WQ E L A K QG Y I A NA K C F E V I C P MT K T Y Y S A Y L A Q L N P NK I DA C K Y L L E Q
V QA HC K L E L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QN S G Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
V DA K G V I G L T A T Y V R E DHK I L D L F H L V G P K L Y D I S M E T L A S QG Y L A K V HC V E V R T P MT K E F G L E Y L A A A N P NK MMC V R E L V R Q
L K S H S K L G L T A T L L R E DDK I S D L N F L I G P K L Y E A NWM E L S L G G H I A R V QC A E V WC P M P T E F Y R E Y L Y I MN P MK F QA C QY L I NY
S A A P C R L G L T A T Y E R E DG L HT E L NR L V G G K V Y E K K V S E L A -G G H L A P Y T I K R F A V T L T E K E QR E Y L A F N S N S K I E K L R E I L E Q
I K S H S K L G L T A T L L R E DDK I S H L N F L I G P K L Y E A NWM E L S E K G H I A K V QC A E V WC P M P T E F Y D E Y L Y A MN P R K F QA C QY L I NY
T K S HC K L G L T A T L V R E D E R I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y A MN P NK F R A C E F L I R F
T K A HC K L G L T A T L V R E DDK V DH L N F L I G P K L Y E A NW L D L QR DG H I A NV QC V E V WC P MT A E F F R K Y L Y C MN P NK F MA C Q F L MQ F
I L A HC N L R L L A T A F G HHD P V L D F L F T H L Q S I F E WA WW L T P NNG Y I A K V QC V E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
I K C A I K I G L T A T L L R E DQK L DN L Y F M I G P K L Y E E N L I D L MT QG F L A K P H I I E I QC DM P P I F L Q E Y L HT G N P G K Y K A L Q F L I K N
I A A HA K L G L T A T L V R E DDK I HD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
V K S HC K L G L T A T L V R E D L L I R D L HW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P C S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
V K S HC K L G L T A T L V R E D L L I R D L QW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P S S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
V K S HC K L G L T A T L V R E D L L I R D L QW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P S S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
T K S HC K L G L T A T L V R E D E R I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y V MN P NK F R A C E F L I R F
I A A HA K L G L T A T L V R E DDK I G D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
I A A HT K L G L T A T L V R E DDK I DD L N F L I G P K MY E A NWMD L A QK G H I A K V QC A E V WC A MT T E F Y N E Y L Y I MN P K K F QA C Q F L I DY
V QA HC K L G L T A T L V R E DDK I A D L N F L I G P K L Y E A NWM E L QNK G F I A R V QC A E V WC P MA P E F F R E Y L Y V MN P NK F R A C Q F L V R F
----------------------------------------------------------------------------------Y K F H F K L G L T A T P Y R E D E K I I N L F Y M I G P K L Y E E NWY D L V S QG F L A K P Y C V E I R C E M S Q L WM S E Y I HT S N P R K F K T L E Y L I K V
I R S HC K L G L T A T L V R E DD L I R D L QW L I G P K L Y E A NW L E L QQK G Y L A K V I C K E I WC P MT A P F Y R E Y L W S C N P V K L I T C E Y L L R F
I R S HC K L G L T A T L V R E DD L I R D L QW L I G P K L Y E A NW L E L Q E K G Y L A K V I C K E I WC P MT A P F Y R E Y L W S C N P V K L I T C E Y L L K F
A K A HT R L G L T A T L I R E DDK I S D L R Y L V G P K L Y E A NW L E L S E QG Y L A R V K C F E V T V P MT A S F Y K Y Y L C S S N P NK I R T V A G I I K F
V K Y K C V I G L S A T L L R E DDK I G D L R H L V G P K L Y E A NW L D L T R A G F L A R V E C A E I QC P L P K A F L T E Y V V C L N P Y K L WC T QA L L E F
V K Y K C V V G L S A T L L R E DDK I G D L R H L V G P K L Y E A NW L E L T R A G F L A R V E C A E V QC P L P L P F F R E Y V V C F N P Y K L WC T QA L L E F
I K A H S K L G L T A T L V R E D E K I D E L N F L V G P K L Y E A NWMD L A A K G H I A T V QC A E V WC P MT P E F Y R E Y L Y C MN P NK F QA C Q F L I DY
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P S K F QA A Q F L I NY
2330
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2340
2350
2360
2370
2380
2390
2400
H E -K R G - -DK T I V F S DNV F A L K HY - -A I K MNK P Y I Y G P T S QN E R I Q I L QN F K F N P K V NT I F V S K V A DT S F D L P E A NV L I Q I S S

H E QQR G - -DK I I V F A DN L F A L T E Y - -A MK L R K P M I Y G A T S H I E R T K I L E A F K T S K T V NT V F L S K V G DN S I D I P E A NV I I Q I S S
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MN I L QN F QY NDQ I S T I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK V I V F S DNV Y A L E R Y - -A L K L NK A Y I Y G G T P QN E R MR I L E N F QHN E QV NT I F L S K I G DT S L D L P E A T C L I Q I S S
H E -K R G - -DK V I V F S DNV Y A L QR Y - -A L K L NK A Y I Y G G T P QN E R MR I L E N F QHN E QV NT I F L S K I G DT S L D L P E A T C L I Q I S S
----------------------------------------------------------------------------------H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -R R N - -DK I I V F S DNV F A L K R Y - -A I E MQK P F L Y G E T S QN E R MK I L QN F QY N P R V NT I F V S K V A DT S F D L P E A NV L I Q I S A
H E -R R N - -DK I I V F S DNV F A L K R Y - -A I E MQK P F L Y G E T S QN E R MK I L QN F QY N P R V NT I F V S K V A DT S F D L P E A NV L I Q I S A
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L R L G K P F I Y G S T P QQ E R MK I L QN F QHNDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -R R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I F G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
- - - - - - - - - - - - - - - - -V F A L K E Y - -A V R MG K P Y I Y G P T T QG E R L Q I L QN F I HN P K V NT I F I S K V G DN S I D L P A A NV L I QV S S
H E - S R G - -DK V I V F S DNV F A L E A Y - -A K K L G K S F I HG G T P E G E R L R I L S R F QHD P Q L NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR G - -DK I L V F S D S L F A L I N I - -A V A L K K P F V C G S V DT L E R I K I L QQ F K E N P N F NT I F L S K V G DNA I D I P L A NV V I Q I S F
H E -QR G - -DK I L V F S D S L F A L I N I - -A V A L K K P F V C G S V DT L E R I K I L QQ F K E N P N F NT I F L S K V G DNA I D I P L A NV V I Q I S F
H E -K R G - -DK I I V F S DNV Y A L QG Y - -A L K L G K P F I Y G S T S QQ E R MK I L QN F QHNDQV NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR G - -DK I I V F S DNV Y A L QK Y - -A K G L G R Y F I Y G P T S G H E R M S I L S K F QHD P T V R T I F I S K V G DT S I D I P E A T V I I QV S S
H E -QR G - -DK T I V F S DNV F A L K HY - -A I K MNK P F I Y G P T S QN E R I Q I L QN F K F N S K V NT I F V S K V A DT S F D L P E A NV L I Q I S S
H E -QR G - -DK T I V F S DNV F A L K HY - -A I K MNK P F I Y G P T S QN E R I Q I L QN F K F N S K V NT I F V S K V A DT S F D L P E A NV L V Q I S S
H E - S R G - -DK I I V F S D S V Y A L K A Y - -A L K L G K P F I Y G P T G QT E R MR I L K Q F QT N P V I NT I F L S K V G DT S I D L P E A T C L I Q I S S
HK -A HG - -DK I I I F C N E L K P A G F Y K E K L K L QK C Y MDG NT S E E HR R N L L DQ F R R D - E I S V I F C S K I G DV G L D L P DA S V A I Q L S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L G K P Y I Y G P T A QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H L -DA G - -A K I L L C C DH I M L L K E Y - -G E L L NA P V I C G S T QHK E R L M I F S D F Q S T S K I NV I C V S R V G DV S V N L P NA N I V I QV S S
H E -A R G - -DK I I V F S DNV Y A L K K Y - -A S V L S K C M I Y G G T S N S E R Q L I L K N F QHN P E I NT L F L S K I G DT S L D L P E A T C L I Q I S S
H - - -R E - -DR V F I F T E HNR L V HR I - - S NT F F I P A I T Y R T P A K E R N S I L E K F R -T G S Y R A V V T S K V L D E G I DV P E A N I G I - I V S
H E -A R G - -DK I I V F S D E L Y S L K QY - -A L K L NK V F I Y G G T G QA E R MQV L E N F QHN P QV NT L F L S K I G DT S L D L P E A T C L I Q I S S
H E QQR G - -DK I I V F A DN L F A L T S Y - -A MK L R K P M I Y G S T S HV E R T R I L HQ F K N S S DV NT I F L S K V - - - - - - - - - - - - - - - - - H E QQR K - -DK V I V F S DN I F A L R E Y - -A T A L R R P L I Y G DT S HA E R T R V L HA F K Y S N E I NT I F L S K V G DN S I D I P E A NV I I Q I S S
H E -M L G - -HK I I V F C D S L L I L NY Y - -A L L L G Y P V I DG D L NT D E K NK I F S I F K N S N E I K T I F V S R V G DT G I D I P S A S V G I E I G Y
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MQ I L S N F QHNDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K HD S S I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K ND S N I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K ND S T I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E E QR R - -DK I I V F A DN L F A L T E Y - -A MK L HK P M I Y G A T S HA E R T K I L HA F K T S S E V NT V F L S K V G DN S I D I P E A NV I I Q I S S
H E -R R G - -DK I I V F S DNV Y A L Q E Y - -A L K MG K P F I Y G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK I I V F S DNV Y A L R A Y - -A I K L G K Y F I Y G G T P QQ E R MR I L E N F QY N E L V NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR N - -DK V I V F S DNV F A L K HY - -A I A MG R P Y I Y G P T S QG E R MQ I L QN F QHN P A V S T I F I S K V G DN S F D L P E A NV L I Q I S S
----------------------------------------------------------------------------------H E - E R G - -DK I L V F C DR P M I I DY Y - -G N I L K Y P V I Y G DV S QD E R K K I F N L F K V S NQ I NT I F L S R V G DT A I D L P QA NV G I Q I G M
H E - S R G - -DK V I V F S DN L F A L L HA - -A K L L NR P F I Y G K V S S A E R I V I L NK F K N E T T F NT I F L S K V G DNA L D I P C A NV V I Q I S F
H E - S R G - -DK V I V F S DN L F A L L HA - -A K L L NR P F I Y G K V S S A E R I I I L NK F K N E T T F NT I F L S K V G DNA L D I P C A NV V I Q I S F
H E -R R G - -DK V L V F C D I I H I L I H L - -A G L L HC P E I HG E T P E NV R S S I F H E F K NG S K V NT L I L S S V G DK A I D L P S A S V V V QV C S
HR -NR S P P DK V I I F C DQ I DG I QY Y - -A QH L HV P F MDG K T S DM E R E N L L QY F QH S DN I NA I I L S R V G DV A L D I P C A S V V I Q I S G
HR -NR S P P DK V I I F C DD L E G V QY Y - -A R H L NV P F MDG K T T E V E R E N L L QY F QH S ND I NA I I L S R V G DV A L D I P C A S V I I QV S G
H E -NR G - -DK I I V F S DNV Y A L V A Y - -A HK L K K P F I HG G T A H L E R MR I L QN F QHN P L V NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK I I V F S DNV HA L K A Y - -A L K L G K F F I F G G T P QQ E R MK I L K N F QY NDQV NT I F L S K V G DT S I D L P E A T C L I Q I S S
2410
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2420
2430
2440
2450
2460
2470
2480
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E MG Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V A E L K K S P -DT - - - - -H P Y P HK F
HA G S R R Q E A QR L G R I L R A K A F F Y S L V S T DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y Y E NR L K Y L A A E K A K - -G E - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I Q E L R K T Q - E P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QA F L V DQ -G Y A F K V Y F E I R S K R I NK L R E T K -Q P - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QA F L V DQ -G Y A F K V Y F E I R S K R I NK L R E T K -N P - - - - -D P Y P HK F
----------------------------------------------------------------------------------HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E MG Y S R K R QR F L V NQ -G Y A Y K V Y F NMR V R M I E A R R A A - -G E - - - - -N P F P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E MG Y S R K R QR F L V NQ -G Y A Y K V Y F NMR V R M I E A R R A A - -G D - - - - -N P F P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V F F E I R S R Q I S E L R E K N -NA D P S A F N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I L E L R K T H - S P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K I N - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E V A Y S T K R QR F L V DQ -G Y S F K A Y L Q I R K NT I T T L R QN - -N I - - - - - E P Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QG F L I DQ -G Y A F K V F H E MR Y K E I A K L R E T K -Q P - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L L S K DT E M E Y A DK R QQ F I I DQ -G Y S Y R V Y T DNR Y K MM E C I K DA - -G R - - - - - P F Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L L S K DT E M E Y A DK R QQ F I I DQ -G Y S Y R V Y T DNR Y K MM E C I K DA - -G R - - - - - P F Y P HK F
HG G S R R Q E A QR L G R V L R A K A Y F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I QA L K G T - -A E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I D E L R QA N - L A DG S A F N P Y P HK F
HY G S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y K E NR T K Q L T S A D I - - -G V - - - - -N P W P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M S Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V Q E L K R S P -A T - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M S Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V Q E L K Q S A -D S - - - - -H P Y P HK F
H F G S R R Q E A QR L G R I L R A K V Y F Y S L V S K DT E M F Y S S K R QQ F L I DQ -G Y T F T I - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S S G S R R Q E A QR L G R I L R A K A Y F Y T L T S K DT E MY F S QR R QR V MR QN -G Y T F K V L F L NR C K DV E E Y QK A - -G H - - - - -N P W P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S HA I QQ L K G T - -N E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I N E L R K T H - S P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R I L R P K A W F Y S I I S T DT E I NY A A HR T A F L V DQ -G Y T C R I Y F E S R L A MV K E MG L L - -G - - - - - -A A Y P HK Y
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K V N - -G E - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S A K R QA F L V DQ -G Y A F K V Y F E I R S R E V NG M L E N P S G P - - - - -N P Y P HK F
G T G S K R A Y V QR L G R I L R K K A V L Y E I I A G E T E T G T A R R R K E A L S S G -K R T S K A F DD S K L A K L NG I I S Q - -G L - - - - -D P Y P Y R F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I QQ L K I S - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA V QQ L K V T - -G E - - - - -D P Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S S K R QA F L V DQ -G Y A F K V Y Y E I R T R QV N E L L K N P - E T - - - - -N P Y P HK F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Q - - - - - - -Y Y E NR L K A L D S L K A T - -G V - - - - -N P Y P HK F
HA G S R R Q E A QR L G R I L R P K A F F Y S L V S T DT E MY Y S T K R QQ F L I QQ -G Y A F K V Y T QNR I NK V L S A K A K - -G E - - - - - S P Y P HK Y
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
L G G S R R QK V QR L G R V MR P K A F F Y S L A S K DT E S E Y S Y K R QK Y I T E Q L G L NT E L F H E NR S K QV L A L K QT K -D P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I N E L R E S N -N E N P T A F N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y F E NR S K F I QDQK DK - -G I - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y Y E NR S K F V Q E QK A K - -G I - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y F E NR S K L I L S QQ E K - -G I - - - - -NT Y P HK F
HA G S R R Q E A QR L G R I L R A K A F F Y S L V S T DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y Y E NR L K Y L DA QK G E - -G K - - - - -NMY P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA V QQ L K V S - -G E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E T R S R Q I Q E L R K T H - E P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S S K R QA F L I DQ -G Y A F K V Y F E NR S R T I M E L R QT K -D P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M F Y S L K R QR F L V NQ -G Y S F K T Y F K I R S QA V E A L K A A - -G D - - - - -H P Y P HK Y
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Y F K I R S QA I E E L K G A - -G E - - - - -D P Y P HK F
H F K S R R Q E V QR L G R I MR A K A F WY T L V S K G T E T S Y C L A R QK C L I NQ -G F K Y E I Y Y E NR C K A V QD L MT T G -K P - - - - -Y P Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MV F A DK R QQ F I I DQ -G Y A Y NV - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - N F A S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MV F A DK R QQ F I I DQ -G Y A Y NV Y Y L NR L E T V E QWR K N - -G - - - - - -T A Y P HK F
NY G A R MQ E S QR L G R V L R P K A F F Y S C I S DMT D L K Y S A R R QQ F L V DQ -G Y V Y E P Y HDR R L A E V T K QV E A H -R K D L S L P S P Y P HK F
L G A S R R Q E A QR L G R I L R P K S Y F Y T L V S QDT E I S Q S Y E R Q S W L R DQ -G F S Y R V Y Y DT R L A MV K E MG P L - -G - - - - - -A A Y P HK F
L G A S R R Q E A QR L G R I L R P K S Y F Y T L V S QDT E V QQ S Y G R Q S W L R DQ -G F A Y R V Y F DT R L A MV K E L G L L - -G - - - - - -A A Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S T K R QQ F L I DQ -G Y A F R V Y Y E R R F R T I S A L R E S K -N P - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I DA L R Q S K -T P - - - - -N P Y P HK F
2500
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2510
2520
2530
2540
2550
2560
2570
S V T V S L G E F I E R Y - - - S G - L QDG E T L D -DV T V S V A G R V HA I R E S G V K L I F F D L R - - - - - -G E G L K L QV F F E E T A R L R R G D I I G
A V S M S I P K Y I E T Y - - -G S - L NNG DHV E -NA E E S L A G R I M S K R S S S S K L F F Y D L H - - - - - -G DD F K V QV F L K L H S NA K R G D I V G
NV T I G L P A F L NK Y - - -A H - L QR G E T L P - E E R V S I A G R I HA K R E S G S K L R F Y V L H - - - - - -A DG V E V QV Y E QDHG L L K R G D I V G
QV T DD L R E Y L K T Y - - -DG - L A K G E QK P -DV T V R I A G R I Y T K R S S G S K L F F Y D I R - - - - - -A E G V K V QV F E A QH E H L R R G D I V G
QV T DD L R K Y L T DY - - - E G - L A K G E QK P - E V A V R I A G R I Y T K R A S G A K L I F Y D I R - - - - - -A E G V K V QV F E A QH E H L R R G D I V G
----------------------------------------------------------------------------------HV D I S L T H F I Q E Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I R I NNK L R R G D I I G
NV T I S L T D F I A K Y - - - S P - L QN E Q -V A -D E I V S V A G R I H S K R E S G S K L V F Y D I H - - - - - -G E G T H I Q I F V T L HDR I K R G D I V G
NV T I S L T D F I T K Y - - -T P - L E K E Q -V V - E E I V S V A G R I H S K R E S G S K L V F Y D I H - - - - - -G E G T H I Q I F V T L HDR I K R G D I V G
NV T T K I P E F V E K Y - - -A H - L QR G E T L K -DV T V S V S G R I MT K R E S G S K L K F Y V L K - - - - - -G DG V E V Q I F E S MH E I L R R G D I I G
QV S I S N P E F L A K Y - - -A H - L K R G E T L P -N E I V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V QV Y E NDHD L I K R G D I V G
HV D I S L T H F I E E Y - - -G H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
HV S I S L S DY V E K Y - - -NN - I E V G S H L N -DQQV S I A G R I HA K R E A G P K L I F Y DV R - - - - - -G DG V K L QV Y Q E I N E R T R R G D I I G
NV T HA V P K F V E E WG K E G K - L E K G E T A Q L N E P I S L A G R V Y T I R E S S S K L R F Y D L K - - - - - -A DG V K V Q I Y L DT HDR I R R G D I I G
K I S M S L P A Y A L K Y - - -G N -V E NG Y I DK -DT T L S L S G R V T S I R S S S S K L I F Y D I F - - - - - -C E E QK V Q I F S V S H S E I R R G DV V G
K I S M S L P A Y A L K Y - - -G N -V E NG Y I DK -DT T L S L S G R V T S I R S S S S K L I F Y D I F - - - - - -C E E QK V Q I F S V S H S E I R R G DV V G
HV D L S L T E F I E R Y - - -NH - L Q P G DH L T -DV V L N L S G R V HA K R A S G A K L L F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
HV S I Q L P A F A E K Y - - -K D - L K K G E S L K -DV E V K V S G R I MG K R E S G S K L K F Y V L K - - - - - -G DG V Q I Q I Y E K MH E Y L R R G D I I G
E V S HQ L P K F V E E F - - - S V - L E K DG E P S -T QV V S I A G R V L S K R A A G S G L V F Y D I T - - - - - -G E F NK V QV Y V K I NG L L R R G D I I G
HV S S S L E D F I A K Y - - E N S - L K E G E T L E -NV K L S V A G R V HA I R E S G A K L I F Y D L R - - - - - -G E G V K V QV F E I DT S K L R R G D I I G
NV S I S L E N F I E QY - - - S G - L T DG E T L E -K V S L S V A G R V HA I R E S G A K L I F Y D L R - - - - - -G E G V K L QV F E T DT A K L R R G D I I G
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M S MR L H - - - - - -A R F C F F V V T - - - - - - S NG E S L QV R E QMA K F L R R G DV V G
NV S I T V P E F I A K Y - - - S G - L E K S Q -V S -DD I V S V A G R V L S K R S S - S A L M F I D L H - - - - - -D S QT K L Q I F V S L T K M I Y R G D I C G
HV D L S L S D F I E R Y - - - S H - L Q P G DH L T -D I T V S V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV Y F R I NNK L R R G D I I G
HV D I S L T D F I QK Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
QV S V T L P E F L S K Y - - -A N - L K R G E T L P - E E K V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V QV Y E DDH S L I K R G D I V G
HR QY T I P QY R R K Y - - -A P L L T E P DT S L -D E T V T I A G R I I NK R S S G S K L H F I T I Q - - - - - -G DM E I V QV F A E I H S K L K R G D I I G
L V DY D P S Q F DK D F - - -K H - L K S G DV DK -T R E I R I A G R I F T K R S S G NK L I F Y D I K T G S DT T T T G S K MQ I F E QQH E H L G R G DV I G
E K NG D I C E I L V K F - - - E D - F E K N E G L S - - - -V R T A G R L Y N I R K HG -K M I F A D L G - - - - - -DQT G R I QV F A T F K N L MD S G D I I G
HV DT S L T H F I E QY - - -NN - L Q P G DH L T -D I T V R V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F F P I NNK L P R G D I F G
HV D I S L T Q F I Q E Y - - - S H - L Q P G DH L T -DV T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
QV NY DD S N F V E E F - - -G S - L K T G E T L P - E K E L R I A G R I Y N I R T A G S K L I F Y D I R T S A DT K S I G T R MQV F E K QHA H L R R G D I I G
L A N I T V A DY I E K Y - - -K S -MNV G DK L V -DV T E C L A G R I MT K R A Q S S K L L F Y D L Y - - - - - -G G G E K V QV F I K F H S T L K R G D I V G
HV DT R V G E F I E K Y - - - S G - L A DG T T A E -G E S A S V A G R I M S K R A S G K K L Y F Y D L I - - - - - -A DG K K I QV F QK I H S A T R R G D I V G
QV D L T I A Q F R DK Y - - -G P L C T E K G K I H - E D F V S V A G R V V T I R S MG A K L M F Y D L Q - - - - - -G E G T K I QV F E K V HT L I K R G D I I G
HR N I T L P E F A E K Y - - - S S - L T R G E T L Q -DV E V K V T G R I MT K R E S G A K L R F Y V L K - - - - - -G DG V E V Q I Y E K MH E Y L R R G DV I G
E R T I S I P E F I E K Y - - -K D - L G NG E H L E -DT I L N I T G R I MR V S A S G QK L R F F D L V - - - - - -G DG E K I QV F A E C Y DK I R R G D I V G
E R T I T V P E F V E K Y - - -QN - L A S G E H L E -NT V L NV T G R I MR V S A S G QK L R F F D L V - - - - - -G DG A K I QV F A E A Y DK I R R G D I V G
E R T I T I P D F I E K Y - - -K D - L QNG E H L E - E T I L NMT G R I MR V S S S G QK L R F F D L V - - - - - -G DG K R I QV F V E C Y DK I K R G D I V G
F V T L S I P E Y I DK Y - - -G G - L S NG E H L E -DV S V S L A G R I M S K R S S S S K L F F Y D L H - - - - - -G L G A K V QV F S K L H S S V K R G D I V G
HV D I S L T Q F I Q E Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
HV S I S N P E F L A K Y - - -A H - L K K G E T L P - E E K V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V Q L Y E K DHD L L K R G D I V G
QV T I T L P E F I A K Y - - - E G - L A R G E T K P - E V E V A V A G R V L G L R T A G NK L R F Y E I H - - - - - -A DG K K L QV F A A QH E H L R R G D I I G
HV T I S L T D F L E K Y - - -DY - L K A E D - I A -D E V L S L S G R V HA K R A S G A K L I F Y D L R - - - - - -G E G V K L QV F T R L N E K I R R G D I I G
HV DV S L T E F I E K Y - - -K N - L Q P G DQ L T -D -A V K V A G R V HA K R V S G A K L L F Y D L R - - - - - -G E G V K L QV F V A I NNK L R R G D I I G
DV S H S I S Q F I E E F - - -D P K L T E NG QT I -DT I V T I G A R I T S F R A S G K A L I F Y QV Q - - - - - -Q E G K K L QV F E E I N S L F K R G D I I G
- - -M S L K E Y V DK Y - - - E H - L E A G E H L E -N E L V S I A G R V S R I A S S S S K L R F L D I K - - - - - - S E G T K L QV F NDT Y NN I K R G D I I G
HV NM S L K E F V G K Y - - -DH - L E A G A H L E -N E L V S I A G R V S R I A S S S S K L R F L D I K - - - - - - S E G T K L QV F NDT Y NN I K R G D I I G
NV S HT F K Q F Y A Q F - - - E H - L K A G E E L P -DV K V S V A S R I A Q L R A HG -N L Y F F E MY - - - - - - E S T F K L Q L F K E E V S S F H L G D I V G
HR DY T L P A F R E C F - - -K P M L Q E K G QR L -DK V V T I A G R I V V K R S S S S K L H F L A L Q - - - - - -G DG E V L QV F A D I H S K I K R G D I I G
DR QY T I P A F K A R F - - -A P Q L S E K G QR V - E E V V A I A G R I V NK R S S G S K L N F L T L Q - - - - - -G DA DT V QV F A A V HG R I R R G D I I G
HV S I S L S E F I S K Y - - - E G K L E A G QH L D -Q E E V S I A G R L HNMR S S G QK L R F Y D L H - - - - - -G E G V K V QV F F A I H E L L R R G DV V G
NV T T K V D E F V E K Y - - -K G - L A R G E I K K -D E E V S V A G R V HT L R A A G S K L R F Y V L H - - - - - -Q E G K T V Q I DWG I HD L I R R G DV I G
2580
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2590
2600
2610
2620
2630
2640
V T G V P G E L S A MA R R I K L L S P C L HM L P G L K DK E T R F R QR Y L D L I L NNNV R N I F V T R A L I I S Y V R R F F DN L G F L E V E T
V I G F P G E L S I F P R S F I L L S HC L HMM P V L K DQ E S R Y R QR H L DM I L NV E V R Q I F R T R A K I I S Y V R R F L DNK N F L E V E T
V E G Y V G E I S V F V S R I Q L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK DA R G R F I T R S K I I T Y I R K F L DNR D F I E V E T
V V G F P G E L S I F A T E V V L L A P C L HA I P G F QDK E QR F R QR Y L D L I MN E R S R NV F V T R S K I V R Y V R N F F D S R D F I E V E T
I V G F P G E L S I F A T E V V L L S P C L HA I P G L QDK E QR F R QR Y L D L I MNDK S R NV F V T R S K I V R Y I R N F F DNR D F V E V E T
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MY K E V R Q I F Y T R A K I I A Y V R R F L DNMG F L E V E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
F T G QA G E L S L I P K E V L Q L T P C L HM L P G L K DK E L R F R K R Y L D L I L N P R V K DN F V I R S K I I T F L R R Y L DN L G F L E V E T
F T G R A G E L S L I P N E I L Q L T P C L HM L P G L K DK E L R F R K R Y L D L I L N P R V K DN F V I R S K I I T F L R R Y L DN L G F L E V E T
V T G Y P G E L S V F A T K V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MND S S R E R F R V R S K I I QY I R K F L DNR D F V E V E T
V E G Y V G E I S V F V K R I E L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK D S R K R F I T R S K I I K Y I R K F L DNR D F I E V E T
V K G N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V I G H P G E L S I V P NT I E I L S P C L HM L P G L K DK E T R Y R QR Y L D L I MNDQT R QK F I T R A K I I S Y I R S F F DQMG F L E V E T
V T G I P G E L S L S I S S I Q L L S P C L H L L P G V V D L E T R Y R K R Y L D L I MN P S T R D I F V T R S K V I NY I R K Y L DA QG F L E V E T
F T G F P G E L S L F S K S V V L L S P C Y HM L P G L K DQ E V R Y R QR Y L D L M L N E E S R K V F K L R S R A I K Y I R NY F DR L G F L E V E T
F T G F P G E L S L F S K S V V L L S P C Y HM L P G L K DQ E V R Y R QR Y L D L M L N E E S R K V F K L R S R A I K Y I R NY F DR L G F L E V E T
V R G N P G E L S I I P V E MT L L S P C L HM L P G L K DK E T R F R QR Y L D L I L ND F V R QK F V T R S K I I T Y L R S F L DQ L G F L E I E T
V T G Y P G E V S V F A T S V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MN E S T R DR F K V R S Q I I S F I R K F L DT R D F T E V E T
A K G T P G E L S L F A T E V I L L S P C L HM L P G L T D P E T R F R QR Y L DM I C N E S V K K N F I I R S K V I QG V R R Y L DN L G F I E V E T
V V G H P G E L S V M P S E I K L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NNNV R E K F Q I R A K I I S Y V R Q F L DR L G F L E I E T
V K G H P G E L S I M P T E I K L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NNK V R E N F Q I R A K I I S Y V R Q F L DR L G F L E I E T
F T G N P L E A S V F A T D I I V L T P C L R T I P G L K D P E T I Y R K R Y MD L L I NR E S R NR F QK R A Q I I G Y I R S F L D S R G F L E V E T
F T G H P G E L S L I P I S G M I L S P C L HM L P G L G DQ E T R F R K R Y L D L I V N P E S V K N F V L R T K V V K A V R K Y L DDK G F L E V E T
V V G N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NDY V R QK F I T R A K I V T Y I R S F L D E L G F L E I E T
V E G Y V G E I S V F V S R I Q L L T P C L HM L P G F K DQ E L R Y R K R Y L D L I MNK DA R NR F I T R S K I I S Y V R K F L DT R N F I E V E T
I A G K P N E F S L K A T E I T L L S T C Y HM L P G L S S F E QR F R QR Y L D F I V NR DN I K T F I QR A N I I K Y I R K F F D E R D F V E V E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K MV T Y I R S F L D E L G F L E I E T
I V G F P G E L S L F A T E V V Q L S P S L H L L P G F T DG E K R F R MR Y L D F M F NDK S R E V L WQR S R I V K Y I R D F F HDR R F I E V E T
I QG E L G E N S I S V S E F S L L S K S L C A L P G L K DV E T R Y R K R Y L D L I V NA E K R E I F V MR S K L I S E I R R F L A DR E F L E F E T
V P G N P G E L S L I P H E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I T R S K I I T Y I R S F L D E L G F L E I E T
V E G N P G E L S I I P Q E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I V R S K I I T Y I R S F L D E L G F L E I E T
I V G F P G E L S V F A T E V Q L L S P C L HM L P P F A DA E QR A R MR Y L DM L WNDR S R E T L WQR S R MV R Y I R D F F H E R R F I E V E T
V C G Y P G E L S I F P K K I V V L S P C L HMM P V L R DQ E T R Y R QR Y L D L MV NH E V R H I F K T R S K V V S F I R K F L DG L D F L E V E T
V K G T P G E L S L F P S N F E I L T P C L K M L P G L K DV E T R F R MR F L D L MMNN E V R DT F Y I R S N I I R Y I R K Y L DDR D F L E V E T
V K G N P G E L S I A P G F I Q L L S P T L HM L P G F K DH E QR Y R MR Y L D L I MNK K V R D I F L T R S S V I K Q L R E Y F DG K G F I E V E T
V T G Y P G E V S V F A T S V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MNDA T R DR F K V R S K I I G Y I R K F L DNR D F V E V E T
I V G F P G E L S I F P K E T I L L S A C L HM L P G L K DT E I R Y R QR Y L D L L I N E S S R HT F V T R T K I I N F L R N F L N E R G F F E V E T
I V G F P G E L S I F P K E T I I L S P C L HM L P G L K DT E I R S R QR Y L D L M I N E S T R S T F I T R T K I I NY L R N F L NDR G F I E V E T
I I G F P G E L S I F P K E T I V L S P C L HM L P G L K DT E I R Y R QR Y L D L L I N E S T R NV F I T R T K I I N F L R N F L NNQG F I E V E T
I T G F P G E L S I F P T S F MV L S HC L HMM P I L K DQ E T R Y R QR Y L D L M L N S E V R Q I F K T R S K I I K Y I QN F L DD L D F L E V E T
V E G N P G E L S I V P R E MT L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V E G Y V G E V S V F V S R V Q L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK DA R NR F I T R S E I I R Y I R R F L DQR K F I E V E T
I R G Y P G E L S I F A R QC V L L S P C L R M L P G L K D L E I R HR QR Y L D L I MNR S T R DR F V MR S R I I QY I R H F F D S R D F M E V E T
V K G R P G E L S I L P S E I T L L S P C L HM L P G V T NK E T R F R QR Y L D L I MNDY V R DK F I T R S K I V S Y L R R F F D E L G F L E V E T
V C G N P G E L S I I P K E M I L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND S V R QK F I T R S K I I T Y L R S F L DQMG F L E I E T
I T G K P G E L S I A P T K L Q L L S P C L HM L P G L K DM E T R Y R K R Y L D L I MNN S S R NN F I T R T K I I S Y I R R Y L DDR N F L E V E T
L T G F P G E L S V F P K S V K I L S P C L HM L P G L K DNDV R F R QR Y L D L MMNDD S L K V MK L R S R I I DY L R K F L T S R G F F E V E T
L T G F P G E L S V F P K S V Q I L S P C L HM L P G L K DNDV R F R QR Y L D L MMNDD S L K V MK L R S R I I DY L R K F L T S R G F F E V E T
A E G F P G E L S V V V T K L V L L A P C L F QM P K L E D L E V R Y R QR F F D L I V NR E NR Q I F E T R C K V V K M I R G F L DD L D F T E V E T
V R G V P G E F S M S A Y E I T L L S T C F HM L P G L S S V E QR F R QR Y L D L I V NR E NA K T F I L R S K I I S Y I R S F F DQK D F L E V E T
V K G V A G E F S MNA F E I T L L S T C Y HM L P G L S S I E QR F R QR Y L D F I V NR E N I QT F V T R S K V I R Y I R N F F E D L N F L E V E T
V T G V P G E L S I F P S S I K L L S P S L K M L P G F T DT E QR HR K R Y L D L I MNNHV R D I F V K R A K I I NY V R R F L DN L G F L E V E T
I R G Y P G E L S V F C K E L V L L T P S L HM L P G F K DV E T R F R QR Y L D L I MND S T R E R F I V R S K I I QY I R K F L DNK D F I E V E T
2650
P MMNM I P
P MMNM I A
P MMNV I A
P MMNA I A
P MMNA I A
P L MNMV P
P MMN I I P
P I MNQ I A
P I MNQ I A
P I L NV I A
P MMNV I A
P MMN I I P
P MMNMV A
P MM S M I A
P M L NM I Y
P M L NM I Y
P MMN L I P
P MMNV I A
P MMNM I A
P MMNM I A
P MMNMV A
P MMN L I P
P I L NT I P
P MMN I I P
P MMN I I P
P MMNV I A
P V L NQ I A
P MMN I I P
P MMT S I A
P I L QT V Y
P MMNV I P
P MMN I I P
P MMHA I A
P MMNM I A
P MMNM I A
P MMN I I P
P S L NV I Q
P I L NV I A
P MMN L I A
P T MN L V A
P S MN L MA
P MMNM I P
P MMN I I P
P MMNV I A
P MMNM I A
P MMNM I A
P MMN I V P
P QMNM I P
PMLK T T S
PMLK T T S
P I MWK T A
P M L NQ I A
P V L NQ I A
P MMNQ I A
P MMN I I A
2660
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
GGAT AK
GGAAAR
GGAT AK
GGAT AK
GGAT AK
GGAT AK
GGAV AK
GGAT AK
GGAT AK
GGAT AK
GGAT AK
GGAV AK
GGAT AK
GGAT AK
GGAAAR
GGAAAR
GGAV AK
GGAT AK
GGAAAK
GGAT AK
GGAT AK
GGAAAK
GGAT AR
G G A MA K
GGAV AK
GGAT AK
GGAAAR
GGAV AK
GGAT A L
G G A NA R
GGAV AK
GGAV AK
GGAT A L
GGAAAR
GGAT AR
GGAV AK
GGAT AK
GGAT AK
G G A NA R
G G A NA K
GGA SAR
GGAAAR
GGAV AK
GGAT AK
GGAT AK
GGAT AK
GGAV AR
GGAAAR
T GA SAK
T GA SAK
GGAT AK
GGAAAR
GGAAAR
GGAT AK
GGAT AK
2670
2680
2690
2700
2710
2720
2730
P F I T HHND L NMD L F L R I A P E L Y L K M L T V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND I I D I T QQ L L
P F V T HHND L DMR L Y MR I A P E L Y L K Q L I V G G L E R V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA F A DY ND L M E MT E V M L
P F V T HHND L DMQMY MR I A P E L F L K Q L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F I T HHND L DMN L F MR V A P E L Y L K M L I V G G L E R V Y E L G R Q F R N E G I D L T HN P E F T T C E F Y WA Y A DV Y DV MN L T E E L I
P F V T HHND L DMN L F MR V A P E L Y L K M L I V G G L E R V Y E L G R Q F R N E G I D L T HN P E F T T C E F Y WA Y A DV Y DV MN L T E E L V
P F I T HHN E L NMD L Y MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L MT I T E S I L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T HHND L DMN L F L R V A P E L Y HK M L V V G G I DR V Y E V G R L F R N E G I D L T HN P E F T T C E F Y MA Y A DY E DV I Q L T E D L L
P F I T HHND L DMN L F L R V A P E L Y HK M L V V G G I DR V Y E V G R L F R N E G I D L T HN P E F T T C E F Y MA Y A DY E DV I Q L T E D L L
P F T T HHND L NM E M F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F V T HHND L DMDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F I T Y HN E L DMN L Y MR I A P E L Y HK I L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T HHND L DMD L Y MR V A P E L Y L K M L V V G G L DR V Y E I G R L F R N E G I DMT HN P E F T S C E F Y MA Y A DY E D L MK I S E T L I
P F V T HHND L K L D L F MR I A P E L Y L K E L V V G G L DR V F E I G R V F R N E Q I DMT HN P E F S I C E F Y MA Y A DMY D I MDMT E E L I
P F I T Y HN E L E T Q L Y MR I A P E L Y L K Q L I V G G L DK V Y E I G K N F R N E G I D L T HN P E F T A M E F Y MA Y A DY Y D L MD L T E E L I
P F I T Y HN E L E T Q L Y MR I A P E L Y L K Q L I V G G L DK V Y E I G K N F R N E G I D L T HN P E F T A M E F Y MA Y A DY Y D L MD L T E E L I
P F I T Y HND L NMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K L L
P F V T HHND L NMDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F L T HHNA L NMD L F MR I A P E L Y L K Q L V V G G MDR V Y E I G K Q F R N E D I DHT HN P E F T T C E F Y MA Y A DY ND L Y T MT E Q L L
P F V T HHND L K MD L F MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY A D I MD I T E Q L V
P F V T HHN E L K MD L F MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY A DV MD I T E Q L I
P F I T HHN E L K L D L Y MR V S P E L Y L K K L V V G G L E R V Y E I G K Q F R N E G I D L T HN P E F T S C E F Y MA Y A DY ND L M E MT E E L I
P F I T HHNQ L D I QMY MR I A P E L Y L K E L V V G G I NR V Y E I G R L F R N E G I DQT HN P E F T T C E F Y MA Y A DY ND I MK MT E E L L
P F I T Y HN E L DMK L Y MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY R D L M E I T E K L L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K MV
P F V T HHND L DMDMY MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E I M I
P F V T HHND L NQT M F L R I A P E L Y L K E L V V G G MDR V Y E I G K Q F R N E G I D L T HN P E F T S C E A Y WA Y MDY HDWMT A T E D L L
P F I T Y HN E L DMN L Y MR I A P E L Y HK I L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HDV M E I T E K MV
P F V T HHN E Y D L DM F MR I A P E L Y L K M L V V G G Y NK V F E I G K N F R N E G C D L T HN P E F T T I E A Y A A Y Y DMY DV MDY T E E L V
P F K T F HNC L G QN L F L R I A P E L Y L K R L V V G G Y E K V F E I S K N F R N E D I DT T HN P E F T M I E V Y E A Y R DY NDMMD L T E A L I
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M L
P F V T HHND L DMDM F MR V A P E L F L K K M I V G Q F G K V F E MG K N F R N E G I D L T HN P E F T S I E F Y WA Y A DV Y D L M S I T E E L V
P F V T HHN E L NMR L Y MR I A P E L Y L K E L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L I E L T E T M L
P F I T HHND L NMT L Y MR I A P E L Y L K Q L V V G G I E R V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DY DD L MQMT E E M I
P F I T Y HN E L DMN L Y MR V A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K MV
P F K T F HN S L HR D L F MR V A P E L Y L K M L I V G G L DR V Y E I G K N F R N E G I DQT HN P E F T A M E F Y WA Y C DY ND L MT V T E E V L
P F I T HHND L S MDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G I DK V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A DY ND L I K W S E D F F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G I DK V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A D F Y D L I K W S E D F F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G L DR V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A DY Y D L I K W S E E F F
P F K T HHND L NMK L Y MR I A P E L Y L K E L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L M E L T E K M L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M L
P F I T HHND L DMDMY MR I A P E L F L K Q L V V G G L DR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F V T HHND L DMD L Y MR I A P E L Y L K M L V V G G L DR V Y E I G R Q F R N E G A D L T HN P E F T S I E F Y QA Y A DY Y D L MDT T E E L L
P F I T HHND L NMD L F MR V A P E L Y L K M L V V G G L QR V Y E I G R Q F R N E G I D L T HN P E F T T L E F Y MA Y A DY ND L MD I A E R L L
P F V T Y HND L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y MA Y A DY HD L M E I T E K L L
P F V T HHND L NMD I F MR I A P E L Y L K N L V V G G F E R V Y E I G K Q F R N E G I DR T HN P E F T S I E L Y QA Y A DY E DMMK L T E D L L
P F I T HHN E L D L D L F MR I A P E L P L K L I I I G G F E K V F E I G K C F R N E G I D P T HN P E F T S C E F Y WA Y A DY HD L MK L T E E L L
P F I T HHN E L D L D L F MR I A P E L P L K L I I I G G F E K V F E I G K C F R N E G I D P T HN P E F T S C E F Y WA Y A DY HD L MK L T E E F L
P F I T HHNA L D I D L W L R V A P E L F L K M L V V G G MNR V Y E L G K Q F R N E G I D L T HN P E F T S C E F Y MA Y A DY ND L MD L T E K L Y
P F I T HHN E L NQT MY L R I A P E L Y L K K L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T S V E S Y WA Y A DY NDWM E T T E E L L
P F I T HHN E L NQR MY L R I A P E L Y L K E L V V G G MDR V Y E L G K Q F R N E G I D L T HN P E F T S V E A Y WA Y A DY NDWMR T T E D L F
P F V T Y HND L K L D L F MR I A P E L F L K E L V V G G L DR V Y E I G R V F R N E S I DQT HN P E F S I C E F Y MA Y A DMY D L MD I T E S M I
P F V T HHND L N L DMY MR I A P E L Y L K Q L V V G G M E R V Y E I G R Q F R N E G I DQT HN P E F T T C E F Y E A Y A DV Y D L M E T T E L L F
2750
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
2760
2770
2780
2790
2800
2810
2820
S G MV HA I HG F T P P F R R I S M I S S L E E A V Q F L DA L C V K H E V - E C K P P R T A A R L L DK L V G E F L E E T C I N - P T F I C DH P Q I M S P L A K
S G MV K E L T G F T P P F R R I E M I G E L E E A NK Y L I DA C A R F DV -K C P P P QT T A R L L DK L V G E F L E P T C V N - P T F I I NQ P E I M S P L A K
S E MV K E I T G F A R P WK R I NM I E E L E E T G E F L K K I L S DNK M -DC P P P L T NA R M L DK L V G E - L E DT C I D - P T F I F G H P QMM S P L A K
S G L V K HV T G WK A P WR R V E M I P A L E E T G E F L K R V L K K T G V - E C S P P L T NA R M L DK L V G E F I E E T C V N - P T F I T G H P QMM S P L A K
S G L V K H I T G WK A P WR R V E M I P A L E E T G E F L K R V L K K T G V - E C S P P MT NA R M L DK L V G E F I E E T C V N - P T F I T G H P QMM S P L A K
S G MV Q S I HG F T P P F A R V P MMA T L E E A ND F L NK L C NT HQ I - E C S P P R T T A R L L DK L V S V F L E E E C I N - P T F I L DH P Q I M S P L S K
S G MV K H I T G F T P P F R K I S MV E E L E E T R R I L DD I C V A K DV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S S MV L A I K G F T P P F K R V NMY E G L A E A R QT F DK L C R DNNV -DC S E P R T T A R L L DK L V G E Y L E S T F I S - P T F L I G H P Q I M S P L A K
S S MV M S I K G F T P P F K R V HMY DG L A E A R E V F DK L C R DNNV -DC S A P R T T A R L L DK L V G E Y L E S T F I S - P T F L I G H P Q I M S P L A K
S E MV K E I T G F S R P WK R V NM I E E L E E T G E F L K K V L K DNN L - E C S P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
S E MV K E I NG F A R P WK R I NM I E E L E E T G E F L QK V L K DNN L - E C P P P I T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
S G MV K H I T G F T P P F R R I S MV E E L E E T R K I L DD I C L A R A V - E C P P P R T T A R L L DK L V G E F L E T T C I N - P T F I C DH P Q I M S P L A K
S G MV K Q I C G Y T P P F R R L R M L P D L E G A QA R L D E I C V K L G V - E C P P P R T T A R L L DK L V G DY L E V NC I N - P T F I T E H P E I M S P L A K
E G MV K S L T G F A R P WK R F DM I G E L E NT NK F L R E L C E K HNV -DC A E P K T N S R L L DK L V G E Y I E NQC V N - P S F I V G H P QV M S P L A K
S G L V L E I HG F T T P WK R F S F V E E I E E N I D F MV E MC E K HK I - E L P H P R T A A K L L DK L A G H F V E T K C T N - P S F I I DH P QT M S P L A K
S G L V L E I HG F T T P WK R F S F V E E I E E N I D F MV E MC E K H E I - E L P H P R T A A K L L DK L A G H F V E T K C T N - P S F I I DH P QT M S P L A K
S G MV K H I T G F T P P F R R I S MT Q E L E E MR K F L DD L C V QK E V - E C P P P R T T A R L L DK L V G D F L E V K C I N - P T Y I C DH P Q I M S P L A K
S E MV K K I T G F T R P WK R V NM I E E L E E T G K F L K Q I L I DHK L -DC S P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
Q S I V M S I HG F S S P WR K I DM I A D L E E C R E F L V K T C R E R K V - E C S A P QT T A R L L DK L V G E Y L E V QC I N - P T F I I NH P E I M S P L S K
S G MV K A I R G F T P P F K R V S M I K T L E E T NQ F L S Q L C A K HQV - E C P A P R T T A R L L DK L V G E F I E E F C V N - P T F I C E H P Q I M S P L A K
S G MV K S I R G F T P P F K R V S M I K T L E A T T D F L S Q L C V K HQV - E C P A P R T T A R L L DK L V G E F I E E E C I N - P T F I C E H P Q I M S P L A K
S G MV E NM F G F K R P F R V I S I L E E L N E T L E K L L S A C DK E G L - S V E K P R T L S R V L DK L I G HV I E P QC V N - P T F V K DH P I A M S P L A K
G NMV K D I T G F T A P F K R I S Y V HA L E E A L T F L K K QA I R F NA - I C A E P QT T A R V MDK L F G D L I E V D L V Q - P T F V C DQ P Q L M S P L A K
S G MV K H I T G F T P P F R R I S MV D E L E E T R R F F DD L C A V R NV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G MV K H I T G F T P P F R R I NMV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S E MV K E I T G F T R P WK R I NM I E E L E E T G E F L K K V L K DNK M -DC A P P L T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
Y G L A V E L HG F S K P F K R L H I I P E L E A G I Q F L MD L C K K HK A -DC P P P Y T A P R L L DA L I A E F L E P E C HD - P C F I C DH P R V M S P L A K
S G MV K H I T G F T P P F R R I S M I E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G L V K H L T G WA R P WK R V K I M P E L E E T NQ F L R D L L K E K N I - E C T P P L T NA R M L DK L I G E Y L E E T C I N - P T F L M E H P Q L M S P L A K
S E L V F R L T G L R S P WK R I S M E G A L K H S L E E L K Q I A I QNR I E DY E K A K S HG E F L A L L F E G L V E DK L V N - P T F I Y D F P V E N S P L A K
S G MV K N I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A R DV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G MV K S I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I S - P T F I C DH P Q I M S P L A K
S S L V K E L T G W E A P WR R V E M I P A L E E T NA F L QR I C K K MNV - E C P P P L T NA R M I DK L T G E F I E E T C I N - P T F I L E H P QMM S P L A K
S G MV K E L T G F T P P F R K I DM I E E L E E A NK Y L I DA C A K Y DV -K C P P P QT T T R L L DK L V G H F L E E T C V N - P T F I I NH P E I M S P L A K
S G MV Y A I K G F T P P F R R I S MV S G L E E N E D F L K E L I K K L G V - E M S P P Y T T A R M L D E L V G E Y L E S Q L V N - P G F I C DH P Q I M S P L A K
S G MV K H I T G F T P P F R R I NMV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S S I V L K L K G F T P P W P R V S MMA E L E E A NA F F V E QA K K HK V - E C S N P R T T A R L I DK L V G H F L E V N F R N - P T F L I DH P Q L M S P L S K
S E MV K E I T G F S R P WK R V NM I E E L E E T G K F L K Q I L I DNK L -DC T P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
S Q L V Y H L F G F T P P Y P K V S I V E E I E E T I E K M I N I I K E HK I - E L P N P P T A A K L L DQ L A S H F I E NK Y NDK P F F I V E H P Q I M S P L A K
S T L V MH L F G F T P P Y P K V S I V E E L E E T I NK M I N L I K E NK I - E M P N P P T A A K L L DQ L A S H F I E NQY P NK P F F I I E H P Q I M S P L A K
S K L V Y H L F G F T P P Y P K I S L V E E L E E T I NK M I N I I K E NN I - E M P N P P T A A K L L DQ L A S H F I E N I Y QNQ P F F I I E H P Q I M S P L A K
S G MV K E L T G F T P P F R R I DM I E E L E E A T K Y L V A A C E K F E V -K C P P P QT T T R L L DK L V G H F L E E T C V N - P S F I I NH P E I M S P L A K
S G MV R S I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A R A V - E C P P P R T T A R L L DK L V G E F L E V T C I S - P T F I C DH P Q I M S P L A K
S E MV K E I T G F S R P WK R I NM I E E L E E T G E F L K K I L V DNK L - E C P P P L T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
S G L V K D L T G F S R P WR R I NM I E Y L E E A NA F L R D L C A K HG V - E C A P P QT C S R L L DK L V G E F I E S E C I N - P T F I I G H P QMM S P L A K
S G MV K F V T G F T P P F R R V S M I N E L Q E T NK F L DD L C R K H E V - E C T S P R T T A R L L DK L V G E Y I E T QC I S - P T F I MDH P E I M S P L S K
S G MV K H I T G F T P P F R R L S MT HD L E E T R K F F DN L C A E K G V - E C P P P R T T A R L L DK L V G E F M E E T C I S - P T F I C DH P Q I M S P L A K
S S L V MK L T G F T P P F K R V P MM E T L S E A R E F F DK L C V QHNV -A C S A P R S T T R L I DK L V G H F I E V DC K N - P T F L M E H P Q I M S P L A K
S S L V F E L F G F T P P F QR V S MV E E L E E NV E K Y L T A I K E A G L -DM P K P P V P A K L I DQ L V G HY I E DQ I V K - P T F I V D F P QC T S P L S K
S S L V F E L F G F T P P F NK V S MV E E L E E NV E K Y L T A I K E A G L -DM P K P P V P A K L I DQ L V G HY I E DQ I V K - P T F I V D F P QC T S P L S K
QK I V M E V K G F S S P WQR I DM I E E L E E V R E L L E K K C K E L DV -DV P P P MT V A R M L DK MV G K F V E P L C V N - P T F MC NH P QV M S P L A K
Y G MV MH L Y G F NR P F K R L H I V P K L E E A N S F F L D I C K K NQV - E C N P P F T T T R L L DA L V S HY L E P QC HD - P T F L C DH P R I M S P L A K
Y G L A MH I HG F NK P F K R L Y I I P E L E S S NA F L Q E L C S K H E V - E C T P P L T T A R L L DA L I S HY L E P E C QD - P T F V C DH P R V M S P L A K
S G L V K A V T G F S T P WK R F DM I K E L E E T R K W L S D L A A K HNV -DC S E P R T S S R L I DK MT G E F I E T QC I N - P S F I V G H P QV M S P L A K
S E MV K E I T G F S R P WK R L D I I G T L E E T NQ F L Q E Q L K K V G L -V C T P P L T NA R M L DK L I G DY L E DT C I N - P T F L Y G H P E MM S P L A K
2830
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
Y HR S A T G L T E R
WHR S K S G L T E R
H S R DQ P G L C E R
Y HR QHA G L C E R
Y HR QNV G L C E R
Y HR DV P G L T E R
WHR S K E G L T E R
WHR S I P G L T E R
WHR S I P G L T E R
K DR N I P G L C E R
Y S R DQ P G L C E R
WHR S K E G L T E R
WHR S I K G L T E R
Y DR S R P G L C E R
WHR E K P E MT E R
WHR E K P E MT E R
WHR S QK G L T E R
K DR DNV G L C E R
Y HR E K P Q L T E R
Y HR S I P G L T E R
Y HR S A P G L T E R
NHR S K A G L T E R
Y HR S E P E L T E R
WHR I HR G L T E R
WHR S K E G L T E R
WHR ND P R L T E R
WHR S K E G L T E R
Y HR T E K G I S E R
NHR E K E G F V E R
WHR S K NG L T E R
WHR S K E G L T E R
Y HR S K NG L C E R
WHR S R P G L T E R
Y HR N I P G MT E R
WHR S K E G L T E R
V HR QY P G L T E R
K DR N I P G L C E R
Y HR T K P G L T E R
Y HR S K P G L T E R
Y HR S K P G L T E R
WHR S K P G L T E R
WHR C K E G L T E R
Y HR S DA G L C E R
WHR S I P G L T E R
WHR S E K G L T E R
Y HR S K P NV T E R
WHR S K E NV C E R
WHR S K E NV C E R
WHR T K P G I V E R
WHR K D L R L S E R
WHR ND P Q L T E R
R HR D I P G L C E R
Y S R DR P G I C E R
2840
2850
2860
2870
2880
2890
2900
F E L F V MR K E V C NA Y T E L ND P A V QR E R F E QQA A DK A A G DD E A Q L V D E N F C T A L E Y G L P P T G G WG MG I DR L T M F
F E L F I NK H E L C NA Y T E L ND P V V QR QR F A DQ L K DR Q S G DD E A MA L D E T F C NA L E Y G L A P T G G WG L G I DR L S M L
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L I D E T F C NA L E Y G L P P T G G WG C G V DR L A M F
F E A F V C K K E I V NA Y T E L ND P F DQR L R F E E QA R QK DQG DD E A Q L I D E N F C T S L E Y G L P P T G G WG MG I DR L V M F
F E A F V C K K E I V NA Y T E L ND P F DQR L R F E E QA R QK DQG DD E A Q I I D E N F C T S L E Y G L P P T G G WG MG I DR L V M F
F E V Y V A K K E I C NA Y T E L ND P A T QR E R F E E QA K NR A A G DD E T P P T D E A F C T A L E Y G L P P T G G WG L G V DR L T M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E T F C T A L E Y G L P P T G G WG MG I DR V T M F
F E L F A V T R E I A NA Y T E L ND P I T QR QR F E QQA K DK DA G DD E A QM I D E T F C NA L E Y G L P P T G G WG MG I DR L S M I
F E L F A V T R E I A NA Y T E L ND P I T QR QR F E QQA K DK DA G DD E A QM I D E T F C NA L E Y G L P P T G G WG MG I DR L S M I
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK A QG DD E A QMV D E T F C NA L E Y G L P P T A G WG C G I DR L A M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L I D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V T M F
F E L F V NK K E I C NA Y T E L ND P M I QR QR F E QQA L DK A A G DD E A QMV D E N F C T A L E Y G L P P T G G WG MG I DR L T M F
F E A F L C T K E I C NA Y T E L ND P F DQR E R F M E QV R QK E QG D E E A QG V D E T F L DA L E Y G L P P T G G WG L G I DR L V M F
F E L F V L G K E L C NA Y T E L N E P L QQR K F F E QQA DA K A S G DV E A C P I D E T F C L A L E HG L P P T G G WG L G I DR L I M F
F E L F V L G K E L C NA Y T E L N E P L QQR K F F E QQA DA K A S G DV E A C P I D E T F C L A L E HG L P P T G G WG L G I DR L I M F
F E L F V MK K E I C NA Y T E L ND P I R QR E L F E QQA K A K A E G DD E A M F I D E T F C T A L E Y G L P P T A G WG MG I DR L T M F
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK A QG DD E A QMV D E T F C NA L E Y G L P P T A G WG C G I DR L A M F
F E L F V NT K E I C NA Y T E L NN P F V Q I E R F A E QA K A K A A G DD E S M L I DK V F T T S L E Y G L P P T G G F G L G I DR F A M L
F E L F V A K K E I C NA Y T E L ND P V V QR E R F E QQA S DK A A G DD E A Q L V D E N F C T S L E Y G L P P T G G F G MG I DR L A M F
F E L F V A K K E I C NA Y T E L ND P V V QR E R F E QQA S DK A A G DD E A QMV D E N F C T A L E Y G L P P T G G F G MG I DR L T M F
F E L F I NC K E I C NA Y T E L NN P F E QR E R F L QQT QD L NA G DD E A MMND E D F C T A L E Y G L P P T G G WG I G I DR L V MY
F E L F I L K R E I A NA Y T E L NN P I V QR S N F E QQA K DK A A G DD E A Q L V D E V F L DA I E HA F P P T G G WG L G I DR L A M L
F E L F V MK K E V C NA Y T E L ND P F QQR Q L F E DQA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR F T M F
F E L F V MK K E I C NA Y T E L ND P MR QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V A M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA NQK A QG DD E A Q L V D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E L F V NK K E L A NA Y T E L NN P I V QR E E F L K QV R NR DK G DD E S M E I D E G F V A A L E HA L P P T G G WG L G I DR L V M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T G G F G MG L DR V A M F
F E G F V C K K E I C NA Y T E L NN P F DQR L R F E E QA R QK A QG DD E A QM I D E N F L R S L E Y G L P P T A G WG L G I DR L C M F
F E L F L NG W E L A NG Y S E L ND P L E Q E K R F E E QDK K R K L G D L E A QT V DY D F I NA L G Y G L P P T G G MG L G I DR L T M I
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A MV I DDN F C T A L E Y G L P P T A G WG MG I DR L T M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR L T M F
F E A F V C K K E I A NA Y T E L NN P F DQR L R F E E QA R QK DQG DD E A Q L V D E S F L NA L E Y G L P P T G G WG L G I DR L A M F
F E L F V NK H E V C NA Y T E L ND P V V QR QR F E E Q L K DR Q S G DD E A MA L D E T F C T A L E Y G L P P T G G WG L G I DR L T M L
F E L F V NT K E L C NA Y T E L ND P I DQR E R F D E QA K A K S S G DD E A M L I D E V F V Q S L E Y G L P P T G G WG L G V DR L T M L
F E L F V MK K E I C NA Y T E L ND P MR QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V A M F
F E L F V NY H E L C NA Y T E L ND P F V QK A L F QK QV E DA A K G DD E A MG Y D E G F I K S L E HA L P P T A G WG L G I DR F V M L
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK A QG DD E A Q L V D E V F C NA L E Y G L P P T A G WG C G I DR L A M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F K L QQK DR E K G DT E A A Q L D S A F C T S L E Y G L P P T G G L G L G I DR I T M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F S A QQK DR E K G DA E A F Q F DA P Y C T S L E Y G L P P T G G L G L G I DR I T M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F A S QQK DK E K G DT E A F HC DA A F C T S L E Y G L P P T G G L G L G I DR I T M F
F E L F V NK H E L C NA Y T E L ND P V V QR QR F E A Q L K DR Q S G DD E A MA L D E T F C MA L E Y G L P P T G G WG L G I DR L A M L
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V T M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L V D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E A F V A T K E I C NA Y T E L ND I F DQR A R F E E QA R QK A QG DD E A Q I I D E N F C T A L E Y G L P P T G G WG MG V DR L V M F
F E L F V A R K E I C NA Y T E L ND P MV QR E R F A T QA K DHA A G DD E A Q L I D E N F C T A L E Y G L P P T G G F G L G I DR L A M F
F E L F V MK K E I C N S Y T E L ND S V R QR E L F E QQA K A K A E G DD E A M F I D E T F C T A L E Y G L P P T A G WG MG I DR L C M F
F E L F V NY Y E L C NA F T E L ND P F K QR K I F V QQ I E E K NK G DV E A MG Y DK D F C DC L E HA L P P T G G WG L G I DR L V M L
F E L F I C G K E L I N S Y T E L ND P I T QR E C F K QQQK A K D L G DD E A Q P P D E A F C T A L E Y G L P P T A G WG I G I DR L A M F
F E L F I C G K E L I N S Y T E L ND P I T QR DC F K QQQK A K D L G DD E A Q P P D E A F C T A L E Y G L P P T A G WG I G I DR L T M F
F E V F I NG L E Y A NA Y T E L NC P MV QR E L F L DQ L K A K A A K DD E A M P Y DDT F C T A L E Y A L P P T A G WG C G V DR L V M L
F E L F I NK K E I C N S Y T E L N S P L V QR E E F E R Q L R DR E K G DD E A MD I D E G Y V QA L E Y A L P P T G G WG L G I DR L V MY
F E L F L NK K E L C NA Y T E L NN P I V QR E E F MK Q L R NK E K G DD E A MD I D E G F V QA L E HA L P P T G G WG L G I DR L V M F
F E V F V A T K E I C NA Y T E L ND P WV QR A N F E E Q S R QK DQG DD E A QG I DHV F I DA L E HG L P P T G G WG L G I DR L V M F
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK DA G DD E A Q L V D E T F C T A L E Y G L P P T A G WG C G V DR L T M F
2910
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
L T D S NN I
LT D S LN I
L T D S NT I
L T DNY S I
L T DNY S I
L T D S NN I
L T D S NN I
L T DNNN I
L T DNNN I
L T D S NT I
L T D S NT I
L T D S NN I
L T D S NN I
L T DC S N I
L A DK NN I
L A DK NN I
L T D S NN I
L T D S NT I
M S DT Y N I
L T D S NN I
L T D S NN I
L T DA A N I
L A DV DN I
LT D S SN I
L T D S NN I
L A D S NT I
LT SQ SN I
L T D S NN I
L T NNA T I
LAG L E S I
L T D S NN I
L T D S NN I
L T DNY S I
L T D S QN I
L A DK NN I
L T D S NN I
L T DT QN I
L T D S NT I
L T NK N S I
L T NK NC I
L T NK N S I
L T D S QN I
L T D S NN I
L T D S NT I
L T D S NT I
L T D S NN I
L T D S NN I
L T DN I Y I
L S DK NN I
L A DK NN I
L T NQV S I
L T S QNN I
L T S QA N I
LT D SN S I
L T N S NT I
2920
2930
2940
2950
2960
2970
2980
K E V L L F P A MK P R QA A E A HR QT R QY MQ -K W I K P G MT M I E I C E E L E NT A R G L A F P T G C S R NHC A A HY T P NA G D - P T V L
K E V L F F P A MR P R R A A E V HR QV R K Y V R - S I V K P G M L MT D I C E T L E NT V R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
R E V L L F P T L K P R K G A E I HR R V R R H L Q -NR L R P G QT L T E V V E L V E NA T R G I G F P T G V S L NHC A A H F T P NA G D -T T V L
K E V L A F P F MK E R QA A E V HR QV R QY A Q -K T I K P G QT L T E I A E G I E DA V R G MG F P C G L S I NHC A A HY T P NA G N -K MV L
K E V L A F P F MK E R QA A E V HR QV R QY A Q -K T I K P G QT L T E I A E G I E E S V R G MG F P C G L S I NHC A A HY T P NA G N -K MV L
K E V L L F P A MK P R HA A E A HR QT R K H I R -NW I K P G MT M I D I C E E L E K T A R G L A F P T G C S R NHC A A HY T P NT G D -T T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K M E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
K E V L L F P A MR P R R S A E A HR QV R QY V K - S W I K P G M S M I E I C E R L E T T S R G L A F P T G C S L NHC A A HY T P NA G D -T T V L
K E V L L F P A MR P R R S A E A HR QV R K Y V K - S W I K P G MT M I E I C E R L E T T S R G L A F P T G C S L NHC A A HY T P NA G D -T T V L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R P G MT M I E I A N L I E D S V R G I G F P T G L S L NHV A A HY T P NT G D -K L I L
R E V L L F P T L K P R K G A E I HR R V R K NV Q -NK L K P G M L L T E V A D I I E NA T R G I G F P T G L S L NHC A A HY T P NT G D -K T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
K E V L L F P A MK P R QA A E T HR QV R HHV Q - E F I K P G L S M I E I C E R L E QA S R G L A F P T G C S L NNC A A HY T P NA G D -K T V L
K E V L L F P A MR P R R A G E V HR QV R A Y A Q -K A I K P G MT MT E I A N L I E DG T R G I G F P T G L S V N E V A A HY T P N P G D -K QV L
K E V I L F P A MR NR R A A E V HR QV R K Y MQ - S I I R P E MK L I DMC N I L E S K V K G WG F P T G C S L NHC A A HY T P N P HD - F T K L
K E V I L F P A MR NR R A A E V HR QV R K Y MQ - S I I R P E MK L I DMC N I L E S K V K G WG F P T G C S L NHC A A HY T P N P HD - F T K L
K E V L L F P A MK P R QA A E A HR QV R K Y V Q - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R P G MNMT E I A D L I E N S V R G I G F P T G L S L NHV A A HY T P NA G D -K T V L
K E V I L F P A MK P R R A A E V HR QV R K Y V Q -G I V K P G L G L T E L V E S L E NA S R G I A F P T G V S L NH I A A H F T P NT G D -K T V L
K E V L L F P A MK P R QA A E A HR QT R QY MQ -R Y I K P G MT M I Q I C E E L E NT A R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
K E V L L F P A MK P R QA A E A HR QT R QY MQ -R F I K P G MT M I Q I C E E L E NT A R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
R DV I F F P T MK P R R A A E A HR R A R Y R V Q - S I V R P G I T L L E I V R S I E D S T R G I G F P A G M S MN S C A A HY T V N P G E QD I V L
K E V I L F P T MR P R K A A A I HK S V R QWA Q -QW I K P G M S D L F V A E N I E R K V R G MA F P C G L S V N S C A A H F T P N P ND P L S F Y
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D - P T V L
R E V L L F P T L K P R K G A E I HR R V R E S V R -NK I K P G MT L T E I A N L V E DG T R G I A F P T G L S L NHC A A H F T P NA G D -K T V L
K E V L L F P A MK P R E A A E V HR QV R T WA Q - S W I K P G L S L M L MT DR I E K K L NG QA F P T G C S L NHV A A HY T P NT G D E K V V L
R E V L A F P F MR DR HG A E A HR QA R R WA H -K HV K P G M S L T D I A NG I E D S V R G MG F P T G L S I NHC A A HY T P NA G N -K MV L
K E V I L F P QMK R R E A G R I L K I V R T E A A -DM I R V G N S L L E V A E F V E K K T I - -A F P C N I S R NQ E A A HA T P K A G D -QDV F
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D - P T V L
R E V L A F P F L R E R HA A E V HR QV R QWA Q -K S I K P G QT L T E I A E N I E D S V R G MG F P T G L S I NHC A A HY T P NA G N -K MV L
K E V L L F P A MK P R R A A E V HR QV R K HMR - S I L K P G M L M I D L C E T L E NMV R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
K E V L L F P A MK P R QC A E V HR E V R QY I S -DWV K P G MK Y I DV C E T L E N S V R G V A F P T G C S K NHV A A HWT P NG G C - E S V I
Q E V L L F P A MK P R K A A E C HR QV R QY A QA K L L K P G NK L I D I C E K L E DMNR G I A F P T G C S L N F C A A HY T P NNG D -NT I L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R A G M S MT E I A D L I E N S V R G I G F P T G L S L NHV A A HY T P NT G D -K L S L
K DV I L F P T MR P R K A A E C HR QV R K HMQ -A F I K P G K K M I D I A Q E T E R K T K G WG F P T G C S L NHC A A HY T P NY G D - E T V L
K DV I L F P T MR P R K A A E C HR QV R K Y I Q -A Y V Q P G R K M I D I V K E T E K K T K G WG F P T G C S L NHC A A HY T P NY G D - E T V L
K DV I L F P T MK P R K A A E C HR QV R K Y I Q - S Y I K P G R K M I D I V QK T E QK T K G WG F P T G C S L NNC A A HY T P NY G D - E T V L
K E V L L F P A MK P R QA A E V HR QV R K Y MK - S I L K P G M L MMD L C E T L E NT V R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
R E V L L F P T L K P R K G A E I HR R V R R A I K -DR I V P G MK L MD I A DM I E NT T R G I G F P T G L S L NHC A A H F T P NA G D -K T V L
R E V L L F P HMK P R R A A E V HR QA R QY A Q - S V I K P G M S MMDV V NT I E NT T R G I G F P T G V S L NHC A A HY T P NA G D -T T I L
K E V L F F P A MK P R QA A E A HR QV R K HV Q -G F I K P G MT M I D I C E R L E T A S R G L A F P T G C S R NHC A A HY T P NA G D -T T V L
K E V L L F P A MK P R QA A E A HR QV R A Y V R - S W I K P G MT M I D I C E K L E DC S R G L A F P T G C S I NHC A A HY T P NA G D - P T V L
Q E V L L F P A MK P R K A A E C HR QV R K Y C Q -Q L I R P G K K L I D I C E S I E E MNR G I A F P T G C S L NHV A A HY T P NNG D - F T T I
K V F I G V I I I V -R R A A E V HR QV R R Y I Q - S V I R P G V S C L D I V QA V E S K T K G WG F P T G C S L N S C A A HY T P NY G D -K T V F
K E V I F F P T MR P R K A A E V HR QA R R Y I Q - S V I K P G L S C L D I V QA L E F K T K G WG F P T G C S L N S C A A HY T P NHG D -K T I F
R E V L L F P L MK P R E G A E I HR R V R R WA M E NV I K P G V K L Y DMC A Q I E E A V R G L A F P C G C S I NNC A A HY T P MY NT DQR V L
K E V L L F P A MK P R C A A E V HR QV R R Y A Q - S F I K P G I S L L S MT DR I E K K L E G QA F P T G C S L NHV A A HY T P NT G D -K C V L
K E V L L F P A MK P R HA A E V HR QV R R Y A Q - S F I K P G I S L I S MT DR I E R K V E G QA F P T G C S L NHV A A HY T P NT G D -K T V L
K E V L A F P A NK P R R A A E V HR QV R QY A Q - S A I K P G MT MT E I A E L V E DG T R G I G F P T G V S V N E C A A HY T P NA G D -K R V L
K E V L L F P A MK P R K G A E I HR V V R K Y A R -DN I K A G MT MT S I A E M I E D S V R G QG F P T G V S L NHC A A HY T P NA G D -K I V L
2990
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3000
3010
3020
3030
3040
3050
3060
L Y DDV T K I D F G T H I K G R I I DC A F T L S F N P - -K Y DK L L E A V K E A T E T G I R E A G I DV R L C D I G A A I Q E V M E S Y E V E L DG K T Y QV K
QY DDV MK L D F G T H I DG H I I DC A F T V A F N P - -M F D P L L A A S R E A T Y T G I K E A G I DV R L C D I G A A I Q E V M E S Y E V E I NG K V F QV K
R H E DV MK V D F G V QV NG H I I D S A WT V T F D P - -R Y D P L L E A V R E A T Y T G I R E A G I DV R L T D I G E A I Q E V M E S Y E V T L G G QT Y QV R
QQG DV MK V D F G A H I NG R I V D S A F T MT F D P - -V Y D P L L E A V K DA T NT G I R E A G I DV R M S D I G A A I Q E A M E S Y E V E L NG T MY P V K
QQG DV MK V D F G A H I NG R I V D S A F T V A F D P - -V Y D P L L A A V K DA T NT G I R E A G I DV R M S D I G A A I Q E A M E S Y E V E I NG T MY P V K
E Y DDV V K I D F G T H I NG R I I DC A F T L H F N P - -R Y D P L V K G V Q E A T E A G I K A S G V DV R L C DV G A A V Q E V M E S H E V E L DG QMY - - QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
QY G DV C K I DY G I HV R G R L I D S A F T V H F D P - -K F D P L V E A V K E A T NA G I R E S G I DV R L C DV G E V V E E V MT S H E V E L E G K T Y V V K
QY G DV C K I DY G I HV R G R L I D S A F T V H F D P - -K F D P L V E A V R E A T NA G I K E S G I DV R L C DV G E I V E E V MT S H E V E L DG K S Y V V K
K K DD I MK V D I G V HV NG R I C D S A F T MT F N E DG K Y DT I MQA V K E A T Y T G I K E S G I DV R L ND I G A A I Q E V M E S Y E M E E NG K T Y P I K
K Y E DV MK V D I G V QV NG H I V D S A WT V S F D P - -QY DN L L A A V K DA T Y T G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I K G K T Y QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
S Y DDV C K I D F G T H I NG R I I DC A F T V S F N P - -K Y DR L L E A V K DA T NT G I K NA G I DV R L C DV G A A I Q E T M E S Y E V E I DG K T Y QV R
QQHDV MK V D F G V HV NG R I V D S A F T M S F E P - -T WDK L L E A V K DA T NT G I R E A G I DV R MC D I G E A I Q E V M E S Y E V E V NG K V Y P V K
T QDD I C K L D F G V QV NG M I I DC A F T V A F ND - -V F D P L I Q S T L DA T NT G L K V A G I DV M F S E I G S A I E E V I K S Y E F E Y K S K V Y N I K
T QDD I C K L D F G V QV NG M I I DC A F T V A F ND - -V F D P L I Q S T L DA T NT G L K V A G I DV M F S E I G S A I E E V I K S Y E F E Y K S K V Y N I K
QY DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K Y DK L L E A V K DA T NT G I K C A G I DV R L C D I G E S I Q E V M E S Y E V D L DG K T Y QV K
NY E DV MK V D I G V HV NG H I V D S A F T L T F DD - -K Y D S L L K A V K E A T NT G V K E A G I DV R L ND I G E A I Q E V M E S Y E M E L NG K T Y P I K
K K DDV L K I D F G T HV NG Y I I DC A F T V T F D E - -K Y DK L K DA V R E A T NT G I Y HA G I DA R L G E I G A A I Q E V M E S H E I E L NG K T Y P I R
QY DDV C K I D F G T H I K G R I I DC A F T L T F NN - -K Y DK L L QA V K E A T NT G I R E A G I DV R L C D I G A A I Q E V M E S Y E I E L DG K T Y P I K
QY DDV C K I D F G T H I K G R I I DC A F T L T F NN - -K Y DK L L QA V K E A T NT G I K E A G I DV R L C D I G A A I Q E V M E S Y E V E L DG K T Y P I K
K E DDV L K I D F G T H S DG R I MD S A F T V A F K E - -N L E P L L V A A R E G T E T G I K S L G V DV R V C D I G R D I N E V I S S Y E V E I G G R MW P I R
K T DDV V K I D F G V HV NG H L I D S A F T MT WD P - -A L Q P I L DC S K DA T NT G I K N I G V DV R L C D I G DA I E E V M S S Y E V E I K G K T Y Q L Q
HY DD I C K I D F G T Y Y S G R I I DC A F T V T F N P - -K Y DR L L E A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
K F E DV MK V D F G V HV NG Y I I D S A F T I A F D P - -QY DN L L A A V K DA T NT G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I NG E T HQV K
T Y DDV MK V D F G T H I NG R I I DC A WT V A F N P - -M F D P L L QA V K E A T Y E G I K QA G I DV R L G D I G A A I E E V M E S H E V E I NG K V HQV K
E HDDV L K V D I G V HV NG R I V D S A F T V A F N P - -R Y DN L L A A V K DA T NT G I R E A G I DA R L G E I G E A I Q E T M E S Y E V E I DG E T Y P V K
G -NDMV K L D L G V HV DG Y I A D S A V T V D L S G - -N S D - I V K A S E E A L A A A I D L MK P G V S T G E I G A A I E E R I H S - - - - - - - - -Y G L K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y D I L L T A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
Q E DDV MK V D F G V HV NG R I V D S A F T V A F N P - -R Y D P L L E A V K A A T NA G I K E A G I DV R V G D I G A A I Q E V M E S Y E V E I NG QM L P V K
QY DDV MK L D F G T H I DG Y I V DC A F T V A F N P - -M F D S L L QA S K DA T NT G V K E A G I DA R L C DV G A A I Q E V M E S Y E V E I NG K V F Q I K
DK DDV I K F D F G V QV K G R I I DC A F T K T F ND - -MY D P L L K A V N E A T E T G I R S A G I DV R L C D I G E A V Q E V M E S HT V E I HG K E Y QV K
T Y DDV C K I D F G T QV DG W I I DC A F T V A F N P - -V Y DT L L QA A K DA T DT G I R N S G I DV R L G DV G A A I Q E T M E S Y E V E I G G K V Y K V K
G K DD L MK V D I G V HV NG H I C D S A F T MT L NDT G K Y D S I MK A V K DA T NT G V K E A G I DV R L ND I G E A I Q E V M E S Y E M E L DG K T Y P V K
K Y DDV C K L D F G V HV NG Y I I DC A F T I A F N E - -K Y DN L I K A T QDG T NT G I K E A G I DA R MC D I G E A I Q E A I E S Y E I E L NQK I Y P I K
K Y DDV C K L D F G V HV NG Y I I DC A F T I A F N E - -K Y DN L I K A T QDG T NT G I R E A G I DA R MC D I G E A I Q E A I E S Y E I E L NK K I Y P I K
K E DDV C K L D F G V HV NG Y I I DC A F T I A F ND - -K Y DN L I K A T QDG T NT G I K E A G I DA R MC D I G E A I Q E A I E S Y E I E L NQK V Y P I K
QY DDV MK L D F G T H I DG H I V DC A F T V A F N P - -M F D P L L E A S R E A T NT G I K E S G I DV R L C DV G A A I Q E V M E S Y E V E I NG K V F QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y D I L L K A V K DA T NT G I K C A G I D I R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
K Y E DV MK V DY G V QV NG N I I D S A F T V S F D P - -QY DN L L A A V K DA T Y T G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I NG E T Y QV K
K E K DV MK V D I G V HV NG R I V D S A F T M S F D P - -QY DN L L A A V K A A T NK G I E E A G I DA R L N E I G E A I Q E V M E S Y E V E I NG K T HQV K
E Y DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K Y DQ L L A A V K DA T NT G I K E A G I DV R L C DV G E R I Q E V M E S Y E V E L DG K T Y QV K
R Y DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K F DG L L E A V R DA T NT G I K F A G I DV R L C DV G E T I Q E V M E S Y E V E I DG K T Y QV K
E Y DDV C K I D F G T QV E G R I I DC A F T V A F N P - -K Y DK L L E A V K E A T NT G I K E A G I DV R I P DV G A A I Q E V M E S Y E V E I E G K T Y P V K
E K DD I MK L D F G T HV NG Y I I D S A F T I A F D E - -K Y D P L I E S T K E A T NT G L K L A G I DA R T S E L G E A I E E V I E S F E I T L K NR T HK I K
HK NDV MK L D F G T HV NG Y I I D S A F T I A F D E - -K Y D P L I E S T K E A T NT G V K L A G I DA R T S E L G E A I Q E V I E S Y E I T L K NK T HK I K
G K S DV MK I D F G V A I NG N I I D S A F T V C F D P - -K F E P L L E A A K T A T NT G V K I A G I DA R MN E I G DA I Q E V F DA S S I D I DG K HY D I K
MY DDV MK V D F G T Q I NG R I I DC A WT V A F K D - - E Y E P L L T A V K E A T Y E G V K QA G I DV R L C DV G A A I Q E V M E S Y E V E L NG K V Y P V K
T Y DDV MK V D F G T Q I NG R I V DC A WT V A F ND - - E Y A P L L E A V K S A T Y E G I K QA G I DV R L C D I G E A I Q E V M E S Y E V E I K G K V Y P V K
QA T DV L K V D F G V HV K G R I V D S A F T L N F E P - -T WD P L L A A V K A A T NA G I K E A G I DA R L G E I G A S I Q E V M E S H E F E A NG K T HR V K
K E DDV L K V D F G V HV NG K I I D S A F T HV QND - -K WQG L L DA V K A A T E T G I R E A G I DV R L G D I G E A I Q E T M E S H E V E V DG K V Y QV K
3080
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3090
A I R N L NG H S I S P Y R I HA - - - - - -G K T V
S I R N L NG H S I G P Y Q I HA - - - - - -G K S V
P C R N L C G HN I V P Y Q I HG - - - - - -G K S V
C I R N L NG HN I DR H I I HG - - - - - -G K S V
C I R N L NG HN I DQH I I HG - - - - - -G K S V
- - - - - - - - - - - - L E I HA - - - - - -G K T V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L NG H S I A QY R I HA - - - - - -G K T V
P I R N L NG H S I A QY R I HA - - - - - -G K T V
C I K N L NG HN I DD F V I H S - - - - - -G K S V
P C R N L C G H S I G P Y T I HA - - - - - -G K S V
P I R N L S G H S I G QY R I HA - - - - - -G K T V
S I S N L NG H S I T P Y T I HG G I G T R P G K S V
P I K N L NG H S I L P Y H I HG - - - - - -G K S V
P I K N L NG H S I L P Y H I HG - - - - - -G K S V
P I R N L NG H S I G QY R I HA - - - - - -G K T V
C I R N L NG HN I G DY L I H S - - - - - -G K T V
S I R N L NG H S I R P Y V I HG - - - - - -G K T V
P I S D L HG H S I S Q F R I HG - - - - - -G I S I
P V R N L S G HMV G S Y A V HA - - - - - -G K S I
P C R N L C G HN I N P Y S I HG - - - - - -G K S V
S I R N L S G HN I A P Y I I H S - - - - - -G K S V
P I R N L NG HT I DR Y T I HG - - - - - -G K S V
P I T N L T G HG L S HY E A HD - - - - - -N P P V
S I R N L NG HT I NHY S I HG - - - - - -T K S V
S V R N L NG H S I G P Y Q I HA - - - - - -G K S V
C C S N L NG H S I D P Y R I HA - - - - - -G K S V
S V K N L NG H L I C K Y H I HG - - - - - -G K S V
C I K N L NG HN I G DY I I H S - - - - - -G K T V
A I S N L R G H S I NK Y I I HG - - - - - -G K C V
A I S N L R G H S I NK Y I I HG - - - - - -G K C V
P I S N L R G H S I C K Y V I HG - - - - - -G K C V
S I R N L NG H S I G P Y Q I HA - - - - - -G K S V
P C R N L C G H S I A P Y R I HG - - - - - -G K S V
S I R N L C G HN L D P Y I I HG - - - - - -G K S V
P I R N L NG H S I G QY R I H S - - - - - -G K T V
A I R N L NG H S I E A Y Q I HA - - - - - -G K S V
P I R N L T G HN I G QY I I HA - - - - - -G K A V
P I R N L T G HN I G QY V I HA - - - - - -G K A V
P I S N L S G H S L G P Y T V HA - - - - - -G K S I
S I R N L S G HT I A P Y V I HG - - - - - -G K S V
S I R N L C G HN I G P Y V I H S - - - - - -G K S V
C V E N L NG H S I E R Y S I HG - - - - - -G K S V
S I R N L NG HN I A P Y E I HG - - - - - -G K S V
3100
3110
3120
P I V K - - - -G G E -T T R -M E E N E F Y A I
P I V K - - - -G G E -QT K -M E E G E F Y A I
P I V K - - - -NG D - E T K -M E E G E H F A I
P I V K - - - -G S D -QT K -M E E G E T F A I
P I V K - - - -G G D -QT K -M E E G E V F A I
P I V K - - - -G G E -T T R -M E E N E I Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -G G E -QT K -M E E N E I Y A I
P I V K - - - -G G E -QT K -M E E N E I Y A I
P I I A - - - -NG D -MT K -M E E G E T F A I
P I V K - - - -NG D -QT K -M E E G E H F A I
P I V K - - - -G G D -QT R -M E E G E V F A I
P I V K QHG S DK D - E T K -M E E G E Y F A I
P I I A - - - -T ND -DT R -M E E N E I Y A I
P I I A - - - -T ND -DT R -M E E N E I Y A I
P I V P - - - -NG D -MT K -M E E G E T F A I
P I V R - - - -G G E -MT K -M E E G E F Y A I
P I V K - - - -G G E - S T R -M E E D E F Y A I
P I V K - - - -G G E - S T R -M E E D E F Y A I
P A V N - - - -NR D -T T R - I K G D S F Y A V
P I C K - - - -G G P -QT K -M E E G E V Y A L
P I V K - - - -NG D -NT K -M E E N E H F A I
P I V K - - - -G G E -QA K -M E E G E V F A I
P I V K - - - - S A D -QT K -M E E G E I Y A I
P NK H - - - -V E G -G V I - L K E G DV L A I
P I V K - - - - S ND -QT K -M E E G DV F A I
P I V K - - - -G G E -QT K -M E E G E F Y A I
P I V K - - - -G G V -QT K -M E E G E Y Y A I
P I V K - - - - S ND -NT L -MK E G E L Y A I
P I V A - - - -NG D -MT K -M E E G E T F A I
P I V R - - - -QK E -K N E I M E E G E L F A I
P I V K - - - -QK E - E N E I M E E G E L F A I
P I V K - - - -QQ E -K H E I M E E G D L F A I
P I V K - - - -G G E -QT K -M E E G E F F A I
P I V K - - - -NG D -T T K -M E E G E H F A I
P I V K - - - -G G E - E I K -M E E G E I F A I
P I V K - - - -G G E -A T R -M E E N E F Y A I
P I V K - - - -G G E -A T R -M E E G DV Y A I
P C I R - - - -T G P -NV K -M E E G E QY A I
P I V G - - - -K S G -NR D I M E E G DV F A I
P I V G - - - -NT N -NR D I M E E G E V F A I
P I T K - - - -G G N -A E K -M E E G E L F A C
P I V R - - - -G G E -A T R -M E E G E L F A I
P I V R - - - -G G E -A I K -M E E G E L F A I
P I V N - - - -M P D L QV K -M E E G E Y Y A I
P I V K - - - - S A D -MT K -M E E G E T F A I
3130
E T F G S -T G R G L V
E T F G S -T G K G Y V
E T F G T -T G R G Y V
E T F G S -T G R G QV
E T F G S -T G K G V V
E T F G S -T G NG Y V
E T F G S -T G R G R V
E T F A T -T G R G Y V
E T F A T -T G R G Y V
E T F G S -T G K G MV
E T F G S -T G R A QV
E T F A T -T G K G S I
E T F A T -T G R G R V
E T F G S -T G R G Y V
E T F G S -T G R G F V
E T F G S -T G L G Y V
E P F A T -NG T G L V
E T F G S -T G K G F V
E T F A S -T G K G Y V
E T F A S -T G K G F V
E T F G S -T G R G V V
E T F G S -T G K G F V
E T F G S -T G R G A V
E T F G V I NG K A S I
E T F A T -T G S G T V
E T F A T -T G S G MV
E T F G S -T G K G I V
E T F G S -T G R G F V
E T F G S -T G R G V V
3140
3150
S HY MK D F DA P - - -K V P L R L
S HY MK N F DA G - - -HV P L R L
S HY A K N P G A L - - - - P A P T L
S HY A L I P DA P - - - S V P L R L
S HY A L I P DH S - - -QV P L R L
S HY MK N F DQQ - - - F V P L R L
S HY MK N F DV G - - -HV P I R L
S HY MK N F E L A D - E K I P L R L
S HY MK N F E L A D - E K I P L R L
S HY A MNK G V E - - -H L K P P S
S HY A R L P S DG - - - L P Q P N L
S HY MK N F D L A N -QHV P L R L
S HY A L N S A A P - - E K Y QG HH
S HY MK Y Y DN P F L N E N S T R L
S HY MK Y Y DN P F L N E N S T R L
S HY MK N F E V G - - -HV P I R L
S HY A K N P G T D - - -D I V V P G
S HY MK T DY - - - - -QT T V R L
S HY MK N F D L P - - - F V P L R L
S HY MK N F D L P - - -Y V P L R L
S H F V L NT Y K S - - - - - - -R K
S HY MV DA NA F - - -DY P V R D
S HY A K K P G S H - - - - P T P S L
S HY MMQ P G A E - - -V MQ L R S
S HY A K R A DA P - - -NV A L R L
E I Y S L I K - - - - - -K K P V R L
S HY A K R G DA A - - -K V D L R L
S HY MK N F DV G - - -HV P L R V
S HY MK N F DV G - - -HV P L R L
S HY MK D F Y A K - - - P T A V R V
S HY S R NQN I D - - -G I R V P S
S HY MR N P E K Q - - - F V P I R L
S HY MR N P DK Q - - - F V P I R L
S HY MR NR DV Q - - -Y A P I R L
S HY MK N F DV G - - -H I P L R L
S HY MK N F DV E - - -HV P I R L
S HY A R S A E DH - - -QV M P T L
S HY A K I P DA G - - -H I P L R L
S HY MK N F E V G - - -HV P L R M
S HY MK N F NV G - - -HV P I R L
S HY MK D F NK E - - -MV P L R Q
S HY MK N P N S I - - -Y A P I R L
S HY MK N P N S I - - -Y A P I R L
S H F MV A R N P P - - - - -T P R T
S HY MMV P G G E - - -K T QV R S
S HY MMV P G G D - - -K T Q L R S
S HY A R K K N L P - -K S I P I R V
S HY A K NV G V G - - -HV P L R V
3160
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3170
3180
3190
3200
3210
3220
3230
Q S S K S L L G L - - - I NR N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C DV K -G S Y T A QY E HT I M L R P T C K - E V V
P R A K Q L L A T - - - I NK N F S T L A F C R R Y L DR I G E T K - -Y L MA L K N L C D S G I V Q P Y P P L C DV K -G S Y V S Q F E HT I L L R P T C K - E V L
S R A K A L L R T - - - I DA N F G T L P WC R R Y L DR L G E DK - -Y M F A L NH L V K QG I V QDY P P L V DV E -G S Y T A Q F E HT I L L H P HK K - E V V
S S A K N L L NV - - - I NK N F G T L P F C R R Y L DR L G Q E K - -Y L L G L NN L V S S G I V QDY P P L C DV K -G S Y T A Q F E HT I L L R P T V K - E V I
S S A K N L L NV - - - I NK N F G T L P F C R R Y L DR L G QDK - -Y L L G L NN L V S S G I V QDY P P L C D I K -G S Y T A QY E HT I V L R P NV K - E V I
Q S S K Q L L NV - - - I NK N F G T L A F C K R W L E R A G A S R - -Y A MA L K D L C DK G V V DA Y P P L C D I K -G C Y T A Q F E HT I L L R P T C K - E V V
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
QK S K G L L S L - - - I DK N F S T L A F C R R W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - QK S K G L L N L - - - I DK N F A T L A F C R R W I DR L G E T K - -Y L MA L K D L C DK G I V D P Y P P L C DV K -G C Y T A QW E HT I L MR P T V K - E V V
E R S K Q L L E T - - - I K QN F G T L P WC R R Y L E R T G E E K - -Y L F A L NQ L V R HG I V E E Y P P I V DK R -G S Y T A Q F E HT I L L H P HK K - E V V
A S A K Q L L K V - - - I DDH F G T L P WC R R Y L DR L G QDK - -Y L F A L N S L V K QG HV QDY P P L NDV I -G S Y T A QY E HT I L L H P HK K - E V V
P R A K R L L HV - - - I N E N F G T L A F C R R W I DR I G E T K - -Y L MA L K N L C D S G V V DA Y P P L C D I K -G C Y T A Q F E HT I L MR P NC K - E V V
Q S A K S L L A S - - -V K R N F G T L P F C R R Y L DHV G E K N - -Y L L A L NT L V R E D F I A DY P P L V D P Q P G A MT A Q F E HT I L L R P T C K - E V V
N S A K I L L G G - - - I NT H F G T L A F C R R W L DQ L G F NK - -HA L A L K S L V D S E I I R P Y P P L ND I P -G S F S S QM E HT I L L R P S C K - E V V
N S A K I L L G G - - - I NT H F G T L A F C R R W L DQ L G F NK - -HA L A L K S L V D S E I I R P Y P P L ND I P -G S F S S QM E HT I L L R P S C K - E V V
P R A K H L L NV - - -V N E N F G T L A F C R R W L DR L G E T K - -Y L MA L K N L C D L G I I D P Y P P L C DT K -G C Y T A Q F E HT I L L R P T C K - E V V
DK A K S L L NV - - - I N E N F G T L P WC R R Y L DR L G QDK - -Y L L A L NQ L V R A G I V QDY P P I V D I K -G S Y T A Q F E HT I L L H P HK K - E V V
P K A K Q L L QY - - - I NK NY DT L C F C R R W L DR A G E DK - -H I L A L NN L C D L G I I QR HA P L V D S K -G S Y V A QY E HT L L L K P T A K - E V L
Q S S K Q L L G T - - - I NK N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C D I K -G C Y T A QY E HT I M L R P T C K - E V V
Q S S K Q L L G T - - - I NK N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C D I K -G C Y T A QY E HT I M L R P T C K - E I V
L F NK D L I K V Y E F V K D S L G T L P F S P R H L DY Y G L V K G G S L K S V N L L T MMG L L T P Y P P L ND I D -G C K V A Q F E HT V Y L S E HG K - E V L
G NA K R L L HA - - - L DA N F K T L A F C R R Y V DK I G F A K - -WQM P F K F L V DDG C V NA Y P P L S DC H -G S Y V A Q F E HT I Y L K P T C K - E V L
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
S S A K N L L K V - - - I D E N F G T I P F C R R Y L DR L G E DK - -HV Y A L NT L V R QG I V E DY P P L ND I K -G S Y T A Q F E HT L I L H P HK K - E I V
E K A QQ L L K H - - - I HK S Y S T L A F C R K W L DR DG F DR - -H L MN L NR L V D E G A V NK Y P P L V DV K -G S F T A QY E HT I Y L G P T A K - E I L
T S A QK I L NV - - - I NK N F G T L P F C R R Y L DR L G QDK - -Y L L G L NN L V S NG I V E A Y P P L V DK K -G S Y T A QY E HT I L L R P T V K - E V I
P A V R NV L K Q - - -V - E E Y R E L P F A K R W L E - - - S DK - - L E F S L I Q L E K A G I L H S Y P V L V E S A -G G L V S QA E HT V I I T R DG C - E V T
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
S S A K S L L NV - - - I T K N F G T L P F C R R Y I DR L G QDK - -Y L L G - - - - - - -G I V E A Y P P L V DK K -G S Y T A HW L S T QR L K N S T A F Q I V
A K A K Q L L G T - - - I NNN F G T L A F C R R Y L DR L G E T K - -Y L MA L K N L C DV G I V Q P Y P P L C DV R -G S Y V S Q F E HT I L L R P T C K - E V I
P R A K Q L L G V - - - I DR N F G T L A F C K R Y L DR I G E QR - -Y S MA L K N L C DNG I V Q P Y P P L C D I K -G S Y V A QY E HT I L L K P S S G V E V L
P K A K S L L T H - - - I DNHY DT L A F C R R F L DR DG Q S N - -Y L L G L K N L C D L G I V N P Y P P L C D I R -G S Y V S QY E HT I F L K P S C I - E V I
E R A K T L L N S - - - I T S N F G T L P WC R R Y L E R T G E E K - -Y L F A L NQ L V R A G I V E E Y P P L V D I K -G S Y T A QY E HT I L L H P HK K - E V V
N S A K T L L K V - - - I NDN F DT L P F C NR W L DD L G QT R - -H F MA L K T L I D L N I V E P Y P P L C D I K -N S F T S QM E HT I L L R P T C K - E V L
N S A K T L L K V - - - I NDN F DT L P F C HR W L DD L G QK R - -H F MA L K T L V D L N I V E P Y P P L C DV K -N S F T S QM E HT I L L R P T C K - E V L
N S A K T L L K V - - - I NDK F DT L P F C NR W L DD L G QT R - -H F MA L K T L V D L N I V E P Y P P L C D I K -N S F T S QM E HT I L L R P T C K - E V L
P R A K Q L L A T - - - I NK N F S T L A F C R R Y L DR L G E T K - -Y L MA L K N L C D S G I I Q P Y P P L C DV K -G S Y V S Q F E HT I L L R P T C K - E V I
P R T K H L L NV - - - I N E N F DT L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
D S A K N L L K T - - - I DR N F G T L P F C R R Y L DR L G Q E K - -Y L F A L NN L V R HG L V QDY P P L ND I P -G S Y T A Q F E HT I L L HA HK K - E V V
P R A K A L L NT - - - I T QN F G T L P F C R R Y L DR I G E S K - -Y L L A L NN L V S A G I V QDY P P L C D I R -G S Y T A Q F E HT I I L H P T QK - E V V
QR S K A L L K V - - - I NNN F G T L A F C R R W L DR L G E T K - -Y L MA L K N L C DT G L V D P Y P P L C DV K -G C Y T A QY E HT I M L R P T Y K - E V V
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I I D P Y P P L C D I K -G S Y T A QY E HT I L L R P T Y K - E V V
P K A K N L L K F - - - I DNN F G T L A F C R R W L DR G G QT G - -H I L S L K Q L C DA G I V V P Y P P L V DV R -G S Y V A QY E HT I V L K P S HK - E V I
K S A R E S L NV - - - I NR E F S T L P F C K R W L DD L T NK R - -G S L V L R N L V DA G I I V P Y P P L C DNN -N S F T S QM E HT I L L R P T C K - E V L
K S A R E A L NV - - - I NR E F S T L P F C K R W L DD L T NR R - -G S MV L R S L V DA G I V V P Y P P L S DNN -H S F T S QM E HT I L L R P T C K - E V L
P A A R K L L K T - - - L Q E N F S T L A F S QR F I DR I G E K K - -Y Q L N L R H L V E C R A V HDY P S L S DV K -G S Y V A Q F E HT F I L L P T HK - E V L
DK A QQ L L R H - - - I HK T Y NT L A F A R K W L DR DG HDR - -H L L N L NQ L V E A G A V NK Y P P L C D I R -G C Y T A Q L E HT L I L K P T A K - E I L
E K A QH L L K H - - - I NK T Y G T L A F A R K W L DR DG Y DR - -H L L N L NQ L V E A G A V NR Y P P L C DV K -G C Y T A Q F E HT I L L K P T A K - E I L
H S A HG L L R T - - - I NK H F D S L P F C R R Y L DR V G E K N - -Y L L G L K H L V S L G V V QDY P P L C D I A -G S MT A QY E HT I L L R P T C K - E V V
NK A K Q L L A T - - - I DK N F G T L P F C R R Y L DR L G E E K - -Y L L A L K N L V Q S G V V QDY P P L V DQK -G C QT A QY E HT I Y L R P T C K - E I L
3240
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3250
3260
3270
3280
3290
3300
3310
S R - - - - - - - -G DDY - - - - - - - - - - - - - - - - - - - -A QMQMK A R L A G A HK G HG L L K K K A DA L QMR F R M I L S K I I E T K T L MG E V - S K - - - - - - - -G DDY M - -A G QNA -R L NV V P T V T - -M L G V MK A R L V G A T R G HA L L K K K S DA L T V Q F R A L L K K I V T A K E S MG DM - S K - - - - - - - -G DDY M - - - - S NN -R E QV F P T R M - -T L G L MK S K L K G A NQG H S L L K R K S E A L T K R F R E I T R R I D E S K QR MG A V - S R - - - - - - - -G DDY M - - S G A V G -R E P V F P T R Q - - S L G L MK S K L K G A E T G H S L L K R K S E A L T K R F R E I T R R I D E A K QK MG R V - S R - - - - - - - -G DDY M S G F N P P G -R E A V F P T R Q - - S L G L MK G K L K G A E T G H S L L K R K S E A L T K R F R E I T R R I D E A K QK MG R V - S R - - - - - - - -G DDY - - - - - - - - - - - - - - - - - - - - - -M L I K G R L A G A V K G HG L L K K K A DA L QV R F R M I L S K I I E T K T L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - - - - - - - - - - - - - - -M S G G G G K D -R I A V F P S R M - -A QT L MK T R L K G A QK G H S L L K K K A DA L N L R F R D I L K K I V E NK V L MG E V - S R - - - - - - - -G DDY M - S G G G K D -R I A V F P S R M - -A QT L MK T R L K G A QK G H S L L K K K A DA L N L R F R D I L R K I V E NK V L MG E V - T K - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G V MK S K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S K - - - - - - - -G DDY M - - - - S G N -R E QV F P T R M - -T L G L MK T K L K G A NQG H S L L K R K S E A L T K R F R D I T K R I D E A K QK MG R V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - S S HD -R I D I F P S R M - -N L T I MK T R L K G A HK G H S L L K K K A DA L K MK F H S I L R K I I E A K Q L MG E I - S R - - - - - - - -G DDY M - - S G T G P -R E A I F P T R M - -N L T L T K G R L K G A QT G H S L L A K K R DA L T T R F R Q I L R K V D E A K R L MG R V - S R - - - - - - - -G DD F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M L K E I V E T K R S I G ND - S R - - - - - - - -G DD F L - - - - - - - - - - - - - - - L I Y R A L QA I K L K S K G A K QG Y D L L K R K S DA L S NK F R G M L K E I V E T K R S I G ND - S R - - - - - - - -G DDY M - - - S G K E -R I D I F P S R M - -A QT I MK A R L K G A QT G R S L L K K K S DA L S MR F R Q I L R K I I E T K T L MG E V - S R - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G L MK G K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S R - - - - - - - -G DDY M - - - S G K N -R L N I F P T R M - -A L T V MK T K L K G A V T G H S L L K K K S DA L T I R F R R I L A N I V E NK Q L MG T T - S R - - - - - - - -G DDY M - - - S G K D -R L P I F P S R G - -A QM L MK A R L A G A QK G HG L L K K K A DA L QMR F R L I L G K I I E T K T L MG DV - S R - - - - - - - -G DDY M - - - S G K D -R L P I F P S R G - -A QM L MK A R L A G A QK G HG L L K K K A DA L QMR F R M I L G K I I E T K T L MG DV - T R - - - - - - - -G DDY M - - - -T G E -R I P V F P T R M - -N L R T M E T K QK S A QK G H S L L K R K S DA L K V R Y R A V E D E Y K R K E L G I NQK - S R - - - - - - - - - - - -M - - - - S DK -R Y T V F P T R M - -Q L T T Y K G K L V G A QR G HD L L K R K T DA L NQK F K S I L K K I I E E K M S MK DY - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K L L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - - S S N -R E QV F P T R M - -T L G L MK T K L K G A NQG Y S L L K R K S E A L T K R F R D I T K R I DD S K QK MG R V - S K - - - - - - - -G S DY M - - - S S T S -R Y P A L P S R M - - S L I A F K T R L K G A QK G H S L L K K K A DA L S L R Y R T V MG E L R T A K L E MA NQ - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - S G A G E -R E A V F P T R Q - - S L G I MK A K L K G A E T G H S L L K R K S E A L T K R F R E I T K R I D E A K R K MG R V - T K - - - - - - - - - - - -M - - - - - - -A QQDV K P T R S - - E L I N L K K K I K L S E S G HK L L K MK R DG L I L E F F K I L N E A R NV R T E L DA A - S R - - - - - - - -G DDY M - - - S T K D -R I D I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA MT L R F R Q I L K K V I QT K V L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - NQR R R V T E A I T E A E M - - S G A A D -R E A V F P T R Q - - S L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E I T R R I D E A K R K MG R V - S R - - - - - - - -G DDY M - - S G QT Q -R L NV V P T V T - -M L G V MK A R L V G A T R G HA L L K K K S DA L T V Q F R A I L K K I V A A K E S MG E A - T R - - - - - - - -G E DY M - - S S A G A -R L NV T P T V T - -T L A V I K S R L A G A QR G HR L L K K K A DA L T L R Y R G I L R D I V E A K R K L A T S - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - - - - - -A E QV V P S R M - -N L A L Y K A K I I S A K K G H E L L K K K C DA L K T K F R I V MV A L L E NK K F MG D E - T R - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G L MK G K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L Q L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K T K V G E E - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L Q L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K T K V G E E - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L H L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K NK V G E D - S R - - - - - - - -G DDY M - - S G S G Q -R L NV V P T V T - -V L G V V K A R L V G A T R G HA L L K K K S DA L T V Q F R Q I L K K I V S T K E S MG DK - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S K - - - - - - - -G DDY M - - - - S G N -R E QV F P T R M - -T L G L MK T K L K G A NQG Y S L L K R K S E A L T K R F R D I T K R I DDA K QK MG R V - S R - - - - - - - -G DDY M - - -A S K Q -R E NV F P T R M - -T L T T MK T R L K G A QT G H S L L K R K S E A L K K R F R E I V V N I E QA K QK MG R V - S R - - - - - - - -G DDY M - - - - S K D -R I A V F P S R M - -A L T T MK I R L K G A QK G H S L L K K K A DA L T L K F R Q I L G K I I E NK T L MG E A - S R - - - - - - - -G E DY M - - - S G K E -R I DV F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L S MR F R Q I L R K I I E V S W L S S A I P I
S R - - - - - - - -G DDY M - - - - - - - - S QQ I T P S R M - -T L A I Y K A K T V S A K K G H E L L K K K C DA L K T K F R A I M I A L L E NK L K MD E E - S R - - - - - - - -G DDY M - - - - S S L - S V L L I P S R M - -N L QN L K QR R HNA H L G Y S L L K R K S DA L T S K F HR L L R A T V QG K E R L V E G - S R - - - - - - - -G DDY M - - - - S N L - S V L L I P S R M L V N L QN L K QR R HNA H L G Y S L L K R K S DA L T S K F HR L L R A T V QG K E R L V E G - S R - - - - - - - -G DDY M - - - - - - - - -A A I I P T R M - - E L QN L K E K L K G A R K G Y D L L K K K S DA L T MK F R S L L R E I R DT K L S V G NV - S K - - - - - - - -G DDY M - - - - S S N -R Y T A L P S R M - - S L I A F K T R L K G A QK G H S L L K K K A DA L A F R Y R T V MD E L R R A K L E V A DQ - S K - - - - - - - -G DDY M - - - - S S N -R Y P A L P S R M - - S L I S F K T R L K G A QK G H S L L K K K A DA L A I R Y R A I MG D L R NA K M E MV E Q - S R - - - - - - - -G T DY M - S S G K G Q -R E S V F P T R Q - -A L G S A K T R L K G A QT G H S L L K K K A DA L T K R F R T I T HK I D E A K R K MG R V - S R - - - - - - - -G DDY M - - - S A NN -R E A V F P T R M - -T L G MMK G K L K G A T QG HN L L K R K S E A L T K R F R D I T R K I D E S K HK MG R V - -
3330
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3340
3350
3360
- -MK E A A F S L A E A K F A S G -D F NQV V L QNV T K A Q I K I R T K K DNV A G V T L

- -MK T S S F A L T E V K Y V A G DNV K HV V L E NV K E A T L K V R S R T E N I A G V K L
- -MQT A S F S L A E V T Y A T G E N I G Y QV Q E S V A NA R F K V G A R Q E NV S G V Y L
- -MQ I A A F S L A E V S Y A V G G D I G Y QV Q E S A K QA R F R V R A K Q E NV S G V F L
- -MQ I A A F S L A E V S Y A V G G D I G Y Q I Q E S A K QA R F R V R A K Q E NV S G V L L
- -MK E A A F S L A E A K F T T G -D F NQV V L QNV T K A Q I K I R S K K DNV A G V T L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MK E A A F S L A E A K F T A G -D F S HT V I QNV S QA QY R V R MK K E NV V G V L L
- -MK E A A F S L A E A K F T A G -D F S HT V I QNV S QA QY R V R MK K E NV V G V F L
- -MQT A A F S L A E V QY A T G DN I S Y QV Q E S V QK A R F T V K A K Q E NV S G V F L
- -MQT A A F S L A E V S Y A T G E N I G Y QV Q E S V L NA R F K V K A R Q E NV S G V Y L
- -MK E A A F S L A E A K F S G G -D F S HV V L QNV G K A QMK V R S K T DNV A G V K L
- - L Q L A S F S L A E V T Y A A G -D I G Y QV Q E S V R K A NY T V QA R Q E NV S G V V L
- - I K E A S F A L A K A T WA A G -D F K DR I I E S C K R P T V T M E V G T E N I A G V R L
- - I K E A S F A L A K A T WA A G -D F K DR I I E S C K R P T V T M E V G T E N I A G V R L
- -MR E A A F S L A E A K F A A G -D F S T T V I QNV NK A QV K V R A K K DNV A G V T L
- -MQT A A F S L A E V QY A T G DN I S Y L V Q E S V QNA R F QV K A K Q E NV S G V Y L
- -MR DA S F S L A A A K Y A A G - E F S N S V I E NV S N P T I A V K MT T E NV A G V H L
- -MK E A A F S L A E A K F T S G -D I NQV V L QNV T K A Q I K I R T K K DNV A G V T L
- -MK E A A F S L A E A K F T S G -D I NQV V L QNV T K A Q I K I R T K K DNV A G V T L
- - I R DA F F R L T E A E F L G A -N L K M F L Y E -C QK QNV Y V R S R V E QV S G V S L
- -MK A S S F S L V S A K Y T A G - E F S HV V V QNV K N S T Y K V K L T Q E N I A G V R L
- -MQT A A F S L A E V T Y A T G E N I G Y QV Q E NV A NA R F K V R A T Q E NV S G V Y L
- - I K G S Y F T I T QA Q F I A G -D I S L A V Q E S L K I P T Y R M E L QV E N I A G V QV
- -MQ I A A F S L A E V T Y A V G G D I G Y T V Q E S A K S A R F R I R A K Q E NV S G V L L
- -Y E K S T E K I N L A S A V NG -MV A V K S T A F T A K E Y P E I Q L S G HN I MG V V V
- -MR E A A F S L A E A K F T A G -D F S A T V I QNV NK A QV K I R T K K DNV A G V T L
- -MQ I A S L S L A E V T Y A V G G N I G Y Q I Q E S A K S A R F R I R A K Q E NV S G V L L
- -MR A S S F S L A E A K Y V A G DG V R HV V L Q S V R S A S L R V R S HQ E NV A G V K L
- -MR DA H F A WT R A K Y A G G DA V K HA V L DG V DR A NV R V MA H E DNV A G V K I
- -A Q E A L L L I A K A QY A A G - E F HQNV K DA V K R A T I R L E I S S E N I A G V M L
- -MQT A A F S L A E V QY A T G DN I A Y QV Q E S V QK A R F QV K A K Q E NV S G V Y L
- -MR NA S F S L A K S V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MG NA S F S L A K A V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MR NA S F A L A K S V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MK A S S F A L T E A K Y V A G E N I K HT V L E NV QT A T L K V R S R Q E NV A G V K L
- -MQT A A F S L A E V S Y A T G E N I G Y QV Q E S V S T A R F K V R A R Q E NV S G V Y L
- -MQ I A A F S MA E V G F A MG NN I N F E I QQ S V K Q P R L R V R S K Q E N I S G V F L
- -MK L A S L S L A E A K F A MG -D I S HNV L QNV T K A QT K V R S K K E NV A G V N L
Q F MR E A A F S L A E A K F T A G -D F S I T V I QNV NK A QV K V R A K K DNV A G V T L
- -MQK A F I Q L A DA Y WA A D -Q F NT NV R E S V K K A L V R I E Y S S E N I A G V M L
- - L K DA T Y S L A NA V W S A E -D F K S L V I E S V G R P S V T L K L R G E N I A G V L L
- - L K DA T Y S L A NA V W S A E -D F K S L V I E S V G R P S V T L K L R G E N I A G V L L
- -A K DA L F A Y T E V K F V A S -D I S P T V I Q S V G NM P Q L L L MT I DN I A G V R T
- - I K G S Y F T I T QA Q F I A G -D I S L A V Q E S L K L P T Y T L T L R V DNV A G V R V
- - I R G A Y F T V S K A Q F I A G -D I G L A V Q E S L K L P T Y A MR L R V E N I A G V R V
- -MQQA S F S L A E V QY A T G -D I G Y I V Q E S V K S A S F R V R A K Q E NV S G V I L
- -MQT A A F S L A E V T Y A T G DN I NY QV Q E S V R S A R L R V R A K E E NV S G V K L
3370
3380
3390
3400
P V F G L A K G G QQ L QK L K K NY Q S A V K L L V E L A S L QT S
P K F G L A R G G QQV R A C R V A Y V K A I E V L V E L A S L QT S
P Q F G L G R G G QQV QR A K N I Y T K V V E S L V Q L A S L QT A
P Q F G L G K G G QQV QR C R E T Y A R A V E T L V E L A - - - - P H F G L G K G G MQV QR C R E T Y A R A V E T L V E L A S L QT A
P I F G L A R G G QQ L A K L K K N F Q S A V K L L V E L A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P V F G L G K G G A N I A R L K K NY NK A I E L L V E L A T L QT C
P V F G L G K G G A N I A R L K K NY NK A I E L L V E L A T L QT C
P T F A L A R G G QQV QK A K L I Y S K A V E T L V E L A S L QT A
P Q F G L G R G G QQV QR A K D I Y S K A V E T L V E L A S L QT A
P V F G L S R G G E Q L S R L K K NY S K A V K L L V E L A S L QT S
P A F G L S R G G QQ I QK S R DT Y I K A V G T L V E L A S L QT A
P I F G V A S G G QV I Q S T R E I Y MK V L R D L V K L A S L QT A
P I F G V A S G G QV I Q S T R E I Y MK V L R D L V K L A S L QT A
P V F G L A R G G E Q L S R L K R NY A K A V E L L V E L A S L QT S
P T F G L G R G G QQV QK A K MV Y T K A V E T L V E L A S L QT A
P T F G L S K G G QQ I NK S R E S H I K A V E A L I A L A S L QT A
P V F G L A R G G QQ L A K L K K NY Q S A V K L L V E L A S L QT S
P V F G L A R G G QQ L A K L K K NY Q S A V K L L V E L A S L QT S
P F F F L DR S G Q S L N E C R E K F L E V L E M L V D L C A L K N S
P V F G L S K G G Q S V A NA R QQY L K A L D S L V K L A S L QT A
P Q F G L G R G G QQV QR A K E I Y S R A V E T L V E L A S L QT A
P S F G L G K G G E Q I K E A Y S A F R HT L S L L V K I A S L QT S
P A F G L G K G G QQV QR C R E T Y A R A V E A L V E L A S L QT A
P K I G I I G T N S Y I D E T A DA Y E E L V E K I I A A A E L E T T
P V F G L A R G G E QV T K L K K NY G K A V E L L V E L A S L QT S
P A F G L G K G G QQV QR C R E T Y A R A V E A L V E L A S L QT A
P K F G L A R G G QQV A A C R A A HV K A I E V L V E L A S L QT S
P K F G L A R G G A R V R E A K A S Y G E A I G L L S E L A S L QT A
P E V G L A R G G Q S I QR C R DK F K D L L M L L V K I A S Y QT S
P T F G L G R G G QQV QK A K L V Y T R A V E T L V E L A S L QT A
P I F G V A A G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P I F G I A S G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P I F G V A A G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P K F G L A R G G QQV QA C R A A Y V K A I E V L V E L A S L QT S
S Q F G L G R G G QQV QR A K E I Y S R A V E T L V E L A S L QT A
P T F G L G K G G QQ I QK A R QV Y E K A V E T L V Q L A S Y Q S A
P V F G L S R G G QQ I DR L K K NY A K A I E L L V E L A S L QT S
P V F G L A K G G E Q I S R L K R NY A R A V E L L V E L A S L QT S
P N L G L DK G G F S I QK A K E R F K E A L Y L L V K V A S L QT S
PV F S L S SGG SA I Q SV K T T H LAA LD I LV E LA S LQ I S
PV F S L S SGG SA I Q SV K T T H LAA LD I LV E LA S LQ I S
P Q F G L A R G G QQ I QK A R E E F T K F L D S L V R L A E L QT A
P A F G I G R G G E Q L R E A R DA F R E T L K L F V K I A S L QV S
P S F G I G R G G E Q L R E A S E K F R E T L R L L V K I A S L QV S
P A F G L S R G G QQV S K A R E V Y T QA L K V L V E L A S L QT A
P S F G L G R G G QQV QK A K A V Y S K A V E T L V E L A S L QT A
3410
A_fumigatus/1-3241
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
C_parvum/1-3281
Danio_rerio/1-3286
Homo_sapiens/1-3286
Mus_musculus/1-3286
Oryza_sativa/1-3218
P_falciparum/1-3285
P_knowlesi/1-3284
T_annulata/1-3248
T_brucei/1-3264
3420
3430
3440
3450
3460
F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R I DR T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F L T L D E A I K T T NR R V NA L E N - - - - -V V K P K L E NT I S Y I K G E L D E L E R E D F F R L K K I QG Y K
F V I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
- - - -N E V I K V V NR R V - S T S L - - - - - S L E P R T L S N - - - - - - - - - - - - - - - - - - - - - - - - - F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V S G K K
F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R L E R T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F V T L D E A I K I T NR R V NA I E HG E F K L P F C P R L H P C L R P A R T QA - - - - - - - - - - - - - - - - - F I T L D E A I K V T NR R V NA I E H - - - - -V I I P R I E NT L T Y I V T E L D E M E R E E F F R MK K I QA NK
F I T L D E A I K V T NR R V NA I E H - - - - -V I I P R I E NT L T Y I V T E L D E M E R E E F F R MK K I QA NK
F I I L D E V I K I T NR R V NA I E H - - - - -V I I P R T E NT I A Y I NG E L D E MDR E E F Y R L K K V Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E S I K I T NR R V NA I E H - - - - -V I I P K I E R T I S Y I I T E L D E G E R E E F F R L K K I QQK K
F T I L D E V I R A T NR R V NA I E H - - - - -V V I P R L E NT I K Y I N S E L D E MDR E E F F R L K K V QG K K
F F S L D E E I K MT NR R V NA L QN - - - - -V V L P K L E DG MNY I L R E L D E I E R E E F F R L K K I Q E K K
F F S L D E E I K MT NR R V NA L QN - - - - -V V L P K L E DG MNY I L R E L D E I E R E E F F R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L T Y I I T E L D E R E R E E F Y R L K K I Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I S Y I N S E L D E L DR E E F Y R L K K V Q E K K
F I T L D E V I K I T NR R V NA I E Y - - - - -V V K P K L E NT I S Y I I T E L D E S E R E E F Y R L K K V QG K K
F R V L N S I L M S T NR R V NA L E F - - - - -N I I P R L E NT V S Y I V S E L D E QDR G D F F R L K K V QN L K
F L T L DT V I K I T NR R V NA L E H - - - - -V V I P MT QA T V K Y I E T E L D E S E R E E F F R L K L I QNK K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L S Y I I T E L D E R E R E E F Y R L K K I Q E K K
W I T L D I A QK V T S R R V NA L E K - - - - -V V I P R V QNT L S Y I T S E L D E Q E R E E F F R L K MV QK K K
F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V A G K K
MK R L L D E I E K T K R R V NA L E F - - - - -K V I P E L I A T MK Y I R F M L E E M E R E NT F R L K R V K A R M
F I T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L NY I V T E L D E R E R E E F Y R L K K I Q E K K
F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V A A K K
F L T L D E A I K T T NR R V NA L E N - - - - -V V K P R L E NT I S Y I K G E L D E L E R E D F F R L K K I QG Y K
F V T L D E A I K T T NR R V NA L E N - - - - -Y V T P R L QNT V K Y I L S E L D E L E R E E F F R L K K V QA K K
F V S L DQV I K V T NR R V NA L E Y - - - - -V V I P R F T A T MNY I DM E L D E M S K E D F F R L K K V L DNK
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I S Y I N S E L D E L DR E E F Y R L K K V Q E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L DG G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L DG G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L E G G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F MT L DT A I K T T NR R V NA L E N - - - - -V V K P R L E NT I T Y I K G E L D E L E R E D F F R L K K I QG F K
F V L L G DV L QMT NR R V N S I E H - - - - - I I I P R L E NT I K Y I E S E L E E L E R E D F T R L K K V QK T K
F I T L D E V I K I T NR R V NA I E H - - - - -V I I P R I E NT I S Y I T T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I DR T L T Y I V T E L D E R E R E E F Y R L K K I Q E K K
F I T L D E V I K V T NR R V NA L E H - - - - -V V I P R F M E V QA Y I NQ E L D E M S R E D F F R L K K V L D F K
F I I L N E E I R MT NR R I NA L DN - - - - -V L I P S I DR N L E Y I R R E L D E M E R E E F Y R L K M I K K HK
F I I L N E E I R MT NR R I NA L DN - - - - -V L I P S I DR N L E Y I R R E L D E M E R E E F Y R L K M I K K HK
F NV I DDV L R I T NR R V NA M E C - - - - -V L I P K Y QA A I A F V D S T L D E N E R E E F F R L K K V Q E T I
WMT L DV A QK V T S R R V NA L E K - - - - -V V I P R M E NT L NY I S S E L D E Q E R E E F F R L K M I QK K K
WV T L D L A QK V T NR R V NA L E K - - - - -V V I P R V QNT L S Y I T S E L D E Q E R E E F F R L K MV QK K K
F V I L D E V I R MT NR R V NA I E H - - - - -V I I P R L E NT I S Y I V S E L D E A DR E E F F R L K K V QA K K
F V I L D E V I K I T NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V QDK K

SaifUr RehmanPhDThesis

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

SaifUr RehmanPhDThesis

Încărcat de

Drepturi de autor:

Formate disponibile

AN INVESTIGATION OF HUMAN PROTEIN

INTERACTIONS USING THE COMPARATIVE METHOD

A Thesis Submitted for the Degree of PhD

Full metadata for this item is available in

Please use this identifier to cite or link to this item:

This item is protected by original copyright

20th Jan 2012

1.4.3 Translation ........................................................................................................................... 20

1.5.5 Storage of functional information ....................................................................................... 33

1.6 Transcriptomics ........................................................................................................................ 34

1.8 Description of project. .............................................................................................................. 37

2.1.1 Homology ............................................................................................................................ 39

2.1.3 Phylogenetic trees................................................................................................................ 41

2.1.4 Detection of homology in molecular data ........................................................................... 47

2.1.8 Model selection in phylogenetic tree estimation ................................................................. 60

2.2.5 Phylogenetic profiles ........................................................................................................... 72

2.2.6 Multiple alignment .............................................................................................................. 79

3.1.3 Co-expression as measured by microarray........................................................................ 101

3.3.1 Hamming distance ............................................................................................................... 109

3.3.3 Co-expression of mRNA ................................................................................................... 121

4.2 Filters ....................................................................................................................................... 134

4.3 Methods.................................................................................................................................... 138

4.3.2 Modification of test to match Dollo constraints ................................................................ 148

4.4 Results ...................................................................................................................................... 157

5.1.2 Power law .......................................................................................................................... 172

5.2.1 Short Branch filtration ....................................................................................................... 174

5.4.4 Conclusions ....................................................................................................................... 193

Connection of mRNA to relevant

Component of protein synthesising

Small nuclear RNA

Component of RNA-protein machine

Small nucleolar RNA

Involved in the modification of rRNA

Involved in the regulation of RNA

Short interfering RNA

Involved in the targeted degradation of

Random coils: In the absence of particular structural imperatives polypeptide chains

Transmembrane domain: This is a domain consisting of ! helical structures capable

1.3.5 Protein motifs

1.4.2.1 Post Transcriptional processing

Table 1.2: The genetic code (Brown 2006).

Figure 1.3: Exons and introns within a pre-mRNA molecule.

Figure 1.4: Exons post splicing.

Figure 1.5: Exons post splicing in an alternate configuration.

5! splice site 5!-AG"GUAAGU-3!

3! splice site 5!-PyPyPyPyPyPyNCAG"-3!

Figure 1.6: Schematic of mature mRNA.

Figure 1.7: Structure of a tRNA molecule. Adapted from (Alberts 2008).

The mRNA molecule is then pulled through the ribosome.

When a start codon is encountered a tRNA molecule with an anticodon arm

This process is iterated constructing a polypeptide chain or protein molecule.

In order to prevent premature folding of proteins during translation the emerging

DNA binding sites.

ORF (Open Reading Frame) detection: Detection of a potential ORF in genomic

Alteration in expression: In cases where the gene in question is essential to the

In order to physically pinpoint specific tissues (in the case of multi-cellular

Detection of genetic interactions: The interaction of two non-essential genes (and

1.5.4.2 Computational methods for functional annotation of genes

Alignment based methods.

Genome Context methods.

1.5.4.2.1 Alignment based methods

Dynamic programming: Dynamic programming is a programming paradigm which

Heuristic Algorithms: Both the dynamic programming algorithms mentioned above