Sunteți pe pagina 1din 360

AN INVESTIGATION OF HUMAN PROTEIN

INTERACTIONS USING THE COMPARATIVE METHOD


Saif Ur-Rehman

A Thesis Submitted for the Degree of PhD


at the
University of St Andrews

2012

Full metadata for this item is available in


Research@StAndrews:FullText
at:
http://research-repository.st-andrews.ac.uk/

Please use this identifier to cite or link to this item:


http://hdl.handle.net/10023/3119

This item is protected by original copyright


This item is licensed under a
Creative Commons Licence

School of Biology
PhD Thesis
An investigation of human protein interactions using the
comparative method
by
Saif Ur-Rehman

20th Jan 2012

1. Candidates declarations:
I, Saif Ur-Rehman hereby certify that this thesis, which is approximately 52,000 words in length, has been written
by me, that it is the record of work carried out by me and that it has not been submitted in any previous
application for a higher degree.
I was admitted as a research student in [09, 2006] and as a candidate for the degree of PhD in [09, 2007]; the
higher study for which this is a record was carried out in the University of St Andrews between [2006] and [2010].
Date signature of candidate
2. Supervisors declaration:
I hereby certify that the candidate has fulfilled the conditions of the Resolution and Regulations appropriate for the
degree of PhD in the University of St Andrews and that the candidate is qualified to submit this thesis in
application for that degree.
Date signature of supervisor
3. Permission for electronic publication: (to be signed by both candidate and supervisor)
In submitting this thesis to the University of St Andrews I understand that I am giving permission for it to be made
available for use in accordance with the regulations of the University Library for the time being in force, subject to
any copyright vested in the work not being affected thereby. I also understand that the title and the abstract will
be published, and that a copy of the work may be made and supplied to any bona fide library or research worker,
that my thesis will be electronically accessible for personal or research use unless exempt by award of an
embargo as requested below, and that the library has the right to migrate my thesis into new electronic forms as
required to ensure continued access to the thesis. I have obtained any third-party copyright permissions that may
be required in order to allow such access and migration, or have requested the appropriate embargo below.
The following is an agreed request by candidate and supervisor regarding the electronic publication of this thesis:
Add one of the following options:
(ii) Access to [all or part] of printed copy but embargo of [all or part] of electronic publication of thesis for a period
of 2 years (maximum five) on the following ground(s):
publication would preclude future publication;
Date signature of candidate

signature of supervisor

A supporting statement for a request for an embargo must be included with the submission of the draft copy of the
thesis. Where part of a thesis is to be embargoed, please specify the part and the reasons.

Abstract
There is currently a large increase in the speed of production of DNA sequence data as next
generation sequencing technologies become more widespread. As such there is a need for
rapid computational techniques to functionally annotate data as it is generated. One
computational method for the functional annotation of protein-coding genes is via detection
of interaction partners. If the putative partner has a functional annotation then this annotation
can be extended to the initial protein via the established principle of guilt by association.
This work presents a method for rapid detection of functional interaction partners for
proteins through the use of the comparative method. Functional links are sought between
proteins through analysis of their patterns of presence and absence amongst a set of 54
eukaryotic organisms. These links can be either direct or indirect protein interactions. These
patterns are analysed in the context of a phylogenetic tree.
The method used is a heuristic combination of an established accurate methodology
involving comparison of models of evolution the parameters of which are estimated using
maximum likelihood, with a novel technique involving the reconstruction of ancestral states
using Dollo parsimony and analysis of these reconstructions through the use of logistic
regression. The methodology achieves comparable specificity to the use of gene coexpression as a means to predict functional linkage between proteins.
The application of this method permitted a genome-wide analysis of the human
genome, which would have otherwise demanded a potentially prohibitive amount of
computational resource.
Proteins within the human genome were clustered into orthologous groups. 10 of
these proteins, which were ubiquitous across all 54 eukaryotes, were used to reconstruct a
phylogeny. An application of the heuristic predicted a set of functional protein interactions in
human cells. 1,142 functional interactions were predicted. Of these predictions 1,131 were
not present in current protein-protein interaction databases.

Acknowledgements
I thank the BBSRC for funding my work and making this project possible. I would also like
to thank my supervisor, Dr Daniel Barker for his tireless support and good advice. I would
like to thank Professor Mike Ritchie, Dr Jeff Graves and Dr Anne Smith for acting as my
committee. I would also like to thank Dr Rona Ramsay, Ji-Hiyun Lim, Wim Verleyen, Dr
Christoph Echtermeyer and Maria Keays for helpful discussion. I would also like to thank Dr
Neil Symington and Dr Herbert Fruchtl for their helpful advice in using cluster-computing
resources.
On a personal note I would like to thank my wife Cathryn for her constant unconditional
support without which I could not have brought this project to fruition. I would also like to
thank my friends Ken Armstrong and Jack Levell who played a large part in keeping me
balanced over the course of my work. Finally I would like to thank my parents for their love
and support, which has been a source of strength.

Abstract.....................................................................................................................................2
Acknowledgements ..................................................................................................................3
Chapter 1 ..................................................................................................................................8
1.1 History.......................................................................................................................................... 8
1.2 DNA/RNA .................................................................................................................................... 9
1.2.1 RNA..................................................................................................................................... 10
1.3 Proteins ...................................................................................................................................... 11
1.3.1 Protein secondary structure ................................................................................................. 11
1.3.2 Protein tertiary structure ...................................................................................................... 12
1.3.3 Protein quaternary structure ................................................................................................ 12
1.3.4 Protein domains ................................................................................................................... 12
1.3.5 Protein motifs ...................................................................................................................... 12
1.4 Genes .......................................................................................................................................... 13
1.4.1 Structure of a gene............................................................................................................... 13
1.4.1.1 Regulatory region of a gene........................................................................................................ 14

1.4.2 Transcription........................................................................................................................ 15
1.4.2.1 Post Transcriptional processing .................................................................................................. 15
1.4.2.1.1 Genetic Code....................................................................................................................... 15
1.4.2.1.2 Open reading frames ........................................................................................................... 16
1.4.2.1.3 Exons/Introns ...................................................................................................................... 17
1.4.2.2 Post Transcriptional processing (cont)........................................................................................ 18
1.4.2.2.1 RNA splicing....................................................................................................................... 18
1.4.2.2.2 Capping ............................................................................................................................... 19
1.4.2.2.3 Polyadenylation................................................................................................................... 20

1.4.3 Translation ........................................................................................................................... 20


1.5 Genomics.................................................................................................................................... 22
1.5.1 Genome annotation.............................................................................................................. 22
1.5.2 Genome and cDNA assembly ............................................................................................. 23
1.5.3 Gene detection ..................................................................................................................... 23
1.5.4 Functional annotation of genes............................................................................................ 25
1.5.4.1 Laboratory based techniques....................................................................................................... 25
1.5.4.2 Computational methods for functional annotation of genes ....................................................... 26
1.5.4.2.1 Alignment based methods ................................................................................................... 26
1.5.4.2.2 Genome context methods.................................................................................................... 28
1.5.4.2.2.1 Rosetta stone ............................................................................................................... 29
1.5.4.2.2.2 Gene neighbour ........................................................................................................... 29
1.5.4.2.2.3 Interolog detection....................................................................................................... 29
1.5.4.2.2.4 Phylogenetic profiling ................................................................................................. 29
1.5.4.2.2.5 Comparative methods.................................................................................................. 31
1.5.4.2.2.6 Mirror trees.................................................................................................................. 32

1.5.5 Storage of functional information ....................................................................................... 33


1.5.5.1 GO............................................................................................................................................... 33
1.5.5.1 KEGG ......................................................................................................................................... 33

1.6 Transcriptomics ........................................................................................................................ 34


1.6.1 Microarrays.......................................................................................................................... 34
1.6.2 Other methods for transcriptome examination .................................................................... 34
1.7 Proteomics ................................................................................................................................. 34
1.7.1 Protein Structure .................................................................................................................. 35
1.7.2 Protein interactions .............................................................................................................. 35
1.7.2.1 Experimental detection of protein interactions ........................................................................... 36

1.8 Description of project. .............................................................................................................. 37

Chapter 2 ................................................................................................................................39
2.1 Introduction............................................................................................................................... 39

2.1.1 Homology ............................................................................................................................ 39


2.1.2 Molecular evolution............................................................................................................. 40
2.1.2.1 Synonymous and non-synonymous mutations ........................................................................... 41

2.1.3 Phylogenetic trees................................................................................................................ 41


2.1.3.1 Species trees and gene trees........................................................................................................ 42
2.1.3.2 Topologies and branch lengths ................................................................................................... 44
2.1.3.3 Bootstrap support values............................................................................................................. 46
2.1.3.4 Evolutionary models in tree estimation ...................................................................................... 46

2.1.4 Detection of homology in molecular data ........................................................................... 47


2.1.5 Multiple sequence alignment............................................................................................... 49
2.1.5.1 Multiple sequence alignment quality filtration................................................................. 50
2.1.6 Methods to estimate phylogenetic trees .............................................................................. 50
2.1.6.1 Distance methods ........................................................................................................................ 51
2.1.6.2 Discrete character state methods................................................................................................. 53
2.1.6.2.1 Maximum Parsimony .......................................................................................................... 54
2.1.6.2.2 Maximum likelihood........................................................................................................... 55
2.1.6.2.3 Bayesian Methods ............................................................................................................... 56
2.1.6.3 Heuristic search methods ............................................................................................................ 57

2.1.8 Model selection in phylogenetic tree estimation ................................................................. 60


2.1.10 Phylogenetic analysis using gene presence ....................................................................... 63
2.2 Methods...................................................................................................................................... 64
2.2.1 Data Selection...................................................................................................................... 65
2.2.2 Data Acquisition .................................................................................................................. 65
2.2.3 Pairwise Alignment ............................................................................................................. 69
2.2.4 Orthology Determination..................................................................................................... 69
2.2.4.1 Inparanoid ................................................................................................................................... 69
2.2.4.2 Implementation ........................................................................................................................... 70
2.2.4.2.1 Design ...................................................................................................................................... 71
2.2.4.3 Application.................................................................................................................................. 72

2.2.5 Phylogenetic profiles ........................................................................................................... 72


2.2.5.1 Single copy proteins.................................................................................................................... 78
2.2.5.2 Proteome content data/tree.......................................................................................................... 79

2.2.6 Multiple alignment .............................................................................................................. 79


2.2.7 Model selection ................................................................................................................... 80
2.2.8 Phylogeny reconstruction .................................................................................................... 81
2.2.9 Comparison of protein content tree with super matrix tree ................................................. 82
2.3 Results ........................................................................................................................................ 82
2.4 Discussion .................................................................................................................................. 93
2.4.1.ML tree ................................................................................................................................ 93
2.4.2 Proteome content phylogeny ............................................................................................... 94
2.4.3 Conclusion ........................................................................................................................... 95

Chapter 3 ................................................................................................................................96
3.1 Introduction............................................................................................................................... 96
3.1.1 Hamming distance ............................................................................................................... 96
3.1.2 Comparative method ........................................................................................................... 97
3.1.2.1 Phylogenetic profile analysis using the comparative method..................................................... 97

3.1.3 Co-expression as measured by microarray........................................................................ 101


3.1.4 Bayesian classifier ............................................................................................................. 101
3.2 Methods.................................................................................................................................... 103
3.2.1 Assessing quality ............................................................................................................... 103
3.2.2 Training and test data. ....................................................................................................... 104
3.2.3 Hamming distance ............................................................................................................. 105
3.2.4 Constrained ML................................................................................................................. 105
3.2.5 Co-expression of mRNA ................................................................................................... 106
3.2.6 Bayesian classifier ............................................................................................................. 108
3.3 Results ...................................................................................................................................... 109

3.3.1 Hamming distance ............................................................................................................... 109


3.3.2 Constrained ML................................................................................................................. 110
3.3.2.1 Likelihood ratio statistic ........................................................................................................... 115

3.3.3 Co-expression of mRNA ................................................................................................... 121


3.3.4 Bayesian classifier ............................................................................................................. 124
3.3.5 Method Comparison .......................................................................................................... 127
3.4 Discussion ................................................................................................................................ 128
3.4.1 Low Sensitivities ............................................................................................................... 129

Chapter 4 ..............................................................................................................................130
4.1. Introduction............................................................................................................................ 130
4.1.1 Ancestral state reconstruction............................................................................................ 132
4.1.1.1 Parsimony ................................................................................................................................. 132
4.1.1.2 Likelihood ................................................................................................................................. 133

4.2 Filters ....................................................................................................................................... 134


4.2.1 Hamming distance filter .................................................................................................... 134
4.2.2 Ancestral state reconstruction filter................................................................................... 134
4.2.2.1 Dollo parsimony........................................................................................................................ 134
4.2.2.2 Maddison Test for correlated evolution.................................................................................... 136

4.3 Methods.................................................................................................................................... 138


4.3.1 Maddison test for correlated evolution.............................................................................. 141
4.3.1.1 Algorithm.................................................................................................................................. 141
4.3.1.2 Calculation of total number of ways of having x gains and y losses over the tree................... 142
4.3.1.3 Calculation of total number of ways of having p gains and q losses in subset k given x gains and
y losses over the entire tree ................................................................................................................... 143
4.3.1.4 Permutation effects ................................................................................................................... 144
4.3.1.5 Evaluation of Maddison test as heuristic for constrained ML .................................................. 148

4.3.2 Modification of test to match Dollo constraints ................................................................ 148


4.3.3 Differential parsimony....................................................................................................... 150
4.3.4 Dollo-pos/ Dollo-overall ................................................................................................... 150
4.3.5 Test based on logistic regression ....................................................................................... 150
4.3.5.1 Evaluation of logistic regression as a heuristic for constrained ML......................................... 156

4.4 Results ...................................................................................................................................... 157


4.4.1 Maddison test for correlated evolution.............................................................................. 158
4.4.2 Differential parsimony....................................................................................................... 160
4.4.3 Dollo-pos ........................................................................................................................... 161
4.4.4 Dollo-overall...................................................................................................................... 162
4.4.5 Logistic regression............................................................................................................. 162
4.4.6 Hamming distance ............................................................................................................. 168
4.5 Discussion ................................................................................................................................ 168

Chapter 5 ..............................................................................................................................170
5.1 Introduction............................................................................................................................. 170
5.1.1 PPI databases ..................................................................................................................... 170
5.1.1.1 MIPS ......................................................................................................................................... 170
5.1.1.2 BIND......................................................................................................................................... 171
5.1.1.3 MINT ........................................................................................................................................ 171
5.1.1.4 INTACT.................................................................................................................................... 171
5.1.1.5 HPRD........................................................................................................................................ 171
5.1.1.6 DIP ............................................................................................................................................ 171
5.1.1.7 REACTOME............................................................................................................................. 171
5.1.1.8 STRING .................................................................................................................................... 172
5.1.1.9 I2D ............................................................................................................................................ 172
5.1.1.10 KEGG ..................................................................................................................................... 172
5.1.1.11 BIOGRID................................................................................................................................ 172
5.1.1.12 Discussion ............................................................................................................................... 172

5.1.2 Power law .......................................................................................................................... 172


5.2 Methods.................................................................................................................................... 173

5.2.1 Short Branch filtration ....................................................................................................... 174


5.2.2 GO term enrichment .......................................................................................................... 178
5.2.3 Intersection with other data sources .................................................................................. 181
5.3 Results ...................................................................................................................................... 182
5.3.1 GO Enrichment.................................................................................................................. 182
5.3.2 Intersection with known data............................................................................................. 187
5.3.3 Network statistics .............................................................................................................. 188
5.4 Discussion ................................................................................................................................ 189
5.4.1 GO enrichment .................................................................................................................. 189
5.4.2 Intersection with known data............................................................................................. 190
5.4.3 Weaknesses........................................................................................................................ 190
5.4.3.1 Scaling....................................................................................................................................... 191

5.4.4 Conclusions ....................................................................................................................... 193

Chapter 6 ..............................................................................................................................194
6.1 Summary of Project................................................................................................................ 194
6.1.1 Repeat Analysis ................................................................................................................. 199
6.2 Conclusion ............................................................................................................................... 200
6.3 Future directions..................................................................................................................... 202
6.3.1 Computational extensions ................................................................................................. 202
6.3.2 Consensus profiles............................................................................................................. 202
6.3.3 Correlated evolution of proteins with the presence or absence of phenotypes ................. 203
6.3.4 Drug Targets ...................................................................................................................... 204

References.............................................................................................................................205
Appendix A Description of divergence of Java implementation of Inparanoid algorithm
from Perl implementation. ..................................................................................................223
Appendix B Individual Gene trees for genes in super matrix utilised in construction of
Phylogeny..............................................................................................................................232
Appendix C: Predictions made by constrained ML .........................................................242
Appendix D Concatenated Filtered Alignment.................................................................316

Chapter 1

Chapter 1
Introduction to computational annotation of protein coding genes
1.1 History
The discovery in the 1940s (Avery et al. 1944) and confirmation in the 1950s (Hershey and
Chase 1952) of DNA (deoxyribonucleic acid) as the physical basis for inheritance was a
milestone in biological research. It provided for a means to examine the materials and
processes underlying phenotypic traits and provided a conceptual link to the other natural
sciences. This was rapidly followed by the elucidation of the three dimensional structure of
B-DNA (Watson and Crick 1953) which is the form of DNA prevalent in living cells as it is
conducive to nucleosome formation (Richmond and Davey 2003). This structure was the now
famous double helix. It had been previously established (Beadle and Tatum 1941) that genes
exist as discrete regions within the genome whose sequence codes for the sequence of a
corresponding chain of amino acids. The genome of an organism is the full set of hereditary
material it possesses (Alberts 2010). This is RNA in the case of some viruses and DNA in the
case of all other types of cellular organism (Brown 2006). The discovery of the genetic code
(Crick et al. 1961) provided information on the mechanism for this production which
operates via initial intermediary transcription into RNA (ribonucleic acid) and then
translation into proteins. (Some genes also code for RNA products such as tRNAs and other
non-coding RNAs (Brown 2006)).
The first feasible method for determining the sequence of DNA was the MaxamGilbert chemical degradation method (Maxam and Gilbert 1977). This method was however
supplanted by the near simultaneous invention of the chain termination reaction method by
Frederick Sanger (Sanger et al. 1977) of DNA sequencing which was both safer and more
efficient (Brown 2006). This led to the first full genome of an organism to be sequenced,
which was bacteriophage fX174 (Sanger et al. 1978). Another contribution by Sanger was
that of shotgun sequencing. This entails the shattering of a piece of DNA into random
fragments and the sequencing of those fragments. The sequences of the fragments are then
assembled through searching for overlaps between them. This method facilitated the
sequencing of number of relatively larger viral and prokaryotic genomes such as
Bacteriophage MS2 (Fiers et al. 1976).

Chapter 1
In 1996 Saccharomyces cerevisiae was the first eukaryotic genome to be sequenced (Goffeau
et al. 1996) via a large collaborative effort. This was followed by the publication of the first
multi cellular eukaryotic genome Caenorhabditis elegans in 1998 (C. elegans Sequencing
Consortium 1998) and the draft genomes of the vertebrate Homo sapiens soon followed in
2001 (Venter et al. 2001). The application of industrial streamlining and automation to
sequencing efforts over the last 20 years as well as more recently with the onset of next
generation sequencing technologies there has been almost exponential growth to sequence
databases such as NCBI GenBank (Benson et al. 2009). Sequence data without further
processing and annotation cannot shed any light on either biological function or evolutionary
relationships between organisms. This means that there has been a focus on the development
of highly accurate high throughput methods for functional annotation of genes and other
functional genomic elements in recent years as the parity between rates of data generation
and rates of accurate and verifiable annotation becomes more divergent (Zhu et al. 2007).
1.2 DNA/RNA
DNA itself is made up of a linear backbone of alternating deoxyribose sugar and phosphate
residues (Strachan and Read 2004). There is a nitrogenous base attached to the 1 (one prime)
carbon of each individual sugar residue. There are two forms of nitrogenous base present
within DNA. One form possesses a single interlocked heterocyclic ring of carbon and
nitrogen atoms. Bases that exist in this conformation are known as pyrimidines (Strachan and
Read 2004). The second form of base consists of two interlocked heterocyclic rings of carbon
and nitrogen atoms. These bases are known as purines (Strachan and Read 2004). There are
two pyrimidines represented within DNA (Strachan and Read 2004). These are cytosine and
thymine commonly represented by the abbreviations C and T respectively (Brown 2006).
There are also two purines present, adenine and guanine represented as A and G (Brown
2006). The stability of the double helix structure of DNA is maintained through hydrogen
bond formation between the pyrimidine-purine pair C and G and hydrogen bond formation
between the remaining pyrimidine-purine pair T and A as well as base stacking interactions
between adjacent bases (Yakovchuk et al. 2006). Due to structural constraints base pairing
can only occur between a pyrimidine and a purine (Brown 2006).
The linear backbone of DNA/RNA is maintained by a phosphodiester bond formed
between the 3 (3 prime) carbon atom of the sugar and the 5 (5 prime) carbon of the
succeeding sugar (Strachan and Read 2004). The backbone is terminated by a sugar where
the 5 carbon is not linked to a succeeding sugar residue. This point is known as the 5 end.
9

Chapter 1
Similarly the other end of the molecule lacks a phosphodiester bond on the 3 carbon and is
known as the 3 end (Strachan and Read 2004). The sequence of DNA is usually described
in the 5!3 direction, as this is the direction of DNA replication as well as transcription of
RNA using DNA as a template (Strachan and Read 2004). Thus a feature along a DNA
molecule is referred to as being upstream of another feature if it is closer to the 5 end. The
length of a DNA molecule is measured in units of individual base pairs (bp).
DNA is a biopolymer and as such can be fully represented by the sequence of its
constituent nucleotide bases. Determination of this sequence for a complete organism
effectively represents the DNA blueprints for the construction of that organism, i.e. the amino
acid sequences of its constituent proteins and RNA molecules, as well as the regulatory
sequences that regulate production of these molecules both spatially and temporally.
1.2.1 RNA
RNA is constructed of similar residues, however the sugar is a ribose as opposed to
deoxyribose and the pyrimidine base thymine is replaced with the base uracil commonly
represented by the abbreviation U (Strachan and Read 2004). There is a diverse population of
RNA molecules produced by the eukaryotic genome. These molecules are involved with a
number of processes essential to life, including protein synthesis and regulation of gene
expression. A breakdown of general RNA types and their functions is presented in Table 1.1.
Abbreviated Name

Full name

mRNA

Messenger RNA

Primary Function
Provides a template for protein
synthesis.

tRNA

Transfer RNA

Connection of mRNA to relevant


amino acid during protein synthesis.

rRNA

Ribosomal RNA

Component of protein synthesising


organelles known as ribosomes.

snRNA

Small nuclear RNA

Component of RNA-protein machine


(involved in post transcriptional
modification of mRNA) known as the
spliceosome.

snoRNA

Small nucleolar RNA

Involved in the modification of rRNA


and snRNA

miRNA

Micro RNA

Involved in the regulation of RNA


stability and translation.

siRNA

Short interfering RNA

Involved in the targeted degradation of


RNA.

10

Chapter 1
Table 1.1: General types of RNA molecules with function (Blow 2004).

1.3 Proteins
Protein molecules are polymers comprised of one or more chains of amino acids. A chain of
amino acids can also be referred to as a polypeptide chain. Amino acids are molecules that
consist of an amino group, a carboxylic group, an R group and a hydrogen atom (Berg et al.
2001). These components are all linked to a central carbon atom known as the ! carbon
(Berg et al. 2001). A polypeptide chain is formed when a peptide bond is formed between
the amino group of one amino acid and the carboxyl group of another. All polypeptide chains
have a free amino group at one end and a free carboxyl group at the other. These are known
as the N-terminus and C-terminus respectively (Alberts 2002). The sequence of a polypeptide
chain is presented as moving from the N-terminus to the C-terminus (Alberts 2002). A linear
polypeptide chain is also considered the primary structure of a protein (Brown 2006).
It is the R group that distinguishes amino acids (Berg et al. 2001). R groups vary in
factors such as size, shape, charge, hydrogen-bonding capacity, hydrophobic
character, and chemical reactivity (Berg et al. 2001). There are 20 naturally occurring
amino acids that are typically utilised by living cells (Alberts 2002).
1.3.1 Protein secondary structure
The interactions of the R, carboxyl, and amine groups of individual amino acids in a
polypeptide chain with each other cause polypeptide chains to fold into characteristic
conformations. These conformations are known as the secondary structure of a protein. There
are two main types of secondary structure (Brown 2006).

The ! helix: This is a structure formed by interactions between the carboxyl groups
and amine groups of amino acids which are separated by a number intermediate
amino acids (Berg et al. 2001).

The " sheet: This is a structure formed by the interactions between two polypeptide
chains running either parallel or anti parallel to each other (Brown 2006).

Random coils: In the absence of particular structural imperatives polypeptide chains


can take on any number of shapes that are sterically possible. These shapes are
referred to as random coils (Shortle and Ackerman 2001).

11

Chapter 1
1.3.2 Protein tertiary structure
The tertiary structure of a protein is formed by the folding up of the secondary structural
constructs formed by the polypeptide chain into a three dimensional configuration (Brown
2006). This configuration is held together a number of chemical forces including hydrogen
bonding between individual amino acids and the interactions of hydrophobic amino acids
with water (Brown 2006).
1.3.3 Protein quaternary structure
The quaternary structure of a protein is formed by the interactions of multiple polypeptide
chains. Quaternary structure is a hallmark of proteins with a complex function (Brown 2006).
1.3.4 Protein domains
A protein domain can be defined as a substructure produced by any part of
a polypeptide chain that can fold independently into a compact, stable structure (Alberts
2002). There are a number of recurrent protein domains that are functionally important within
the eukaryotic cell. These include:

Helix turn helix: This is a domain comprised of two ! helices separated by a short
strand of amino acids. It is functionally important due to its ability to bind DNA
(Brennan and Matthews 1989).

Transmembrane domain: This is a domain consisting of ! helical structures capable


of passing through the lipid bilayer (cell membrane) that surrounds the cell. These are
crucially important in facilitating cell-cell communication and relaying information
about the external environment into a cell (Brown 2006).

1.3.5 Protein motifs


Protein motifs are conceptually similar to protein domains in that they are distinct
substructures within a protein molecule (Brown 2006). In contrast with domains they are not
able to form outside of the context of the overall protein. Functionally important protein
motifs include:

Leucine zipper: This motif is important in that it facilitates the formation of protein
quartenary structure by the dimerisation of two leucine rich regions of separate
polypeptides (Brown 2006). It is a motif that is found in a number of proteins that
bind DNA (Brown 2006).

12

Chapter 1

Zinc finger: The zinc finger motif is a set of polypeptide chains whose interactions is
stabilised by the presence of zinc ions. It is also present in DNA binding proteins
(Brown 2006).

1.4 Genes
As mentioned above the blueprints for the production of given protein and RNA molecules
within an organism are contained in subsections of its genome known as genes. A current
more specific definition of a gene presented by Pesole (Pesole 2008) defines them as a
discrete genomic region whose transcription is regulated by one or more promoters
and distal regulatory elements and which contains the information for the synthesis of
functional proteins or non-coding RNAs, related by the sharing of a portion of genetic
information at the level of the ultimate products (proteins or RNAs).
1.4.1 Structure of a gene
As implied by that definition a gene is made up of two distinct parts. These are firstly a
transcribed area, which is the portion of DNA that is actually converted into RNA and
secondly regulatory regions, which can occur either upstream or down stream of the
transcribed region. Regulatory regions within the vicinity of a gene provide recognition
signals for proteins known as transcription factors. These proteins regulate the transcription
rate of a gene by either carrying out the actual transcription, or by binding to DNA and either
promoting or silencing transcription (Maston et al. 2006). As the binding of the proteins to
these regions provides this functionality, the regions are known as transcription factor
binding sites.

13

Chapter 1

Figure 1.1: General structure of a gene. Adapted from (Maston et al. 2006).
1.4.1.1 Regulatory region of a gene
A typical regulatory region associated with a gene consists of a promoter element and distal
regulatory elements (Maston et al. 2006). The promoter element consists of a core promoter
and proximal promoter elements and typically spans less than 1 kb (kilobase) pairs (Maston
et al. 2006). The core promoter of a gene is the region of DNA at which the proteins
primarily responsible for transcription bind and initiate the process of transcription. Wellstudied elements of the eukaryotic core promoter include the TATA box and the initiator or
Inr sequence (Brown 2006;Strachan and Read 2004). The TATA box generally has a
consensus sequence of 5!-TATAWAW-3! where W is A or T (Brown 2006). The INR
sequence has a consensus 5!-YYCARR-3!, where Y is C or T, and R is A or G (Brown 2006).
The TATA box and Inr sequence are generally present upstream of a large number of
eukaryotic genes. Generally most of the elements of the core promoter are generally
comprised of near identical DNA sequences.
The proximal promoter is generally located a few hundred base pairs upstream of the
core promoter element (Maston et al. 2006). This region of DNA typically contains binding
sites for other proteins, which contribute to the transcription of the gene but are not the
primary mechanism (Maston et al. 2006).
Distal regulatory tend to be further away from the transcribed portion of the gene and
contains elements that either activate or repress the transcription of the gene. Elements that
activate transcription are known as enhancers and conversely elements that repress it are
known as silencers (Raab and Kamakaka 2010).

14

Chapter 1
1.4.2 Transcription
A family of enzymes known as RNA polymerases carry out the process of transcription of
DNA into RNA in eukaryotic cells (Brown 2006). This process is known as transcription as
the fundamental chemical language is not changed (Alberts 2002). There are three RNA
polymerases typically encoded by the eukaryotic genome (Strachan and Read 2004). RNA
polymerase I and RNA polymerase III tend to transcribe genes which code for functional
RNA molecules, while RNA polymerase II is generally utilised for the production of RNA
which is further translated into a protein (Alberts 1998). Transcription proceeds via the
following general steps (Brown 2006):

A protein known as TATA binding protein (TBP) binds to the TATA box sequence.
This causes a bend in the DNA molecule.

This bend provides a recognition signal for other transcription factors to bind to the
DNA creating a structure known as the preinitiation complex (PIC) (Brown 2006).
The formation of the PIC also disrupts base pairing thus creating a single stranded
DNA template from which the RNA molecule is synthesised.

RNA polymerase binds to the PIC and them moves along the single strand on DNA
creating a complementary RNA molecule that conforms to base pairing rules. This
RNA molecule is known as the primary transcript.

1.4.2.1 Post Transcriptional processing


After the primary transcript has been produced it is subjected to further modifications. In the
case of primary transcripts associated with protein coding gene the primary transcript is also
known as pre-mRNA (messenger RNA). In order to explain why these modifications occur it
is necessary to understand how RNA molecules specify corresponding polypeptide
molecules.
1.4.2.1.1 Genetic Code
It was established in work by Francis Crick (Crick et al. 1961) that polypeptide chains are
specified by RNA molecules via triplets of nucleotides known as codons. As there are only
twenty naturally occurring amino acids in eukaryotic proteins, and 43 =64 possible triplets
from the 4 nucleotide types, the genetic code is redundant. Three of the codons specify the
termination of the polypeptide chain and the remaining 61 specify amino acids.
The table below presents the genetic code

15

Chapter 1

Table 1.2: The genetic code (Brown 2006).


The process by which these codons are translated into these amino acids will be presented in
the next section. This code is widely utilised though there are a number of exceptions where a
different code is utilised, e.g. in translation of mitochondrial genes (Knight et al. 2001).
1.4.2.1.2 Open reading frames
Given this code a sequence of triplets that specify a chain of amino acids commencing with a
start codon and ending with a stop codon can be defined as an open reading frame (ORF)
(Brown 2006). An open reading frame can exist in 6 possible orientations as there are two
strands to a DNA molecule and an ORF can start from the first, second or third nucleotide
within either strand as illustrated below.

16

Chapter 1

Figure 1.2: Starting positions for possible ORFs within a double stranded DNA molecule.
1.4.2.1.3 Exons/Introns
ORFs as discussed above are subsections of the primary transcript or pre-mRNA molecule.
ORFs are interrupted within pre-mRNA by sections known as introns (Brown 2006). The
sections of the ORF thus separated by the introns are known as exons (Brown 2006). Thus in
order to produce a molecule containing the full-uninterrupted ORF it is necessary to excise
the introns and splice the exons together as shown in Figure 1.4.

Figure 1.3: Exons and introns within a pre-mRNA molecule.

17

Chapter 1

Figure 1.4: Exons post splicing.

Figure 1.5: Exons post splicing in an alternate configuration.

It is not necessary however for all the exons within a given ORF to be utilised (Brown 2006)
as shown in Figure 1.5. Different permutations of exons can be created to produce different
protein molecules. This process is known as alternative splicing and is responsible for the
disparity between the number of genes within a eukaryotic genome and the number of
proteins it is capable of producing (Strachan and Read 2004). Alternate splicing is a feature
of higher eukaryotes and contributes to overall protein diversity (Black 2003). Estimates of
how many human gene products are alternately spliced include 60% (Black 2003) and 74%
(Johnson et al. 2003).
1.4.2.2 Post Transcriptional processing (cont)
Having now discussed the necessity of posttranscriptional modification it is now possible to
move on to the mechanisms by which splicing is carried out as well as covering other
elements of posttranscriptional processing.
1.4.2.2.1 RNA splicing
As mentioned above the primary transcript or pre-mRNA is treated so as to excise intronic
sequences and splice together exonic sequences. In order for this process to occur a necessary
first step is the recognition of the borders between exons and introns. These areas are known
as splice junctions (Strachan and Read 2004). It has been observed in a large number of cases

18

Chapter 1
that introns in pre-mRNA commence with the sequence GU and end with the sequence AG
(Strachan and Read 2004). These dinucleotides are not in themselves sufficient to signal a
splice junction (Strachan and Read 2004) as splice junctions have been observed to show a
greater degree of conservation (Breathnach et al. 1978). In vertebrates the following motifs
have been observed at splice junctions (Brown 2006).

5! splice site 5!-AG"GUAAGU-3!

3! splice site 5!-PyPyPyPyPyPyNCAG"-3!

In these consensus sequences the " symbol indicates the border between an exon and
intron or vice versa (Brown 2006). Py indicates that the nucleotide is a pyrimidine and N
indicates that any nucleotide could be present at this position (Brown 2006). In addition to
the conserved sequences at splice junctions introns also contain a conserved sequence around
40bp away from the end on the intron known as the branch sequence (Strachan and Read
2004). A large RNA-protein complex known as the spliceosome actually carries out the
actual process of RNA splicing (Strachan and Read 2004). The spliceosome is one of the
largest molecular machines in the human cell containing ~170 distinct proteins (Valadkhan
and Jaladat 2010).
The process of RNA splicing typically involves the following sequence (Brown 2006;
Strachan and Read 2004):

Cleavage of the 5 splice junction detaching the exon from the intron at one end.

The attachment of the cleaved 5 end to the branch sequence forming a lariat like
structure.

Removal of the intronic lariat like RNA structure and the ligation of the two
exons.

1.4.2.2.2 Capping
Another step in posttranscriptional modification of protein-coding genes is capping. This
process is the first step in posttranscriptional processing of eukaryotic pre-mRNAs (Alberts
2002). This entails the addition of a methylated nucleoside (a nucleoside is a molecule
consisting of a deoxyribose or ribose sugar bound to a nitrogenous base (Brown 2006)) to the

19

Chapter 1
first 5 prime end of the transcript (Strachan and Read 2004). This process protects the
transcript from rapid degradation via ribonuclease digestion (Strachan and Read 2004).
1.4.2.2.3 Polyadenylation
Post the termination of transcription the primary transcript is also modified via the addition of
about 200 adenosine nucleotides to the 3 end of the transcript (Alberts 2002). This structure
is known as a poly-A tail. The process is thought to facilitate the transport of the mature
mRNA molecule into the cytoplasm (Strachan and Read 2004).
1.4.3 Translation
After a transcript associated with a protein-coding gene has been transcribed and processed, it
then migrates to the cytoplasm, where a process known as translation occurs. This process
entails the production a polypeptide chain that is specified by the transcript via the genetic
code. The mature mRNA molecule is not synonymous with an ORF (Strachan and Read
2004). Generally an ORF is a subsection within the mature transcript. The ORF is flanked by
sequences known as the 5 UTR and 3UTR (UTR=untranslated regions) (Brown 2006) as
illustrated in Figure 1.6.

Figure 1.6: Schematic of mature mRNA.


The process of translation occurs at cytoplasmic structures known as ribosomes. Ribosomes
are large RNA-protein complexes, which consist of two subunits (Strachan and Read 2004).
20

Chapter 1
The larger subunit is known as the 60S subunit and consists of three different types of
ribosomal RNA (rRNA) molecule and up to 50 ribosomal proteins (Strachan and Read 2004).
The smaller subunit is known as the 40S subunit and contains a single rRNA molecule and
over 30 ribosomal proteins (Strachan and Read 2004). The two subunits of the ribosome exist
as separate entities and attach for the process of translation.
The other molecule that provides the physical basis for the implementation of the
genetic code is transfer RNA (tRNA). tRNA has a secondary structure consisting of four
double helical structures as illustrated in Figure 1.7. tRNA attaches to an amino acid at its 3
end. The anticodon arm of the tRNA molecule has a triplet sequence, which is
complementary to the codon of the amino acid to which it is bound. Thus tRNA attaches
codons to their corresponding amino acids.

Figure 1.7: Structure of a tRNA molecule. Adapted from (Alberts 2008).

The process of translation typically proceeds via the following steps (Strachan and Read
2004):

The two subunits of the ribosome attach to each other and also to a mature mRNA
molecule at the methylated cap at the 5 end.
21

Chapter 1

The mRNA molecule is then pulled through the ribosome.

When a start codon is encountered a tRNA molecule with an anticodon arm


complementary to the start codon enters the ribosome. This tRNA molecule will
have the relevant amino acid pre-bound to it.

The next tRNA corresponding to next codon will then enter the ribosome.

The amino acid attached to the first tRNA will detach from the tRNA and attach
to the amino acid attached to the 3 end of the second tRNA.

This process is iterated constructing a polypeptide chain or protein molecule.

When a stop codon is encountered an enzyme known as a release factor causes the
ribosome to disassociate and release the protein molecule.

In order to prevent premature folding of proteins during translation the emerging


polypeptide chain is stabilised by proteins known as chaperones (Alberts 2008).

1.5 Genomics
The term genome can de defined as the entire genetic complement of a living organism
(Brown 2006). The field of study around ascertaining information about the genome of a
living organism is thus known as genomics. The primary step of any full genomic study is the
determination of the DNA sequence of the genome of the organism in question. Once this has
been determined the next step is annotating the sequence.
1.5.1 Genome annotation
The full genome of an organism is generally a mosaic of functional and non-functional
elements. The percentage of an organisms genome that is functional is variable. In the case
of the human genome it has been calculated that potentially between 2.56% and 3.25% is
functional (Lunter et al. 2006).
Functional elements in a genome include:

Genes.

DNA binding sites.

CpG Islands: These are stretches of the dinucleotide repeat CG. These areas of DNA
are subject to methylation, which is a form of epigenetic control over gene
transcription (Kawaji and Hayashizaki 2008).

22

Chapter 1
Genome annotation can be described as the systematic location of these functional
elements within a genome sequence (structural annotation) and the ascertainment of that
function (functional annotation). The location of functional elements is based on the principle
of sequence specifying function. Thus the sequence of a functional element will vary in some
detectable way from the remainder of the background sequence.
1.5.2 Genome and cDNA assembly
The initial challenge post the generation of sequencing data is the fact that the output of DNA
sequencing is generally reads of short stretches of DNA. These reads range in length from >
700 bp long for Sanger sequencing (Hert et al. 2008) and ~200bp for pyro sequencing
(Sundquist et al. 2007) and down to ~50bp for ligation based sequencing methods
(McKernan et al. 2009).
These short reads have to be assembled into a full sequence for the whole genome.
This process is known as contig assembly. Contig assembly is carried out through scanning a
set of short reads for overlaps. The discovery of an overlap indicates that two fragments are
contiguous and should be connected. This process is necessary both at the level of the full
genome as well at the level of the individual gene (Wang et al. 2005a).
1.5.3 Gene detection
Given a fully sequenced and assembled genome lacking annotation there are a number of
computational techniques available to delineate coding sequence. These can be divided into
two main subtypes: extrinsic and intrinsic (Borodovsky et al. 1994). Extrinsic methods utilise
comparisons of sequence data to an external reference point while intrinsic methods evaluate
sequences based on properties that are internal to the sequence (Borodovsky et al. 1994).
Construction of a cDNA library is one of the standard methods of extrinsic gene
detection. cDNA stands for complementary DNA and is created through application of an
enzyme known as reverse transcriptase to mature mRNA. Reverse transcriptase as the name
implies reverses the process of transcription and creates a DNA strand complementary to the
single stranded mRNA. Further steps are then taken in order to create a double stranded DNA
molecule (Strachan and Read 2004).
A library of cDNA sequences is compiled through the collection of mRNA molecules
from cells under various experimental conditions. This RNA is then converted to cDNA
using the enzyme reverse transcriptase. The resultant cDNA is then amplified using the

23

Chapter 1
polymerase chain reaction (PCR) (Mount 2004) and then sequenced. The library of sequences
thus generated corresponds to the sequence of protein coding genes within the genome minus
the introns. These sequences are then systematically mapped onto the genomic sequence
using local alignment algorithms. This technique is known as cis-alignment. There are a
number of local sequence programs that can be used to carry out these alignments. Exonerate
is one such program. It utilises a bounded dynamic programming approach (Slater and Birney
2005) to generate local alignments. Dynamic programming is discussed in more detail later in
this chapter. Another program, which can be utilised, is Spidey (hosted by the NCBI). This
program employs the Blast heuristic algorithm (Altschul et al. 1990) to generate its
alignments. SIM4 is another program that utilises an algorithm based on Blast but tailored to
the specific problem of mapping cDNA to genomic DNA by factoring in introns and
potential sequencing errors (Florea et al. 1998).
The Ensembl automatic genome annotation system (Curwen et al. 2004;Potter et al.
2004) uses the algorithm GeneWise (Birney et al. 2004) to map cDNA to full genomic data
and the algorithm GenomeWise (Birney et al. 2004) to create a final putative structure for the
gene in question post the initial alignment. cis alignment can be considered to be one of the
most reliable methods for protein coding gene detection/prediction (Brent 2008).
In cases where cDNA libraries are not available or incomplete for the organism under
consideration it is also possible to use cDNA sequences of homologous genes from either the
same species or a different species in order to detect coding sequence. This technique is also
referred to as trans-alignment and is central to various gene prediction tools (Brent 2008).
The GeneWise (Birney et al. 2004), algorithm is also used in this context by the Ensembl
pipeline (Potter et al. 2004). Extrinsic methods for genome annotation are far more cost and
labour intensive as opposed to the strictly in-silico intrinsic approach.
Intrinsic approaches to gene detection are predominantly computational and as such
require an explicit definition/description in order to delineate between coding and non-coding
sequence (Picardi and Pesole 2010). Picardi (Picardi and Pesole 2010) gives a good working
definition of a gene for detection purposes, which defines a gene as a transcribed region of
DNA whose expression is regulated by cis acting elements such as upstream promoters.
Examples of tasks undertaken as a part of intrinsic gene detection include:

ORF (Open Reading Frame) detection: Detection of a potential ORF in genomic


DNA is an indicator of a potential gene (Mount 2004). As prokaryotes in most cases
(exceptions are pointed out in (Edgell et al. 2000)) lack exons and introns ORF

24

Chapter 1
detection drastically reduces the search space for potential genes in the case of
prokaryotes.

Promoter regions detection: Genes are typically associated with one to several
promoter regions. In prokaryotes these include the upstream Pribnow box with the
consensus sequence TATAAT. This sequence is homologous to the eukaryotic
TATA box (Berg et al. 2007). Detection of these motifs within a sequence upstream
of an ORF strengthens the case for a potential gene.

Internal splice junction detection: As the sequence of exon intron borders is broadly
conserved discovery of splice junctions can also contribute to the case for a
prospective gene.

These features can be can be detected within a stretch of sequence using various
techniques to model sequence motifs ranging from simple regular expressions to hidden
Markov models and position weight matrices (Picardi and Pesole 2010). Examples of specific
applications of the intrinsic approach to gene prediction include SNAP (Korf 2004) and
Genscan (Burge and Karlin 1997) both of which utilise Markov models in order to detect
delineating features of genes. The primary weaknesses of the intrinsic approach lie in the fact
that that it requires a representative sample of protein coding genes specifically from the
organism under consideration in order to operate (Aubourg and Rouze 2001).
1.5.4 Functional annotation of genes
After a putative gene has been identified the next stage is determination of the exact
biological role of the product coded for. This process can be carried out computationally or
by entirely laboratory based techniques.
1.5.4.1 Laboratory based techniques
Laboratory based techniques for determination of biological function involve alteration of the
gene in question either in the organism of study (in the case of prokaryotes, unicellular
eukaryotes as well as higher eukaryotes which are deemed suitable) or in the case of
organisms where modification would be impractical or unethical such as Homo sapiens
alteration of the homologous gene in a model organism. The main model organism of choice
for study of mammalian gene function is Mus musculus (Kim et al. 2010). The main
alterations that are possible include:

25

Chapter 1

Knockouts: This entails the removal of the gene in order to observe the effects of its
absence. This technique is only effective if the gene in question is not essential to
organism survival and has a visible/measurable effect on phenotype (Moore 1999).

Alteration in expression: In cases where the gene in question is essential to the


survival of the organism, alterations can be made to the cis-regulatory regions of the
gene in question in order to affect levels of expression (Capecchi 2005).

In order to physically pinpoint specific tissues (in the case of multi-cellular


organisms) or areas within a cell that a protein is active it is possible to place a
reporter gene such as GFP (green fluorescent protein) upstream of the promoter
region of the gene of interest (Chalfie et al. 1994).

Detection of genetic interactions: The interaction of two non-essential genes (and


hence their associated proteins) can be detected if the mutation of both genes leads to
lethality (von Mering et al. 2002). This method has been applied to a large-scale study
in Saccharomyces cerevisiae in order to characterise its set of genetic interactions
(Ooi et al. 2006). The detection of an interaction partner of known function can aid in
the determination of the function of an unknown gene.

1.5.4.2 Computational methods for functional annotation of genes


Computational methods to determine gene function have only become applicable relatively
recently as most computational methods depend on comparison of novel sequence data with
sequence of known function. Computational methods of functional annotation of genes can
be split into a number of broad categories (Pellegrini 2001).

Alignment based methods.

Genome Context methods.

1.5.4.2.1 Alignment based methods


Sequence alignment is a problem that has been at the heart of bioinformatics since the
inception of the field. The basic sequence alignment problem is searching two strings for
areas of similarity (Mount 2004). The products of genes with similar/identical sequences are
extremely likely to carry out the same function. Genes that share a significant degree of
sequence similarity are potentially homologous (descended from a recent common ancestral
gene) to each other. Using these methods the results of laboratory-based annotations only
need to be carried out on one representative of a given set of identical sequences and the

26

Chapter 1
derived functional annotation can be applied to all members. Alignment methods can be
applied at either the gene or the protein level.
There are three primary ways of carrying out pairwise sequence alignments.

Dot matrix analysis: This method entails arranging one sequence horizontally and the
other sequence vertically perpendicular, starting from the left end of the horizontal
sequence. Matches between the two sequences are then marked with a dot. Areas of
similarity can then be viewed as diagonal lines between the two sequences (Mount
2004).

Dynamic programming: Dynamic programming is a programming paradigm which


entails the reduction of a large problem to a series of sub-problems whose solutions
are constructed incrementally and summed to provide the overall solution (Russell et
al. 2003). In terms of sequence alignment it entails the construction of a matrix
similar to the dot matrix and calculating a path through it, where the next step in the
path is determined only by the state of the current cell and its neighbouring cells.
Two popular dynamic programming algorithms utilised in pairwise alignment of
sequences are the Needleman-Wunsch algorithm (Needleman and Wunsch 1970)
which returns an optimal global alignment of two sequences and the previously
mentioned Smith-Waterman algorithm (Smith and Waterman 1981) which provides
an optimised local alignment. Both of these algorithms are proven to return the
optimal alignment between two sequences (Mount 2004).

Heuristic Algorithms: Both the dynamic programming algorithms mentioned above


are O(n2) in terms of both memory utilisation as well as time taken to run (Mount
2004). As such heuristic algorithms such as Blast (Altschul et al. 1990) and Fasta
(Pearson and Lipman 1988) were developed as usable alternatives. The Fasta
algorithm constructs a sequence alignment by searching for matching sequence
patterns called k-tuples. These patterns are k consecutive matches between the two
sequences. These matches are then extended to provide the alignment (Krane and
Raymer 2003). Blast constructs an alignment in a similar manner by locating short
matches and then building an alignment around it. The difference between Blast and
Fasta is that while Fasta examines all possible k-tuples the Blast algorithm is
restricted to only examining matches that are significant and score over a given
threshold (Mount 2004). These matches have to be of a length to achieve
significance. This is 3 for proteins and 11 for DNA. Significance for proteins is

27

Chapter 1
judged through use of the BLOSUM62 substitution matrix (Mount 2004). Given the
rapid expansion of most of the large sequence databases it is typical to use heuristic
algorithms as a search tool.

Profile Hidden Markov models have been used by Eddy (Eddy 1998) to create a
scoring system, which allows detection of remotely homologous sequences. Hidden
Markov models score the probability of a discrete chain of events based on model
parameters whose values are unknown (Durbin 1998).

Alignment methods can also be applied to the three dimensional structures of protein
molecules as well as sequence (Hasegawa and Holm 2009). This method is potentially useful
in cases where sequence divergence reaches a point where two proteins can no longer be
identified as homologous. However as the rate of structure generation lags behind sequence
generation by a considerable degree this method can only be applied in a small subset of
cases.
Detection of a significant alignment with a gene of known function can be used to attach
the same function to a gene of known function. Martin (Martin et al. 2004) used GO terms
(Ashburner et al. 2000) in conjunction with Blast (Altschul et al. 1990) to achieve this with
some success. There is however a danger with alignment based methods of a Chinese
whispers effect where if for example a gene p with known function a displayed 90 %
identity using some form of pairwise alignment algorithm with gene q of unknown function.
Assigning function a to gene q would seem to be intuitively legitimate. However if gene q
was assigned function a and the process was iterated a number of times a situation could arise
where a gene x would be assigned function a with little or no sequence similarity to the
original protein p. Examples of incorrect annotation by automated methods of homology
detection occur in the case of genes where translations of the antisense strand of the coding
region are entered into databases such as GenBank (Linial 2003).
1.5.4.2.2 Genome context methods
The recent proliferation of genome data has made it possible to detect and assign function to
proteins through examination of their genomic context. Genome context methods compare
and contrast the context of a gene between genomes (i.e. the arrangement of its homologues)
in other genomes. Context methods are based on the principle of guilt by association which
is the hypothesis that genes, which show proximity or association by some measure, e.g.
phyletic distribution or chromosomal ordering are functionally associated (Aravind 2000).
28

Chapter 1
Thus through demonstration of functional association or interaction between one gene/protein
of known function with one of unknown function, the latter entity may be annotated with the
function of the former.
1.5.4.2.2.1 Rosetta stone
The Rosetta stone method or detection of domain fusion was recognised through work by
Marcotte (Marcotte et al. 1999) and Enright (Enright et al. 1999) which showed that sets of
separate proteins in one organism which exist in a unified (fused) homologous form in
another organism are likely to be interaction partners. As fusion events are comparatively
rare and generally affect genes that are tightly functionally coupled this method is effective at
detection of interaction partners (Kensche et al. 2008). However the rareness of these events
lowers the overall coverage of this method.
1.5.4.2.2.2 Gene neighbour
Examination of the genomes of nine bacterial and archaeal genomes by Dandekar (Dandekar
et al. 1998) showed that the proteins encoded by genes which showed conserved physical
order along a chromosome tended to interact physically.
1.5.4.2.2.3 Interolog detection
A term introduced by Walhout (Walhout et al. 2000) an interolog is a pair of proteins that
interact in a given organism. If both proteins involved in the interaction are conserved in
another organism a similar interaction can be inferred in the second organism. This method
has shown comparable accuracy with large-scale experimental data (Yu et al. 2004b).
1.5.4.2.2.4 Phylogenetic profiling
Phylogenetic profiling is a method that operates on the hypothesis that functionally linked
proteins evolve in a correlated manner (Pellegrini et al. 1999). Consider for example a
group of genes/proteins, which exist as a self-contained modular group and are associated
with a particular cellular function. If this associated function was no longer needed by a given
set of organisms the selective pressure to maintain all the genes/proteins within that group
would be lowered thus leading to an eventual correlated cascade of losses for the genes in
question. Genes are primarily lost through psdeudogenisation, which is the conversion of a
functional gene to a non-functional copy. This can be caused by mutations that cause the
premature truncation of a transcript through the creation of a premature stop codon or a

29

Chapter 1
mutation in upstream cis-regulatory sequences thus removing the potential for transcription
(Brown 2006). Pseudogenes can also be formed through retrotransposition of mature mRNA
(Graur et al. 1989).
Thus through examination of multiple genomes for correlations in the presence and
absence of proteins potential functional linkages can be detected. A phylogenetic profile is
typically a binary string representing the presence or absence of a homolog of a given
gene/protein. Predictions are made through examination of levels of similarity between these
strings. These suggestions are suggestive in their nature rather than specific as it is unclear
what the nature of a functional linkage between two proteins with similar profiles might be.
The relationship could be a direct physical interaction such as subunits involved in
heterodimerisation or more indirect such as the link between a transcription factor and the
product of its associate gene.
The first use of phylogenetic profiles to predict functional linkages used Hamming
distance as a metric in order to cluster similar profiles (Pellegrini et al. 1999). The Hamming
distance of two strings can be defined as the number of points at which they differ (Hamming
1950). There have been various extensions and reinterpretations of the method since then
(Ranea et al. 2007). Some of these involved examination of profiles using higher logical
operations to carry out more complex comparisons of profiles (Bowers et al. 2004; Antonov
and Mewes 2008). The method was also applied to protein domains rather then whole
sequences (Pagel et al. 2004b). Work by Ranea utilised domain information from the Gene3D
database to create phylogenetic profiles of the presence and absence of structural domains
within genomes (Ranea et al. 2007). This method thus bypasses the problem of identification
of genes that are functionally homologous by focussing on the presence and absence of
predefined domains within proteins. Chen and Vitkup used examination of correlation
coefficients to measure similarity in phylogenetic profiles (Chen and Vitkup 2006). They
observed that the method was successful in identifying genes that were members of the same
metabolic pathways (Chen and Vitkup 2006).
As a tool phylogenetic profiling could be used to detect errors in genome annotation
through the detection and displays of gene absences, which are not plausible in closely
related species. A similar approach has in fact been used by Pinney to detect and annotate
enzyme-coding genes in the protist E. tenella (Pinney et al. 2005).
Other extensions to the method involved the utilisation of the phylogenetic
relationships of the organisms include work by Barker and Pagel (Barker and Pagel 2005).

30

Chapter 1
This method made use of an explicit phylogeny and ancestral reconstruction over the
phylogeny based on a continuous-time Markov model. The likelihood of a model of
dependent or contingent evolution was compared with the likelihood of a model of
independent evolution over the phylogeny. This method was then further extended by
investigating the effects of constraining the rate at which genes could be acquired over the
phylogeny (Barker et al. 2007).
Other methods of incorporating phylogenetic information included the work by Vert
(Vert 2002), which utilised support vector machines, as well as the work by Cokus (Cokus et
al. 2007), which utilised phylogeny as a heuristic by ordering profiles by the phylogenetic
closeness of the organisms involved.
1.5.4.2.2.5 Comparative methods
Comparing phylogenetic profiles over a phylogenetic tree can be considered to be an
application of the comparative method to traits at the molecular level. The comparative
method is a well-established method in biology (Harvey and Pagel 1991). The fundamental
idea of underpinning the comparative method is how the state of one factor (which can be a
trait or environmental condition) influences the state of another over the context of a
topology of a phylogenetic tree (Maddison 1990). Testing for correlations without
considering the phylogeny will detect correlations in gene content based on phylogenetic
relationships rather than functional linkage. For example the set of all genes that are intrinsic
to the class Mammalia will share similar phylogenetic profiles. This does not however
suggest that they are all functionally linked.
There are a number of tests that have been developed in order to test the correlations
in the states of traits over a phylogeny. Ridley (Ridley 1983) developed one of the earliest of
these tests. This test involved the construction of a 2x2 contingency table where the state of
each trait was considered as a categorical variable defined at each node in the tree. The
method assumed that the construction of an accurate phylogeny and accurate reconstruction
of ancestral context for each node within the phylogeny. Ridleys method did not however
differentiate between dependant and independent variables in measuring the significance of a
given set of changes (Maddison 1990). The method did not take into account the sequence of
changes in the states of traits (i.e. was a change in state A followed by a change in state B or
vice versa). This makes the results of the method difficult to interpret (Maddison 1990).
Joe Felsenstein (Felsenstein 1985b) developed another test for correlations in traits
over a phylogeny. This test was developed to measure continuous data and modelled changes
31

Chapter 1
over a tree as a Brownian process. Another test for detection of correlations in traits and/or
external environmental conditions was devised by Grafen. This test was a phylogenetically
corrected regression, which did not rely on any form of ancestral reconstruction (Grafen
1989).
Maddison developed a similar test to Ridleys in 1990 (Maddison 1990). It however
did distinguish between dependant and independent variable by defining areas of a phylogeny
to be in state A or state B depending on the state of one of the traits under consideration. The
test then measured how many of the changes in the other trait occurred in the area of the tree
that was in state A compared to how many changes were possible over the whole tree.
One of the issues with the tests described above was the fact that none of them
integrated information on branch lengths of the phylogeny. This meant that the probability of
a change in the state of a given trait was equally likely over a branch of a phylogenetic tree
regardless of its length. However clearly a change on a short branch is less likely than a
longer branch. Work by Pagel took this into account by integrating branch lengths into a test
for correlated evolution (Pagel 1994). The parameters defined by this work were utilised by
Barker and Pagel in their approach to phylogenetic profile analysis (Barker and Pagel 2005).
1.5.4.2.2.6 Mirror trees
Another method of detection potential protein interactions is known as mirror trees. This
method involves the detection of protein interactions through the construction and
comparison of phylogenetic trees of proteins with a single genome (Pazos and Valencia
2001). The rationale behind this method is similar to that of phylogenetic profiling. However
correlation is sought not in the presence and absence of homologous genes but in the pattern
of sequence evolution of interacting proteins. Trees are examined by examining distance
matrices of homologous sequences for correlations. These matrices are the inputs used in the
formation of the trees in question. The phylogenetic tree of any given protein in a genome
will however carry signal from the speciation events, which shaped the genome of the
organism in question. An upgrade of the method has been developed to take into account this
background similarity (Pazos et al. 2005). Hakes and others have however pointed out that
the evolutionary pressures as well as the functional constraints on duplicated genes differ
depending whether the mechanism of duplication was whole genome duplication or smallscale duplication (Hakes et al. 2007). This indicates that sequence divergence and functional
evolution are not necessarily correlated (Robertson and Lovell 2009). Thus any similarity in

32

Chapter 1
the phylogenetic trees of functionally linked genes is more likely to be due to chance or as
mentioned above due to background similarity.
1.5.5 Storage of functional information
With the exponential increase in sequence data that has been generated through the 2000s
there have been a number of attempts with which to organise and contextualise function
information surrounding genomic entities.
1.5.5.1 GO
A notable attempt to do this has been the establishment of a controlled vocabulary with which
to describe the functional role of a gene as well as its physical location within the cell. The
vocabulary is known as the Gene Ontology (GO) (Ashburner et al. 2000). GO associates a set
of terms with gene products. These terms are known as GO terms and fall into three general
domains. These are

Cellular component: This is the physical location within the cell where the gene
product is generally to be found.

Biological process: This is the biological pathway or process that the gene product has
been localised in.

Molecular function: This is a lower level to the biological process domain and
includes the specific molecular capabilities of the molecule in question. An example
of molecular function could be the ability to bind a particular metal.

Terms are organised as a network starting from the root terms defined above. As the
network is traversed starting from a root term, terms become more specific, i.e. if term B
is directly below term A in the ontology then term B is a subclass of term A.
1.5.5.1 KEGG
Another database that localises gene products within functional pathways is KEGG (Kyoto
Encylopedia of Genes and Genomics) (Kanehisa 1997; Kanehisa et al. 2006). KEGG
maintains a list of functional pathways of processes that occur within the cell. These
processes are arranged in a similar manner to GO in that they start from general categories
and become more specific.

33

Chapter 1
1.6 Transcriptomics
The transcriptome of a cell can be considered to be the sum total of its genome that is
transcribed into RNA. Studying the transcriptome can also yield insights into the
functionality of gene products.
1.6.1 Microarrays
At the transcriptomic level the putative function of a gene can be at least partially determined
through establishing the association of the expression of a particular gene with a particular
external condition or treatment. This can be achieved through the use of glass slides known
as microarrays (Mount 2004). These slides have oligonucleotides, which are subsections of a
set of genes attached to them. Cells of the organism under study are subjected to variable
experimental conditions. mRNA is then extracted from these cells, converted to cDNA and
fused with a unique florescent dye. By examining the relative degrees of florescence for the
colours associated with the two versions of the cDNA of the gene of interest it is possible to
measure levels of gene expression in response to a given experimental condition. A variant of
this involves using full cDNA molecules as the contents of the chip.
1.6.2 Other methods for transcriptome examination
Expression levels for a given environmental condition can also be measured through direct
sequencing and counting through use of the SAGE (Serial analysis of gene expression). In
this method mRNA is extracted from the cells of interest. A small section is excised from
each mRNA molecule. A tag is then connected to each separate subsection. These
subsections are then amplified and the tags counted thus providing a measure of gene
expression levels (Velculescu et al. 1997). Another protocol for sequencing mRNA to detect
gene expression levels has also been developed. This protocol is known as RNA-Seq and is
made feasible through the utilisation of the high throughput nature of next generation
sequencing (Wang et al. 2009b).
1.7 Proteomics
Proteomics in a similar way to genomics and transcriptomics is the study of the full protein
complement produced by a cell. The proteomic level is the point where the connection
between macromolecules and measurable phenotypes is first bridged. Proteins can be
considered as making up close to the totality of both structural (e.g. microtubules) and active
(e.g. enzymes) components of a cell. The function of a protein can be determined by the
determination of its structure and/or the determination of its interaction partners.
34

Chapter 1
1.7.1 Protein Structure
There are two main methods utilised to determine the three dimensional structure of a protein
molecule (Brown 2006). These are:

X-Ray crystallography: This procedure involves the production of a crystal from the
protein of interest. X-rays are then fired through this crystal to acquire a backscatter
diffraction pattern. This diffraction pattern can then be used to reconstruct the
structure of the protein. X-ray crystallography is limited by the fact that it requires the
protein to be able to crystallise (Brown 2006).

NMR spectroscopy: NMR or nuclear magnetic resonance is electro-magnetic


radiation produced by the absorption and re-emition of electro-magnetic radiation by
the nuclei of atoms. By bombarding a protein with electro-magnetic radiation, these
patterns of resonance can be used to work out the structure of the protein (Brown
2006).

1.7.2 Protein interactions


In terms of protein interactions there are two primary modes of protein interaction. The first
is a direct physical interaction. Direct physical interactions between distinct proteins can
occur in two contexts (Orengo et al. 2003). These are:

Formation of a stable complex: A protein complex is a stable structure formed by two


or more proteins to carry out a specific function. In order to maintain the structural
integrity of a complex proteins within the complex have to maintain relatively long
term direct physical interactions. The subunits of the ribosome are an example of a
stable protein complex as well as the histone octamer and RNA polymerases (Orengo
et al. 2003). Not all interactions within a protein complex are direct as members of a
complex with more then two interacting partners do not necessarily have to be
physically connected to every other protein within that complex.

Transient interaction: These are functional interactions where proteins physically


interact but also exist independently in their own right (Orengo et al. 2003). An
example of a transient interaction is the interaction between the human proteins Rho
and RhoGap, which triggers a signalling cascade, involved in cytoskeleton formation
and cell proliferation (Nooren and Thornton 2003).

35

Chapter 1
The other form of interaction between proteins is indirect interactions. Examples of these
could be two proteins that have a role in a given metabolic pathway but whose production is
temporally and spatially separated. Examples of indirect interactions include the interaction
between SHC-transforming protein and mitogen-activated protein kinase 1 over several steps
of the insulin-signalling pathway (Sasaoka and Kobayashi 2000).
The full collection of all protein interactions within a cell has been labelled the interactome.
1.7.2.1 Experimental detection of protein interactions
Protein interactions can be detected using a variety of techniques. The main techniques
include:

Yeast two-hybrid: In order to detect protein interactions one widely used (Marcotte et
al. 1999) method is the yeast two-hybrid technique. This technique exploits the S.
cerevisiae GAL4 transcription factor. This transcription factor has two domains that
require physical proximity in order to operate. One of these domains binds DNA and
the other domain is an activator for the transcription factor. A protein interaction can
be detected by fusing two genes of interest to both of these domains respectively on
separate plasmids and insertion of these plasmids into a yeast cell with a reporter gene
upstream of the GAL4 transcription factor-binding site. Reporter gene transcription is
only possible if the protein products of the two genes of interest were able to maintain
a physical interaction (Griffiths 2002). The primary drawbacks to this method are the
facts that all interactions must take place in the nucleus removing a large number of
proteins from their native cell compartment and that only binary protein interactions
can be tested for (von Mering et al. 2002). The yeast two-hybrid method does have a
high rate of false positives. One reason for this is that pairs of proteins that stick
together are not necessarily ever expressed at the same time or in the same tissue
(Vidalain et al. 2004). Also some proteins such as heat shock proteins are inherently
promiscuous in their binding affinities (Vidalain et al. 2004).

Proteome chips: In a manner similar to the use of microarrays described above for the
measurement of gene expression levels microarrays can also be used with proteins.
By printing translations of 5800 ORFs from S. cerevisiae on to a microarray chip Zhu
and others (Zhu et al. 2001) were able to detect 33 novel interactions for the multi
functional calcium binding protein calmodulin. The drawbacks to this method are that
it is low throughput and again is restricted to binary interactions.

36

Chapter 1

Mass spectrometry of purified complexes: In order to detect interactions that are not
binary, complexes of proteins can be isolated using techniques such as tandem affinity
purification. This technique entails the tagging of a protein of interest with a tag that
allows the purification of the main protein and any complex partners that it might
have. These complexes can be characterised through the use of mass spectrometry
(von Mering et al. 2002).

1.8 Description of project.


This work details the development and application of a novel heuristic which combines
application of the Barker and Pagel approach to phylogenetic profiling (Barker et al. 2007;
Barker and Pagel 2005) in conjunction with a novel data filter. The Barker and Pagel
approach to phylogenetic profiling will subsequently be referred to as constrained ML
(maximum likelihood). It was observed over the course of this project that this method could
be useful in elucidating novel protein interactions. Novel protein interactions will allow
further elucidation and annotation of protein function through the principle of guilt by
association as articulated above. The proteome of Homo sapiens is still filled with known
unknowns in terms of protein-protein interactions. The HPRD (Prasad et al. 2009) currently
contains 38,788 binary protein interactions and data on 998 protein complexes. Current
estimates of the interactome size such as work by Stumpf (Stumpf et al. 2008), which
estimates the size of the interactome as 650,000, intimate that the majority of protein-protein
interactions have not yet been elucidated. The potential of phylogenetic profiling to detect
novel interactions has been demonstrated in work by Ramazzina (Ramazzina et al. 2006)
where two novel genes involved in the degradation of uric acid were detected. The
phylogenetic profiling method has also been successful in identifying enzymes of the
MEP/DOXP pathway (Cunningham et al. 2000).
Chapter 2 details the construction of a eukaryotic phylogeny over 54 taxa as well as
the phylogenetic profiles of known proteins within the human genome relative to the other 53
species which was one of the essential precursor steps to this study.
Chapter 3 contains the results of the application of the method in context and
compares it to a comparable high throughput experimental technique. Specifically the method
is compared to detection of protein-protein interactions as well as indirect functional linkages
through co-expression of genes as measured by microarrays. The method is also compared to
PIPs which is the protein interaction prediction system maintained by Barton (McDowall et

37

Chapter 1
al. 2009; Scott and Barton 2007). This system makes novel predictions through the
combination of different informative features.
Chapter 4 describes the construction of the data filter, which is based on Dollo
parsimony. The filter reduces the size of the overall search space facilitating the use of the
method for whole genome comparisons. This is achieved through the elimination of pairs of
proteins, whose function cannot be detected via examination of patterns of presence and
absence.
Chapter 5 presents a network of predictions generated as a putative human
interactome of proteins, which are susceptible to this line of enquiry. This network is
analysed for consistency with known data. A set of novel predictions is presented.
Finally Chapter 6 will sum up this work and present details on potential future
directions.

38

Chapter 2

Chapter 2
Reconstruction of eukaryotic phylogeny as precursor to comparative
analysis
2.1 Introduction
Examination of the evolutionary histories of organisms is a fundamental step for any form of
study of biological function as adaptation can only be examined within an historical context
(Harvey and Pagel 1991). As a phylogeny is by definition an evolutionary history of species
(Harrison and Langdale 2006) it is a necessary step within the process of a comparative
study. In terms of examination of changes in gene content within a probabilistic framework it
provides the necessary topology over which such changes occur. This is a fundamental
parameter in any such model.
2.1.1 Homology
The fundamental object of any phylogenetic study, whether molecular or morphological, is
the comparison of homologous structures within the organisms under consideration. When
genomic data is under consideration homologous structures within organisms correspond to
those genomic elements, which were present in the last common ancestor of the set of
organisms under consideration. These elements can provide a measure of divergence (Fitch
1970). These elements if functional (which is implied by conservation) can either maintain
their ancestral function or if sufficiently diverged have a new (or no) function. In discussions
of elements of genomes (genes) there are a number of subclasses of homologous
relationships. These are:

Orthology: Genetic elements are orthologous if they are the direct product of
divergence from a common ancestral species (speciation) (Fitch 1970).

Paralogy: Genetic elements are paralogous if they are the product of a duplication
event within a given species. Mechanisms of duplication include retrotransposition
(insertion of reverse transcribed RNA back into a genome) and unequal crossover
leading to tandem duplication of a portion of a chromosome (Hurles 2004). It is
thought that these duplication events are a major force in creating and broadening
genetic repertoires (Zhang 2003).

Xenology: Genetic elements are xenologous if they are the product of a direct
exchange of DNA between organisms (Fitch 2000). These exchanges are known to be
far more prevalent in prokaryotes given their lack of a true nucleus and the existence
39

Chapter 2
of plasmids (free floating segments of DNA) in some prokaryotes. Genes have also
been observed as xenologous in eukaryotes. Xenologous genes in eukaryotes can be
acquired via organelles, which are the product of endosymbiosis such as the
mitochondrion and chloroplasts (Blanchard and Lynch 2000).
It is important for purposes of phylogenetic reconstruction to be able to draw a distinction
between genes which are paralogous and which are orthologous. If paralogous genes are
compared between species the distance between them does not necessarily reflect the overall
genetic divergence between the species under consideration. Genetic elements that are
orthologous provide information on levels of divergence between speciation events whereas
those that are paralogous provide data on duplication events.
A converse relationship to homology is that of analogy where through convergent
evolution genes that share no common ancestry develop and maintain sequence similarity due
to similar demands placed on the organisms in question by their environment. A classic
example of this at the molecular level is that of the convergent evolution of the enzyme
lysozyme in both the langur monkeys of the Indian subcontinent (Semnopithecus entellus)
and ruminants due to the similar requirements imposed by a herbivorous diet (Swanson et al.
1991).
2.1.2 Molecular evolution
The fundamental idea at the heart of modern biology is that of random mutations guided by
natural selection producing adaptation, which allow an organism to thrive in a given
ecological niche. The large-scale study of evolution at a molecular level has only recently
become possible due to advance in DNA sequencing technologies. This has been extremely
useful as random mutations occur at the molecular level and also DNA/ amino acids are the
fundamental comparable common denominator across morphologically and physiological
diverse species (Nei and Kumar 2000).
At the DNA level there are four basic types of mutation (Nei and Kumar 2000). These are:

Insertions: These mutations are insertions of additional nucleotides into a sequence of


nucleotides. These can be caused by replication errors (Brown 2006). If an insertion
occurs within an open reading frame it can cause the frame to be shifted hence
insertions in coding regions can also be referred to as frameshift mutations.

Deletions: Deletions are the opposite of insertions. Deletions within an ORF can also
cause a frame shift (Brown 2006).

40

Chapter 2

Substitutions: These mutations are also referred to as point mutations and involve the
substitution of a nucleotide with any other nucleotide. Substitutions do not necessarily
have to involve a single nucleotide (Brown 2006). There are two types of
substitutions transitions which entail the replacement of a purine with another purine,
e.g. A to G or a pyrimidine with another pyrimidine, e.g. C to T. The other form of
substitution is a transversion, which involves the replacement of a purine with a
pyrimidine or vice versa (Nei and Kumar 2000).

Inversions: An inversion mutation involves the reversing of the sequence of a strand


of DNA, e.g. the sequence TGA being replaced with AGT (Nei and Kumar 2000).

2.1.2.1 Synonymous and non-synonymous mutations


Recalling the genetic code where nucleotide triplets known as codons specify amino acids,
mutations within coding regions can also be classified by the effect that they have on the
potential protein product. Thus mutations where the amino acid specified is altered are
known as non-synonymous mutations, while mutations where there is no effect on the amino
acid specified are referred to as synonymous mutations (Nei and Kumar 2000). Most
synonymous mutations occur in the third position of codons. Measuring the relative rate of
synonymous vs. non-synonymous mutations is technique for the detecting of positive
Darwinian selection (Nei and Kumar 2000).
2.1.3 Phylogenetic trees
The evolutionary relationship between organisms has traditionally been presented as a tree
like structure starting first presented in 1801 by French botanist Augustin Augier (Stevens
and Augier 1983). Intuitively it is fairly clear what an evolutionary tree represents.
Mathematically a tree can be defined as an acyclic graph. A graph is an abstraction, which
can be used to model binary relations between objects (Parida 2008). A graph G can be
defined as G(V,E) where V is a set of vertices or nodes and E is a subset defined as E ! (V "
V)(Parida 2008). E is thus a subset of the set of all ordered pairs that can be created from
elements of V. The elements of E are referred to as the edges of the graph (Parida 2008). A
tree is a graph where all vertices are connected possessing the property that any two vertices
v1, v2 # V are connected by a unique path (Parida 2008). A vertex in a tree that has one
incoming edge is known as a leaf node (Parida 2008). All other vertices by contrast are
known as internal nodes (Parida 2008).

41

Chapter 2
In the case of a phylogenetic tree leaf nodes are extant taxonomic units or taxa and
internal nodes are proposed hypothetical common ancestors as illustrated in Figure 2.1. A
subsection of a phylogenetic tree can be referred to as a clade (Nei and Kumar 2000).

Figure 2.1: Sample phylogenetic tree. In this tree the extant taxa are nodes A, B and C
while node E is an ancestral node for A and C.
2.1.3.1 Species trees and gene trees
There are two main types of phylogenetic tree that are commonly investigated. These are:

Species trees: The topology of these phylogenetic trees represents the branching order
of species. Thus internal nodes are hypothetical common ancestors for the nodes that
succeed them. The split at these ancestral nodes represent speciation events. A
42

Chapter 2
speciation event is considered to be the moment in time when two species were
reproductively isolated from each other (Nei and Kumar 2000).

Gene trees: Gene trees measure the degree of divergence between homologous genes
within and/or across species. Thus internal nodes in a gene tree represent a
hypothetical gene that existed prior to a mutation event that created its two immediate
descendants (Nei and Kumar 2000).

Figures 2.2 and 2.3 illustrate the differences between gene trees and species trees.

Figure 2.2: A species tree. Adapted from (Brown 2006).

43

Chapter 2

Figure 2.3: A gene tree. Adapted from (Brown 2006).


2.1.3.2 Topologies and branch lengths
The branching pattern of a phylogenetic tree is known as its topology. The topologies of
phylogenetic trees can be rooted as above in Figure 2.1 or unrooted as present below in
Figure 2.4.

44

Chapter 2

Figure 2.4: Sample of an unrooted phylogenetic tree representing four taxa.


Theoretically the topologies of most phylogenetic trees are bifurcating, as ancestral
nodes will split into two descendant nodes at a given point in time. Multifurcation is possible
in phylogenetic trees. A node with more than two descendents is referred to as a polytomy.
There are two types of polytomy. Soft polytomies where the multifurcation is attributable to a
lack of information and hard polytomies where species genuinely split into multiple
descendants simultaneously (Page and Holmes 1998). Most polytomies are treated as soft as
simultaneous speciation is considered unlikely (Page and Holmes 1998).
The number of possible unrooted bifurcating tree topologies B(t) can be calculated
using the formula given below (Salemi and Vandamme 2003) where t is the number of taxa
under consideration.

45

Chapter 2
t

B(t) = # (2i " 5)

(1)

i= 3

The number of possible rooted bifurcating topologies B(t) can be counted using the following
formula:
B(t) =

(2t " 3)!


2 t"2 (t " 2)!

(2)

Thus estimation of a phylogenetic tree is a problem that quickly becomes computationally


!
intractable as the number of taxa rises.
Another attribute that can be added to a phylogenetic tree is the length of its
individual branches. As the nodes within a tree represent taxa, the lengths of the braches
between them represent the degree of evolutionary change between the taxa over time. A
phylogenetic tree with branch length information is also known as an additive tree, a metric
tree or a phylogram (Page and Holmes 1998).
2.1.3.3 Bootstrap support values
Another attribute commonly associated with internal nodes in phylogenetic tree is the
bootstrap support value. This value reflects the amount of times a particular internal node or
split is selected if a phylogenetic analysis is repeated on a random set of re-samples (with
replacement) from the original dataset (Page and Holmes 1998).
2.1.3.4 Evolutionary models in tree estimation
In order to estimate the amount of evolutionary change between taxa, methods considered to
be effective tree estimators, utilise models of evolution that specify information on the
evolutionary rate of substitution between homologous stretches of nucleotide or amino acid
data. These models are framed as m " m matrices where m is the number of entities in the
data type, i.e. 4 in the case of nucleotides and 20 in the case of amino acids. To illustrate, an
example of the simplest substitution model possible for DNA is the Jukes-Cantor model,
!
which assumes that nucleotide substitution occurs with equal frequency (Nei and Kumar
2000). Thus the substitution matrix for the Jukes-Cantor model is presented below where !
represents this uniform rate of substitution.

46

Chapter 2
A

Table 2.1: Rates of nucleotide substitution for the Jukes-Cantor model (Nei and Kumar
2000).
The methods of tree estimation that utilise these models of evolution include the
distance method, tree estimation by Bayesian methods, and tree estimation by maximum
likelihood (Felsenstein 2004).
In distance methods an evolutionary model provides a measure of evolutionary
distance between taxa, whereas in probabilistic methodologies such as maximum likelihood
and Bayesian methods they provide a measure of probability for a given set of substitutions
between taxa. Evolutionary models can be calculated via a priori assumptions about the
evolutionary process or can be constructed empirically by examining the rate of observed
substitutions in homologous sequences. Examples of empirically calculated substitution
matrices for amino acids include the PAM matrices created in the seminal work by Margaret
Dayhoff (Dayhoff et al. 1978) and more recently the WAG (Whelan and Goldman 2001) and
LG matrices (Le and Gascuel 2008).
2.1.4 Detection of homology in molecular data
In order to construct a phylogenetic tree, which represents the evolutionary history of a set of
taxa using molecular data, it is necessary to compare homologous sequences. More
specifically it is necessary to detect orthologous genes/proteins. These genes/proteins are the
most appropriate measure of genetic divergence between species, as an equal level of genetic
divergence will have occurred since the speciation event causing the split.
There are a number of algorithms, which are utilised in the selection of homologous
genes/proteins and their subsequent classification as orthologous or paralogous. These
include:

47

Chapter 2

Reciprocal Best Hits (RBH): This procedure is implemented by the COGs (Tatusov et
al. 2003) database hosted by the NCBI. The underlying rationale of the algorithm is
that orthologous genes between two species will possess more similarity with each
other then with any other gene. This similarity is generally established using pairwise
sequence alignment algorithms such as BLAST (Altschul et al. 1990) or the SmithWaterman algorithm (Smith and Waterman 1981).

InParanoid: This algorithm extends the idea behind RBHs by using them to seed
orthologous clusters, and then by an application of an iterative inclusion process
constructs a set of gene/protein families (Remm et al. 2001).

OrthoMCL: This process also utilises RBHs as seed pairs for clusters. Similarity
relations between gene/proteins are then established as a graph and additional
paralogous sequences are determined through a process of graph clustering (Li et al.
2003).

Reciprocal smallest distance (RSD): This procedure does not utilise RBHs and
instead, for a set of hits for a given query protein, over a given E-value (Expect
value), conducts pairwise alignments between each of the hits and the original query.
Hits that are alignable to a given threshold are then subjected to further analysis to
calculate the number of amino acid substitutions or distance between them and the
original query. The hit with the shortest distance is then used to reverse the process. If
the reversal yields the original query then the two sequences are declared orthologous
(Wall et al. 2003).

EnsemblCompara GeneTrees: This is an algorithm utilised by the Ensembl Compara


database (Vilella et al. 2009). The process involves RBHs. Two species are subjected
to an all against all pairwise alignment. Like OrthoMCL the resulting data is then
converted into a graph. This graph is then clustered. Gene trees are then constructed
using these clusters and reconciled against a gold-standard species tree.
In comparative studies the Inparanoid algorithm (Remm et al. 2001) was shown to

perform better than its rivals (Hulsen et al. 2006). This work showed the Inparanoid
algorithm tied as the best performer with simple reciprocal best hits at identification of
orthologs. However reciprocal best hits in practise only yield one to one orthologous
relationships (Hulsen et al. 2006). This reduces the coverage of the method (Hulsen et al.
48

Chapter 2
2006). OrthoMCL was shown to perform a close second to the Inparanoid algorithm in
benchmarking tests (Hulsen et al. 2006). Subsequent benchmarking work (Altenhoff and
Dessimoz 2009) showed that OrthoMCL outperformed Inparanoid to an extent at lower
levels of specificity but higher coverage. However at points benchmarking was applied to
data and organisms common to both reviews the results were seen as broadly congruent
(Altenhoff and Dessimoz 2009).
2.1.5 Multiple sequence alignment
Given a set of orthologous sequences further processing is required in order to convert them
into a suitable input for a phylogenetic tree estimation procedure. This input is known as a
multiple sequence alignment (MSA) (Edgar and Batzoglou 2006). The process involves
creating an optimal alignment between three or more protein sequences. Insertions and
deletions between orthologous proteins are represented by introducing gaps into the
alignment. Alignments are scored through the use of substitution matrices. The process
converts orthologous sequences into a rectangular array where each column of the array
corresponds to a homologous attribute between the taxa under consideration (Edgar and
Batzoglou 2006).
Forms of multiple sequence alignment include.

Progressive: This form of alignment involves the construction of initial pairwise


alignments between all the sequences under consideration. The distances thus
established between the sequences are used to create a guide tree. The multiple
sequence alignment is then built up progressively in the order suggested by the guide
tree (Mount 2004). The main flaw with this method is that errors made any stage of
constructing the MSA remain in the final alignment (Wheeler and Kececioglu 2007).
A very prominent example of a progressive MSA tool is the Clustal suite (Higgins
and Sharp 1988).

Iterative: In order to reduce the errors introduced by the progressive approach to MSA
the iterative approach realigns sub-groups of the sequences repeatedly (Mount 2004).
Examples of iterative MSA programs include MUSCLE (Edgar 2004) and DIALIGN
(Morgenstern et al. 1998). The performance of the iterative approach can be improved
by the inclusion of consistency information between the growing MSA and the pre-

49

Chapter 2
computed pairwise alignments used by some of algorithms within the MAFFT
(Multiple alignment by fast Fourier transform) program (Katoh et al. 2002).
The quality of a multiple alignment is crucial to the accuracy of the phylogenetic tree
created via its analysis (Blair and Murphy 2011). This is especially true when there are gaps
in the alignment (Talavera and Castresana 2007). Thus benchmarking tests have been carried
out to examine the performance of various algorithms currently available. The results of these
have found that MAFFT (running in its iterative, consistency enhanced mode) using the
Smith-Waterman algorithm (Smith and Waterman 1981) for its initial pairwise alignment
outperformed its nearest rivals (Ahola et al. 2006; Nuin et al. 2006). This mode of MAFFT is
known as MAFFT-L-INS-i.
2.1.5.1 Multiple sequence alignment quality filtration
Given the effects of MSA quality on phylogenetic analysis it is argued that filtration of
areas, which are problematic to align, will improve the outcome of subsequent phylogenetic
analyses (Talavera and Castresana 2007). It is common practise to edit MSAs by hand before
analysing them further though it is considered that this makes all results thus gained
irreproducible through the subjectivity of the overall process (Blair and Murphy 2011). Thus
this process has been semi automated by programs such as Gblocks (Talavera and Castresana
2007) and Trimal (Capella-Gutierrez et al. 2009). These programs will retain sections of
MSAs, which are highly conserved and remove gaps in the alignment.
Gblocks will either remove all gaps in its stringent mode or only remove gaps if they are
present in more than half the sequences in the alignment in its relaxed mode (Talavera and
Castresana 2007). Trimal will remove columns from an alignment based on a conservation
threshold defined by the user, i.e. how much of the original alignment does the user wish to
conserve (Capella-Gutierrez et al. 2009). In benchmarking tests optimum performance for
Gblocks in enhancing tree estimation was observed using its relaxed mode (Capella-Gutierrez
et al. 2009).
2.1.6 Methods to estimate phylogenetic trees
The focus of this section as mentioned above shall be on the analysis of molecular data
though the methods described are applicable to any form of measurable polymorphic trait.
These data provide a measure of distance between the species under consideration.

50

Chapter 2
The first subdivision in types of methods of phylogenetic analyses is between discrete
character state and distance matrix methods (Salemi and Vandamme 2003). Discrete
character state methods examine the differences in state of a set of discrete characters or
traits. Distance matrix methods utilise the distance between sets of data through the creation
of a matrix of pairwise distances and application of clustering techniques. Subtypes of the
character state method include the maximum parsimony method that does not utilise an
explicit model of evolution and maximum likelihood, which conversely does (Salemi and
Vandamme 2003).
2.1.6.1 Distance methods
Distance methods were originally developed to construct phenograms, i.e. (diagrams which
reflect the similarity between a given group of taxa without consideration of
ancestor/descendant relationships (Salemi and Vandamme 2003; Sneath and Sokal 1973) as
opposed to phylogenies. Distance methods however can also be applied to elucidating
phylogeny under the assumption of equal rates of mutation in cases where a quick initial
result is required.
Distance methods of phylogeny depend on the construction of a matrix of pairwise
distances for the trait data of the organisms under consideration. This data is generally
nucleotide and or amino acid sequence data though the method is also applicable to any other
form of discrete descriptive data. In the case of amino acid or nucleotide data distances are
estimated according to evolutionary models, which allow a meaningful calculation of the
evolutionary distance between two species.
The simplest form of evolutionary distance measure is the proportion of differing sites
between two sequences p. This is calculated through a simple count of differing sites nd and
division by the total number of sites n as shown in Equation 3 (Nei and Kumar 2000).

p=

nd
n

(3)

p is an underestimate of evolutionary distance over extended periods of time as multiple


!

substitutions accumulate per site. Thus in order to represent this information substitutions
can be modelled as a Poisson process over time and then the probability of k mutations over t
time can be can be calculated by the standard Poisson distribution function where # = the rate
of mutations / unit time and e = the base of the natural logarithm (Nei and Kumar 2000).

51

Chapter 2
e" # #k
p(k;t) =
k!

(4)

This probability can then be used to calculate a distance between two sequences. This
distance is referred to as the Poisson corrected distance (Nei and Kumar 2000).
The Poisson corrected distance assumes a homogenous rate of mutations /
substitutions over a molecular sequence. This assumption however is not true as different
areas of a sequence (coding or not coding in the case of nucleotides, for example) will be
subject to differing selective pressure hence differing mutation rates (Nei and Kumar 2000).
This information is integrated into calculations of distance via the observation that
variation in rates of substitution over a sequence follows a gamma distribution (Nei and
Kumar 2000).
Having created a matrix of pairwise distances between the sequences under
comparison this matrix can then used to generate a phylogenetic tree via clustering.
A commonly used form of clustering in the generation of distance-based trees is
neighbour joining. This algorithm follows the following steps (Brown 2006):

Construction of a fully multifurcating star shaped tree including all taxa under
consideration.

The selection of a random pair of taxa and removing them from the star to
form a tree consisting of a clade containing that pair and a clade containing the
rest of the star.

Evaluation of the total branch lengths of the new tree.

Iteration of this process for all possible pairs storing the results of the branch
length calculation.

Identification of the pair, which yields the first interim tree with the shortest
branch length.

This pair is now placed on their own branch and the process is iterated until a
fully bifurcating tree is retrieved.

Another method of tree estimation involving distance matrices is least squares fitting
in which for each tree the residual sum of squares is calculated between pairs of taxa. This
method is known as the Fitch-Margoliash method. This involves applying the following
equation (Nei and Kumar 2000).
52

Chapter 2

Rs = # (dij " eij )2

(5)

Where dij is the observed distance in the matrix between taxa i and taxa j and eij is the
patristic distance between the taxa. The patristic distance between two taxa is the sum of the
branch lengths that make up the shortest path between the two taxa. The tree with the lowest
Rs is selected by the method. Generally tree space is searched using a heuristic search
method as described below in Section 2.1.6.3.
Other standard techniques for this process are clustering methods such as UPGMA
(unweighted pair group methods with arithmetic means), which group organisms by degree
of closeness in the matrix. The underlying assumption of UPGMA is that the evolutionary
process occurs at a consistent pace, i.e. follows a molecular clock (Felsenstein 2004). Thus in
cases where data does not follow a molecular clock, UPGMA will deliver misleading results
as it will cluster species on short branches with each other (Felsenstein 2004).
Another commonly applied method is minimum evolution, which creates a tree where
the overall amount of evolution (measured by the total branch lengths of the tree from root to
tip) is minimised (Salemi and Vandamme 2003). Again tree space is traversed by heuristic
search as described below.
Distance methods are comparatively fast compared to character based methods and
given a dataset with relatively constant rates of evolution and closely related taxonomic units
fairly accurate (Felsenstein 2004). However they suffer from a systemic issue where if the
taxa under consideration display variability of rates of evolution along a sequence at different
points in a tree this cannot be detected as all distances between the sequence are calculated
locally, i.e. between adjacent species (Felsenstein 2004).
2.1.6.2 Discrete character state methods
Discrete state character methods operate on matrices populated with assigned attributes or
characters to each taxon under consideration. Possible trees are then evaluated against this
matrix in an attempt to satisfy an optimality criterion (Salemi and Vandamme 2003). One of
the two most popular optimality criterions is parsimony, which entails minimisation of the
amount of change required over a given tree to produce the data observed in the matrix. The
other widely used criterion for selection of trees is likelihood. This method frames the tree as
a hypothesis for the matrix of observed data and evaluates its likelihood given the matrix of

53

Chapter 2
observed data (Felsenstein 2004). Maximisation of the likelihood function yields the
optimum tree.
2.1.6.2.1 Maximum Parsimony
Using parsimony, as a criterion for judging potential trees was first introduced by Camin and
Sokal in 1965 (Camin and Sokal 1965). The rationale behind considering a tree that is more
parsimonious is based on the principle of Ockhams razor, which can be stated, as a simpler
explanation for an observed phenomenon is to be preferred to a more complex ad hoc
explanation (Steel and Penny 2000).
Specific variants of parsimony that can be utilised are (Felsenstein 2004):

Fitch parsimony: this form of parsimony is also known as Wagner parsimony


(Felsenstein 2004). This form of parsimony allows all possible changes in any
direction and counts them all equally.

Camin-Sokal parsimony: this form of parsimony only allows evolutionary


change in one direction (Camin and Sokal 1965). For example if a two state
character which can take on states 0 and 1 is considered Camin-Sokal
parsimony will only allow changes in the 0 to 1 direction (assuming that is the
direction selected as permissible) (Felsenstein 2004).

Dollo parsimony: this form of parsimony is based on the principle of


phylogenetic irreversibility (Lequesne 1974). The acquisition of a complex
character is allowed once and then all subsequent changes over the tree can
only be reversions (Felsenstein 2004).

Parsimony on an ordinal scale: this deals with the case where changes in a
multi state character are considered on an ordinal scale. Thus only changes
that are adjacent are allowed (Felsenstein 2004).

Polymorphism parsimony: this form of parsimony allows an intermediate state


of polymorphism to be acquired at a point within the tree. All changes
subsequent to the polymorphic areas in the tree are counted as a loss of one of
the composite characters that make up the polymorphic state (Farris 1978;
Felsenstein 1979; Felsenstein 2004).

Evaluating the number of character changes required over a particular tree for a given
character matrix is computationally easy and can be calculated rapidly through applications
of dynamic programming algorithms such as:

54

Chapter 2

The Fitch algorithm (Fitch 1971): operates by carrying out a post order
traversal of the phylogenetic tree (Felsenstein 2004). At each internal node the
set of potential ancestral states is set to either the intersection of the states of
its immediate descendant nodes if such an intersection exists. If no such
intersection exists then the state of the nodes is set to the union of the states of
the two descendant nodes.

The Sankoff algorithm (Sankoff 1975): is similar but not identical to the Fitch
algorithm (Felsenstein 2004). A cost matrix is created which stores the cost of
all possible changes of state within the context of the data under consideration.
Ancestral node states are then assigned by selecting the state with the minimal
cost.

The complexity of both these algorithms is linear as the number of operations


required increases linearly with the number of taxa, the number of characters under
consideration and the number of states that those characters can take on.
Parsimony based methods have been popular due to their computational as well as
conceptual simplicity. Parsimony methods were recently utilised to construct the largest
phylogenetic tree ever reconstructed consisting 73,060 eukaryotic taxa with a combination of
morphological and molecular data (Goloboff et al. 2009). However in general over the last
two decades there has been a swing toward the use of likelihood-based methods (Steel and
Penny 2000). This is due in part to the demonstration (Felsenstein 1978) that under given
conditions maximum parsimony will converge on the wrong tree. These conditions have
come to be known as the Felsenstein zone.
2.1.6.2.2 Maximum likelihood
Maximising the likelihood of a given tree as a hypothesis to explain observed data was first
applied to phylogenetic inference by Edwards and Cavali-Sforza (Cavalli-Sforza 1964). The
likelihood function assigns a value to the ability of a hypothesis to explain an observed set of
results. Assume a statistical model M that associates a probability with a set of possible
outcomes and a set of observed outcomes (or results) R. Thus P(R|M) is the probability of
observing R assuming M is a correct description of the process under study (Edwards 1992).
The likelihood L of M is defined as:
L(M)=P(R|M) " k

(6)

!
55

Chapter 2
Where k is an arbitrary constant. Use of this constant allows relative comparison of
likelihoods (Edwards 1992). To paraphrase an example from (Durbin 1998) in the case of a
die if our hypothesis is that the die is fair then the probability of any outcome is equal to 0.16.
If we go on to roll 5 sixes then this forms our observed data. The likelihood of the hypothesis
is then proportional to 0.165 or 0.000104. Hypotheses can thus be judged on their relative
abilities to explain observed results. A hypothesis with a higher estimate of the probability of
rolling a 6 would be better fit to the observed data in the case of the die. Thus likelihood
provides a framework with which to select a hypothesis or model appropriate to the observed
data.
In the case of phylogeny each tree is a hypothesis explaining the distribution of the
traits under consideration. The phylogeny with the maximum likelihood is selected as the
optimal tree. The likelihood of a tree can be measured through the application of a
substitution model of evolution, which models the probability of individual evolutionary
events over the tree. Empirically calculated substitution models can be used as a substitute for
the calculation of a set of probabilities, which permits the application of more generalised
rules of evolution to each individual phylogenetic study. Empirical models of evolutionary
events can be created through the examination of homologous sequences in different species.
Models currently in use for amino acid based phylogenies include the WAG (Whelan and
Goldman 2001) and LG (Le and Gascuel 2008) substitution models.
If a model is badly specified and a poor fit for the data then likelihood methods can
return an inaccurate tree with high statistical support (Keane et al. 2006). There are a limited
number of cases where parsimony methods can outperform likelihood-based methods, which
has been called the inverse Felsenstein zone, or Farris zone (Siddall 1998). It has been shown
however that these cases are extremely rare in real data (Swofford et al. 2001) and in cases
where it is computationally feasible maximum likelihood has become the one of the dominant
paradigms in phylogeny reconstruction.
2.1.6.2.3 Bayesian Methods
Another criterion related to likelihood is the posterior probability of a tree given a matrix of
observations and a prior probability for the tree. The posterior probability of a hypothesis is
the probability of the hypothesis being true given some observed data. The posterior
probability of a tree given a multiple alignment is calculated through the application of Bayes
theorem, which is defined as:
56

Chapter 2

P(X | Y ) =

P(Y | X)P(X)
P(Y )

(7)

Where X and Y are separate events and P(Y|X) is the conditional probability of event Y given
event X has occurred. P(X) is known as the prior probability of event X. P(X|Y) is the
posterior probability of event X given event Y has occurred. P(X) represents a subjective prior
belief in the probability of X occurring.
In the case of phylogenetic analysis X is a phylogenetic tree and Y is a given multiple
alignment. It is however non-trivial to evaluate the posterior probabilities over all possible
tree topologies exhaustively (Huelsenbeck et al. 2001). This process had been made feasible
by sampling the distribution of posterior probabilities of trees. The posterior probability of a
particular tree is measured as the amount of times it is visited over traversal of tree space.
Tree space traversal is facilitated by the use of Metropolis-coupled MCMC (Markov chain
Monte Carlo) methods first introduced by the doctoral work of Li and Mau (Pickett and
Randle 2005).
The algorithm returns a set of trees sampled from the posterior distribution. An
individual phylogeny is then generally assembled from the returned sample through using
majority rule consensus methods (Cranston and Rannala 2007).
Bayesian methods suffer from the potential source of bias of prior probabilities
(Holder and Lewis 2003). This issue can be ameliorated through the use of flat or
uninformative priors. Flat priors can however still bias a Bayesian phylogenetic study
towards trees with particular configurations of clades (Pickett and Randle 2005).
2.1.6.3 Heuristic search methods
Given the large number of possible topologies possible for even a small number of taxa the
estimation of phylogenetic trees is a problem that is intractable by brute force searching. Thus
the space of all possible trees is usually searched heuristically (Felsenstein 2004). What this
entails is the selection of a random first tree. This tree is then evaluated on the basis for
whatever measure that has been defined to evaluate the quality of the tree. Examples of
possible quality measures for a phylogeny include as previously discussed parsimony,
likelihood and distance. The tree is then altered thus moving to a new point in tree space.
This new tree is then evaluated. This process is then iterated until a local optimum point has

57

Chapter 2
been reached within the space. This point is not guaranteed to be a global optimum within the
space (Felsenstein 2004).
Examples of alterations/moves that are used to traverse tree space include (Felsenstein 2004):

Nearest neighbour interchange (NNI): This process involves the swapping of adjacent
branches within a tree. This is a local rearrangement of the tree.

Subtree pruning and regrafting (SPR): This process involves the removal or pruning
of a subtree from an overall tree and reattaching it at another point. As opposed to
NNI this is a global rearrangement of the tree.

Tree bisection and reconnection (TBR): This involves the deletion of an interior
branch to split a tree into separate trees and then all possible connections are made
between the branch set of the first tree and the second. This is also a global
rearrangement of the tree.

Global rearrangements are more radical moves within the tree space and thus are less
likely to stabilise in local optima. Modern phylogeny estimation programs generally
provide the options to carry out either form of rearrangement. The advantage of using
local rearrangements is greater speed in arriving at the optimum tree. Examples of
programs, which offer this choice, are a number of programs within the PHYLIP suite
(Felsenstein 1989) and PhyML (Guindon and Gascuel 2003). PhyML is generally as
accurate as other phylogeny estimation programs while being considerably faster
(Dereeper et al. 2008). Programs within PHYLIP can carry out multiple searches through
the space jumbling the order of the taxonomic data to widen space coverage. The
programs within PHYLIP that offer heuristic search are:

PROTPARS

DNAPARS

DNACOMP

DNAML

DNAMLK

PROML

PROMLK

RESTML

58

Chapter 2

FITCH

KITSCH

NEIGHBOR

CONTML

PARS

MIX

DOLLOP

Short descriptions of these programs can be found on the PHYLIP webpage


(http://evolution.genetics.washington.edu/phylip/).
2.1.7 Bootstrapping
Bootstrapping was first proposed as a method for evaluating confidence limits in
phylogenetic trees by Felsenstein in 1985 (Felsenstein 1985a). The procedure evaluates how
well supported a particular tree topology is by a given dataset. This entails constructing a
dataset created by random resampling from the original dataset. This new dataset should be
of the same size as the original dataset. This procedure is repeated to produce the appropriate
number of replicates. The original tree estimation procedure is then repeated on this subset
producing a set of trees.
The original tree is then evaluated in the light of these new trees. Each interior branch
of the original tree is compared to the bootstrap trees, and for every bootstrap tree, which
contains a branch, which creates an identical partition of the data the branch is marked as
present. Thus each internal branch gets a score or bootstrap confidence value calculated by
dividing the number of times it was found to be present in one of the bootstrap trees with the
total number of bootstraps (Nei and Kumar 2000).
In the PHYLIP package presented by Felsenstein bootstrapping is carried out through
the use of two of its internal programs SEQBOOT and CONSENSE (Felsenstein 1989). This
procedure calculates confidence values for a consensus tree created from the bootstrap trees
as opposed to the original tree (Nei and Kumar 2000). The procedure for obtaining these
bootstrap support values is:

Run SEQBOOT on original dataset. This produces the number of required


resamples/replicates of the dataset. SEQBOOT requires a random odd number
as a seed.
59

Chapter 2

Repeat original estimation procedure on each replicate. This produces a set of


trees all of which represent the original data.

CONSENSE is then used to merge these trees together. CONSENSE is a


program, which is designed to produce a consensus tree from a set of trees.
CONSENSE will add internal values /branch which represent the confidence
values in those branches.

2.1.8 Model selection in phylogenetic tree estimation


As mentioned above, three of the main methods of phylogenetic tree estimation utilise
evolutionary models as described in section 2.1.3.4 with which to convert homologous data
matrices into trees. The use of an inappropriate model has been shown to adversely affect all
aspects of tree reconstruction including branch lengths and topology (Bruno and Halpern
1999). A poorly specified model will return an incorrect tree with high statistical confidence
(Posada and Buckley 2004).
Thus procedures have been developed with which to estimate how well a given
evolutionary model fits a dataset. A standard probabilistic measure that is used to measure the
fit of a model to a given dataset is likelihood. As a model can be fitted (over fit) perfectly to a
dataset by adding parameters, the likelihood of any given model is evaluated in the context of
how many parameters the model has. Two standard measures for integrating this information
(the likelihood of the model with respect to the dataset and the number of parameters) for
doing this are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion
(BIC). Both of these measures penalise parameter rich models over simpler models thus
selecting the simplest model that explains the observed data adequately (Felsenstein 2004).
The AIC is calculated by the following equation (Felsenstein 2004).
(8)

AICi = "2 ln Li + 2 pi

Where Li is the likelihood of model i and pi is the number of parameters in model i. The BIC
has a higher penalty for parameter richness than the AIC and is calculated via the following
equation (Felsenstein 2004).
(9)

BICi = "2 ln Li + pi ln(n)

!
60

Chapter 2
Where Li is the likelihood of model i and pi is the number of parameters in model i and n is
the number of data points in the dataset.
An example of a model selection procedure involving likelihood, utilised by a model
selection tool ModelGenerator is as follows (Keane et al. 2006).

The construction of a simple guide tree using neighbour joining on the dataset.

Each model to be evaluated is then examined over this guide tree and the dataset to
calculate the likelihood of that model.

The AIC/BIC are calculated for the model.

The model with lowest AIC/BIC is presented as the best model for the given dataset.

2.1.9 Comparing phylogenetic trees


Given sets of two or more phylogenetic trees produced by disparate estimation procedures it
can sometimes be necessary to compare trees for congruence with each other as well as with
the datasets underlying them.
There are a number of procedures that can be followed to compare trees. The simplest
of these is attempting to visualise the parts of trees that are topologically similar. A method
for doing do is presented in work by Nye (Nye et al. 2006).
Other measures have been designed to define the difference between two trees. These
include:

The Robinson-Foulds distance: This is a distance that counts how many


branches differ between two trees. This is done by ignoring branch lengths and
considering each tree as a set of branches. Each branch splits the tree into two
partitions. The Robinson-Foulds distance is a count of the partitions present in
one tree and not in the other (Felsenstein 2004).

The NNI distance: This distance can be considered an edit distance, analogous
to the Levenshtein distance used to compare strings of text. It is the number of
NNI operations it would take to transform one of the trees into the other
(Felsenstein 2004).

The Branch Score distance: This measure uses branch lengths as well as
topology to calculate the distance between two trees (Felsenstein 2004).
61

Chapter 2
Another way to measure the quality of a pair of estimated tree relative to a given dataset is
use of the Kishino-Hasegawa (KH) test. This is a test of how well individual homologous
sites within a dataset support a given tree in contrast to another tree (Goldman et al. 2000).
Both trees are selected a priori as the possible best hypotheses for an observed dataset. The
test was first introduced in the work (Hasegawa and Kishino 1989). The underlying rationale
is that if the trees are equally well supported by the dataset. Thus using the notation provided
in (Goldman et al. 2000) the test can be carried out via the following procedure.

Given two trees T1 and T2 calculated by a given quality criterion, e.g. parsimony or
likelihood.

Assuming for the purposes of explanation that the quality criterion is likelihood, the
likelihoods of T1 and T2 (with respect to a given dataset D) are L1 and L2 respectively.

Calculate $ as the difference between the two likelihoods, i.e. $ = L1 - L2.

The underlying hypothesis of the test is that T1 and T2 do not explain D equally well or

$ %0.

Thus the hypotheses for the test are:


H0: E[$]=0
H1: E[$] % 0
Where E[x] corresponds to the expected value of the random variable x.

In order to test these hypotheses it is necessary to calculate how extreme the observed
value of $ is with respect to the distribution of $.

In order to calculate this distribution a bootstrapping procedure is followed to


calculate multiple replicated datasets from D.

The likelihoods of the trees T1 and T2 are then recalculated for each bootstrapped
dataset.

For each of these likelihoods a corresponding $ is calculated.

62

Chapter 2

This provides a distribution for $ against which the position of the initial $ can be
compared via a two-tailed test. A two tailed test is used as there is no a priori
expectation of which tree is to be preferred (Goldman et al. 2000).

The test assumes that all columns within the dataset are independent and identically
distributed according the evolutionary history of the taxa under consideration.
The KH test is used to compare trees, which are selected a priori as possible best
explanations for a given dataset. To examine how well a given best estimated tree matches an
underlying dataset relative to another tree the SOWH (Swofford-Olsen-Wadell-Hillis) test
can be used in conjunction with the KH test. This test is an example of parametric
bootstrapping (Goldman et al. 2000). Essentially the test involves the construction of a tree
with a given quality criterion over a given dataset. Then this initial tree is used to create
multiple datasets, simulated using the parameters that define the tree. Each of these datasets
is then used to create a new tree. The likelihoods of these new trees can then be compared to
the likelihood of the initial tree relative to the simulated datasets. This creates a set of
likelihood differences, the distribution of which can be compared to the difference between
the initial tree and other trees of interest. This test is not widely used due the computational
demands of the construction of multiple trees and the construction of simulated datasets.
2.1.10 Phylogenetic analysis using gene presence
Leaving aside sequence data there are other aspects to the genomes of a set of organisms that
also provide signal indicating their evolutionary divergence. One of these aspects is proteome
content, i.e. which genomic features are present in a genome and which are absent (Snel et al.
1999). This aspect is essentially the phylogenetic profile of the genomic feature. The
presence and absence of the genomic feature can be treated as a binary trait and then used as
an input to apply standard phylogenetic tree estimation procedures. Methods for the
estimation of trees from discrete traits precede the methods described above for the analysis
of molecular sequence data. The issues and criteria for estimation procedures surrounding the
tree estimation from this form of data remain the same, as the underlying process for the
generation of the data is the same.
As an illustrative example consider 5 species with six genes. Any of these genes can either be
present or absent. An absence of a gene is coded as 0 and the presence of a gene is coded as
1.

63

Chapter 2
Thus given the following distribution of the genes over the 5 species named [A-E].
A

111010

111111

000001

111100

000000

A phylogenetic tree can be estimated from this data clustering these species. An example
tree reconstructed using Dollo parsimony as implemented in the PHYLIP package in the
program DOLLOP (Felsenstein 1989) is shown in the figure below.

Figure 2.5: Dollo parsimonious estimation of a phylogenetic tree from example data.
As species E is devoid of all six genes it is placed as an outgroup relative to the other 5.
2.2 Methods
In order to carry out a comparative study on human protein function a phylogeny was
constructed to provide a framework on which to evaluate the distribution of human proteins
over the eukaryotic kingdom. Sets of phylogenetic profiles, which detail this distribution,
were also generated.

64

Chapter 2
2.2.1 Data Selection
Given the computational nature of this project as well as the abundance of molecular data it
was clear that its use was to be preferred over morphological data. Also given the wide
morphological divergence of eukaryotes isolating individual features to be compared was not
considered a plausible option.
The next point of consideration was whether to utilise nucleotide or amino acid
molecular data. It is known that over long periods of evolution it is more likely for nucleotide
data to become saturated with multiple back substitutions as nucleotide data has four
potential changes of state at any given site as opposed to 20 potential changes in amino acid
data. This can lead to an underestimate of genetic distance (Harrison and Langdale 2006;
Salemi and Vandamme 2003). Thus despite nucleotide data outperforming amino acid data
over smaller time frames such as the period of time covering the divergence of the division
Angiospermae (Simmons et al. 2002) and of the sub phylum Vertebrata (Townsend et al.
2008) it was felt that amino acid data was a more appropriate choice as a measure of genetic
distance over all eukaryotes.
2.2.2 Data Acquisition
Having decided to utilise amino acid data the next step was to acquire usable data. The
protein sets of 54 eukaryotic genomes were downloaded on the 16th and 17th of August 2007
of which 41 organisms were accessed from the NCBI RefSeq database (using the Entrez data
retrieval interface at http://www.ncbi.nlm.nih.gov/sites/gquery) (Pruitt et al. 2005) and the
remainder from the Sanger Centre (ftp://ftp.sanger.ac.uk), Genoscope
(http://www.genoscope.cns.fr/spip/spip.php?lang=en), TIGR ((now the JCVI )
http://www.jcvi.org/), the Broad Institute (http://www.broadinstitute.org), Ensembl (using
BioMart at http://www.ensembl.org/biomart) and lastly SilkDb
(http://silkworm.genomics.org.cn) (Wang et al. 2005b). These databases were employed, as
they were (at the time of access) the sources utilised by the KEGG database (Kanehisa et al.
2006). An additional archeon Methanosarcina acetivorans was downloaded from the NCBI
RefSeq database (Pruitt et al. 2005) in order to root the phylogeny using the outgroup
criterion. This method entails using an organism that falls outside the known taxonomy of the
group under consideration to provide a point of reference for the overall topology of the tree
(Felsenstein 2004). It is widely accepted after work by Carl Woese (Woese et al. 1990) that
the archea form a sister group to the Eucarya. Thus it was felt that an archeon was an

65

Chapter 2
appropriate choice for an outgroup species. Full details of all data sources and species can be
seen in the table below.

66

Chapter 2
Organism

Database

Common Name

Methanosarcina acetivorans (Outgroup)

RefSeq

NA

Anopheles gambiae

Ensembl

Mosquito

Arabidopsis thaliana

RefSeq

Thale Cress

Ashbya gossypii

RefSeq

NA

Aspergillus fumigatus

RefSeq

NA

Aspergillus niger

RefSeq

NA

Bombyx mori

SilkDB

Silkworm

Bos Taurus

RefSeq

Cow

Caenorhabditis briggsae

Ensembl

NA

Caenorhabditis elegans

RefSeq

NA

Candida albicans

RefSeq

NA

Candida glabrata

RefSeq

NA

Canis familiaris

RefSeq

Dog

Ciona intestinalis

RefSeq

Sea squirt

Cryptococcus neoformans

RefSeq

NA

Cryptosporidium hominis

RefSeq

NA

Cryptosporidium parvum

RefSeq

NA

Danio rerio

RefSeq

Zebrafish

Debaryomyces hansenii

RefSeq

NA

Dictyostelium discoideum

RefSeq

NA

Drosophila melanogaster

RefSeq

Fruitfly

Drosophila pseudoobscura

RefSeq

NA

Encephalitozoon cuniculi

RefSeq

NA

Entamoeba histolytica

RefSeq

NA

Gallus gallus

RefSeq

Chicken

Homo sapiens

RefSeq

Human

Kluyveromyces lactis

RefSeq

NA

Leishmania major

Sanger

NA

Macaca mulatta

RefSeq

Rhesus macaque

Magnaporthe grisea

RefSeq

NA

Table 2.1: Organisms in study.

67

Chapter 2

Monodelphis domestica

RefSeq

Grey short-tailed oppossum

Mus musculus

RefSeq

Mouse

Neurospora crassa

Broad Institute

NA

Oryza sativa

RefSeq

Rice

Ostreococcus lucimarinus

RefSeq

NA

Pan troglodytes

RefSeq

Chimpanzee

Paramecium tetraurelia

Genoscope

NA

Pichia stipitis

RefSeq

NA

Plasmodium falciparum

RefSeq

NA

Plasmodium knowlesi

Sanger

NA

Plasmodium yoelii

TIGR

NA

Populus trichocarpa

JGI

Black cottonwood tree

Rattus norvegicus

RefSeq

Rat

Saccharomyces cerevisiae

RefSeq

Brewers yeast

Schizosaccharomyces pombe

RefSeq

Fission yeast

Strongylocentrotus purpuratus

RefSeq

NA

Takifugu rubripes

Ensembl

Pufferfish

Tetrahymena thermophila

RefSeq

NA

Theileria annulata

RefSeq

NA

Theileria parva

RefSeq

NA

Trichomonas vaginalis

RefSeq

NA

Trypanosoma brucei

RefSeq

NA

Trypanosoma cruzi

RefSeq

NA

Ustilago maydis

RefSeq

NA

Yarrowia lipolytica

RefSeq

NA

Table 2.1: Organisms in study (cont).

68

Chapter 2
2.2.3 Pairwise Alignment
In order to gauge the relatedness of individual proteins in the organisms it was necessary to
use a pairwise alignment algorithm, which would deliver a measure of similarity between any
two given sequences. The Smith-Waterman algorithm (Smith and Waterman 1981) was
selected as it is guaranteed to locate optimal regions of local similarity. Speed is an issue with
use of the Smith-Waterman algorithm however an accelerated implementation developed by
Michael Farrar provided within the Fasta package made its use feasible (Farrar 2007; Pearson
and Lipman 1988).
A necessary pre-processing step was to subject all sequences to low complexity
filtering to remove regions of the sequence which are non random but not biologically
significant such as regions of compositional bias. Thus all sequences were fed into the SEG
program (Wootton and Federhen 1993) with the parameter x, which masks out regions of
low complexity sequence and replaces them with the character lower case x.
Each protein set was then split into its individual proteins and each protein compared
against every other organism in the dataset in order to locate sequences that were
significantly similar. Each comparison was run with a gap-opening penalty of -12 and a gap
extension penalty of -2. The substitution matrix BLOSUM62 (Henikoff and Henikoff 1992)
was used to score the alignments. The results of these searches were then parsed and
pertinent data, i.e. raw Smith-Waterman score, E value, bit score and the coordinates of the
alignments along the sequences were stored in a relational database structure in MySQL to
facilitate further analysis.
2.2.4 Orthology Determination
In order to select data which would allow the measurement of evolutionary divergence over
the species it was necessary to cluster the proteins into orthologous clusters. As the
Inparanoid procedure (Remm et al. 2001) had been observed to perform well in this function
it was decided to utilise this procedure. The Inparanoid procedure as described in work
published by Remm (Remm et al. 2001) is detailed below.
2.2.4.1 Inparanoid
Given a set of n pairwise alignments between organism A and organism B the Inparanoid
algorithm returns a set of clusters s using sequence similarity as an inverse distance. A
pairwise alignment of two proteins protein a and protein b in this case is a composite input
consisting of

69

Chapter 2

Bit score aSb: This is the result of the normalisation of a raw pairwise alignment with
respect to the scoring system (Karlin and Altschul 1990). Normalisation places all
scores on the same scale, which is a fundamental prerequisite for use as a distance
metric.

Sequence Lengths ax and bx.

Alignment lengths: The length of the alignments along both proteins alength and
blength.
The latter two length inputs are used to eliminate short hits using a minimum length

cut-off. These short hits may reflect functionally homologous (potentially orthologous)
domains as opposed to whole proteins, which are inherited intact as a discrete unit from the
last common ancestor of the two species under consideration. The bit score is employed as a
cut-off to limit the radius of the clustering step. The Inparanoid algorithm runs the following
steps in a pairwise comparison (Remm et al. 2001).

Sort all bit scores in ascending order.

Read in all hits excluding those that fall below score and length cutoffs.

Select the best scores between organism A and B.

For each best-hit protein from A-B examine the reciprocal relationship B-A.

All reciprocal best hits are stored as a set of seed pairs for orthologous clusters.

For each seed pair paralogous genes are grouped around them if the largest score that
the putative paralog has in the set of all scores is against the putative ortholog in the
seed cluster of the organism under consideration.

Overlapping clusters are then resolved through either deletion or subsumption


depending on the degree and topology of the overlap.

2.2.4.2 Implementation
An implementation of the Inparanoid algorithm in the Perl language (Remm et al. 2001) was
acquired from the Inparanoid website (http://inparanoid.sbc.su.se). However as this
implementation provided by proved to not be amenable to the analysis of bespoke output, it
was decided to re-implement the procedure described above.
The Inparanoid algorithm was implemented through the application of object
orientated (OO) software design principles. OO principles involve the characterisation of a
problem domain as a collection of interacting objects where functionality inherent to each

70

Chapter 2
object is implemented as internal to that particular object (Pressman 2001). Objects within a
problem domain are generally identified through the identification of nouns within a problem
statement (Pressman 2001). Perusal of the algorithm specification provided in (Remm et al.
2001) led to the above design.
2.2.4.2.1 Design
The main object that the procedure required in order to operate was identified as a Cluster
object, which was implemented with the following attributes and operations.

Figure 2.6: Class diagram of the main entity.


The box that represents the object above is split into three segments. The first segment
contains the name of the object. The second segment contains the attributes that the object
contains. In the case of the Cluster object it has two attributes. One of these attributes
theAGenes is a list of genes/proteins clustered together from species A. Conversely
theBGenes is a list of genes/proteins clustered together from species B.
The third segment of the boxes represents the operations that it is possible for the
object to carry out. In the case of the Cluster object the main operations necessary are the
ability to add and delete members as well as to detect whether two Clusters overlap. Finally
the ability to merge two Clusters is an essential operation.

71

Chapter 2
It was decided to use the programming language Java to implement the above design
as the language provides functionality, which facilitates the OO paradigm. The
implementation was then tested against the author provided Perl implementation to ensure
correctness.
The Java implementation deviated slightly from the author provided Perl
implementation in two respects. The first deviation from the author provided Perl
implementation was the use of higher precision double values to represent the bit scores of
alignments as opposed to the use of integers by Perl. This led to cases where scores that had
been rounded in the Perl implementation and (thus marked as reciprocal best hits) were not
marked as reciprocal best hits (examples in Appendix A).
The second deviation was to cluster orthologous genes with two equal reciprocal best
hits between the species at the stage of sorting. This change in the order of steps has no effect
overall on the groups produced by the implementation. The implementation was run using a
bit score cut-off of 50 and an alignment length cut-off, which was 50% of the length of the
longer protein.
2.2.4.3 Application
The data generated from the similarity searches was then clustered to identify orthologous
genes using the constructed Inparanoid implementation. The study was carried out using H.
sapiens as a reference species. Orthologous groups are sought in each organism for every
protein within the human proteome. This study can be considered unbalanced as no
information was collected on proteins that are absent in H. sapiens (Davey et al. 2007).
As the dataset under consideration was amino acid sequence data there was a choice
as to how to deal with alternatively spliced isoforms of the same protein. As the goal of this
project was to examine the presence and absence of the proteins under consideration it was
decided that the retention of all isoforms in the dataset and clustering them as inparalogs was
appropriate. This would allow an examination of correlations in gain and loss of the protein
as an independent phenotypic entity.
2.2.5 Phylogenetic profiles
The output from the clustering step was then used to generate phylogenetic profiles, which as
mentioned previously are binary strings of presence and absence of an orthologous group for
a given gene in the reference species.

72

Chapter 2
In order to establish create phylogenetic profiles of each protein within the human
proteome the following steps were undertaken.

A list of each GI identifier for the set of human proteins was generated.

For each entry in this list the relevant identifier was scanned against all files
containing orthology predictions in alphabetical order.

If the entry was present within the orthology prediction file for a given organism that
position within the profile string was marked as 1. Otherwise that position was
marked as 0. The order of the profiles is alphabetical. Therefore for example the
profile 100000000000000000000000100000000000000000000000000000 indicates a
protein with an orthologous group present in Anopheles gambiae and Homo sapiens
but absent in all other species under consideration.

Two sets of profiles were generated, one including the outgroup for use in ortholog
selection and the other excluding the outgroup for use in prediction of functional
linkage.

Table 2.2 lists the organisms under study along with their proteome sizes and number of
orthologous groups with reference to H. sapiens. Figure 2.7 shows the distribution of
proteome sizes in the organisms under consideration. Figure 2.8 shows the number of
proteins clustered in each organism.

73

Chapter 2

Organism

No. of Proteins

No. of Clusters

No. of Proteins Clustered

Methanosarcina acetivorans 4544


408
822
Anopheles gambiae
13465
4889
5030
(Outgroup)
Arabidopsis thaliana
31915
2849
3389
Ashbya gossypii
4292
1590
1599
Aspergillus fumigatus
9630
2167
2175
Aspergillus niger
14102
2197
2210
Bombyx mori
21302
4060
4104
Bos Taurus
25379
15508
16567
Caenorhabditis briggsae
19553
4009
4212
Caenorhabditis elegans
23220
4149
4404
Candida albicans
14107
1909
3367
Candida glabrata
5192
1737
1784
Canis familiaris
33527
16372
17681
Ciona intestinalis
15852
6097
6117
Cryptococcus neoformans
6594
1985
2045
Cryptosporidium hominis
3886
953
957
Cryptosporidium parvum
3805
1080
1086
Danio rerio
36083
11598
13007
Debaryomyces hansenii
6317
1861
1887
Dictyostelium discoideum
13377
2551
2650
Drosophila melanogaster
20071
5037
6847
Drosophila pseudoobscura
9871
4230
4240
Encephalitozoon cuniculi
1996
711
725
Entamoeba histolytica
9772
1282
1572
Gallus gallus
18710
11489
11769
Homo sapiens
33473
26634
33473
Kluyveromyces lactis
5339
1745
1751
Leishmania major
8302
1494
1706
Macaca mulatta
37856
17025
22774
Magnaporthe grisea
14010
2099
4420
Monodelphis domestica
20194
13633
13880
Mus musculus
35048
16607
18448
Neurospora crassa
10082
1998
2002
Oryza sativa
26887
2699
2753
Ostreococcus lucimarinus
7603
2152
2241
Pan troglodytes
51517
18916
28944
Paramecium tetraurelia
39642
2147
2740
Pichia stipitis
5816
1883
3826
Plasmodium falciparum
5270
1090
2198
Plasmodium knowlesi
4958
1125
2254
Plasmodium yoelii
7861
1058
2140
Populus trichocarpa
45555
3037
3363
Rattus norvegicus
35903
16414
21279
Saccharomyces cerevisiae
5883
1803
3706
Schizosaccharomyces pombe 5045
2066
4272
Table 2.2: List of organisms used in study along with data source and proteome size as well
as number of orthologous groups in alphabetical order. The outgroup is placed at the top.

74

Chapter 2

Strongylocentrotus
42373
6638
13387
Takifugu rubripes
22428
10891
11046
purpuratus
Tetrahymena thermophila
26235
2014
2051
Theileria annulata
3795
1000
1016
Theileria parva
4079
1005
2044
Trichomonas vaginalis
59681
1823
2567
Trypanosoma brucei
8772
1531
3596
Trypanosoma cruzi
19606
1737
2079
Ustilago maydis
6548
3756
3762
Yarrowia lipolytica
6545
4092
4132
Table 2.2: List of organisms used in study along with data source and proteome size as well
as number of orthologous groups in alphabetical order (cont).

Figure 2.7: Distribution of proteome sizes in organisms under consideration. N=55.

75

Chapter 2

Figure 2.8: Distribution of number of proteins placed within clusters in each organism. N=55.
Table 2.3 shows the top ten profiles within the human genome and provides an interpretation.

76

Chapter 2

Profile

Count

Interpretation

000000000000000000000000100000000000000000000000000000

5150

Present in species
Homo sapiens.

000000000000000000000000100000000010000000000000000000

2281

Present in Tribe
Hominini.

000000100001000000000001100101100010000001000000000000

1089

Present in Phylum
Chordata.

000000000000000000000000100100000010000000000000000000

622

Present in Order
Primate.

000000100001000000000000100101100010000001000000000000

495

Present in Infraclass
Eutheria.

000000000000000000000000100100000000000000000000000000

466

Present in species
Homo sapiens and
Macaca mulatta.

000000100001000000000001100101100010000001000000000000

378

Mammalia and

000000100001000010000001100101100010000001000000000000
000000100001000000000000100100100010000001000000000000

Present in Class
Class Aves

342

Present in Class
Mammalia

000000100001000010000001100101100010000001000000000000

281

Present in the
Phylum Chordata
with the exception
of the Species
Takifugu rubripes

000000100001100010000001100101100010000001000100000000

235

Present in in Class
Mammalia ,Class
Actinopterygii and
Class Aves

Table 2.3: Top ten occurring phylogenetic profiles ranked by counts.


77

Chapter 2

2.2.5.1 Single copy proteins


In order to select orthologous proteins, which most accurately reflected the evolutionary
histories of the species under consideration, it was decided to focus on orthologous groups
that were present in a single copy across all 55 taxa. This focus on single copy proteins
excluded potential comparisons between paralogous proteins.
A set of proteins with ubiquitous profiles was extracted from the data and these were
sifted for proteins that were present in single copy in each organism in the dataset. 10
proteins were present in single copy over all the organisms under study. Table 2.4 shows
details of these proteins.
NCBI RefSeq GI Number

Description in NCBI annotation.

Entrez Gene Name

116805340

glycyl-tRNA synthetase

GARS

32307132

NFS1 nitrogen fixation 1 precursor

NFS1

4506605

ribosomal protein L23

RPL23

4506743

ribosomal protein S8

RPS8

signal recognition particle 54kDa

SRP54

4507215

isoform 1
excision repair cross-

ERCC3

complementing rodent repair


deficiency,complementation group
4557563

5031815

lysyl-tRNA synthetase isoform 2

KARS

H(+)-transporting two-sector

ATP6V1D

7706757

ATPase

5803092

Methioine aminopeptidase 2

METAP2

26s protease regulatory subunit 4

PSMC1

24430151

Table 2.4: Single copy ubiquitous genes extracted via analysis of profiles.

78

Chapter 2
2.2.5.2 Proteome content data/tree
The phylogenetic profiles as developed provided a matrix of presence and absence of every
human protein in the dataset across the remaining 53 eukaryotes and 1 outgroup archeon. As
this form of data also contains phylogenetic signal, i.e. shows the divergence in the
proteomes of the given species over time, it was decided to subject this data to a phylogenetic
analysis as well as the main phylogenetic analysis to be carried out on the multiple sequence
alignments of the homologous proteins. The tree was reconstructed using Dollo parsimony
(Farris 1977) via the program DOLLOP (Felsenstein 1989). This tree can be seen in Figure
2.17.
In order to carry this out the profiles were transposed, so that rather than showing the
distribution of human proteins over a set of species they showed the pattern of presence and
absence of human proteins over a single species. Thus instead of a matrix of 33,473 proteins
by 55 species the end product was a matrix of 55 species by 33,473 proteins. In other words
each species was assigned a binary string of length 34,373 where 1 indicated the presence of
a particular human protein and 0 its absence.
This matrix was converted into PHYLIP format through the truncation of species
names to 10 characters and the addition of header information about the size of the matrix.
Finally this formatted file was input to the DOLLOP program (Felsenstein 1989). The
program was run with its default settings.
In order to examine the level of support for the initial outputted tree 100 bootstrap
replicates were created with SEQBOOT (Felsenstein 1989). These 100 replicates were
resubmitted to DOLLOP to produce 100 bootstrap trees. These trees were unified using
CONSENSE (Felsenstein 1989).
2.2.6 Multiple alignment
As an initial step to generate a phylogenetic tree using the orthologous proteins selected a
multiple alignment of each of the 10 proteins was constructed utilising Mafft (Katoh et al.
2002) (Multiple Alignment By Fast Fourier Transform) using the L-INS-i algorithm for 1000
iterations. Each alignment was then subjected to Gblocks filtration (Talavera and Castresana
2007) to remove columns that were poorly aligned. Gblocks was run in its relaxed mode.
These alignments were then concatenated to form a super matrix measure of divergence in

79

Chapter 2
order to generate a measure of divergence across the genomes as opposed to at a single locus,
as has been suggested by (Rokas et al. 2003). The full alignment can be seen in Appendix D.
2.2.7 Model selection
In order to select an evolutionary model which provided a statistically accurate measure of
genetic divergence ModelGenerator (Keane et al. 2006) was used to select the model that best
fitted the concatenated multiple alignment. It requires as an argument a number for gamma
categories, to account for heterogeneity in substitution rates. The argument was given a value
of 4 gamma categories as this has been observed to be sufficient number to create a nearoptimum fit of a model (Yang 1994).
It has been observed that individual gene trees can be highly incongruent with species
trees (Cranston et al. 2009). Thus it is possible for the inference of a species tree to be misled
by non-phylogenetic signal from the individual genes (Cranston et al. 2009). In order to
examine potential incongruence between gene alignments each individual alignment was first
analysed separately. The model selected for the complete supermatrix was the LG
substitution matrix (Le and Gascuel 2008). The LG matrix was predicted with the additional
parameter $ that allows different rates of evolution across the sequence. Each gene alignment
was also matched by the LG matrix along with different variations as shown in Table 2.5.
The LG matrix is generated by a model of evolution that takes into account mutation rate
heterogeneity over sites, thus yielding better results then its predecessors WAG and JTT (Le
and Gascuel 2008). The models selected were a best fit judged by both the Aikake
Information Criterion (AIC) as well as the Bayesian Information Criterion (BIC) in all but
two cases (GARS and ERCC3). Where they disagreed the BIC was selected over the AIC
(Yang 2008) thus lowering the possibility of overfitting a more complex model to the data.
The models selected for the individual orthologs were exactly the same as the model
for the concatenated alignment except in the case of SRP54 where the additional parameter I
indicating that a proportion of the alignment was invariant.

80

Chapter 2

Entrez Gene Name

Substitution model of best fit (Selected by


BIC)

GARS

LG+$

NFS1

LG+$

ATP6V1D

LG+$

KARS

LG+I+$

SRP54

LG+I+$

PSMC1

LG+$

METAP2

LG+$

ERCC3

LG+$

RPL23

LG+$

RPS8

LG+$

Table 2.5: Substitution model selected by ModelGenerator for each ortholog.


2.2.8 Phylogeny reconstruction
After selection of the model each alignment was individually input to PhyML (Guindon and
Gascuel 2003) with its individual substitution model as well as the concatenated alignment. A
total of 11 trees were generated one per orthologous gene and one for the concatenated
alignment as a whole. For the concatenated alignment 1000 replicates were also generated
using SEQBOOT (Felsenstein 1989) and rerun using PhyML (Guindon and Gascuel 2003).
These 1000 replicated trees were input to CONSENSE (Felsenstein 1989) in order to acquire
an estimate of the overall bootstrap support for the tree from within the data. The topology of
the bootstrapped trees was identical to the topology of the primary ML tree.

81

Chapter 2
2.2.9 Comparison of protein content tree with super matrix tree
In order to place a measure on whether the protein content tree was a significantly worse
hypothesis of the evolutionary relationships of the species under study the PHYLIP program
PROML (Felsenstein 1989) was utilised. PROML was given the supermatrix alignment as a
dataset and the two trees as user inputted trees to evaluate against the dataset. PROML then
ran the KH test against the two trees to examine the differences in the likelihood of the trees
relative to the dataset.
2.3 Results
The individual gene trees can be seen in Appendix B. In order to examine potential
differences in phylogenetic signal between the individual genes trees TREEDIST
(Felsenstein 1989) was used to generate a distance matrix between the 10 gene trees.
TREEDIST uses the Branch Score distance (Kuhner and Felsenstein 1994) to calculate the
distance between two trees. This distance takes into account the branch lengths of the trees
input as well as the overall topology. This matrix was then used to generate a dendogram
using UPGMA (Unweighted Pair Group Method with Arithmetic mean) clustering as shown
in Figure 2.9.

Fig 2.9: Cluster diagram of tree distances of individual gene phylogenies.

82

Chapter 2
The gene trees fell into three main clusters. In order to examine the degree of incongruence of
each cluster from the super matrix species tree the trees contained with each cluster were then
submitted to CONSENSE (Felsenstein 1989) in order to view the consensus trees.
CONSENSE was run with the majority rule setting where a group has to appear more than
50% of the time in the input trees in order to be conserved in the consensus tree. Figures 2.10
shows the outlier tree estimated from ERCC3 and Figures 2.11 and 2.12 show the consensus
trees.

83

Chapter 2

Figure 2.10: Gene tree for gene ERCC3.


84

Chapter 2

Figure 2.11: Consensus Tree for Cluster 2 containing genes: RSP8 ATP6V1D PSMC1 and
METAP2.

85

Chapter 2

Figure 2.12: Consensus Tree 2 for Cluster 3 containing genes: GARS, NFS1, RPL23, SRP54
and KARS.
86

Chapter 2

Both consensus trees preserve the kingdoms of Plantae, Animalia and Fungi though
the order of branching is lost. However both clusters demonstrate a broad congruence with
the fully concatenated ML tree, which can be seen in Figure 2.13. This is a useful measure of
the degree of overlap of phylogenetic signal contributed from each of the individual genes.
Figure 2.13 is an illustration of the topology of the tree with the animals, fungi and plants
highlighted. Figure 2.14 also overlays the bootstrap support values for each proposed clade.
Figure 2.15 presents the reconstructed phylogeny with a measure of support, which is the
proportion of the individual gene trees that supported a clade (Bratke 2009). Figure 2.16
shows the topology of the tree in combination with branch lengths.

87

Chapter 2

Figure 2.13: ML tree of 54 eukaryotes without branch lengths created from a super matrix of
the concatenated alignments of all genes listed in Table 2.5. The clades containing animals,
fungi and plants are coloured blue, red and green respectively.

88

Chapter 2

Figure 2.14: ML tree of 54 eukaryotes without branch lengths created from a super matrix of
the concatenated alignments of all genes listed in Table 2.5. Bootstrap support values are
only shown at each node where support was less than 1000 (not universally supported across
1000 bootstrap replicates).
89

Chapter 2

Figure 2.15: ML tree of 54 eukaryotes created from the concatenated alignments of genes
listed in Table 2.5. Support Values are proportion of individual gene trees, which show a
given clade (Bratke 2009).

90

Chapter 2

Fig 2.16: ML tree of 54 eukaryotes with proportional branch lengths.

91

Chapter 2

Figure 2.17: Proteome content phylogeny with bootstrap support only shown at each node
which was not 100% supported out of 100 bootstraps.

92

Chapter 2
Tree Comparison
The results of the KH test carried out using PROML (Felsenstein 1989) showed that the
proteome content tree is a significantly worse hypothesis of the evolutionary relationships
between the organisms as shown in the table below.
Tree

Log likelihood

Protein alignment supermatrix ML tree

-131725.8

Proteome content parsimony tree

-133941.0

PROML (Felsenstein 1989) reported that the log likelihood of the proteome content tree was
significantly worse than that of the protein supermatrix tree.
2.4 Discussion
2.4.1.ML tree
In terms of current thought about super groups within eukaryotes the phylogeny
reconstructed as seen in Figure 2.13 is incongruent. However given that it is based on a
concatenation of nuclear genes this is not surprising (Parfrey et al. 2006). This work showed
that there is generally weak support for most putative eukaryotic super groups in phylogenies
built using proteins coded by nuclear genes. The super group Opisthokonta is however
supported as would be expected from the work (Parfrey et al. 2006) though it does subsume
the Amoebozoa. The ML tree is consistent with known eukaryotic trees (Baldauf et al. 2000)
in placing plantae as an outgroup to fungi and metazoans. The base of the tree is inconsistent
with known trees in its placement of E. histolytica as an early braching eukaryote with T.
vaginalis when it is thought that E. histolytica branches higher up in the tree as a member of
the Amoebozoa super group (Parfrey et al. 2006). However there is no clear synapomorphy
(shared derived character) which defines the group Amoebozoa. There is also a lack of
unambiguous support for the existence of the group as a whole within the nuclear genome
(Parfrey et al. 2006). This placement is also not novel as both organisms lack mitochondria
and have been grouped together at the base of eukaryota by phylogenetic analyses of small
subunit (SSU) RNA genes though their ultimate placement is not certain (Vanacova et al.
2003).

93

Chapter 2
Within the animals the tree is consistent with the Coelomata hypothesis which places
the nematoda as an outgroup to both arthropods and vertebrates. This grouping is fairly
common in phylogenies derived using molecular data (Wolf et al. 2004) despite being held to
be false (Aguinaldo et al. 1997). This is thought to be an artefact of long branch attraction
due to rapid evolution along the C. elegans line (Telford 2004). The classes Mammalia (H.
sapiens, P. troglodytes, B .taurus, M. domestica, C. familiaris, M. musculus and R.
norvegicus), Aves (G. gallus), Osteichthyes (T. rubripes, D. rerio) are all maintained in the
order that they are generally found in most broad vertebrate phylogenies, e.g. work by Stuart
(Stuart et al. 2002).
The tree also arranges its four plant species as expected (Rodriguez-Ezpeleta et al.
2005) with the algae O. lucimarinus forming an outgroup to the monocot O. sativa and the
two dicots A. thaliana and P. trichocarpa.
Within the fungi the tree is consistent with known fungal phylogenies (Fitzpatrick et
al. 2006). The kingdom Dikarya is a separate clade. Within Dikarya in the phylum
Ascomycota the subphyla Saccharomycotina (S. cerevisiae, C. albicans, C. glabrata, P
.stipitis, D. hanseii, Y. lipolytica, K. lactis , A. gossypii), Taphrinomycotina (S. pombe) and
Pezizomycotina (A. fumigatus, A. niger, N. crassa, M. grisea) are grouped as separate clades.
Another phylum in Dikarya Basidiomycota (U. maydis, C. neoformans) is a separate clade
within the tree. The microsporidium E. cuniculi branches out as an outgroup to the Dikarya.
Within Saccharomycotina the WGD (fungi which have undergone whole genome
duplication) (C. glabrata, S. cerivisae) are presented as a clade. Also the CTG group (fungi
which utilise the codon CTG to encode serine instead of leucine) (P. stipitis, D. hanseii, C.
albicans) (Fitzpatrick et al. 2006) is proposed as a separate clade within the tree. Within the
CTG group the tree shows disagreement with some published trees (Wang et al. 2009a) by
placing P. stipitis and C. albicans together with D. hanseii as an outgroup. Figure 2.6 shows
that this fungal topology was highly supported by the bootstrap analysis.
The Chromoalveolates are grouped together in one clade. Within this clade the
Apicomplexa form a monophyletic group within that clade with the Ciliates as an outgroup.
These groupings are congruent with published trees (Burki et al. 2008; Rodriguez-Ezpeleta et
al. 2007).
2.4.2 Proteome content phylogeny
The proteome content phylogenetic tree as presented in Figure 2.17 shows a degree of
topological congruence with the tree based on concatenated protein sequences shown in
94

Chapter 2
Figure 2.13 in that it preserves the animals as a monophyletic group. The branching order of
the taxa is however different. It also shows the fungi as a clustered group (though not
monophyletic). However given that PROML reports that it is a significantly worse fit to the
alignment of homologous proteins it is clearly a worse representation of the dataset then the
ML supermatrix tree.
2.4.3 Conclusion
The reconstructed phylogeny via an application of maximum likelihood to the concatenated
supermatrix of 10 eukaryotic proteins appeared to be a plausibly accurate reflection of the
relationships between the taxa. This plausibility was assessed by both by inspection by eye
and comparisons to previously published eukaryotic phylogenies.
The proteome content phylogeny reconstructed by Dollo parsimony on the other hand
was shown to be a significantly worse representation of the evolutionary relationships
between the species. As such it was the ML supermatrix tree that was utilised as the
framework for comparative analysis of protein function within the species.
The work described in this chapter also produced phylogenetic profiles for each human
protein across 54 other eukaryotes.

95

Chapter 3

Chapter 3
Comparison of methods of prediction of functional linkage in proteins
3.1 Introduction
This chapter presents the comparison of four systems of inferring functional links between
proteins using phylogenetic profiles. These systems were:
1) Hamming distance: Phylogenetic profiling was initially used without taking into
account species phylogeny and treating the state of each point in the profile as
independent (Pellegrini et al. 1999). Using profiling in this way entailed comparison
of profiles using the string comparison algorithm Hamming distance (Hamming
1950), which is a count of the points at which two strings differ.
2) Use of the comparative method (Barker and Pagel 2005; Pagel 1994) in the context of
phylogenetic profiling (Pellegrini et al. 1999) with constrained rates of gene gain
(Barker et al. 2007) over the phylogeny developed in Chapter 2 to detect protein
interactions. An implementation of the method, BayesTraits (Pagel et al. 2004a) was
used in order to calculate the relevant likelihoods.
3) Co-expression of mRNAs corresponding to given proteins: Proteins that physically
interact or are required to be produced in some of form of spatio-temporal order tend
to show correlations (positive or negative) in the expression of their underlying
mRNA molecules. This method has been shown to be effective in detecting
interactions in Saccharomyces cerevisiae (von Mering et al. 2002) as well as in
Arabidopsis thaliana in combination with examination of other genomic features (De
Bodt et al. 2009). Use of this system presents a comparison of an un-curated highthroughput physical experimental system with an equivalently un-curated
computational system.
4) Use of a Bayesian classifier to combine disparate sources of evidence comprising
gene co-expression, orthology, post translational modification, co-localisation,
intrinsic disorder, domain co-occurrence and network analysis data in order to predict
protein interactions (McDowall et al. 2009).
3.1.1 Hamming distance
The distance measure Hamming distance is named after its creator Richard Hamming who
introduced it in his work (Hamming 1950). As mentioned above and in previous chapters, it
is the distance between two strings of equal length calculated as a count of the points where
96

Chapter 3
they differ. A string can be represented as a vector of characters. As an illustrative example
given the two strings x and y as defined below:

x = [c, a,t]
y = [h, a,t]
The hamming distance between the two strings is 1 as they vary by 1 character.

In the context of phylogenetic profile analysis, a hamming distance of 1 between two


profiles would indicate that the gene/protein under consideration differed by only one species
in its pattern of distribution.
3.1.2 Comparative method
As mentioned in the introductory chapter the comparative method involves the examination
of the association of a given trait in an organism with another variable in the context of a
phylogenetic tree (Harvey and Pagel 1991). This variable can be another trait or potentially
an environmental factor. A prerequisite for this form of analysis is reconstructing putative
ancestral states at each hypothetical ancestral node within the tree (Pagel 1994). Using this
method in the context of phylogenetic profile analysis involves testing whether the presence
or absence of a given gene is associated with the presence or absence of a second gene or
protein.
The implementation of the comparative method used in this chapter is based on the
approach introduced by Mark Pagel (Pagel 1994) and utilised in the context of phylogenetic
profiling by Barker and Pagel (Barker and Pagel 2005). This approach is preferred to other
applications of the comparative method (e.g. work by Wayne Maddison (Maddison 1990)
and work by Mark Ridley (Ridley 1983)) due to the fact that it does not depend on a single
set of reconstructed ancestral states over a tree but instead calculates test statistics based on
all possible ancestral states (Barker and Pagel 2005; Pagel 1994). An explanation of how
Barker and Pagel (Barker and Pagel 2005) utilised the framework established in previous
work by Pagel (Pagel 1994) in order to analyse phylogenetic profiles follows.
3.1.2.1 Phylogenetic profile analysis using the comparative method
Imagine a gene or protein G1, which exists in a group of species and a phylogenetic tree
representing the evolutionary relationships between the members of the group. At any given
internal node (hypothetical ancestor) within the tree G1 can either be present or absent. This
state will be denoted as 0 for absent and 1 for present.
97

Chapter 3
Over a given branch if the state of G1 at the ancestral node is 0 then there is a
probability of a gain, i.e. moving to state 1 at the descendant node over the time period
represented by the branch. Conversely if the ancestral state is 1 then there is a corresponding
probability of a loss. These probabilities are represented as P01(t) and P10(t) where t is equal
to the time interval represented by the branch. There are also the probabilities of no
transitions which are represented by P00(t) and P11(t). The probabilities P01(t) and P10(t) can
also be considered the rate of transitions.
The comparative method as applied to phylogenetic profiling is an examination of
whether the state of a second gene/protein has an effect on the state of the first. Thus
introducing a second gene / protein G2 there is a corresponding set of probabilities for G2. In
order to examine whether the state of G2 has an effect on the state of G1 the probabilities or
transition rates P01(t), P10(t), P00(t) and P11(t) for G1 can be split in order to factor in the state
of G2. Thus for example P01(t) can be split into two probabilities one corresponding to the rate
of gain of G1 if G2 is present and the other corresponding to the rate of gain of G1 if G2 is
absent. These transition rates were the basic parameters used by Barker and Pagel (Barker
and Pagel 2005) as represented in the following figure.

Figure 3.1: Parameters for modelling state transitions for pairs of genes as used by
Barker and Pagel (Barker and Pagel 2005) (Figure is directly reproduced).

98

Chapter 3
Assuming the numbers at the corners of the above figure represent the states of G1 and G2,
The parameters as presented above represent the following rates of transitions.
Parameter

Description

q13

Rate of gain for G1 given the absence of G2.

q31

Rate of loss for G1 given the absence of G2.

q12

Rate of gain for G2 given the absence of G1.

q21

Rate of loss for G2 given the absence of G1.

q34

Rate of gain for G2 given the presence of G1.

q43

Rate of loss for G2 given the presence of G1.

q24

Rate of gain for G1 given the presence of G2.

q42

Rate of loss for G1 given the presence of G2.

Table 3.1: Description of rate parameters used by Barker and Pagel (Barker and Pagel 2005).
Given these rates it is possible to investigate whether the state of G2 has an effect on the
transition rates for G1. In order to carry out this investigation two competing models /
hypotheses were constructed using these parameters (Barker and Pagel 2005). One was a
dependant model where the presence of G1 is somehow contingent on the presence of the
absence of G2 or an independent model where there was no connection (Barker and Pagel
2005). The dependent model makes an assumption that the rate of gain/loss of G1 is
somehow affected by the state of G2. Thus for example the rate of gain of G1 in the presence
of G2 (q24) will be different from the rate of gain of G1 in the absence of G2 (q13). Conversely
the dependant model makes the assumption that there is no effect on the transition rates of G1
by the state of G2. To detect gains and losses over a phylogenetic tree Barker and Pagel
reconstructed the likelihood of these two competing hypothesis about the distribution of pairs
of proteins in the constituent species (Barker and Pagel 2005). The premise of the work was
that the dependent model would prove a better fit to observed data if the transition rate for a
given protein were affected by the state of the other. In order to detect correlated evolution
the two competing models were thus defined as follows

99

Chapter 3

The independent model of evolution where the probabilities of gain and loss of A
were independent of the state of B. In order to create this model the parameters
involving gain and loss of A were constrained to be equal irrespective of the state B
and vice versa. Using the symbols to define the rates shown in Figure 3.1 this entailed
setting the transition rates as q13=q24, q42=q31, q31=q43 and q12=q34. This reduced the
number of parameters for the independent model to four.

A dependant or correlated model of evolution that utilised all eight-transition rate


parameters.
Thus given a set of phylogenetic profiles and a phylogenetic tree, the values of the

parameters that maximised the likelihood of each of the two models was calculated in turn
between pairs of profiles. These likelihoods were calculated summing the likelihoods of all
possible ancestral reconstructions at each internal node of the tree thus removing the need for
the reliance on a single set. Having calculated the likelihoods of both models, the goodness of
fit of the models to the observed data was compared using the likelihood ratio statistic, LR.
This can be calculated using the following equation (Yang 2006):
(1)

LR = "2(ln(H 0 ) " ln(H1 ))

As applied to detection of correlated evolution H0 was the likelihood of the


independent model of evolution and H1 was the model of dependent evolution (Barker et al.
2007; Barker and Pagel 2005).
Further work showed that constraining the rate of gain of a gene to a preset low level
as a more potentially realistic representation of actual biological reality improved on the
ability of the method to detect functional linkages in Saccharomyces cerevisiae (Barker et al.
2007). This entailed specifying values for the parameters connected to gene gain specifically
q31, q24, q34 and q12. Rate of gene gain is specifically low in eukaryotes where horizontal gene
transfer is relatively rare (Whitaker et al. 2009). The rate of de novo generation of genes over
a relatively short time scale has also been observed to be extremely low. A study detected
three potential genes in the human genome, which were generated de novo since the split
with P. troglodytes (Knowles and McLysaght 2009) (estimated at around 4 million years
(Hobolth et al. 2007)).

100

Chapter 3
This method of fitting models of correlated and uncorrelated models of evolution for
pairs of proteins using maximum likelihood (ML) (Barker and Pagel 2005) while
constraining the rate of protein gain (Barker et al. 2007) shall from here on be referred to as
constrained ML.
3.1.3 Co-expression as measured by microarray
As mentioned in Chapter 1 a microarray is a chip usually made of glass, with fluorescently
labelled oligonucleotide probes representing subsections of genes. The degree of florescence
from these probes corresponds to the abundance of a given mRNA in a sample and therefore
the level of expression (Quackenbush 2002; Wodicka et al. 1997). In a typical microarray
experiment cells are subjected to different treatments, or harvested from organisms with
differing phenotypic or disease states (e.g. cancerous tissue vs. normal in human cancer
patients). The probes on a given microarray chip are generally designed to map to a within a
coding region on a given gene (Brown 2006).
To establish whether a given experimental condition corresponds with differential
expression of a given gene, a descriptive statistic illustrating the central tendency of
expression (usually the median) is calculated for the entire set of samples. If a given gene is
found to be expressed at a statistically significant higher level than the central value, this
gene is interpreted as up-regulated. Similarly if a gene is expressed at a significantly lower
level then that gene is interpreted as down-regulated. Probes on a microarray chip can map to
a single gene, or members of a gene family depending on the specificity of the probe design
(Heyer et al. 1999).
3.1.4 Bayesian classifier
The fourth system of prediction of functional linkage to be considered was that utilised by the
PIPs server (http://www.compbio.dundee.ac.uk/www-pips)(McDowall et al. 2009). This
system (Scott and Barton 2007) utilised a combination of sources of evidence for protein
interactions. These sources were:

Gene co-expression: As described and used above a correlated shift in gene


expression patterns in response to a given environmental stimulus can be used to
detect given protein interactions.

Orthology: If a given pair of proteins is orthologous to a pair of proteins in another


species that are known to interact then this interaction annotation can be ported from
one species to the other (Yu et al. 2004a).
101

Chapter 3

Subcellular localisation, domain co-occurrence, and posttranslational modification cooccurrence: These features of a protein can also be informative as to its interaction
partners. The PIPs system (Scott and Barton 2007) combines these as a joint source of
evidence.

Protein disorder: This measure is based on the observation that the unstructured
regions within protein molecules are often involved in transient protein interactions.
(Singh et al. 2007) showed that intrinsic disorder is enriched in date hubs, proteins
that maintain multiple interactions but at different times.

Network topology similarity: This measure utilises the principle that proteins that
interact will share other interacting partners.

These five predictors are combined using a nave Bayesian classifier to generate a single
score based on the posterior odds ratio of interaction after calculation of likelihood ratios
over each of the individual predictor modules (Scott and Barton 2007).
In order to explain what an odds ratio is it is necessary first to define odds. Odds are a
method of presenting the probability of an event by relating this probability to the probability
of the event not occurring. Thus the odds of an event are simply the probability of an event
occurring divided by the probability of the event not occurring (Sokal and Rohlf 1995). The
odds of a given event can be calculated by the equation:

odds(e) =

p(e)
1" p(e)

(2)

An odds ratio is thus is the ratio of multiple odds. It can be used to measure the effect size
of a given factor on the probability of an event. Thus if for example the probability of heart
disease in people who consume a high fat diet is calculated as

1
and the converse probability
4

of heart disease in individuals who do not consume a high fat diet is calculated as

1
, the
8

odds of having heart disease with a high fat diet are !


thus via Equation 2 equal to 0.3 and the
odds of having heart disease with a low fat diet are 0.14. Thus the odds ratio would be
!
0.3
calculated as
which is roughly equal to 2. Thus in this example a high fat diet roughly
0.14
doubles the probability of heart disease.
!

102

Chapter 3
The posterior odds ratio utilised by PIPs (Scott and Barton 2007) was calculated by
utilising a prior odds ratio calculated by using a prior probability of interaction estimated as
1
. This prior odds ratio was then multiplied by the likelihood ratios yielded by each of the
400

individual predictor modules. The product of this calculation is the posterior odds ratio.
As in the example, the posterior odds ratio corresponds to the posterior probability of

interacting, e.g. a score of 2 translates to the probability of interacting being twice as high as
the probability of not interacting (McDowall et al. 2009; Scott and Barton 2007).
3.2 Methods
3.2.1 Assessing quality
In terms of classification of the accuracy of a binary classification system a common method
of measurement is the use of sensitivity and precision. Sensitivity can be defined as the
probability of predicting a true positive and precision as the probability of that prediction
being correct (Baldi and Brunak 2001). In order to calculate these measures some
terminology must be introduced.
True positives (TP): The number of positive predictions made by a binary classifier that
lie within the positive training set.
False positives (FP): The number of positive predictions made by a binary classifier that
lie within the known negative training set.
False negatives (FN): The number of items in the known positive set, which were not
predicted by a binary classifier.
Given these values precision and sensitivity can be calculated as follows (Baldi and
Brunak 2001; Barker et al. 2007; von Mering et al. 2003):

precision =

(TP)
(TP + FP)

(3)

sensitivity =

(TP)
(TP + FN )

(4)

103

Chapter 3

3.2.2 Training and test data.


In order to calculate these values it was necessary to acquire data on known positive
interactions. Thus the data used by Scott in her development of PIPs (Scott and Barton 2007)
which was in turn derived from the HPRD (Human Protein Reference Database) (Mishra et
al. 2006) was acquired. This dataset contained 25,013 predicted protein interactions. The
dataset was then compared with the set of human proteins contained in the version of RefSeq
(Pruitt et al. 2005) downloaded in Chapter 2 and the overlap was kept leaving a positive
dataset of 6,106 proteins and 18,322 protein pairs.
A negative dataset was generated by creating a set of all possible pairs of proteins
from the full set of human proteins. In order to filter out proteins which could potentially
interact the full set of GO (Gene Ontology)(Ashburner et al. 2000) terms associated with each
protein was downloaded. All pairs with any overlaps in associated GO terms were then
excluded. As with the positive set the negative dataset was compared to the set of human
proteins contained in the locally held version of RefSeq (Pruitt et al. 2005) and the overlap
preserved. As a final check the negative set was compared to the positive set in order to
examine whether there was any overlap. There was an overlap of 9,568 proteins pairs
between the positive and negative datasets or 52% of the positive set. This suggests that the
use of specific GO terms is not better than the selection of random pairs as a procedure for
the generation of a negative set. However as previous work such as the PIPs procedure (Scott
and Barton 2007) considered in this chapter, utilised solely random pairs, it was decided that
this procedure of GO + HPRD filtration was an improvement on this process. The process
resulted in a negative dataset of 3,216 proteins and 207,952 protein pairs.
To use the datasets effectively as an objective measure of quality as well as a training
tool for the evaluation of optimal rates at which to constrain levels of gene gain (Barker et al.
2007) for BayesTraits (Pagel et al. 2004a) the datasets were randomly split into two halves.
This was done in order to cross validate the predictive power of any proposed optimal rate of
gain. This process yielded a positive training set of 4,868 proteins / 9,161 protein pairs and a
negative training set of 3,216 proteins / 103,971 protein pairs. The second half of the dataset
was marked as testing data and contained a positive testing set of 4,796 proteins / 9,161
protein pairs and a negative testing set of 3,215 proteins / 103,974 protein pairs. The sizes of
the two negative sets are uneven as 5 pairs of proteins had to be removed from the negative
104

Chapter 3
training set, as they were present in a B-A orientation in the positive set. Similarly two pairs
of proteins had to be removed from the negative testing set.
The ratio of the size of negative to positive datasets in this case was roughly 11 to
1.This is biologically unrealistic as current estimates of the size of the full human interactome
range from 154,000-369,000 (Hart et al. 2006) to 650,000 (Stumpf et al. 2008). Stumpf
estimated the potential size of the interactome by treating known experimentally verified data
as a sub-network of the true network and extrapolating from the sub-network to the full
network (Stumpf et al. 2008). Hart on the other hand employed the idea that two independent
samples (experiments) from the complete interactome or subspace of the interactome of size
N would be expected to share k interactions by random chance under the hypergeometric
distribution (Hart et al. 2006). Thus Hart estimated the size of N using actually observed
intersections between experiments (Hart et al. 2006).
If these numbers are subtracted from the size of all potential interactions
112,044,172,9 (calculated as all possible pairs from version of RefSeq held) the remaining
ratios of negative to positive range from 1722:1 to 8 617:1. For any full genome-wise survey
it would be necessary to scale all the precision and sensitivity scores from the training set
ratios to ratios constructed from estimates of the interactome size. This issue is addressed
more fully in Chapter 5.
3.2.3 Hamming distance
The ability of Hamming distance to differentiate between the positive and negative training
set was measured with a lower distance corresponding to a higher score. Precision/sensitivity
were evaluated at every integer within a range of Hamming distance cut-offs ranging from 0
to 54.
3.2.4 Constrained ML
To use phylogenetic profiling in a phylogenetically aware manner to detect correlations in
gain and loss the software package BayesTraits was utilised (Pagel et al. 2004a). This has
been used in previous work (Barker and Pagel 2005) to demonstrate that detection of
correlations in gain and loss of particular genes can be used as a tool with which to detect
functional interactions.
The script bms_runner (Barker et al. 2007) was used to examine the performance of
different rates of gain in predicting functional interactions amongst the training sets in order
to select an optimal rate. The script utilised the phylogenetic profiles and phylogeny

105

Chapter 3
described in Chapter 2 as well the positive and negative training sets to evaluate the
performance of different rates of gain. bms_runner (Barker et al. 2007) creates input for the
program BayesTraits (Pagel et al. 2004a) to evaluate the relative likelihood of correlated
evolution at a range of rates of gain. bms_runner creates a non-redundant set of profiles
(Barker et al. 2007) before passing them on to BayesTraits for comparisons. Thus 113,132
protein pairs in the training set were reduced to a set of 54,906 non-redundant pairs of
profiles.
A number of rates of gain were evaluated for precision and sensitivity over the
training data ranging from 1 " 10-6 up to placing no restriction on gain. An LR score was
calculated for each profile pair at rate of gain and assigned to each protein pair corresponding
with that profile pair. bms_runner then evaluates precision and sensitivity at a range of cut!
offs commencing at the minimum LR encountered and moving up by a decreasing interval
until a value close to the maximum LR is reached (Barker et al. 2007). The program then
provides a table that includes the following information for this range of cut-offs.
LR cutoff

No of predictions

Precision

Sensitivity

Table 3.2: Column headings for data matrix returned by bms_runner (Barker et al. 2007).
3.2.5 Co-expression of mRNA
The co-expression of two genes in association with a given environmental condition can be
considered a potential indictor of functional linkage. In order to examine the performance of
the ML reconstruction method in predicting protein functional interactions against the coexpression of mRNA the results of all microarray experiments held in the EBIs ArrayExpress
database were downloaded.
This data was pre-processed and thus contained expression data at the gene level
rather than at the probe level. As oligonucleotide probes only map to small subsections of a
gene and also can hybridise with multiple targets the relationship between probe to gene is
many-to-many. This many-to-many relationship was collapsed by data processing carried out
by ArrayExpress on each individual experiment.
Thus a total of 377 experiments were downloaded. Each experiment record contained
information on genes whose expression level varied significantly in response to the
experimental treatment/tissue state. The size of individual experiments ranged in size from 1

106

Chapter 3
gene to a maximum of 15,987. The mean number of genes showing significant variation per
experiment was approximately 3,143.
A sample line from the downloaded data is shown below for illustrative purposes:
Gene Symbol

STAT1

Ensembl ID

Species

ENSG00000115415

Factor

Homo

Disease

sapiens

state

Value

Accession

normal

E-

Expression

DOWN

p Value

0.0423247888020165

GEOD3790

Table 3.3: Sample processed data from ArrayExpress for experiment E-GEOD-3790 (Hodges
et al. 2006).
E-GEOD-3790 is a study on gene expression in brain tissue afflicted with Huntingtons
disease (Hodges et al. 2006). The factor column has a number of potential values
corresponding to the annotation of the individual samples. In this case it corresponds to
whether the tissue comes from a patient diagnosed with Huntingtons disease as opposed to
normal tissue. The value column shows the value of the factor (in this case normal). The p
value column shows the significance of the identified differential expression. Thus the data
presented above shows that the gene STAT1 is significantly (p<0.05) down-regulated in
tissue annotated as normal.
To use the training datasets to measure the ability of gene co-expression to predict
protein interactions it was necessary to convert the training data from protein pairs to gene
pairs. Using a translation key provided by the International Protein Index (IPI) (Kersey et al.
2004) each RefSeq Gi number was mapped to the associated gene name. As individual genes
can produce multiple proteins via the process of alternative splicing there isnt a one to one
correspondence between the number of genes and the number of protein in both training sets.
With some Gi entries missing in the translation key this created a positive gene training set of
4319 genes /8057 gene pairs and a negative set of 2833 genes /89549 pairs of genes.
In order to predict functional linkage the following procedure was followed.

The data in each microarray experiment was split according to experimental


condition.

For each experimental condition pairs of genes were marked as functionally linked if
their expression level went up or down in response to the given experimental
condition.

True positives were counted if the genes existed as a pair in the positive set.
107

Chapter 3

False positives were counted if the genes existed as a pair in the negative set.

False negatives were counted as the complement of the set of predictions and the
positive set.

This data was processed using a program implemented in Java, which followed the steps
presented below. The input to the program was a file containing a set of lines as shown above
for a single experiment.

Create a non-redundant set of factors present in the experiment.

Create a non-redundant set of values for each factor.

Create an empty set L to hold functionally linked pairs

For each factor F:


For each factor value FV:
For each expression value E (UP or DOWN)
Create a set S of all genes possessing the attributes {F,FV,E}where the associated p
value for E is less than 0.05
All genes within S are declared functionally linked.
All possible pairs of genes within S are added to L

L is now evaluated against the training data for precision/sensitivity.

3.2.6 Bayesian classifier


The PIPs server offers its data split into six score cut-offs: 0.25, 1, 2.5, 25, 250 and 2500
(McDowall et al. 2009). These datasets were downloaded and evaluated for precision and
sensitivity at each of these cut-offs. For each cut-off the server provides a file with pairs of
proteins and the associated posterior odds ratio score. Each pair of proteins within each file
was declared to be functionally linked and then evaluated against the training data. This
provided associated precision/sensitivity scores for each cut-off.

108

Chapter 3
3.3 Results
3.3.1 Hamming distance
Figure 3.2 shows the performance of phylogenetic profiling over the training sets using
Hamming distance.

Figure 3.2: Performance of phylogenetic profiling using Hamming distance over the training
data.
Hamming distance as a measure does not perform well over the training data
achieving a maximum precision of 0.08796.

109

Chapter 3
3.3.2 Constrained ML
The ability of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to distinguish
between the training data was tested at a number of rates of gain. The results of this can be
seen in Figure 3.3.

Figure 3.3: Performance of constrained ML (Barker et al. 2007; Barker and Pagel 2005) over
training data at different rates of gain.
Figure 3.3 shows points at a range of sensitivity between 0-1. Sensitivity over the
whole training set ranges between 1 at LR cut-offs of 0 where all pairs of proteins are

110

Chapter 3
predicted to be functionally linked to 0 at the points where no pairs from the positive set are
predicted to be functionally linked.
Precision ranges from 0.0809 (a base level that is derived from the ratio of the size of
the positive set to the size of the negative set) up to 1 which is the point at which all
predictions made at a given LR cut-off lie in the positive set, i.e. are true positives.
Of the two metrics (precision/sensitivity) it is precision, which appears to be the
strong suit of constrained ML. This is probably due to the fact that correlated evolution will
not occur in all cases of protein interactions. A large number of protein interactions will
contain members that are phylogenetically ubiquitous. In some of these cases the interaction
will be essential to maintenance of normal eukaryotic cellular function. In other cases even if
an interaction is being lost or gained in an organism, its individual members might still be
present (the interaction being lost due to some form of temporal/spatial separation of the
members). Thus as low sensitivity is inevitable with this method; it was decided to focus on
rates of gain that achieve 100% precision. Figure 3.4 places all rates of gain on a single plot
and zooms into a range of sensitivities between 0 and 0.001.

111

Chapter 3

Figure 3.4: Performance of constrained ML (Barker et al. 2007; Barker and Pagel 2005) over
training data magnified to a scale of sensitivity ranging from 0 to 0.001. For clarity some of
the worse performing rates of gain are removed.
Figure 3.4 shows that the rate 0.025 is the clear best performer as it delivers
predictions with a precision of 1 at the highest sensitivity. The LR cut-off at this point is
58.54. The sensitivity at this cut-off for the rate 0.025 is 0.000545. This rate is thus chosen as
the exemplar rate to represent the method in comparisons and to utilise for further analysis.
The findings of (Barker et al. 2007) were borne out in this investigation as lower rates of gain
were generally seen as the best performers.

112

Chapter 3
Over the training data constrained ML (Barker et al. 2007; Barker and Pagel 2005)
with the rate of gain constrained to 0.025 makes five predictions from the positive set shown
in Table 3.1
RefSeq Accessions

Annotation Protein A

Annotation Protein B

NP_001789

Cell division protein

Origin recognition

NP_004144

kinase 2

complex subunit 1

Interaction type
Direct

Verified By
Protein microarray
(Ramachandran et
al. 2004).

NP_005617
NP_006266

Splicing factor,

Splicing factor,

arginine/serine-rich

arginine/serine-rich 6

Complex

Site-directed
mutagenesis

(Monsalve et al.
2000)

NP_001789
NP_001790

Cell division protein

Cyclin dependent

kinase 2

kinase 7

Direct

In-vitro
experimentation
(Garrett et al. 2001)

NP_001347
NP_003391

DEAD/H (Asp-Glu-

Exportin 1

Direct

In-vivo/in-vitro

Ala-Asp/His) box

experimentation

polypeptide 3

(Yedavalli et al.
2004)

NP_066953
NP_000935

Peptidyl-prolyl cis-

Serine/threonine-

trans isomerase A

protein phosphatase

Direct

Yeast 2-hybrid
(Stelzl et al. 2005)

2B catalytic subunit
alpha

Table 3.1: True positive proteins predicted by constrained ML (Barker et al. 2007; Barker
and Pagel 2005) from the training data with rate of gain constrained to 0.025 at an LR cutoff
of 58.54.

The examination of the training datasets yielded a similar result to (Barker et al. 2007) in so
much as lower rates of gain tended to perform better.

113

Chapter 3
As 0.025 was selected as the optimum rate of gene gain over the training data it was
also tested on the testing data to cross validate this selection. Figure 3.5 shows the results of
this cross validation check.

Figure 3.5: Performance of constrained ML (Barker et al. 2007; Barker and Pagel 2005) over
test data with rate of gain constrained to 0.025.
Figure 3.5 shows that constraining the rate of gain to 0.025 also achieves a precision
of 1 over the testing data. This precision occurs at an LR cutoff of 53.3. The sensitivity at
this point is 0.00054. At this cutoff constrained ML (Barker et al. 2007; Barker and Pagel
2005) makes 5 predictions that are true positives as shown in Table 3.2.

114

Chapter 3

RefSeq Accessions
NP_002583
NP_001347

NP_001118
NP_001119

Annotation Protein A

Annotation Protein B

Interaction type
Direct

Verified By

proliferating cell

ATP-dependent

nuclear antigen

RNA helicase

experimentation

DDX3X isoform 1

(Ohta et al. 2002)

Direct

In-vitro

adaptor-related

AP-1 complex

Yeast 2-hybrid

protein complex 1

subunit gamma-1

(Takatsu et al.

beta 1 subunit

isoform b

2001)

isoform a
NP_000391
NP_001790

TFIIH basal

cell division protein

transcription factor

kinase 7

Direct

In-vitro
experimentation

complex helicase

(Coin et al. 1998)

XPD subunit
isoform 1
NP_005517
NP_004497
NP_066953
NP_000936

heat shock factor

heat shock factor

protein 1

protein 2 isoform a

Direct

In-vivo/in-vitro
experimentation
(He et al. 2003)

peptidyl-prolyl cis-

calcineurin subunit

trans isomerase A

B type 1

Complex

In-vitro
experimentation
(Huai et al. 2002)

Table 3.2: True positive proteins predicted by constrained ML (Barker et al. 2007; Barker
and Pagel 2005) from the testing data with rate of gain constrained to 0.025 at an LR cutoff
of 50.217.
3.3.2.1 Likelihood ratio statistic
The likelihood ratio statistic (LR) derived from the comparison of the independent and
dependant models of evolution is asymptotically distributed as a %2 variate with degrees of
freedom equal to the difference of numbers in parameters between the two models which in
this case equals 4 under assumptions about the size of the phylogeny and the speed of
evolution of the character under consideration (Barker and Pagel 2005; Pagel 1997). Thus if
the LR falls within the critical region of the distribution it is considered significant. A
histogram showing the theoretical %2 distributions with 4 degrees of freedom is shown below
in Figure 3.6.
115

Chapter 3

Figure 3.6: Theoretical %2 distribution with 4 degrees of freedom.

The distribution of LRs in the positive and negative set as well as over the combined
training data differs from this theoretical distribution as can be seen in Figures 3.7, 3.8, 3.9
and 3.10.

116

Chapter 3

Figure 3.7: Distribution of likelihood ratio statistic for constrained ML within the rate
of gain 0.025 over the positive training set.

117

Chapter 3

Figure 3.8: Distribution of likelihood ratio statistic for constrained ML within the rate
of gain 0.025 over the negative training set.

118

Chapter 3

Figure 3.9: Distribution of likelihood ratio statistics for constrained ML (Barker et al.
2007; Barker and Pagel 2005) within the rate of gain 0.025 over the complete training
dataset.
Minimum
0.08932

1st Quartile
7.85

Median
10.51

Mean
11.14

3rd Quartile Maximum


13.79

74.80

Table 3.3: Descriptive statistics for the distribution of likelihood ratios for the rate of gain
0.025 over the complete training data.

119

Chapter 3

Figure 3.10: Distribution of likelihood ratio statistics for constrained ML within the
rate of gain 0.025 over the complete training dataset, the positive training dataset and the
negative training dataset compared with the theoretical %2 distribution with 4 degrees of
freedom.
The distribution of LR statistics over the training data seems to differ from the
theoretical %2 distributions with 4 degrees of freedom.
This distribution was also tested via a two-sample Kolmogorov-Smirnov test for
goodness of fit between a generated theoretical %2 distribution with 4 degrees of freedom and
the LR statistic score distribution over the training data using R (R Development Core Team
2011). This also showed a difference between the two distributions (D=0.9993, p-value<2.2e16

).

120

Chapter 3
This may be due to a violation of assumptions of the model with regards to the speed
of character transition.
The overall frequency of higher LR statistics does appear to be higher in the positive
set which is further validation for the constrained ML method.
3.3.3 Co-expression of mRNA
The results for each microarray experiment measured over the training data are given below
in Figure 3.11.

Figure 3.11: Precision/ sensitivity results for 377 microarray experiments over the
training datasets.

121

Chapter 3
As before the area of interest in Figure 3.11 is the point at which precision equals 1.
This is because the average correlation between transcript abundance and peptide abundance
has been observed to be fairly low in primates at around 0.33 (Fu et al. 2007). Thus mRNA
co-expression is unlikely to be capable of high sensitivities in protein-protein interaction
detection. Figure 3.12 is a magnification of this area.

Figure 3.12: precision/ sensitivity results for microarray experiments over the training
datasets magnified to a scale of sensitivity ranging from 0 to 0.01.
Mean precision over all 377 microarray experiments was 0.2141 and mean sensitivity
was 0.1195. Out of the 377 total 18 experiments achieved a precision of 1. Details of these
experiments are shown in Table 3.4.

122

Chapter 3
Accession
E-GEOD-4567

Size
166

Description of experiment

Sensitivity

Transcription profiling of human pulmonary artery endothelial

0.0006547359

cell culture treated with Chapel Hill Ultrafine particle.


E-GEOD-2280

168

Transcription profiling of oral cavity samples from human

0.0003274752

squamous cell carcinoma patients (O'donnell et al. 2005).

E-GEOD-3183

255

Transcription profiling of human bronchial cell line treated

0.0002183168

with IL-13 to better understand early cytokine-mediated


mechanisms that lead to asthma.
E-GEOD-994

266

Transcription profiling of human intra-pulmonary airways and

0.0002183168

buccal mucosa to identify the effects of cigarette smoke on the


human airway epithelial cell transcriptome (Spira et al. 2004).
E-GEOD-2152

474

Transcription profiling of human uterine fibroids mith mutated

0.0008728860

or wild type fumarate hydratase gene (Vanharanta et al. 2006).


E-GEOD-2504

28

Transcription profiling of untreated, HIV-1 vector-infected

0.0002182929

and TNFalpha-treated human Jurkat T cells (Lewinski et al.


2005).
E-GEOD-4748

191

Transcription profiling of human dendritic monocytes treated

0.0004366336

with LPS (lipopolysaccharide) or CyP (Cyanobacterial


Product) (Macagno et al. 2006).
E-MEXP-1224

538

Transcription profiling of human colon samples from patients

0.0030511060

who have colorectal cancer recurrence or are recurrence-free


(Garman et al. 2009).
E-GEOD-7664

254

Transcription profiling of human PBMC response to benzene

0.0008728860

metabolites (Gillis et al. 2007).


E-GEOD-2361

87

Transcription profiling of 36 normal human tissue types to

0.0003274752

identify tissue-specific genes (Ge et al. 2005).


E-GEOD-1739

212

Transcription profiling of blood samples from human patients


with severe acute respiratory syndrome (SARS) (Reghunathan
et al. 2005).

Table 3.4: Microarray experiments achieving a precision of 1.

123

0.0002182929

Chapter 3
E-TABM-577

95

Transcription profiling of human placenta from women

0.0001091584

presenting at term with villitis of unknown etiology (Kim et


al. 2009).
E-GEOD-2018

129

Transcription profiling of human bronchoalveolar lavage

0.0001091584

samples collected from lung transplant recipients with


rejection states determined at the time of sample collection
(Lande et al. 2003).
E-GEOD-1786

126

Transcription profiling of human male vastus lateralis muscle

0.0008728860

samples from healthy and COPD subjects before and after 3


months of training (Radom-Aizik et al. 2005).
E-GEOD-2624

293

Transcription profiling of human tetracycline-regulated cell

0.0014181302

line expressing an NF-kB inhibitor to systematically identify


NF-kB dependent genes (Tian et al. 2005).
E-MEXP-714

55

Transcription profiling of human hepatitis C virus replicon

0.0002182929

cell line treated with interferon-alpha 2a in a time series.


E-GEOD-9770

1225

Transcription profiling of human neurons from different brain

0.0008731718

regions derived from individuals with mild cognitive


impairment.
E-GEOD-403

333

Transcription profiling time series of the cAMP-induced

0.0007641087

decidualization of human endometrial stromal cells (Tierney


et al. 2003).

Table 3.4: Microarray experiments achieving a precision of 1 (cont).


The highest scoring microarray experiment was E-MEXP-1224, an investigation into
whether there was a difference in expression profiles between the colorectal tissue of patients
who has recurrent cancer and those who remained clear (Garman et al. 2009). The sensitivity
of this experiment was 3.0511"10 -3 with a precision of 1.
3.3.4 Bayesian classifier
The ability of!the PIPs (McDowall et al. 2009) server to predict functional interaction over
the training set was evaluated at 6 cutoffs. The results can be seen in Figure 3.13

124

Chapter 3

Figure 3.13: Precision/ sensitivity results for predictions from the PIPs server over six
cut-offs over the training dataset.

125

Chapter 3

Figure 3.14: precision/ sensitivity results for predictions from the PIPs server over six cutoffs
over the training dataset zoomed in to a maximum sensitivity of 0.15.
None of the score cut-offs over the predictions from the PIPs server achieved a full
precision of 1. However none of them fell under 0.9 either as seen in Table 3.5.

126

Chapter 3

Cutoff

Predictions

Precision

Sensitivity

0.25

79441

0.9135546

0.14366504

1.00

37606

0.9395973

0.11068444

2.50

25598

0.9533333

0.09825928

25.00

5394

0.9949239

0.04742318

250.00

1232

0.9865772

0.01832689

2500.00

498

0.9883721

0.01067973

Table 3.5: precision/ sensitivity results for predictions from the PIPs server over six cutoffs.

3.3.5 Method Comparison


Three out of the four methods evaluated are able to discriminate between the negative and
positive examples in the training data with varying degrees of success. Figure 3.12 shows all
three methods charted in on the same plot. Examination of phylogenetic profiles via detection
of correlated evolution using maximum likelihood is represented by the single optimum rate
of gene gain of 0.025 as the object of the ML correlated evolution training step was the
selection of this optimum rate of gain.

127

Chapter 3

Figure 3.12: All methods compared over training dataset. Legend explanation (PIPs=PIPs
server, MA= microarray experiment and PP= phylogenetic profiling measuring correlation in
gain and loss over a phylogeny with constrained rate of gain).
3.4 Discussion
Arguably the best performing method out of all three methods is the PIPs server (McDowall
et al. 2009) as it achieves the highest rates of combined precision and sensitivity over the
training data. The success of the PIPs server in terms of accuracy and coverage is attributable
to its use of multiple, disparate sources of evidence. The other two methods both focus on
particular types of interactions.
Phylogenetic profiling measured with constrained ML over a phylogeny is limited to
proteins that have been gained and lost in a correlated fashion over a phylogeny. Thus protein
interactions between phylogenetically ubiquitous partners cannot be detected. Similarly it
cannot detect interactions between interactors with potentially redundant partners.

128

Chapter 3
Microarrays are more flexible in the types of interaction they are capable of detecting.
However individual experiments are limited in the types of interactions that they can uncover
by the experimental conditions under which their constituent mRNAs were extracted. They
are also biased toward stable complexes (von Mering et al. 2002). Another limitation in the
use of microarray experiments in the prediction of protein interactions is the fact that
expression levels of a gene at the transcription level do not correlate strongly with overall
levels of protein production at the translational level (Gygi et al. 1999). This is due to
regulation at the posttranscriptional level by factors such as mRNA half-life, codon usage and
ribosome occupancy and density (Wu et al. 2008). The best performing microarray
experiments outperformed constrained ML in terms of sensitivity.
However given the difference in cost and labour intensiveness between a microarray
experiment and a computational analysis employing phylogenetic profiling, the latter can
clearly be a useful tool in the functional annotation of identified genes within a newly
sequenced genome.
3.4.1 Low Sensitivities
None of the methods as described and utilised above can are particularly sensitive in
detecting protein-protein interactions. Constrained ML and gene co-expression are insensitive
to protein-protein interactions for the reasons described above.
The PIPs server as the best performer achieves a sensitivity of 0.14 at a high level of
precision. However this still corresponds to a 14% chance of detecting a possible protein
interaction despite its integration of various forms of supporting evidence. It is possible that it
is this integration of evidence that renders PIPs insensitive. If for example the likelihood ratio
returned by one of its predictor modules was high with the rest all being low, the overall
posterior odds ratio score would be low. Thus the individual sensitivities of the module
predictors are averaged out.
It seems that maximising coverage of the interactome is beyond the scope of each of
the predictive methods considered in this chapter. To use the analogy of the interactome as a
dark room, none of these methods are equivalent to an overhead light that illuminates every
corner of the room. Rather each method is more like a lamp that casts a pool of light on its
immediate surroundings. It is only by lighting a number of these lamps that the entire room
can be illuminated.

129

Chapter 4

Chapter 4
Design and implementation of data filter
4.1. Introduction
The constrained maximum likelihood (ML) method used to detect proteins which share
correlated evolutionary histories as described in Chapter 3 and in work by Barker et al.
(Barker et al. 2007; Barker and Pagel 2005) estimates values for parameters which model the
transition rates of the gain and loss of discrete characters (Pagel 1994) by integrating over all
possible ancestral states at each node within the phylogenetic tree.
As pointed out by Barker (Barker et al. 2007) placing a constraint on the rate of
acquisition of new proteins increases the ability of the likelihood method to discriminate
between proteins that interact and those that do not. The determination of an optimum rate of
gain reduces the scale of the problem of parameter estimation (Barker and Pagel 2005) as it
reduces the numbers of parameters to be fitted to 2 for the independent model and 4 for the
dependent model.
The detection of potential functional interactors for a single given protein using this
method is possible, however given the low sensitivity of the method (see Chapter 3) the
probability of detecting a functional interaction for any given single protein or even a set of
proteins is low. A complete genome-wide survey however would detect all protein pairs that
displayed evidence of correlated evolution.
The procedure is however prohibitively slow for a complete genome-wide survey
without access to a significant amount of computing power. A timed training run over the
training dataset for a single rate of gain took approximately 110 CPU-hours to conclude
54,906 comparisons of non redundant phylogenetic profile pairs on a single core of a 3 GHz
dual-core Intel Xeon processor (see Section 4.5). As there are 60,615,555 possible nonredundant pairs of phylogenetic profiles in the version of the human proteome currently held;
a full genome comparison would take 121,825.05 CPU-hours or 13.9 CPU-years on the
single core of a dual-core 3 GHz Intel Xeon processor. The speed of constrained ML (Barker
et al. 2007; Barker and Pagel 2005) was also measured in work presenting a genome order
based approach to phylogenetic profiling (Cokus et al. 2007). In this case it was found to
range between 5-15 seconds per pair of proteins (Cokus et al. 2007). This caused the authors
to utilise a subset of their data in their benchmarking study of constrained ML (Cokus et al.
2007).

130

Chapter 4
Potentially access to multi-core CPUs and/or computing clusters could ameliorate this
to a certain extent. As application of constrained ML (Barker et al. 2007; Barker and Pagel
2005) involves sequential comparison of pairs of phylogenetic profiles, it is a process that is
easily amenable to parallelisation via splitting the task into a smaller set of tasks, which can
be launched in parallel. Task farming is applied in computational biology to tasks that are
potentially intractable if tackled serially, e.g. analysis of gel electrophoresis data (Dowsey et
al. 2003) or analysis of microarray data (Hill et al. 2008). However even with the application
of task farming it is clear that a full genome-wide survey is not feasible for this method on
any averaged sized eukaryotic genome.
This chapter details the development of a data filter to remove protein pairs that
display little or no evidence of correlated evolution. There are two main types of filter
evaluated. The first type is a simple distance based test (Hamming distance) as shown in
Chapter 3 and utilised in early work on phylogenetic profiling (Pellegrini et al. 1999).
Potentially proteins that display evidence of correlated evolution will have phylogenetic
profiles that have a lower Hamming distance from each other. Thus even though Hamming
distance applied in isolation performs poorly as seen in Chapter 3, it may serve as a filter for
proteins which do not display evidence of correlated evolution in combination with the
second type of filter.
The second type of filter will utilise a single set of reconstructed ancestral states. By
using a single set of reconstructed states and a simpler method for the detection of evidence
of correlated evolution proteins that do not display any such evidence may be filtered out.
This chapter describes the implementation and comparison of five filters, which utilise a
single set of reconstructed ancestral states to detect signs of correlated evolution. As a large
amount of the computations performed by constrained ML (Barker et al. 2007; Barker and
Pagel 2005) involve estimation of the transition rate parameters by integrating over all
possible ancestral states, use of a single set of reconstructed ancestral states reduces the scope
of the problem. Through the use of an effective and accurate data filter a genome-wide
survey for an average eukaryotic organism could be rendered feasible.
The end product of this research described in this chapter is just such a filter based on
logistic regression of a set of empirically evaluated predictors/parameters, which reflect
correlated evolution between a pair of proteins. The filter is approximately 2208 times faster
then constrained ML and achieves a reasonable degree of precision/sensitivity over the
training data in its own right. Thus application of this filter can facilitate a heuristic search

131

Chapter 4
for genes/proteins displaying evidence of correlated evolution over an entire
genome/proteome.
In order to describe the process of filter development/evaluation it will firstly be
necessary to present an overview of ancestral state reconstruction.
4.1.1 Ancestral state reconstruction
The procedures involved in the reconstruction of the states of characters and traits in extinct
ancestral species are similar to those involved in phylogeny reconstruction. This is due to the
similarity of the issues involved. The reconstruction procedures for character states thus
utilise similar criteria with which to judge putative reconstructions. Ancestral reconstruction
is a useful tool for investigating hypothetical evolutionary scenarios having been used to
investigate many biological questions such as for example the demonstration of homoplasy in
the evolution of lysozyme (Malcolm et al. 1990; Messler and Stewart 1997; Stewart et al.
1987). It is also a prerequisite step for a number of comparative method tests (Maddison
1990; Ridley 1983).
4.1.1.1 Parsimony
A parsimonious reconstruction of ancestral states over a phylogenetic tree would
entail the selection of the internal state that minimised change. Thus if for example two
terminal nodes within a given clade had the same internal state the same state would be
assigned to the node immediately preceding them.
Algorithms such as the Fitch (Fitch 1971) and Sankoff (Sankoff 1975) algorithms as
described in Chapter 2 are used employed as a step within phylogeny reconstruction (Albert
2006). However given a particular already constructed phylogenetic tree they can be
employed to reconstruct a set of ancestral node values which minimises evolutionary change
over that particular tree (Felsenstein 2004). The algorithms themselves do not reconstruct
individual states at each internal node but instead construct sets of potential states at each
node. These potential states can be resolved into a singular state reconstruction through the
application of algorithms such as ACCTRAN (Accelerated transformation) (Swofford and
Maddison 1987), which reconstructs ancestral states by placing points of change as close to
the root of the tree as possible (Agnarsson and Miller 2008). The converse approach to
ACCTRAN is DELTRAN (delayed transformation)(Swofford and Maddison 1987), which
reconstructs ancestral states by placing points of change as close to the tips of the tree as
possible (Agnarsson and Miller 2008). ACCTRAN and DELTRAN are the most commonly

132

Chapter 4
used methods for collapsing node state sets into individual node states though of the two
ACCTRAN is the more widely employed (Agnarsson and Miller 2008).
Parsimony methods fail to consider different branch lengths in different parts of the
tree (Yang et al. 1995). Parsimony based methods have also been criticised for their lack of
statistical soundness (Elias and Tuller 2007). Parsimony methods are also unable to
distinguish between reconstructions that are equally parsimonious (Koshi and Goldstein
1996).
4.1.1.2 Likelihood
In a similar fashion as likelihood is employed as an optimality criterion for phylogeny
generation, it can also be used in the context of ancestral state reconstruction. Maximum
likelihood techniques are used to estimate the parameters of the specified model of evolution
(Yang 2006). Once these parameters are estimated they can be utilised to calculate the
posterior probability of ancestral states using Bayes theorem (Yang 2006). The state with the
highest posterior probability is then assigned to the node under consideration. This procedure
has been defined as empirical Bayes (Yang 2006). Empirical Bayes can be used to either
assign a character state to a set of nodes in a tree via a process known as marginal
reconstruction or it can be used to assign a set of possible characters to each node (Yang
2006). This latter process is known as joint reconstruction (Yang 2006).
Empirical Bayes can be contrasted with hierarchical Bayes where rather than estimating a
single value for the parameters of a model of evolution a prior probability distribution is
assigned for each unknown parameter (Yang 2006). The posterior probability for a given
ancestral state is then calculated by integrating over all possible values of parameters
(Huelsenbeck and Bollback 2001). Again the putative state with the highest posterior
probability is then assigned to each ancestral node.
Work by Koshi and Goldstein used the empirical Bayes method to reconstruct the
sequence of ancestral ribonuclease (Koshi and Goldstein 1996). The performance of
parsimony and the empirical Bayes method was also compared in a reconstruction of
lysozyme c by Yang et al. (Yang et al. 1995). This work found that empirical Bayes
outperformed parsimony but both methods suffered when the sites within the multiple
alignments being reconstructed were highly variable and the distance from the ancestral
nodes to the extant species was high (Yang et al. 1995).
An interesting application of empirical Bayes reconstruction was carried out by
Gashen (Gaschen et al. 2002). This work entailed reconstruction of the reconstruction of the
133

Chapter 4
sequence of the ancestor to various regional variants of the HIV-1 virus in order to contribute
to the creation of a potential vaccine (Gaschen et al. 2002).
4.2 Filters
4.2.1 Hamming distance filter
The original work which introduced the methodology of phylogenetic profiling as a means of
detection of functional interaction between genes (Pellegrini et al. 1999) utilised Hamming
distance (Hamming 1950) as a measure of similarity of profiles. Phillip Kensche also
examined this method in a review of phylogenetic profiling methods, and found it to perform
reasonably well over a dataset composed of the proteins sequences of 25 fungi (Kensche et
al. 2008). Hamming distance did not perform well over the training data as seen in Chapter 3
however it was possible that it could reduce the possible search space for an application of
constrained ML. As a potential heuristic it offers speed, as Hamming distance is one of the
simplest comparisons that can be carried out between two strings. Hamming distance
therefore was investigated as a potential filter to be used possibly in conjunction with a filter
based on a single set of reconstructed states.
4.2.2 Ancestral state reconstruction filter
The first consideration in the development of a heuristic/filter based on a single set of
reconstructed characters was which criterion to use to reconstruct that set. Likelihood as a
criterion yields more accurate results as discussed above. However as the aim of this heuristic
approach was to develop a method that reduced the search space for an application of the
computationally intensive constrained ML (Barker et al. 2007; Barker and Pagel 2005) to
phylogenetic profiling, it was decided to use the simpler though less accurate criterion of
parsimony.
4.2.2.1 Dollo parsimony
Dollo parsimony operates under the assumption that once a complex trait has been lost it
cannot be re-acquired (Albert 2006). Given that the character under investigation is the
presence and absence of genes/proteins in eukaryotic organisms it was decided that Dollo
parsimony was the appropriate variant to use. Dollo parsimony has been previously used to
investigate the propensity of particular genes to be lost over the course of evolutionary time
in eukaryotes (Krylov et al. 2003). It was chosen by the authors due to the relative rarity of
lateral gene transfer events in eukaryotes (Krylov et al. 2003).

134

Chapter 4
Dollo parsimony has also been utilised to investigate gene gain in poxviruses
(McLysaght et al. 2003). The results of this use however may have been affected by the fact
that poxviruses were later observed to acquire genetic material from infected hosts (Hughes
and Friedman 2005). Kensche also evaluated the efficacy of Dollo reconstructions of profiles
as a method of phylogenetic profiling (Kensche et al. 2008). Kensche utilised a distance
measure d(A,B) between the Dollo parsimonious reconstructions of the phylogenetic profiles
of two (orthologous groups of ) proteins A and B calculated as:

d(A, B) =

$| (anc(a ) " desc(a )) " (anc(b ) " desc(b )) |


i

(1)

i#branches

where branches denoted the set of branches in the phylogenetic tree, anc(ai) was defined as
the state of orthologous group A at the ancestral node of branch i, desc(ai) was defined as the
state of orthologous group A at the descendant node of branch i, anc(bi) was defined as the
state of orthologous group B at the ancestral node of branch I and desc(bi) was defined as the
state of orthologous group B at the descendant node of branch i (Kensche et al. 2008). The
distance d(A,B) was a count of branches where either orthologous group was gained or lost
independently. The method performed as well as more sophisticated techniques on the data
analysed by Kensche (Kensche et al. 2008).
One of the methods evaluated by Barker as a potential source of signal for correlated
evolution was also examination of Dollo parsimony based reconstructions of phylogenetic
profiles over a phylogeny (Barker et al. 2007). Dollo parsimony was utilised as it reflected
the idea of setting the rate of acquisition of a complex trait (in this case a protein) to a preset
low level (Barker et al. 2007). Pairs of proteins were scored on branches of the tree where
they were jointly lost and jointly gained to form a score referred to as Dollo-pos (Barker et al.
2007). Branches where proteins were not gained or lost together were also counted and
subtracted from Dollo-pos to form a score referred to as Dollo-overall (Barker et al. 2007).
Both these scores however did not perform particularly well over the data examined (Barker
et al. 2007). Dollo-overall however performed significantly better than Dollo-pos (Barker et
al. 2007).
Thus given the fact that Dollo parsimony based tests had been moderately successful
at detecting correlated evolution, a series of potential data filters /heuristics for examination
of phylogenetic profiles using constrained ML (Barker et al. 2007; Barker and Pagel 2005)

135

Chapter 4
based on a single set of reconstructed ancestral states over the phylogeny using Dollo
parsimony were investigated.
4.2.2.2 Maddison Test for correlated evolution
To use the reconstructed ancestral state data a test to detect correlated evolution using the
comparative method that utilised a set of reconstructed ancestral states over a given
phylogenetic tree was needed. One candidate test was a contingency table based test
presented by Ridley where a gain or loss of a character was considered in the light of whether
it occurred in the presence or absence of another character over a phylogenetic tree (Ridley
1983). This test however does not separate which character is dependent and which is
independent.
A second candidate test considered was a procedure described by Wayne Maddison
(Maddison 1990) for the comparison of the association of changes in one binary character
with the given state of another. This test was designed to carry out this analysis assuming a
given phylogenetic tree and a set of reconstructed characters (Maddison 1990). This test has
been referred to as a test for concentrated changes (Felsenstein 2004).
The fundamental idea behind the Maddison test is to test whether changes in one trait
or character are concentrated in an area of a tree where a second trait or character in a given
state. As an illustrative example consider a fictional monophyletic group of related cow-like
animals. These animals do not possess horns. The phylogenetic relationships of these
animals are fully resolved and understood as well as the ancestral states for all morphological
and molecular traits. Now imagine that this group overall has no ability to metabolise valine.
Finally imagine the ability to metabolise valine is independently acquired by a sub-clade of
our fictional group and this leads to the development of horns in this sub-clade.
If we wished to test whether the ability to metabolise valine leads to horn
development, the Maddison test would return the probability of the observed configuration of
valine metabolism / horn presence. This probability would be calculated by firstly calculating
the total number of ways to acquire horns in the presence of the ability to metabolise valine
over the phylogenetic tree. Secondly the number of ways to acquire horns over the entire tree
irrespective of the state of the ability to metabolise valine are calculated. By dividing the first
value by the second a probability can be calculated. If horns are concentrated in parts of the
tree where valine metabolism is also present this probability will be lower.

136

Chapter 4

Figure 4.1: Illustrative example tree.


Thus imagine in the above figure only Cow1 has the ability to metabolise valine and
also possesses horns. Thus a reasonable hypothesis/reconstruction could be that both abilities
were gained in the branch leading to Cow1. There is 1 gain of horns. Over the entire tree
there are 6 branches (not counting the root branch) and thus 6 ways to have 1 gain of horns.
However there is only one way of having a gain of horns in the presence of valine
metabolism and that is on the branch leading to Cow1. Thus the probability of the observed
configuration is

1
.
6

137

Chapter 4
To reiterate the test works through counting all possible ways of having a set of
observed changes in a character over a phylogeny and then counting how many ways there
are of having the same number of changes in parts of the tree where a second character is in a
given state. Thus if correlated evolution is occurring changes in the first character will be
concentrated in areas of the tree where the second character is in the causative state. Consider
as a second example two proteins, which carried out the same function. If the presence of the
first protein made the second protein redundant then losses in the second protein could be
concentrated in areas where the first protein was present.
The drawbacks of the test are the fact that it treats all forms of evolutionary change as
equally likely and its inability to take into account branch lengths (Pagel 1994). However as
the motivation behind the implementation of the test was its use as a simple data filter to
remove protein pairs that showed little or no evidence of correlated evolution it was decided
that the Maddison test (Maddison 1990) was an appropriate test.
4.3 Methods
To create Dollo parsimony based reconstructions over each phylogenetic profile over the
phylogeney presented in Chapter 2, the program DOLLOP from the PHYLIP package
(Felsenstein 1989) was used. The program implements the Dollo parsimony reconstruction
algorithm described in work by Farris (Farris 1977).
Given a binary trait T that can take on 2 possible values coded as [0,1], DOLLOP
implements Dollo parsimony by seeking to explain a given observed configuration of
presence and absence for T over a set of taxa over a phylogenetic tree by allowing one gain
(transition from 0 to 1) and multiple reversions (transition from 1 to 0) (Felsenstein 1989).
As an illustrative example consider the tree below and a trait with the distribution
010101.

138

Chapter 4

Figure 4.2: Example tree.


DOLLOP will reconstruct the trait as initially gained at the root of the tree and lost at
the braches leading to A, B and E. This is as opposed to allowing multiple gains on the
branches leading to D and G.
The process followed was similar to the process followed to generate the genome
content tree produced in Chapter 2. The main difference in this case was that DOLLOP was
run with the U option, which instructed it to produce Dollo parsimonious reconstructions
over a user-supplied tree. The program was supplied with the phylogeny generated in Chapter
2 as well as the phylogenetic profile for each protein under consideration.
Apart from the U option the program was run with its default settings. The output
from DOLLOP contained data on the state of the protein at every node in the phylogeny as
well the branches within the phylogeny at which transitions occurred. In order to record this
data DOLLOP assigns an identifying number to each internal node of the phylogeny.
An example of the outputted data from DOLLOP is given below. For a human protein
with the profile 000000000000000000000000100000000000000000000000000000 (Only
present in Homo sapiens) over the species used for the phylogeny produced in Chapter 2, the
following reconstruction was provided for species close to the root of the tree.

139

Chapter 4
From

To

Changed

State

root

No

Absent

Entamoeba

No

Absent

histolytica
1

No

Absent

Trichomonas

No

Absent

vaginalis
2

No

Absent

No

Absent

Table 4.1: Sample output from DOLLOP.


Clearly the parsimonious reconstruction for this protein would only contain one gain. This
gain occurs between the ancestral node immediately preceding Homo sapiens.
The output files from DOLLOP were stored for further use.
In order to process this data and utilise it as input for various tests of correlated
evolution two Java objects were defined and implemented.

Figure 4.3: Class diagram illustrating classes underpinning Dollo analyses.

140

Chapter 4
The main object in the preceding figure is the Transition Matrix object. This object has 2
main attributes.

The States: This is a list of Transition objects. Transition objects contain the same 4
attributes as shown in Table 4.1

The Position Map: This is a Tree Map, which contains a position within the tree as a
key and the state of a given trait at that position as a value. Thus this attribute can be
queried for the state (present or absent) of a given trait at any point in the tree.

The Transition Matrix object also has 2 main operations.

Calculate clade: This function returns all parts of a tree descended from a given node.
Thus if a trait is gained or lost at Node n, the function will return the monophyletic
group consisting of n and all its descendants.

Create Position Map: This function traverses the States list and utilises the Calculate
clade function to populate the Position map.

These objects underpin all further analyses described in this chapter.


4.3.1 Maddison test for correlated evolution
Given the set of the ancestral reconstructed state the Maddison test as described above and in
the original work by Wayne Maddison (Maddison 1990) was implemented. The following
description of the algorithm utilised is based entirely upon the work presented by Maddison
(Maddison 1990). A modification to the test for correlated evolution defined by Maddison
(Maddison 1990) to fit the constraints imposed by Dollo parsimony is presented in Section
4.3.2
4.3.1.1 Algorithm
Assume two discrete binary characters A and B and a phylogenetic tree T and a set of
reconstructed states for A and B for each node N within T. Possible states for characters A and
B lie within the closed interval [0,1]. A gain is defined as a transition from 0 to 1 and
conversely a loss is defined as a transition from 1 to 0.
Define character B as reference trait. Define state s as the relevant state of character
B. Define subset k ( k " T ) as the area(s) of the tree where B is in state s.

!
141

Chapter 4
Define W root (x, y | b) as the total number of ways to have x gains and y losses of
character A over the tree starting at the root node given that state of character A is b at the
root of the tree.
!
Define Broot ( p,q | x, y,b) as the total number of ways to have p gains and q losses of
character A in subset k given x gains and y losses over the entire tree starting at the root node
given that state of character A is b at the root of the tree.
!
The test for correlated evolution is thus calculated by

p(obs) =

Broot ( p,q | x, y,0) + Broot ( p,q | x, y,1)


W root (x, y | 0) + W root (x, y |1)

(2)

Solving Equation 2 provides the probability p(obs) of having p gains and q losses of
character A in subset k given a total of x gains and y losses occur over the whole tree under
the null hypothesis of no correlated evolution. If gains and losses of character A are in some
way dependent on whether character B is in state s then we could expect those gains and
losses to be concentrated in subset k. W root (x, y | b) and Broot ( p,q | x, y,b) are calculated
through the use of a dynamic programming approach starting at the tips of the tree and
proceeding in a post order fashion (Maddison 1990).
!
!
4.3.1.2 Calculation of total number of ways of having x gains and y losses over the tree
In order to calculate W root (x, y | b) over the entire tree for a character A, a matrix containing
the number of ways of having 0 to x gains, 0 to y losses for either potential values of b (0 or
1) has to be calculated for each node in the tree.
!
For a leaf node there are 0 ways of having x gains and y losses at the node for all
values of x and y which are greater than 0. There is one way of having 0 gains and 0 losses at
a leaf node.
For a non-leaf node K there are four calculations to make. Firstly assume all gains and
losses occur post the nodes immediate descendants L and M and that the state of character A
is 0. A non-leaf node is only processed after both its descendants have been visited. The
number of ways of having x gains and y losses at node K given a state of 0 can be calculated
x

by the expression # #W L (i, j | 0) " W M (x $ i, y $ j | 0) . This counting system operates on the


i= 0 j= 0

principle that for every way of having i gains and j losses on node L there are (x-i) gains and
(y-j) gains on node M. Thus if for example there was 1 gain and 1 loss to distribute over node
!
142

Chapter 4
K then if both of them occurred post descendent L then no changes would occur post
descendent M. If only the gain occurred post L then the loss would occur post node M.
The second part of this calculation is based on the assumption that one of the changes
occurs between K and one of its child nodes for example M. Thus as one of the changes has
occurred (the change is a gain as the state of the character is of character A at node K is 0) the
state of character A at node M is now 1 and one of x gains has already occurred. Thus the
number of ways to have the remaining number of gains and losses can be calculated by the
x#1 y

expression.

$ $W

(i, j | 0) " W M (x # i, y # j |1) . The third part of the calculation covers the

i= 0 j= 0

eventuality that the change happens between K and its other child L. Thus the number of
ways remaining to have x gains and y losses are calculated by the expression
!
x#1 y

$ $W

(i, j |1) " W M (x # i, y # j | 0) . Finally assume changes occur between K and both of its

i= 0 j= 0

child nodes L and M. The states of both nodes will be 1 and there will be two fewer gains to

distribute over the remainder of the tree. Thus the fourth part of the calculation is:
x#2 y

$ $W

(i, j |1) " W M (x # i # 2, y # j |1) .

i= 0 j= 0

Summing up the results of these four expressions will provide the number of ways of

having x gains and y losses at non-leaf node K given that the state of character A is 0. The
calculation of the number of ways of having x gains and y losses if the state of character A is
1 at node K is a mirror image of the process described above (Maddison 1990).
4.3.1.3 Calculation of total number of ways of having p gains and q losses in subset k
given x gains and y losses over the entire tree
This calculation of Broot ( p,q | x, y,b) is very similar to the one described above. As above a
matrix containing the number of ways of having 0 to p gains in subset k, 0 to q losses in
subset k given 0 to x gains and 0 to y losses over the whole tree for either potential values of b
!
(0 or 1) has to be calculated for each node in the tree.
For a leaf node there are 0 ways of having p gains and q losses in subset k given x
gains and y losses overall for all values of p, q, x and y which are greater than 0. There is one
way of having 0 gains and 0 losses in subset k given 0 gains and 0 losses overall.
As above a non-leaf node is only processed when both its children have been visited.
For a non-leaf node K with character A having state 0 with children L and M there are again

143

Chapter 4
four calculations to be made. The first calculation counts the possibilities where both changes
occur post the child nodes. This number is calculated through the expression
x

" " ""


i= 0

j= 0

BL ( f ,g | i, j,0) # BM ( p $ f ,q $ g | x $1, y $ j,0) . The second calculation counts

f = 0 g= 0

the possibilities where one of the changes occurs between node K and node M. Whether this

change is counted as within subset k depends on whether node M lies within subset k. To
facilitate calculation (Maddison 1990) defined a number ZM as set to 1 if M lies within k.
Thus the second calculation is evaluated by the expression
x#1 y p#Z m q

$ $ $ $ B ( f ,g | i, j,0) " B
L

( p # f # Z M ,q # g | x # i #1, y # j,1) . The third calculation

i= 0 j= 0 f = 0 g= 0

counts the possibility of one of the changes occurring between node K and node L. This is

evaluated via the expression


x#1 y p#Z L

$ $ $ $ B ( f ,g | i, j,1) " B
L

( p # f # Z L ,q # g | x # i #1, y # j,0) . The fourth calculation

i= 0 j= 0 f = 0 g= 0

counts the possibilities where changes occur between K and L as well as K and M. This is

evaluated with the expression:


x#2 y p#Z L #Z M

$ $ $ $ B ( f ,g | i, j,1) " B
L

i# 0 j= 0

f =0

( p # f # Z L # Z M ,q # g | x # i # 2, y # j,1)

g= 0

The summation of the solutions of the four expressions yields the total number of
ways to have p gains and q losses of character A within subset k given x gains and y losses of
character A overall under node K given the state of character A is 0. As above this process is
mirrored when the state of character A is 1 (Maddison 1990).
4.3.1.4 Permutation effects
The Maddison test for correlated evolution (Maddison 1990) is potentially susceptible to two
effects in the context of examination of protein phylogenetic profiles. Maddisons test was
designed to test specific hypotheses about correlated evolution. For example one of the first
applications of the test was on data testing the association of gregariousness in butterflies
with unpalatable larvae (Sillentullberg 1988; Maddison 1990). Thus in a pairwise comparison
of characters one character is held static as a reference while the location of changes in the
other dynamic character are examined over the tree. The terms static and dynamic shall be
used in this context in all subsequent references.

144

Chapter 4
In the case of examinations of correlated evolution in phylogenetic profiles however
it is not possible to state whether we are testing for the dependence of the distribution of
protein A with the state of protein B or vice versa. The first effect is thus permutation.
The second effect is based on how subset k is defined. As phylogenetic profiles
compare patterns of presence and absence of genes subset k can either be defined as the
presence of protein B or the absence of protein B.
This second effect is however precluded as defining subset k as the absence of protein
B shifts position of the number of changes sought. Consider the tree shown in Figure 4.4. If
for example protein A was gained once within the clade containing Species 1 and Species 2
and protein B was present in that clade but no where else within the tree. Thus the ancestral
state of B would be reconstructed parsimoniously as shown in Figure 4.5.
If k is defined as the presence of B then the test is investigating the probability of 1
gain within k with 1 gain over the entire tree. If k was defined as the absence of B then the
test is investigating the probability of 0 gains within k with 1 gain over the entire tree.
Thus over the sample tree if k is defined as the presence of B, then there are 3 ways to
have 1 gain of A within k. That is 1 on the branch leading to Species 1, 1 on the branch
leading to Species 2 and 1 on the branch leading to the clade. There are 9 ways of having 1
gain over the entire tree. Thus the probability of 1 gain in k is

3
or 0.33. If on the other hand
9

k is defined as the absence of B then there are 0 gains within k with 1 gain overall. As there is
one gain to be accounted for and this gain can only occur within the clade containing Species
!
1 and Species 2. Thus as before there are 3 ways of having one gain within that clade and 9
ways of having one gain over the whole tree thus the associated probability remains the same,
i.e. 0.33.

145

Chapter 4

Figure 4.4: Sample phylogeny of 5 hypothetical species. The numbers on the tree
represent presence and absence of protein B. The arrow points out the point post which
protein A was acquired.

146

Chapter 4

Figure 4.5: Sample phylogeny of 5 hypothetical species. The black area of the tree
corresponds to a Fitch parsimonious reconstruction (Fitch 1971) carried out by Mesquite
(Maddison and Maddison 2010) of protein B if protein B has the phylogenetic profile 11000
(where the order of species in the profile is the same as the numerical order of the species). It
is also the Dollo parsimonious reconstruction. This black area corresponds to subset k if it is
defined as the presence of B. Conversely the white area of the tree corresponds to k if it is
defined as the absence of B.
A further example of this concept can be considered by using the initial example
provided in 4.2.2.2 involving our fictional cow like species. In that case the probability of
acquiring horns in the presence of valine was calculated as

1
. If we were to examine the
6

probability of acquiring horns in the absence of valine then the denominator remains the
same and there are 0 gains of horns in k (the area of the tree where the ability to metabolise
!
valine is absent). There is 1 way of having 0 gains of horns. Thus the probability of the
observed configuration remains the same.
Thus the results shown in by the Maddison-Dollo test over the training data as
described in Chapter 3 were identical with respect to the choice of how subset k is defined.

147

Chapter 4
4.3.1.5 Evaluation of Maddison test as heuristic for constrained ML
The Maddison test (Maddison 1990) as described above modified to accommodate the
assumptions of Dollo parsimony (Section 4.3.2) was implemented using Java. This entailed
writing 3,563 lines of code. The implementation was supplied with the set of reconstructed
states for each protein pair and the phylogenetic tree on which they are reconstructed.
In order to remove permutation based effects from the analysis the training dataset as
described in Chapter 3 was doubled so it included proteins pairs in both the A-B and the B-A
orientations. The Maddison test was run on this expanded training set. Thus each protein pair
in the training set had two associated probability scores. The lower of these two scores was
selected as the lower the probability of the observed distribution of gains and losses the
stronger the evidence for correlated evolution. In order to use this probability as an ascending
score the score was defined as 1-p. The test was run with subset k defined as the absence of
trait B.
The ability of the test to detect protein interactions was then judged according to the
criterion of precision/sensitivity as defined in Chapter 3.
This process was then repeated with k defined as the presence of trait B in order to
verify the observation that there was no effect on whether k was defined as the absence or the
presence of trait B.
4.3.2 Modification of test to match Dollo constraints
The calculation of the null distribution under the standard Maddison approach (Maddison
1990) allows for all sequences and permutations of gains and losses as allowed by Fitch
parsimony (Fitch 1971). In order to reconcile the test to the assumptions of Dollo parsimony
the test was modified to remove all possibility of a gain following a loss. This was achieved
by examining the state of the character under consideration at the root node. If the root node
was 0 then the standard test as described in Equation 2 was utilised. However if the state was
1 then the following test was used.

p(obs) =

Broot ( p,q | x, y,0) + Broot (0,q | 0, y,1)


W root (x, y | 0) + W root (0, y |1)

(3)

If a character is acquired at the root of the tree then no gains can be allowed to occur
post a loss thus a is calculated for 0 gains and y losses from the root.

148

Chapter 4
To illustrate this difference consider the following tree.

Figure 4.6: Example tree to illustrate the imposition of Dollo parsimonious constraints
on the Maddison test for correlated evolution.
Assume the state of a character C was reconstructed as 1 at the root node of the tree
and there was one gain and one loss to be distributed over the tree. If gains were allowed to
follow losses then there are 4 ways of having one gain and one loss over the tree. These are:

A loss between node 2 and node 6 followed by a gain between node 6 and
node 7.

A loss between node 2 and node 6 followed by a gain between node 6 and
node 8.

A loss between node 2 and node 3 followed by a gain between node 3 and
node 4.

A loss between node 2 and node 3 followed by a gain between node 3 and
node 5.

149

Chapter 4
However with the added Dollo parsimony constraint there are no ways having one
gain and one loss over the tree in Figure 4.6 if the state at the root node is 1.
4.3.3 Differential parsimony
The distance as defined in work by Phillip Kensche (Kensche et al. 2008) and reiterated in
Section 4.3.1 was implemented and its performance examined over the training data.
4.3.4 Dollo-pos/ Dollo-overall
Both measures as described by Barker et al. (Barker et al. 2007) were implemented and
examined in the light of the testing data.
4.3.5 Test based on logistic regression
In order to test for correlated evolution the reconstructed ancestral states allowed the
calculation of potential predictor variables, which bore a correspondence to the transition rate
parameters, used by Barker and Pagel (Barker and Pagel 2005; Pagel 1994) as described in
Chapter 3. These parameters represent the rates of transition in state for a discrete binary
character given a particular state for a second discrete binary character.
Given a single set of reconstructed ancestral states these transitions can be empirically
counted over the reconstructed states. For example a protein is lost on a given branch of the
phylogeny and a second protein is present on that branch according to the reconstructed states
this can be counted as a loss of one protein in the presence of the other.
The Dollo parsimony based reconstructions of each phylogenetic profile contain
within them the state of the associated protein at every given point in the tree. It was thus
possible to compare the state of any given ancestral branch within the tree for any two given
proteins. The Dollo reconstruction data is framed in terms of transitions between two nodes.
This meant that it was possible to compare a transition in the state of a given protein with the
state of the other protein at the same point in the tree.
In order to use the Dollo reconstructions of each phylogenetic profile as potential
predictors of functional interaction using logistic regression the reconstructions had to be
framed in terms of being potential predictors of correlated evolution. The possible states that
a protein could be in at any given transition in the tree were coded as:

0: Absent

1: Present

2: Lost
150

Chapter 4

3:Gained

Each profile was associated with a matrix of transitions, which was constructed using
the reconstruction of the ancestral states of the profile over the tree. Pairwise comparisons
were then carried out. Thus at each transition point the state of protein A was compared to
that of protein B.
In order to avoid permutation effects the order in which protein pairs were considered
was made redundant. This was performed by framing transitions in terms of changes in
proteins as opposed to changes in a particular protein. If for example protein A was lost in
the presence of protein B this would be counted as the loss of a protein in the presence of
another, not the loss of A in the presence of B. Pairwise comparisons were thus carried out at
each node of the tree to create the predictors shown in Table 4.2. The lower case s stands for
scenario.

151

Chapter 4

Predictor

Description

s00

A point in the tree where both proteins are absent.

s01

A point in the tree where one protein is present and the


other is absent.

s02

A point in the tree where one protein is absent and the


other is lost.

s03

A point in the tree where one protein is present and the


other is gained.

s11

A point in the tree where both proteins are present.

s12

A point in the tree where one protein is present and the


other is lost.

s13

A point in the tree where one protein is present and the


other is gained.

s22

A point in the tree where both proteins are lost.

s23

A point in the tree where one protein is lost and the


other is gained.

s33

A point in the tree where both proteins are gained.

Table 4.2: Description of predictor parameters to be utilised in regression model.


Whether or not two proteins interact is a binomially distributed variable. Models for
the calculation of the probability of a dichotomous outcome include:

Two group discriminant function analysis: given two predictors X1 and X2


discriminant functional analysis constructs a variable Z which is a linear function of
X1 and X2. This function is the equation of a line that separates the data under
consideration into two groups. One of these groups will have high values for Z and
the other will have low values (Sokal and Rohlf 1995). Novel data with measured
values for X1 and X2 can thus be classified by solving the equation for Z (Sokal and
Rohlf 1995).

152

Chapter 4

Logistic regression: logistic regression relates the probability of a successful outcome


in this case that of an interaction with estimated coefficients for a set of predictors via
the following application of the logistic function (Sokal and Rohlf 1995).

Preliminary trials were carried out on the training data, which found the results of logistic
regression and a linear discriminant function to be broadly similar. However the values of
predictors associated with gains are not distributed normally as the number of gains is
restricted to 1. In such cases logistic regression is the recommended technique (Lei and
Koehly 2003) as it makes no assumptions of normality regarding the distribution of the
predictor variables. Thus logistic regression was selected as an appropriate method of testing
whether the predictor variables contribute to the outcome of interaction as well as
determining the degree to which they contribute. Logistic regression is carried out via an
application of the logistic function, which can be defined as shown in Equation 4

p(Interaction) =

e a +bX
1+ e a +bX

(4)

Where a is the y-intercept of a regression line, e is the base of the natural logarithm
and b is the coefficient of a predictor variable X for a set of predictors (X0,X1,X2.Xi) .
The logistic function, which is also known as the sigmoid function returns a value within the
closed interval [0,1] for values in the range of real numbers from "# to +" .
This probability is converted into the odds of an interaction versus no interaction with
the expression:

p
where p is equal to the probability
! of an!interaction (Sokal and Rohlf
(1" p)

1995).
Solving the expression substituting Equation 4 as a value for p yields the following
!
equation (Sokal and Rohlf 1995).

p
= e a +bX
(1" p)

(5)

Finally the odds of an interaction are converted into the log odds or logit of an

interaction via Equation 6 (Sokal and Rohlf 1995):

p
ln(
) = a + bX
1" p

(6)

153

Chapter 4
The optimal values for the coefficients and intercept of an optimal regression line are
estimated by maximum likelihood (Sokal and Rohlf 1995). The full positive training set of
9,161 protein pairs was used as examples of proteins, which interact. A random subsample of
9,161 proteins was then selected from the negative training set as examples of proteins,
which do not interact. The size of negative and positive sets were set to be equal to allow the
linear model to create a regression line which matched the distribution of the predictors in
both sets rather than being biased by a larger negative set.
Counts were then calculated for each parameter for each pair of profiles within the new
training dataset. The statistical package R (R Development Core Team 2011) was then used
to fit a generalised linear model between the two binary variables using a binomial (logit)
link function. The predictor variables were considered to be continuous.
Predictor variables s13, s22, s23 and s33 were found to cause singularities within the model.
s13 was found to be perfectly correlated with s03 and s33 as Dollo parsimony only allows one
acquisition of a complex trait. s22 and s33 only occur rarely as seen below.

Minimum

1st Quartile

Median
0

Mean
0.5259

3rd Quartile

Maximum.

14

Table 4.3: Descriptive statistics for counts of predictor s22.

Minimum

1st Quartile

Median
0

Mean
0.1

3rd Quartile

Maximum.

Table 4.4: Descriptive statistics for counts of predictor s33.


The predictor s23 never occurs at all within the data. The results of the initial
regression are shown in Table 4.5.

154

Chapter 4

Predictor
s00

Coefficient

p value <0.05

0.02179

0.1278 (Not
significant)

s01

0.02607

0.0845 (Not
significant)

s02

0.03775

0.0244

s03

-0.24041

2.66 & 10-05

s11

0.06446

5.46 & 10-5

s12

0.04350

0.0189

Table 4.5: Coefficients of the initial logistic regression equation.


After removing all predictors that caused singularities as well as all insignificant
predictors the analysis was repeated. This led to the following coefficients as shown in Table
4.6.
Predictor

Coefficient

p value <0.05

s02

0.019429

2.05 & 10-7

s03

-0.177835

0.00115

s11

0.043359

< 2 " 10-16

s12

0.018787

0.00107

!
Table 4.6: Coefficients of the logistic regression equation derived via examination of
the reduced set of predictors.
The full equation of the regression line is shown below as Equation 7.

155

Chapter 4

y = 0.019429 s02 " 0.177835 s 03 + 0.043359 s11 + 0.018787 s12 " 0.791849 (7)

y is equal to the logit score of the probability of two proteins interacting versus the
probability of them not interacting. The logit scores were then transformed into a probability
of interaction via an application of the logistic function. This probability was used as the
score.
The predictor, which contributes the most to the probability of an interaction, is s03.
This would suggest that a protein being gained in the absence of the other is indicative of the
two proteins not being functionally linked (as the coefficient is negative). The other
significant terms, which contribute positively to the probability of an interaction, are s02, s11
and s12. Losses appear to be a defining event for correlated evolution whether a loss in the
presence of a protein or a loss in the absence of a protein. This could potentially be
confirmation of work postulating that gene loss is relatively the most important event shaping
gene content and determining phenotype. However this may be attributable to the use of
Dollo parsimony for ancestral state reconstruction. A loss of a protein in the presence of
another might suggest some form of redundancy-based loss. A loss of a protein in the
absence of another on the other hand might suggest a cascade of losses of a group of proteins
that carry out a particular function that is no longer needed in a particular lineage.
s11 is the final significant predictor which implies that two proteins coexisting for
periods of time at different points in the phylogeny are also more likely to be functionally
linked. This predictor can be thought of as similar to the method used by Cokus (Cokus et al.
2007) except whereas while that work carried out a horizontal comparison of the distribution
of presence and absences of proteins across a set of genomes clustered by similarity, the
predictor s11 measures co-occurrence both horizontally across species and vertically over a
set of putative ancestors as reflected by the phylogenetic tree.

4.3.5.1 Evaluation of logistic regression as a heuristic for constrained ML


In order to examine the efficacy of the derived regression equation it was applied to the full
training dataset. The performance of the method was then evaluated using through the
calculation of precision/sensitivity.

156

Chapter 4
4.4 Results
The results for the five tests carried out were measured in terms of precision and sensitivity
over the training data. The intersection of the predictions made by the tests with the 6
predictions made by constrained ML method (Barker et al. 2007; Barker and Pagel 2005) at
its optimum rate of gain 0.025 and its optimum likelihood ratio cut-off of 58.54 over the
training dataset was also examined. As pointed out in Chapter 3 this combination of rate of
gain and likelihood ratio cut-off yielded 5 predictions all of which were true positives.
Maintenance of this intersection is judged to be a key criterion for any data filter to form part
of a heuristic approach. Thus in order to use a test as a data filter for constrained ML (Barker
et al. 2007; Barker and Pagel 2005) the highest acceptable cut-off for the test was considered
to be the point at which all 5 predictions made by the ML method were preserved.

157

Chapter 4
4.4.1 Maddison test for correlated evolution

Figure 4.7: Performance of the Maddison test for correlated evolution on the training data with k as the absence
of trait B. The figure shows the performance between cut-offs of 0.9999 and 1 rising by increments of 1 x 10-7.

158

Chapter 4

Figure 4.8: Performance of the Maddison test for correlated evolution on the training data with k defined as the
presence of trait B. The figure shows the performance between cut-offs of 0.9999 and 1 rising by increments of
1 x 10-7.

The Maddison test for correlated evolution (Maddison 1990) performs reasonably
well on the training data as can be seen in Figure 4.7. There is a maximum intersection of 1 in
this range of cut-offs with the 5 predictions made by constrained ML (Barker et al. 2007;
Barker and Pagel 2005) at its optimum rate of gain and optimum LR cut-off. It is however a
computationally intensive process with each node in the tree being evaluated for all
combinations of states, gains and losses respectively firstly for the calculation of the null
distribution and secondly for all combinations of gains within subset k, losses within subset k,
gains over the whole tree, losses over the whole tree and states. Binary tree traversal is
carried out in linear time (Felsenst.J 1973). The amount of time taken as the number of gains
and losses to be accounted for increase however rises at a much steeper rate (Maddison
1990).

159

Chapter 4
The mirror property of the algorithm, i.e. that the calculations mirror each other for
any given combination of states (Maddison 1990) cannot be utilised in this particular case as
gains are restricted to 1 by the use of Dollo parsimony.

4.4.2 Differential parsimony

Figure 4.9: Precision and sensitivity measure of differential parsimony over training data.
Differential parsimony (Kensche et al. 2008) was not very successful on the training
data as shown in Figure 4.9. It reached a maximum precision of 0.22547 with a sensitivity of
0.023. There was an intersection of 3 at this point with the predictions made by ML
constrained at its optimum rate of gain and LR cut-off.

160

Chapter 4

4.4.3 Dollo-pos

Figure 4.10: Precision and sensitivity measure of Dollo-pos over training data. Range of cutoffs at which predictions are made: 0 to14.
Dollo-pos (Barker et al. 2007) was fairly successful over the training data with a
maximum precision of 0.57 at a cut-off of 13 as shown in Figure 4.10. The sensitivity at this
point was 0.00043. There was an intersection of 2 with the 6 predictions made by constrained
ML (at its optimum rate of gain and LR cut-off) at this cut-off. In order to use Dollo-pos as a
potential data filter the lowest acceptable cut-off that maintained an intersection with all 6
predictions made by constrained ML was 1. At this cut-off the precision of Dollo-pos was
0.105 and the sensitivity was 0.389.

161

Chapter 4

4.4.4 Dollo-overall

Figure 4.11: Precision and sensitivity measure of Dollo-overall over training data. Range of
cut-offs at which predictions are made: -23 to 14.
The results for Dollo-overall (Barker et al. 2007) were found to be broadly similar to
Dollo-pos (Barker et al. 2007) as seen in Figure 4.11. However the effective range of cut-offs
for Dollo-overall is shifted down. The maximum precision achieved was 0.5 at a cut-off of
11. The sensitivity at this point was 0.00002. The lowest cut-off at which the intersection
with the 6 predictions made by constrained ML (at its optimum rate of gain and LR cut-off)
was maintained was -21. At this cut-off the precision of Dollo-overall was 0.08074 and the
sensitivity was 0.99.
4.4.5 Logistic regression
Logistic regression performed well on the training data as can be seen below in Figure 4.11.

162

Chapter 4

Figure 4.12: Precision and sensitivity measure of logistic regression over training data. Range
of probability cut-offs at which predictions are made: 0 to 1. Cut-offs were incremented by
0.001.
The maximum precision achieved by logistic regression was 0.736 at a sensitivity of
0.01. This was achieved at a probability cut-off of 0.967. The lowest probability cut-off at
which the intersection with predictions made by constrained ML was maintained was 0.85.
The precision achieved at this point was 0.479 and the sensitivity was 0.0598. The method
that achieved the highest precision while maintaining a full intersection with the predictions
made by constrained ML was thus logistic regression.

163

Chapter 4
In order to cross validate the ability of the logistic regression based filter to
discriminate between proteins that interact and those that do not, the filter was run on the
testing data. Figure 4.13 shows the performance of the filter over the testing data.

Figure 4.13: Precision and sensitivity measure of logistic regression over testing data. Range
of probability cut-offs at which predictions are made: 0 to 1. Cut-offs were incremented by
0.001.
As the s predictors used to determine the logit score used in the logistic regression
filter are based on the transition rate parameters used to detect correlated evolution by the
constrained ML technique (Barker et al. 2007; Barker and Pagel 2005; Pagel 1994) it was
expected that there is a correlation between a high logit score for a protein pair and a high LR
164

Chapter 4
(likelihood ratio statistic) score using the exemplar rate of gain elucidated in Chapter 3
(0.025). The distribution of LR scores is extremely skewed towards the lower end as can be
seen in Chapter 3. All true positive protein pairs detected by constrained ML (Barker et al.
2007; Barker and Pagel 2005) at the optimum rate of gain lie within the 99th percentile of LR
scores. This is due to the fact that many protein pairs within the training dataset do not
display little or no evidence of correlated evolution, i.e. the low sensitivity of the method as
shown in Chapter 3.
It is only in the upper ranges of the LR scores that protein pairs that interact are
distinguished from those that do not interact. This phenomenon was also observed by Barker
as well as by Kensche (Barker et al. 2007; Kensche et al. 2008). Thus in order to display the
relationship between the two prediction systems the logit derived probability scores of
protein pairs with corresponding LR scores that lay in the 95th percentile of the distribution of
LR scores were selected. This came to a set of 5,658 protein pairs.
The logit derived probability scores of these pairs were plotted against their
corresponding LR score. Figure 4.14 shows the logit derived probability scores of these 5,658
proteins pairs plotted against their LR scores. The relationship between the values is
displayed via an overlaid regression line. There is a large amount of scatter around the
regression line however the relationship between the two variables is found to be significant
(p value < 0.001).

165

Chapter 4

Figure 4.14: Linear regression line (Adjusted R2=0.1049) drawn over a plot of logit
derived probability scores against likelihood ratio statistics over the training data. Vertical
dotted line shows optimum cut-off for likelihood ratio statistic. Horizontal dotted line shows
optimum cut-off for the logit derived probability score.
Figure 4.14 shows that there is a positive relationship between the LR score generated
by constrained ML (Barker et al. 2007; Barker and Pagel 2005) and the logit derived
probability scores generated by application of Equation 6 to the set of reconstructed ancestral
states.
Figure 4.15 shows that a similar relationship is observed over the testing data set (p
value < 0.001).
166

Chapter 4

Figure 4.15: Linear regression line (Adjusted R2=0.0995) drawn over a plot of logit derived
probability scores against likelihood ratio statistics over the testing data. Vertical dotted line
shows optimum cut-off for likelihood ratio statistic determined by the training data.
Horizontal dotted line shows optimum cut-off for the logit derived probability score
determined by training data.

167

Chapter 4

4.4.6 Hamming distance


In order to further improve the quality of the heuristic hamming distance was applied to the
set of predictions generated at each logit score cut-off. The goal of the additional filter was to
reduce the size of the search space while still preserving the 6 true positive predictions made
by constrained ML.
This yielded an unexpected result. The expectation with the application of Hamming
distance was that as the Hamming distance increased the number of true positives predicted
actually went up. This would suggest that as evidence of correlated evolution goes down, the
probability of predicting a protein-protein interaction goes up. This anomalous result is
probably due to the fact that Hamming distance does not take in the phylogenetic
relationships between the organisms in the study as pointed out by Barker (Barker and Pagel
2005).
4.5 Discussion
The Maddison-Dollo test proved not to be an effective filter as evidenced by its speed. The
method achieved a high level of accuracy however it did not maintain an intersection with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its higher cut-offs.
Differential parsimony (Kensche et al. 2008) was unable to differentiate between the
negative and positive training data to an adequate degree.
Dollo-pos and Dollo-overall (Barker et al. 2007) were both able to differentiate
between the negative and positive data however they did not preserve the intersection with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at a reasonable level of
precision/sensitivity.
The application of logistic regression on the other hand is a viable filter for
constrained ML analysis (Barker et al. 2007; Barker and Pagel 2005) as evidenced by both its
relative speed as well as its relationship with the LR scores provided by constrained ML
(Barker et al. 2007; Barker and Pagel 2005).
The filter utilising logistic regression on a single set of reconstructed ancestral states
as implemented above is quick as compared to both constrained ML (Barker et al. 2007;
Barker and Pagel 2005) as well as the Maddison test for correlated evolution (Maddison
1990). The comparative CPU time taken by each method over the training data as measured
by the Mac OS X utility time is given in Table 4.7.

168

Chapter 4

Process

Duration: Minutes/Hours

Maddison Dollo test (Farris

3988 minutes and 20

1977; Felsenstein 1989;

seconds/ 66 hours and 28

Maddison 1990)

minutes (approximately).

Logistic regression using

3 minutes and 48 seconds.

Dollo parsimony
reconstructions (Farris 1977;
Felsenstein 1989)
Constrained ML (Barker and

6624 minutes and 37

Pagel 2005;Barker et al.

seconds/ 110 hours and 24

2007)

minutes.

Table 4.7: Times taken by each of the three methods on the training data. The time given for
the Maddison Dollo test is an extrapolation from a run on 12.5% of the training data. This
12.5% was selected randomly. Times given in minutes are rounded to the nearest second.
Times given in hours are rounded to the nearest minute. All tests were run on an Intel Xeon
3.0 GHz processor.
The reduction in potential protein pairs to be investigated via an application of the
logistic regression filter using the probability score cut-off of 0.85 is 111,902 (113,1321132). This is a reduction of 98.9%. As the filter discriminates between proteins, which show
evidence of correlated evolution, the remaining 1.1% should be enriched for proteins
amenable to investigation via phylogenetic profiling using ML reconstructions with
constrained rates of gain (Barker et al. 2007; Barker and Pagel 2005). Thus an application of
the filter to the full human genome followed by an application of phylogenetic profiling using
constrained ML (Barker et al. 2007; Barker and Pagel 2005) will potentially yield a large set
of interactions from within the human genome some of which may be novel.
All code implementing the procedures described in this chapter is available on request from
the author.

169

Chapter 5

Chapter 5
Genome-wide prediction of protein functional interactions in humans using
a heuristic approach
5.1 Introduction
The interactome of an organism can be defined as the complete set of molecular interactions
that occur within its full complement of cell types (Yu et al. 2008). This study focuses on the
elucidation of interactions between proteins (both direct and indirect) in the human proteome
(PPIs). PPIs have been defined as physical interactions between proteins (De Las Rivas and
Fontanillo). PPIs are detected by methods such as the yeast 2-hybrid and tandem affinity
purification coupled to mass spectrometry (TAP-MS) (De Las Rivas and Fontanillo 2010) as
well as co-immunoprecipitation (De Las Rivas and Fontanillo 2010). Interactions between
proteins that are indirect can be detected by gene co-expression as investigated in Chapter 3
or techniques like double mutant synthetic lethality (De Las Rivas and Fontanillo 2010).
Indirect protein interactions are also detected by TAP-MS as proteins that share membership
of a complex do not necessarily maintain a direct physical interaction. Computational
interaction detection methods as described in Chapter 1 can contribute to this effort by
pointing out putative interactions, which can then be further verified. This chapter describes
the application of the logistic regression-based data filter developed in Chapter 4 in
combination with constrained ML (Barker et al. 2007; Barker and Pagel 2005) phylogenetic
profile analysis to detect potential novel protein-protein interactions as well as novel indirect
interactions between proteins.
5.1.1 PPI databases
As experimental data has accumulated on protein-protein interactions there have been a
number of attempts to organise and annotate accumulated data on PPIs. There are thus a
number of databases, which contain data on human protein-protein interactions. As any
attempt to examine the quality of predicted PPIs comparison with known data, a brief
overview of the major PPI databases is presented below.
5.1.1.1 MIPS
MIPs (Mammalian Protein-Protein Interaction Database)(Pagel et al. 2005), is a PPI database
which contains 1,812 experimentally verified human protein-protein interactions (PPIs). It

170

Chapter 5
only includes published data from individual experiments as opposed to large scale highthroughput surveys (Pagel et al. 2005).

5.1.1.2 BIND
BIND (Biomolecular Interaction Network Database) contains data on three main interactions
types (Bader et al. 2001). These are binary interactions, molecular complexes and pathway
data (Bader et al. 2001).
5.1.1.3 MINT
MINT (Molecular INTeraction) in contrast to MIPS contains data from large scale highthroughput experiments (Chatr-aryamontri et al. 2007). As of 2009 it contains data derived
from more than 19,000 experiments and 25,105 curated human PPIs (Ceol et al. 2010).
5.1.1.4 INTACT
IntAct is one of the larger PPI databases containing over 200,000 curated binary protein
interactions (Aranda et al. 2010). IntAct follows an extremely specific curation process with
information from experiments being recorded in high detail using a number of controlled
vocabularies to facilitate further data analysis (Aranda et al. 2010).
5.1.1.5 HPRD
The HPRD (Human Protein Reference Database) is a human specific PPI database. There are
currently 45,207 interactions held in the HPRD (Prasad et al. 2009). It contains manually
curated data on protein interactions derived from both high throughput surveys as well as
single experiments (Prasad et al. 2009).
5.1.1.6 DIP
The DIP (Database of Interacting Proteins) is one of the earlier PPI databases. It contains data
derived from manual curation of the literature as well as from structural information on
complexes derived from the PDB (Protein Data Bank) (Salwinski et al. 2004).
5.1.1.7 REACTOME
The REACTOME database holds data on PPIs in the context of the biological pathways that
underpin cellular processes and is also manually curated (Haw et al. 2011).

171

Chapter 5
5.1.1.8 STRING
The STRING database holds data on PPIs that are experimentally verified and also adds a set
of computationally predicted PPIs (von Mering et al. 2005). It contains PPI information on
630 organisms (Jensen et al. 2009). The total number of interactions held by STRING
exceeds 50,000,000 (Jensen et al. 2009).
5.1.1.9 I2D
The I2D (Interologous Interaction) database contains the full literature derived predictions
from the databases HPRD, BIOGRID, InTact, BIND and MINT as well as computationally
predicted interactions (Brown and Jurisica 2005). The sources of evidence utilised for
computational predictions include domain co-occurrence, gene co-expression and intersection
of GO terms. I2D contains 133,250 unique entries for detected protein interactions between
13,490 proteins.
5.1.1.10 KEGG
KEGG as mentioned in Chapter 1 localises gene products within functional pathways
(Kanehisa 1997; Kanehisa et al. 2006). This is similar to REACTOME.
5.1.1.11 BIOGRID
The BIOGRID database also contains curated data. It has 49,378 interactions involving
human proteins (Stark et al. 2006).
5.1.1.12 Discussion
There is an overlap between these databases as they are all based on examination of similar
experimental data (De Las Rivas and Fontanillo 2010). Given that current estimates of the
human interactome size are around 650,000 including non-direct functional interactions
(Stumpf et al. 2008) there are still a large number of interactions still to be characterised.
5.1.2 Power law
In order to examine the statistical properties of PPI networks, these networks are usually
analysed as graphs (Jeong et al. 2001). An interesting observation of the degree distribution
within some of these graphs (the degree of a vertice within a graph is the number of edges
connected to that given vertice) appear to follow a power law (Jeong et al. 2001). That is the
number of vertices within a graph with degree k is approximately k " x where x is a constant
(Alon 2007). What this entails is that for any given protein within the PPI network the
probability of having a large degree (many interactions) is low. There will however be
!
172

Chapter 5
proteins within the network that will have a large number of interacting partners. These
proteins have been referred to as hubs (Han et al. 2004). It has been hypothesised that
there are two forms of protein hub (Han et al. 2004). These are date hubs, which interact
with different partners at different times, and party hubs, which interact with multiple
partners simultaneously (Han et al. 2004).
Networks with similar degree distributions have been observed in both natural and
man-made networks such as the neural arrangements of C. elegans and the power grid of the
western United States (Watts and Strogatz 1998).
In the case of PPI networks however as there is a clear physical limit to the number of
interacting partners that a given molecule can interact with, the power law distribution over a
PPI network will sharply decay at the upper ends of the distribution, as hub proteins reach
saturation point with a given number of interaction partners. Similarly the lower end of the
distribution may not match a power law, as the cellular environment and other physiochemical factors will affect the probability of being an entirely monogamous interacting
partner in a binary interaction. Examples of PPI networks that do not exhibit a power law in
degree distribution have been pointed out in the literature, e.g. in work by Tanaka (Tanaka et
al. 2005).
5.2 Methods
The first step in carrying out a full genome-wide survey was to develop a list of all possible
ordered pairs of proteins within the version of RefSeq (Pruitt et al. 2005) used. This came a
total of to 560,237,601 pairs. The logistic regression-based filter implemented in Chapter 4
was applied to the ordered pairs of profiles at its optimum probability cut-off of 0.85.
Removing all pairs that scored beneath this threshold resulted in a total of 5,312,880 pairs of
proteins. This was a reduction of approximately 90 % of the total search space. This set of
reduced profile pairs was then analysed by constrained ML (Barker et al. 2007; Barker and
Pagel 2005) with the rate of gain parameters restricted to the optimum rate of 0.025. This
analysis was carried out using a cluster consisting of 260 2 GHz dual core Opteron 270
processors.
The results of the constrained ML analysis were then filtered for pairs with a
likelihood ratio (LR) statistic score of higher than 58.54 (this was the optimum LR score
determined in Chapter 3). This led to a set of 20,605 predicted interactions between protein

173

Chapter 5
pairs, consisting of 2,188 individual proteins. In order to examine predicted interactions
between members of the same orthologous group predictions were then converted to
predictions between orthologous groups. These orthologous groups were identified as the
groups clustered by the Inparanoid (Remm et al. 2001) implementation described in Chapter
2 resulting in a predicted set of 7,150 interactions between orthologous group pairs,
consisting of 1,417 individual orthologous groups.
5.2.1 Short Branch filtration
Examination of the distribution of interactions amongst the individual proteins showed that
some of the individual proteins were predicted to have an extremely large number of
interacting partners. The maximum number of interactions partners was predicted for the
protein with RefSeq GI number 148613856 (described as probable ATP-dependent RNA
helicase DDX17 isoform 3 on the NCBI website). This was predicted to have 1,503
interactions. However given the overall distribution of interaction partners within the set of
predictions as shown below in Figure 5.1 these extreme numbers seem to be implausible.

Figure 5.1: Distribution of number of predicted interaction partners/protein in constrained


ML predictions.
Thus the profiles of these highly connected proteins were investigated.
174

Chapter 5
Another protein with the RefSeq GI numbers 29029591 (labelled putative ribosomal RNA
methyltransferase 1 isoform b on the NCBI website) was predicted to take part in 1,430
interactions. The phylogenetic profile of this protein was:
001101001001010001111010100010000001001000010001111010
This translates to these proteins being present in Ashbya gossypii, Aspergillus fumigatus,
Bombyx mori, Caenorhabditis elegans, Canis familiaris, Cryptococcus neoformans,
Debaryomyces hansenii, Dictyostelium discoideum, Drosophila melanogaster, Drosophila
pseudoobscura, Entamoeba histolytica, Homo sapiens, Magnaporthe grisea, Paramecium
tetraurelia, Plasmodium knowlesi, Schizosaccharomyces pombe, Theileria annulata,
Theileria parva, Trichomonas vaginalis, Trypanosoma brucei and Ustilago maydis. This is
an extremely unbalanced distribution over the tree as illustrated in Figure 5.2.

175

Chapter 5

Figure 5.2: Distribution of protein labelled putative ribosomal RNA methyltransferase 1


isoform b. The character 1 indicates presence of the orthologous group.

It was hypothesised that this unbalanced distribution of this profile contributed to its
display of a high likelihood ratio statistic (LR) score with a large number of proteins. In
particular it was hypothesised that the profiles of prediction heavy proteins might contain
losses on the branches leading to P. troglodytes and M. mulatta. As the branches leading to
these taxa are short (see Chapter 2) this may contribute to spuriously high LR scores. In
order to investigate this hypothesis a list of RefSeq Gis for proteins lost on either the branch
leading to P. troglodytes or the branch leading to M. mulatta was sifted from the overall set
of phylogenetic profiles.
176

Chapter 5
The following procedure was then applied iteratively.

Set cut-off to 0.

Select all protein Gis from predicted interactions where no. of predicted interactions
for > cut-off.

Examine intersection of Gis in selected list with set of Gis of proteins lost on short
branches.

Increment cut-off by 1.

At the point where the cut-off was equal to 298, the intersection between the two sets was
100% as shown in the Venn diagram below.

Figure 5.3: Intersection of proteins lost on in P. troglodytes and M. mulatta with proteins
with > 298 predicted interaction partners.

177

Chapter 5
The 16 proteins in this intersection alone accounted for 13,082 of the total predicted protein
interactions or 63%. It is impossible to tell whether these proteins are genuinely evolving in a
correlated fashion or merely an artefact of the loss on a short branch.
Thus a post-processing step was applied which removed any prediction involving
proteins with profiles that matched this pattern. Thus 16,301 proteins with profiles that
contained a 0 at either the position representing P. troglodytes or the position representing M.
mulatta were removed from the set of predicted interactions. This led to the removal of
19,463 predicted interactions between proteins. This left a reduced set of 1,142 predictions.
An examination of the training data showed that 2 of the 5 predicted interactions by
constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its selected optimum rate of
gain (0.025) during the training step (see Chapter 3) involved the protein
(16936528/NP_001789) which would have been removed by this post processing step. This
reduces the sensitivity of the method by 40%.
5.2.2 GO term enrichment
A plausible method to examine the potential accuracy of the predicted interaction is to test
whether the GO terms associated with the predicted interaction partners are enriched for
particular terms.
In order to subject the data to GO (Gene ontology) term analysis (Ashburner et al.
2000) the set of 1,142 predicted interactions between protein pairs was converted to
predictions between gene pairs. This was carried out using IDconverter (Alibes et al. 2007).
This produced a set of 273 interactions between pairs of genes and 183 individual genes.
In order to investigate the validity of predicted interactions the set of interactions
between genes was converted into a network of interactions. The network can be represented
as an undirected graph. The graph in this case would be undirected as there is no way of
inferring any form of directional relationship between putative predictions.
The predicted interactions were converted into a graph through insertion of the
characters xx between each predicted pair. This converted the predicted interactions into a
format known as the simple interaction format, which is usable by a platform known as
Cytoscape (Shannon et al. 2003). Cytoscape is a program that allows visualisation and
analysis of network data (Shannon et al. 2003) and is widely used for such analyses. The
broad structure of the resultant graph is shown in Figure 5.4.

178

Chapter 5

Figure 5.4: Graph of 273 interactions between genes as predicted by the application of
constrained ML post data filtering. Each vertex in the graph is one gene. The edges in the
graph represent a predicted functional interaction between two vertices.
In order to examine the quality of the predictions the graph was subjected to clique
analysis to break up the network into sub-graphs, which are densely connected. The
Cytoscape plugin ClusterViz (Cai 2010) was utilised to deconstruct the network into sub
clusters. The plugin was used with the FAG-EC agglomerative hierarchical algorithm (Li et
al. 2008), which builds up sub-clusters through analysis of the clustering coefficient of each
edge in the graph. The clustering coefficient measures the density of connections between a
given edge and its neighbours. It does this by calculating the number of triangles that a given
edge is part of and dividing this number by the number of triangles that might potentially
include it given the degree (number of incoming edges) of its adjacent nodes (Radicchi et al.
2004). FAG-EC was run with a specified cut-off of sub-cliques of at least size 3. This was
because GO terms can be found to be significantly enriched in pairs of proteins even if an
annotation is attached to just one protein in the pair.

179

Chapter 5
FAG-EC was also run with a weak module definition. This identifies modules as
sub-cliques within graphs where the sum of in-degree of each node within a module is higher
than the sum of out-degree (Li et al. 2008). The in-degree of a node within an undirected
graph is defined as defined as the number of edges connecting it to other nodes in the same
subgraph (Li et al. 2008). The out-degree of a node is defined as the number of edges
connecting it to the rest of the graph excluding its subgraph (Li et al. 2008).
The application of this algorithm yielded 10 connected sub-cliques. GO term
enrichment was examined through the use of the Cytoscape plugin Bingo (Maere et al. 2005).
Bingo operates through examination of all GO terms associated with a given network. There
are a number of sources of evidence by which a term may be associated with a gene
(Ashburner et al. 2000). These are:

IMP: inferred from mutant phenotype

IGI: inferred from genetic interaction

IPI: inferred from physical interaction

ISS: inferred from sequence similarity

IDA: inferred from direct assay

IEP: inferred from expression pattern

IEA: inferred from electronic annotation

TAS: traceable author statement

NAS: non-traceable author statement

ND: no biological data available

IC: inferred by curator

In order to utilise reliable sources of evidence terms that were determined using the evidence
codes ISS, IEA, NAS and ND were excluded.
Bingo operates by calculating the probability of the association of a given set of terms
with a cluster of genes given a background distribution of terms associated with a reference
set of genes. This is calculated using the hypergeometric test (Maere et al. 2005). The
probability of a given set of genes being associated with a given GO term follows the
hypergeometric distribution, which is equivalent to the binomial distribution but utilising
sampling without replacement (Sokal and Rohlf 1995). The probability of a cluster C of r
genes being associated with a given GO term g (if evaluated against a background set of N

180

Chapter 5
genes and assuming the total number of genes associated with g is t) can be calculated. The
background probability of any given gene being associated with g is t/N and the probability
of g not being associated with a gene is (1-t/N). Thus the probability of x genes inside C
" t
%"
%
t
$( N )(N)'$(1( N )(N)'
$
'$
'
# x &# r ( x &
being associated with g can be calculated using the formula
where
"t
%
$ N (N)'
$
'
# r &

"k%
$ ' is the number of combinations of k items taken Y at a time (Sokal and Rohlf 1995).
#Y &
!
The effects of multiple testing are reduced through application of the Bonferroni
!

correction. This correction scales the point at which p values are found to be significant down
by dividing by n the number of tests performed (Sokal and Rohlf 1995). A procedure
involving the hypergeometric test is common in GO enrichment tools and is also utilised by
ClueGO (Bindea et al. 2009), Gorilla (Eden et al. 2009) and GOEAST (Zheng and Wang
2008) amongst others.
Bingo was run against a background set of genes, which consisted of the full set of
human genes held in Entrez Gene.
5.2.3 Intersection with other data sources
To examine the extent to which the predictions made by the filter in combination with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) intersected with known data it
was decided to compare the predictions to a known set of PPIs. It was decided to utilise the
I2D database (Brown and Jurisica 2005) as it contained data from all the other major PPI
databases. Thus the Interologous Interaction Database (I2D) version 1.95 was downloaded.
Predictions were converted from RefSeq GI numbers to their corresponding Uniprot
(Apweiler et al. 2010) primary accession. Only Swiss-Prot accessions were used, as these are
high confidence protein molecules that have been manually annotated (Apweiler et al. 2010).
As mentioned above there were 1,142 predictions made between RefSeq GI pairs.
This conversion reduced the set of predictions to 278 predictions as a complete mapping of
RefSeq to Uniprot is lacking.

181

Chapter 5

5.3 Results
5.3.1 GO Enrichment
Table 5.1 shows details of the sub-clusters generated by ClusterViz (Cai 2010) ordered in
descending order by size.
Cluster No. No. of genes No. of interactions
1

97

188

Genes
PRPF31 RRP9 SNRPE SUPT4H1 TNPO1
COPB1 FASN GLRX5 RPS25 WWOX
DHDDS EXO1 H2AFY ATP5C1 RPS29
RER1 PHF5A PIGL RPS21 POLR2G PSMD8
TP53RK ABT1 ANAPC10 TCEA2 NOP10
POLR2L SF3B5 LZTR1 TUBGCP2 CDS1
MAK16 CTDSPL RBM34 KIFC1 GFPT1
PPP1CC UBE2D4 BYSL PSMA6 FDX1L
TFB1M C20orf118 KIAA1609 UBE2V1
NAPNAPB RLBP1 RPF1 PSMC4
TRAPPC6B RBMX2 RHOC TOP2B UBE2I
CDK5 FKBP4 CCT6A CDK7 CKS2 CTDSP1
DIMT1L FAM96B FKBP5 GNB1 GUK1
HSPE1 KIF19LSM7 POLE2 PSMB1
RIOK2RPL13 RPL19 RPL30 RPL37A
RPS23 SEC11C SLC2A6
SMARCAL1TRAPPC1 TRAPPC4
TRAPPC6ATXNL4B UBE2V2 VBP1 VPS45
ZDHHC21 ERCC2 SPO11 TRIT1 SHMT2
GDPD1 DOLK DUSP5 LIG1 TRMT112

Table 5.1: Sub-cliques of predicted interactions generated through analysis of clustering


coefficients.

182

Chapter 5
Cluster No. No. of genes No. of interactions
2

Genes
MTMR2 MTM1
MTMR9 MTMR1

10

VAPB ZNF516 STK17A


ZNF225
MAZ ZNF286A ZNF304
DNM1L

NLE1 CDC6 TEP1 GEMIN5

ZRSR2 PPP1CB
RPL31

DERL1 DERL2
DNAJC12

H2AFV ATG4A
UNG

Table 5.1: Sub-cliques of predicted interactions generated through analysis of clustering


coefficients (cont).
To investigate whether GO terms were enriched in the predicted sub-clusters; clusters were
subjected to analysis for GO term enrichment using Bingo (Maere et al. 2005). All terms
were judged significant at p < 0.05 after application of the Bonferroni correction. Table 5.2
presents the clusters and the GO terms enriched in each cluster.

183

Chapter 5
Cluster No.
1

Enriched GO terms
44238

primary metabolic process

44237

cellular metabolic process

8152

metabolic process

44260

cellular macromolecule metabolic process

43170

macromolecule metabolic process

6414

translational elongation

6412

translation

44267

cellular protein metabolic process

6368
RNA elongation from RNA polymerase II
promoter
6354

RNA elongation

10467

gene expression

19538

protein metabolic process

44265

cellular macromolecule catabolic process

No significant enrichment

19224
termination of RNA polymerase II
transcription
43653 mitochondrial fragmentation during apoptosis

79 regulation of cyclin-dependent protein kinase


activity
31981 nuclear lumen

No significant enrichment

30970 retrograde protein transport, ER to cytosol


30433 ER-associated protein catabolic process
6515 misfolded or incompletely synthesized protein
catabolic process
6984 ER-nuclear signaling pathway
30176 integral to endoplasmic reticulum membrane
31227 intrinsic to endoplasmic reticulum membrane
51789 response to protein stimulus

Table 5.2: GO enrichment in sub-cliques within predicted interaction network.


184

Chapter 5
Cluster No.
6

Enriched GO terms

31301 integral to organelle membrane


31300 intrinsic to organelle membrane
43161 proteasomal ubiquitin-dependent protein catabolic
process
10498 proteasomal protein catabolic process
5789 endoplasmic reticulum membrane
42175 nuclear envelope-endoplasmic reticulum network
44432 endoplasmic reticulum part
43632 modification-dependent macromolecule catabolic
process
51603 proteolysis involved in cellular protein catabolic
process
19941 modification-dependent protein catabolic process
6511 ubiquitin-dependent protein catabolic process
44257 cellular protein catabolic process
30163 protein catabolic process
9607 response to biotic stimulus
6886 intracellular protein transport
19060 intracellular transport of viral proteins in host cell
30581 intracellular protein transport in host
51708 intracellular protein transport in other organism
during symbiotic interaction
15031 protein transport
44265 cellular macromolecule catabolic process
45184 establishment of protein localization
43285 biopolymer catabolic process
8104 protein localization
9057 macromolecule catabolic process
46719 regulation of viral protein levels in host cell
33036 macromolecule localization
12505 endomembrane system
6508 proteolysis
46907 intracellular transport
5783 endoplasmic reticulum

Table 5.2: GO enrichment in sub-cliques within predicted interaction network (cont).

185

Chapter 5

Cluster No.
6

Enriched GO terms
44248 cellular catabolic process
31090 organelle membrane
9056 catabolic process
51649 establishment of localization in cell
42288 MHC class I protein binding
51641 cellular localization
42287 MHC protein binding
30307 positive regulation of cell growth
42221 response to chemical stimulus
45793 positive regulation of cell size
65008 regulation of biological quality
19048 virus-host interaction
45927 positive regulation of growth
51701 interaction with host
7242 intracellular signaling cascade
44419 interspecies interaction between organisms
44404 symbiosis, encompassing mutualism through
parasitism
6950 response to stress
6810 transport
51234 establishment of localization
22415 viral reproductive process
16032 viral reproduction
1558 regulation of cell growth
51179 localization
8361 regulation of cell size
44267 cellular protein metabolic process
19538 protein metabolic process
40008 regulation of growth
44260 cellular macromolecule metabolic process
51869 response to stimulus
16021 integral to membrane
7154 cell communication
43170 macromolecule metabolic process
30968 endoplasmic reticulum unfolded protein response

Table 5.2: GO enrichment in sub-cliques within predicted interaction network (cont).


186

Chapter 5

Cluster No.
6

Enriched GO terms
31224 intrinsic to membrane
44446 intracellular organelle part
44422 organelle part
43283 biopolymer metabolic process
7165 signal transduction
44425 membrane part
51706 multi-organism process
22414 reproductive process
8284 positive regulation of cell proliferation
6986 response to unfolded protein

No Significant enrichment.
Table 5.2: GO enrichment in sub-cliques within predicted interaction network (cont).

5.3.2 Intersection with known data


The comparison of the set of protein interactions predicted by constrained ML (Barker et al.
2007; Barker and Pagel 2005) with the I2D database (Brown and Jurisica 2005) was carried
out by determining the intersection between the two sets of interactions. There were 2
predictions in common between the two datasets, which were not self-interactions (there were
9 self interactions in the intersection). These were:
Protein Pair (RefSeq GI numbers)

Evidence of interaction

118600991 13129120

(Kummel et al. 2008)

21361657 4758304

(Jessop et al. 2007)

Table 5.2: Intersection between I2D (Brown and Jurisica 2005) and predictions by logistic
regression/constrained ML (Barker et al. 2007; Barker and Pagel 2005).
There are thus 1,131 predictions made by logistic regression/constrained ML (Barker
et al. 2007; Barker and Pagel 2005), which are potentially novel. All predictions made can be
seen in Appendix C.

187

Chapter 5
5.3.3 Network statistics
The degree distribution of nodes within the graph appears to follow a power law. This could
potentially also be indicative of the correctness of the predictions made by constrained ML.
This pattern is observed in both the full and the reduced graphs as shown in Figures 5.5 and
5.6.

Figure 5.5: Degree distribution for full graph of protein interactions. Line is fitted power law
of the form y=axb. Line is fitted by least squares regression R2=0.694.

188

Chapter 5

Figure 5.6: Degree distribution for graph of protein interactions post short branch filtration.
Line is fitted power law of the form y=axb. Line is fitted by least squares regression
R2=0.768.

5.4 Discussion
A full genome wide investigation of human protein interactions by constrained ML (Barker et
al. 2007; Barker and Pagel 2005) in combination with the logistic regression-based data filter
seems to be a potentially fruitful source of new protein interactions. The enrichment of GO
terms in some sub-cliques of the resultant network suggests that the system has an ability to
make predictions with some basis in reality and thus a proportion of the set of predictions
made are both novel and accurate.
5.4.1 GO enrichment
GO enrichment was investigated conservatively by excluding the GO evidence code IEA.
This evidence code is associated with 90% of GO annotations (Buza et al. 2008). However

189

Chapter 5
despite removing terms associated with this code as well as terms associated with the codes
ISS, ND and NAS, a reasonable degree of enrichment was still observed.
The terms enriched appear to be associated with processes, which are divergent across
eukaryotes such as transcription (enriched in sub-clique 1) (Coulson and Ouzounis 2003).
This is a demonstration of the fact that it is only proteins that show a degree of variability in
their distribution pattern that are susceptible to this line of investigation.
5.4.2 Intersection with known data
The level of intersection with the I2D database is fairly low. Using the estimate of
interactome size provided by (Stumpf et al. 2008) and assuming every prediction in I2D
(Brown and Jurisica 2005) is correct. This would correspond to a coverage level of
133,250/ 650,000 or 20%. Thus the probability of any given accurate prediction being within
this database would be 0.2. Thus the converse probability of an accurate prediction not being
in the database would be 1-0.2 or 0.8.
If every prediction made by the heuristic approach were accurate, then the observed
result of an intersection of 11 and a complement of 1,131 would be highly improbable 0.81131
or ~0). The lack of intersection between the two datasets could be due to the bias in PPI
databases to particular physical detection systems such as yeast 2 hybrid. Approximately 37
% of the binary interactions held in HPRD (Mishra et al. 2006) were detected using yeast 2
hybrid.
The issue of RefSeq to Uniprot mapping is also pertinent in contributing to this lack
of intersection as over 75% of the predictions were lost post mapping.
Finally it is also unlikely that there is 100% accuracy in all PPIs held in I2D.
5.4.3 Weaknesses
Clearly the result of a precision of 1 as achieved on the training and testing data cannot be
extended to a full genome wide survey. The fact that predictions are made through
comparisons of the phylogenetic distribution of proteins suggests that one weakness of the
method could be an inability to distinguish between paralogs/isoforms and proteins showing
evidence of correlated evolution. However this issue is far from clear-cut as there is evidence
to show that homologous proteins are more likely to interact (Ispolatov et al. 2005; Orlowski
et al. 2007). Thus it is possible that the success of the phylogenetic profile method is partly
based on this observation. This is a potentially confounding issue for the method. However

190

Chapter 5
examination of interactions between predicted orthologous groups can ameliorate this. In the
case of this study of the 1,142 pairs of proteins predicted to be functionally linked by this
study 221 lie within the same orthologous group as identified by the Inparanoid
implementation (Remm et al. 2001).
Thus predictions between members of orthologous groups are not particularly
widespread over the data examined.
The other weakness of phylogenetic profiling in general that applies to this set of
predictions is potentially inaccurate profiles. Profiles can be inaccurate for a number of
reasons including low coverage sequencing, poor annotation or incorrect assumptions in
homolog identification. The short branch filtration step undertaken before further analysis is
potentially attributable to this phenomenon.
5.4.3.1 Scaling
The precision and sensitivity results observed over the training data were based on a
biologically unrealistic ratio of 10:1 of negative to positive examples of interacting proteins.
The results observed can be adjusted for the whole genome by scaling to a more realistic
ratio. A possibly more realistic ratio can be calculated using estimates of interactome size.
These range from 154,000-369,000 (Hart et al. 2006) to 650,000 (Stumpf et al. 2008). If these
numbers are subtracted from the size of all potential interactions 560,237,601 (calculated as
all possible pairs from version of RefSeq held) estimated ratios of negative to positive range
from approximately 860:1 to 3636 :1. Assume for the sake of argument the ratio of 860:1 is
adopted (via an assumption of an interactome size of 650,000). Recall that the size of the
positive set in the training data is 9,161 pairs of known interactions. Thus as an illustrative
example if a given predictive method yielded a precision of 0.5 and a sensitivity of 0.1 over
the training data this would correspond to making 916 predictions of which 50% were correct
(TP=458, FP=458 and FN=8703). In order to scale the data the following numbers need to be
calculated:

P(TP)= Probability of predicting a true positive.

P(FP)= Probability of predicting a false positive.

P(FN)= Probability of predicting a false negative.

These numbers can be calculated by the following equations:

191

Chapter 5

P(TP) =

(TP)
(PS)

(1)

P(FP) =

(FP)
(NS)

(2)

P(FN) =

(FN)
(PS)

(3)

!
!
!

Where PS= size of the positive set and NS= size of the negative set.
For the example above P(TP)=458/9161=0.049, P(FP)=458/103971=0.004 and
P(FN)=8703/9161=0.95. Thus by multiplying these probabilities by the estimated full
interactome size (in this case) 650,000 the sizes of TP and FN can be calculated over the full
interactome. In order to calculate the size of FP the size of a potential negatome (proteins that
do no interact) must be calculated. This can be calculated as the estimated size of the
interactome subtracted from the number of all possible interactions (in this case 560,237,601650,000=559,587,601). Given these numbers the values of TP, FP, and FN over the full
interactome for this example would be 31,850, 2,238,216.5 and 617,500 respectively leading
to a scaled precision of 0.014 and a scaled sensitivity of 0.049 over the whole interactome. In
cases where precision =1 scaling will not affect this value as there are no false positives
predicted.
Probabilities of predicted interactions being genuine can also be calculated via an
alternate route applying Bayes theorem with the prior probability of an interaction being
derived from an estimate of interactome size. Thus applying Bayes theorem the posterior
probability of an interaction can be calculated using the following parameters (Yang 2006):

P(I)=prior probability of interaction. Calculated by division of interactome size


estimate by total number of potential interactions.

P(Pos)= Probability of making any positive prediction. Calculated as

P(Pos) = P(Pos | I )(P(I )) + P(Pos |~ I )(P(~ I ))


In cases where precision is = 1 P(Pos |~ I ) = 0 . Note P(Pos |~ I ) = P(FP)

Thirdly the probability of making a positive prediction given an interaction is


calculated as: P(Pos|I)= Sensitivity of the method.
!
!
192

Chapter 5
Thus the posterior probability of a predicted interaction being genuine can be calculated
using Bayes theorem as presented in Equation 4:

P(I | Pos) =

P(I) " P(Pos | I)


P(Pos)

(4)

Bayes theorem however is only applicable in cases where precision < 1 as the posterior
probability is 1 when precision =1.
This can be simply demonstrated using basic algebra and recasting the terms.

P(I | Pos) =

(Pr ior " Sensitivity)


(Pr ior " Sensitivity) + (P(FP)(1# Pr ior))

(5)

Thus as P(FP)(1" Pr ior) = 0 the posterior probability is 1.


5.4.4 Conclusions
!Given results observed on the training data and testing data and the GO term enrichment

observed in the sub-clusters, as well as the results of previous work on the phylogenetic
profile method (Barker et al. 2007; Barker and Pagel 2005; Bowers et al. 2004; Cokus et al.
2007; Kensche et al. 2008; Pagel et al. 2004b; Pellegrini et al. 1999; Vert 2002) amongst
others, it appears that the method is capable of discerning between proteins that are
functionally linked and proteins that are not. Thus the novel predictions made could
potentially be genuine interactions, which are of yet uncharacterised.

193

Chapter 6

Chapter 6
Conclusions and further work
6.1 Summary of Project
The goal of this project has been an investigation into detection of human protein interactions
using the comparative method. More specifically the development of a novel heuristic
approach to allow application of the effective but computationally intensive constrained ML
(Barker et al. 2007; Barker and Pagel 2005) approach to phylogenetic profile analysis on a
genome-wide scale. This application was intended to allow the generation of novel
predictions of protein interactions.
A database of all against all comparisons of the proteomes of 54 eukaryotic organisms
plus 1 archaeon was created. This was used to as input to an implementation of the
Inparanoid (Remm et al. 2001) procedure to cluster the contents of the proteomes into
orthologous groups. Using the human proteome as a reference point phylogenetic profiles
were then constructed for each protein within the human proteome.
10 proteins that were universally present in single copies in all organisms under
consideration were then selected through analysis of the phylogenetic profiles and
orthologous groups. The versions of these single copy proteins from each species were then
aligned to create a multiple sequence alignment. Each multiple sequence alignment was then
concatenated to create one single combined alignment. This combined alignment provides a
measure of divergence between the 55 organisms under consideration. The concatenated
multiple sequence alignment was then used to reconstruct a phylogenetic tree of the 54
eukaryotes under consideration using the archaeon as an outgroup with which to root the tree.
This phylogeny was broadly congruent with current thought on eukaryotic evolution (see
Chapter 2) .

194

Chapter 6

Figure 6.1: Process flow for research carried out in Chapter 2.


In order to use constrained ML (Barker et al. 2007; Barker and Pagel 2005) to analyse
the training data it was necessary to ascertain the optimum rates at which a character could be
gained in order to constrain the models of evolution used by the method (Barker et al. 2007).
In order to do this it was necessary to obtain training data, i.e. examples of protein pairs that
interact and examples of protein pairs that are unlikely to interact. Positive data was acquired
which was based protein interactions held with the HPRD database. Negative data was
generated by creating a set of all possible pairs of human proteins. These pairs were filtered
by removing all pairs that possessed any Gene Ontology (GO) (Ashburner et al. 2000) terms
in common. Once these training sets were obtained different rates of protein gain were
evaluated in terms of precision and sensitivity and an optimum rate of gain of 0.025 was
selected. The highest sensitivity reached by constrained ML at this rate was 1 at a cut-off of
56.37. The sensitivity of the method at this cut-off was 0.000654.
The efficacy of constrained ML (Barker et al. 2007; Barker and Pagel 2005) in
detecting protein-protein interactions was then compared to a comparable high throughput
laboratory based method for detecting interactions using the training data. This method was
examination of gene co-expression in response to given experimental stimuli as measured by

195

Chapter 6
microarrays. The highest performing microarray experiments also achieved a precision of 1.
The highest performing microarray experiment E-MEXP-1224 (Garman et al. 2009) achieved
a sensitivity of 0.003.
Constrained ML (Barker et al. 2007; Barker and Pagel 2005) was also compared to
the PIPs server (McDowall et al. 2009) which uses a semi-naive Bayesian classifier (Scott
and Barton 2007) in order to evaluate multiple sources of evidence for potential protein
interactions. At its highest cut-off the Bayesian classifier achieved a precision of 0.9883721
and a sensitivity of 0.01.

Figure 6.2: Process flow for research described in Chapter 3.


These comparisons showed that constrained ML (Barker et al. 2007; Barker and Pagel
2005) showed comparable levels of precision to gene co-expression at an optimal level of
constraint for rate of gain and outperformed the method that integrated multiple sources of
evidence. In terms of sensitivity however constrained ML (Barker et al. 2007; Barker and
Pagel 2005) was clearly the worst performer. However given that constrained ML (Barker et

196

Chapter 6
al. 2007; Barker and Pagel 2005) achieved a precision of 1 over the training data it was
utilised for further analysis.
The application of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to a
full genome-wide survey was found to be impractical due to time considerations. Thus a
heuristic was developed which approximated the ability of constrained ML (Barker et al.
2007; Barker and Pagel 2005) to distinguish between proteins that interact and those that do
not. This heuristic was based on the reconstruction of ancestral states using Dollo parsimony
(Farris 1977) over the phylogenetic tree. Two novel potential heuristics were developed,
implemented and tested using the Dollo parsimonious reconstruction. The first was an
implementation of a test for correlated evolution which calculates the probability of the
concentration of a set of gains and losses of a protein in the areas of a phylogenetic tree
where a second protein was either present or absent (Maddison 1990). The second potential
heuristic was based on logistic regression using empirical counts of the presence, absence,
gain or loss of one protein given the presence, absence, gain or loss of the other as predictor
variables.
The Maddison test (Maddison 1990) based heuristic performed reasonably well in its
own right as a method of detecting functional interactions. It achieved a maximum precision
of 0.857 with a sensitivity of 6.54 " 10-4 over the training data at a score cut-off of
0.9999997999999475. However it proved not to be efficient enough in terms of speed to be
justified for use as a heuristic. It also did not maintain an intersection with the 5 predictions
!
made by constrained ML (Barker et al. 2007; Barker and Pagel 2005) at its optimum rate of
gain (0.025) and at its optimum likelihood ratio (LR) statistic score cut-off (58.3).
Maintenance of an intersection with these predictions was considered a necessary property of
an effective heuristic.
The heuristic that utilised logistic regression achieved a precision of 0.736 with a
sensitivity of 0.01 at its optimum cut-off of 0.967. It also maintained an intersection with the
predictions made by constrained ML (Barker et al. 2007; Barker and Pagel 2005) (at its
optimum rate of gain and LR cut-off) up to a cut-off of 0.85. At a cut-off of 0.85 the heuristic
made 1,230 predictions, which amounted to a reduction of the search space of potential
proteins by 98.9%.
The heuristic based on logistic regression was then applied to the full human genome
in order to filter out protein pairs that displayed little or no evidence of correlated evolution.
The heuristic reduced the size of the search space by 90% over the whole genome.

197

Chapter 6

Figure 6.3: Process flow for research carried out in Chapter 4. Note: Validation sets
were used to validate all methods. The connectors have been left out for clarity.
Having applied the heuristic to the method a full genome-wide survey was launched.
The results of the genome-wide survey found that a large majority of predicted protein
interactions involved proteins, which had been lost on short branches in the phylogeny. These
predictions were removed from the overall set of predictions. The prediction set was then
recast as a network of interactions.
The results of the genome-wide survey were then examined by generating subnetworks from the complete network generated and examining these sub networks for
enrichment in Gene Ontology (GO) (Ashburner et al. 2000) terms. GO term enrichment was
found in 57% of the clusters generated. The intersection of the predictions made by
constrained ML (Barker et al. 2007; Barker and Pagel 2005) with the I2D database (Brown
and Jurisica 2005) was also examined. The intersection with the I2D (Brown and Jurisica
2005) database was low suggesting that any correct predictions generated by this project are

198

Chapter 6
also novel predictions of protein interaction. The genome-wide survey yielded a final set of
1,131 predictions of protein interaction.

Figure 6.4: Procedure for research carried out in Chapter 5.

6.1.1 Repeat Analysis


To apply this procedure to a new dataset, the following procedure would have to be followed.
Prerequisites needed:

Phylogenetic tree for species of interest.

Phylogenetic profiles for proteins of interest.

Positive and negative examples of protein interaction data. An automated procedure


for the acquisition of training/testing data is found in (Chen et al. 2011).

Having acquired these, the programs BayesTraits (Pagel et al. 2004a) and bms_runner
(Barker et al. 2007) should be downloaded.
To determine the optimum rate of protein gain for use in the constrained ML procedure
bms_runner should be used to evaluate multiple rates of gain. The LR scores for all
proteins for the optimum rate of gain should be kept.

199

Chapter 6
Once this rate is determined, the next step is the ancestral state reconstructions. In order
to carry out these reconstructions it will be necessary to download the program
DOLLOP held in the PHYLIP package (Felsenstein 1989).
DOLLOP should be run with the U option, which will allow it to utilise the
phylogenetic tree. (Note bms_runner uses a NEXUS formatted tree while DOLLOP will
need a PHYLIP format tree). DOLLOP should be run on every profile in the dataset.
Thus the end product of this step is a set of ancestral reconstructions over the tree for
each profile.
At this point code written by the author (available on request) can be used to process
these reconstructions. This code will take in the reconstructions and return a dataset
consisting of the s parameters described in Chapter 4 calculated for each protein.
This data can then be processed using standard statistical package R (R Development
Core Team 2011) in order to carry out logistic regression. Once regression has been
carried out, this should yield a linear equation for calculating a logit based score for the
probability of interacting.
Again code available from the author can now be utilised. This code will take in the
specified coefficients for the s parameters calculated in R, the Dollo reconstructions of
the proteins, the LR scores of the proteins at the optimum rate of gain and the validation
data and return the optimum logit cut-off for the data for preserving the performance of
constrained ML (Barker et al. 2007; Barker and Pagel 2005).
At this point a dataset of all possible pairs of profiles should be prepared. Code from the
author can be used to apply the linear equation to each of these pairs to calculate the logit
score. These pairs can now be filtered by the optimum cut-off.
Once a reduced set has been created, constrained ML can be applied to this set (Barker
et al. 2007; Barker and Pagel 2005).
6.2 Conclusion
This project has investigated use of the comparative method specifically constrained ML
(Barker et al. 2007; Barker and Pagel 2005) as a means to detect protein-protein interactions.
It has generated a set of predictions that if validated by further laboratory based investigation
could contribute to knowledge about the human interactome. It has also developed a method
200

Chapter 6
that allows the application of the computationally intensive constrained ML (Barker et al.
2007; Barker and Pagel 2005) approach to phylogenetic profiling on a genome-wide scale.
The ability of the comparative method to unearth protein interactions can only be
enhanced by the current rate of data generation given the rapid uptake of next generation
sequencing technologies such as the Roche 454 GS FLX sequencer, the Illumina Genome
Analyser and the Applied Biosystems SOLID sequencer, which can generate gigabases of
sequence data in a matter of days (Mardis 2008). As more organisms are sequenced the
quality of reconstructed phylogenies and consequently the efficacy of the comparative
method in detecting associations between traits should improve due to increased taxon
sampling (Heath et al. 2008).
Given this increased pace of data generation it is also necessary to develop fast and
effective computational techniques for functional annotation of proteins. Detection of protein
interactions can be used to functionally annotate proteins via the principle of guilt by
association (Aravind 2000). Thus the combination of the developed heuristic with
constrained ML (Barker et al. 2007; Barker and Pagel 2005) can contribute to annotation
efforts. It has been seen that this method is not very sensitive thus the probability of it making
any predictions at all for a given protein are low. But used in a high throughput unsupervised
context the method is potentially capable of detecting novel interactions as one tool amongst
many.
Among the methods of detecting protein interactions examined over the course of this
study was the PIPs server (McDowall et al. 2009), which as mentioned above combines
multiple sources of evidence in order to detect potential protein interactions (Scott and Barton
2007) utilising a Bayesian classifier. A similar approach of using combined evidence in a
Bayesian framework was previously taken by (Jansen et al. 2003). This combination of
diverse sources of evidence as a means to elucidate protein interactions has also been applied
by Mohamed (Mohamed et al. 2010) utilising a classifier based on a majority vote from a
collection of decision trees. Other approaches such as support vector machines and singular
decision trees have also been investigated by (Qi et al. 2006).
Potentially the application of constrained ML (Barker et al. 2007; Barker and Pagel
2005) in combination with the heuristic in a genome-wide manner could be utilised as a
source of contributory evidence in a similar framework.

201

Chapter 6
6.3 Future directions
Constrained ML (Barker et al. 2007; Barker and Pagel 2005) has been seen to be capable of
detecting protein-protein interactions at a reasonable level of accuracy. With the data
accumulated over the course of this project there are a number of further avenues of
investigation and areas of extension.
6.3.1 Computational extensions
The procedure followed in order to utilise constrained ML (Barker et al. 2007; Barker and
Pagel 2005) for a genome-wide survey of H. sapiens involved the use of bespoke scripts and
various programs provided by a plethora of authors as cited throughout this text. To facilitate
the application of this tool by other users it will be necessary to create an interface and
combine the functionality of the programs utilised into one computational procedure.
The construction of phylogenetic profiles for all proteins in all species held in the
current dataset and the provision of these profiles online via a web interface would also
facilitate this process. The data generated by this project as presented in Appendix D could
also be presented via an online database either an extant protein interaction database such as
String or I2D or a bespoke database, which would have to be constructed.
6.3.2 Consensus profiles
The application of constrained ML (Barker et al. 2007; Barker and Pagel 2005) to detection
of protein interactions is carried out in a pairwise fashion. Work by Bowers extended the idea
of pairwise comparisons to three way comparisons using Boolean logic operators (Bowers et
al. 2004). This method attempted to detect dependencies in the presence and absence of a
given gene on the presence and absence of two other genes. A similar technique could be
utilised to integrate matching profiles into consensus profiles. By classifying mismatches as
missing information consensus phylogenetic profiles could be constructed to represent groups
of proteins. The program BayesTraits (Pagel et al. 2004a), which is utilised to apply the
constrained ML approach, handles missing data by reconstruction of the missing data as an
extension of ancestral state reconstruction. Thus when a plausible reconstruction is reached at
the immediate ancestral node of the taxon with the missing data the state of the taxa can be
estimated using rate transition parameters (Pagel 1994). Consensus profiles will utilise a
mismatch character X to represent missing information. Thus if for example we compare the
following four species profiles:
1010

202

Chapter 6
1110
The consensus profile of the above two profiles would be:
1X10
In comparisons of consensus profiles the X character will remain unchanged if matched
against another X, shift to 0 if matched against a 0 and shift to 1 if matched against a 1. Thus
a 1 or a 0 in a consensus profile will always be present in more than 50% of its constituent
profiles.
Some of these groups will represent clade specific distributions of proteins. Others
will represent distributions of proteins correlated with the distribution a given function over
the species under consideration. Comparison of a protein with an as yet unascertained
function using consensus profiles would connect a protein to either a clade-specific group or
a group, which possessed a function connected to the presence of the protein. Thus a protein
that showed correlated evolution with a consensus profile could potentially be functionally
linked to all constituent members of that profile. At a higher-level if two consensus profiles
show evidence of correlated evolution with each other this could suggest functional linkage
between two groups of proteins, e.g. the functional interaction of one pathway with another.
6.3.3 Correlated evolution of proteins with the presence or absence of phenotypes
Given the data currently generated an interesting avenue of investigation would be the
comparison of the presence and absence of given phenotypes with the presence and absence
of given proteins. This process can detect proteins that underlie the phenotype of interest.
This method was developed by Levesque (Levesque et al. 2003) and used to detect genes
associated with cell motility. It was also applied to associating a number of phenotypes with
given proteins (Jim et al. 2004; Slonim et al. 2006). The method was found to be to be
reasonably effective with traits that were evenly distributed among the organisms under
consideration (Jim et al. 2004). A further application of the method by Gonzalez and Zimmer
examined the association of optimal growth pH with given genotypes (Gonzalez and Zimmer
2008). Gonzalez and Zimmer utilised a threshold with which to discretise continuous
phenotypes (Gonzalez and Zimmer 2008). If the measured value of a measured phenotype
was over a given value then the phenotype was declared present. Applications of this method
have so far utilised measures like string distance measures (Jim et al. 2004; Levesque et al.
2003) and mutual information (Gonzalez and Zimmer 2008; Slonim et al. 2006) to compare
the phylogenetic profiles of genes and given phenotypes. Use of a phylogenetically aware
method such as constrained ML (Barker et al. 2007; Barker and Pagel 2005) would enhance
203

Chapter 6
the method and potentially yield more accurate results. Given the range of eukaryotic
organisms currently held potential traits to be investigated could include multi-cellularity,
aerobic respiration and parasitism.
6.3.4 Drug Targets
Keeping with the theme of parasitism there are a number of disease causing parasitic
organisms in the dataset currently held. These are

Plasmodium falciparum

Plasmodium knowlesi

Plasmodium yoelii

Trypanosoma brucei

Trypanosoma cruzi

Leishmania major

Trichomonas vaginalis

Theileria annulata

Theileria parva

Encephalitozoon cuniculi

These include T. cruzi and T. brucei, which cause Chagas disease (Lescure et al.
2010) and sleeping sickness (Ralston et al. 2009) respectively. Also included in the dataset
are three members of the malaria-causing genus Plasmodium. Take for example P.
falciparum. There is currently resistance to all five groups of anti-malarial drugs (Hayton and
Su 2004). The detection of protein interactions in P. falciparum could potentially aid in the
development of new anti-malarial drugs. Using this species as a reference point, phylogenetic
profiles for its proteome could be constructed. An application of the logistic regression based
heuristic would make all against all comparisons using constrained ML (Barker et al. 2007;
Barker and Pagel 2005) feasible. These studies could potentially detect novel protein
interactions within P. falciparum. Disruption of protein-protein interactions is potentially
one avenue for drug development. This could potentially be carried out via procedures such
as peptidomimetics (Hruby 1997), which involves the construction of a molecule that mimics
the properties of one of the interacting partners. The construction of phylogenetic profiles
could also reveal proteins and protein interactions that are unique to P. falciparum. These
molecules could potentially be targeted with a lower risk of side effects in the host organism.
A similar procedure could be followed with all other parasitic organisms in the dataset.

204

References
References
Agnarsson I, Miller JA (2008) Is Acctran Better Than Deltran? Cladistics 24:1032
Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA (1997)
Evidence for a Clade of Nematodes, Arthropods and Other Moulting Animals.
Nature 387:489
Ahola V, Aittokallio T, Vihinen M, Uusipaikka E (2006) A Statistical Score for Assessing
the Quality of Multiple Sequence Alignments. BMC Bioinformatics 7:484
Albert VA (2006) Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford
Alberts B (1998) Essential Cell Biology : An Introduction to the Molecular Biology of the
Cell. Garland, New York
Alberts B (2002) Molecular Biology of the Cell. Garland Science
Alberts B (2008) Molecular Biology of the Cell. Garland Science, New York ; Abingdon
Alberts B (2010) Essential Cell Biology. Garland Science, New York ; London
Alibes A, Yankilevich P, Canada A, Diaz-Uriarte R (2007) Idconverter and Idclight:
Conversion and Annotation of Gene and Protein Ids. BMC Bioinformatics 8
Alon U (2007) An Introduction to Systems Biology Design Principles of Biological
Circuits. Chapman & Hall / CRC
Altenhoff AM, Dessimoz C (2009) Phylogenetic and Functional Assessment of Orthologs
Inference Projects and Methods. PLoS Comput Biol 5:e1000262
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic Local Alignment
Search Tool. J Mol Biol 215:403
Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schaffer AA, Yu YK (2005)
Protein Database Searches Using Compositionally Adjusted Substitution
Matrices. FEBS Journal 272:5101
C. elegans Sequencing Consortium (1998) Genome Sequence of the Nematode C. Elegans:
A Platform for Investigating Biology. Science 282:2012
Antonov AV, Mewes HW (2008) Complex Phylogenetic Profiling Reveals Fundamental
Genotype-Phenotype Associations. Computational Biology and Chemistry 32:412
Apweiler R, Martin MJ, O'Donovan C, Magrane M, Alam-Faruque Y, Antunes R, Barrell D,
Bely B, Bingley M, Binns D, Bower L, Browne P, Chan WM, Dimmer E, Eberhardt R,
Fedotov A, Foulger R, Garavelli J, Huntley R, Jacobsen J, Kleen M, Laiho K, Leinonen R,
Legge D, Lin Q, Liu WD, Luo J, Orchard S, Patient S, Poggioli D, Pruess M, Corbett M, di
Martino G, Donnelly M, van Rensburg P, Bairoch A, Bougueleret L, Xenarios I, Altairac S,
Auchincloss A, Argoud-Puy G, Axelsen K, Baratin D, Blatter MC, Boeckmann B, Bolleman
J, Bollondi L, Boutet E, Quintaje SB, Breuza L, Bridge A, deCastro E, Ciapina L, Coral D,
Coudert E, Cusin I, Delbard G, Doche M, Dornevil D, Roggli PD, Duvaud S, Estreicher A,
Famiglietti L, Feuermann M, Gehant S, Farriol-Mathis N, Ferro S, Gasteiger E, Gateau A,
Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hulo N, James J, Jimenez S,
Jungo F, Kappler T, Keller G, Lachaize C, Lane-Guermonprez L, Langendijk-Genevaux P,
Lara V, Lemercier P, Lieberherr D, Lima TD, Mangold V, Martin X, Masson P, Moinat M,
Morgat A, Mottaz A, Paesano S, Pedruzzi I, Pilbout S, Pillet V, Poux S, Pozzato M, Redaschi
N, Rivoire C, Roechert B, Schneider M, Sigrist C, Sonesson K, Staehli S, Stanley E, Stutz A,
Sundaram S, Tognolli M, Verbregue L, Veuthey AL, Yip LN, Zuletta L, Wu C, Arighi C,
Arminski L, Barker W, Chen CM, Chen YX, Hu ZZ, Huang HZ, Mazumder R, McGarvey P,
Natale DA, Nchoutmboube J, Petrova N, Subramanian N, Suzek BE, Ugochukwu U,

205

References
Vasudevan S, Vinayaka CR, Yeh LS, Zhang J (2010) The Universal Protein Resource
(Uniprot) in 2010. Nucleic Acids Research 38:D142
Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M,
Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M,
Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B,
van Eijk K, Hermjakob H (2010) The Intact Molecular Interaction Database in
2010. Nucleic Acids Res 38:D525
Aravind L (2000) Guilt by Association: Contextual Information in Genome Analysis.
Genome Research 10:1074
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S,
Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene
Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium.
Nat Genet 25:25
Aubourg S, Rouze P (2001) Genome Annotation. Plant Physiology and Biochemistry
39:181
Avery OT, Macleod CM, McCarty M (1944) Studies on the Chemical Nature of the
Substance Inducing Transformation of Pneumococcal Types : Induction of
Transformation by a Desoxyribonucleic Acid Fraction Isolated from
Pneumococcus Type Iii. J Exp Med 79:137
Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW (2001) Bind--the
Biomolecular Interaction Network Database. Nucleic Acids Res 29:242
Baldauf SL, Roger AJ, Wenk-Siefert I, Doolittle WF (2000) A Kingdom-Level Phylogeny
of Eukaryotes Based on Combined Protein Data. Science 290:972
Baldi P, Brunak S (2001) Bioinformatics : The Machine Learning Approach. MIT Press,
Cambridge, Mass.
Barker D, Meade A, Pagel M (2007) Constrained Models of Evolution Lead to Improved
Prediction of Functional Linkage from Correlated Gain and Loss of Genes.
Bioinformatics 23:14
Barker D, Pagel M (2005) Predicting Functional Gene Links from PhylogeneticStatistical Analyses of Whole Genomes. PLoS Comput Biol 1:e3
Beadle GW, Tatum EL (1941) Genetic Control of Biochemical Reactions in Neurospora.
Proc Natl Acad Sci U S A 27:499
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) Genbank. Nucleic
Acids Res 37:D26
Berg JM, Tymoczko JL, Stryer L (2001) Biochemistry. W. H. Freeman and CO., New York
Berg JM, Tymoczko JL, Stryer L (2007) Biochemistry. W. H. Freeman, New York
Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, Fridman WH,
Pages F, Trajanoski Z, Galon J (2009) Cluego: A Cytoscape Plug-in to Decipher
Functionally Grouped Gene Ontology and Pathway Annotation Networks.
Bioinformatics 25:1091
Birney E, Clamp M, Durbin R (2004) Genewise and Genomewise. Genome Res 14:988
Black DL (2003) Mechanisms of Alternative Pre-Messenger Rna Splicing. Annu Rev
Biochem 72:291
Blair C, Murphy RW (2011) Recent Trends in Molecular Phylogenetic Analysis: Where
to Next? J Hered 102:130
Blanchard JL, Lynch M (2000) Organellar Genes - Why Do They End up in the Nucleus?
Trends in Genetics 16:315
Blow MJ (2004) A Survey of RNA Editing in the Human Brain Sanger Institute.
University of Cambridge, Cambridge
206

References
Borodovsky M, Rudd KE, Koonin EV (1994) Intrinsic and Extrinsic Approaches for
Detecting Genes in a Bacterial Genome. Nucleic Acids Res 22:4756
Bowers PM, Cokus SJ, Elsenberg D, Yeates TO (2004) Use of Logic Relationships to
Decipher Protein Network Organization. Science 306:2246
Bratke K (2009) Comparative Analysis of Poxvirus Genome Evolution. University of
Dublin,Trinity College, Dublin
Breathnach R, Benoist C, O'Hare K, Gannon F, Chambon P (1978) Ovalbumin Gene:
Evidence for a Leader Sequence in mRNA and DNA Sequences at the ExonIntron Boundaries. Proc Natl Acad Sci U S A 75:4853
Brennan RG, Matthews BW (1989) The Helix-Turn-Helix DNA Binding Motif. J Biol
Chem 264:1903
Brent MR (2008) Steady Progress and Recent Breakthroughs in the Accuracy of
Automated Genome Annotation. Nat Rev Genet 9:62
Brown KR, Jurisica I (2005) Online Predicted Human Interaction Database.
Bioinformatics 21:2076
Brown TA (2006) Genomes 3. Garland Science Pub., New York
Bruno WJ, Halpern AL (1999) Topological Bias and Inconsistency of Maximum
Likelihood Using Wrong Models. Molecular Biology and Evolution 16:564
Burge C, Karlin S (1997) Prediction of Complete Gene Structures in Human Genomic
DNA. Journal of Molecular Biology 268:78
Burki F, Shalchian-Tabrizi K, Pawlowski J (2008) Phylogenomics Reveals a New
'Megagroup' Including Most Photosynthetic Eukaryotes. Biology Letters 4:366
Buza TJ, McCarthy FM, Wang N, Bridges SM, Burgess SC (2008) Gene Ontology
Annotation Quality Analysis in Model Eukaryotes. Nucleic Acids Research
36(2):e12
Cai JC, G. Wang , J (2010) ClusterViz: A Cytoscape Plugin for Graph Clustering and
Visualization Central South University, Changsha
Camin JH, Sokal RR (1965) A Method for Deducing Branching Sequences in Phylogeny.
Evolution 19:311
Capecchi MR (2005) Gene Targeting in Mice: Functional Analysis of the Mammalian
Genome for the Twenty-First Century. Nat Rev Genet 6:507
Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T (2009) Trimal: A Tool for Automated
Alignment Trimming in Large-Scale Phylogenetic Analyses. Bioinformatics
25:1972
Cavalli-Sforza LLE, Edwards A.W.F (1964) Reconstruction of Evolutionary Trees.
Phenetic and Phylogenetic Classification 6:67-76
Ceol A, Aryamontri AC, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G
(2010) Mint, the Molecular Interaction Database: 2009 Update. Nucleic Acids
Research 38:D532
Chalfie M, Tu Y, Euskirchen G, Ward WW, Prasher DC (1994) Green Fluorescent Protein
as a Marker for Gene-Expression. Science 263:802
Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni
G (2007) Mint: The Molecular Interaction Database. Nucleic Acids Res 35:D572
Chen LF, Vitkup D (2006) Predicting Genes for Orphan Metabolic Activities Using
Phylogenetic Profiles. Genome Biology 7:R17
Chen XW, Jeong JC, Dermyer P (2011) Kups: Constructing Datasets of Interacting and
Non-Interacting Protein Pairs with Associated Attributions. Nucleic Acids Res
39:D750
207

References
Coin F, Marinoni JC, Rodolfo C, Fribourg S, Pedrini AM, Egly JM (1998) Mutations in the
Xpd Helicase Gene Result in Xp and Ttd Phenotypes, Preventing Interaction
between Xpd and the P44 Subunit of Tfiih. Nature Genetics 20:184
Cokus S, Mizutani S, Pellegrini M (2007) An Improved Method for Identifying
Functionally Linked Proteins Using Phylogenetic Profiles. BMC Bioinformatics
8:S7
Coulson RMR, Ouzounis CA (2003) The Phylogenetic Diversity of Eukaryotic
Transcription. Nucleic Acids Res 31:653
Cranston KA, Hurwitz B, Ware D, Stein L, Wing RA (2009) Species Trees from Highly
Incongruent Gene Trees in Rice. Systematic Biology 58:489
Cranston KA, Rannala B (2007) Summarizing a Posterior Distribution of Trees Using
Agreement Subtrees. Systematic Biology 56:578
Crick FH, Barnett L, Brenner S, Watts-Tobin RJ (1961) General Nature of the Genetic
Code for Proteins. Nature 192:1227
Cunningham FX, Lafond TP, Gantt E (2000) Evidence of a Role for Lytb in the
Nonmevalonate Pathway of Isoprenoid Biosynthesis. Journal of Bacteriology
182:5841
Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M (2004) The
Ensembl Automatic Gene Annotation System. Genome Res 14:942
Dandekar T, Snel B, Huynen M, Bork P (1998) Conservation of Gene Order: A
Fingerprint of Proteins That Physically Interact. Trends Biochem Sci 23:324
Davey R, Savva G, Dicks J, Roberts IN (2007) Mpp: A Microarray-to-Phylogeny Pipeline
for Analysis of Gene and Marker Content Datasets. Bioinformatics 23:1023
Dayhoff MO, Schwartz. RM, Orcutt. BC (1978) A Model of Evolutionary Change in
Proteins. Atlas of Protein Sequence and Structure 5:345
De Bodt S, Proost S, Vandepoele K, Rouze P, Van de Peer Y (2009) Predicting ProteinProtein Interactions in Arabidopsis Thaliana through Integration of Orthology,
Gene Ontology and Co-Expression. BMC Genomics 10:288
De Las Rivas J, Fontanillo C (2010) Protein Protein Interactions Essentials: Key
Concepts to Building and Analyzing Interactome Networks. PLoS Comput Biol
6:e1000807
Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S,
Lefort V, Lescot M, Claverie JM, Gascuel O (2008) Phylogeny.Fr: Robust
Phylogenetic Analysis for the Non-Specialist. Nucleic Acids Research 36:W465
Dowsey AW, Dunn MJ, Yang GZ (2003) The Role of Bioinformatics in Two-Dimensional
Gel Electrophoresis. Proteomics 3:1567
Durbin R (1998) Biological Sequence Analysis : Probabilistic Models of Proteins and
Nucleic Acids. Cambridge University Press, Cambridge New York
Eddy SR (1998) Profile Hidden Markov Models. Bioinformatics 14:755
Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z (2009) Gorilla: A Tool for Discovery
and Visualization of Enriched Go Terms in Ranked Gene Lists. BMC
Bioinformatics 10:48
Edgar RC (2004) Muscle: Multiple Sequence Alignment with High Accuracy and High
Throughput. Nucleic Acids Res 32:1792
Edgar RC, Batzoglou S (2006) Multiple Sequence Alignment. Curr Opin Struct Biol 16:368
Edgell DR, Belfort M, Shub DA (2000) Barriers to Intron Promiscuity in Bacteria. J
Bacteriol 182:5281
Edwards AWF (1992) Likelihood. Johns Hopkins University Press, Baltimore ; London
Elias I, Tuller T (2007) Reconstruction of Ancestral Genomic Sequences Using
Likelihood. Journal of Computational Biology 14:216
208

References
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Protein Interaction Maps for
Complete Genomes Based on Gene Fusion Events. Nature 402:86
Farrar M (2007) Striped Smith-Waterman Speeds Database Searches Six Times over
Other Simd Implementations. Bioinformatics 23:156
Farris JS (1977) Phylogenetic Analysis under Dollo's Law. Systematic Zoology 26:77
Farris JS (1978) Inferring Phylogenetic Trees from Chromosome Inversion Data.
Systematic Zoology 27:275
Felsenstein J (1973) Maximum Likelihood and Minimum-Steps Methods for Estimating
Evolutionary Trees from Data on Discrete Characters. Systematic Zoology 22:240
Felsenstein J (1978) Cases in Which Parsimony or Compatibility Methods Will Be
Positively Misleading. Syst Zool 27:401
Felsenstein J (1979) Alternative Methods of Phylogenetic Inference and Their
Interrelationship. Systematic Zoology 28:49
Felsenstein J (1985a) Confidence-Limits on Phylogenies - an Approach Using the
Bootstrap. Evolution 39:783
Felsenstein J (1985b) Phylogenies and the Comparative Method. The American Naturalist
125:1
Felsenstein J (1989) Phylip - Phylogeny Inference Package (Version 3.2). Cladistics 5:164
Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland, Mass.
Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, Min Jou W,
Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, Ysebaert M (1976)
Complete Nucleotide Sequence of Bacteriophage Ms2 Rna: Primary and
Secondary Structure of the Replicase Gene. Nature 260:500
Fitch WM (1970) Distinguishing Homologous from Analogous Proteins. Syst Zool 19:99
Fitch WM (1971) Toward Defining Course of Evolution - Minimum Change for a
Specific Tree Topology. Syst Zool 20:406
Fitch WM (2000) Homology a Personal View on Some of the Problems. Trends Genet
16:227
Fitzpatrick DA, Logue ME, Stajich JE, Butler G (2006) A Fungal Phylogeny Based on 42
Complete Genomes Derived from Supertree and Combined Gene Analysis. BMC
Evol Biol 6:99
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A Computer Program for
Aligning a Cdna Sequence with a Genomic DNA Sequence. Genome Res 8:967
Fu N, Drinnenberg I, Kelso J, Wu JR, Paabo S, Zeng R, Khaitovich P (2007) Comparison of
Protein and Mrna Expression Evolution in Humans and Chimpanzees. PLoS One
2:e216
Garman KS, Acharya CR, Edelman E, Grade M, Gaedcke J, Sud S, Barry W, Diehl AM,
Provenzale D, Ginsburg GS, Ghadimi BM, Ried T, Nevins JR, Mukherjee S, Hsu D,
Potti A (2009) A Genomic Approach to Colon Cancer Risk Stratification Yields
Biologic Insights into Therapeutic Opportunities (Vol 105, 19432, 2008).
Proceedings of the National Academy of Sciences of the United States of America
106:6878
Garrett S, Barton WA, Knights R, Jin P, Morgan DO, Fisher RP (2001) Reciprocal
Activation by Cyclin-Dependent Kinases 2 and 7 Is Directed by Substrate
Specificity Determinants Outside the T Loop. Molecular and Cellular Biology
21:88
Gaschen B, Taylor J, Yusim K, Foley B, Gao F, Lang D, Novitsky V, Haynes B, Hahn BH,
Bhattacharya T, Korber B (2002) Aids - Diversity Considerations in Hiv-1 Vaccine
Selection. Science 296:2354

209

References
Ge XJ, Yamamoto S, Tsutsumi S, Midorikawa Y, Ihara S, Wang SM, Aburatani H (2005)
Interpreting Expression Profiles of Cancers by Genome-Wide Survey of Breadth
of Expression in Normal Tissues. Genomics 86:127
Gillis B, Gavin IM, Arbieva Z, King ST, Jayaraman S, Prabhakar BS (2007) Identification
of Human Cell Responses to Benzene and Benzene Metabolites. Genomics 90:324
Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel
JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin
H, Oliver SG (1996) Life with 6000 Genes. Science 274:546
Goldman N, Anderson JP, Rodrigo AG (2000) Likelihood-Based Tests of Topologies in
Phylogenetics. Systematic Biology 49:652
Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Kallersjo M, Farris JS (2009)
Phylogenetic Analysis of 73 060 Taxa Corroborates Major Eukaryotic Groups.
Cladistics 25:211
Gonzalez O, Zimmer R (2008) Assigning Functional Linkages to Proteins Using
Phylogenetic Profiles and Continuous Phenotypes. Bioinformatics 24:1257
Grafen A (1989) The Phylogenetic Regression. Philosophical Transactions of the Royal
Society of London Series B-Biological Sciences 326:119
Graur D, Shuali Y, Li WH (1989) Deletions in Processed Pseudogenes Accumulate Faster
in Rodents Than in Humans. Journal of Molecular Evolution 28:279
Griffiths AJF (2002) Modern Genetic Analysis : Integrating Genes and Genomes. W.H.
Freeman and Co., New York
Guindon S, Gascuel O (2003) A Simple, Fast, and Accurate Algorithm to Estimate Large
Phylogenies by Maximum Likelihood. Syst Biol 52:696
Gygi SP, Rochon Y, Franza BR, Aebersold R (1999) Correlation between Protein and
Mrna Abundance in Yeast. Molecular and Cellular Biology 19:1720
Hakes L, Pinney J.W, Lowell S.C, Oliver S.G, Robertson D.L (2007) All Duplicates Are
Not Equal: The Difference between Small-Scale and Genome Duplication.
Genome Biology 8:R209
Hamming RW (1950) Error Detecting and Error Correcting Codes. Bell System
Technical Journal 26:147
Han JDJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM,
Cusick ME, Roth FP, Vidal M (2004) Evidence for Dynamically Organized
Modularity in the Yeast Protein-Protein Interaction Network. Nature 430:88
Harrison CJ, Langdale JA (2006) A Step by Step Guide to Phylogeny Reconstruction.
Plant J 45:561
Hart GT, Ramani AK, Marcotte EM (2006) How Complete Are Current Yeast and
Human Protein-Interaction Networks? Genome Biol 7:120
Harvey PH and Pagel MD (1991). The Comparative Method in Evolutionary Biology.
Oxford: Oxford University Press
Hasegawa H, Holm L (2009) Advances and Pitfalls of Protein Structural Alignment.
Curr Opin Struct Biol 19:341
Hasegawa M, Kishino H (1989) Confidence-Limits on the Maximum-Likelihood Estimate
of the Hominoid Tree from Mitochondrial-DNA Sequences. Evolution 43:672
Haw R, Hermjakob H, D'Eustachio P, Stein L (2011) Reactome Pathway Analysis to
Enrich Biological Discovery in Proteomics Datasets. Proteomics: 11(18):3598-613.
Hayton K, Su XZ (2004) Genetic and Biochemical Aspects of Drug Resistance in Malaria
Parasites. Curr Drug Targets Infect Disord 4:1

210

References
He HY, Soncin F, Grammatikakis N, Li YL, Siganou A, Gong JL, Brown SA, Kingston RE,
Calderwood SK (2003) Elevated Expression of Heat Shock Factor (Hsf) 2a
Stimulates Hsf1-Induced Transcription During Stress. Journal of Biological
Chemistry 278:35465
Heath TA, Hedtke SM, Hillis DM (2008) Taxon Sampling and the Accuracy of
Phylogenetic Analyses. Journal of Systematics and Evolution 46:239
Henikoff S, Henikoff JG (1992) Amino Acid Substitution Matrices from Protein Blocks.
Proc Natl Acad Sci U S A 89:10915
Hershey AD, Chase M (1952) Independent Functions of Viral Protein and Nucleic Acid
in Growth of Bacteriophage. J Gen Physiol 36:39
Hert DG, Fredlake CP, Barron AE (2008) Advantages and Limitations of Next-Generation
Sequencing Technologies: A Comparison of Electrophoresis and NonElectrophoresis Methods. Electrophoresis 29:4618
Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring Expression Data: Identification and
Analysis of Coexpressed Genes. Genome Research 9:1106
Higgins DG, Sharp PM (1988) Clustal: A Package for Performing Multiple Sequence
Alignment on a Microcomputer. Gene 73:237
Hill J, Hambley M, Forster T, Mewissen M, Sloan TM, Scharinger F, Trew A, Ghazal P
(2008) Sprint: A New Parallel Framework for R. BMC Bioinformatics 9
Hobolth A, Christensen OF, Mailund T, Schierup MH (2007) Genomic Relationships and
Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a
Coalescent Hidden Markov Model. PLoS Genet 3:e7
Hodges A, Strand AD, Aragaki AK, Kuhn A, Sengstag T, Hughes G, Elliston LA, Hartog C,
Goldstein DR, Thu D, Hollingsworth ZR, Collin F, Synek B, Holmans PA, Young
AB, Wexler NS, Delorenzi M, Kooperberg C, Augood SJ, Faull RL, Olson JM, Jones
L, Luthi-Carter R (2006) Regional and Cellular Gene Expression Changes in
Human Huntington's Disease Brain. Hum Mol Genet 15:965
Holder M, Lewis PO (2003) Phylogeny Estimation: Traditional and Bayesian
Approaches. Nature Reviews Genetics 4:275
Hruby VJ (1997) Prospects for Peptidomimetic Drug Design. Drug Discovery Today 2:165
Huai Q, Kim HY, Liu YD, Zhao YD, Mondragon A, Liu JO, Ke HM (2002) Crystal
Structure of Calcineurin-Cyclophilin-Cyclosporin Shows Common but Distinct
Recognition of Immunophilin-Drug Complexes. Proceedings of the National
Academy of Sciences of the United States of America 99:12037
Huelsenbeck JP, Bollback JP (2001) Empirical and Hierarchical Bayesian Estimation of
Ancestral States. Systematic Biology 50:351
Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP (2001) Bayesian Inference of
Phylogeny and Its Impact on Evolutionary Biology. Science 294:2310
Hughes AL, Friedman R (2005) Poxvirus Genome Evolution by Gene Gain and Loss.
Molecular Phylogenetics and Evolution 35:186
Hulsen T, Huynen MA, de Vlieg J, Groenen PM (2006) Benchmarking Ortholog
Identification Methods Using Functional Genomics Data. Genome Biol 7:R31
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol
2:E206
Ispolatov I, Yuryev A, Mazo I, Maslov S (2005) Binding Properties and Evolution of
Homodimers in Protein-Protein Interaction Networks. Nucleic Acids Res 33:3629
Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M,
Greenblatt JF, Gerstein M (2003) A Bayesian Networks Approach for Predicting
Protein-Protein Interactions from Genomic Data. Science 302:449

211

References
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A,
Simonovic M, Bork P, von Mering C (2009) String 8-a Global View on Proteins
and Their Functional Interactions in 630 Organisms. Nucleic Acids Research
37:D412
Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and Centrality in Protein
Networks. Nature 411:41
Jessop CE, Chakravarthi S, Garbi N, Hammerling GJ, Lovell S, Bulleid NJ (2007) Erp57 Is
Essential for Efficient Folding of Glycoproteins Sharing Common Structural
Domains. EMBO J 26:28
Jim K, Parmar K, Singh M, Tavazoie S (2004) A Cross-Genomic Approach for Systematic
Mapping of Phenotypic Traits to Genes. Genome Research 14:109
Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt
EE, Stoughton R, Shoemaker DD (2003) Genome-Wide Survey of Human
Alternative Pre-Mrna Splicing with Exon Junction Microarrays. Science
302:2141
Kanehisa M (1997) A Database for Post-Genome Analysis. Trends Genet 13:375
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T,
Araki M, Hirakawa M (2006) From Genomics to Chemical Genomics: New
Developments in Kegg. Nucleic Acids Res 34:D354
Karlin S, Altschul SF (1990) Methods for Assessing the Statistical Significance of
Molecular Sequence Features by Using General Scoring Schemes. Proc Natl Acad
Sci U S A 87:2264
Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: A Novel Method for Rapid
Multiple Sequence Alignment Based on Fast Fourier Transform. Nucleic Acids
Res 30:3059
Kawaji H, Hayashizaki Y (2008) Genome Annotation. Methods Mol Biol 452:125
Keane TM, Creevey CJ, Pentony MM, Naughton TJ, Mclnerney JO (2006) Assessment of
Methods for Amino Acid Matrix Selection and Their Use on Empirical Data
Shows That Ad Hoc Assumptions for Choice of Matrix Are Not Justified. BMC
Evolutionary Biology 6
Kensche PR, van Noort V, Dutilh BE, Huynen MA (2008) Practical and Theoretical
Advances in Predicting the Function of a Protein by Its Phylogenetic
Distribution. J R Soc Interface 5:151
Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R (2004) The
International Protein Index: An Integrated Database for Proteomics
Experiments. Proteomics 4:1985
Kim IY, Shin JH, Seong JK (2010) Mouse Phenogenomics, Toolbox for Functional
Annotation of Human Genome. BMB Rep 43:79
Kim MJ, Romero R, Kim CJ, Tarca AL, Chhauy S, LaJeunesse C, Lee DC, Draghici S,
Gotsch F, Kusanovic JP, Hassan SS, Kim JS (2009) Villitis of Unknown Etiology Is
Associated with a Distinct Pattern of Chemokine up-Regulation in the FetoMaternal and Placental Compartments: Implications for Conjoint Maternal
Allograft Rejection and Maternal Anti-Fetal Graft-Versus-Host Disease. Journal
of Immunology 182:3919
Knight RD, Landweber LF, Yarus M (2001) How Mitochondria Redefine the Code. J Mol
Evol 53:299
Knowles DG, McLysaght A (2009) Recent De Novo Origin of Human Protein-Coding
Genes. Genome Res 19:1752
Korf I (2004) Gene Finding in Novel Genomes. BMC Bioinformatics 5:59

212

References
Koshi JM, Goldstein RA (1996) Probabilistic Reconstruction of Ancestral Protein
Sequences. Journal of Molecular Evolution 42:313
Krane DE, Raymer ML (2003) Fundamental Concepts of Bioinformatics. Pearson
Education International, San Francisco
Krylov DM, Wolf YI, Rogozin IB, Koonin EV (2003) Gene Loss, Protein Sequence
Divergence, Gene Dispensability, Expression Level, and Interactivity Are
Correlated in Eukaryotic Evolution. Genome Research 13:2229
Kuhner MK, Felsenstein J (1994) Simulation Comparison of Phylogeny Algorithms under
Equal and Unequal Evolutionary Rates. Mol Biol Evol 11:459
Kummel D, Oeckinghaus A, Wang C, Krappmann D, Heinemann U (2008) Distinct
Isocomplexes of the Trapp Trafficking Factor Coexist inside Human Cells. FEBS
Lett 582:3729
Lande J, Gimino V, Berryman T, Hertz MI, King RA (2003) Gene Expression Profiling of
Bronchoalveolar Lavage Cells in Acute Lung Rejection. American Journal of
Human Genetics 73:421
Le SQ, Gascuel O (2008) An Improved General Amino Acid Replacement Matrix. Mol
Biol Evol 25:1307
Lei PW, Koehly LM (2003) Linear Discriminant Analysis Versus Logistic Regression: A
Comparison of Classification Errors in the Two-Group Case. Journal of
Experimental Education 72:25
Lequesne WJ (1974) Uniquely Evolved Character Concept and Its Cladistic Application.
Systematic Zoology 23:513
Lescure FX, Le Loup G, Freilij H, Develoux M, Paris L, Brutus L, Pialoux G (2010) Chagas
Disease: Changes in Knowledge and Management. Lancet Infectious Diseases
10:556
Levesque M, Shasha D, Kim W, Surette MG, Benfey PN (2003) Trait-to-Gene: A
Computational Method for Predicting the Function of Uncharacterized Genes.
Current Biology 13:129
Lewinski MK, Bisgrove D, Shinn P, Chen H, Hoffmann C, Hannenhalli S, Verdin E, Berry
CC, Ecker JR, Bushman FD (2005) Genome-Wide Analysis of Chromosomal
Features Repressing Human Immunodeficiency Virus Transcription. Journal of
Virology 79:6610
Li L, Stoeckert CJ, Jr., Roos DS (2003) Orthomcl: Identification of Ortholog Groups for
Eukaryotic Genomes. Genome Res 13:2178
Li M, Wang JX, Chen J (2008) A Fast Agglomerate Algorithm for Mining Functional
Modules in Protein Interaction Networks. Bmei 2008: Proceedings of the
International Conference on Biomedical Engineering and Informatics, Vol 1:3
Linial M (2003) How Incorrect Annotations Evolve - the Case of Short Orfs. Trends in
Biotechnology 21:298
Lunter G, Ponting CP, Hein J (2006) Genome-Wide Identification of Human Functional
DNA Using a Neutral Indel Model. PLoS Comput Biol 2:2
Macagno A, Molteni M, Rinaldi A, Bertoni F, Lanzavecchia A, Rossetti C, Sallusto F (2006)
A Cyanobacterial Lps Antagonist Prevents Endotoxin Shock and Blocks
Sustained Tlr4 Stimulation Required for Cytokine Expression. Journal of
Experimental Medicine 203:1481
Maddison WP (1990) A Method for Testing the Correlated Evolution of Two Binary
Characters - Are Gains or Losses Concentrated on Certain Branches of a
Phylogenetic Tree. Evolution 44:539
Maddison WP, Maddison DR (2010) Mesquite: A Modular System for Evolutionary
Analysis. Version 2.73
213

References
Maere S, Heymans K, Kuiper M (2005) Bingo: A Cytoscape Plugin to Assess
Overrepresentation of Gene Ontology Categories in Biological Networks.
Bioinformatics 21:3448
Malcolm BA, Wilson KP, Matthews BW, Kirsch JF, Wilson AC (1990) Ancestral
Lysozymes Reconstructed, Neutrality Tested, and Thermostability Linked to
Hydrocarbon Packing. Nature 345:86
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D (1999) Detecting
Protein Function and Protein-Protein Interactions from Genome Sequences.
Science 285:751
Mardis ER (2008) The Impact of Next-Generation Sequencing Technology on Genetics.
Trends Genet 24:133
Martin DM, Berriman M, Barton GJ (2004) Gotcha: A New Method for Prediction of
Protein Function Assessed by the Annotation of Seven Genomes. BMC
Bioinformatics 5:178
Maston GA, Evans SK, Green MR (2006) Transcriptional Regulatory Elements in the
Human Genome. Annual Review of Genomics and Human Genetics 7:29
Maxam AM, Gilbert W (1977) New Method for Sequencing DNA. Proc Natl Acad Sci U S
A 74:560
McDowall MD, Scott MS, Barton GJ (2009) Pips: Human Protein-Protein Interaction
Prediction Database. Nucleic Acids Res 37:D651
McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu YT, Tsung EF, Clouser CR,
Duncan C, Ichikawa JK, Lee CC, Zhang Z, Ranade SS, Dimalanta ET, Hyland FC,
Sokolsky TD, Zhang L, Sheridan A, Fu HN, Hendrickson CL, Li B, Kotler L, Stuart
JR, Malek JA, Manning JM, Antipova AA, Perez DS, Moore MP, Hayashibara KC,
Lyons MR, Beaudoin RE, Coleman BE, Laptewicz MW, Sannicandro AE, Rhodes
MD, Gottimukkala RK, Yang S, Bafna V, Bashir A, MacBride A, Alkan C, Kidd JM,
Eichler EE, Reese MG, De la Vega FM, Blanchard AP (2009) Sequence and
Structural Variation in a Human Genome Uncovered by Short-Read, Massively
Parallel Ligation Sequencing Using Two-Base Encoding. Genome Research
19:1527
McLysaght A, Baldi PF, Gaut BS (2003) Extensive Gene Gain Associated with Adaptive
Evolution of Poxviruses. Proceedings of the National Academy of Sciences of the
United States of America 100:15655
Messler W, Stewart CB (1997) Episodic Adaptive Evolution of Primate Lysozymes.
Nature 385:151
Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K,
Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran
S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S,
Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia
H, Rekha B, Nayak R, Vishnupriya G, Kumar HG, Nagini M, Kumar GS, Jose R,
Deepthi P, Mohan SS, Gandhi TK, Harsha HC, Deshpande KS, Sarker M, Prasad TS,
Pandey A (2006) Human Protein Reference Database--2006 Update. Nucleic
Acids Res 34:D411
Mohamed TP, Carbonell JG, Ganapathiraju MK (2010) Active Learning for Human
Protein-Protein Interaction Prediction. BMC Bioinformatics 11 Suppl 1:S57
Monsalve M, Wu ZD, Adelmant G, Puigserver P, Fan ML, Spiegelman BM (2000) Direct
Coupling of Transcription and Mrna Processing through the Thermogenic
Coactivator Pgc-1. Molecular Cell 6:307
Moore KJ (1999) Utilization of Mouse Models in the Discovery of Human Disease Genes.
Drug Discov Today 4:123
214

References
Morgenstern B, Frech K, Dress A, Werner T (1998) Dialign: Finding Local Similarities by
Multiple Sequence Alignment. Bioinformatics 14:290
Mount DW (2004) Bioinformatics : Sequence and Genome Analysis. Cold Spring Harbor
Laboratory Press, Cold Spring Harbor, N.Y.
Needleman SB, Wunsch CD (1970) A General Method Applicable to the Search for
Similarities in the Amino Acid Sequence of Two Proteins. J Mol Biol 48:443
Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press
Nooren IMA, Thornton JM (2003) Structural Characterisation and Functional
Significance of Transient Protein-Protein Interactions. Journal of Molecular
Biology 325:991
Nuin PA, Wang Z, Tillier ER (2006) The Accuracy of Several Multiple Sequence
Alignment Programs for Proteins. BMC Bioinformatics 7:471
Nye TMW, Lio P, Gilks WR (2006) A Novel Algorithm and Web-Based Tool for
Comparing Two Alternative Phylogenetic Trees. Bioinformatics 22:117
O'donnell RK, Kupferman M, Wei SJ, Singhal S, Weber R, O'Malley B, Cheng Y, Putt M,
Feldman M, Ziober B, Muschel RJ (2005) Gene Expression Signature Predicts
Lymphatic Metastasis in Squamous Cell Carcinoma of the Oral Cavity.
Oncogene 24:1244
Ohta S, Shiomi Y, Sugimoto K, Obuse C, Tsurimoto T (2002) A Proteomics Approach to
Identify Proliferating Cell Nuclear Antigen (Pcna)-Binding Proteins in Human
Cell Lysates - Identification of the Human Chl12/Rfcs2-5 Complex as a Novel
Pcna-Binding Protein. Journal of Biological Chemistry 277:40362
Ooi SL, Pan X, Peyser BD, Ye P, Meluh PB, Yuan DS, Irizarry RA, Bader JS, Spencer FA,
Boeke JD (2006) Global Synthetic-Lethality Analysis and Yeast Functional
Profiling. Trends Genet 22:56
Orengo C, Jones D, Thornton JM (2003) Bioinformatics : Genes, Proteins, and
Computers. BIOS Scientific ; Distributed in the U.S. by Springer-Verlag, Oxford
New York
Orlowski J, Kaczanowski S, Zielenkiewicz P (2007) Overrepresentation of Interactions
between Homologous Proteins in Interactomes. Febs Letters 581:52
Page RDM, Holmes EC (1998) Molecular Evolution : A Phylogenetic Approach.
Blackwell Science, Oxford
Pagel M (1994) Detecting Correlated Evolution on Phylogenies - a General-Method for
the Comparative-Analysis of Discrete Characters. Proceedings of the Royal
Society of London Series B-Biological Sciences 255:37
Pagel M (1997) Inferring Evolutionary Processes from Phylogenies. Zoologica Scripta
26:331
Pagel M, Meade A, Barker D (2004a) Bayesian Estimation of Ancestral Character States
on Phylogenies. Syst Biol 53:673
Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C,
Mark P, Stumpflen V, Mewes HW, Ruepp A, Frishman D (2005) The Mips
Mammalian Protein-Protein Interaction Database. Bioinformatics 21:832
Pagel P, Wong P, Frishman D (2004b) A Domain Interaction Map Based on Phylogenetic
Profiling. J Mol Biol 344:1331
Parfrey LW, Barbero E, Lasser E, Dunthorn M, Bhattacharya D, Patterson DJ, Katz LA
(2006) Evaluating Support for the Current Classification of Eukaryotic
Diversity. PLoS Genet 2:e220
Parida L (2008) Pattern Discovery in Bioinformatics : Theory & Algorithms. Chapman &
Hall/CRC, London

215

References
Pazos F, Ranea JAG, Juan D, Sternberg MJE (2005) Assessing Protein Co-Evolution in the
Context of the Tree of Life Assists in the Prediction of the Interactome. Journal of
Molecular Biology 352:1002
Pazos F, Valencia A (2001) Similarity of Phylogenetic Trees as Indicator of ProteinProtein Interaction. Protein Eng 14:609
Pearson WR, Lipman DJ (1988) Improved Tools for Biological Sequence Comparison.
Proc Natl Acad Sci U S A 85:2444
Pellegrini M (2001) Computational Methods for Protein Function Analysis. Curr Opin
Chem Biol 5:46
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning
Protein Functions by Comparative Genome Analysis: Protein Phylogenetic
Profiles. Proc Natl Acad Sci U S A 96:4285
Pesole G (2008) What Is a Gene? An Updated Operational Definition. Gene 417:1
Picardi E, Pesole G (2010) Computational Methods for Ab Initio and Comparative Gene
Finding. Methods Mol Biol 609:269
Pickett KM, Randle CP (2005) Strange Bayes Indeed: Uniform Topological Priors Imply
Non-Uniform Clade Priors. Molecular Phylogenetics and Evolution 34:203
Pinney JW, Shirley MW, McConkey GA, Westhead DR (2005) Metashark: Software for
Automated Metabolic Network Prediction from DNA Sequence and Its
Application to the Genomes of Plasmodium Falciparum and Eimeria Tenella.
Nucleic Acids Research 33:1399
Posada D, Buckley TR (2004) Model Selection and Model Averaging in Phylogenetics:
Advantages of Akaike Information Criterion and Bayesian Approaches over
Likelihood Ratio Tests. Syst Biol 53:793
Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R,
Clamp M (2004) The Ensembl Analysis Pipeline. Genome Res 14:934
Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla
D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S,
Somanathan DS, Sebastian A, Rani S, Ray S, Kishore CJH, Kanth S, Ahmed M,
Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S,
Ranganathan P, Ramabadran S, Chaerkady R, Pandey A (2009) Human Protein
Reference Database-2009 Update. Nucleic Acids Research 37:D767
Pressman R (2001) Software Engineering: A Practioners Approach. McGraw-Hill
Pruitt KD, Tatusova T, Maglott DR (2005) Ncbi Reference Sequence (Refseq): A Curated
Non-Redundant Sequence Database of Genomes, Transcripts and Proteins.
Nucleic Acids Res 33:D501
Qi YJ, Bar-Joseph Z, Klein-Seetharaman J (2006) Evaluation of Different Biological Data
and Computational Classification Methods for Use in Protein Interaction
Prediction. Proteins-Structure Function and Bioinformatics 63:490
Quackenbush J (2002) Microarray Data Normalization and Transformation. Nat Genet
32 Suppl:496
Raab JR, Kamakaka RT (2010) Opinion Insulators and Promoters: Closer Than We
Think. Nature Reviews Genetics 11:439
Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and Identifying
Communities in Networks. Proceedings of the National Academy of Sciences of the
United States of America 101:2658
Radom-Aizik S, Hayek S, Shahar I, Rechavi G, Kaminski N, Ben-Dov I (2005) Effects of
Aerobic Training on Gene Expression in Skeletal Muscle of Elderly Men.
Medicine and Science in Sports and Exercise 37:1680

216

References
Ralston KS, Kabututu ZP, Melehani JH, Oberholzer M, Hill KL (2009) The Trypanosoma
Brucei Flagellum: Moving Parasites in New Directions. Annual Review of
Microbiology 63:335
Ramachandran N, Hainsworth E, Bhullar B, Eisenstein S, Rosen B, Lau AY, Walter JC,
LaBaer J (2004) Self-Assembling Protein Microarrays. Science 305:86
Ramazzina I, Folli C, Secchi A, Berni R, Percudani R (2006) Completing the Uric Acid
Degradation Pathway through Phylogenetic Comparison of Whole Genomes.
Nature Chemical Biology 2:144
Ranea JA, Yeats C, Grant A, Orengo CA (2007) Predicting Protein Function with
Hierarchical Phylogenetic Profiles: The Gene3d Phylo-Tuner Method Applied to
Eukaryotic Genomes. PLoS Comput Biol 3:e237
R Development Core Team (2011) R: A language and environment for statistical
computing. R Foundation for Statistical Computing Vienna, Austria
Reghunathan R, Jayapal M, Hsu LY, Chng HH, Tai D, Leung BP, Melendez AJ (2005)
Expression Profile of Immune Response Genes in Patients with Severe Acute
Respiratory Syndrome. BMC Immunology 6
Remm M, Storm CE, Sonnhammer EL (2001) Automatic Clustering of Orthologs and inParalogs from Pairwise Species Comparisons. J Mol Biol 314:1041
Richmond TJ, Davey CA (2003) The Structure of DNA in the Nucleosome Core. Nature
423:145
Ridley M (1983) The Explanation of Organic Diversity : The Comparative Method and
Adaptions for Mating. Clarendon Press, Oxford
Robertson DL, Lovell SC (2009) Evolution in Protein Interaction Networks: CoEvolution, Rewiring and the Role of Duplication. Biochem Soc Trans 37:768
Rodriguez-Ezpeleta N, Brinkmann H, Burey SC, Roure B, Burger G, Loffelhardt W, Bohnert
HJ, Philippe H, Lang BF (2005) Monophyly of Primary Photosynthetic
Eukaryotes: Green Plants, Red Algae, and Glaucophytes. Curr Biol 15:1325
Rodriguez-Ezpeleta N, Brinkmann H, Burger G, Roger AJ, Gray MW, Philippe H, Lang BF
(2007) Toward Resolving the Eukaryotic Tree: The Phylogenetic Positions of
Jakobids and Cercozoans. Current Biology 17:1420
Rokas A, Williams BL, King N, Carroll SB (2003) Genome-Scale Approaches to Resolving
Incongruence in Molecular Phylogenies. Nature 425:798
Russell SJ, Norvig P, Canny J (2003) Artificial Intelligence : A Modern Approach.
Prentice Hall, Upper Saddle River, N.J.
Salemi M, Vandamme A-M (2003) The Phylogenetic Handbook : A Practical Approach
to DNA and Protein Phylogeny. Cambridge University Press, Cambridge, U.K. ;
New York
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database
of Interacting Proteins: 2004 Update. Nucleic Acids Res 32:D449
Sanger F, Coulson AR, Friedmann T, Air GM, Barrell BG, Brown NL, Fiddes JC, Hutchison
CA, Slocombe PM, Smith M (1978) Nucleotide-Sequence of Bacteriophage-PhiX174. J Mol Biol 125:225
Sanger F, Nicklen S, Coulson AR (1977) DNA Sequencing with Chain-Terminating
Inhibitors. Proc Natl Acad Sci U S A 74:5463
Sankoff D (1975) Minimal Mutation Trees of Sequences. Siam Journal on Applied
Mathematics 28:35
Sasaoka T, Kobayashi M (2000) The Functional Significance of Shc in Insulin Signaling
as a Substrate of the Insulin Receptor. Endocrine Journal 47:373
Scott MS, Barton GJ (2007) Probabilistic Prediction and Ranking of Human ProteinProtein Interactions. BMC Bioinformatics 8:239
217

References
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B,
Ideker T (2003) Cytoscape: A Software Environment for Integrated Models of
Biomolecular Interaction Networks. Genome Research 13:2498
Shortle D, Ackerman MS (2001) Persistence of Native-Like Topology in a Denatured
Protein in 8 M Urea. Science 293:487
Siddall ME (1998) Success of Parsimony in the Four-Taxon Case: Long-Branch
Repulsion by Likelihood in the Farris Zone. Cladistics-the International Journal of
the Willi Hennig Society 14:209
Sillentullberg B (1988) Evolution of Gregariousness in Aposematic Butterfly Larvae - a
Phylogenetic Analysis. Evolution 42:293
Simmons MP, Ochoterena H, Freudenstein JV (2002) Amino Acid Vs. Nucleotide
Characters: Challenging Preconceived Notions. Molecular Phylogenetics and
Evolution 24:78
Singh GP, Ganapathi M, Dash D (2007) Role of Intrinsic Disorder in Transient
Interactions of Hub Proteins. Proteins-Structure Function and Bioinformatics
66:761
Slater GS, Birney E (2005) Automated Generation of Heuristics for Biological Sequence
Comparison. Bmc Bioinformatics 6
Slonim N, Elemento O, Tavazoie S (2006) Ab Initio Genotype-Phenotype Association
Reveals Intrinsic Modularity in Genetic Networks. Molecular Systems Biology
Smith TF, Waterman MS (1981) Identification of Common Molecular Subsequences. J
Mol Biol 147:195
Sneath PHA, Sokal RR (1973) Numerical Taxonomy : The Principles and Practice of
Numerical Classification. W. H. Freeman, San Francisco
Snel B, Bork P, Huynen MA (1999) Genome Phylogeny Based on Gene Content. Nat
Genet 21:108
Sokal RR, Rohlf FJ (1995) Biometry : The Principles and Practice of Statistics in
Biological Research. W.H. Freeman, New York
Spira A, Beane J, Shah V, Liu G, Schembri F, Yang XM, Palma J, Brody JS (2004) Effects
of Cigarette Smoke on the Human Airway Epithelial Cell Transcriptome.
Proceedings of the National Academy of Sciences of the United States of America
101:10143
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: A
General Repository for Interaction Datasets. Nucleic Acids Research 34:D535
Steel M, Penny D (2000) Parsimony, Likelihood, and the Role of Models in Molecular
Phylogenetics. Mol Biol Evol 17:839
Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner
M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann
S, Goedde A, Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H,
Wanker EE (2005) A Human Protein-Protein Interaction Network: A Resource
for Annotating the Proteome. Cell 122:957
Stevens PF, Augier A (1983) Augustin Augier's "Arbre Botanique" (1801), a
Remarkable Early Botanical Representation of the Natural System. Taxon
32:203
Stewart CB, Schilling JW, Wilson AC (1987) Adaptive Evolution in the Stomach
Lysozymes of Foregut Fermenters. Nature 330:401
Strachan T, Read AP (2004) Human Molecular Genetics. Garland Press, New York
Stuart GW, Moffett K, Leader JJ (2002) A Comprehensive Vertebrate Phylogeny Using
Vector Representations of Protein Sequences from Whole Genomes. Mol Biol
Evol 19:554
218

References
Stumpf MP, Thorne T, de Silva E, Stewart R, An HJ, Lappe M, Wiuf C (2008) Estimating
the Size of the Human Interactome. Proceedings of the National Academy of
Sciences of the United States of America 105:6959
Sundquist A, Ronaghi M, Tang HX, Pevzner P, Batzoglou S (2007) Whole-Genome
Sequencing and Assembly with High-Throughput, Short-Read Technologies.
PLoS One 2
Swanson KW, Irwin DM, Wilson AC (1991) Stomach Lysozyme Gene of the Langur
Monkey - Tests for Convergence and Positive Selection. J Mol Evol 33:418
Swofford DL, Maddison WP (1987) Reconstructing Ancestral Character States under
Wagner Parsimony. Mathematical Biosciences 87:199
Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS (2001) Bias in
Phylogenetic Estimation and Its Relevance to the Choice between Parsimony and
Likelihood Methods. Systematic Biology 50:525
Takatsu H, Futatsumori M, Yoshino K, Yoshida Y, Shin HW, Nakayama K (2001) Similar
Subunit Interactions Contribute to Assembly of Clathrin Adaptor Complexes
and Copi Complex: Analysis Using Yeast Three-Hybrid System. Biochemical and
Biophysical Research Communications 284:1083
Talavera G, Castresana J (2007) Improvement of Phylogenies after Removing Divergent
and Ambiguously Aligned Blocks from Protein Sequence Alignments. Syst Biol
56:564
Tanaka R, Yi TM, Doyle J (2005) Some Protein Interaction Data Do Not Exhibit Power
Law Statistics. Febs Letters 579:5140
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM,
Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV,
Vasudevan S, Wolf YI, Yin JJ, Natale DA (2003) The Cog Database: An Updated
Version Includes Eukaryotes. BMC Bioinformatics 4:41
Telford MJ (2004) Animal Phylogeny: Back to the Coelomata? Curr Biol 14:R274
Tian B, Nowak DE, Jamaluddin M, Wang SF, Brasier AR (2005) Identification of Direct
Genomic Targets Downstream of the Nuclear Factor-Kappa B Transcription
Factor Mediating Tumor Necrosis Factor Signaling. Journal of Biological
Chemistry 280:17435
Tierney EP, Tulac S, Huang STJ, Giudice LC (2003) Activation of the Protein Kinase a
Pathway in Human Endometrial Stromal Cells Reveals Sequential Categorical
Gene Regulation. Physiological Genomics 16:47
Townsend JP, Lopez-Giraldez F, Friedman R (2008) The Phylogenetic Informativeness of
Nucleotide and Amino Acid Sequences for Reconstructing the Vertebrate Tree. J
Mol Evol 67:437
Valadkhan S, Jaladat Y (2010) The Spliceosomal Proteome: At the Heart of the Largest
Cellular Ribonucleoprotein Machine. Proteomics 10: 4128
Vanacova S, Liston DR, Tachezy J, Johnson PJ (2003) Molecular Biology of the
Amitochondriate Parasites, Giardia Intestinalis, Entamoeba Histolytica and
Trichomonas Vaginalis. International Journal for Parasitology 33:235
Vanharanta S, Pollard PJ, Lehtonen HJ, Laiho P, Sjoberg J, Leminen A, Aittomaki K, Arola
J, Kruhoffer M, Orntoft TF, Tomlinson IP, Kiuru M, Arango D, Aaltonen LA (2006)
Distinct Expression Profile in Fumarate-Hydratase-Deficient Uterine Fibroids.
Human Molecular Genetics 15:97
Velculescu VE, Zhang L, Zhou W, Polyak K, Basrai M, Bassett D, Hieter P, Vogelstein B,
Kinzler KW (1997) Serial Analysis of Gene Expression (Sage). American Journal
of Human Genetics 61:A36
219

References
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,
Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman
JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas
PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick
VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos
R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S,
Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E,
Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R,
Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian
AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z,
Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina
N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg
S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R,
Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong
F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A,
Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I,
Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport
L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart
B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T,
Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D,
McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K,
Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH,
Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E,
Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M,
Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell
MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania
A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz
R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M,
Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M,
Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek
A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J,
Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu
X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T,
Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J,
Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M,
Wu D, Wu M, Xia A, Zandieh A, Zhu X (2001) The Sequence of the Human
Genome. Science 291:1304
Vert JP (2002) A Tree Kernel to Analyse Phylogenetic Profiles. Bioinformatics 18 Suppl
1:S276
Vidalain PO, Boxem M, Ge H, Li S, Vidal M (2004) Increasing Specificity in HighThroughput Yeast Two-Hybrid Experiments. Methods 32:363
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) Ensemblcompara
Genetrees: Complete, Duplication-Aware Phylogenetic Trees in Vertebrates.
Genome Res 19:327
von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen
MA, Bork P (2005) String: Known and Predicted Protein-Protein Associations,
Integrated and Transferred across Organisms. Nucleic Acids Res 33:D433
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002)
Comparative Assessment of Large-Scale Data Sets of Protein-Protein
Interactions. Nature 417:399
220

References
von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA, Bork P
(2003) Genome Evolution Reveals Biochemical Networks and Functional
Modules. Proc Natl Acad Sci U S A 100:15428
Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, Brasch MA, Thierry-Mieg N, Vidal
M (2000) Protein Interaction Mapping in C. Elegans Using Proteins Involved in
Vulval Development. Science 287:116
Wall DP, Fraser HB, Hirsh AE (2003) Detecting Putative Orthologs. Bioinformatics
19:1710
Wang D, Hsieh M, Li WH (2005a) A General Tendency for Conservation of Protein
Length across Eukaryotic Kingdoms. Mol Biol Evol 22:142
Wang H, Xu Z, Gao L, Hao B (2009a) A Fungal Phylogeny Based on 82 Complete
Genomes Using the Composition Vector Method. BMC Evol Biol 9:195
Wang J, Xia Q, He X, Dai M, Ruan J, Chen J, Yu G, Yuan H, Hu Y, Li R, Feng T, Ye C, Lu
C, Wang J, Li S, Wong GK, Yang H, Wang J, Xiang Z, Zhou Z, Yu J (2005b)
Silkdb: A Knowledgebase for Silkworm Biology and Genomics. Nucleic Acids
Res 33:D399
Wang Z, Gerstein M, Snyder M (2009b) Rna-Seq: A Revolutionary Tool for
Transcriptomics. Nat Rev Genet 10:57
Watson JD, Crick FH (1953) Molecular Structure of Nucleic Acids; a Structure for
Deoxyribose Nucleic Acid. Nature 171:737
Watts DJ, Strogatz SH (1998) Collective Dynamics of 'Small-World' Networks. Nature
393:440
Wheeler TJ, Kececioglu JD (2007) Multiple Alignment by Aligning Alignments.
Bioinformatics 23:i559
Whelan S, Goldman N (2001) A General Empirical Model of Protein Evolution Derived
from Multiple Protein Families Using a Maximum-Likelihood Approach.
Molecular Biology and Evolution 18:691
Whitaker JW, McConkey GA, Westhead DR (2009) Prediction of Horizontal Gene
Transfers in Eukaryotes: Approaches and Challenges. Biochem Soc Trans 37:792
Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ (1997) Genome-Wide Expression
Monitoring in Saccharomyces Cerevisiae. Nat Biotechnol 15:1359
Woese CR, Kandler O, Wheelis ML (1990) Towards a Natural System of Organisms:
Proposal for the Domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U
S A 87:4576
Wolf YI, Rogozin IB, Koonin EV (2004) Coelomata and Not Ecdysozoa: Evidence from
Genome-Wide Phylogenetic Analysis. Genome Research 14:29
Wootton JC, Federhen S (1993) Statistics of Local Complexity in Amino-Acid-Sequences
and Sequence Databases. Computers & Chemistry 17:149
Wu G, Nie L, Zhang WW (2008) Integrative Analyses of Posttranscriptional Regulation
in the Yeast Saccharomyces Cerevisiae Using Transcriptomic and Proteomic
Data. Current Microbiology 57:18
Yakovchuk P, Protozanova E, Frank-Kamenetskii MD (2006) Base-Stacking and BasePairing Contributions into Thermal Stability of the DNA Double Helix (Vol 34,
Pg 564, 2006). Nucleic Acids Res 34:1082
Yang Z (2006) Computational Molecular Evolution. Oxford University Press, Oxford
Yang Z (2008) Computational Molecular Evolution. Oxford University Press, Oxford
Yang ZH (1994) Maximum-Likelihood Phylogenetic Estimation from DNA-Sequences
with Variable Rates over Sites - Approximate Methods. Journal of Molecular
Evolution 39:306

221

References
Yang ZH, Kumar S, Nei M (1995) A New Method of Inference of Ancestral Nucleotide
and Amino-Acid-Sequences. Genetics 141:1641
Yedavalli VSRK, Neuveut C, Chi YH, Kleiman L, Jeang KT (2004) Requirement of Ddx3
Dead Box Rna Helicase for Hiv-1 Rev-Rre Export Function. Cell 119:381
Yu HY, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa
T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot A, Vazquez A, Murray RR,
Simon C, Tardivo L, Tam S, Svrzikapa N, Fan CY, de Smet AS, Motyl A, Hudson
ME, Park J, Xin XF, Cusick ME, Moore T, Boone C, Snyder M, Roth FP, Barabasi
AL, Tavernier J, Hill DE, Vidal M (2008) High-Quality Binary Protein Interaction
Map of the Yeast Interactome Network. Science 322:104
Yu HY, Luscombe NM, Lu HX, Zhu XW, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M,
Gerstein M (2004a) Annotation Transfer between Genomes: Protein-Protein
Interologs and Protein-DNA Regulogs. Genome Research 14:1107
Yu HY, Luscombe NM, Lu HX, Zhu XW, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M,
Gerstein M (2004b) Annotation Transfer between Genomes: Protein-Protein
Interologs and Protein-DNA Regulogs. Genome Res 14:1107
Zhang JZ (2003) Evolution by Gene Duplication: An Update. Trends in Ecology &
Evolution 18:292
Zheng Q, Wang XJ (2008) Goeast: A Web-Based Software Toolkit for Gene Ontology
Enrichment Analysis. Nucleic Acids Research 36:W358
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R,
Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder M
(2001) Global Analysis of Protein Activities Using Proteome Chips. Science
293:2101
Zhu X, Gerstein M, Snyder M (2007) Getting Connected: Analysis and Principles of
Biological Networks. Genes Dev 21:1010

222

Appendix A
Appendix A Description of divergence of Java implementation of Inparanoid algorithm
from Perl implementation.
In order to examine the differences in output between the novel Java implementation of the
Inparanoid algorithm and the version (2.0) distributed by Remm et al. (Remm et al. 2001) the
following test was run.
Organism A: Saccharomyces cerevisiae.
Organism B: Encephalitozoon cuniculi.
The algorithm BLASTP (version 2.2.18) was run on with the Fasta formatted file
containing all proteins for Saccharomyces cerevisiae as the query and the Fasta formatted file
containing all proteins for Encephalitozoon cuniculi as the database. The program formatdb
was used to create parsable input for BLASTP.
The substitution matrix BLOSUM62 was used to score alignments. The converse
command was also run with Encephalitozoon cuniculi as the query and Saccharomyces
cerevisiae as the database. The two organisms were also run against themselves as query and
database. The parameters v and b were set to the number of proteins in the database files and
the parameter z is set to a theoretical maximum database size by (Remm et al. 2001) to
maintain consistent values for relevant statistics such as K and " described below.
The exact syntax of the commands is given below:

!
blastall -i Saccharomyces_cerevisiae -d Saccharomyces_cerevisiae -p blastp -v 5883 -b 5883
-F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Saccharomyces_cerevisiae -d Encephalitozoon_cuniculi -p blastp -v 1996 -b 1996 F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Encephalitozoon_cuniculi -d Saccharomyces_cerevisiae -p blastp -v 5883 -b 5883 F "m S" -M BLOSUM62 -z 5000000 -V
blastall -i Encephalitozoon_cuniculi -d Encephalitozoon_cuniculi -p blastp -v 1996 -b 1996 F "m S" -M BLOSUM62 -z 5000000 -V
The output from these commands was fed to the Perl script blast_parser.pl provided in the
Inparanoid package, which produces formatted output in the following order:

223

Appendix A

Protein Id1.

Protein Id2.

Bit Score.

E value.

Protein A Length.

Protein B Length.

Alignment Length on query sequence.

Identity percentage.

Similarity percentage.

Coordinates of alignment on query sequence.

Blast bit scores are calculated using the formula

S' =

!
!

"S # lnK
ln2

(1)

S ' = bit score, S= raw score, K = constant associated with search space size and " =constant

associated with scoring system(Mount 2004).


Blast bit scores can be affected by composition based score adjustments which were
!
introduced in order to deal with comparisons of proteins with highly biased amino acid
compositions (Altschul et al. 2005).
This and variable database sizes (depending on whether the order of database and
query is reversed due to search space size variation) can lead to an asymmetry in bit scores
for the same pair of sequences. In order to deal with this artefact scores are normalised by
both implementations by averaging the A-B and B-A orientations.
Results were filtered for hits containing only single high scoring pairs as the Java
implementation was constructed to deal with SSEARCH output, which only returns a single
optimal local alignment.

224

Appendix A
Reciprocal best hits as marked by Perl implementation but not by Java implementation.
Protein pair 1
Protein A: NP_015092 (Saccharomyces cerevisiae )
Protein B: NP_586462 (Encephalitozoon cuniculi)
A-B bit score = 56.2 (Rounded down to 56 by Perl). This is the best score in the A-B
direction.
B-A bit score = 55.5 (Rounded up to 56 by Perl).
Mean bit score = 55.85 (Rounded to 56 by Perl).
NP_586462 however has another significant score against Saccharomyces cerevisiae, which
is NP_013908 with a mean bit score of 56.2 (Rounded down to 56 by Perl).
The Java implementation does not recognise NP_015092 and NP_586462 as reciprocal best
hits as 56.2 > 55.85.
Protein pair 2
Protein A: NP_014520 (Saccharomyces cerevisiae ).
Protein B: NP_586468 (Encephalitozoon cuniculi).
A-B Bit score = 61.6 (Rounded up to 62 by Perl).
B-A Bit score = 61.6 (Rounded up to 62 by Perl).
Mean bit score = 61.6 (Rounded up to 62 by Perl).
NP_586468 has another significant score against Saccharomyces cerevisiae, which is
NP_014097 with a mean bit score of 62.0.
The Java implementation does not recognise NP_014520 and NP_586468 as reciprocal best
hits as 62.0 > 61.6.
Protein pair 3
Protein A: NP_013648 (Saccharomyces cerevisiae ).
Protein B: NP_597364 (Encephalitozoon cuniculi).
A-B Bit score = 111.0
B-A Bit score = 112.0
Mean bit score = 111.5 (Rounded up to 112 by Perl).
NP_597364 has another significant score against Saccharomyces cerevisiae, which is
NP_013546 with a mean bit score of 112.0

225

Appendix A
The Java implementation does not recognise NP_013648 and NP_597364 as reciprocal best
hits as 112.0 > 111.5.
Protein pair 4
Protein A: NP_013182 (Saccharomyces cerevisiae ).
Protein B: NP_586039 (Encephalitozoon cuniculi).
A-B Bit score = 60.5
B-A Bit score = 60.5
Mean bit score= 60.5 (Rounded up to 61 by Perl).
NP_586039 has two other significant scores against Saccharomyces cerevisiae, which are
NP_010629 and NP_010630, which both have mean bit scores of 60.85.
The Java implementation does not recognise NP_013648 and NP_597364 as reciprocal best
hits as 60.85 > 60.5.
Protein pair 5
Protein A: NP_010407 (Saccharomyces cerevisiae).
Protein B: NP_597607 (Encephalitozoon cuniculi).
A-B Bit score = 246.0
B-A Bit score = 245.0
Mean bit score = 245.5(Rounded up to 246 by Perl).
NP_597607 has another significant score against Saccharomyces cerevisiae, which is
NP_013197, which has a mean bit score of 246.
The Java implementation does not recognise NP_013648 and NP_597364 as reciprocal best
hits as 246> 245.5.
Differences in Cluster Output
There are a number of groups which differ between the two implementations on this test data.
This is due to different scores being stored for various values affecting the criterion for
reciprocal bests as well as the criteria for merging and deleting clusters. However the primary
purpose for ortholog selection in this project, which is detection of presence and absence of
proteins, is achieved, as the number of Saccharomyces cerevisiae proteins found to be present
in Encephalitozoon cuniculi was identical.

226

Appendix A
Groups, which differ between implementations.
There are 16 groups, which differ between the two implementations. The Java
implementation produces 616 groups while the Perl implementation produces 619.
Orthologous Group 1
Perl Inparanoid implementation
NP_009501

NP_586181

NP_014887
Java implementation clusters NP_009501 and NP_014887 with a separate protein
XP_955683.
Orthologous Group 2
Perl Inparanoid implementation
NP_011424 NP_586425

Orthologous Group 3
Perl Inparanoid implementation
NP_012263 NP_597203

Groups 2 and 3 are merged into one group by the Java implementation.
Orthologous Group 4
Perl Inparanoid implementation
NP_012610.

XP_955636

NP_010056.
NP_009928.
NP_010504.
Group 4 does not contain NP_009928 in output from the Java implementation.
Orthologous group 5
Perl Inparanoid implementation
227

Appendix A
NP_012710.

NP_597625

NP_014074.
Orthologous group 6
Perl Inparanoid implementation
NP_014293.

NP_597270

NP_014752.
NP_012264.
Groups 5 and 6 are merged into one group by the Java implementation.
Orthologous group 7
Perl Inparanoid implementation
NP_011573.

XP_965975

NP_011975.
NP_013418.
Orthologous group 7 has an additional paralog added in Saccharomyces cerevisiae by the
Java implementation NP_009928.
Orthologous group 8
Perl Inparanoid implementation
NP_011651.

NP_597477

Orthologous group 8 has an additional paralog added in Saccharomyces cerevisiae by the


Java implementation NP_014604.
Orthologous Group 9
Perl Inparanoid implementation
NP_013618.

NP_597286

NP_014604.

228

Appendix A
Orthologous group 9 has an additional paralog added in Saccharomyces cerevisiae by the
Java implementation NP_015007. This paralog replaces NP_014604.
Orthologous Group 10
Perl Inparanoid implementation
NP_014045.

NP_586473

NP_015007.
Java implementation clusters NP_014045. and NP_015007. with a separate protein
NP_597286.
Orthologous Group 11
Perl Inparanoid implementation
NP_012603.

NP_584802
NP_597429

Group 11 does not contain NP_597429 in output from the Java implementation.
Orthologous Group 12
Perl Inparanoid implementation
NP_010144.

NP_597320

NP_010089.

NP_586125

NP_014737.
Orthologous Group 13
Perl Inparanoid implementation
NP_009723.

NP_597558

NP_015274.
Orthologous groups 12 and 13 are merged into one group by the Java implementation.

229

Appendix A

Orthologous Group 14
Perl Inparanoid implementation
NP_009800

NP_584705

NP_010629

NP_586039

NP_010630
NP_013182
NP_011960
NP_012316
NP_014486
NP_010632
NP_011962
NP_012321
NP_011964
NP_013724
NP_116644
NP_010845
NP_014470
NP_010036
NP_012692
NP_011411
NP_014081
NP_010087
NP_010143
NP_010825
NP_014538
NP_010785
NP_010675
NP_116613
NP_011805
NP_010034
NP_012694
NP_009857

230

Appendix A
NP_010082
Orthologous group 14 has an additional paralog added in Saccharomyces cerevisiae by the
Perl implementation NP_010082.
Orthologous Group 15
Perl Inparanoid implementation
NP_012710.

NP_597625

NP_014074.
Orthologous Group 16
Perl Inparanoid implementation
NP_014293.

NP_597270

NP_014752.
NP_012264.
Orthologous groups 15 and 16 are merged into one group by the Java implementation.

231

Appendix B
Appendix B Individual Gene trees for genes in super matrix utilised in construction of
Phylogeny

Gene RPL23: 60S ribosomal protein L23.

232

Appendix B

Gene RPS8: 40S ribosomal protein S8.

233

Appendix B

Gene SRP54:signal recognition particle 54 kDa protein.

234

Appendix B

Gene ERCC3: TFIIH basal transcription factor complex helicase XPB.


235

Appendix B

Gene KARS: lysyl-tRNA synthetase.


236

Appendix B

Gene METAP2: methionine aminopeptidase 2


237

Appendix B

Gene ATP6V1D: V-type proton ATPase subunit D.

238

Appendix B

Gene PSMC1: 26S protease regulatory subunit 4.

239

Appendix B

Gene NFS1:cysteine desulfurase, mitochondrial precursor.

240

Appendix B

Gene GARS: glycyl-tRNA syntheta


241

Appendix C
Appendix C: Predictions made by constrained ML
Protein 1
Protein 2

Description
Description
PREDICTED:
apolipoprotein A-I binding protein

91984773

similar to adaptor-related protein complex 1

precursor

89042891
sigma 2 subunit
PREDICTED: similar to peptidylprolyl

23110944

proteasome alpha 6 subunit

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

23110944

proteasome alpha 6 subunit

38201680

23110944

proteasome alpha 6 subunit

113414586

isoform b
PREDICTED: similar to CG17293-PA
PREDICTED: similar to Ubiquitin-63E

11024714

ubiquitin B precursor

113423966

11024714

ubiquitin B precursor

5454144

7705785

transcription factor B1, mitochondrial

113414586

CG11624-PA, isoform A
ubiquitin D
PREDICTED: similar to CG17293-PA
PREDICTED: similar to peptidylprolyl

7705785

transcription factor B1, mitochondrial

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

7705785

transcription factor B1, mitochondrial

38201680

isoform b

4557896

myotubularin

41350318

myotubularin-related protein 2 isoform 2

transcription elongation factor A protein


4507385

PREDICTED: similar to 40S ribosomal

2 isoform a

51467029

protein S26

transcription elongation factor A protein


4507385

2 isoform a

4557896

myotubularin

7705477

hypothetical protein LOC51504

28872761

myotubularin-related protein 1

transcription elongation factor A protein


4507385

2 isoform a

PREDICTED: similar to postmeiotic


113418682

segregation increased 2-like 2

transcription elongation factor A protein


4507385

2 isoform a

38201710

242

DEAD box polypeptide 17 isoform 1

Appendix C
transcription elongation factor A protein
4507385

2 isoform a

4557896

myotubularin

PREDICTED: similar to large subunit


113427529
44680154

transcription elongation factor A protein


4507385

myotubularin-related protein 2 isoform 1


small nuclear ribonucleoprotein polypeptide

2 isoform a

4507129

transcription elongation factor A protein


4507385

ribosomal protein L36a

E
RNA, U3 small nucleolar interacting protein

2 isoform a

4759276

transcription elongation factor A protein


4507385

2 isoform a

116812591

RER1 retention in endoplasmic reticulum 1

transcription elongation factor A protein


4507385

2 isoform a

4557719

DNA ligase I

transcription elongation factor A protein


4507385

2 isoform a

56549681

small CTD phosphatase 3 isoform 2

4557896

myotubularin

18491016

exonuclease 1 isoform b

transcription elongation factor A protein


4507385

phosphatidylinositol glycan anchor

2 isoform a

4758922

biosynthesis, class L

4506233

proteasome 26S non-ATPase subunit 8

7019319

activator of basal transcription 1

transcription elongation factor A protein


4507385

2 isoform a
transcription elongation factor A protein

4507385

2 isoform a
transcription elongation factor A protein

4507385

DNA directed RNA polymerase II

2 isoform a

10863925

transcription elongation factor A protein


4507385

polypeptide L
dehydrodolichyl diphosphate synthase

2 isoform a

45580738

isoform b

transcription elongation factor A protein


4507385

2 isoform a

4557896

myotubularin

150170706
19923424

anaphase promoting complex subunit 10


myotubularin-related protein 9

transcription elongation factor A protein


4507385
4507385

2 isoform a

8923942

transcription elongation factor A protein

243

40254869

nucleolar protein family A, member 3


pre-mRNA processing factor 31 homolog

Appendix C
2 isoform a
transcription elongation factor A protein
4507385

2 isoform a

41327715

p53-related protein kinase

transcription elongation factor A protein


4507385

2 isoform a

4507311

transcription elongation factor A protein


4507385

2 isoform a

suppressor of Ty 4 homolog 1
PREDICTED: similar to large subunit

113427044

ribosomal protein L36a

transcription elongation factor A protein


4507385

2 isoform a

4506651

ribosomal protein L36a-like protein


PREDICTED: similar to peptidylprolyl

4557896

myotubularin

113429091

isomerase A isoform 1

4557896

myotubularin

113414586

PREDICTED: similar to CG17293-PA

transcription elongation factor A protein


4507385

DNA directed RNA polymerase II

2 isoform a

4505947

phosphoribosyl pyrophosphate
4506127

polypeptide G
phosphoribosyl pyrophosphate synthetase 1-

synthetase 1

28557709

like 1
PREDICTED: similar to adaptor-related

153791910

hypothetical protein LOC79868

89042891

4506541

retinaldehyde binding protein 1

4557719

protein complex 1 sigma 2 subunit


DNA ligase I
PREDICTED: similar to adaptor-related

38348232

dual specificity phosphatase 7

89042891

protein complex 1 sigma 2 subunit

pre-mRNA processing factor 31


40254869

homolog

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to 60S ribosomal

pre-mRNA processing factor 31


40254869

homolog

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

pre-mRNA processing factor 31


40254869
40254869

homolog

pre-mRNA processing factor 31

protein L26 (Silica-induced gene 20 protein)


113431146
113429091

244

(SIG-20)

PREDICTED: similar to peptidylprolyl

Appendix C
homolog

isomerase A isoform 1
PREDICTED: similar to adaptor-related

38348232

dual specificity phosphatase 7

89041736

pre-mRNA processing factor 31

protein complex 1 sigma 2 subunit


meiotic recombination protein SPO11

40254869

homolog

38201680

isoform b

13775200

SF3b10

62909985

hypothetical protein LOC140711


N-ethylmaleimide-sensitive factor

13775200

SF3b10

4505331

attachment protein, gamma


PREDICTED: similar to peptidylprolyl

13775200

SF3b10

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

13775200

SF3b10

38201680

isoform b
PREDICTED: similar to adaptor-related

13775200

SF3b10

89042891

13775200

SF3b10

4502743

protein complex 1 sigma 2 subunit


cyclin-dependent kinase 7
ras homolog gene family, member C

13775200

SF3b10

111494251

precursor
ras homolog gene family, member C

13775200

SF3b10

111494248

precursor
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

13775200

SF3b10

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

13775200

SF3b10

113418826

(SIG-20)

13775200

SF3b10

113414586

PREDICTED: similar to CG17293-PA

DNA replication complex GINS protein


7706367

PSF2

PREDICTED: similar to peptidylprolyl


113429091

eukaryotic translation initiation factor 3


7705433

isomerase A isoform 1
PREDICTED: similar to adaptor-related

subunit 6 interacting protein

89042891

245

protein complex 1 sigma 2 subunit

Appendix C
7662482

transmembrane protein 15

4557719

DNA ligase I
meiotic recombination protein SPO11

4826675

cyclin-dependent kinase 5

38201680

isoform b
PREDICTED: similar to peptidylprolyl

4826675

cyclin-dependent kinase 5

113429091

4826675

cyclin-dependent kinase 5

4557719

4507213

signal recognition particle 19kDa

isomerase A isoform 1
DNA ligase I

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

delta isoform of regulatory subunit B56,


5453954

protein phosphatase 2A isoform 1

PREDICTED: similar to adaptor-related


4826675

cyclin-dependent kinase 5

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4826675

cyclin-dependent kinase 5

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

35493987

ubiquitin-conjugating enzyme E2I

113429091

N-ethylmaleimide-sensitive factor
44917606

isomerase A isoform 1
PREDICTED: similar to adaptor-related

attachment protein, beta

89042891

protein complex 1 sigma 2 subunit

N-ethylmaleimide-sensitive factor
44917606

attachment protein, beta

4557719

DNA ligase I
PREDICTED: similar to adaptor-related

16945972

kelch domain containing 3

89041736

N-ethylmaleimide-sensitive factor
44917606

attachment protein, beta

35493987

ubiquitin-conjugating enzyme E2I

N-ethylmaleimide-sensitive factor
4505331
113414586

N-ethylmaleimide-sensitive factor
44917606

protein complex 1 sigma 2 subunit

attachment protein, gamma


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

attachment protein, beta

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

16945972

kelch domain containing 3

89042891

protein complex 1 sigma 2 subunit


meiotic recombination protein SPO11

35493987

ubiquitin-conjugating enzyme E2I

38201680

246

isoform b

Appendix C
16945972

kelch domain containing 3

7019405

host cell factor C2


PREDICTED: similar to adaptor-related

133925811

transportin 1 isoform 1

89042891

protein complex 1 sigma 2 subunit

133925811

transportin 1 isoform 1

23510381

transportin 1 isoform 2

ras homolog gene family, member C


111494248

meiotic recombination protein SPO11

precursor

38201680

ras homolog gene family, member C


111494251

meiotic recombination protein SPO11

precursor

38201680

ras homolog gene family, member C


111494251

isoform b

isoform b
PREDICTED: similar to adaptor-related

precursor

89042891

protein complex 1 sigma 2 subunit

ras homolog gene family, member C


111494251

precursor

4557719

DNA ligase I

ras homolog gene family, member C


111494251

precursor

118600973

RNA binding motif protein, X-linked 2

118600973

RNA binding motif protein, X-linked 2

ras homolog gene family, member C


111494248

precursor
ras homolog gene family, member C

111494251

PREDICTED: similar to adaptor-related

precursor

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

119943098

dihydropyrimidine dehydrogenase

113429091

ras homolog gene family, member C


111494251

precursor

isomerase A isoform 1
ras homolog gene family, member C

111494248

precursor

ras homolog gene family, member C


111494251

precursor

4506717

ribosomal protein S29 isoform 1

71772583

ribosomal protein S29 isoform 2

71772583

ribosomal protein S29 isoform 2

47717139

leucine-zipper-like transcription regulator 1

ras homolog gene family, member C


111494248

precursor
ras homolog gene family, member C

111494251

precursor
ras homolog gene family, member C

111494251

precursor

247

Appendix C
ras homolog gene family, member C
111494248

precursor

47717139

ras homolog gene family, member C


111494248

dehydrodolichyl diphosphate synthase

precursor

45580738

ras homolog gene family, member C


111494251

leucine-zipper-like transcription regulator 1

isoform b
dehydrodolichyl diphosphate synthase

precursor

45580738

isoform b

38201710

DEAD box polypeptide 17 isoform 1

38201710

DEAD box polypeptide 17 isoform 1

56549681

small CTD phosphatase 3 isoform 2

56549681

small CTD phosphatase 3 isoform 2

14249398

PHD-finger 5A

14249398

PHD-finger 5A

ras homolog gene family, member C


111494251

precursor
ras homolog gene family, member C

111494248

precursor
ras homolog gene family, member C

111494248

precursor
ras homolog gene family, member C

111494251

precursor
ras homolog gene family, member C

111494248

precursor
ras homolog gene family, member C

111494251

precursor
ras homolog gene family, member C

111494251

DNA directed RNA polymerase II

precursor

10863925

ras homolog gene family, member C


111494248

polypeptide L
PREDICTED: similar to adaptor-related

precursor

89042891

protein complex 1 sigma 2 subunit

ras homolog gene family, member C


111494248

precursor

4557719

ras homolog gene family, member C


111494248

DNA ligase I
PREDICTED: similar to adaptor-related

precursor

89041736

protein complex 1 sigma 2 subunit

ras homolog gene family, member C


111494248

precursor

4502859

CDC28 protein kinase 2

4506717

ribosomal protein S29 isoform 1

ras homolog gene family, member C


111494248

precursor

248

Appendix C
glucose-6-phosphate dehydrogenase
109389365

PREDICTED: similar to adaptor-related

isoform a

89042891

glucose-6-phosphate dehydrogenase
109389365

PREDICTED: similar to adaptor-related

isoform a

89041736

ras homolog gene family, member C


111494248

protein complex 1 sigma 2 subunit

protein complex 1 sigma 2 subunit


DNA directed RNA polymerase II

precursor

10863925

polypeptide L
meiotic recombination protein SPO11

51173724

bystin

38201680

isoform b
PREDICTED: similar to peptidylprolyl

51173724

bystin

113429091

isomerase A isoform 1

51173724

bystin

113414586

PREDICTED: similar to CG17293-PA

41406094

J domain containing protein 1 isoform b

13236516

Der1-like domain family, member 1


PREDICTED: similar to adaptor-related

41281768

cytochrome b-5 isoform 1

89042891

protein complex 1 sigma 2 subunit


meiotic recombination protein SPO11

41349495

DNA primase polypeptide 2

38201680

isoform b

41406094

J domain containing protein 1 isoform b

11141871

J domain containing protein 1 isoform a

41349495

DNA primase polypeptide 2

113414586

PREDICTED: similar to CG17293-PA

41281768

cytochrome b-5 isoform 1

4503183

cytochrome b-5 isoform 2


PREDICTED: similar to peptidylprolyl

41349495

DNA primase polypeptide 2

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

41281768

cytochrome b-5 isoform 1

89041736

protein complex 1 sigma 2 subunit

41406094

J domain containing protein 1 isoform b

31455614

Der1-like domain family, member 2

41406094

J domain containing protein 1 isoform b

4557719

35493996

ubiquitin-conjugating enzyme E2I

113414586

PREDICTED: similar to CG17293-PA

35494003

ubiquitin-conjugating enzyme E2I

113414586

PREDICTED: similar to CG17293-PA

DNA ligase I

PREDICTED: similar to peptidylprolyl


35494003

ubiquitin-conjugating enzyme E2I

113429091

249

isomerase A isoform 1

Appendix C
PREDICTED: similar to peptidylprolyl
35493996

ubiquitin-conjugating enzyme E2I

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

35493996

ubiquitin-conjugating enzyme E2I

38201680

isoform b
meiotic recombination protein SPO11

35494003

ubiquitin-conjugating enzyme E2I

38201680

isoform b
meiotic recombination protein SPO11

31581534

tRNA isopentenyltransferase 1

38201680

isoform b

minor histocompatibility antigen 13


30581111

isoform 3

4557719

DNA ligase I

31543831

tubulin, gamma 1

6996005

dynamin 1-like protein isoform 1

minor histocompatibility antigen 13


30581111

isoform 3

6996005

dynamin 1-like protein isoform 1

31543831

tubulin, gamma 1

4557719

DNA ligase I
PREDICTED: similar to peptidylprolyl

31581534

tRNA isopentenyltransferase 1

29826282

protein phosphatase 1G

31581534

tRNA isopentenyltransferase 1

113429091
4505999
113414586

minor histocompatibility antigen 13


30581111

89042891

minor histocompatibility antigen 13


isoform 3

19913408

DNA topoisomerase II, beta isozyme

protein phosphatase 1G
PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

isoform 3

30581111

isomerase A isoform 1

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

89041736
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to peptidylprolyl

19913408

DNA topoisomerase II, beta isozyme

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

19913408

DNA topoisomerase II, beta isozyme

38201680

isoform b
PREDICTED: similar to adaptor-related

13236516

Der1-like domain family, member 1

89041736

250

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to peptidylprolyl
13236516

Der1-like domain family, member 1

113429091

isomerase A isoform 1

13236516

Der1-like domain family, member 1

113414586

PREDICTED: similar to CG17293-PA


meiotic recombination protein SPO11

13236516

Der1-like domain family, member 1

38201680

isoform b
PREDICTED: similar to adaptor-related

13236516

Der1-like domain family, member 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

10835049

ras homolog gene family, member A

89042891

ATP-binding cassette, sub-family C,


9955970

PREDICTED: similar to peptidylprolyl

member 3

113429091

nuclear factor of kappa light polypeptide


10092619

protein complex 1 sigma 2 subunit

isomerase A isoform 1
PREDICTED: similar to adaptor-related

gene enhancer in B-cells inhibitor, alpha

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

5729877

heat shock 70kDa protein 8 isoform 1

89042891

protein complex 1 sigma 2 subunit

proteasome 26S ATPase subunit 4


5729991

isoform 1

4557719

proteasome 26S ATPase subunit 4


5729991

PREDICTED: similar to adaptor-related

isoform 1

89042891

proteasome 26S ATPase subunit 4


5729991

isoform 1

DNA ligase I

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

proteasome 26S ATPase subunit 4


5729991

isoform 1

71772583

ribosomal protein S29 isoform 2

4506717

ribosomal protein S29 isoform 1

proteasome 26S ATPase subunit 4


5729991

isoform 1
proteasome 26S ATPase subunit 4

5729991

DNA directed RNA polymerase II

isoform 1

10863925

polypeptide L
PREDICTED: similar to adaptor-related

5729877
5729991

heat shock 70kDa protein 8 isoform 1

89041736
8923942

proteasome 26S ATPase subunit 4

251

protein complex 1 sigma 2 subunit


nucleolar protein family A, member 3

Appendix C
isoform 1
PREDICTED: similar to adaptor-related
6005764

GABA(A) receptor-associated protein

89042891

proteasome 26S ATPase subunit 4


5729991

PREDICTED: similar to adaptor-related

isoform 1

89041736

proteasome 26S ATPase subunit 4


5729991

protein complex 1 sigma 2 subunit

protein complex 1 sigma 2 subunit


dehydrodolichyl diphosphate synthase

isoform 1

45580738

isoform b

56549681

small CTD phosphatase 3 isoform 2

proteasome 26S ATPase subunit 4


5729991

isoform 1
ATP-binding cassette, sub-family C

4557481

PREDICTED: similar to peptidylprolyl

(CFTR/MRP), member 2

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

4507785

ubiquitin-conjugating enzyme E2I

113429091

isomerase A isoform 1

4507785

ubiquitin-conjugating enzyme E2I

113414586

PREDICTED: similar to CG17293-PA


meiotic recombination protein SPO11

4507785

ubiquitin-conjugating enzyme E2I

38201680

protein phosphatase 3, regulatory


4506025

PREDICTED: similar to peptidylprolyl

subunit B, alpha isoform 1

113429091

ubiquitin-like protein fubi and


4503659

isoform b

isomerase A isoform 1
PREDICTED: similar to Ubiquitin-like

ribosomal protein S30 precursor

113422449

protein FUBI
PREDICTED: similar to adaptor-related

4504277

H2B histone family, member Q

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4503183

cytochrome b-5 isoform 2

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4503183

cytochrome b-5 isoform 2

89042891

protein complex 1 sigma 2 subunit


protein phosphatase 1, catalytic subunit, beta

153252132

ribosomal protein L31 isoform 3

4506005

isoform 1

hypothetical protein LOC57604 isoform


153251916

153251913

252

hypothetical protein LOC57604 isoform 1

Appendix C
153252132

ribosomal protein L31 isoform 3

113414586

PREDICTED: similar to CG17293-PA

33286434

p47 protein isoform c

116256336

SEC31 homolog A isoform 4

33286434

p47 protein isoform c

6996005

dynamin 1-like protein isoform 1


PREDICTED: similar to adaptor-related

30520314

hypothetical protein LOC118812

89042891

7657339

molybdenum cofactor synthesis 3

113414586

PREDICTED: similar to CG17293-PA

coenzyme Q10 homolog A isoform b

151101384

coenzyme Q10 homolog A isoform a

151101386
73622130
7662010

protein complex 1 sigma 2 subunit

BolA-like protein 2

85797673

bolA-like protein 2B

zinc finger protein 516

10190686

zinc finger protein 286

19718751

uracil-DNA glycosylase isoform UNG2

H2A histone family, member V isoform


41406067

PREDICTED: similar to 40S ribosomal


149944735

hypothetical protein LOC728937

89041601

protein S26 isoform 1

149944735

hypothetical protein LOC728937

15011936

ribosomal protein S26


PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

88980535

protein S26
PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

88982349

protein S26
PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

113420084

protein S26
PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

89025350

protein S26 isoform 2


PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

113430282

protein S26
PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

88987217

protein S26
PREDICTED: similar to 40S ribosomal

149944735

hypothetical protein LOC728937

150010661

SEC14-like 5

113429703
89042891

253

protein S26

PREDICTED: similar to adaptor-related

Appendix C
protein complex 1 sigma 2 subunit
PREDICTED: similar to peptidylprolyl
150170706

anaphase promoting complex subunit 10

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

38201710

DEAD box polypeptide 17 isoform 1

28626498

kinesin family member C1

38201680
4557719

isoform b
DNA ligase I
PREDICTED: similar to adaptor-related

47778943

syntaxin 16 isoform a

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

38201710

DEAD box polypeptide 17 isoform 1

89041736

38201710

DEAD box polypeptide 17 isoform 1

4758496

38201710

DEAD box polypeptide 17 isoform 1

113414586

protein complex 1 sigma 2 subunit


H2A histone family, member Y isoform 2
PREDICTED: similar to CG17293-PA
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

38201710

DEAD box polypeptide 17 isoform 1

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

38201710

DEAD box polypeptide 17 isoform 1

113431146

(SIG-20)
PREDICTED: similar to peptidylprolyl

38201710

DEAD box polypeptide 17 isoform 1

113429091

isomerase A isoform 1

excision repair cross-complementing


rodent repair deficiency,
15834617

dehydrodolichyl diphosphate synthase

complementation group 2 protein

45580738

isoform b

excision repair cross-complementing


rodent repair deficiency,
15834617

complementation group 2 protein

8923942

nucleolar protein family A, member 3

excision repair cross-complementing


rodent repair deficiency,
15834617
15834617

complementation group 2 protein

47717139

excision repair cross-complementing

4557719

rodent repair deficiency,

254

leucine-zipper-like transcription regulator 1


DNA ligase I

Appendix C
complementation group 2 protein
excision repair cross-complementing
rodent repair deficiency,
15834617

DNA directed RNA polymerase II

complementation group 2 protein

10863925

polypeptide L

excision repair cross-complementing


rodent repair deficiency,
15834617

PREDICTED: similar to adaptor-related

complementation group 2 protein

89042891

protein complex 1 sigma 2 subunit

excision repair cross-complementing


rodent repair deficiency,
15834617

PREDICTED: similar to adaptor-related

complementation group 2 protein

89041736

protein complex 1 sigma 2 subunit

excision repair cross-complementing


rodent repair deficiency,
15834617

PREDICTED: similar to peptidylprolyl

complementation group 2 protein

113429091

isomerase A isoform 1

excision repair cross-complementing


rodent repair deficiency,
15834617

meiotic recombination protein SPO11

complementation group 2 protein

38201680

COX17 homolog, cytochrome c oxidase


5031645

PREDICTED: similar to adaptor-related

assembly protein

89042891

COX17 homolog, cytochrome c oxidase


5031645

isoform b

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

assembly protein

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4503719
148727247

fragile histidine triad gene

89042891

ubiquitin specific peptidase 5 isoform 2

8923942

protein complex 1 sigma 2 subunit


nucleolar protein family A, member 3
dehydrodolichyl diphosphate synthase

148727247

ubiquitin specific peptidase 5 isoform 2

45580738

coatomer protein complex, subunit


148536853

alpha isoform 2

148596961

stearoyl-CoA desaturase 4 isoform a

isoform b
PREDICTED: similar to adaptor-related

89042891
148596938

coatomer protein complex, subunit

protein complex 1 sigma 2 subunit


stearoyl-CoA desaturase 4 isoform b
PREDICTED: similar to peptidylprolyl

148536853

alpha isoform 2

113429091

145275210

RNA processing factor 1

113418826

255

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

Appendix C
(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)
145275210

RNA processing factor 1

113431146

(SIG-20)

145275210

RNA processing factor 1

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

145275187

tRNA-(N1G37) methyltransferase

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

145275210

RNA processing factor 1

113429091

isomerase A isoform 1
meiotic recombination protein SPO11

145275210

RNA processing factor 1

38201680

126723390

ankyrin repeat domain 24

121582655

124256496

heat shock 70kDa protein 1-like

34419635

isoform b
ankyrin repeat domain 35
heat shock 70kDa protein 6 (HSP70B')
PREDICTED: similar to adaptor-related

126723390

ankyrin repeat domain 24

89041736

dual specificity phosphatase 27


122937243

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

(putative)

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

121582655

ankyrin repeat domain 35

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

118600973

RNA binding motif protein, X-linked 2

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

118600973

RNA binding motif protein, X-linked 2

113431146

(SIG-20)

trafficking protein particle complex 6B


118600991

isoform 1

13129120

trafficking protein particle complex 6B


118600991

isoform 1

118498359

ribosomal L1 domain containing 1

trafficking protein particle complex 6A


meiotic recombination protein SPO11

38201680
113414586

256

isoform b
PREDICTED: similar to CG17293-PA

Appendix C
PREDICTED: similar to peptidylprolyl
118600973

RNA binding motif protein, X-linked 2

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

trafficking protein particle complex 6B


118600991

isoform 1
RER1 retention in endoplasmic

116812591

reticulum 1

62909985

hypothetical protein LOC140711

RER1 retention in endoplasmic


116812591

reticulum 1

113414586

RER1 retention in endoplasmic


116812591

meiotic recombination protein SPO11

reticulum 1

38201680

RER1 retention in endoplasmic


116812591

reticulum 1

PREDICTED: similar to CG17293-PA

isoform b
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal

RER1 retention in endoplasmic


116812591

reticulum 1

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal

RER1 retention in endoplasmic


116812591

reticulum 1

protein L26 (Silica-induced gene 20 protein)


113418826

ATP-binding cassette, sub-family A


116734710

PREDICTED: similar to adaptor-related

member 3

89042891

ATP-binding cassette, sub-family A


116734710

member 3

116256336

SEC31 homolog A isoform 4

(SIG-20)

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

89041736
110347439

protein complex 1 sigma 2 subunit


zinc finger protein 225
protein phosphatase 1, catalytic subunit, beta

116256336

SEC31 homolog A isoform 4

4506005

115387112

ubiquitin-like 5

13236510

115387112

ubiquitin-like 5

113414586

proto-oncogene tyrosine-protein kinase


112382244
112382241

isoform 1
ubiquitin-like 5
PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

FGR

89042891
89042891

proto-oncogene tyrosine-protein kinase

257

protein complex 1 sigma 2 subunit

PREDICTED: similar to adaptor-related

Appendix C
FGR

protein complex 1 sigma 2 subunit


PREDICTED: similar to Ubiquitinconjugating enzyme E2S (Ubiquitinconjugating enzyme E2-24 kDa) (Ubiquitinprotein ligase) (Ubiquitin carrier protein)

112382377

ubiquitin-conjugating enzyme E2S

113430896

SEC24 (S. cerevisiae) homolog B


112382212

PREDICTED: similar to adaptor-related

isoform a

89042891

SEC24 (S. cerevisiae) homolog B


112382212

isoform a

FGR

89041736

FGR

113429091

113429091

containing 1 isoform 1

89041736

89042891

serine/threonine kinase 24 (STE20


homolog, yeast) isoform b

110347439
110347439

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

containing 1 isoform 1

110349738

isomerase A isoform 1
PREDICTED: similar to adaptor-related

ankyrin repeat and FYVE domain


110815813

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

ankyrin repeat and FYVE domain


110815813

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

proto-oncogene tyrosine-protein kinase


112382241

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

proto-oncogene tyrosine-protein kinase


112382244

(E2-EPF5)

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

89042891

protein complex 1 sigma 2 subunit

zinc finger protein 225

6996005

dynamin 1-like protein isoform 1

zinc finger protein 225

10190696

zinc finger protein 304

110347439

zinc finger protein 225

MYC-associated zinc finger protein


110347459

isoform 2

PREDICTED: similar to adaptor-related


110349799

testis-specific protein kinase 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to zinc finger protein

110347439

zinc finger protein 225

113413881

110347439

zinc finger protein 225

10190686

110349799

testis-specific protein kinase 2

89041736

258

114
zinc finger protein 286

PREDICTED: similar to adaptor-related

Appendix C
protein complex 1 sigma 2 subunit
109452595

zinc finger protein 205

109452593

zinc finger protein 205

109255245

serine/threonine kinase 17a

113414586

PREDICTED: similar to CG17293-PA

109255245

serine/threonine kinase 17a

6996005

ATP-binding cassette, sub-family E,


108773782

PREDICTED: similar to adaptor-related

member 1

89042891

ATP-binding cassette, sub-family E,


108773784

dynamin 1-like protein isoform 1

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

member 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

95147356

mitogen-activated protein kinase 15

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

95147356

mitogen-activated protein kinase 15

89041736

vesicle-associated membrane protein94721250

PREDICTED: similar to adaptor-related

associated protein A isoform 1

89041736

nuclear LIM interactor-interacting factor


93004102

89042891

nuclear LIM interactor-interacting factor


2

93141204

methyltransferase like 2B

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

93004102

protein complex 1 sigma 2 subunit

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

89041736
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

89145417

methyltransferase like 7A

89042891

eukaryotic translation initiation factor


84043963

5B

PREDICTED: similar to peptidylprolyl


113429091

nucleolar protein family A, member 2


77812674

protein complex 1 sigma 2 subunit

isomerase A isoform 1
nucleolar protein family A, member 2

isoform b

8923444

isoform a
PREDICTED: similar to adaptor-related

77812670

exosome component 9 isoform 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

77812670

exosome component 9 isoform 2

89041736

259

protein complex 1 sigma 2 subunit

Appendix C
myosin head domain containing 1
75812980

isoform 3

PREDICTED: similar to peptidylprolyl


113429091

myosin head domain containing 1


75812980

PREDICTED: similar to adaptor-related

isoform 3

89042891

digestive-organ expansion factor


75677335

homolog

isoform 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

hypothetical protein LOC112812


72534754

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

hypothetical protein LOC112812


72534754

isomerase A isoform 1

isomerase A isoform 1
meiotic recombination protein SPO11

isoform 1

38201680

isoform b

hypothetical protein LOC112812


72534754

isoform 1

71772583

ribosomal protein S29 isoform 2

4758496

71772583

ribosomal protein S29 isoform 2

32130516

113414586

PREDICTED: similar to CG17293-PA


H2A histone family, member Y isoform 2
serologically defined colon cancer antigen 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

71772583

ribosomal protein S29 isoform 2

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

71772583

ribosomal protein S29 isoform 2

113418826

(SIG-20)

71772583

ribosomal protein S29 isoform 2

62909985

71772583

ribosomal protein S29 isoform 2

113414586

PREDICTED: similar to CG17293-PA

71772583

ribosomal protein S29 isoform 2

19718751

uracil-DNA glycosylase isoform UNG2

hypothetical protein LOC140711

ubiquitin-conjugating enzyme E2D 4


71772583

ribosomal protein S29 isoform 2

8393719

peroxisomal enoyl-coenzyme A

(putative)
PREDICTED: similar to adaptor-related

70995211

hydratase-like protein

71772583

ribosomal protein S29 isoform 2

4506717

71772583

ribosomal protein S29 isoform 2

38016127

89042891

260

protein complex 1 sigma 2 subunit


ribosomal protein S29 isoform 1
RNA binding motif protein 34

Appendix C
71772583

ribosomal protein S29 isoform 2

4502743

cyclin-dependent kinase 7

71772583

ribosomal protein S29 isoform 2

4557719

DNA ligase I
meiotic recombination protein SPO11

71772583

ribosomal protein S29 isoform 2

38201680

isoform b
PREDICTED: similar to peptidylprolyl

71772583

ribosomal protein S29 isoform 2

113429091

isomerase A isoform 1

68509270

transcriptional adaptor 2-like isoform a

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

68303635

mutS homolog 3

89042891

protein complex 1 sigma 2 subunit

68226422

Yip1 domain family, member 5

32401427

Yip1 domain family, member 5


PREDICTED: similar to adaptor-related

62955833

DNA-damage inducible protein 2

89042891

serologically defined colon cancer


64276486

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

antigen 10

89042891

serologically defined colon cancer

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

64276486

antigen 10

89041736

protein complex 1 sigma 2 subunit

62955833

DNA-damage inducible protein 2

48717485

DDI1, DNA-damage inducible 1, homolog 1


PREDICTED: similar to peptidylprolyl

62865890

dual specificity phosphatase 5

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

62460637

importin 4

89041736

protein complex 1 sigma 2 subunit


dehydrodolichyl diphosphate synthase

62865890

dual specificity phosphatase 5

45580738

62865890

dual specificity phosphatase 5

4557719

isoform b
DNA ligase I
ubiquitin-conjugating enzyme E2D 4

62865890

dual specificity phosphatase 5

8393719

(putative)
PREDICTED: similar to adaptor-related

62865890

dual specificity phosphatase 5

89042891

62865890

dual specificity phosphatase 5

8923942

261

protein complex 1 sigma 2 subunit


nucleolar protein family A, member 3

Appendix C
PREDICTED: similar to adaptor-related
62865890

dual specificity phosphatase 5

89041736

protein complex 1 sigma 2 subunit

62865890

dual specificity phosphatase 5

56549681

small CTD phosphatase 3 isoform 2

62240994

cysteinyl-tRNA synthetase isoform d

62240992

cysteinyl-tRNA synthetase isoform c

62234438

Notchless gene homolog isoform b

41350318

myotubularin-related protein 2 isoform 2

62234438

Notchless gene homolog isoform b

44680154

myotubularin-related protein 2 isoform 1

62234461

Notchless gene homolog isoform a

41350318

myotubularin-related protein 2 isoform 2

62234461

Notchless gene homolog isoform a

62234438

Notchless gene homolog isoform b

62234438

Notchless gene homolog isoform b

4502703

62234438

Notchless gene homolog isoform b

21536371

telomerase-associated protein 1

62234461

Notchless gene homolog isoform a

44680154

myotubularin-related protein 2 isoform 1

58533179

trafficking protein particle complex 2

7657548

cell division cycle 6 protein

trafficking protein particle complex 2


meiotic recombination protein SPO11

60279265

Sec61 gamma subunit

38201680

isoform b
meiotic recombination protein SPO11

58533179

trafficking protein particle complex 2

60279265

Sec61 gamma subunit

38201680
7657546

isoform b
Sec61 gamma subunit
PREDICTED: similar to adaptor-related

60279265

Sec61 gamma subunit

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

58533179

trafficking protein particle complex 2

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

58533179

trafficking protein particle complex 2

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

58533179

trafficking protein particle complex 2

113418826

(SIG-20)

58533179

trafficking protein particle complex 2

113414586

PREDICTED: similar to CG17293-PA

262

Appendix C
57165436

serine/threonine kinase 16

57165434

serine/threonine kinase 16
PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

88943062

isomerase A (cyclophilin A)-like 4


PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

113423887

isomerase A isoform 1
PREDICTED: similar to adaptor-related

56549681

small CTD phosphatase 3 isoform 2

89041736

protein complex 1 sigma 2 subunit

56549681

small CTD phosphatase 3 isoform 2

31543091

RNA binding motif protein 13

56549681

small CTD phosphatase 3 isoform 2

22035624

phosphatidate cytidylyltransferase 1
PREDICTED: similar to adaptor-related

56549683

small CTD phosphatase 3 isoform 1

89042891

56549681

small CTD phosphatase 3 isoform 2

4502743

protein complex 1 sigma 2 subunit


cyclin-dependent kinase 7
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

56549681

small CTD phosphatase 3 isoform 2

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

56549681

small CTD phosphatase 3 isoform 2

113431146

(SIG-20)
PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

113422777

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

89042897

isomerase A isoform 1

56549681

small CTD phosphatase 3 isoform 2

38016127

RNA binding motif protein 34


meiotic recombination protein SPO11

56549681

small CTD phosphatase 3 isoform 2

38201680

56549681

small CTD phosphatase 3 isoform 2

5729840

isoform b
tubulin, gamma complex associated protein 2
ubiquitin-conjugating enzyme E2D 4

56549681

small CTD phosphatase 3 isoform 2

8393719

(putative)
PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

88943041

263

isomerase A (cyclophilin A)-like 4

Appendix C
PREDICTED: similar to peptidylprolyl
56549681

small CTD phosphatase 3 isoform 2

88953813

isomerase A isoform 1
PREDICTED: similar to TBC1 domain
family member 3 (Rab GTPase-activating
protein PRC17) (Prostate cancer gene 17

56549681

small CTD phosphatase 3 isoform 2

113426831

56549681

small CTD phosphatase 3 isoform 2

4557719

CCR4-NOT transcription complex,


56550059

protein) (TRE17 alpha protein) isoform 1


DNA ligase I
CCR4-NOT transcription complex, subunit 4

subunit 4 isoform b

56550057

isoform a
PREDICTED: similar to adaptor-related

56699411

solute carrier family 35, member E2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

56549683

small CTD phosphatase 3 isoform 1

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

56549681

small CTD phosphatase 3 isoform 2

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

56549681

small CTD phosphatase 3 isoform 2

89042891

56549681

small CTD phosphatase 3 isoform 2

4758496

protein complex 1 sigma 2 subunit


H2A histone family, member Y isoform 2
meiotic recombination protein SPO11

56549681

small CTD phosphatase 3 isoform 2

6912680

isoform a
PREDICTED: similar to adaptor-related

56699411

solute carrier family 35, member E2

89041736

protein complex 1 sigma 2 subunit

56549681

small CTD phosphatase 3 isoform 2

62909985

hypothetical protein LOC140711


PREDICTED: similar to adaptor-related

55956895

CGI-01 protein isoform 3

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

56549113

debranching enzyme homolog 1

89042891

56118223

choline/ethanolaminephosphotransferase

5174415

protein complex 1 sigma 2 subunit


choline/ethanolaminephosphotransferase
PREDICTED: similar to adaptor-related

56549113

debranching enzyme homolog 1

89041736

264

protein complex 1 sigma 2 subunit

Appendix C
SMT3 suppressor of mif two 3 homolog
54792071

SMT3 suppressor of mif two 3 homolog 2

2 isoform b precursor

54792069

isoform a precursor

oxoglutarate (alpha-ketoglutarate)
dehydrogenase (lipoamide) isoform 1
51873036

precursor

51944950

phosducin-like 2

PREDICTED: similar to adaptor-related


89042891
113414586

DnaJ (Hsp40) homolog, subfamily B,


50593537

member 12

41054844

5-phosphatase, A isoform 2

89041736

chromosomes 4-like 1

89042891

chromosomes 4-like 1

89042891

5-phosphatase, A isoform 2

89042891

5-phosphatase, A isoform 2

chromosomes 4-like 1

18765707

chromosomes 4-like 1

113429091

113429091

subunit 11 isoform 2

50409781

subunit 11 isoform 2

18777675

11 isoform 2
APC11 anaphase promoting complex subunit

subunit 11 isoform 2

50409750

APC11 anaphase promoting complex


50409796

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409789

isomerase A isoform 1
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409796

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

APC11 anaphase promoting complex


50409789

phosphatase isoform 2
PREDICTED: similar to peptidylprolyl

SMC4 structural maintenance of


50658065

protein complex 1 sigma 2 subunit


skeletal muscle and kidney enriched inositol

SMC4 structural maintenance of


50658063

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

phosphatidylinositol (4,5) bisphosphate


50726960

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

phosphatidylinositol (4,5) bisphosphate


50726960

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

SMC4 structural maintenance of


50658063

member 12
PREDICTED: similar to adaptor-related

SMC4 structural maintenance of


50658065

PREDICTED: similar to CG17293-PA


DnaJ (Hsp40) homolog, subfamily B,

phosphatidylinositol (4,5) bisphosphate


50726960

protein complex 1 sigma 2 subunit

11 isoform 2
APC11 anaphase promoting complex subunit

subunit 11 isoform 2

50409750

265

11 isoform 2

Appendix C
APC11 anaphase promoting complex
50409796

APC11 anaphase promoting complex subunit

subunit 11 isoform 2

50409789

APC11 anaphase promoting complex


50409804

APC11 anaphase promoting complex subunit

subunit 11 isoform 2

50409750

APC11 anaphase promoting complex


50409804

subunit 11 isoform 2

50409789

subunit 11 isoform 2

18777675

subunit 11 isoform 2

50409781

subunit 11 isoform 2

50409796

subunit 11 isoform 2

18777675

subunit 11 isoform 2

50409781

11 isoform 2
PREDICTED: similar to adaptor-related

polypeptide A'

89042891

NAD(P)H:quinone oxidoreductase type


49574502

11 isoform 2
APC11 anaphase promoting complex subunit

small nuclear ribonucleoprotein


50593002

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409796

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409789

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409804

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409804

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409804

11 isoform 2

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

3, polypeptide A2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

50083277

ATPase class I type 8B member 4

89042891

APC11 anaphase promoting complex


50409781

APC11 anaphase promoting complex subunit

subunit 11 isoform 2

50409750

APC11 anaphase promoting complex


50409781

subunit 11 isoform 2

18777675

11 isoform 2
APC11 anaphase promoting complex subunit

subunit 11 isoform 2

18777675

NAD(P)H:quinone oxidoreductase type


49574502

11 isoform 2
APC11 anaphase promoting complex subunit

APC11 anaphase promoting complex


50409750

protein complex 1 sigma 2 subunit

11 isoform 2
PREDICTED: similar to adaptor-related

3, polypeptide A2

89041736

266

protein complex 1 sigma 2 subunit

Appendix C
DDI1, DNA-damage inducible 1,
48717485

PREDICTED: similar to adaptor-related

homolog 1

89042891

DDI1, DNA-damage inducible 1,


48717485

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

homolog 1

89041736

protein complex 1 sigma 2 subunit

6996005

dynamin 1-like protein isoform 1

leucine-zipper-like transcription
47717139

regulator 1
leucine-zipper-like transcription

47717139

meiotic recombination protein SPO11

regulator 1

38201680

isoform b

leucine-zipper-like transcription
47717139

regulator 1

4758496

H2A histone family, member Y isoform 2

4557719

DNA ligase I

leucine-zipper-like transcription
47717139

regulator 1
solute carrier family 25 member 3

47132595

dehydrodolichyl diphosphate synthase

isoform b precursor

45580738

isoform b

dehydrodolichyl diphosphate synthase


45580738

isoform b

113414586

PREDICTED: similar to CG17293-PA

dehydrodolichyl diphosphate synthase


45580738

isoform b

21536371

telomerase-associated protein 1

23397458

kinesin family member 19

dehydrodolichyl diphosphate synthase


45580738

isoform b
dehydrodolichyl diphosphate synthase

45580742

dehydrodolichyl diphosphate synthase

isoform a

45580738

isoform b

dehydrodolichyl diphosphate synthase


45580742

isoform a

113414586

dehydrodolichyl diphosphate synthase


45580738

isoform b

PREDICTED: similar to peptidylprolyl


113423887

dehydrodolichyl diphosphate synthase


45580738

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

isoform b

88943062

dehydrodolichyl diphosphate synthase


45580738

PREDICTED: similar to CG17293-PA

isomerase A (cyclophilin A)-like 4


PREDICTED: similar to peptidylprolyl

isoform b

89042897

267

isomerase A isoform 1

Appendix C
dehydrodolichyl diphosphate synthase
45580738

isoform b

PREDICTED: similar to peptidylprolyl


113422777

isomerase A isoform 1

dehydrodolichyl diphosphate synthase


45580738

isoform b

5729840

dehydrodolichyl diphosphate synthase


45580738

tubulin, gamma complex associated protein 2


meiotic recombination protein SPO11

isoform b

38201680

isoform b

dehydrodolichyl diphosphate synthase


45580738

isoform b

4506707

ribosomal protein S25


PREDICTED: similar to 60S ribosomal

dehydrodolichyl diphosphate synthase


45580738

isoform b

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

dehydrodolichyl diphosphate synthase


45580738

isoform b

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)

dehydrodolichyl diphosphate synthase


45580738

isoform b

41872631

fatty acid synthase

38016127

RNA binding motif protein 34

dehydrodolichyl diphosphate synthase


45580738

isoform b
dehydrodolichyl diphosphate synthase

45580738

solute carrier family 25 member 3 isoform b

isoform b

4505775

dehydrodolichyl diphosphate synthase


45580742

isoform a

PREDICTED: similar to peptidylprolyl


113429091

dehydrodolichyl diphosphate synthase


45580738

precursor

isomerase A isoform 1
PREDICTED: similar to adaptor-related

isoform b

89042891

protein complex 1 sigma 2 subunit

32130516

serologically defined colon cancer antigen 1

dehydrodolichyl diphosphate synthase


45580738

isoform b
dehydrodolichyl diphosphate synthase

45580738

isoform b

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

dehydrodolichyl diphosphate synthase


45580738
45580738

isoform b

dehydrodolichyl diphosphate synthase

62909985
113426831

268

hypothetical protein LOC140711


PREDICTED: similar to TBC1 domain
family member 3 (Rab GTPase-activating

Appendix C
isoform b

protein PRC17) (Prostate cancer gene 17


protein) (TRE17 alpha protein) isoform 1

protein phosphatase 1, catalytic subunit,


46249376

protein phosphatase 1, catalytic subunit, beta

beta isoform 1

4506005

isoform 1

dehydrodolichyl diphosphate synthase


45580738

isoform b

17978477

vacuolar protein sorting 11

dehydrodolichyl diphosphate synthase


45580738

isoform b

4557719

dehydrodolichyl diphosphate synthase


45580742

DNA ligase I
meiotic recombination protein SPO11

isoform a

38201680

isoform b

dehydrodolichyl diphosphate synthase


45580738

isoform b

4758496

H2A histone family, member Y isoform 2

4502743

cyclin-dependent kinase 7

dehydrodolichyl diphosphate synthase


45580738

isoform b
dehydrodolichyl diphosphate synthase

45580738

isoform b

42516576

dehydrodolichyl diphosphate synthase


45580738

WW domain-containing oxidoreductase

isoform b

7706523

dehydrodolichyl diphosphate synthase


45580738

glutaredoxin 5

isoform 1
WW domain-containing oxidoreductase

isoform b

18860884

isoform 2

dehydrodolichyl diphosphate synthase


45580738

isoform b

7705369

coatomer protein complex, subunit beta


PREDICTED: similar to 60S ribosomal

dehydrodolichyl diphosphate synthase


45580742

isoform a

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal

dehydrodolichyl diphosphate synthase


45580742

isoform a

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)

dehydrodolichyl diphosphate synthase


45580738
45580738

isoform b

dehydrodolichyl diphosphate synthase

269

31455614

Der1-like domain family, member 2

23510381

transportin 1 isoform 2

Appendix C
isoform b
dehydrodolichyl diphosphate synthase
45580738

PREDICTED: similar to peptidylprolyl

isoform b

88953813

dehydrodolichyl diphosphate synthase


45580738

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

isoform b

88943041

isomerase A (cyclophilin A)-like 4


PREDICTED: similar to adaptor-related

45238849

poly(A) binding protein, cytoplasmic 3

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal

myotubularin-related protein 2 isoform


44680154

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

myotubularin-related protein 2 isoform


44680154

113418826

(SIG-20)

113414586

PREDICTED: similar to CG17293-PA

myotubularin-related protein 2 isoform


44680154

1
myotubularin-related protein 2 isoform

44680154

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

myotubularin-related protein 2 isoform


44680154

41350318

myotubularin-related protein 2 isoform 2

19923424

myotubularin-related protein 9

21536371

telomerase-associated protein 1

myotubularin-related protein 2 isoform


44680154

1
myotubularin-related protein 2 isoform

44680154

1
transcription elongation factor A 1

45439355

isoform 2

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

45238849

poly(A) binding protein, cytoplasmic 3

89042891

protein complex 1 sigma 2 subunit

28872761

myotubularin-related protein 1

myotubularin-related protein 2 isoform


44680154

1
acyl-CoA synthetase long-chain family

42794754

acyl-CoA synthetase long-chain family

member 3

42794752

270

member 3

Appendix C
PREDICTED: similar to adaptor-related
42516563

UDP-glucuronate decarboxylase 1

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

41872631

fatty acid synthase

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

42516563

UDP-glucuronate decarboxylase 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

42516576

glutaredoxin 5

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

41872631

fatty acid synthase

89041736

protein complex 1 sigma 2 subunit

19923424

myotubularin-related protein 9

myotubularin-related protein 2 isoform


41350318

2
myotubularin-related protein 2 isoform

41350316

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal

myotubularin-related protein 2 isoform


41350318

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

myotubularin-related protein 2 isoform


41350318

113431146

myotubularin-related protein 2 isoform


41350318

(SIG-20)
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

myotubularin-related protein 2 isoform


41350318

21536371

telomerase-associated protein 1

28872761

myotubularin-related protein 1

myotubularin-related protein 2 isoform


41350318

2
myotubularin-related protein 2 isoform

41350318

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

41327715

p53-related protein kinase

113429091

41327715

p53-related protein kinase

89042891

271

isomerase A isoform 1

PREDICTED: similar to adaptor-related

Appendix C
protein complex 1 sigma 2 subunit
PREDICTED: similar to adaptor-related
41349441

SEC31 homolog A isoform 2

41327715

p53-related protein kinase

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

41349441

SEC31 homolog A isoform 2

89041736

ubiquitin-conjugating enzyme E2
40806167

variant 1 isoform a

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

ubiquitin-conjugating enzyme E2
40806167

variant 1 isoform a
ubiquitin-conjugating enzyme E2

40806167

meiotic recombination protein SPO11

variant 1 isoform a

38201680

isoform b

transmembrane emp24 protein transport


39725636

domain containing 9

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to 60S ribosomal

transmembrane emp24 protein transport


39725636

domain containing 9

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

transmembrane emp24 protein transport


39725636

domain containing 9

protein L26 (Silica-induced gene 20 protein)


113431146

transmembrane emp24 protein transport

(SIG-20)
PREDICTED: similar to peptidylprolyl

39725636

domain containing 9

113429091

isomerase A isoform 1

38708309

hypothetical protein LOC51029

113428755

PREDICTED: similar to CG7222-PA


PREDICTED: similar to adaptor-related

38327644

hypothetical protein LOC57707

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

38327644

hypothetical protein LOC57707

89042891

protein complex 1 sigma 2 subunit

38327644

hypothetical protein LOC57707

62909985

hypothetical protein LOC140711

meiotic recombination protein SPO11


38201680

PREDICTED: similar to ribosomal protein

isoform b

29742309

272

L31

Appendix C
meiotic recombination protein SPO11
38201680

isoform b

PREDICTED: similar to ribosomal protein


113427093

L31

meiotic recombination protein SPO11


38201680

isoform b

7706343

hypothetical protein LOC51647

4504221

guanylate kinase 1

4506193

proteasome beta 1 subunit

7657198

dimethyladenosine transferase

4507797

ubiquitin-conjugating enzyme E2 variant 2

4506699

ribosomal protein S21

4506643

ribosomal protein L37a

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

13129120

trafficking protein particle complex 6A

meiotic recombination protein SPO11


38201680

isoform b

7706423

U6 snRNA-associated Sm-like protein LSm7

4557719

DNA ligase I

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

DNA directed RNA polymerase II

isoform b

10863925

polypeptide L

14249398

PHD-finger 5A

62909985

hypothetical protein LOC140711

7705477

hypothetical protein LOC51504

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

273

Appendix C
meiotic recombination protein SPO11
38201680

guanine nucleotide-binding protein, beta-1

isoform b

11321585

meiotic recombination protein SPO11


38201680

subunit
SWI/SNF-related matrix-associated actin-

isoform b

21071060

dependent regulator of chromatin a-like 1

meiotic recombination protein SPO11


38201680

isoform b

4506631

ribosomal protein L30

meiotic recombination protein SPO11


38201680

isoform b

15150809

SEC11-like 3

meiotic recombination protein SPO11


38201680

isoform b

7657546

Sec61 gamma subunit

8923475

thioredoxin-like 4B

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

PREDICTED: similar to postmeiotic


113418682

segregation increased 2-like 2

meiotic recombination protein SPO11


38201680

isoform b

4758384

FK506 binding protein 5

8922905

RIO kinase 2

4506701

ribosomal protein S23

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

18105063

vacuolar protein sorting 45A


PREDICTED: similar to 60S ribosomal

meiotic recombination protein SPO11


38201680

isoform b

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

meiotic recombination protein SPO11


38201680

isoform b

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)

meiotic recombination protein SPO11


38201680
38201680

isoform b

8923942
4502643

meiotic recombination protein SPO11

274

nucleolar protein family A, member 3

chaperonin containing TCP1, subunit 6A

Appendix C

38201680

isoform b

isoform a

meiotic recombination protein SPO11

phosphatidylinositol glycan anchor

isoform b

4758922

biosynthesis, class L

meiotic recombination protein SPO11


38201680

isoform b

15431295

ribosomal protein L13

15431297

ribosomal protein L13

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

meiotic recombination protein SPO11


38201680

isoform b

7706667

trafficking protein particle complex 4

4507311

suppressor of Ty 4 homolog 1

meiotic recombination protein SPO11


38201680

isoform b

PREDICTED: similar to 60S ribosomal


meiotic recombination protein SPO11
38201680

isoform b

protein L29 (Cell surface heparin-binding


113428574

protein HIP)

meiotic recombination protein SPO11


38201680

isoform b

32189369

DNA polymerase epsilon subunit 2

meiotic recombination protein SPO11


38201680

isoform b

4503729

meiotic recombination protein SPO11


38201680

FK506-binding protein 4
PREDICTED: similar to 40S ribosomal

isoform b

89035017

protein S28 isoform 2

meiotic recombination protein SPO11


38201680

isoform b

4507873

meiotic recombination protein SPO11


38201680

small nuclear ribonucleoprotein polypeptide

isoform b

4507129

meiotic recombination protein SPO11


38201680

von Hippel-Lindau binding protein 1

E
solute carrier family 2 (facilitated glucose

isoform b

8923733

transporter), member 6

meiotic recombination protein SPO11


38201680

isoform b

113414586

275

PREDICTED: similar to CG17293-PA

Appendix C
meiotic recombination protein SPO11
38201680

isoform b

4502859

CDC28 protein kinase 2

4506609

ribosomal protein L19

meiotic recombination protein SPO11


38201680

isoform b

CTD (carboxy-terminal domain, RNA


meiotic recombination protein SPO11
38201680

polymerase II, polypeptide A) small

isoform b

32813443

meiotic recombination protein SPO11


38201680

isoform b

PREDICTED: similar to ribosomal protein


113427613

meiotic recombination protein SPO11


38201680

phosphatase 1 isoform 2

L31
meiotic recombination protein SPO11

isoform b

6912680

isoform a

7657548

trafficking protein particle complex 2

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

PREDICTED: similar to large subunit


113427529

meiotic recombination protein SPO11


38201680

ribosomal protein L36a


PREDICTED: similar to 40S ribosomal

isoform b

51467029

protein S26

10864021

trafficking protein particle complex 1

meiotic recombination protein SPO11


38201680

isoform b
meiotic recombination protein SPO11

38201680

isoform b

PREDICTED: similar to DNA primase large


113418084

meiotic recombination protein SPO11


38201680

isoform b

subunit, 58kDa
PREDICTED: similar to DNA primase large

113418086

subunit, 58kDa

meiotic recombination protein SPO11


38201680

isoform b

4506717

small nuclear ribonucleoprotein


38149981

ribosomal protein S29 isoform 1


PREDICTED: similar to adaptor-related

polypeptide B''

89042891

protein complex 1 sigma 2 subunit

meiotic recombination protein SPO11


38201680

isoform b

4502743

cyclin-dependent kinase 7

5729840

tubulin, gamma complex associated protein 2

meiotic recombination protein SPO11


38201680

isoform b

276

Appendix C
meiotic recombination protein SPO11
38201680

isoform b

4504523

meiotic recombination protein SPO11


38201680

isoform b

heat shock 10kDa protein 1 (chaperonin 10)


PREDICTED: similar to 40S ribosomal

113419590

protein S28

meiotic recombination protein SPO11


38201680

isoform b

4506651

meiotic recombination protein SPO11


38201680

isoform b

ribosomal protein L36a-like protein


PREDICTED: similar to large subunit

113427044

ribosomal protein L36a

meiotic recombination protein SPO11


38201680

isoform b

23397458

meiotic recombination protein SPO11


38201680

RNA, U3 small nucleolar interacting protein

isoform b

4759276

chaperone, ABC1 activity of bc1


34147522

complex like precursor

34147513

RAB7, member RAS oncogene family

2
PREDICTED: similar to adaptor-related

89042891
113414586

chaperone, ABC1 activity of bc1


34147522

kinesin family member 19

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

complex like precursor

89041736

protein complex 1 sigma 2 subunit

glycerophosphodiester
32698962

phosphodiesterase domain containing 1

4557719

DNA ligase I

ubiquitin-conjugating enzyme E2A


32967278

isoform 3

32967276

ubiquitin-conjugating enzyme E2A isoform 2

32967280

ubiquitin-conjugating enzyme E2A isoform 1

ubiquitin-conjugating enzyme E2A


32967278

isoform 3
serologically defined colon cancer

32130516

antigen 1

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

31542547

dullard homolog

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

31542507
32130516

HORMA domain containing 1

113429091
10863925

serologically defined colon cancer

277

isomerase A isoform 1

DNA directed RNA polymerase II

Appendix C

32130516

antigen 1

polypeptide L

serologically defined colon cancer

skeletal muscle and kidney enriched inositol

antigen 1

18765707

phosphatase isoform 2
PREDICTED: similar to adaptor-related

31542547

dullard homolog

89042891

protein complex 1 sigma 2 subunit

serologically defined colon cancer


32130516

antigen 1

29553970

H2A histone family, member J

30425538

zinc finger, DHHC-type containing 21

4506717

ribosomal protein S29 isoform 1

63029935

H2A histone family, member B3

4557719

DNA ligase I
PREDICTED: similar to adaptor-related

31455614

Der1-like domain family, member 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

31455614

Der1-like domain family, member 2

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

30410779

huntingtin interacting protein B

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

29553970

H2A histone family, member J

89041736

protein complex 1 sigma 2 subunit

29553970

H2A histone family, member J

63029943

H2A histone family, member B2


PREDICTED: similar to peptidylprolyl

28872761

myotubularin-related protein 1

113429091

28872761

myotubularin-related protein 1

19923424

isomerase A isoform 1
myotubularin-related protein 9
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

28872761

myotubularin-related protein 1

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

28872761

myotubularin-related protein 1

113431146

(SIG-20)
PREDICTED: similar to adaptor-related

29553970
28827774

H2A histone family, member J

89042891
89042891

dual-specificity tyrosine-(Y)-

278

protein complex 1 sigma 2 subunit

PREDICTED: similar to adaptor-related

Appendix C
phosphorylation regulated kinase 4
28872761

myotubularin-related protein 1

protein complex 1 sigma 2 subunit


113414586

PRP38 pre-mRNA processing factor 38


24762236

PREDICTED: similar to adaptor-related

(yeast) domain containing A

89042891

zinc finger, DHHC-type containing 14


24371272

isoform 2

24430186

phosphatidylinositol glycan, class C

PREDICTED: similar to CG17293-PA

protein complex 1 sigma 2 subunit


zinc finger, DHHC-type containing 14

24371241
4505795

isoform 1
phosphatidylinositol glycan, class C
PREDICTED: similar to adaptor-related

24430186

phosphatidylinositol glycan, class C

89042891

protein complex 1 sigma 2 subunit

23510381

transportin 1 isoform 2

8923942

nucleolar protein family A, member 3

23397458

kinesin family member 19

8923942

nucleolar protein family A, member 3


DNA directed RNA polymerase II

23397458

kinesin family member 19

10863925

polypeptide L
PREDICTED: similar to adaptor-related

23510381

transportin 1 isoform 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

23397458

kinesin family member 19

23199991

casein kinase 1 epsilon

89042891
4503093

protein complex 1 sigma 2 subunit


casein kinase 1 epsilon
PREDICTED: similar to peptidylprolyl

22202633

prefoldin subunit 5 isoform alpha

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

22035624

phosphatidate cytidylyltransferase 1

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

21624654

spermatogenesis associated 5

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

21362110

thiamin pyrophosphokinase 1 isoform a

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

21624654

spermatogenesis associated 5

89042891

21362110

thiamin pyrophosphokinase 1 isoform a

89042891

279

protein complex 1 sigma 2 subunit

PREDICTED: similar to adaptor-related

Appendix C
protein complex 1 sigma 2 subunit
PREDICTED: similar to peptidylprolyl
21450653

zinc finger, DHHC-type containing 15

113429091

isomerase A isoform 1

21361144

proteasome 26S ATPase subunit 3

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

21361376

splicing factor 3a, subunit 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

21361144

proteasome 26S ATPase subunit 3

113429091

isomerase A isoform 1

protein disulfide isomerase-associated 3


21361657

precursor

4758304

protein disulfide isomerase-associated 4


PREDICTED: similar to adaptor-related

20270343

ADP-ribosylation factor-like 10B

89042891

protein complex 1 sigma 2 subunit

SWI/SNF-related matrix-associated
actin-dependent regulator of chromatin
21071060

a-like 1

4502743

cyclin-dependent kinase 7

SWI/SNF-related matrix-associated
actin-dependent regulator of chromatin
21071060

a-like 1

113414586

PREDICTED: similar to CG17293-PA

SWI/SNF-related matrix-associated
actin-dependent regulator of chromatin
21071060

a-like 1

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

19718751

uracil-DNA glycosylase isoform UNG2

113429091

isomerase A isoform 1
DNA directed RNA polymerase II

19718751

uracil-DNA glycosylase isoform UNG2

10863925

polypeptide L
skeletal muscle and kidney enriched inositol

19718751

uracil-DNA glycosylase isoform UNG2

18765707

19718751

uracil-DNA glycosylase isoform UNG2

4506717

phosphatase isoform 2
ribosomal protein S29 isoform 1
PREDICTED: similar to adaptor-related

18860916

5'-3' exoribonuclease 2

89042891

280

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to peptidylprolyl
19913428

vacuolar H+ATPase B2

113429091

isomerase A isoform 1
autophagy-related cysteine endopeptidase 2

19718751

uracil-DNA glycosylase isoform UNG2

30795252

isoform a
autophagy-related cysteine endopeptidase 2

19718751

uracil-DNA glycosylase isoform UNG2

19923424

myotubularin-related protein 9

19718751

uracil-DNA glycosylase isoform UNG2

30795248
113414586
8923942

isoform b
PREDICTED: similar to CG17293-PA
nucleolar protein family A, member 3
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

18105063

vacuolar protein sorting 45A

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

18105063

vacuolar protein sorting 45A

18491016

exonuclease 1 isoform b

18105063

vacuolar protein sorting 45A

113431146
4557719
113414586

(SIG-20)
DNA ligase I
PREDICTED: similar to CG17293-PA
PREDICTED: similar to peptidylprolyl

18105063

vacuolar protein sorting 45A

113429091

ATPase, H+ transporting, lysosomal


18087815

isomerase A isoform 1
PREDICTED: similar to adaptor-related

31kDa, V1 subunit E isoform 2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

17978519

vacuolar protein sorting 26 A isoform 1

17978477

vacuolar protein sorting 11

89042891
4502859

protein complex 1 sigma 2 subunit


CDC28 protein kinase 2
PREDICTED: similar to peptidylprolyl

17978519

vacuolar protein sorting 26 A isoform 1

113429091

isomerase A isoform 1
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

88980535

protein S26
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

15150809

SEC11-like 3

89041601
113414586

281

protein S26 isoform 1


PREDICTED: similar to CG17293-PA

Appendix C
PREDICTED: similar to 40S ribosomal
15011936

ribosomal protein S26

88982349

protein S26
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

113420084

protein S26
PREDICTED: similar to peptidylprolyl

15150809

SEC11-like 3

113429091

isomerase A isoform 1
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

89025350

protein S26 isoform 2


PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

113430282

protein S26
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

88987217

protein S26
PREDICTED: similar to 40S ribosomal

15011936

ribosomal protein S26

14249398

PHD-finger 5A

113429703
4758496

protein S26
H2A histone family, member Y isoform 2
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

14249398

PHD-finger 5A

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

14249398

PHD-finger 5A

113418826

(SIG-20)
PREDICTED: similar to peptidylprolyl

14249398

PHD-finger 5A

113429091

isomerase A isoform 1

IMP2 inner mitochondrial membrane


14211845

protease-like

113414586

PREDICTED: similar to CG17293-PA

14249398

PHD-finger 5A

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

14149696

SEC31 homolog B

113429091

isomerase A isoform 1

13236510

ubiquitin-like 5

113414586

PREDICTED: similar to CG17293-PA

11863130

phosphatidylinositol Nacetylglucosaminyltransferase subunit A

113429091

282

PREDICTED: similar to peptidylprolyl

Appendix C

10863925

isoform 1

isomerase A isoform 1

DNA directed RNA polymerase II

PREDICTED: similar to adaptor-related

polypeptide L

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

10864021

trafficking protein particle complex 1

113429091

DNA directed RNA polymerase II


10863925

isomerase A isoform 1
ubiquitin-conjugating enzyme E2D 4

polypeptide L

8393719

(putative)

4502743

cyclin-dependent kinase 7

4557719

DNA ligase I

DNA directed RNA polymerase II


10863925

polypeptide L
DNA directed RNA polymerase II

10863925

polypeptide L
DNA directed RNA polymerase II

10863925

polypeptide L

38016127

RNA binding motif protein 34

62909985

hypothetical protein LOC140711

DNA directed RNA polymerase II


10863925

polypeptide L
DNA directed RNA polymerase II

10863925

meiotic recombination protein SPO11

polypeptide L

6912680

isoform a

4758496

H2A histone family, member Y isoform 2

DNA directed RNA polymerase II


10863925

polypeptide L
DNA directed RNA polymerase II

10863925

polypeptide L

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

DNA directed RNA polymerase II


10863925

polypeptide L

5729840

DNA directed RNA polymerase II


10863925

polypeptide L

10864021

trafficking protein particle complex 1

tubulin, gamma complex associated protein 2


PREDICTED: similar to adaptor-related

89041736
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to 60S ribosomal

DNA directed RNA polymerase II


10863925
10863925

polypeptide L

DNA directed RNA polymerase II

protein L26 (Silica-induced gene 20 protein)


113431146
113418826

283

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

Appendix C
polypeptide L

(SIG-20)

DNA directed RNA polymerase II


10863925

polypeptide L

113414586

PREDICTED: similar to CG17293-PA


protein phosphatase 1, catalytic subunit, beta

8923942

nucleolar protein family A, member 3

4506005

isoform 1

8923942

nucleolar protein family A, member 3

4502743

cyclin-dependent kinase 7

8923942

nucleolar protein family A, member 3

4557719

DNA ligase I

8923942

nucleolar protein family A, member 3

4758496

H2A histone family, member Y isoform 2

8923942

nucleolar protein family A, member 3

38016127

RNA binding motif protein 34


PREDICTED: similar to TBC1 domain
family member 3 (Rab GTPase-activating
protein PRC17) (Prostate cancer gene 17

8923942

nucleolar protein family A, member 3

113426831

protein) (TRE17 alpha protein) isoform 1

8923942

nucleolar protein family A, member 3

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

8923942

nucleolar protein family A, member 3

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

8923942

nucleolar protein family A, member 3

113418826

(SIG-20)
PREDICTED: similar to adaptor-related

8923942

nucleolar protein family A, member 3

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

8923942

nucleolar protein family A, member 3

89041736

uncharacterized hypothalamus protein


8923712

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

HARP11

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

8923942

nucleolar protein family A, member 3

113429091

uncharacterized hypothalamus protein


8923712

isomerase A isoform 1
PREDICTED: similar to adaptor-related

HARP11

89041736

284

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to adaptor-related
7706326

splicing factor 3B, 14 kDa subunit

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7706753

ubiquitin C-terminal hydrolase UCH37

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7706667

trafficking protein particle complex 4

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7706657

cell division cycle 40 homolog

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

7706667

trafficking protein particle complex 4

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

7706326

splicing factor 3B, 14 kDa subunit

89041736

DnaJ (Hsp40) homolog, subfamily B,


7706495

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

member 11 precursor

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7706657

cell division cycle 40 homolog

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7706667

trafficking protein particle complex 4

7705483

hypothetical protein LOC51507

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

7705483

hypothetical protein LOC51507

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7705483

hypothetical protein LOC51507

7657198

dimethyladenosine transferase

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

7657522

ring finger protein 7 isoform 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

7657198

dimethyladenosine transferase

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

7657548

trafficking protein particle complex 2

113429091

285

isomerase A isoform 1

Appendix C
PREDICTED: similar to adaptor-related
7657546

Sec61 gamma subunit

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

7657548

trafficking protein particle complex 2

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

7657548

trafficking protein particle complex 2

113418826

(SIG-20)

7657548

trafficking protein particle complex 2

113414586

PREDICTED: similar to CG17293-PA

meiotic recombination protein SPO11


6912680

isoform a

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

meiotic recombination protein SPO11


6912680

isoform a

4557719

meiotic recombination protein SPO11


6912680

PREDICTED: similar to adaptor-related

isoform a

89042891

ATP-binding cassette, sub-family A


6005701

DNA ligase I

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

member 8

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

5902002

dual specificity phosphatase 14

89042891

plasma glutathione peroxidase 3


6006001

DnaJ (Hsp40) homolog, subfamily A,

precursor

31542539

ATP-binding cassette, sub-family A


6005701

member 8

protein 2

member 3
PREDICTED: similar to adaptor-related

89041736

tubulin, gamma complex associated


5729840

protein complex 1 sigma 2 subunit

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

tubulin, gamma complex associated


5729840

protein 2
tubulin, gamma complex associated

5729840

protein 2

6996005

dynamin 1-like protein isoform 1


PREDICTED: similar to Ubiquitin-63E

5454144

ubiquitin D

113423966

286

CG11624-PA, isoform A

Appendix C
tubulin, gamma complex associated
5729840

protein 2

4557719

tubulin, gamma complex associated


5729840

PREDICTED: similar to adaptor-related

protein 2

89042891

tubulin, gamma complex associated


5729840

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

protein 2

89041736

tubulin, gamma complex associated


5453660

DNA ligase I

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

protein 3

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

5032133

eukaryotic translation initiation factor 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

5031635

cofilin 1 (non-muscle)

113429091

U2 small nuclear RNA auxiliary factor


4827046

isomerase A isoform 1
protein phosphatase 1, catalytic subunit, beta

1-like 2

4506005

isoform 1

4758496

H2A histone family, member Y isoform 2

ATP synthase, H+ transporting,


mitochondrial F1 complex, gamma
4885079

subunit isoform H (heart) precursor

PREDICTED: similar to adaptor-related


4759302

VAMP-associated protein B/C

89042891

protein complex 1 sigma 2 subunit

4759302

VAMP-associated protein B/C

6996005

dynamin 1-like protein isoform 1


PREDICTED: similar to adaptor-related

4759302

VAMP-associated protein B/C

89041736

protein complex 1 sigma 2 subunit

DNA directed RNA polymerase II


4826924

polypeptide K

113414586

DNA directed RNA polymerase II


4826924

polypeptide K

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

phosphatidylinositol glycan anchor


4758922

biosynthesis, class L
phosphatidylinositol glycan anchor

4758922

biosynthesis, class L

4502743

287

cyclin-dependent kinase 7

Appendix C
phosphatidylinositol glycan anchor
4758922

biosynthesis, class L

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal

phosphatidylinositol glycan anchor


4758922

biosynthesis, class L

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal

phosphatidylinositol glycan anchor


4758922

biosynthesis, class L

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4507873

von Hippel-Lindau binding protein 1

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4507873

von Hippel-Lindau binding protein 1

113418826

(SIG-20)

excision repair cross-complementing


rodent repair deficiency,
4557563

complementation group 3

9910180

ACN9 homolog
PREDICTED: similar to peptidylprolyl

4507947

tyrosyl-tRNA synthetase

113429091

isomerase A isoform 1

4507873

von Hippel-Lindau binding protein 1

113414586

PREDICTED: similar to CG17293-PA

ubiquitin-conjugating enzyme E2
4507797

variant 2

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

4507873

von Hippel-Lindau binding protein 1

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

ubiquitin-conjugating enzyme E2
4507797

variant 2

PREDICTED: similar to peptidylprolyl


4506701

ribosomal protein S23

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506699

ribosomal protein S21

113418826

4506699

ribosomal protein S21

113431146

288

(SIG-20)

PREDICTED: similar to 60S ribosomal

Appendix C
protein L26 (Silica-induced gene 20 protein)
(SIG-20)
4506717

ribosomal protein S29 isoform 1

4758496

solute carrier family 7 (cationic amino


4507047

H2A histone family, member Y isoform 2


PREDICTED: similar to adaptor-related

acid transporter, y+ system), member 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506717

ribosomal protein S29 isoform 1

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506717

ribosomal protein S29 isoform 1

113431146

4506699

ribosomal protein S21

4506717

ribosomal protein S29 isoform 1

62909985

4506717

ribosomal protein S29 isoform 1

113414586

4506699

ribosomal protein S21

4758496

4502743

(SIG-20)
H2A histone family, member Y isoform 2
hypothetical protein LOC140711
PREDICTED: similar to CG17293-PA
cyclin-dependent kinase 7
ubiquitin-conjugating enzyme E2D 4

4506717

ribosomal protein S29 isoform 1

8393719

(putative)
PREDICTED: similar to peptidylprolyl

4506699

ribosomal protein S21

113429091

solute carrier family 7 (cationic amino

isomerase A isoform 1
PREDICTED: similar to adaptor-related

4507047

acid transporter, y+ system), member 1

89041736

protein complex 1 sigma 2 subunit

4506717

ribosomal protein S29 isoform 1

38016127

RNA binding motif protein 34

4506717

ribosomal protein S29 isoform 1

4502743

cyclin-dependent kinase 7
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506701

ribosomal protein S23

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506701

ribosomal protein S23

113418826

(SIG-20)

4506701

ribosomal protein S23

113414586

PREDICTED: similar to CG17293-PA

289

Appendix C
4506699

ribosomal protein S21

113414586

small nuclear ribonucleoprotein


4507123

polypeptide B''

4506717

ribosomal protein S29 isoform 1

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

89042891
4557719

protein complex 1 sigma 2 subunit


DNA ligase I
PREDICTED: similar to 40S ribosomal

4506715

ribosomal protein S28

113422526

protein S28 isoform 1


PREDICTED: similar to 40S ribosomal

4506715

ribosomal protein S28

113423050

protein S28 isoform 1


PREDICTED: similar to 40S ribosomal

4506715

ribosomal protein S28

89034184

protein S28 isoform 2


PREDICTED: similar to 40S ribosomal

4506715

ribosomal protein S28

88959151

protein S28
PREDICTED: similar to 40S ribosomal

4506715

ribosomal protein S28

88953906

protein S28
PREDICTED: similar to peptidylprolyl

4506717

ribosomal protein S29 isoform 1

113429091

isomerase A isoform 1

4506643

ribosomal protein L37a

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

4506617

ribosomal protein L17

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

4506643

ribosomal protein L37a

113429091

isomerase A isoform 1

4506609

ribosomal protein L19

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

4506609

ribosomal protein L19

113429091

isomerase A isoform 1

4506193

proteasome beta 1 subunit

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

4506193

proteasome beta 1 subunit

113429091

protein tyrosine phosphatase, receptor


4506303

type, A isoform 1 precursor

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

113429091

290

isomerase A isoform 1

Appendix C
protein phosphatase 1, catalytic subunit,
4506005

beta isoform 1

4557719

DNA ligase I
PREDICTED: similar to peptidylprolyl

4506233

proteasome 26S non-ATPase subunit 8

4505621

prostatic binding protein

113429091
22165364

isomerase A isoform 1
mitochondrial ribosomal protein L38
PREDICTED: similar to adaptor-related

4505795

phosphatidylinositol glycan, class C

89042891

DnaJ (Hsp40) homolog, subfamily A,


4504511

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

member 1

89041736

DnaJ (Hsp40) homolog, subfamily A,

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

4504511

member 1

113429091

isomerase A isoform 1

4504221

guanylate kinase 1

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

4504007

glycerol kinase isoform b

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4504007

glycerol kinase isoform b

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

4504221

guanylate kinase 1

4502703

cell division cycle 6 protein

113429091
21536371

isomerase A isoform 1
telomerase-associated protein 1
PREDICTED: similar to adaptor-related

4503301

2,4-dienoyl CoA reductase 1 precursor

89042891

chaperonin containing TCP1, subunit


4502643

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

6A isoform a

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

63029935

H2A histone family, member B3

89041736

phosphoribosyl pyrophosphate

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

28557709

synthetase 1-like 1

89042891

protein complex 1 sigma 2 subunit

63029935

H2A histone family, member B3

63029943

H2A histone family, member B2

phosphoribosyl pyrophosphate
28557709

PREDICTED: similar to adaptor-related

synthetase 1-like 1

89041736

291

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to adaptor-related
66912162
148747574

histone 2, H2bf

89042891

protein complex 1 sigma 2 subunit

hypothetical protein LOC51030

21945058

hypothetical protein LOC201158


PREDICTED: similar to adaptor-related

63029935

H2A histone family, member B3

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

118402582

cell division cycle 20

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

38016127

RNA binding motif protein 34

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 40S ribosomal

38016127

RNA binding motif protein 34

51467029

protein S26
PREDICTED: similar to adaptor-related

38016127

RNA binding motif protein 34

89042891

38016127

RNA binding motif protein 34

4557719

37595752

lamin B receptor

37595750

protein complex 1 sigma 2 subunit


DNA ligase I
lamin B receptor
PREDICTED: similar to peptidylprolyl

38016127

RNA binding motif protein 34

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

38348260

ankyrin repeat domain 47

38016127

RNA binding motif protein 34

32189369

DNA polymerase epsilon subunit 2

89041736

protein complex 1 sigma 2 subunit

6996005

dynamin 1-like protein isoform 1

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

32189369

DNA polymerase epsilon subunit 2

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

32189369

DNA polymerase epsilon subunit 2

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

32189369

DNA polymerase epsilon subunit 2

113418826

32484973

adenosine kinase isoform a

113429091

292

(SIG-20)

PREDICTED: similar to peptidylprolyl

Appendix C
isomerase A isoform 1
PREDICTED: similar to peptidylprolyl
32528306

replication factor C large subunit

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

32483374

nucleolar protein 5A

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

31795544

origin recognition complex, subunit 1

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

31543091

RNA binding motif protein 13

31795544

origin recognition complex, subunit 1

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to peptidylprolyl

31543091

RNA binding motif protein 13

113429091

autophagy-related cysteine
30795252

autophagy-related cysteine endopeptidase 2

endopeptidase 2 isoform a

30795248

CCR4-NOT transcription complex,


31542315

isomerase A isoform 1

isoform b
PREDICTED: similar to adaptor-related

subunit 8

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

31543091

RNA binding motif protein 13

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

31795544

origin recognition complex, subunit 1

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

31795544

origin recognition complex, subunit 1

113418826

(SIG-20)
PREDICTED: similar to adaptor-related

28376621

SEC14p-like protein TAP3

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

28376621

SEC14p-like protein TAP3

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

28173554

histone H2B

89042891

293

protein complex 1 sigma 2 subunit

Appendix C
28559085

cytidine triphosphate synthase II

28559083

cytidine triphosphate synthase II

autophagy-related cysteine
30795252

endopeptidase 2 isoform a

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

autophagy-related cysteine
30795248

endopeptidase 2 isoform b
autophagy-related cysteine

30795248

PREDICTED: similar to peptidylprolyl

endopeptidase 2 isoform b

113429091

autophagy-related cysteine
30795252

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

endopeptidase 2 isoform a

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

24586679

testis-specific histone H2B

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

24586675

slingshot homolog 3

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

24586679

testis-specific histone H2B

89042891

protein complex 1 sigma 2 subunit

potassium voltage-gated channel,


shaker-related subfamily, beta member

potassium voltage-gated channel, shaker-

27436969

2 isoform 2

4504825

22538446

tumor protein p53 inducible protein 3

22538444

tumor protein p53 inducible protein 3

22001417

gemin 5

21536371

telomerase-associated protein 1

adaptor-related protein complex 1 sigma


22027655

PREDICTED: similar to adaptor-related

2 subunit

89041736

adaptor-related protein complex 1 sigma


22027655

related subfamily, beta member 2 isoform 1

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

2 subunit

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

21362084

TBC1 domain family, member 15

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

21362084

TBC1 domain family, member 15

89042891

protein complex 1 sigma 2 subunit

IMP1 inner mitochondrial membrane


21450679

peptidase-like

113414586

294

PREDICTED: similar to CG17293-PA

Appendix C
PREDICTED: similar to peptidylprolyl
21314720

Smad nuclear interacting protein

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

20911035

peptidylprolyl isomerase-like 4

89042891

serine hydroxymethyltransferase 2
19923315

PREDICTED: similar to adaptor-related

(mitochondrial)

89042891

WW domain-containing oxidoreductase
18860884

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

isoform 2

89042891

WW domain-containing oxidoreductase
18860884

protein complex 1 sigma 2 subunit

protein complex 1 sigma 2 subunit


WW domain-containing oxidoreductase

isoform 2

7706523

isoform 1

4557719

DNA ligase I

serine hydroxymethyltransferase 2
19923315

(mitochondrial)
WW domain-containing oxidoreductase

18860884

PREDICTED: similar to adaptor-related

isoform 2

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

15431297

ribosomal protein L13

113429091

isomerase A isoform 1

15431297

ribosomal protein L13

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

16306568

poly(A) polymerase gamma

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

16306566

histone H2B

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

15431297

ribosomal protein L13

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

15431297

ribosomal protein L13

113431146

DEAD (Asp-Glu-Ala-Asp) box


14251212

polypeptide 20

(SIG-20)
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

13376747

nucleotide binding protein-like

89042891

295

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to adaptor-related
14043026

vesicle-associated membrane protein 8

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

14043026

vesicle-associated membrane protein 8

113429091

isomerase A isoform 1

13430872

nucleolar protein 10

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

13430872

nucleolar protein 10

113429091

isomerase A isoform 1

13129120

trafficking protein particle complex 6A

113414586

PREDICTED: similar to CG17293-PA

guanine nucleotide-binding protein,


11321585

PREDICTED: similar to adaptor-related

beta-1 subunit

89041736

guanine nucleotide-binding protein,


11321585

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

beta-1 subunit

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

11056006

kelch-like 12

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

11056006

kelch-like 12

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

12758125

hypothetical protein LOC23378

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

12758125

hypothetical protein LOC23378

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

12758125

hypothetical protein LOC23378

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

11386163

ELAV-like 4

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

8922905

RIO kinase 2

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

solute carrier family 2 (facilitated


8923733

glucose transporter), member 6

10190686

zinc finger protein 286

10190696

10190686

zinc finger protein 286

18765707

296

zinc finger protein 304

skeletal muscle and kidney enriched inositol

Appendix C
phosphatase isoform 2
solute carrier family 2 (facilitated
8923733

glucose transporter), member 6

62909985

hypothetical protein LOC140711


PREDICTED: similar to zinc finger protein

10190686

zinc finger protein 286

113413881

solute carrier family 2 (facilitated


8923733

glucose transporter), member 6

114
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal

solute carrier family 2 (facilitated


8923733

glucose transporter), member 6

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

solute carrier family 2 (facilitated


8923733

glucose transporter), member 6

113418826

(SIG-20)

8922905

RIO kinase 2

113414586

PREDICTED: similar to CG17293-PA

10190686

zinc finger protein 286

6996005

dynamin 1-like protein isoform 1

7706343

hypothetical protein LOC51647

113414586

PREDICTED: similar to CG17293-PA

7705477

hypothetical protein LOC51504

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

7705748

TNNI3 interacting kinase

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

U6 snRNA-associated Sm-like protein


7706423

LSm7

PREDICTED: similar to adaptor-related


7705748

TNNI3 interacting kinase

89041736

protein complex 1 sigma 2 subunit

7706497

cytidylate kinase

113414586

PREDICTED: similar to CG17293-PA

7705369

coatomer protein complex, subunit beta

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

7705369

coatomer protein complex, subunit beta

89042891

WW domain-containing oxidoreductase
7706523

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

isoform 1

89042891

297

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to peptidylprolyl
7706343

hypothetical protein LOC51647

113429091

7705477

hypothetical protein LOC51504

4557719

isomerase A isoform 1
DNA ligase I
PREDICTED: similar to peptidylprolyl

7706497

cytidylate kinase

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

8922388

RNA binding motif protein 28

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7657508

ring-box 1

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

7705477

hypothetical protein LOC51504

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

7705477

hypothetical protein LOC51504

113418826

WW domain-containing oxidoreductase
7706523

(SIG-20)
PREDICTED: similar to adaptor-related

isoform 1

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

7705477

hypothetical protein LOC51504

113429091

U6 snRNA-associated Sm-like protein


7706423

LSm7

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

7705748

TNNI3 interacting kinase

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7019405

host cell factor C2

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

7019405

host cell factor C2

89041736

AHA1, activator of heat shock 90kDa

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

6912280

protein ATPase homolog 1

113429091

isomerase A isoform 1

7019319

activator of basal transcription 1

113414586

PREDICTED: similar to CG17293-PA

298

Appendix C
7657315

Lsm3 protein

6996005

dynamin 1-like protein isoform 1

113414586
10190696

PREDICTED: similar to CG17293-PA


zinc finger protein 304

AHA1, activator of heat shock 90kDa


6912280

protein ATPase homolog 1

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

7019319

activator of basal transcription 1

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

5729953

nuclear distribution gene C homolog

5902034

periodic tryptophan protein 1

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to peptidylprolyl

4557719

DNA ligase I

113429091

isomerase A isoform 1

H2A histone family, member Y isoform


4758496

4503729

FK506-binding protein 4

small nuclear ribonucleoprotein


4759156

polypeptide A

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

4759224

programmed cell death 5

89041736

4758384

FK506 binding protein 5

4503729

protein complex 1 sigma 2 subunit


FK506-binding protein 4

RNA, U3 small nucleolar interacting


4759276

protein 2

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

4759224

programmed cell death 5

89042891

small nuclear ribonucleoprotein


4759156

polypeptide A

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

4758384

FK506 binding protein 5

89042891

protein complex 1 sigma 2 subunit

H2A histone family, member Y isoform


4758496

4557719

DNA ligase I

4557719
62909985

299

DNA ligase I
hypothetical protein LOC140711

Appendix C
H2A histone family, member Y isoform
4758496

PREDICTED: similar to large subunit


113427044

ribosomal protein L36a

H2A histone family, member Y isoform


4758496

4506651

ribosomal protein L36a-like protein


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4557719

DNA ligase I

113431146

H2A histone family, member Y isoform


4758496

(SIG-20)
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
skeletal muscle and kidney enriched inositol

4557719

DNA ligase I

18765707

RNA, U3 small nucleolar interacting


4759276

protein 2

phosphatase isoform 2
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
protein phosphatase 1, catalytic subunit,

4557719

DNA ligase I

4506007

H2A histone family, member Y isoform


4758496

gamma isoform
PREDICTED: similar to adaptor-related

89042891

protein complex 1 sigma 2 subunit


glucosamine-fructose-6-phosphate

4557719

DNA ligase I

4503981

aminotransferase

4557719

DNA ligase I

4502743

cyclin-dependent kinase 7
N-ethylmaleimide-sensitive factor

4557719

DNA ligase I

4505331

attachment protein, gamma


ubiquitin-conjugating enzyme E2D 4

4557719

DNA ligase I

8393719

H2A histone family, member Y isoform


4758496

(putative)
PREDICTED: similar to large subunit

113427529

ribosomal protein L36a


PREDICTED: similar to adaptor-related

4557719

DNA ligase I

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4507311

suppressor of Ty 4 homolog 1

113418826

300

(SIG-20)

Appendix C
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)
4507311

suppressor of Ty 4 homolog 1

113431146

(SIG-20)
PREDICTED: similar to adaptor-related

4507369

tyrosine aminotransferase

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

4507311

suppressor of Ty 4 homolog 1

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506631

ribosomal protein L30

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506631

ribosomal protein L30

113418826

(SIG-20)
PREDICTED: similar to peptidylprolyl

4506631

ribosomal protein L30

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

4506629

ribosomal protein L29

113429091

isomerase A isoform 1

4507311

suppressor of Ty 4 homolog 1

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to TBC1 domain
family member 3 (Rab GTPase-activating
protein PRC17) (Prostate cancer gene 17

4557719

DNA ligase I

113426831

protein) (TRE17 alpha protein) isoform 1


PREDICTED: similar to 60S ribosomal
protein L29 (Cell surface heparin-binding

4506629

ribosomal protein L29

27482992

protein HIP)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4557719

DNA ligase I

113418826

(SIG-20)

4557719

DNA ligase I

113414586

PREDICTED: similar to CG17293-PA

small nuclear ribonucleoprotein

PREDICTED: similar to peptidylprolyl

4507133

polypeptide G

113429091

4506203

proteasome beta 7 subunit proprotein

113429091

301

isomerase A isoform 1

PREDICTED: similar to peptidylprolyl

Appendix C
isomerase A isoform 1
PREDICTED: similar to adaptor-related
4557719

DNA ligase I

4506631

ribosomal protein L30

89041736
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to 60S ribosomal
protein L29 (Cell surface heparin-binding

4506629

ribosomal protein L29

113428574

protein HIP)

113414586

PREDICTED: similar to CG17293-PA

heat shock 10kDa protein 1 (chaperonin


4504523

10)
heat shock 10kDa protein 1 (chaperonin

4504523

10)

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4503729

FK506-binding protein 4

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4503729

FK506-binding protein 4

113418826

(SIG-20)
PREDICTED: similar to peptidylprolyl

4505235

mannose-6- phosphate isomerase

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

4505773

prohibitin

113429091

alpha isoform of regulatory subunit


4506019

PREDICTED: similar to adaptor-related

B55, protein phosphatase 2

89042891

N-ethylmaleimide-sensitive factor
4505331

isomerase A isoform 1

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

attachment protein, gamma

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4504257

H2B histone family, member A

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4504261

H2B histone family, member D

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4504269

H2B histone family, member J

89042891

302

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to 60S ribosomal
alpha isoform of regulatory subunit
4506019

B55, protein phosphatase 2

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

alpha isoform of regulatory subunit


4506019

B55, protein phosphatase 2

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to adaptor-related

4504263

H2B histone family, member E

89042891

N-ethylmaleimide-sensitive factor
4505331

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

attachment protein, gamma

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal

heat shock 10kDa protein 1 (chaperonin


4504523

10)

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal

heat shock 10kDa protein 1 (chaperonin


4504523

10)

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to adaptor-related

4505997

protein phosphatase 1D

89042891

protein phosphatase 1, catalytic subunit,


4506007

gamma isoform

4502743

cyclin-dependent kinase 7

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

89042891
113414586

protein complex 1 sigma 2 subunit


PREDICTED: similar to CG17293-PA
PREDICTED: similar to adaptor-related

4502743

cyclin-dependent kinase 7

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

148222882

hypothetical protein LOC644820

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

S-phase kinase-associated protein 1A


25777713

isoform b
S-phase kinase-associated protein 1A

25777711

isoform a

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

21166389

H2B histone family, member L

89042891

303

protein complex 1 sigma 2 subunit

Appendix C
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)
4502743

cyclin-dependent kinase 7

113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4502743

cyclin-dependent kinase 7

113418826

S-phase kinase-associated protein 1A

(SIG-20)
PREDICTED: similar to peptidylprolyl

25777713

isoform b

113429091

isomerase A isoform 1

23592238

glucose transporter 14

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

S-phase kinase-associated protein 1A


25777711

isoform a

PREDICTED: similar to peptidylprolyl


23592238

glucose transporter 14

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

4502859

CDC28 protein kinase 2

113429091

S-phase kinase-associated protein 1A


25777713

isomerase A isoform 1
S-phase kinase-associated protein 1A

isoform b

25777711

isoform a
PREDICTED: similar to adaptor-related

4502743

cyclin-dependent kinase 7

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

63029943

H2A histone family, member B2

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to Kinesin heavy chain
isoform 5C (Kinesin heavy chain neuron-

4758650

kinesin family member 5C

113413289

specific 2)
PREDICTED: similar to adaptor-related

58615669

cytochrome c oxidase subunit III

89042891

protein complex 1 sigma 2 subunit

58615665

cytochrome c oxidase subunit I

17981855

cytochrome c oxidase subunit I

58615673

NADH dehydrogenase subunit 5

17981853

NADH dehydrogenase subunit 1

58615673

NADH dehydrogenase subunit 5

58615663

NADH dehydrogenase subunit 1

58615673

NADH dehydrogenase subunit 5

17981862

NADH dehydrogenase subunit 4

304

Appendix C
58615673

NADH dehydrogenase subunit 5

13128862

histone deacetylase 3

62909985

hypothetical protein LOC140711

58615672
113414586
8923475

NADH dehydrogenase subunit 4


PREDICTED: similar to CG17293-PA
thioredoxin-like 4B

protein phosphatase 1 (formerly 2C)63003905

like

113414586

dual specificity phosphatase and pro

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

51491914

isomerase domain containing 1

89042891

protein complex 1 sigma 2 subunit

58615672

NADH dehydrogenase subunit 4

17981862

NADH dehydrogenase subunit 4


PREDICTED: similar to peptidylprolyl

62909985

hypothetical protein LOC140711

113429091

isomerase A isoform 1

58615666

cytochrome c oxidase subunit II

17981859

cytochrome c oxidase subunit III

58615669

cytochrome c oxidase subunit III

58615666

cytochrome c oxidase subunit II

58615669

cytochrome c oxidase subunit III

17981856

cytochrome c oxidase subunit II

58615669

cytochrome c oxidase subunit III

17981859

cytochrome c oxidase subunit III

solute carrier family 2 (facilitated


4557851

glucose transporter), member 2

113414586

58615666

cytochrome c oxidase subunit II

17981856

PREDICTED: similar to CG17293-PA


cytochrome c oxidase subunit II
PREDICTED: similar to peptidylprolyl

13128862

histone deacetylase 3

113429091

isomerase A isoform 1
PREDICTED: similar to adaptor-related

63029943

H2A histone family, member B2

89042891

protein complex 1 sigma 2 subunit

58615663

NADH dehydrogenase subunit 1

17981853

NADH dehydrogenase subunit 1

CTD (carboxy-terminal domain, RNA


polymerase II, polypeptide A) small
32813443

phosphatase 1 isoform 2

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to adaptor-related

17981859

32813443

cytochrome c oxidase subunit III

CTD (carboxy-terminal domain, RNA


polymerase II, polypeptide A) small

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

305

isomerase A isoform 1

Appendix C
phosphatase 1 isoform 2
PREDICTED: similar to 60S ribosomal
small nuclear ribonucleoprotein
4507129

polypeptide E

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal

small nuclear ribonucleoprotein


4507129

polypeptide E

protein L26 (Silica-induced gene 20 protein)


113431146

protein phosphatase 2A, regulatory


30065643

PREDICTED: similar to adaptor-related

subunit B' isoform b

89041736

protein phosphatase 2A, regulatory


29725611

subunit B' isoform b

subunit B' isoform b

89041736

32813443

32813443

subunit B' isoform b

113429091

17981859

113429091

PREDICTED: similar to 60S ribosomal

polymerase II, polypeptide A) small

protein L26 (Silica-induced gene 20 protein)

phosphatase 1 isoform 2

113431146

PREDICTED: similar to 60S ribosomal

polymerase II, polypeptide A) small

protein L26 (Silica-induced gene 20 protein)

phosphatase 1 isoform 2

113418826

polypeptide E

113429091

cytochrome c oxidase subunit III

17981856

4507129

89042891

cytochrome c oxidase subunit II

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

subunit B' isoform b

small nuclear ribonucleoprotein

isomerase A isoform 1

PREDICTED: similar to adaptor-related

subunit B' isoform b

polypeptide G

(SIG-20)
PREDICTED: similar to peptidylprolyl

89042891

DNA directed RNA polymerase II


4505947

(SIG-20)

CTD (carboxy-terminal domain, RNA

protein phosphatase 2A, regulatory


30065643

isomerase A isoform 1

CTD (carboxy-terminal domain, RNA

protein phosphatase 2A, regulatory


29725611

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

small nuclear ribonucleoprotein


4507129

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

protein phosphatase 2A, regulatory


29725611

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

protein phosphatase 2A, regulatory


30065643

(SIG-20)

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

113414586

PREDICTED: similar to CG17293-PA

306

Appendix C
polypeptide E
RNA pseudouridylate synthase domain
27734887

RNA pseudouridylate synthase domain

containing 3

14249470

DnaJ (Hsp40) homolog, subfamily A,


31542539

member 3

containing 4
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

ubiquitin-conjugating enzyme E2A


32967280

isoform 1

32967276

ubiquitin-conjugating enzyme E2A isoform 2


PREDICTED: similar to zinc finger protein

10190696
8923475

zinc finger protein 304

113413881

114

thioredoxin-like 4B

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

15431295

ribosomal protein L13

113429091

isomerase A isoform 1
PREDICTED: similar to peptidylprolyl

8923475

thioredoxin-like 4B

113429091

adaptor-related protein complex 4,


21361394

isomerase A isoform 1
PREDICTED: similar to adaptor-related

sigma 1 subunit

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal

general transcription factor IIH,


19923732

polypeptide 3, 34kDa

protein L26 (Silica-induced gene 20 protein)


113431146

(SIG-20)
PREDICTED: similar to 60S ribosomal

general transcription factor IIH,


19923732

polypeptide 3, 34kDa

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

8923475

thioredoxin-like 4B

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

8923475
15431295
5802970

thioredoxin-like 4B

113431146

(SIG-20)

ribosomal protein L13

113414586

PREDICTED: similar to CG17293-PA

AFG3 ATPase family gene 3-like 2

113414586

PREDICTED: similar to CG17293-PA

307

Appendix C
PREDICTED: similar to adaptor-related
21396484

H2B histone family, member H

89042891

ubiquitin-conjugating enzyme E2D 4


8393719

(putative)

protein complex 1 sigma 2 subunit


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

15431295

ribosomal protein L13

113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

15431295

ribosomal protein L13

113431146

general transcription factor IIH,


19923732

polypeptide 3, 34kDa

PREDICTED: similar to peptidylprolyl


113429091

adaptor-related protein complex 4,


21361394

(SIG-20)

isomerase A isoform 1
PREDICTED: similar to adaptor-related

sigma 1 subunit

89042891

protein complex 1 sigma 2 subunit

general transcription factor IIH,


19923732

polypeptide 3, 34kDa

113414586

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

5802970

AFG3 ATPase family gene 3-like 2

113429091

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

isomerase A isoform 1
PREDICTED: similar to large subunit

113427529

ribosomal protein L36a

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to 40S ribosomal


113419590

protein S28
PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

PREDICTED: similar to ribosomal protein


113427093

PREDICTED: similar to peptidylprolyl


113429091

PREDICTED: similar to ribosomal protein

isomerase A isoform 1

29742309

PREDICTED: similar to 40S ribosomal


88987217

89041601

L31
PREDICTED: similar to 40S ribosomal

protein S26

88982349

PREDICTED: similar to 40S ribosomal


113420084

L31

protein S26
PREDICTED: similar to 40S ribosomal

protein S26

88987217
88980535

PREDICTED: similar to 40S ribosomal

308

protein S26

PREDICTED: similar to 40S ribosomal

Appendix C

113420393

protein S26 isoform 1

protein S26

PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal

protein S26

113420084

PREDICTED: similar to 40S ribosomal


113420393

PREDICTED: similar to 40S ribosomal

protein S26

88982349

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

protein S26

protein S26
PREDICTED: similar to APG4 autophagy 4

113413585

homolog B isoform a

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to ribosomal


29742309

protein L31
PREDICTED: similar to ribosomal

113427093

protein L31

PREDICTED: similar to adaptor-related


4758754

napsin A preproprotein

89042891

protein complex 1 sigma 2 subunit

PREDICTED: similar to ribosomal


113427613

protein L31

113414586

PREDICTED: similar to peptidylprolyl


113422777

113431146

PREDICTED: similar to CG17293-PA


PREDICTED: similar to peptidylprolyl

isomerase A isoform 1

89042897

isomerase A isoform 1

PREDICTED: similar to 60S ribosomal

PREDICTED: similar to 60S ribosomal

protein L26 (Silica-induced gene 20

protein L26 (Silica-induced gene 20 protein)

protein) (SIG-20)

113418826

(SIG-20)
PREDICTED: similar to adaptor-related

4504265

H2B histone family, member G

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to adaptor-related

4504271

H2B histone family, member K

89042891

PREDICTED: similar to 40S ribosomal


113420393

PREDICTED: similar to 40S ribosomal

protein S26

88987217

PREDICTED: similar to 40S ribosomal


113430282

protein complex 1 sigma 2 subunit

protein S26
PREDICTED: similar to 40S ribosomal

protein S26

88987217

protein S26
PREDICTED: similar to 60S ribosomal

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

protein L26 (Silica-induced gene 20 protein)


113418826

309

(SIG-20)

Appendix C
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20
113431146

protein) (SIG-20)

PREDICTED: similar to peptidylprolyl


113429091

PREDICTED: similar to 40S ribosomal


88982349

PREDICTED: similar to 40S ribosomal

protein S26

88980535

PREDICTED: similar to 40S ribosomal


113420084

protein S26

88980535

protein S26
PREDICTED: similar to 40S ribosomal

protein S26

89041601

PREDICTED: similar to 40S ribosomal


89041601

protein S26
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113420084

isomerase A isoform 1

protein S26 isoform 1


PREDICTED: similar to 40S ribosomal

protein S26 isoform 1

88982349

protein S26

PREDICTED: similar to 40S ribosomal


51467029

protein S26

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to large subunit


113427529

ribosomal protein L36a


PREDICTED: similar to peptidylprolyl

113429091

PREDICTED: similar to 40S ribosomal

isomerase A isoform 1

51467029

PREDICTED: similar to 40S ribosomal


113430282

PREDICTED: similar to 40S ribosomal

protein S26

89025350

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

isomerase A isoform 1

protein S26 isoform 2


PREDICTED: similar to DNA primase large

113418084

PREDICTED: similar to peptidylprolyl


113429091

protein S26

subunit, 58kDa
PREDICTED: similar to DNA primase large

113418086

subunit, 58kDa
PREDICTED: similar to 60S ribosomal

PREDICTED: similar to 40S ribosomal


51467029

protein S26

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146

PREDICTED: similar to 40S ribosomal

protein) (SIG-20)

51467029

PREDICTED: similar to 40S ribosomal


89041601

protein S26
PREDICTED: similar to 40S ribosomal

protein S26 isoform 1

89025350

310

protein S26 isoform 2

Appendix C
PREDICTED: similar to 40S ribosomal
89025350

PREDICTED: similar to 40S ribosomal

protein S26 isoform 2

88980535

protein S26

PREDICTED: similar to TBC1 domain


family member 3 (Rab GTPaseactivating protein PRC17) (Prostate
cancer gene 17 protein) (TRE17 alpha
113426831

PREDICTED: similar to adaptor-related

protein) isoform 1

89042891

protein complex 1 sigma 2 subunit

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146

PREDICTED: similar to adaptor-related

protein) (SIG-20)

89041736

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal

PREDICTED: similar to adaptor-related


89041736

protein complex 1 sigma 2 subunit

protein L26 (Silica-induced gene 20 protein)


113418826

PREDICTED: similar to peptidylprolyl


113429091

(SIG-20)
PREDICTED: similar to 40S ribosomal

isomerase A isoform 1

113419590

protein S28

ribosomal protein L36a

113414586

PREDICTED: similar to CG17293-PA

ribosomal protein L36a-like protein

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to large subunit


113427044
4506651

PREDICTED: similar to adaptor-related


89042891

PREDICTED: similar to ribosomal protein

protein complex 1 sigma 2 subunit

89042328

PREDICTED: similar to adaptor-related


89042891

S18 isoform 4
PREDICTED: similar to ribosomal protein

protein complex 1 sigma 2 subunit

41150652

S18 isoform 1
PREDICTED: similar to adaptor-related

4505289

diphosphomevalonate decarboxylase

89042891

protein complex 1 sigma 2 subunit


PREDICTED: similar to 60S ribosomal

PREDICTED: similar to large subunit


113427529

ribosomal protein L36a

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146
113428574

protein) (SIG-20)

PREDICTED: similar to large subunit


113427529

PREDICTED: similar to 60S ribosomal

27482992

protein L29 (Cell surface heparin-

311

ribosomal protein L36a


PREDICTED: similar to 60S ribosomal
protein L29 (Cell surface heparin-binding

Appendix C

113430282

binding protein HIP)

protein HIP)

PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal

protein S26

113420393

PREDICTED: similar to peptidylprolyl


113429091

protein S26
PREDICTED: similar to 40S ribosomal

isomerase A isoform 1

89035017

protein S28 isoform 2

PREDICTED: similar to DNA primase


113418086

large subunit, 58kDa

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to DNA primase


113418084

large subunit, 58kDa

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20 protein)
4506651

ribosomal protein L36a-like protein

113418826

(SIG-20)

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146

protein) (SIG-20)

PREDICTED: similar to large subunit


113427044

ribosomal protein L36a


PREDICTED: similar to 60S ribosomal

PREDICTED: similar to large subunit


113427044

ribosomal protein L36a

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to 60S ribosomal
protein L26 (Silica-induced gene 20 protein)

4506651

ribosomal protein L36a-like protein

113431146

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

(SIG-20)
PREDICTED: similar to large subunit

113427044

ribosomal protein L36a


PREDICTED: similar to peptidylprolyl

4506651

ribosomal protein L36a-like protein

113429091

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

isomerase A isoform 1
PREDICTED: similar to postmeiotic

113418682

segregation increased 2-like 2

113414586

PREDICTED: similar to CG17293-PA

113414586

PREDICTED: similar to CG17293-PA

PREDICTED: similar to 40S ribosomal


89035017

protein S28 isoform 2


PREDICTED: similar to peptidylprolyl

113429091

isomerase A isoform 1

312

Appendix C
PREDICTED: similar to peptidylprolyl
88953813

PREDICTED: similar to peptidylprolyl

isomerase A isoform 1

88943041

isomerase A (cyclophilin A)-like 4

PREDICTED: similar to postmeiotic


113418682

segregation increased 2-like 2

113414586

PREDICTED: similar to 40S ribosomal


113430282

PREDICTED: similar to 40S ribosomal

protein S26

89041601

PREDICTED: similar to 40S ribosomal


113430282

PREDICTED: similar to CG17293-PA

protein S26 isoform 1


PREDICTED: similar to 40S ribosomal

protein S26

88980535

protein S26
PREDICTED: similar to large subunit

4506651

ribosomal protein L36a-like protein

113427529

PREDICTED: similar to large subunit


113427529

PREDICTED: similar to large subunit

ribosomal protein L36a

113427044

PREDICTED: similar to adaptor-related


89042891

ribosomal protein L36a


PREDICTED: similar to adaptor-related

protein complex 1 sigma 2 subunit

89041736

PREDICTED: similar to peptidylprolyl


113429091

ribosomal protein L36a

protein complex 1 sigma 2 subunit


PREDICTED: similar to kidney-specific

isomerase A isoform 1

89040714

protein (KS)

PREDICTED: similar to TBC1 domain


family member 3 (Rab GTPaseactivating protein PRC17) (Prostate
cancer gene 17 protein) (TRE17 alpha
113426831

PREDICTED: similar to adaptor-related

protein) isoform 1

89041736

PREDICTED: similar to 40S ribosomal


89041601

PREDICTED: similar to 40S ribosomal

protein S26 isoform 1

88987217

PREDICTED: similar to 40S ribosomal


88987217

protein S26

protein S26

protein S26
PREDICTED: similar to 40S ribosomal

88980535

PREDICTED: similar to 40S ribosomal


113430282

protein complex 1 sigma 2 subunit

protein S26
PREDICTED: similar to 40S ribosomal

113429703

protein S26
PREDICTED: similar to adaptor-related

4758754

napsin A preproprotein

89041736

PREDICTED: similar to 40S ribosomal


113429703

protein complex 1 sigma 2 subunit


PREDICTED: similar to 40S ribosomal

protein S26

88982349

313

protein S26

Appendix C
PREDICTED: similar to 40S ribosomal
113429703

protein S26

PREDICTED: similar to 40S ribosomal


113420084

PREDICTED: similar to 40S ribosomal


113420084

PREDICTED: similar to 40S ribosomal

protein S26

88982349

PREDICTED: similar to 40S ribosomal


89025350

protein S26 isoform 2

isomerase A isoform 1

88987217

113427613

protein S28 isoform 1

88953906

protein S28 isoform 1

88959151

protein S28 isoform 2

88953906

protein S28 isoform 1

89034184

protein S28 isoform 1

88959151

protein S28 isoform 1

89034184

protein S28 isoform 2

88959151

protein S28

88953906

protein S28 isoform 1

protein S28 isoform 1

88953906

protein S28
PREDICTED: similar to 40S ribosomal

113422526

PREDICTED: similar to 40S ribosomal


113420393

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113423050

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113423050

protein S28 isoform 2


PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


88959151

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


89034184

protein S28 isoform 2


PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113422526

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113423050

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113423050

protein S28
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


89034184

L31
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to 40S ribosomal


113422526

protein S26
PREDICTED: similar to ribosomal protein

PREDICTED: similar to 40S ribosomal


113422526

protein S26
PREDICTED: similar to 40S ribosomal

PREDICTED: similar to peptidylprolyl


113429091

protein S26

protein S28 isoform 1


PREDICTED: similar to 40S ribosomal

protein S26

89025350

314

protein S26 isoform 2

Appendix C
PREDICTED: similar to adaptor-related
89042891

protein complex 1 sigma 2 subunit

PREDICTED: similar to aortic preferentially


113414263

expressed gene 1
PREDICTED: similar to 60S ribosomal

PREDICTED: similar to adaptor-related


89042891

protein complex 1 sigma 2 subunit

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146

PREDICTED: similar to adaptor-related

protein) (SIG-20)

89042891

protein complex 1 sigma 2 subunit

PREDICTED: similar to 60S ribosomal


protein L26 (Silica-induced gene 20
113431146

protein) (SIG-20)

PREDICTED: similar to ribosomal protein


113427613

L31
PREDICTED: similar to 60S ribosomal

PREDICTED: similar to ribosomal


113427613

protein L31

protein L26 (Silica-induced gene 20 protein)


113418826

(SIG-20)
PREDICTED: similar to large subunit

4506651

ribosomal protein L36a-like protein

113427044

ribosomal protein L36a


PREDICTED: similar to 60S ribosomal

PREDICTED: similar to peptidylprolyl


113429091

protein L29 (Cell surface heparin-binding

isomerase A isoform 1

27482992

PREDICTED: similar to 40S ribosomal


113429703

protein HIP)
PREDICTED: similar to 40S ribosomal

protein S26

89025350

protein S26 isoform 2


PREDICTED: similar to 60S ribosomal

PREDICTED: similar to peptidylprolyl


113429091

isomerase A isoform 1

protein L29 (Cell surface heparin-binding


113428574

PREDICTED: similar to 40S ribosomal


113429703

PREDICTED: similar to 40S ribosomal

protein S26

89041601

PREDICTED: similar to 40S ribosomal


113429703

protein HIP)

protein S26 isoform 1


PREDICTED: similar to 40S ribosomal

protein S26

88980535

315

protein S26

Appendix D
Appendix D Concatenated Filtered Alignment

Alignment Follows on Subsequent page.

316

10

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

20

30

40

50

60

70

80

R A R M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S NM I NT WR Q F F V L E E QM L E V DC S I L T P E P V L K A S G HV D
R K A V V NT L E R R L F Y I P S F K I Y - - - - - - - S G V A G L F DY G P P G C A I K S NV L S F WR QH F I L E E NM L E V DC P C V T P E V V L K A S G HV D
R E N L E S V L K R R F F F A P A F E L Y - - - - - - -G G V S G L Y DY G P P G C A F QA N I V DV WR K H F I L E E DM L E V DC T M L T P Y E V L K T S G HV D
R T V L D S M L R R R L F Y T P S F D I Y - - - - - - -G G V S G L Y DY G P P G T A L L NN I V D L WR K H F V L E E DM L E V DC T M L T P H E V L K T S G HV D
R T L F E S L L K R R L F Y T E S F E I Y R T S G N L T G D S R G L Y DY G P P G C A L Q S N I V D L WR K H F V L Q E DM L E L DC T I L T P E E V F K T S G HV D
R A K M E D L I K R R F F Y DQ S F A I Y - - - - - - -G G I T G Q F D F G P MG C A L K S NM I H L WK K F F I L Q E QM L E V E C S I L T P E P V L K A S G HV E
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R L K L E D L L K R R F F Y DQ S F A I Y - - - - - - -G G V T G L Y D F G P MG C A L K A NM L QQWR K H F I L E E G M L E V DC T S L T P E P V L K A S G HV D
R L K L E D L L K R R F F Y DQ S F A I Y - - - - - - -G G V T G L Y D F G P MG C S L K A NM L Q E WR K H F I L E E G M L E V DC T S L T P E P V L K A S G HV D
R D S L E QT L K R R F F F A P S F E I Y - - - - - - -G G V A G L F D F G P P G C A F QNNV I DA WR K H F I L E E DM L E V E A T M L T P HDV L K T S G HV D
R E K L E S V L R G R F F Y A P A F D L Y - - - - - - -G G V S G L Y DY G P P G C S F QA NV V DQWR K H F I L E E DM L E V DC T M L T P Y E V L K T S G HV D
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R QQM E DT L K R R F F Y G QA F E L Y - - - - - - -G G V S G L Y D F G P V G C M L K NN I I S E WK QH F I L HDQM L E I E C T M L T P E P V L R A S G H I E
K S T L DA L L A R R F F F A P S F E I Y - - - - - - -G G V A G L Y DY G P T G S A L QA N I L DA WR K HY I I E E DM L E L DT T I MT L S DV L K T S G HV D
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M L E I S A T C L T P Y N P L K A S G HV D
R E A L E N L L K R R F F I A P S F E I Y - - - - - - -G G V A G L F DY G P P G C A L K S E V E S F WR R H F V L A E DM L E I S A T C L T P Y N P L K A S G HV D
R T K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I L QV WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R E S L E QV L K R R F F F A P A F E I Y - - - - - - -G G V S G L Y DY G P P G C A L QA N I MDT WR K H F I L E E DM L E V DC T M L T P H E V L K T S G HV D
R A G L E D L MK R R F F I T Q S F S I Y - - - - - - -G G QA G L Y DY G P P G C A V K A N L I N L WR QH F V L N E DM S E V DC V S V T P E QV L K A S G HV A
R A K M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S N I L A L WR QY F A L E E QM L E V DC S I L T P E P V L K A S G HV E
R A K M E D L L K R R F F Y DQ S F A I Y - - - - - - -G G I T G QY D F G P MG C A L K S N I L S L WR QY F A L E E QM L E V DC S I L T P E P V L K A S G HV E
QQQ I E Q I L K K R F F I T Q S A Y I Y - - - - - - -G G V S G L Y D L G P P G L S I K T N I L S L WR K H F V L E E DM L E I E T T T M L P HDV L K A S G HV D
K A K L D E I L K QR NMV I Q S Y E I Y - - - - - - -G G I A G L Y DMG P L G C A L K QN I L Q F WR K H F T T Y E N F F E V E G P I L T P K C V L A A S G HT A
R V K M E DT L K R R F F Y DQA F S I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QA WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R E S L E S V L K R R F F Y A P A F E L Y - - - - - - -G G V S G L Y DY G P P G C S F QA N I V DV WR K H F V L E E DM L E V DC T M L T P Y E V L K T S G HV D
R T E F E DT C R R R F F Y G L A F D P Y - - - - - - -G G T A G L Y D L G P T MC A MK S NM L H F WR QH F V I E E S MC E V DT T C L T P E E V F K A S G HV T
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R G A L DT I L R R R M F Y T P S F E I Y - - - - - - -G G V S G L Y DY G P P G C A L QA N I I DA WR K H F V L E DDM L E V DC S V L T P A DV L K T S G HV D
Y E K V F E L A K R R G F L WN S F E L Y - - - - - - -G G S R G F Y DY G P L G S T L K R R I E QV WR E F Y V I Q E G HM E I E C P T I G I E E V F I A S G HV G
R V K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G E C E - - - -G P G L G S L A P WA V S S DR S V L R L P Q S L A G R R C S L G W P E - - - - - - - - - R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QA WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
K G A L E S M L R R R M F F A P S F D I Y - - - - - - -G G V A G L Y DY G P P G C A L QA N I I D I WR K H F V L E E DM L E V DC T A L T P HDV L K T S G HV D
R QA V V NT L E R K L F Y I P S F K I Y - - - - - - -R G V A G L Y DY G P P G C A V K A NV L A F WR QH F V L E E NM L E V DC P C V T P E V V L K A S G HV E
R A K L G Q L L E G R L F Y I P S F K I Y - - - - - - -G G V A G L Y DY G P P G C A V K S NV QQ F WR QH F V L E E S M L E V E C P A V T P E P V L R A S G HV E
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R E A L E QV I K R R F I Y Q P A F S L Y - - - - - - -G G V A G L Y DY G P V G C A I K T N I E QY WR E H F I I E E D L F E I A A T I L T P E P V L K A S G HV D
R E S L E QV L K R R F F F A P A F D I Y - - - - - - -G G V S G L Y DY G P P G C A F QA NV V DT WR K H F V L E E DM L E V DC T M L T P HDV L K T S G HV D
R T K L E N L V K R K F F Y T N S F E I Y - - - - - - -G G A S G L F DY G P S G C L L K S E L E N L WR C H F I Y Y D E M L E I S G S C V T P Y QV L K T S G HV D
R T K I DN L A K R K L F Y T N S F E I Y - - - - - - -G G S S G L I DY G P S G C L L K S E L E N L WR Y H F I F Y D E M L E I S A T C I T P Y T V L K T S G HV D
R S K L E S L I K R R L F Y T N S F E I Y - - - - - - -G G V S G L I DY G P S G C L L K Y E L E K L WR NH F V F Y D E M L E I K G T C I T P Y S V L K T S G HV D
R QA V V NT L E R R L F F I P S F K I Y - - - - - - -R G V A G L Y DY G P P G C A V K S NV L A F WR QH F V L E E NM L E V DC P C V T P E V V L K A S G HV D
R A K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P V G C A L K NN I I QT WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R DK L E S T L R R R F F Y T P S F E I Y - - - - - - -G G V S G L F D L G P P G C Q L QNN L I R L WR E H F I M E E NM L QV DG P M L T P Y DV L K T S G HV D
R T Q F E E L MK K R F F F S P S F Q I Y - - - - - - -G G I S G L Y DY G P P G S A L Q S N L V D I WR K H F V I E E S M L E V DC S M L T P H E V L K T S G HV D
R A K M E DT L K R R F Y Y DQ S Y A I Y - - - - - - -G G V S G L Y D F G P T G C A MK A N F I N I WR NH F I I E E G M L E V D S A I L A P E NV F K A S G HV E
R T K M E DT L K R R F F Y DQA F A I Y - - - - - - -G G V S G L Y D F G P MG C A L K NN I L QV WR QH F I Q E E Q I L E I DC T M L T P E P V L K T S G HV D
R K Y F E D L I K R R Y F F NQG F E I Y - - - - - - -G G V A G L Y DY G P P G C A I K NN L L K L WR E H F I L E E DM L E I S S T C I T P Y P V F K A S G HV D
K V DC E N L L R R R F F Y T N S F E I Y - - - - - - -G G S A G L F D F G P P G C A L K S E L E R L WR E H F V V F D E M L E V S C T C I T P H P V L K S S G HV D
K V DC E N L L R R R F F Y A N S F E I Y - - - - - - -G G S A G L F D F G P P G C A L K S E L E R L WR E H F I V F D E M L E V S C S C I T P H P V L K S S G HV D
R A T A E D L E V S G F F WV P S F E I Y - - - - - - -G S V A G I Y D L G P T G C A I E R N F L QK WR DH F V L E DDM L E V R C S A L T P R P V L DA S G HT E
R S E F E DT C R R R F F F G L A F D P Y - - - - - - -G G S A G L Y DMG P P L C A MK A N L L A HWR QH F V L A E S MC E V DT T C L T P Q E V F V T S G HV T
R A E F E DT C R R R F F F G L A F D P Y - - - - - - -G G S A G L Y D L G P P L C A MK A N L L S Y WR QH F V L E E NMC E V DT T S L T P E E V F K A S G HV V
R S Q L E V L MT K R F F Y I Q S F E I Y - - - - - - -G G V G G L Y DY G P T G A A L QA N I I NQWR NH F I I E E E M L E L DT T I MT L S DV L K T S G HV D
R E T L DA V L K R R F F Y A P A F E I Y - - - - - - -DG V S G L Y DY G P P G C A L QT R I I DT WR DH F V L E DDM L E V DT T M L T P H E V L K T S G HV D

90

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

100

110

120

R F A D L MT K DV K NG E C F R L D P I T G ND L T E P I E F N L M F G T Q I G P L R
K F T D L MV K D E K T G T C Y R A D P DT K N P L S D P Y P F N L M F QT S I G P MR
K F S DWMC QD P K S G E I F R A D P V T G E T L E P P K A F N L M F E T A I G P L R
K F A DWMC K D P K T G E I F R A D P T T DG N L L P P V A F N L M F QT S I G P L R
K F E DWMC K D F K K G D F L R A D P DG DA P V S S P V P F N L M F K T T V G P L R
R F A D L MT K D I K T G E C F R L D P I S G ND L T P P I E F N L M F NT Q I G P L R
K F A D F MV K D L K NG E C F R A D P T T G ND L S P P V P F N L M F K T F I G P L R
R F A DWMV K DT K NG E C F R A D P I T G ND L T E P I A F N L M F P T Q I G P L R
R F A DWMV K DMK NG E C F R A D P I T G ND L T E P I A F N L M F P T Q I G P L R
R F S DWMC K D L K T G E I F R A D P S T G G K L E P P V E F N L M F DT A I G P L R
K F S DWMC R D L K T G E I F R A D P V T G E P L E P P MA F N L M F E T A I G P L R
K F A D F MV K DV K NG E C F R A D P NT G ND L S P P V S F N L M F K T F I G P L R
R F A D L MV K D E K T G A C F R A D P V T G N E I S D P MD F N L M F QT T I G P L R
K F A DWMV K DV K NG E I Y R A D P T T G N E V S E P V E F N L M F E S N I G P L R
R F T D S M I T D I K T N E Y Y R A D P - S G G E W S E P Y P F N L M F R T K I G P MR
R F T D S M I T D I K T N E Y Y R A D P - S G G E W S E P Y P F N L M F R T K I G P MR
K F A DY MV K DV K NG E C F R A D P S T G ND L T P P I S F N L M F QT S I G P L R
K F A DWMC R D L K T G E I F R A D P A T DG P L E L P I E F N L M F E T A I G P L R
K F A D F MV K D E V T K A F F R A D P E T G NA L T E P Y P F N L M F QT Q I G P L R
R F A D L MV K DV K T G E C F R L D P L T G ND L T E P I E F N L M F A T Q I G P L R
R F A D L MV K DV K T G E C F R L D P L T G ND L T E P I E F N L M F A T Q I G P L R
K F C D I L V F D E V S G DC F R A DT - L G NK L S K S QQ F N L M F G T Q I G Y L R
K F S DY MV K D L K NG C C Y R A D P DT G ND L S E P L A F N L M F A T D I G P L R
K F A D F MV K DMK NG E C F R A D P I T G ND L S P P V S F N L M F K T S I G P L R
K F A D F MV K DV K NG E C F R A D P I T G ND L S P P V S F N L M F K T F I G P L R
K F S DWMC K D P K T G E I F R A D P V S G DK L E P P R A F N L M F E T A I G P L R
R F NDV MV R DT V T G E C I R A D P -K G N P F S D P F P F N L M F A T H I G P MR
K F A D F MV K DV K NG E C F R A D P I T G ND L S P P V S F N L M F K T F I G P L R
K F A DWMC K D P K T G D I F R A D P A T G L L P T P P V S F N L M F S T S I G P L R
G F S D P L C E C MNC K E A F R A D P E C G G E F E DA Y E F N L M F K T T I G P L R
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S WA WR R E V A G L C
K F A D F MV K DV K NG E C F R A D P T T G ND L S P P V P F N L M F QT F I G P L R
K F A DWMC K D P K NG D I L R A D P A T G V Q P E P P V A F N L M F QT A I G P MR
K F T D L MV K D E K T G T C Y R A D P DT K N P L S D P Y P F N L M F QT S I G P MR
K F T D L MV NDV V T K DC F R A D P V T G ND L S E P Y P F N L M F P T Q I G P L R
K F A D F MV K DV K NG E C F R A D P I T G ND L S P P V S F N L M F K T F I G P L R
R F T D L L V C D S K T G T G Y R A D P E T G ND L T D P T P F N L M L P T I I G P L R
K F A DWMC K D L K T G E I F R A D P A T G G K L E P P V E F N L M F E T A I G P L R
R F T D L M I R DV V T NDC Y R A D P - L K ND L S E P F P F N L M F QT K I G P L R
R F T D L M I R DA V T G D F Y R A D P -G K ND F V G P F P F N L M F QT R I G P L R
R F T D L M I K D I V T K DC Y R A D P - E K ND L S D P F P F N L M F QT K I G P L R
K F T D L MV K D E K T G T C Y R A D P DT K N P L S D P Y P F N L M F QT S I G P MR
K F A D F MV K DV K NG E C F R A D P T T G ND L S P P V P F N L M F QT F I G P L R
K F T DWMC R N P K T G E Y Y R A D P V T NDV L DA L T S F N L M F E T K I G A L R
K F A DWMC K D P A T G E I F R A D P A T NG E L E T P R Q F N L M F E T Q I G P L R
R F A D F MV K DG K T G E C F R A D P T T NND L S D P M E F N L M F A T A I G P L R
K F A DY MV K DV K NG E C F R A D P T T G ND L T P P I S F N L M F QT S I G P L R
R F T D L MV K DV K NG A G HR A D P DT G N E L G F P E P F N L M F G T P I G P L R
R F T D L MV K N L S NG DC Y R A D P - E G D E F S K P F P F N L M F S T S I G P L R
R F T D L MV K N L S NG DC Y R A D P - E G ND F S K P F P F N L M F S T S I G P L R
K F ND L M L T DMT T K A L Y R A D P - E G N E F S E P A P F N L M F NT R V G P L R
R F NDV MV R DT V T G E C I R A D P -K G NA L S D P F P F N L M F S T S I G P MR
R F NDA MV R DT V T G E C I R A D P -K G NA L S E P F P F N L M F S T S I G P MR
K F A DWMC K DT K T G E I F R A D P E S G N E V S E P V E F N L M F E S Y I G P L R
K F A DWMC R D L A S G E I F R A D P V T G G P L E K P M E F N L M F E T A I G P L R

130

140

150

160

P E T A QG I F V N F K R L L E F -NQG R L P F A A A Q I G N S F R N E I S
P E T A QG I F V N F K D L Y Y Y -NG K K L P F A A A Q I G QA F R N E I S
P E T A QG Q F L N F NK L L E F -NNG K T P F A S A S I G K S F R N E I S
P E T A QG Q F L N F QK L L E F -NQQ S M P F A S A S I G K S F R N E I S
P E T A QG Q F L N F K K L L DY -NQN S M P F A S A S I G K S F R N E I S
P E T A QG I F V N F K R L L E F -NQG R L P F A A A Q I G N S F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A A A Q I G L G F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A A A Q I G L G F R N E I S
P E T A QG Q F L N F NK L L E F -NNDK M P F A S A S I G K S F R N E I A
P E T A QG Q F L N F NK L L E F -NNG K T P F A S A S I G K S F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F S A V Q I G M S F R N E I S
P E T A QG H F V N F A R L L E F -NNG K V P F A S A Q I G K S F R N E I A
P E T A QG I F V N F K R L Y E Y -NG K K L P F S V A Q I G L G F R N E I A
P E T A QG I F V N F K R L Y E Y -NG K K L P F S V A Q I G L G F R N E I A
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F S K L L DC -NN E K M P F A S A S I G K S F R N E I S
P E T A QG I F T N F G K L Y E Y -NG K K L P F A A A Q I G NA F R N E I A
P E T A QG I F V N F K R L L E F -NQG K L P F A V A Q I G N S F R N E I S
P E T A QG I F V N F K R L L E F -NQG K L P F A V A Q I G N S F R N E I S
P E T A QG Q F L N F K K L C E Y -NNDK L P F A S A S I G K A Y R N E I S
P E T A QG I F T M F K R N L E F -NG G K V P F G V T Q I G NV F R N E I A
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F NK L L E F -NNG K T P F A S A S I G K S F R N E I S
P E L A QG I I L N F K R L MD S G NA QR M P F A G A C V G T A F R N E I A
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F A K L L E Y -NNQQM P F A S A S I G K S Y R N E I S
P E T A QG M F V D F QR L S R F -Y R DK L P F G A V Q I G K S Y R N E I A
PAW PR A L LC - - - - LGT T - PGGR LAV A - - - - - - - - - - - - P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F A K L L E Y -NA G NM P F A S A S I G K S Y R N E I A
P E T A QG I F V N F K D L Y Y Y -NG QK L P F A A A Q I G QA F R N E I S
P E T A QG I F V N F R D L L Y Y -NG G K L P F A A A Q I G Q S F R N E I A
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG M F L N F A R L L E Q -NG G R V P F G A A Q I G L G F R N E I A
P E T A QG Q F L N F A K L L E F -NN E K M P F A S A S I G K S F R N E I A
P E T A QG I F V N F K K L L E Y -NG G K T P F A G A Q L G L G F R N E I S
P E T A QG I F V N F K K L L E Y -NG G K M P F A G A Q I G L G F R N E I S
P E T A QG I F V N F K K L L E Y -NG G K M P F A G A Q I G L G F R N E I S
P E T A QG I F V N F K D L Y Y Y -NG NK L P F A A A Q I G QA F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG Q F L N F NK L L E I -NQG K I P F A S A S I G K S F R N E I S
P E T A QG Q F L N F S R L L E F -NNG K V P F A S A MV G K A F R N E I S
P E T A QG I F V N F K R L L E F -NQG R L P F G A A Q I G T A F R N E I S
P E T A QG I F L N F K R L L E F -NQG K L P F A A A Q I G N S F R N E I S
P E T A QG M F V N F NR L N E F -NG G R I P F A A A Q I G L G F R N E I A
P E T A QG I F V N F NR L L E F -NG G K I P F A A A Q I G L G F R N E I S
P E T A QG I F V N F T R L L E F -NG G K I P F A A A Q I G L G F R N E I S
P E T A QG I F V N F T R L L NA -NR G S L P F A A A QV G A G Y R N E I S
P E L A QG I I L N F K R L L DT G NA QR M P F A C A S I G T A F R N E I A
P E L A QG I I L N F K R L L D S G NA QR M P F A G A C I G T A F R N E I A
P E T A QG H F V N F QR L L E F -NNG R V P F A S A Q I G K S F R N E I S
P E T A QG Q F L N F NK L L DC -NNT K M P F A S A S I G K S F R N E I S

170

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

180

190

200

210

220

230

240

P R S G L I R V R E F T MC E I E H F C D P Q -A K NH P K F E NV A DT V MT L Y S A C NQA V - -A S G L V A N E T L G Y F MA R I QMY L HR I G I L P E R L R
P R QG L L R V R E F T L A E I E H F V D P E -NK S H P K F S DV A K L E F L M F P R E E QA V - -A K G T V NN E T L G Y F I G R V Y L F L T R L G I DK E R L R
P R S G L L R V R E F L MA E I E H F V D P E -NK NH P R F D E V K N L K L K F L P K G V QA V - -A S G M I DNQT L G Y F I A R I Y Q F L T K I G V D E E K L R
P R A G L L R V R E F L MA E I E HY V D P E G G K K HHR F E E V K D I E MA F L NR NV QA V - - E T G MV DN E T L G Y F I A R I Q L F L L K L G V D P NK L R
P R S G L L R V R E F L MA E I E H F V D P E G G K K HA K F D L V K D L Q L S F L DR A T QA V - - S S K MV DN E T L G Y F L G R I Y I F L L K I G V DT NK V R
P R S G L L R V R E F T MC E I E H F C D - - -V K E H P K F E S V K NT QM L L Y S A DNQA V - - S K G I V NN E T L G Y F MA R I HMY M L A V G I D P K R L R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L Y L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L V K V G V S P E K L R
P R QG L I R V R E F T MC E I E H F V D P E -DK R F P K F A K V A D E K L V L F S A C NQA V - -A NK T V A N E T L G Y Y MA R C HQ F L MK V G I DG R R L R
P R QG L I R V R E F T MC E I E H F V D P E -DK S L A K F A K V A DQK L V L F S A C NQA V - -A K K T V A N E T L G Y Y MA R C HQ F L MK V G I DG R R L R
P R A G L L R V R E F L MA E I E HY V D P E - S K S H P K F E DV K D I K L K F L P K NV QA V - - S S G MV DN E T L G Y F I A R I Y L F L V K I G V DT NR L R
P R A G L L R V R E F L MA E I E H F V D P L -DK S H P K F H E V K D I K L S F L P R N I QA V - -A S K MV DN E T L G Y F I A R I Y L F L I K I G V DDT K L R
P R S G L I R V R E F T MA E I E H F V D P S - E K E H P K F QNV A D L H L Y L Y S A K A QA V - -DQG V I NN S V L G Y F I G R I Y L Y L T K V G V S P DK L R
P R S G L I R V R E F QMG E I E H F V D P L -R K E H P L F S T V K D I K V P L Y S S K A QA V - - E DG T I DN E T L G Y F MG R I Y L F C V K V G I D P F K F R
P R QG L L R V R E F T MA E I E HY V D P L -DK R HA R F N E V K DV V L T L L A K G V QA V - -A E G I V DN E T L G Y F L G R T Q L F L T K I G I D P A R L R
P R NG L L R V R E F QMA E I E H F I H P D -R K DH P K F DDV A F K C L P L Y S S K T QA V HG E E K I I NN E T L A Y F L S R T Y D F L I S I G I N P DG I R
P R NG L L R V R E F QMA E I E H F I H P D -R K DH P K F DDV A L K C L P L Y S S K T QA V HG E E K I I NN E T L A Y F L S R T Y D F L I S I G I N P DG I R
P R S G L I R V R E F T MA E I E H F V D P N - E K V H F K F S NV A D L D I M L Y S S K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L V K V G V A K DK L R
P R A G L L R V R E F L MA E I E HY V D P D -NK S H S R F D E I K D L K L K F L P K G V QA V - - S S G MV DN E T L G Y F L A R I Y S F L I K I G V D P S R L R
P R A G L L R V R E F T MA E I E H F V N P N -NK T H P K F N E I K DV E A N L L S S D S QA V - - E K K L I DN E T L A Y F MA R T QQ F L HT V G I K P A G L R
P R S G L I R V R E F T MA E I E H F C D P V - L K DH P K F G N I K S E K L T L Y S A C NQA V - -A S K L V A N E T L G Y Y MA R I QQ F L L A I G I K P E C L R
P R S G L I R V R E F T MA E I E H F C D P T -QK DH P K F G NV K D E K MT L Y S A C NQA V - - S A K L V A N E T L G Y Y MA R I QQ F L L A I G I K P E C L R
P R S G L L R V R E F DQA E I E H F V L T D - E K DH P K F S T V QG I K L K L MHHDA S A I - - E R G I V C N E T MG Y Y I G R T A L F L I E L G I DR E L L R
P R NG L L R V R E F T L A E I E Y F V L P D -K K T H S N F S DV E N L S V Q L Y P R E L QA V - -NDG I I N S Q L L A Y F MG R T F K F L I E L G I P A E H I R
P R S G L I R V R E F T MA E I E H F V D P S - E K NH P K F Q S V A D L N I L L Y S S K A QA V - -QQG V I NN S V L G Y F I G R I Y L F L T K V G V S P DK L R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L H L Y L Y S A K A QA V - - E QG V I NNT V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R S G L L R V R E F L MA E I E H F V D P N -DK S HK R F QD I K D I K L K F L P R E V QA V - -A T K L V DN E T L G Y F I A R I Y Q F L I K I G V D P E R L R
P R S A L I R V R E F T L A E I E H F V N P S -NK NH E K F DR V R DV E I WA W P R H F QA V - - E A K V I DNQT L G Y F MG R V A L F L T S I G V - -R F Y R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L H L Y L Y S A K A QA V - - E QG V I NNT V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R S G L L R V R E F L MA E I E H F V D P E S G K K H P R F A E V A D I E L E L L DR E T QA V - -K DG L V DN E T L G Y F L A R I H L F L E K I G V DK S K L R
P R QG V I R L R E F T QA E C E L F V D P R -NK K H P N F E R F A DK E L V L Y S QA A QA V - - E T G V I A H E I L G Y N I A L T N E F L T K V G I D P E K L R
- - - - - - - - - - - - - - - - -R I C S P R -C QK P P - - - - - - - - - L E L L T S S L R P - - -A R G V I NN S V L G Y F L G R I Y L F L T K A G V C A E R L R
P R S G L I R V R E F T MA E I E H F V D P T - E K DH P K F Q S V A D L C L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R G G L L R V R E F L MA E I E H F V D P A G HK K H E R F H E V A D I E L A L L DR NV QA V - -K QK I V DN E T L G Y F L A R I H L F L K K I G V DQ S K I R
P R QG L L R V R E F T L A E I E H F V D P E -DK S H P K F V DV A D L E F L M F P R E L QA V - - S K G T V NN E T L G Y F I G R V Y L F L T R L G I DK NR L R
P R A G L L R V R E F T QA E I E H F V H P E -HK E H P R F A E V A DT V L S L F S QDA QA V - - S K G I I A N E T L G Y F I A R C H L F L V Q I G I DT NR L R
P R S G L I R V R E F T MA E I E H F V D P S - E K DH P K F QNV A D L H L Y L Y S A K A QA V - - E QG V I NNT V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R G G L L R C R E F QMA E I E Y F V D P T E K S T F K K F NK Y I N L E I P L L S R Q L QA V - -K E G I I NN E T L A Y F I C R T Y L Y L V E I G I N P V N I R
P R A G L L R V R E F L MA E I E H F V D P N -DK S H P K F K DV QD I K L R F L P K DV QA V - - S S G MV DNQT L G Y F L A R V Y Q F L I K V G V DT DR L R
P R NG L L R V R E F QMA E I E Y F V N P K -K K NH E K Y Y L F K Y L M L P L Y P R DNQA V - - E K N I I A N E A L A Y F L A R T Y L F L L K C G I NK DG I R
P R NG L L R V R E F E MA E I E Y F V N P E -K K C H E K Y H L F K H L I L P L Y P R E E QA V - -T K G I I A N E A L A Y F L A R T Y L F L L K C G I NK DG L R
P R NG L L R V R E F E MG E I E Y F F N P E -K S K H E K Y D L Y K H L V L P L Y P R T NQA V - -NNG I I C N E A L A Y F L A R T Y L F L L K C G I K K DG I R
P R QG L L R V R E F T L A E I E H F V D P E -DK S H P K Y S E V A D L E F L M F P R E QQA V - - S K G I V NN E T L G Y F I G R V Y L F L T H L G I DK DR L R
P R S G L I R V R E F T MA E I E H F V D P T - E K DH P K F P S V A D L Y L Y L Y S A K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L T K V G I S P DK L R
P R S G L L R V R E F L MA E I E H F V D P L -NK S HA K F N E V L N E E I P L L S R R L QA V - -N S G MV E N E T L G Y F MA R V HQ F L L N I G I NK DK F R
P R S G L L R V R E F L MA E V E H F V D P K -NK E HDR F D E V S HM P L R L L P R G V QA V - -K K G I V DNT T L G Y F MA R I S L F L E K I G I DMNR V R
P R S G L L R V R E F T MC E I E H F I D P T -NK DH P K F DT V A N L A I P L F P V DR QA V - - E K G M I K S R V L G Y F MG R T F L F M I K V G I D P K K L R
P R S G L I R V R E F T MA E I E H F V D P K - E K V HQK F A NV A D L E I L L Y S S K A QA V - - E QG V I NN S V L G Y F I G R I Y L Y L I K V G V A K DK L R
P R NG L Y R V R E F DMA E I E H F F D P K -R P E H P K F K Y V K D L K L P L L T A K S QA V - -K S G T V S N E T HA Y F I G R T F L F L V E A G V NQNN I R
P R NG L L R V R E F P MA E I E Y F V N P K - F K T H E K F P E F K NT V L P L L T R DQQA V - - S S G I V G N E A L A Y F L A R T F L F L K R V G I N E A G L R
P R NG L L R V R E F P MA E I E Y F V N P K - F K T H E K F P E F S HV V L P L V T R DQQA V - - S S G MV G N E A L A Y F L A R T F L F L K R V G I N E A G L R
P R NG L V R C R E F QMA E I E H F A D P E Q L NN F P K F E T V K N L K V K L F P A S I QA I - -A QHV V S HK T L G Y Y I G R V Y L F L C E I G I Q P DT I R
P R A N L I R V R E F T L A E I E H F V N P N -DK T H E K F A L V K DV E I WMWA R K QQA V - -A QK I I DN E T L A Y F I A R T A Q F L E A V G A - -R Y V R
P R A N L I R V R E F T L A E I E H F V N P N -DK S H E K F E S V R G T E F WA W S R E L QA V - -A K K I I DN E T L G Y F I A R T V L F L E A V G L - -R F L R
P R A G L L R V R E F T MA E I E H F V D P E -DK NHDR F D E V K H I NV P L L A K DV QA V - - S A G I I DNQT L G Y F I G R I Y L F L V K I G I DA T R L R
P R S G L L R V R E F T MA E I E H F V D P L -DK DHHR F D E V K DV K L R F L A K DV QA V - - E T G L V DNK T L G Y F L A R I Y L F L I K I G V N P DR L R

260

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

270

F R QHMG N E MA HY A C DC WDA E C L T - S Y G W I
F R QH L A N E MA HY A A DC WDA E I E S - S Y G W I
F R QHM S N E MA HY A T DC WDA E L K T - S Y G W I
F R QHMA N E MA HY A A DC WDA E L L T - S Y G W I
F R QHMA N E MA HY A T DC WDA E L QT -T Y G W I
F R QHMG N E MA HY A C DC WDA E C L S - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QH L S N E MA HY A QDC WDA E I L T - S Y G W I
F R QH L S N E MA HY A QDC WDA E I L T - S Y G W I
F R QHM S N E MA HY A S DC WDA E L E T - S Y G W I
F R QHMA N E MA HY A A DC WDA E L K T - S F G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHMQN E MA HY A C DC WDA E C K T - S Y G WV
C R QHMA N E MA HY A T DC WD F E I Q S - S Y G W I
F R QH L S T E MA HY A S DC WDA E V L T - S Y G W I
F R QH L S T E MA HY A S DC WDA E V L T - S Y G W I
F R QHMDN E MA HY A C DC WDA E T K T - S Y G W I
F R QHM S N E MA HY A A DC WDA E L HT - S Y G W I
F R QHQK N E MA HY A QDC WDA E I L S - S Y G WV
F R QHM S N E MA HY A C DC WDA E I L T - S Y G WV
F R QHM S N E MA HY A C DC WDA E I L T - S Y G WV
F R QHK K D E MA HY A K G C WDA E I Y T - S Y G W I
F R QH L K T E MA HY A K DC WDA E I R L - S Y G WV
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHMA N E MA HY A A DC WDA E L QT - S Y G W I
F R QHQ S T E MA HY A QDC WDA E L L T - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHMA N E MA HY A C DC WDA E L L T - S Y G W I
F R QH L T D E MA HY A I DC WDA E I E T DR F G WV
F R QHMDN E MA HY A C DC WDA E A R T - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHMG N E MA HY A C DC WDA E L L T - S S G WV
F R QH L P N E MA HY A A DC WDA E I E C - S Y G W I
F R QH L K H E MA HY A A DC WDA E I QC - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QHQA D E MA HY S S DC WDA E I E M - S S G WV
F R QHMG N E MA HY A S DC WDA E L QT - S Y G W I
F R QH L K T E MA HY A NDC WDA E I L T - S F G F I
F R QH L P T E MA HY A NDC WDA E I L T - S Y G F I
Y R QH L E K E MA HY A NDC WDA E I L T - S Y G Y I
F R QH L A N E MA HY A A DC WDA E I E S - S Y G W I
F R QHM E N E MA HY A C DC WDA E S K T - S Y G W I
F R QH L K N E MA HY A T DC WDG E I L T - S Y G W I
F R QHM S N E MA HY A C DC WDA E I QC - S Y G W I
F R QHM F N E MA HY A T DC WDA E T K T - S Y G WV
F R QHMDN E MA HY A C DC WDA E T K T - S Y G W I
F R QHM S N E MA HY A C DC WDA E I E F - S HG F K
F R QHMA N E MA HY A S DC WDA E I L T - S Y G WV
F R QHT A N E MA HY A S DC WDA E I L T - S Y G WV
F R MHR K N E MA HY A R E C WDA E I Y T K T L G W L
F R QH L R N E MA HY A QDC WDA E L L T - S Y G WV
F R QHQR D E MA HY A QDC WDA E L L T - S Y G WV
F R QHM S N E MA HY A S DC WDA E I HT - S Y G W I
F R QHM S N E MA HY A T DC WDA E L HT - S Y G W I

280

290

300

310

320

330

E C V G C A DR S A Y D L T QHT NA T - - - - - - -G V K L V A E K K L P A P K A A I G K A F K K E A K A
E C V G I A DR S A Y D L R A H S DK S - - - - - - -G T P L V A E E K F A E P K K E L G L A F K G NQK N
E C V G C A DR S A Y D L T V HA NK T - - - - - - -K T A L V V R E K L DV P K K L F G P K F R K DA P K
E C V G C A DR S A Y D L T V HK NK T - - - - - - -G A P L V V R E P R A E P K K K F G P R F K K DG K A
E C V G C A DR S A Y D L T V H S R K T - - - - - - -K E P L V V R E P R R E P K P K L G P L F K K NA K A
E C V G C A DR S A Y D L T QHT K A T - - - - - - -G I R L A A E K K L P A P K A A I G K A F K K D S QA
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G NA DR A C Y D L QQHY K A T - - - - - - -NV K L V A E K K L P E P MA L L G K K Y K K E A K K
E C V G NA DR A C Y D L QQHY K A T - - - - - - -NV K L V A E K K L P E P MA L L G K S F K K DA K K
E C V G C A DR S A Y D L S V H S A R T - - - - - - -G E K L V A R QT L A E P K K K F G P K F R K DA G T
E C V G C A DR S A Y D L T V HA NK T - - - - - - -K E K L V V R QK L E T P K K L F G P K F R K DA P K
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G C A DR S C Y D L K C H S QA A - - - - - - -K V N L S A E R P L P E P K QA V G K A F K K DA K K
E C V G C A DR S A Y D L T V H S V R T - - - - - - -K Q P L R V QQR L DQ P A K A F G MK F K K DA T M
E C A G HA DR S C Y D L L QH S K A T - - - - - - -K T D L F A S E K Y D E P K P L I G K T F K Q E A S L
E C A G HA DR S C Y D L L QH S K A T - - - - - - -K T D L F A S E K Y D E P K P L I G K T F K Q E A S L
E I V G C A DR S C Y D L L C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K F
E C V G C A DR S A Y D L S V H S A R T - - - - - - -N E K L V V R Q P L P E P K K K F G P K F R K DA G T
E C V G HA DR S C Y D L K V HA T E S - - - - - - -K S N L S A Y E E F K E P P G A I S K K HR A A V S P
E C V G C A DR S A Y D L G QHT A A T - - - - - - -G V R L V A E K R L P A P K QA L G K T F K K E A K N
E C V G C A DR S A Y D L G QHT A A T - - - - - - -G V R L V A E K R L P A P K QA L G K T F K K E A K T
E C V G I A DR A C Y D L S C H E DG S - - - - - - -K V D L R C K R R L A E P K K E WG A K L R DR F S V
E C V G HA DR G D F D L S NHA R C S - - - - - - -K V DQ S V F I A Y D E P K G V MG K K Y K K D S QK
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L I A E K L L K E P K G A I G K A Y K K DA K V
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G C A DR S A Y D L T V H S NK T - - - - - - -K E K L V V R E A L E T P K K L F G P K F R K DA P K
E C V G I A DR S A Y D L T QH S NA S - - - - - - -K K D L C A R E E Y D E P K G L I G K T F G K K A G E
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G C A DR S A Y D L S V HA K K T - - - - - - -NA P L I V R QR L P E P K K K F G P K F K K DA K A
E I V G I A DR T DY D L K A HA R V S - - - - - - -K T D L Y V Y V E Y D E P MG K L G P L F K G K A K A
E I V G C A DR S C Y D L T C H S R A T - - - - - - -K V P L V A E K L L R E P K A A I G R T Y K K DA R L
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A V G K A Y K K DA K L
E C V G C A DR S A Y D L T V HA K K T - - - - - - -G A P L V V R E T L E T P S K K F G P T F R K DA K T
E C V G I A DR S A Y D L R A H S DK S - - - - - - -G V P L V A H E K F S K P K K D L G L A F K G NQK M
E C V G L A DR S A F D L K A H S DK S - - - - - - -K V D L V A Y E R F DK P K K V L G K A F K K DA K P
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A I G K A Y K K DA K L
E C V G L A DR S A Y D L NA H S E A T - - - - - - -G QK L QA A R K F K V P K QK I G K E L K K DG MA
E C V G C A DR S A Y D L S V HA A R T - - - - - - -NA S L V V R Q P L P E P K K K F G P K F K K DG G A
E V V G HA DR S A Y D L QHHMK Y T - - - - - - -G A N L Y A C E K Y N E P K A K I G HT F K S E QNK
E V V G HA DR S A Y D L K NHMK V T - - - - - - -G A N L Y A C E K Y DT P K A K MG MK F K S QQNV
E C V G HA DR S A Y D L K HHMNA T - - - - - - -G S N L Y G C QK Y DK P K A K I G MK F K S DQNK
E C V G I A DR S A Y D L R A HT DK S - - - - - - -G V P L V A H E K F S E P K K E L G L S F K G NQK K
E I V G C A DR S C Y D L S C HA R A T - - - - - - -K V P L V A E K P L K E P K G A V G K A Y K K DA K L
E C V G C A DR A A F D L T V H S K K T - - - - - - -G R S L T V K QK L DT P K K F F G S K F K QK A K L
E C V G C A DR S A Y D L S V H S K A T - - - - - - -K T P L V V Q E A L P E P R K K F G P R F K R DA K A
E C V G NA DR S C Y D L T C HA K H S - - - - - - -K V A MV A E K K L P E P K S V MG K A F K K E A K V
E I V G C A DR S C Y D L A C HA R V T - - - - - - -K V P L V A E K P L K E P K G A L G K A Y K K DA K I
E C V G L A NR S A F D L E S HT K G S - - - - - - -G V K L L A A R R L P E P K K E I F K A L K G DG N E
E V V G HA DR MA Y D L MC H S K S T - - - - - - -N S Q L V A HHR Y DN P K P E I G K A F K S DQK I
E V V G HA DR MA Y D L MC H S K S T - - - - - - -N S Q L V A HHR Y DT P K P E I G K A F K A DQK I
E C V G I A DR Q S WD L S R HA K Y T T K K G DA E S S P L Y L S A P L DT P K S A I G K I F R K DA K E
E C V G V A DR S A Y D L T QH S G A T - - - - - - -K K D L C A R E E F A E P K G A I G K A F G K NA G D
E C V G I A DR S A Y D L T QH S A A S - - - - - - -K K D L F A R E E F A E P K G A I G K A F G K NA G E
E C V G C A DR S A Y D L T V H S K R T - - - - - - -K K D L V V QK A HK E P K K N L G P K F K K DA K F
E C V G C A DR S A Y D L S V H E A R T - - - - - - -K V K L QV QQK L DA P K K K F G P L L K K A A K P

340

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

350

360

370

380

390

400

410

I T E A L P S V I E P S F G I G R I MY S L L E H S F R MR R S Y F S L P P V V A P L K C S V L P L S NNA E F A P F V K K I S S A L T S V DV S HK V DD S S G S I
V V E S L P S V I E P S F G I G R I I Y C L Y E HC F S T R L N L F R F P P L V A P I K C T V F P L V QNQQ F E E V A K V I S K E L A S V G I S HK I D I T G T S I
V E NY L P NV I E P S F G I G R I I Y A I F E H S F W S R R S V L S F P P L V A P T K V L L V P L S NNA D L A E V V T E V S R V L R K E Q I P F K V DD S G V S I
V A A A V P NV I E P S F G I G R I L Y S M I E HV Y WA R R G V L S F P P A I A P T K V L I V P L S T HA S F R P L L QQ L MT K L R R MG I S NR V DD S S A S I
V E A A L P NV I E P S F G F G R I F Y S L L E HV Y WHR R G V L S L P I S V A P T K V L I V P L S T HQD F V P I T K R I T E D L R E L G I S C R A D E S S A S I
I NDT L - - - - - - - - - - - - - - - - - - -N S F - - - - - - - - - - - - -A P MK C V V L P L S G NA E F Q P F V R D L S Q E L I T V DV S HK V DD S S G S I
V M E Y L P S V I E P S F G L G R I MY T V F E HT F QV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG I S HK V DD S S G S I
I QT A L P S V I E P S Y G I G R I MY A L L E H S F R QR R T F L A F K P L V A P I K C S I L P I S A NDT L V P V MDA V K E E L S HY E L S Y K V DD S S G T I
I QT S L P S V I E P S Y G I G R I MY A L L E H S F R QR R T F L A F K P L V A P I K C S V L P I S A NDT L I P V MDA V K E E L S R F E M S Y K V DD S S G T I
V E K W L P NV I E P S F G I G R I L Y S I F E HQ F WC R R G V L S L P P I V A P T K V L L V P L S NN S E L Q P I V K K V S QA L R K E K I P F K V DD S S A S I
V E A Y L P NV I E P S F G I G R I I Y S I F E H S F W S R R A V L S F P P L V A P T K V L L V P L S NHK D L A P V T A QV S K I L R K E Q I A F R V DD S G V S I
V L E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
V QA H L P S V I E P S F G F G R L L Y S T L E HN F K I R R T F F S L P A V I A P Y K C S L L P L S NK P D F N P F I T S L S L A L K K L G I S HK V DT S S G S I
I K E T L P NV I E P S F G I G R I L Y C V L E HT Y WA R R G V L S L P A L V A P I K C L I V S I S QDA Q L R S K I H E V S R E MR K R G I A S R V DD S S A T I
V T E A L P G V I E P S F G I G R I I Y C L L E H S F K I R R S Y L S L P A L I A P V K C S I L P I S S NA I F ND L I N L L HK S F I NHG I S C K V DT S S A S I
V T E A L P G V I E P S F G I G R I I Y C L L E H S F K I R R S Y L S L P A L I A P V K C S I L P I S S NA I F ND L I N L L HK S F I NHG I S C K V DT S S A S I
A M E Y L P NV I E P S F G I G R I MY S I F E HT F R I R R T Y F S F P A T V A P Y K C S V L P L S QNQ E F M P F V R E L S E A L T R NG V S HK V DD S S G S I
V E NW L P NV I E P S F G I G R I L Y S I F E HQ F WA R R T V L S L P P L V A P T K V L L V P L S S NA E L Q P I V K K I S A F L R K E QV P F K V DD S S A S I
I K K Y L P HV I E P S F G L G R I I Y S I L E QNY Y T R R G V L S L P A I I A P V K A S I L P L T S S DR I A P F V QT I S K A L K E A N I S T K V DDT G NA I
I T DA L P S V V E P S F G I G R I MY S L L E H S F QC R R C Y F T L P P L V A P I K C S I L P L S NNT D F Q P Y T QK L S S A L T K A E L S HK V DD S S G S I
I T E A L P S V V E P S F G I G R I MY A L L E H S F QC R R C Y F T L P P L V A P L K C S I L P L S NNA E F Q P Y T QK L S S S L T K A E L S HK V DD S S G S I
L M E T V P DV I E P S F G I G R I L Y A L I E H S F Y L R R P V F R F K P A I A P V QC A I G Y L I H F D E F N E H I L N I K R F L T DNG L V V HV N E R S C S I
L F A Y A P NV I E P S F G V G R V L T A V L E H S F WV R K S V L S I P A S I A P V K V G L F P L L T K L E F NNK I A E I E K I C K NG F L S F K S NT T A V A I
V M E Y L P NV I E P S F G I G R I MY T V F E HT F R I R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG I S HK V DD S S G S I
V M E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG V S HK V DD S S G S I
V E A R L P NV I E P S F G I G R I I Y S I F E H S F W S R R A V L S F P P L V A P T K V L L V P L L NN P E L S K I T A QV S Q I L R K E Q I P F K V D E S G V S I
V MA Y L P S V I E P S F G I G R I L Y C L L E Q S Y WV R R A V F S F S P L L A P QK V A L L P L MV K P E L L A T I S E I R Q E L V MR G I S V R V DD S S V T I
V M E Y L P -Y F G I K I G L L R NG Y S HT Q L T - - - - - - - - - - - P W L K P QV C T - - - - - - - - - - - - - - - - L QK A C S R - - - - - - - - - - - - - V E T V L P NV I E P S F G I G R I L Y C L L E HNY WT R R G V L S F T P V V A P T K V L I V P L S R HDD F V P F V QK I S QK L R S V G V S S R V DD S S A T I
V A DA L P HV I E P S Y G I DR I F Y G I M E HA F D E E R L V MH F S S A V A P V QV A V L P L L T R K E L A D P A K E I I A K L R E K T L L V NY DD S G -T I
V L DY L P NV I E P S F G L G R I MY T V F E HT F R V R R T F F S F P A I V A P Y K C S V L P L S QNQ E F A P F V R E L S E A L T R NG V S HK V DD S S G S I
V L E Y L P S V I E P S F G L G R I MY T I L E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
V E A A I P NV I E P S F G I G R I L Y S L L E HNY WV R R G V I S F P P A V A P V K V L I V P I S S K A E F A P HV R R L S QK L R S V G I S S R V DD S S A S I
V V E A L P S V I E P S F G I G R I I Y C L F E H S F Y T R L NV F R F P P I V A P I K C T V F P L V K NQ E F DDA A K V I DK A L T T A G I S H I I DT T A I S I
V T DA L P S V I E P S F G I G R I MY C M F E HA F Y I R K T V L R L T P V V A P I K T T I F P L V NDDK L NA I A A E MNK M L T T NG I S A K L DA T A I S V
V M E Y L P NV I E P S F G L G R I MY T V F E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R HG V S HK V DD S S G S I
L I K Y V P HV I E P A F G I G R I L QA I I E H S F NQR K T F F K F S P R V A P V K C S I L S V V Q S E E F DNV I F E L T S S L K K L G I S C K T DNA G V A L
I QK W L P NV I E P S F G I G R I L Y S I F E HQY WA R R G V L S L P P L V A P T K V L L V P L S N S A D L Q P I V T K V S A Y L R K QQ I P F K V DD S S A S I
I Y A C L P NV I E P S F G I G R L I F C I L E H S F R I R R QY L S L P Y K L A P I K C S I L S I S NNK A F Y P Y I K Q I QM L L NQY N I S S K I DN S S V S I
I Y QW L P NV I E P S F G I G R L I F C I I E H S F R T R R HY L S L P Y T L A P I K C S V L T I S NHK T F I P F V K QV QM I L N E F S I S S K I DN S S V S I
I Y K I L P NV I E P S F G I G R L I F C I L E H S F R V R R HY L S L P Y A L S P I K C S V L S I S NNK E F Y P Y I K Q I QT I L S E NN I S C K L DN S S V S I
V V E A L P S V I E P S F G I G R I I Y C L Y E H S F Y MR QNV F R F P P L V A P I K C T V F P L V QNQQY E DV A K I I S K S L T A A G I S HK I D I T G T S I
V L E Y L P S V I E P S F G L G R I MY T I L E HT F HV R R T F F S F P A V V A P F K C S V L P L S QNQ E F M P F V K E L S E A L T R NG V S HK V DD S S G S I
I E S V L P NV I E P S F G L G R I I Y C I F DHC F QV R R G F F S F P L Q I A P I K V F V T T I S NNDG F P A I L K R I S QA L R K R E I Y F K I DD S NT S I
V E E A M P NV I E P S F G L G R I L Y V L M E HA Y WT R R G V L S F P A S I A P I K A L I V P L S R NA E F A P F V K K L S A K L R N L G I S NK I DD S NA N I
V V E H L P S V I E P S F G V G R I L Y S I L E HN F K V R R T Y F T L P P I I A P Y K C C V L P L S S NK D F E P L V K T L A QA L S NA S I S HK V D S S S G S I
A MD F L P NV I E P S F G I G R I MY T I F E HT F HV R R T F F S F P A T V A P Y K C S V L P L S QNQ E F V P F V R E L S E E L T R NG V S HK V DD S S G S I
L T K L I P Y V I E P S F G V G R I F S A I L E H S F R MR R T F F H L P P K I S P I K C S I L P V I S H E K Y NDA I HK L K V G L T K V G V S S K V DDT G HA I
L L NH L P C V I E P S F G L G R L I F S I L E H S Y R V R R K Y V A L NK S I A P T K C S V L P L S S K E V F E P L I T R V QA H L R R L G I S HK V DK T G A S I
I L NH L P C V I E P S F G L G R L I F S I L E HA Y R V R R K Y V S L HK S I A P T K C S I L P L S S K E V F E P L I S R V Q S Q L R S L G I S HK V DK T G A S I
I MDA L N I T V C G DK T V T Y E MY N I NDT V V T T R - - - - F F P NA L S S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V M E Y L P S V I E P S F G V G R I L Y A L L E Q S Y WV R R A V F G L R P L I A P QK V A V F P L L MK P E L I R T V E E I K E R M L L HG I S T R T DD S G A S I
V M E Y L P S V I E P S F G V G R I L Y A L L E Q S Y WV R R A V F S MR P V I A P QK V A V L P L L V K P E L L R V V E E I R G DMV L R G I S T R T DD S G A S I
V E E A I P NV I E P S F G I G R I F Y S L L E H S F WT R R G V L S L P P L V A P I K A S I V P I S S N E K L S P L V K QV S R K L R S A G V A S R V DD S NA S I
V E E W F P NV I E P S F G I G R I L Y S L I E HC F WT R K G V L S F P P R I A P T K V L V V P L S S QK E L A P F T Q E V S K K L R QA R I S A K V DD S S A S I

420

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

430

440

450

460

470

480

490

G R R Y A R T D E I A I P Y G I T I D F DT L K - E P HT V T L R E R D S MK QV R I G L D E V A NT I R D L A T G - -R T S WK Y E P - - P - - I P T R V G K K K G
G K R Y A R T D E L G V P F A I T V D S - - - - - -DT S V T I R E R D S K DQV R V T L K E A A S V V S S V S E G - -K MT WK F E P - -A A - P P A R V G R K QG
G K R Y A R ND E L G T P F G I T I D F E S I K - -DG S V T L R E R D S T R QV R G S V T D I I R A I R D I T Y N - -G V T WK Y E P - - P - -V Q S K I G R K K G
G K R Y A R ND E L G T P F G I T V D F Q S V K - -DNT F T L R DR DT T K QV R A S E D E I L QA I K S L V DG - - E K T WK Y E P - - P P P P T T R I G R K K G
G K R Y A R ND E L G I P L G I T I D F D S V K - -DG T I T L R E R D S T K QV R A S K E E I L G A I E S L I S G - -K MNWR Y E P - - P P P P T T R L G R K K G
G R R Y A R T D E L G V P Y A V T V D F DT I K - E P HT V T L R E R D S MR QV R L P MA DV P T V V R D L S N S - -K I L W - - - - - - - - - - - - - - - - - - G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V R D L A NG - - S I MWK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G I P F G I T V D F E S G K T T P Y T V T I R HA E T M S Q I R L E V S E L G R L I S D L V S G - -R QQWK Y E A - - P - - I P S R I G K K K G
G R R Y A R T D E I G I P F G I T V D F D S L K T T P F T V T I R HA E T M S Q I R L E V S E L G R L I S D L V A G - -R QQWK Y E A - - P - - I P S R I G K K K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -DD S V T L R E R D S T K QV R G S I Q E I V E A I K D I T Y N - -DG T WK Y E P - - P - -V E S K F G K K K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -DG S V T L R E R D S T K QV R G S V E A V I K A V R E I T Y N - -G A S WK Y E P - - P - -V Q S K F G R K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V C D L A NG - - S I T WK Y E P - - P - -V P T R V G K K K G
G K R Y A R T D E I A V P F G I T V D F DT V K I E P HT A T L R E R D S L V Q I R A T V E E I P Q I V Y D L V Q E - -NT T WK Y E R - - P - - L P T R V G K R R G
G K K Y A R ND E L G T P F G C T V D F A T I Q - -NG T MT L R E R D S T S Q L I G P I E DV I S V V DQ L V K G - -V L DWK W E P - - P - -V P T R I G K K K G
G R R Y A R T D E I G I P F G I T I D F Q S V K - -DDT V T L R E R D S MK QV R I S S S E V P S V I S K I I NQ - -Q I T WK L E S - -A - - P P P M E MK R K G
G R R Y A R T D E I G I P F G I T I D F Q S V K - -DDT V T L R E R D S MK QV R I S S S E V P S V I S K I I NQ - -Q I T WK L E S - -A - - P P P M E MK R K G
G R R Y A R T D E I G V A F G I T I D F DT V NK N P HT A T L R DR D S MR Q I R A E V G E L P E I I R D L A NG - -A I T WK Y E P - - P - - I P T R V G K R K G
G K R Y A R ND E L G T P F G I T I D F D S V K - -D E S V T L R DR D S T K QV R G S L E D I V E A I K D I A Y N - -NV S WK Y E P - - P - -V E S K F G K K R G
G R K Y A R T D E I G V P F G V T I D F QT V E - -DNT V T L R E R DT T K QV R I P I S E L A S T L R K L C D L - -T V S WK Y Q P - - P P - P P T Q F G K K K G
G R R Y A R T D E I A I P Y G I T V D F DT L K - E P HT V T L R DR NT MK QV R V G L E E V V G V V K D L S T A - -R T T WK Y E P - - P - - I P T R V G K K K G
G R R Y A R T D E I A I P Y G I T V D F DT L K - E P HT V T L R DR NT MK QV R V G L E QV V G V V K D L A T A - -R T S WK Y E P - - P - - I P T R V G K K K G
G R K Y S S C D E L G I P F F I T F D P D F L K - -DR MV T I R E R D S MQQ I R V DV E K C P S I V L E Y I R G - -Q S R WN L QD - -T - -T T I N L R R R R E
G K K Y A QA D E A G I P F DV T V DY T S L S - -DNT V T L R DR DT T K Q I R I P I DK L V E T V HA L T Q L H P T T T F - - - - - - - - -M S MT L G K K R E
G R R Y A R T D E V G V A F G I T I D F DT V NR T P HT A T L R DR D S MR Q I R A E V S E L P A I I R D L A NG - -Y L T WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G V T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P S I V QD L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K R Y A R ND E L G T P F G V T I D F D S V T - -DG S I T L R E R D S T K QV R G S V A DV I K A I R E I T Y Q - -G V S WK Y E P - - P - -V E S K F G R K K G
G K K Y A R V D E L G I P F A I T C D F E G - - - -DG S V T L R E R DT A S QV R V P K L E V A S V V V D L C N P L Q P L T WK W E P - - P - -V A P E I G K R K G
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -K Y E P - - P - -V P T R V G K K K G
G R R Y S R ND E L G T P L G I T V D F QT V K - -DG T I T L R DR DT T V QV R A DQDK I V E A I Q E L V S G - -NK V WK Y E P - - P P R P T T R V G R K K G
G R R Y R R ND E I G T P Y S V T V DY DT L Q - -DG T V T I R DR D S MR QV R A P I NG I E NV L Y E L I Y R - -G R D F - - - - - - - - - - - - - - - - - - G K R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P K V V C S L A NG - -T MT WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P NV V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G R R Y A R ND E L G T P F G L T I D F QT L Q - -DG T F T L R E R D S T R QV R A E E E K I V DA I K A L V E G - - S K T WK Y E P - - P P K P T T R I G R K K G
G R R Y A R T D E I G V P F A V T V D S - - - - - -A T S V T I R E R D S K E Q I R V G I D E V A S V V K Q L T DG - -Q S T WK F E P - - P A -A P S R V G R K QG
G K R Y A R T D E L G V P F A V T V DHR S V T - - E NT V T V R E R D S C G QV R V P I P E V P G L L G R L C K M - -T V DWK Y V P - - P A - P P MR V G K K K G
G R R Y A R T D E I G V A F G V T I D F DT V NK T P HT A T L R DR D S MR Q I R A E I S E L P S I V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K K Y A R T D E I G I P F A I T V DK E T L T - -A Q S V T L R E I E T T K QV R V P I A E V P R L I L E L S A G - - L I L WK K P P - - P - - P P QR V G R K K G
G K R Y A R ND E L G T P F G I T I D F D S I K - -D E S V T L R E R D S T K QV R G S F E DV V A A I K E I T Y T - -G T T WK Y E P - - P - -V E S K F G K K R G
G K K Y A R T D E I G I P F A V T I D F QT L K - -DK T I T L R E R D S M L Q I R I S M S H L V D I I N S M L HA - -K K NWK L E S - -V - - P I S HMG K K K G
G K K Y A R T D E I G I P F A V T I D F QT L K - -DK T V T L R E R D S M L QV R I D L S D L V E I V T S L L R Q - -K K T WK L E S - -A - - P P S H I G K R K G
G K K Y A R I D E I G I P F A V T I D F QT L K - -DK T I T L R DR D S M L Q I R V N I S E V S D I I N S L L S Q - -K S S WK L E S - - S - - P P S H I G K R K G
G K R Y A R T D E L G V P F A I T V D S - - - - - -T S S V T I R E R D S K DQ I R V NV E E A A S V V K S V T DG - -HT T WK F E P - -A A - P P A R V G R K QG
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V S E L P S V V R D L A NG - -N I T WK Y E P - - P - -V P T R V G K K K G
G K K Y A R ND E L G T P F G I T I D F E T I K - -DQT V T L R E R N S MR QV R G T I T DV I S T I DK M L HN P D E S DWK Y E P - - P - -V Q S K F G R K K G
G R R Y A R ND E L G T P F G L T V D F E T L Q - -N E T I T L R E R D S T K QV R G S QD E V I A A L V S MV E G - -K S S F K Y E P - - P - -V P T R T G R R K G
G R R Y A R T D E I S V P F C I T V D F D S L K - E P HT V T L R DR DT F E QV R T L V S DV A D I I R D L S S D - -K I R WK Y E P - - P - -V P T R V G K K K G
G R R Y A R T D E I G V A F G I T I D F DT V NK T P HT A T L R DR D S MR Q I R A E V R E L P G I I R D L A NG - -T L S WK Y E P - - P - - I P T R V G K K K G
G R R Y A R T D E L G I P F G I T I DNDT L V - -DD S V T L R E I L T T K Q I R I P I NDV F R V V S D L A DG - - L I T WK DQM - - P - - - - - - - -R E K G K R Y A R T D E I G I P F C V T L D F Q S V N - -DDT V T L R E R DT MQQV R I K L DD L G E L I NN L L K D - -D I T WQ S R S - -D - -Q P V T F G K R K R
G K R Y A R T D E I G V P F C V T L D F Q S V N - -DDT V T L R E R D S MQQV R V K L DQV G Q L L S N L L K - - -D I T W - - - - - - - - - - - - -M I K R I K
- - - - - - - - - - - - - - - - - - - - - - - - - -HR S V S A E S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E E E S V N P K K K L T E E E R K K R
G K K Y A R V D E L G V P F C V T C DM E N - - - -DG C V T L R E R D S A QQV R I P K E K V A D I V A E MC R P L R P R E WK W E P - - P - -V A T D I G K K K G
G K K Y A R V D E L G I P F C V T C D F E T - - - -DG C V T L R E R D S A R QV R I P R E A V A DV V A E L S R P L R P R E WK WQ P - - P - -V A S D I G K K K G
G R R Y A R ND E L G T P F A C T L D F A S L S - -K G T MT L R E R DT T A QR I G P I DQV I DV I R Q L C DG - - S L DWK W E P - - P - - L P T R V G K K K G
G K R Y A R ND E MG T P F G I T V D F DT V K - -DN S V T L R E R D S T R QV R G S I DA V I A A I NV MT A D - -DV A WK Y E P - - P - -V NT R S HR K K G

500

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

510

520

530

540

550

560

570

P DA A L K L P QV - - - - - -T P HT R C R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G S P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P E A A A R L P T V - - - - - -T P S T K C K L R L L K L E R I K DY L L M E E E F V A NQ E DD L R G T P M S V G N L E E L I D E NHA I V S S S V G P E Y Y V G I
P A T A E K L P NV - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E F V T N S E E E I R G T P L S I G T L E E I I DDDHA I V T S P T T P D F Y V S I
P S A A S K L P D I - - - - - - F P T S R C K L R Y L R MQR V HDH L L L E E E Y V E NM E DDMR G S P MG V G N L E E L I DDDHA I V S S A T G P E Y Y V S I
P S T A S K L P D I - - - - - - F P T S R C K L R Y L R MQR V HDH L L L E E E Y V E NM E DDMR G S P MG V G N L E E L I DDDHA I V S S A T G P E Y Y V S I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M E E E F I R NQ E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - P DV A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A S K L P A V - - - - - -T P HA R C R L K L L K S E R I K DY L L M E Q E F I QNQ E D E L R G T P MA V G S L E E I I DDQHA I V S T NV G S E HY V N I
P DA A S K L P A V - - - - - -T P HA R C R L K L L K S E R I K DY L L M E Q E F I QNQ E D E L R G T P MA V G S L E E I I DDQHA I V S T NV G S E HY V N I
P DT A V K L P S V - - - - - -Y P NT R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P MA I G T L E E I I DDDHA I V S S T A S S E Y Y V S I
P A T A E K L P NV - - - - - -V P S T R C K L K L L R M E R I K DH L L L E E E F V T N S E E E I R G N P L S I G T L E E I I DDDHA I V T S P T M P DY Y V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A NK L P S V - - - - - -T P HT R C R L R L L K Q E R I K DY L L M E E E Y I R NQ E DD L R G S P M S V G T L E E I I D E NHA I V S T S V G S E HY V S I
P DA S S R L P A V - - - - - -Y P T T R C K L K L L K M E R I QDY L L M E E E F V S NQA D E L R G S P MG V G T L E E I I DDDHA I V S S G G G S E Y Y V G I
P P QY A R L P A V - - - - - -V P NA K C R L R L L K Y E R I K DY L MM E Q E F I T S M E DD L R G S P MN I G T L E E I I D E NHA I V S S S V G S E Y Y V N I
P P QY A R L P A V - - - - - -V P NA K C R L R L L K Y E R I K DY L MM E Q E F I T S M E DD L R G S P MN I G T L E E I I D E NHA I V S S S V G S E Y Y V N I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K QDR I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DT A V K L P S V - - - - - -Y P S T R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P M S I G T L E E I I DDDHA I V S S T A G S E Y Y V S I
A E T S T K L P V I - - - - - -T P H S K C K L K Q L K L E R I K DY L L M E Q E F L QNY D E E L R G D P L T V G N L E E I I DDNHA I V S S T V G P E HY V R I
P DA A MK L P QV - - - - - -T P HT R C R L K L L K L E R I K DY L MM E D E F I R NQ E DD L R G T P M S V G N L E E I I DDNHA I V S T S V G S E HY V S I
P DA A MK L P L V - - - - - -T P HT R C R L K L L K L E R I K DY L MM E D E F I R NQ E DD L R G T P M S V G N L E E I I DDNHA I V S T S V G S E HY V S I
G K A A S K P P QV - - - - - -Y P L MK C K L R Y L K L K K L A H L L S L E DN I L S L C E E Q L R G S P L S V G T L E E F V DDHHG I I T T G V G L E Y Y V N I
Y G NNNK L P Q I - - - - - -N P R T QC N L K K L R L E R L K D I L L I QR D F I E NQ E E E L R G S P L E V S K L H E M I DDHHA I I S S G NT MQY C V P V
P DA A S K P P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P S T V E K L P S V - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E Y V T N S E DD I R G T P L S I G T L E E I V DDDHA I V T S P T T P DY Y V S I
P DA A T R I P K V - - - - - -Y P NR A C L L R K Y R L E R C K DY L L L E E E F L R T I N E D I R G T P L E V A T L E E A V DD S HA I V S I S -G T E Y Y V P L
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY I S I
T S A A A K L P A I - - - - - -Y P T S R C K L R L L R MQR T HDH L L L E E E F V E NQ E DDMR G S P MG V G T L E E M I DDDHA I V S S T T G P E Y Y V S I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E Q L T E P P L F I A T I L E V NG E I A L I R QHG NNQ E - - - -V
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
A S A A A K L P S V - - - - - -Y P T S R C K L R L L R MQR I HDH L L L E E E Y V E NQ E DDMR G S P MG V G V L E E L I DDDHA I V S S T S G P E Y Y V S I
P E A A A R L P NV - - - - - -A P L S K C R L R L L K L E R V K DY L L M E E E F V A A Q E DD L R G T P M S V G S L E E I I D E S HA I V S S S V G P E Y Y V G I
I E G S T R L P NV - - - - - -A P Q S K C K L R M L K L E R V K DY L L M E E E F V G NQ E D E MR G A P M S V G S L E E I I DDT HG I V S S S I G P E Y Y V N I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
V E QA S K L P A A - - - - - -T P I T K C R L K L L K N E R I K DY L L L E Q E F I E NQQ E Q L R G T P M I V G T L E E F V N E NHA I V S S S V G P E S Y S G I
P DT A V K L P S V - - - - - -Y P NT R C K L K L L K L E R I K DH L L L E E E F V T NQ E D E L R G Y P M S I G T L E E I I DDDHA I V S S T A G S E Y Y V S I
T S G H S K L P NV - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E Y I T NQ E DD L R G S P V S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
V P G H S K L P T V - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E F I T NQ E DD L R G S P M S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
A T G H S K L P T V - - - - - -T P NT K C R L K L L K L E R I K DY L L L E E E F I T NQ E DD L R G S P M S V G T L E E L I D E NHG I I A T S V G P E Y Y V N I
P E A A A R L P T V - - - - - -T P HT K C K L R L L K M E R I K DY L L M E E E F V A NQ E DD L R G S P M S V G N L E E L I D E NHA I V S S S V G P E Y Y V G I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K L E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
P A T A E K L P N I - - - - - -Y P S T R C K L K L L R M E R I K DH L L L E E E F V S N S E E E I R G N P L S I G T L E E I I DDDHA I V T S P T M P DY Y V S I
P DA S A K L P T V - - - - - - I P T T R C R L R L L K MQR I HDH L L M E E E Y V QNQ E D E I R G T P M S V G T L E E I I DDDHA I V S T A -G P E Y Y V S I
P DT A S K L P QV - - - - - -T P HT K C R L R L L K M E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G S L E E I I DDNHA I V S A S V G S E Y Y V S I
P DA A S K L P L V - - - - - -T P HT QC R L K L L K Q E R I K DY L L M E E E F I R NQ E DD L R G T P M S V G T L E E I I DDNHA I V S T S V G S E HY V S I
- -K P A P V R I V - - - - - -T P I S K C R L R Q L K L DR I K DY L L M E Q E F I R NQ E E Q L R G S P M L I G T L E E F I D E DHA I V S S - I G P E Y Y A N I
Q L A P V R I P T V - - - - - -T P N S K C R L R L L K L E R I K DY L L L E E E Y I T NK S DD I R G S P M S V G T L E E I I D E NHA I V T S S I G P E Y Y V N I
P V L T NR L P L V NV K G K L T P N S K C R L R L L K L E R I K DY L L L E E E Y I T NK S DD I R G S P M S V G T L E E I I D E NHA I V T S S I G P E Y Y V N I
G A K NT H I P T V - - - - - -T P NA K C Q L R L L K L E R V K DW L K M E E E F I NNC E E E V R G S P MMV G T L E E I V DDDHA I V S R S V -QD F Y V T I
P DA A A K L P K I - - - - - -Y P S R A C L L K Q L R L E R C K DY L L L E D E L L T M I T DA L R G M P L E V G T L E E V I DDT HA I V S T A -G S E Y Y V A M
P DT A A K L P K I - - - - - -Y P V K A C L L K Q L R L E R C K DY L L L E E E L L K T I G DA L R G M P L E V G T L E E V I DDT HA I V S T A -G S E Y Y V P M
P D S S S K L P T V - - - - - -Y P NT R C R L K L L K L E R I K DH L L L E E E F V QNQ E DD L R G S P MA V G T L E E I I DD E HA I V S S A T G P E Y Y V S I
P E NA NK L P G V - - - - - -Y P T T R C K L K L L K M E R I K DH L L L E E E F V QNQ E DT L R G S P MG V G T L E E I I DDDHA I V S S T S G P E Y Y V S I

590

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

600

610

620

630

640

L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L G DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - -Q L E P G C S I L MHNK V L S V V G I L QD E V D P MV S V MK V E K A P L E S Y A D I G G L E A Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK I DK S P T E S Y ND I G G L E A Q I Q E I K
M S F V DK D - - - L L E P G A S I L L HHK S V S V V G V L T E E S D P L V S V MK L DK A P T E S Y A D I G G L E S Q I Q E V R
M S F V DK D - - - L L E P G A S I L L HHK S V S V V G V L T E E S D P L V S V MK L DK A P T E S Y A D I G G L E T Q I Q E V R
- - - - - - - - - - - - - - - - - - - - - -K V HA V V G V L G DDT D P MV S V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK E - - -Q L E P G C S V L L NHK NHA V I G V L S DDT D P MV S V MK L E K A P Q E T Y A DV G G L DQQ I Q E I K
M S F V DK E - - -Q L E P G C S V L L NHK NHA V I G V L S DDT D P MV S V MK L E K A P Q E T Y A DV G G L DQQ I Q E I K
M S F V DK G - - - L L E P G C S V L L HHK T V A V V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK I DK S P T E S Y S D I G G L E S Q I Q E I K
L S F V DK D - - - L L E L G C S V L L NHK V HA V I G V L MDDT D P L V I V MK V E K T P Q E T Y A D I R A L DNQ I Q E I K
L S F V DK D - - - L L E P G C T V L MNHK V HA V V G F L G DDV D P L V T V MK L E K A P K E S Y A D I G G L DT Q I T E I K
M S F V DK D - - - L L E P G C S V L L HHK T HA V V G V L A DDT D P MV S V MK L DK A P T E S Y A D I G G L E S Q I Q E I K
L S F V DK N - - -Q L E P G S S V L L HNK V Y S V V G I MND E V D P L V S V MK V DK A P L E S Y A D I G G L E QQ I Q E I K
L S F V DK N - - -Q L E P G S S V L L HNK V Y S V V G I MND E V D P L V S V MK V DK A P L E S Y A D I G G L E QQ I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK G - - - L L E P S C S V L L HHK T V S I V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
M S F V DK S - - -K L Y L G A T V L L NNK T L S V V G V I DG E V D P MV NV MK V E K A P T E S Y S D I G G L E A QV Q E MK
L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L S DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
L S F V DK D - - -Q L E P G C S V L L NHK V HA V V G V L S DDT D P MV T V MK L E K A P Q E T Y A D I G G L DT Q I Q E I K
M S F V DK D - - - L L E P G C T V L L NY K DN S V V G V L E G E MD P MV NV MK L E K A P S E T Y A D I G G L E E Q I Q E I K
L S I V DR E - - - L L E P G V QV L T HNHNK A I V G V L QND E D P HV S V MK V DK A P L E S Y A DV G G L E K Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK L E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S V V G V L QDDA D P MV S V MK MDK S P T E NY S D I G G L E A Q I Q E I K
M S F V DK E - - -Q L E L G C S V L L HDR QH S I V G V L K DDV D P L V S V MK V DK A P E DT Y A D I G G L E QQ I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK D - - - L L E P G A S V L L HHK S V S I V G V L T DDA D P L V S V MK L DK A P T E S Y A D I G G L E QQ I Q E V R
L T Q I P E E C L G K I E P G MR V A V N -G A Y S I I S I V S R A A DV R A QV M E L I N S P G V DY S M I G G L DDV L Q E V R
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK D - - - L L E P G A S V L L HHK S V S I V G V L T DDT D P A V S V MK L DK A P T E S Y A D I G G L E QQ I Q E V R
L S F V DK D - - -Q L E P G C S I L MHNK V L S V V G I L QD E V D P MV S V MK V E K A P L E S Y A D I G G L DA Q I Q E I K
A S F V DK S - - -Q L E P G C A V L L HHK N S A V V G T L A DDV D P MV S V MK V DK A P L E S Y A DV G G L E DQ I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
M S F V DK D - - -Q L E P G C S V L L NQR S Y A V V G I MQD E I D P L L NV MK V DK A P L E S Y A D I G G L E QQ I Q E I K
M S F V DK G - - - L L E P G C S V L L HHK T V S V V G V L QDDA D P MV S V MK L DK S P T E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NNK T N S V V G I L L D E V D P L V S V MK V E K A P L E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NNK T N S V V G I L L D E V D P L V S V MK V E K A P L E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NNK T N S V V G I L L D E V D P L V S V MK V E K A P L E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - -Q L E P G C A I L MHNK V L S V V G L L QD E V D P MV S V MK V E K A P L E S Y A D I G G L DA Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK E - - - L L E P G C S V L L HHK T M S I V G V L QDDA D P MV S V MK MDK S P T E S Y S D I G G L E S Q I Q E I K
M S F V DK D - - -M L E P G C S V L L HHK A M S I V G L L L DDT D P M I NV MK L DK A P T E S Y A D I G G L E S Q I Q E I K
L S F V DK D - - -Q L E P G C T V L L NHK V L A I V G V L G DDT D P MV S V MK L E K A P Q E S Y A D I G G L DT Q I Q E I K
L S F V DK D - - - L L E P G C S V L L NHK V HA V I G V L MDDT D P L V T V MK V E K A P Q E T Y A D I G G L DNQ I Q E I K
L S F V DK D - - -Q L E P G S T V L L NNR T MA V V G I MQD E V D P M L NV MK V E K A P L E C Y A D I G G L E QQ I Q E V K
L S F V DK E - - - L L E P G C S V L L HNK T N S I V G I L L DDV D P L V S V MK V E K A P L E S Y DD I G G L E E Q I Q E I K
L S F V DK E - - - L L E P G C S V L L HNK T N S I V G I L L DDV D P L V S V MK V E K A P L E S Y DD I G G L E E Q I Q E I K
S S F V DR K - - -A L Q I G C S V L L H E K A L T I V G L L DDDA N P L V DV MK V E NA P L E S F A D I G G L E DQ I V D I K
L S F V DK E - - -K L E L G C S V L L HDR Y HNV V G L L E S NT D P L V S V MK V DK A P Q E T Y A D I G G L E DQ I Q E I K
L S F V DK E - - -K L E L G C S V L L HDR QH S V V G V L QN S I D P HV S I MK V E K A P Q E T Y A D I G G L E E Q I Q E I K
M S F V DK D - - - L L E P G C S V L L HHK A MA I V G V L S DDA D P MV S V MK L DK A P S E S Y A D I G G L E T Q I Q E I K
M S F V DK D - - - L L E P G C S V L L HHK T V S V V G V L QDDA D P MV S V MK L DK A P T E S Y A D I G G L E S Q I Q E I K

650
E SV
EAV
EAV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
EAV
EA I
E SV
E SV
E SV
EAV
E SV
E SV
EAV
EAV
E SV
E SV
E SV
E SV
E SV
E SV
EAV
EAV
E SV
EAV
EAV
EAV
EAV
EAV
EAV
E SV
E SV
EAV
E SV
E SV
EAV
EAV
EAV
EAV
EAV
EAV
EAV
E SV

660

E L P L T H P E Y Y E E MG
E L P LT H P E LY ED I G
E L P L T H P E L Y E E MG
E L P L L H P E L Y E E MG
E L P L L H P E L Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E L Y E E MG
E L P L T H P E L Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E L Y E E MG
E I P L T H P E L Y DD I G
E I P L T H P E L Y DD I G
E L P L T H P E Y Y E E MG
E L P L T H P E L Y E E MG
E L P LT H P E LY E E I G
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T N P E L Y Q E MG
E L P L SH P E LY E E I G
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E L Y E E MG
E F P L SH P E LY D E I G
E L P L T H P E Y Y E E MG
E L P L L H P E L Y E E MG
E L P LT E P E L F ED LG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L L H P E L Y E E MG
E L P LT H P E LY ED I G
E L P LT H P E LY ED I G
E L P L T H P E Y Y E E MG
E L P L T H P E I Y E DMG
E L P L T H P E L Y E E MG
E L P LT H P E LY ED I G
E L P LT H P E LY ED I G
E L P LT H P E LY ED I G
E L P LT H P E LY ED I G
E L P L T H P E Y Y E E MG
E L P L T H P E L Y E E MG
E L P L T H P E L Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E Y Y E E MG
E L P L T H P E I Y E DMG
E L P L T R P E L Y DD I G
E L P L T R P E L Y DD I G
E L P LT H P EQ FD E I G
E F P L SH P E L FD EV G
E F P L SH P E LY D EV G
E L P L T H P E L Y E E MG
E L P L T H P E L Y E E MG

670

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I E P P SGV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
I K P PK GV
V Q P PK GV
V K P PK GV
I K P PK GV
I R P PK GV
I K P PK GV

680

I LY G P PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GC PGT GK T
I LY GC PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GV PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G E PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G L PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GV PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
L L HG A P G T G K T
I LY G P PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G E PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G P PGT GK T
I LY G E PGT GK T
I LY G P PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY GA PGT GK T
I LY G P PGT GK T
I MY G P P G T G K T
I LY G P PGT GK T
I LY G P PGT GK T
I L FG P PGT GK T
I LY GV PGT GK T
I LY GV PGT GK T
I LY GV PGT GK T
I LY GA PGT GK T

690
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L

L A K A V A NQT S A T
LAK AV AN ST SAT
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
LAK AV AN ET SAT
LAK AV AN ET SAT
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
LAK AV AN ET SAT
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NR T S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
I A K A I A S QA K A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
LAK AV AN ST SAT
LAK AV AN ST SAT
L A K A V A NQT S A T
LAK AV AN ET SAT
L A K A V A NQT S A T
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ST SAT
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
LAK AV AN ET SAT
LAK AV AN ET SAT
LAK AV AN ET SAT
LAR AV AK ST SAT
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T
L A K A V A NQT S A T

700

710

720

730

F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A DD L S P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A A E NA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q I F QV A A E HA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q I F QV A A E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P K MV R E L F R V A E E NA P S I
F L R I V G S E L I QK Y L G DG P K MV R E L F R V A E E NA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A A DHA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A A E NA P S I
F L R V V G S Q L I QK Y L G NG P K L I R E L F R V V E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A G E L A P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E C A P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G T E L I Q E Y L G E G P K L V R E L F R V A DMHA P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F QA A K D S A P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HG P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A A E NA P S I
F L R V V G S E L I QK Y S G E G P K L V R E L F R V A E E H S P A I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q L F QV A A E NA P S I
F I R M S G S D L V QK F V G E G S R L V K D I F Q L A R DK S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L V R Q L F QV A A E NA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E L S P S I
F L R I V G S E L I QK Y L G DG P K L V R E L F R V A D E M S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y QG DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A G E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E DHA P S I
F L R V V G S E L I QK Y L G DG P K L V R E M F K V A E DHA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A DD L S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F K V A G E NA P S I
F L R V V G S E L I QK Y L G DG P R L V R Q L F NA A E E H S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A E E HA P S I
F L R I V G S E L I QK Y A G E G P K L V R E L F R V A E E HA P S I
F L R V V G S E L I QK Y L G E G P K L V R E M F K V A E DNA P S I
F L R V V G S E L I QK Y L G E G P K L V R E M F K V A E DNA P S I
F L R V V G S E L I QK Y L G E G P K L V R E L F K T A H E L A P S I
F L R V V G S E L I QK Y S G E G P K L V R E L F R V A E E N S P S I
F L R V V G S E L I QK Y S G DG P K L V R E L F R V A E E N S P S I
F L R V V G S E L I QK Y L G DG P K L V R E L F R V A D E HA P S I
F L R I V G S E L I QK Y L G DG P R L C R Q I F Q I A A E HA P S I

V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
I
V
V
V
V
V
V
V
L
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
I
I
V
V
V
V
V

FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
F MD E I
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI
FI DEI

740
DA V G T K R Y D S N
DA V G T K R Y DA N
DA I G T K R Y D S N
DA I G T K R Y D S T
DA I G T K R Y D S T
DA V G T K R Y D S N
DA I G T K R Y D S N
DA V G T K R Y D S N
DA V G T K R Y D S N
DA I G T K R Y E S T
DA I G T K R Y D S N
DA I G T K R Y D S N
DA I G T K R Y E S N
DA I G T K R Y D S T
DA V G T K R HD S Q
DA V G T K R HD S Q
DA I G T K R Y D S N
DA I G S K R Y E S S
DA V G T K R Y D S Q
DA V G T K R Y D S N
DA V G T K R Y D S N
DA I G G K R Y NT S
DA V G T K R Y DA H
DA I G T K R Y D S N
DA I G T K R Y D S N
DA I G T K R Y E S N
DA I G T K R Y DT D
DA I G T K R Y D S N
DA I G T K R Y D S T
DA V G S MR T Y DG
DA I G T K R Y D S N
DA I G T K R Y D S N
DA I G T K R Y D S T
DA V G T K R Y DA H
DA V G T K R Y D S Q
DA I G T K R Y D S N
DA V G T K R Y D S H
DA I G T K R Y E S T
DA V G T K R Y E A T
DA V G T K R Y E A T
DA V G T K R Y E A T
DA V G T K R Y DA H
DA I G T K R Y D S N
DA I G T K R Y D S N
DA I G T K R Y DA Q
DA I G T K R Y E S N
DA I G T K R Y D S N
DA V G S K R Y NT S
DA I G T K R Y DA T
DA I G T K R Y DA T
DA V G T K R Y D S T
DA I G T K R Y DT D
DA I G T K R Y DT D
DA V G T K R Y D S N
DA I G T K R Y E S T

750

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

760

S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S S G E R E I QR T M L E L
S G G E R D I QR T M L E L
S G G E R D I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E V QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G R R E V QR T M L E L
S G G E K E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E K E I QR T M L E L
S S G T K E V QR T M L E L
S G G E R E I QR T M L E L
S G G E R E V QR T M L E L
T S G S A E V NR T M L Q L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E K E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G A E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E R E I QR T M L E L
S G G E K E I QR T M L E L
S G G E K E I QR T M L E L
S G G E K E I QR T M L E L
S S G E R E V QR T M L E L
S G G A K E V QR T M L E L
S S G A K E V QR T M L E L
S G G E R E I QR T L L E L
S G G E R E V QR T M L E L

770

780

L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I L A T NR
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V L MA T NR
L NQ L DG F D - S R G DV K V L MA T NR
L NQ L DG F D -DR G D I K V I MA T NK
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -T R G DV K V I MA T NK
L NQ L DG F E -A R G DV K V I MA T NK
L NQ L DG F E -A R G DV K V I MA T NK
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -DR G D I K V I MA T NK
L NQ L DG F D -A R T DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -T R ND I K V I MA T NK
L NQ L DG F D -T R G E V K V I I A T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -DR G DV K V I MA T NK
L T Q L DG F D - S S NDV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -DR G DV K V I MA T NK
L A E MDG F D - P K G NV K V V A A T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F D - S R G DV K V I L A T NR
L NQMDG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R A DV K V I L A T NK
L NQ L DG F D -DR G D I K V I MA T NK
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I L A T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D -DR G DV K V I MA T NK
L NQ L DG F DT S QR D I K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R G DV K V I MA T NR
L NQ L DG F D - S R T DV K V I L A T NK
L NQ L DG F D - S Q S DV K V I MA T NK
L NQ L DG F D - S Q S DV K V I MA T NK
L NQ L DG F D -DR G D I K V I MA T NR
L T Q L DG F D - S C NDV K V I MA T NR
L T Q L DG F D - S S NDV K V I MA T NR
L NQ L DG F D -T R HDV K V I MA T NR
L NQ L DG F D -DR G DV K V I MA T NK

790
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I

ET LD PA
E S LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
D S LD PA
EN LD PA
E S LD PA
E S LD PA
ET LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
EA LD PA
E S LD SA
ET LD PA
ET LD PA
E S LD PA
DT L D P A
ET LD PA
ET LD PA
D L LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
E S LD PA
E S LD PA
D S LD PA
D S LD PA
D S LD PA
E S LD PA
ET LD PA
ET LD PA
SD LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA
E S LD PA
ET LD PA
ET LD PA
ET LD PA
E S LD PA
E S LD PA

L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L

I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R AGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
LR PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
LR PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR
I R PGR

800
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
F DR S I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I
I DR K I

810

E F P L PD EK T K R R I
E F P L PD I K T R R R I
L F E N P DV S T K R K I
L F E N P DQNT K K K I
L F E N P DQNT K K K I
E F P L PD EK T K R R I
E F P L PD EK T K K R I
E F P L PD EK T K R R I
E F P L PD EK T K R R I
L F E N P DA NT K K K I
L F EN PD L ST K R K I
E F P L PD EK T K K R I
E F PMPD EK T K R R I
E F P L P DT K T K R H I
E L P N P DC K T K R R I
E L P N P DC K T K R R I
E F P L PD EK T K R R I
L F E N P D S NT K K R I
E F P L PD I K T K R K I
E F P L PD EK T K R R I
E F P L PD EK T K R R I
E F G M P DA A T K K K I
E F P L PD I K T K R K I
E F P L PD EK T K K R I
E F P L PD EK T K K R I
L F EN PD I T T K R K I
E F P F PD EK T K R R I
E F P L PD EK T K K R I
L F E N P DQNT K R K I
EV P L PD EK GR V E I
E F P L PD EK T K K R I
E F P L PD EK T K K R I
L F E N P DQNT K R K I
E F P L PD I K T R R R I
E F P L P DV K T K R H I
E F P L PD EK T K K R I
E F P L P DV K NK K K I
L F E N P DA NT K K K I
Q L P N P DT K T K R R I
Q L P N P DT K T K R R I
Q L P N P DT K T K R R I
E F P L PD I K T R R R I
E F P L PD EK T K K R I
L F EN PD L ST K K K I
L F EN PD EAT K R K I
E F P L PD EK T K R R I
E F P L PD EK T K R R I
E F P V P DMK T K K K I
Q L PN PD SK T K R K I
Q L PN PD SK T K R K I
E L P F P DNK T K L K I
E F P F PD EK T K K MI
E F P F PD EK T K K MI
E F P L P DQK T K MH I
L F EN PD ST T K R K I

820
F N I HT A R MT L A E DV N L
F Q I HT S K MT L A E DV N L
L G I HT S K MN L S A DV D L
F T L HT S K M S L A DDV D L
F T L HT S K M S L G DDV D L
F T I HT S R MT L A DDV N L
F Q I HT S R MT L A DDV T L
F Q I HT S R MT L G DDV N L
F Q I HT S R MT L G K E V N L
L T I HT S K M S L A DDV N L
L G I HT S K MN L S S DV D L
F Q I HT S R MT L A DA V T L
F N I HT A R MT L S DDV NV
F K L HT S R M S L A DDV D I
F Q I HT S K MT L S DDV D L
F Q I HT S K MT L S DDV D L
F Q I HT S R MT V A E DV S L
L H I HT S K M S L A DDV K L
F E I HT A K MN L S E DV N L
F T I HT S R MT L A E DV N L
F T I HT S R MT L A E DV N L
F D I HT S R MT L D E S V N I
F E I HT S K MT L E E G V DM
F Q I HT S R MT L A DDV T L
F Q I HT S R MT L A DDV T L
V G I HT S K MN L A E DV D L
F E I HT S R M S L A E DV D I
F Q I HT S R MT L A DDV T L
F T L HT S K M S L N E DV D L
L K I HT R K MK L A DDV D F
F Q I HT S R MT L A DDV T L
F Q I HT S R MT L A DDV T L
F T L HT S K M S L N E DV D L
F Q I HT S K MT L A DDV N L
F N I HT G R MN L S A DV Q L
F Q I HT S R MT L A DDV T L
F Q I HT S K MN L G E DA N L
L T I HT S K M S L A DDV N L
F Q I HT S K MT M S P DV D L
F Q I HT S K MT M S P DV D L
F Q I HT S K MT M S P DV D L
F Q I HT A R MT L A DDV N L
F Q I HT S R MT L A DDV T L
L G I HT S K MN L S E DV N L
F T I HT S K MN L G E DV N L
F N I HT S R MT L S NDV N L
F Q I HT S R MT V A DDV T L
F E I HT S K MA L G E E V N F
F E I HT S K MT M S K DV D L
F E I HT S K MT M S K DV D L
F Q I HT A NMH L A P DV N L
F E I HT S R M S L A E DV D L
F E I HT S R M S L A E DV D I
F K L HT S R MN L D S DV D L
MG I HT S K MN L NDDV D L

840

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

850

S E L I MA K DD L S G A D I K A I C T E A G L MA
E E F V MT K D E F S G A D I K A I C T E A G L L A
E T L V T S K DD L S G A D I K A MC T E A G L L A
D E F I NQK DD L S G A D I R A I C T E A G L MA
D E F I NQK DD L S G A D I R A I C T E A G L MA
S E L I M S K DD L S G A D I K A I C T E A G L MA
DD L I MA K DD L S G A D I K A I C T E A G L MA
E E F I T A K D E L S G A D I K A MC T E A G L L A
E E F I T A K D E L S G A D I K A MC T E A G L L A
D E I V T G K DD L S G A D I K A I C T E A G L L A
E N L V T S K DD L S G A D I QA MC T E A G L L A
DD L I MA K DD L S G A D I K A I C T E A G L MA
D E HV QA K DD L S G A D I K A I C T E A G L L A
E E L V MT K D E L S G A D I K A V C T E A G L L A
E E F I MA K DD I S G A D I K A I C T E A G L L A
E E F I MA K DD I S G A D I K A I C T E A G L L A
DD L I L A K DD L S G A D I K A I C T E A G L MA
D E L V T S K D E L S G A D I K A MC T E A G L L A
E E F V M S K DD L S G A D I K A I C T E S G L L A
S E L I MA K DD L S G A D I K A I C T E A G L MA
S E L I MA K DD L S G A D I K A I C T E A G L MA
E L L I T SK ED L SGAD I K A I CT EAGMI A
E E F V M S K DD L S G A D I K A I C T E A G L L A
D E L I MA K DD L S G A D I K A I C T E A G L MA
DD L I MA K DD L S G A D I K A I C T E A G L MA
DN L V T S K DD L S G A D I K A MC T E A G L L A
S E F I HA K D E M S G A DV K A I C T E A G L L A
DD L I MA K DD L S G A D I K A I C T E A G L MA
E E F I A QK DD L S G A D I K A I C S E A G L MA
EK LAK V LT GK SGA E I SV I V K EAG I FV
DD L I MA K DD L S G A D I K A I C T E A G L MA
DD L I MA K DD L S G A D I K A I C T E A G L MA
E E F I A QK DD L S G A D I K A I C S E A G L MA
E E F V MT K D E F S G A D I K A I C T E A G L L A
E E F V MA K D E L S G A D I K A L C T E A G L L A
DD L I MA K DD L S G A D I K A I C T E A G L MA
D E F I NA K D E L S G A D I K A MC T E A G L L A
D E L V T S K DD L S G A D I K A I C T E A G L L A
E E FV MSK D E L SGAD I K A I CT EAG L LA
E E FV MSK D E L SGAD I K A I CT EAG L LA
E E FV MSK D E L SGAD I K A I CT EAG L LA
E E F V MT K D E F S G A D I K A I C T E A G L L A
DD L I MA K DD L S G A D I K A I C T E A G L MA
E T L V T T K DD L S G A D I QA MC T E A G L L A
E E L I QC K DD L S G A E I K A I V S E A G L L A
D E Y I T S K DD L S G A D I K A I C T E A G L MA
DD L I L A K DD L S G A D I K A I C T E A G L MA
DT F V HV K DD L S G A D I K A MC T E A G L L A
D E F V V NK DD L S G A D I K A MC T E A G L L A
D E F V V NK DD L S G A D I K A MC T E A G L L A
M E F A NT K D E I S G A D I K A I C S E A G L I A
S E F I HA K D E M S G A D I K A I C T E A G L L A
S E F I HA K E E M S G A D I K A I C T E A G L L A
E E F V A MK DD L S G A D I K S L V T E A G L L A
E E F V S S K D E L S G A D I K A MC T E A G L L A

860

870

880

890

900

910

L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T S MD P R V L DA MM P Y L T
L R E R R MK V T HV D F K K A K E K V M F K K K E G V P - E G L Y Y L DMQA T T P I D P R V F DA MNA S Q I
L R E R R MQV T V E D F K QA K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V V DT M L K F Y T
L R E R R MR V QMDD F R A A R E R I MK T K QDG G P V E G L Y Y L DMQA T T P V D P R V L DA M L P Y L T
L R E R R MR V QMDD F R A A R E R I MK T K QDG G P V E G L Y Y L DMQA T T P T D P R V L DA M L P Y L T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y - - - - - - - - - - - - - - - - - - - - - - L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MR V T M E D F QK S K E NV L Y R K K E G A P - E E L Y Y L DV QA T S P MD P R V V DA M L P Y M I
L R E R R MR V T M E D F QK S K E NV L Y R K K E G A P - E E L Y Y L DV QA T A P MD P R V V DA M L P Y M I
L R E R R MQV K A E D F K S A K E R V L K NK V E E N L - E G L NY L DV QA T T P V D P R V L DK M L E F Y T
L R E R R MQV T A E D F K QA K E R V MK NK I E E N L - E G L Y Y MDMQA T T P T D P R V L DV M L K F Y T
L R E R R MK V T N E D F K K S K E NV L Y R K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MK V T S E D F K K S K E NV L Y R K N E G A P -QG L Y Y MDA QA T T P L D P R V L DK V M S Y Y V
L R E R R MR V T R T D F T T A R E K V L Y G K D E NT P -A G L Y Y L DMQA T T P MD P R V L DK M L P L F T
L R E R R MR V T Q E D L R K A K E K A L Y R K K G G I P - E G L Y Y F DY QA T T P V D P R V L DK MM P F F T
L R E R R MR V T Q E D L R K A K E K A L Y R K K G G I P - E G L Y Y F DY QA T T P V D P R V L DK MM P F F T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MD F QA T T P MD P R V L DA M L P Y QV
L R E R R MQV K A E D F K A A K E R V L K NK V E E N L - E G L Y Y L DV QA T T P T D P R V L DR M L E F Y T
L R E R R MR V T HT D F K K A K E K V L Y R K T A G A P - E G L Y Y L DMQ S T T P I D P R V L DA M L P L Y T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T P MD P R V L DA M L P Y L T
L R E R R MK V T N E D F K K S K E S V L Y R K K E G T P - E G L Y Y L DA QA T T P MD P R V L DA M L P Y L T
L R E R R K T V T MK D F I S A R E K V F F S K QK MV S -A G L Y F L DV Q S T T P V D P R V L DA M L P F Y T
L R E R R MK V NQ E D F K K A K E K V MY R K K E G V P -DG L Y Y L DNNA T T MV D P E V L N S M L P Y F S
L R E R R MK V T N E D F K K S K E N F L Y K K T E G T P - E G L Y Y L DV QA T T P L D P R V L DR M L P Y L T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L I
L R E R R MQV T A QD F K E A K E R V L K NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DT M L K F Y T
L R E R R MK V C QA D F I K G K E NV QY R K DK S T F - S R F Y Y MDNQA T T P L D P R V L DA M L P Y MT
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L I
L R E R R MR V QMA D F R A A R E R V L R T K Q E G E P - E G L Y Y L DMQA T T P V D P R V L DA M L P L Y V
L R R R G K E I T MA D F MK A Y E K V V NV Q E P T I P -QA M F Y MDN S A T T P V R K E V V E E M L P Y L T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L I
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MR V QMA D F R A A R E R V L R T K Q E G E P - E G L Y Y L DMQA T T P I D P R V L DA MM P Y F T
L R E R R MK V T HA D F K K A K E K V M F K K K E G V P - E G L Y Y MDMQA T T P V D P R V L DA M L P F Y L
L R E R R MQV T HA D F S K A K E K V L Y K K K E G V P - E G M F - - -MQA T T P L D P R V L DA M L P Y F T
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L I
L R E R R MK I T Q E D F R K A K E K I L Y L K K G N I P - E G L Y Y L D F QA T T P T DY R V L DA M L P Y L T
L R E R R MQV K A DD F K S A K E R V L K NK V E E N L - E G L Y Y L DV QA T T P T D P R V L DK M L T F L T
L R E R R MK I T QA D L R K A R DK A L F QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK M L P Y MT
L R E R R MK I T QV D L R K A R DK A L Y QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK MM P Y MT
L R E R R MK I T Q L D L R K A R DK A L Y QK K G N I P - E G L Y Y L D S QA T T M I D P R V L DK MM P Y MT
L R E R R MK V T HT D F K K A K E K V M F K K K E G V P - E G L Y Y L DMQA T S P V D P R V L DA M L P Y Y L
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MDV QA T T P L D P R V L DA M L P Y L V
L R E R R MQV T A E D F K QA K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DT M L K F Y T
L R E R R MR V V MDD F R QA R E K V L K T K D E G G P A G G L Y Y MD F QA T S P L DY R V L D S M L P F F T
L R E R R MK V NN E D F K K S K E NV L Y R K T E G T P - E G L Y Y L DA Q S T T P L D P R V MDA MM P Y S V
L R E R R MK V T N E D F K K S K E NV L Y K K Q E G T P - E G L Y Y MD F QA T T P MD P R V L DA M L P Y QV
L R E R R MK V T L DD F T K A K DK V L Y L K K G DT P -DG L Y Y L D F QA T T P L D F R V L DK MM P Y QT
L R E R R MQ I T QA D L MK A K E K V L F QK K G NV P -DV L Y Y L DNQA T T C V D P R V L D S MM P Y L T
L R E R R MQ I T QA D L MK A K E K V L Y QK K G NV P -DV L Y Y L DNQA T T C V D P R V L DA MM P Y L T
L R DG R L M E C QA D F R K G R E MV MY R R K E N I P - E G L Y Y L DT QA T S V L D P R V F DT M I P Y E T
L R DR R MK V C Q S D F V K G K E NV QY R K DK G R F - S K F Y Y L D L Q S T T P L D P R V L DK M L P Y MT
L R DR R MK V C QA D F V K G K E NV QY R K DK S S F - S K F Y Y L D F QA T T P L D P R V L DR M L P Y L T
L R E R R MR V T K K D F T T A R E R V I DR K N E G T P - E G L Y Y L DA Q S T T P V D P R V V DK MM P Y MT
L R E R R MR V T A E D F R T A K E R V MK NK V E E N L - E G L Y Y L DMQA T T P T D P R V L DV M L NY Y T

920

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

930

940

950

960

970

980

A Y Y G N P H S R T HA Y G W E T E E A V E K A R A QV A S L I G A -D P K E V V F T S G A T E S NN I S V K G I G R F K K H I I T T QT E HK C V
H E Y G N P H S R T H L Y G W E A E NA V E NA R NQV A K L I E A - S P K E I V F V S G A T E A NNMA V K G V MH F K K HV I T T QT E HK C V
G L Y G N P H S NT H S Y G W E T S Q E V E K A R K NV A DV I K A -D P K E I I F T S G A T E S NNMA L K G V A R F K NH I I T T R T E HK C V
G I Y G N P H S R T HA Y G W E S E K A V E QA R E Y I A K L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T S QT E HK C V
G I Y G N P H S R T HA Y G W E S E K A V E QA R E HV A K L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T T QT E HK C V
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -HK C V
NY Y G N P H S R T HA Y G W E S E A A M E C A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
ND F G N P H S R T H S Y G WK A E E G V E QA R K Y V A D L I K A -D P R D I V F T S G A T E S NN L A I K G V A K F K NH I I T L QT E HK C V
ND F G N P H S R T H S Y G WK A E E G V E QA R E HV A N L I K A -D P R D I I F T S G A T E S NN L A I K G V A K F K NH I I T L QT E HK C V
G L Y G N P H S S T HA Y G W E T DK E V E K A R T Y I A DV I NA -D P K E I I F T S G A T E T NNMA I K G V P R F K K H I I T T QT E HK C V
G L Y G N P H S NT HA Y G W E T NK E V E T A R DHV A K V I R A -D P K E I I F T S G A T E S NN L A I K G V G R F K K H I I T T R T E HK C A
NY Y G N P H S R T HA Y G W E S E A A M E HA R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
S Y Y G N P H S R T HA Y G WQA E DA V E V A R QQV A DV I NA -D P R E I I F T S G A T E S NN L A V K G V G R F K K H I I T T Q I E HK C V
E QY G N P H S R T HA Y G W E A E K A V D E A R QQV A Q L V G A -Q P K D I V F T S G A T E S NNM L I K G I A K F K K H I I T T QT E HK C V
E K F G N S H S R T HG Y G W E A E E A V E NA R T N I A N L I K C - L P K E I I F T S G A T E S NNT I I R G V C D I K NH I I T T Q I E HK C V
E K F G N S H S R T HG Y G W E A E E A V E NA R T N I A N L I K C - L P K E I I F T S G A T E S NNT I I R G V C D I K NH I I T T Q I E HK C V
NY Y G N P H S R T HA Y G W E S E S A M E K A R K QV A G L I G A -D P R E I V F T S G A T E S NNM S I K G V A R F K MH I I T T Q I E HK C V
G L Y G N P H S S T H S Y G W E T DK E V E K A R K Y V A DV I NA -D P K E I I F T S G A T E S NNMA V K G V P R F K K H I I T T QT E HK C V
E NY G N P H S K T HA Y G WT S ND L V E DA R E K V S K I I G A -D S K E I I F T S G A T E S G N I A I K G V A R F K NH I I T T V T E HK C I
N F Y G N P H S R T HA Y G W E T E S A V E K A R E QV A T L I G A -D P K E I I F T S G A T E S NN I A V K G V A R F K R HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E T A V E K A R E QV A N L I G A - E T K E I I F T S G A T E S NN I A V K G V A R F K K HV V T T QT E HK C V
T V F G N P H S R T HR Y G WQA E A A V E K A R S QV A S L I G C -D P K E I I F T S G A T E S NN L A L K G V S G F A A H I I T L QT E HK C I
E I Y G N P N S - L HA F G QK A R K A L S D S L D I I Y E C I G A S DDDT V L I T A N S T E G NNT V L K T M L A R R NK I I V S Q I E H P S I
G C Y G N P H S R T HA Y G W E S E A A T E R A R R QV A D L I G A -D P R E V I F T S G A T E S NNMA I K G V A R F K K H I I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
G L Y G N P H S NT H S Y G W E T NK E I E QA R K Y I A DV I K A -D P K E I I F T S G A T E S NNMA L K G V S R F R NH I I T T R T E HK C V
E E Y G N P N S R T HQY G W S A E E A V E K A R R QV A D L I G A - S P K E I F F T S G A T E C NN I A I K G V G N F K NH I I T L QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
G V Y G N P H S R T HA Y G W E S E K A V E DA R A HV A S L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T T QT E HK C V
E N F G N P - S S I Y E L G K I S K HA V E NA R K R V A DA I G A - E E N E I Y F T S G G T E S DNWT V K G V A F A G K H I I T S S I E HHA V
NY Y G N P H S R T HA Y G W E S E A A V E HA R QQV A S L I G A -D P R E I I F T S G A T E S NN L A I K G V A R F K K HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L V T T QT E HK C V
NV Y G N P H S R T HA Y G W E T DK A V E E A R K H I A D L I G A -D P K E I I F T S G A T E S NNM S I K G V A R F K K H I I T S QT E HK C V
S R Y G N P H S R T H L Y G W E S DA A V E E A R A R V A S L V G A -D P R E I F F T S G A T E C NN I A V K G V MR F R R HV V T T QT E HK C V
E QY G N P H S R T HMY G W E T E DA I E K A R G E L A S L I G A -NA K E I V F T S G A T E S NNM S L K G V A R F K K H I I T T T T E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L I T T QT E HK C V
NQY G N P H S K T H S F G W E T E K A V E NA R S Q I A N L I NT -Q P Q S I I F T S G A T E S NNA A L K G L Y G F K NH I I T T QT E HK C V
G MY G N P H S S T HA Y G W E T DK E V E K A R E Y V A A V I K A -D P K E I I F T S G A T E T NNMA I K G V P R F K K H I I T T QT E HK C V
Y I Y G NA H S R NH F F G W E S E K A V E DA R T N L L N L I NG K NNK E I I F T S G A T E S NN L A L I G I C T Y K NH I I T S Q I E HK C I
Y I Y G NA H S R NH F F G W E S E E A V E DA R K N I L H L I NG K NNK E I I F T S G A T E S NN L A L I G I C T Y K NH I I T S Q I E HK C I
Y I Y G NA H S R NH F F G W E S E QA V E DA R A N L I K L L NG NNNK E I I F T S G A T E S NN L A L I G T C T Y K NH I I T S Q I E HK C I
A R Y G N P H S R T H L Y G W E S DQA V E T A R S Q I A D L I G A - S P K E I V F T S G A T E S NN I S V K G V I K F K R HV V T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E A A M E R A R QQV A S L I G A -D P R E I I F T S G A T E S NN I A I K G V A R F K K H L V T T QT E HK C V
G L Y G N P H S NT H S Y G W E T NT A V E NA R A HV A K M I NA -D P K E I I F T S G A T E S NNMV L K G V P R F K K H I I T T R T E HK C V
G I Y G N P H S R T HA Y G W E A E K A V E NA R Q E I A S V I NA -D P R E I I F T S G A T E S NNA I L K G V A R F K K H L V S V QT E HK C V
A Y Y G N P H S R T H S Y G W E S DDA V E HA R K QV A N L I G A -DA R E I I F T S G A T E S NN I S V K G T A R F K K HV I T T QT E HK C V
NY Y G N P H S R T HA Y G W E S E T A M E T A R K QV A D L I G A -D P R E I I F T S G A T E S NNMA I K G V A R F K R HV I T T QT E HK C V
NMY G N P H S R S H E Y G WA T E K A T E DA R A QV A D L I G A -D P K E I T F T S G A T E S NNQA L K G L A A F K K H I I T T Q I E HK C I
HA F G N P H S R T H S Y G W E A E K A V E T A R A DV A N L I NC - E S K NV I F T S G A T E S NN L A I K G S K S F K NHV I T T Q I E HK C V
HA F G N P H S R T H S Y G W E A E K A V E T A R A D I A N L I NC - E S K NV I F T S G A T E S NN L A I K G S K S F K NHV I T T Q I E HK C V
Y V HG NA H S K QHG F G Q E A MA A V E K A R K S V A D L I NA -K P N E I I F T S G A T E C NN I A I K G A MG Y K K HV I V S S I E HK C V
E MY G N P H S R T H S Y G WT A E E A V E K A R T QV A D L I R A - S P K G V F F T S G A T E S NN I A I K G V A NY K NH L I T L QT E HK C V
E R Y G N P H S R T HR Y G WT A E DA V E K A R A E V A D L I G T - S P K G V F F T S G A T E S NN I A I K G V A Y Y K NH I I T L QT E HK C V
NQY G N P H S R T HA Y G W E S E K G V E E G R E H I A S L I G A -D P K E I I F T S G A T E S NNMA I K G V A H F K NH I I T T QT E HK C V
DMY G N P H S R T H S Y G W E T DT A V E K A R E E I A A L I G A -D P K E I I F T S G A T E S NNMV I K G I A R F K R H I I T T QT E HK C I

990
LD SCR A L EG
L D S C R H L QQ
L E A A R S MK D
L D S C R H L QD
L D S C R H L QD
LD SCR A L EG
LD SCR S L EA
LD SCR Y L EN
LD SCR Y L EN
L D S A R HMQD
L EAAR GMI N
LD SCR S L EA
LD SCR A L EN
LD SCR WL ST
L ST LR E L E L
L ST LR E L E L
LD SCR V L ET
L D S A R HMQD
LD SCR H L EM
LD SCR A L EN
LD SCR A L EN
L DT C R N L E E
S E S EK Y LK E
LD SCR S L EA
LD SCR S L EA
L E A A R A MK N
LD SCR Y L EM
LD SCR S L EA
L D S C R H L QD
L HA C A W L E G
LD SCR S L EA
LD SCR S L EA
L D S C R H L QD
L D S C R Y L QQ
LD SCR Q L ER
LD SCR S L EA
LD SCR Y L E E
L D S A R HMQD
L QT C R F L QT
L QT C R Y L QT
L QT C R Y L QT
L D S C R H L QQ
LD SCR S L EA
L E A A R A MMK
LD S LR A LQ E
LD SCR V L EG
LD SCR V L E S
L DT C R N L E E
L QC C R Q L E N
L QC C R Q L E N
I E S A R A L QK
LD SCR Y L EM
LD SCR Y L EM
LD SCR R LQ E
L D S C R Y L QD

1000

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1010

1020

1030

1040

1050

1060

1070

E G F HV T Y L P V Q S NG L I S M E E L E K A I T P - E T S L V S I MT V NN E I G V K Q P I A E I G R L C V F F HT DA A QA V G K I P L DV NK MN I D L M S I
E G F E V T Y L P V K T DG L V D L E M L R E A I R P -DT G L V S I MA V NN E I G V V Q P M E E I G M I C V P F HT DA A QA I G K I P V DV K K WNV A L M S M
E G F DV T F L NV N E DG L V S L E E L E QA I R P - E T S L V S V M S V NN E I G V V Q P I K E I G A I C V F F H S DA A QA Y G K I P I DV D E MN I D L L S I
E G F E V T Y L P V QNNG L I R M E D L E A A I R P -DT A L V S I MA V NN E I G V I Q P L E E I G K L C V F F HT DA A QA V G K I P L DV NK L N I D L M S I
E G F DV T Y L P V Q S NG L I R M E E L E A A I R P -DT A L V S I MA V NN E I G V I Q P M E E I G K L C I F F HT DG A QA V G K I P L DV NK L N I D L M S I
E G F R I T Y L P V QQNG I I N L K D L E DA I T P - E T S L V S I MT V NN E I G V R Q P I E A I G A I C V F F HT DA A QA V G K V P L DV NT MN I D L M S I
E G F K V T Y L P V K K S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I K E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F K V T Y L P V DK G G MV DM E Q L E Q S I T P - E T C L V S I M F V NN E I G V V Q P I K Q I G E L C V Y F HT DA A QA T G K V P I DV ND L K I D L M S I
E G F K V T Y L P V DK G G MV DM E Q L T Q S I T A - E T C L V S I M F V NN E I G V MQ P I K Q I G E L C V Y F HT DA A QA T G K V P I DV N E MK I D L M S I
E G F E V T Y L P V S S E G L I N L DD L K K A I R K -DT V L V S I MA V NN E I G V I Q P L K E I G K I C V F F HT DA A QA Y G K I P I DV N E MN I D L L S I
E G F DV T F L S V DNQG L I DMK E L E E A I R P -DT C L V S V MA V NN E I G V MQ P L K E I G A L C I Y F HT DA A QA Y G K V P I DV N E MN I D L L S V
E G F QV T Y L P V K K S G I I D L K E L E S A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F K V T Y L P V K P NG I V D L K V L E E S F Q P -DT S L V S I I F V NN E I G - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - QG F E V T Y L P V L P NG L V S I N E L K A A L R P -DT S L V S I MA V NN E I G V I Q P L A E I S QA I P L F HT DA A QA V G K I P I DV E A L G I DA M S I
K G F R V T Y L K V NNK G L I S L E E L E K S I I P G E T I L A S I MHV NN E I G V I Q P MN L I G E I C V L F H S DV A QG L G K I N I DV DK WNA D F L S L
K G F R V T Y L K V NNK G L I S L E E L E K S I I P G E T I L A S I MHV NN E I G V I Q P MN L I G E I C V L F H S DV A QG L G K I N I DV DK WNA D F L S L
E G F D I T Y L P V K S NG L I D L K Q L E DT I R P -DT S L V S I MA I NN E I G V K Q P V K E I G H L C V F F HT DA A QA V G K I P V DV T DWK V D L M S I
E G F E V T Y L P V N E E G L I S L DD L R K S I R K -DT S L V S I MA V NN E I G V V Q P L K E I G K I C I F F HT DA A QA Y G K I D I DV N E MN I D L M S I
E G F K V T Y L P V G E NG L V D L E L L K NT I T P -QT S L V T I MA V NN E I G V V Q P I K E I G K I C V F F HT DA A QA V G K I P I DV NDMN I D L L S I
E G F K V T Y L P V L A NG L I D L QQ L E E T I T S - E T S L V S I MT V NN E I G V R Q P V D E I G K L C V F F HT DA A QA V G K V P L DV NA MN I D L M S I
E G F T V T Y L P V QT NG I I D L K Q L E E A L T P - E T S L V S I MA V NN E I G V K Q P I D E I G R L C V F F HT DA A QA V G K I P MDV NA MN I D L M S I
NG V E V T Y L P V G NDG V V D I DDV K K S I K E -NT V L V S I G A V N S E I G T V Q P L K E I G M L C V L F HT DA A QG V G K I Q I DV N E MN I D L L S M
R G I E V I K M P V N E DG V V D P K D L E R L I DD -K T A L V S C MWV NN E T G L I M P V E E L C K I A A L F H S DA T QA MG K I K V S V K DV P V DY L T F
E G F Q I T Y L P V QK NG L I D L K E L E A A F Q P -DT S L V S V MA V NN E I G V K Q P I R D I G E I C V F F HT DA A QA V G K I P L DV ND S K I D L M S I
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G R I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G Y E I T F L NV D E QG L I N L E E L E A A I R P - E T C L V S V MA V NN E I G V MQ P L K E I G E L C V F F HT DA A QA Y G K I P I DV N E MK I D L M S I
E G F E V T Y L P V QK NG I L D L K V L E A A I K P -T T C L V S C MA A HN E I G V L Q P I R E I G A L C V L F HT DA A QA L G K V K V DV NA DN I D L M S M
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T Y L P V QN S G L V D L K E L E A A MR P - E T A L V S I MT V NN E I G V I Q P V E E I G K MC I F F HT DA A QA V G K I P MDV NA MN I D L M S I
QG F E V T Y L P V DR Y G MV S P E E L K NA I R D -DT I L I S I M L A NN E I G T I Q P V E E I G K I S I Y F HT DA V QA I G HV P I DV K K MNV D L L S L
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S I MT V NN E I G V K Q P I A D I G R I C V Y F HT DA A QA I G K I P L NV NDMK I D L M S I
E G F R V T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I R Q I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T Y L P V K S S G L I DMA E L E A A I R P -DT A I V S I MA V NN E I G V I Q P L E E I G K L C I F F HT DA A QA V G K I P V DV NA MN I D L M S I
E G F E V T Y L P V R P DG L V DV A Q L A DA I R P -DT G L V S V MA V NN E I G V V Q P L E E I G R I C V P F HT DA A QA L G K I P I DV NQMG I G L M S L
E G F DV T Y L P V K E NG L V D L K E L E A A MR D -DT A I V S V MA V NN E I G V I Q P L K A I G E L C I F F HT DG A QA V G K V P MDV NDMN I D L M S I
E G F QV T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G R I C V Y F HT DA A QA V G K I P L DV NDMK I D L M S I
K G V E V T Y L P V D S NG L I S L QQ L Q E S I K S -NT L C V S V M L V NN E I G V I QN L K E I S R I C V Y V H S DMA QA I A K I P V DV QD L D I D L G S I
E G F DV T Y L P V D E HG L I S L DD L K A A I R K -DT I L V S V MA V NN E I G V V Q P L K E I G K I C I F F HT DA A QA Y G K I D I DV NDMN I D L L S I
K G F E V T Y L K P DT NG L V K L DD I K N S I K D -NT I MA S F I F V NN E I G V I QD I E N I G N L C I L F HT DA S QA A G K V P I DV QK MN I D L M S M
K G F E V T Y L K P E P NG I V K L E D I E K N I K E -NT I MA S F I HV NN E I G V I QD I E N I G L L C V I F HT DA S QA I G K I P I DV QK MN I D L L S M
K G F E V T Y L K P DA NG L I K L E D L K N S I K E -NT I L A S F I Y V NN E I G V I QD I E N I G K I C I I F HT DA S QA V G K I K I DV QK L N I D L L S L
E G F E V T Y L P V G NDG I V D L E K L K G S I R P -DT G L V S V MA V NN E I G V I Q P M E E I G E I C V P F HT DA A QA L G K I P I DV DK WNV S L M S L
E G F R V T Y L P V QK S G I I D L K E L E A A I Q P -DT S L V S V MT V NN E I G V K Q P I A E I G Q I C L Y F HT DA A QA V G K I P L DV NDMK I D L M S I
E G F E V T F L NV DDQG L I D L K E L E DA I R P -DT C L V S V MA V NN E I G V I Q P I K E I G A I C I Y F HT DA A QA Y G K I H I DV N E MN I D L L S I
E G F E V T F L P V QT NG L I N L D E L R DA I R P -DT V C V S V MA V NN E I G V C Q P L E E I G K I C V F F H S DA A QG Y G K I D I DV NR MN I D L M S I
E G F D I T Y L P V K P NG I I D L K E L E A A F R P -DT V L C S I MA I NN E I G V K Q P MK Q I G E MC V F F HT DA A X A V G K I P V DV NDMK I D L M S I
E G F S V T Y L P V QK NG L V D L E L L E A S I R P -DT S L L S V MT V NN E I G V QQ P I D E I G R I C V F L HT DA A QA V G K I P I NV S DWK V D L M S I
QG Y E I T Y L P V QK NG L V D L E V F K NA I R P -DT L V A S I I L V HN E I G V I QD I K T I G K I C V F F HT DA A QA L G K I P I NV D E MN I D L M S M
E G Y S V T Y L K P DK Y G M I L P D L V R K N I R P - E T F L C S V I HV NN E I G V I QN I S E I G R I C V I F HT DA A Q S F G K L P I D L K N L DV D L L S I
E G Y S V T Y L K P DK Y G M I L P E E V R K N I R P - E T F L C S V I HV NN E I G V I QD I A E I G K V C V I F HT DA A Q S F G K L P I D L K N L E V D L L S I
E G F DA T F L QV G K DG R V D P K E V A K N I R P -DT G L V S C M L V NN E I G S I N P V Q E I S K I C V W F HT DA A QG F G K I P I DV K K I G A N F M S I
E G F E V T Y L P V E K NG I V N L QK L E E A I R P -T T A L V S C MY V NN E I G V I Q P I G E I G K I C V L F HT DA A QA V G K L D I DV DR DN I D L M S V
DG F E V T Y L P V E K NG L V N L QK I E E A I R P -T T A L V S C MY V HN E I G V I Q P I S E I G N L C V L F HT DA A QA L G K V S I DV E R DN I D L M S L
E G F E V T Y L P V Q S NG L I D L K Q L E E A L R P -T T A L V S I MT V NN E I G V I Q P I K E I G Q L L P F F HT DA A QA A G K I R L DV N E L G I D L M S L
E G F E V T Y L P V L S S G L I DMK Q L E A A I R P -DT A L V S I MA V NN E I G V I Q P I A E I G A L C V F F HT DA A QA V G K I P I DV NA DK I DV M S I

1090

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1100

1110

1120

1130

1140

1150

1160

S G HK I Y G P K G I G A L Y V -R R -K P R V R V E A I Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E I A A R E MA Y DHR WM E F L S K R L NG D - - P
S A HK I Y G P K G V G A L Y V -R R -R P R I R L E P L MNG G G Q E R G L R S G T G A T QQ I V G F G A A C E L A MK E M E Y D E K W I K G L Q E R L NG S - -M
S S HK I Y G P K G I G A L Y V -R R -R P R V R M E P L L S G G G Q E R G F R S G T L P P P L V V G L G HA A K L MV E E Y E Y D S A HV R R L S DR L NG S - -A
S S HK I Y G P K G I G A C Y V -R R -R P R V R L E P I I S G G G Q E R G L R S G T L A P H L V V G F G E A C R I A S QDM E Y DR K HV E R L S K R L NG D - -A
S S HK I Y G P K G MG A C Y V -R R -R P R V R L E P I I S G G G Q E R G L R S G T I A P H L V V G F G E A C R I A Y E DM E Y D S K H I A R L S K R L NG D - - P
S G HK V Y G P K G V G A L Y I -R R -R P R V R V E P I Q S G G G Q E R G MR S G T V P T P L V V G L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A E R L NG D - - P
S G HK I Y G P K G A G A L Y V -R R -R P R V R V E A QM S G G G Q E R G L R S G T V A A P L C I G L G E A A R I A G R E M E MDK A HV E R L S R M L NV D - - E
S A HK I Y G P K G A G A L Y V -R R -R P R V R I E A QM S G G G Q E R G L R S G T V A A P L C I G L G E A A K I A DK E MA MDK A HV E R L S QM L NG D - -A
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I T G G G Q E R G L R S G T L A P P L V A G F G E A A R L MK Q E S S F DK R H I E K L S S K L NG C NDA
S S HK I Y G P K G I G A L Y V -R R -R P R V R L E P L L S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MH E E Y NA D I A H I DK L S S K L NG S - -A
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A DR L NG D - - P
----------------------------------------------------------------------------------S G HK L Y G P K G V G A A Y V -R R -R P R V R L E P L I HG G G Q E R G L R S G T V A A P L V V G L G E A C R I A E N E MA A DHA R I K A L S DR L NG D - - S A HK V Y G P K G I G A F Y I -R S -K P R R R I K P L I F G G G Q E R G MR S G T M P V P L A V G F G E A C K I A S S E MN S D S I HV K S L Y DK L NG C - -G
S A HK V Y G P K G I G A F Y I -R S -K P R R R I K P L I F G G G Q E R G MR S G T M P V P L A V G F G E A C K I A S S E MN S D S I HV K S L Y DK L NG C - -G
S A HK I Y G P K G V G A L F V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E I A QQ E L E Y DHK R V S L L A NR L NG D - - P
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I V T G G G Q E R G L R S G T L A P P L V A G F G E A S R L MK E E MD - - - - - - - - - - - - - - - - - - S G HK I Y G P K G V G A L F V -R R -R P R V R I E P I T T G G G Q E R G I R S G T V P S T L A V G L G A A C D I A L K E MNHDA A WV K Y L Y DR L NG D - - L
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P I Q S G G G Q E R G L R S G T V P A P L A V G L G A A A E L S L R E MDY DK K WV D F L S NR L NG D - -A
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P I Q S G G G Q E R G L R S G T V P A S L A V G L G A A A E L S QQ E M E Y DK K W I D F L S NR L NG D - -A
C A HK I Y G P K G I G A L Y V -R R -R P R V R MV P L I NG G G Q E R G L R S G T V A S P L V V G F G K A A E I C S K E MK R D F E H I K E L S K K L NG S - - T A HK F HG P K G V G A L F I -R A G K P - - - I T P L L HG G E QMG G L R S G T I DT P S V V G MA V A L K K A T HD I N I E NT Y V R K L R DK L V G K - - P
S G HK I Y G P K G V G A I Y V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C E V A Q E E M E Y DHK R I S Q L A E R L NG D - -R
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L S E R L NG D - - P
S S HK I Y G P K G I G A I Y V -R R -K P R V R L D P L I S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MMK E Y E ND S NH I K R L S DK L NG S - -A
S S HK V Y G P K G C G A L Y V -R R -R P R V R L R S P V S G G G Q E R G V R S G T V A A A L V V G MG A A C E V A MK E WK R DA A HT E R L Q E R L NG D - - L
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L A E R L NG D - - P
S G HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I S G G G Q E R G L R S G T L A P P L I V G F G E A C R I A K Q E M E Y D S K R V K Y L S DR L NG H - - P
S G HK F G G P K G C G A L Y I -R K - - -G T K I E A F L HG G A Q E R K R R A G T E NV P S I V G L G K A I G L A T G E M E E T NK P L L E MR E R L NG H - - P
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A Q E E M E NDHK R I S M L A E R L NG D - - P
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E L A QQ E M E Y DHK R I S K L A E R L NG D - - P
S S HK I Y G P K G I G A C Y V -R R -R P R V R L D P I I S G G G Q E R G L R S G T L A P P L V V G F G E A C R I A K E E M P Y D S K R I K H L S DR L NG D - - P
S A HK I Y G P K G V G A L Y L -R R -R P R I R V E P QM S G G G Q E R G I R S G T V P T P L V V G F G A A C E I A A K E MDY DHR R A S V L QQR L NG S - -M
S G HK F Y G P K G I G A L Y V -R R -R P R V R M E P I I NG G G Q E R G L R S G T L P T P L I V G I G E A A R V A QK E L QR D E E HV NR L A K R L NG D - -R
S G HK I Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E V A QQ E M E Y DHK R I S K L S E R L NG D - - P
S A HK L Y G P K G I G A L Y V -R R -K P R V R L QQ I I HG G G Q E R G L R S G T L A P H L C V G F G K A A E I A L T E L P Y D I QHV DK L Y NR L NG S - - L
S S HK I Y G P MG I G A C Y V -R R -R P R V R L D P I I T G G G Q E R G L R S G T L S P P L V A G F G E A A R L MK E E MDY DK A H I T R L S NK L NG S NN P
S G HK L Y G P K G I G A L Y I K R K -K P N I R L NA L I HG G G Q E R G L R S G T L P T H L I V G F G E A A K V C S L E MNR D E K K V R Y F F NY V NG C - -Q
S G HK L Y G P K G I G A L Y I K R K -K P N L R L NA L I HG G G Q E R G L R S G T L P T H L I V G L G E A A N L G S I E MNR DHK K MK F F F DY V NG C - -Q
S S HK L Y G P K G V G A L Y I K R K -K P N I R L NA I I HG G G Q E R G L R S G T L P T H L I V G L G E A A N I C L S E MDR DNK K MN F F F NY V NG C - -Q
S G HK I Y G P K G V G A L Y M -R R -R P R I R V E P QMNG G G Q E R G I R S G T V P T P L V V G MG A A C E L A K K E M E Y DDK R I R A L H E R MNG S - -V
S G HK L Y G P K G V G A I Y I -R R -R P R V R V E A L Q S G G G Q E R G MR S G T V P T P L V V G L G A A C E L A QQ E M E Y DHK R I S K L A E R L NG D - - P
S S HK I Y G P K G I G A I Y V -R R -R P R V R L E P L L S G G G Q E R G L R S G T L A P P L V A G F G E A A R L MK K E F DNDQA H I K R L S DK L NG S - - P
S A HK I Y G P K G I G A A Y V -R R -R P R V R L E P L I S G G G Q E R G L R S G T L A P S QV V G F G T A A R I C K E E MK Y DY A H I S K L S QR L NG D - - P
S G HK I Y G P K G I G A L Y V -R R -R P R V R V E A L Q S G G G Q E R G MR S G T L P A P L V V G L G A A C E V S QQ E M E Y DHK R I S A L S E R L NG D - - P
S G HK I Y G P K G V G A L Y V -R R -R P R V R L E P L Q S G G G Q E R G L R S G T V P T P L A V G L G A A C S V A QQ E I E Y DHQR V S M L A NR L NG D - - P
S S HK V Y G P K G I G G L Y V -R R -K P K V R I L P I I NG G G Q E R G L R S G T L A P H L C V G F G E A C E I A K R E MDNDK K H I QR L S E K F NG D - -K
S G HK I Y G P K G V G A L F V -R T -K P R I R L Q P I I DG G G Q E R G L R S G T L P T A L V V G L G T A A K I A K M E MK R DQ L HM E N L F F K L NG S I K P
S G HK I Y G P K G V G A L F V -R T -K P R I R L Q P I I DG G G Q E R G L R S G T L P T A L V V G L G T A A K I A K M E M E R DHR HM E N L F F K L NG S I K P
S G HK I HG P K G I G A L Y V - S S -R P R S R V E P I I NG G G Q E R N I R S G T L A V P L I V G L G K A A E I A K R E MK Y D S P Y I E S L G K H L NG S - - L
S S HK I Y G P K G C G A L Y M -R R -R P R V R V R S P V S G G G Q E R G V R S G T I A T P L A V G L G A A C E L A K V E MK R D S E R I A Q L S K R L NG D - -V
S S HK I Y G P K G C G A L Y M -R R -R P R V R V R S P V S G G G Q E R G V R S G T V A T A QV V G MG A A C A I A K V E M E R D S A H I S R L S K R L NG D - - L
S S HK L Y G P MG I G A C Y I -R R -R P R V R L E P I I NG G G Q E R G L R S G T L A P P L I A G F G E A A R L A K Q E L A Y DHA H I S K L S QR L NG D - - S G HK L Y G P MG I G A C Y V -R R -R P R V R L E P I I T G G G Q E R G L R S G T L A A P L V A G F G E A A R L C R Q E M P Y DT A H I K K L S DK L NG D - -A

1170

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1180

1190

1200

1210

1220

1230

1240

V Q S Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T I E E V DY T A E K C I
D S R Y V G N L N L S F A Y V E G E S L L MG L - -K E V A V S S G S A C T S A S L E P S Y V L R A L G V D E DMA HT S I R F G I G R F T T K E E I DK A V E L T V
DHR Y P G C V NV S F A F V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA I G R DDA L A H S S I R F G I G R F T T E A E V DY V I K A I T
E R HY P G C V NV S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G S S D E S A H S S I R F G I G R F T T D S E I DY V L K A V Q
DR HY P G C V N I S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G S S D E S A H S S I R F G I G R F T T D S E I DY V L K A V Q
- - - - - - - - - - - - - - - -G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G L G R F T T I E E V DY T A E K T I
E HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
K HA Y P G C V N L S F A Y V E G E S L L MA L - -K S I A L S S G S A C T S A S L E P S Y V L R A I G S E E D L A H S S I R F G L G R F T T E E E V K HT I D L C V
R HA Y P G C V N L S F A Y V E G E S L L MA L - -K S I A L S S G S A C T S A S L E P S Y V L R A I G S E E D L A H S S I R F G L G R F T T D E E V K HT I D L C I
K S QY P G C V NV S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L HA L G A DDA L A H S S I R F G I G R F T T E A E V DY V I QA I N
E K R Y P G C V NV S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F T T E E E V DY V L K A I T
E HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
----------------------------------------------------------------------------------V NG Y P G C V N L S F S Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G A A E DMA H S S L R F G I G R F T T E E E I D L V V QR I V
V NR M F G N L N L S F T G V E G E S L MMK L - -Y S L A L S S G S A C T S A S L E P S Y V L R A I G V G E DV A HT S I R F G L G R F T K H E DV DK A V K E I V
V NR M F G N L N L S F T G V E G E S L MMK L - -Y S L A L S S G S A C T S A S L E P S Y V L R A I G V G E DV A HT S I R F G L G R F T K H E DV DK A V K E I V
DQR Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -V C - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I F V - - - - - - - - - - - - NA R Y Y G N L N I S F S Y V E G E S L L MA I - -K DV A C S S G S A C T S S S L E P S Y V L R S L G V E E DMA H S S I R F G I G R F T T E Q E I DY T I E I L K
K A T Y NG C L N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T V E E V DY T A DK C I
V A T Y NG C L N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G L G R F T T V E E V DY T A DK C I
E K G F P G C V NV S F P F V E G E S L L MH L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G R DD E L A H S S I R F G I G R F T MA K E I D I V A NK T V
E L R V P NT I L V A F K G V E G E A M L WD L NK HG I A A S T G S A C A S E S L QA N P T F K A MK F G E D L S HT G I R L S L S R F NT E E E I DY T I D I I K
E HR Y P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R F T T E E E I DY T V QK C I
K HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
DHR Y P G C V N I S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F T T D E E I DY V I K A I T
K HR L P G N L N I S F S C V E G E S L L MG M - -R DV A V S S G S A C T S A S L E P S Y V L R A L G V DA E NA HT S I R F G I G R F T T A K E V D L V I E E C V
E HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
DH F Y P G C V NV S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G N S D E S A H S S I R F G I G R F T T E R E I DY V L K A V Q
T E R L A NNV NV T F E Y I E G E S L L L L L NA K G I F A S T G S A C N S T S L E P S HV L T A C G V P H E I V HG S L R L S L G R MNT L E DV DR V L E V L P
QQHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
K QHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V F R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T A E K C I
NH F Y P G C V NV S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G N S D E S A H S S I R F G I G R F T T E Q E I DY V L K A V T
E HR Y P G N L N L S F A Y V E G E S L L MG L - -K E V A V S S G S A C T S A S L E P S Y V L R A L G V E E DMA HT S I R F G I G R F T T E E E V DR A I E L T V
E A R Y HG NV NM S F A Y V E G E S M L MG L - -K E I A V S S G S A C T S A S L E P S Y V L R A L G V N E E MA HT S V R Y G L G R F T T E A E V DR A I E A T V
K HHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V E K C I
E HR Y K G N L NV S F A F V E G E S L I MA I - -K QV A V S S G S A C T S A S L E P S Y V L R A L G V Q E DMA HT S L R I G I G R F T T E K E V D F L L DQ L S
E S QY P G C V N I S F A Y I E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L HA L G A DDA L A H S S I R F G I G R F T T E E E V DY V I K A I N
I NR Y Y G NMN I S F L F V E G E S L L M S L - -N E I A L S S G S A C T S S T L E P S Y V L R S I G I S E D I A HT S I R I G F NR F T T F F E V QQ L C I N L V
T NR Y F G NMNV S F L F V E G E S L L M S L - -N E I A L S S G S A C T S S T L E P S Y V L R S I G I S E D I A HT S I R I G F NR F T T F F E V QQ L C E N L V
I NR Y F G NMN I S F L F V E G E S L L M S L - -ND I A L S S G S A C T S S T L E P S Y V L R S I G I T E E I A HT S I R I G F NR F T T F F E V QQ L C K N L V
E R R Y A G N L N L S F A Y V E G E S L L MG L - -K DV A V S S G S A C T S A S L E P S Y V L R A L G V D E DMA HT S I R F G I G R F T T E E E I DR A I E L T V
K QHY P G C I N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G T D E D L A H S S I R F G I G R F T T E E E V DY T V QK C I
DHR Y P G C V NV S F A Y V E G E S L L MA L - -R D I A L S S G S A C T S A S L E P S Y V L HA L G K DDA L A H S S I R F G I G R F S T E E E V DY V V K A V S
K S R Y P G C V N I S F NY V E G E S L L MG L - -K N I A L S S G S A C T S A S L E P S Y V L R A I G Q S D E NA H S S I R F G I G R F T T E A E I DY A I E NV S
D E T Y P G C V N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A Q E D L A H S S I R F G I S R F T T E E E V DY T A E K C V
NQR Y P G C V N L S F A Y V E G E S L L MA L - -K DV A L S S G S A C T S A S L E P S Y V L R A I G A D E D L A H S S I R F G I G R HHR R R S G L HG R K MY L
DQR Y V G N I N I S F E F V E G E S L MMG I - -K QC A V S S G S A C T S A S L E P S Y V L R A L G V N E E L A HT S L R I G F G R F T T D E E V DY L I N L L S
G E R Y F G N L NM S F E F I E G E S L L M S L - - S N F A L S S G S A C T S A S L E P S Y V L R S L DV S E E L A HT S I R F G L G R F T M E S E V DMA L E S I T
G QR Y F G N L NM S F E F I E G E S L L M S L - - S N F A L S S G S A C T S A S L E P S Y V L R S L DV S E E L A HT S I R F G MG R F T I E S E V DMA L D S I T
E HR W F G C V N I S F E A V E G E S L MA T I - - P N F G V S S G S A C T S A S L E P S Y V L K G I G V G D E L A HT S L R I G I S K F T T R E E V DQ F V E L L E
E R R F HG N L N I S F A C V E G E S L L MG M - -K K V A V S S G S A C T S A S L E P S Y V L R A L G I DA E NA HT S I R F G I G R F T T E R E V DV T V E E C A
E K R Y P G N L N I S F S C V E G E S L L MG M - -K NV A V S S G S A C T S A S L E P S Y V L R A L G I DA E NA HT S I R F G I G R F T T E R E I DV T I E E C V
QNG Y P G C L N L T F QY V E G E S L L MA L - -K D I C L S S G S A C T S A S L E P S Y V L R A L G L ND E NA H S S L R F G I G R F T T E E E V DY V A DK I I
E HHY P G C V N I S F A Y V E G E S L L MA L - -K D I A L S S G S A C T S A S L E P S Y V L R A L G A DDA L A H S S I R F G I G R F T T E A E V DY V L K A V Q

1250

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1260

1270

1280

1290

1300

1310

1320

K HV T R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S A G G K F R I S L G L P V G A V I NC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
K QV E K L R E M S P L - - - - - - - -Y E M S K R - - - - - -G R G G T S G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
E R V E F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
DR V H F L R E L S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - -MT L G L P C G A V MNC C DN S G A R N L Y I I S V K G V G A R L NR L P A A
DR V H F L R E L S P L - - - - - - - -W E M S A R - - - - - -G R G G A S G NK L K MT L G L P C G A V L NC C DN S G A R N L Y I I S V K G I G A R L NR L P A A
R HV E R L R E M S P L - - - - - - - -W E I K T L - - - - - -G R G G S A G A K F R I S L G L P V G A V I NC A DNT G A K N L Y V I A V QG I K G R L NR L P A A
HHV K R L R E M S P L - - - - - - - -W E M S K L - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
R E T E R L R E L S P L - - - - - - - -W E M S K R - - - - - -G R G G A S G A K F R I S L G L P V G A V MNC A DNT G A K N L F V I S V Y G I R G R L NR L P S A
R E T NR L R D L S P L - - - - - - - -W E M S K R - - - - - -G R G G A S G A K F R I S L G L P V G A V MNC A DNT G A K N L F V I S V Y G I R G R L NR L P S A
E R V D F L R K M S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MNC A DN S G A R N L Y V L A V K G T G A R L NR L P A A
E R V K F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
HHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
- - - - - - - - - - - - - - - - - - - - - -M S X X - - - - - -X X X X X X X X X F R I S L S L P V G A V V NC A DNT G A K N L Y I I A V K G I R G R L NR L P A A
S V V NK L R DM S P L - - - - - - - -W E M S - - - - - - - - I K S A A A G T K F R M S L G L P V G A V MNC A DN S G A K N L Y V I S V I G F G A R L NR L P A A
E S V T L L R K M S P L - - - - - - - -WDM -K R - - - - - -G R G A A G G A K MR I T L G L NV G A L I NC C DN S G G K N L Y I I A V K G T G S C L NR L P S A
E S V T L L R K M S P L - - - - - - - -WDM -K R - - - - - -G R G A A G G A K MR I T L G L NV G A L I NC C DN S G G K N L Y I I A V K G T G S C L NR L P S A
HQV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - L G L P V G A V V NC C DN S G A R N L Y I V S V K G F G A R L NR L P A A
K NV QR L R DM S P L - - - - - - - -W E M - - - - - - - - - S K A QA V G S NY R V S L G L P V G A V MN S A DN S G A K N L Y V I A V K G I K G R L NR L P S A
K HV E R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G T A G G K F R I S L G L P V G A V MNC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
K HV E R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G T A G G K F R I S L G L P V G A V MNC A DNT G A K N L Y V I A V HG I R G R L NR L P A A
E A V QK L R E M S P L - - - - - - - -Y E MA A E K K T E V L E K K I S I K P R Y K MT R G I QV E T L MK C A DN S G A K I L R C I G V K R Y R G R L NR L P A A
K S V DR L R Q L S S T - - - - - - - -Y A M P K R - - - - - -G A G G R QG NK F R V T C G L NNA S T V NC A DNT G A K T L T I I S V K G F HG R L NR L P R A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I K G A DNT G A K N L Y I I S V K G I K G R L NR L P A A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
E R V D F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P T G A I MNC A DN S G A R N L Y I MA V K G S G S R L NR L P A A
R NV E R L R E L S P L - - - - - - - -WDM -G K - - - - - -DQA NV K G C R F R V S V A L P V G A V V NC A DNT G A K N L Y V I S V K G Y HG R L NR L P S A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
E R V S F L R E L S P L - - - - - - - -W E M -A K - - - - - - L S R G A P G G K L K MT L G L P V G A I MNC A DN S G A R N L Y I I S V K G I G A R L NR L P A G
E I V QK L R NM S P L - - - - - - - -T P - - - - - - - - - - - - - -MK G MR S N I P R A L NA G A Q I A C V DNT G A K V V E I I S V K K Y R G V K NR M P C A
HHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
HHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
E R V G F L R E L S P L - - - - - - - -W E M -A K - - - - - -Q S R G A P G G K L K MT L G L P V G A I MNC A DN S G A R N L Y I I S V K G I G A R L NR L P A G
HQV K K L R DM S P L - - - - - - - -Y E M S K R - - - - - -G R G G S A G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
R QV E K L R E M S P L - - - - - - - -W E M S K R - - - - - -G G G NA S G T K Y K M S Y G V P V G A V V NC A DNT G A K N L Y L I A V K R WG S R QNR L P A A
QHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
G A V R K L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G QV G I K L R I T L A C NV G A V L NC A DN S G A K N I Y V I S T F G I K G H L S R L P S A
E R V E F L R K M S P L - - - - - - - -W E M - - - - - - - - - S G S G A S G NK F R M S L A L P V G A V MNC A DN S G A R N L Y V L A V K G V G A R L NR L P A A
K S V E R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
K S V K R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
K S V K R L R S I S P L - - - - - - - -Y E M -K R - - - - - -G R A G T L K NK MR I T L S L P V G A L I NC C DN S G G K N L Y I I A V QG F G S C L NR L P A A
QQV E K L R E M S P L - - - - - - - -Y E M S K R - - - - - -G R G G S A G NK F R M S L G L P V A A T V NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
HHV K R L R E M S P L - - - - - - - -W E M S K R - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P A A
DR V K F L R E L S P L - - - - - - - -W E M - - - - - - - - - S G NG A QG T K F R I S L G L P V G A I MNC A DN S G A R N L Y I I A V K G S G S R L NR L P A A
R QV S F L R NM S P L - - - - - - - -WDM - S R - - - - - -G R G A A S G T K Y R MT L G L P V QA I MNC A DN S G A K N L Y I V S V F G T G A R L NR L P A A
H E V T Q L R E M S P L - - - - - - - -W E - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -G K N L Y I I A V S G I G G R L NR L P NA
P C L QT E G D E S S MG DG S G R HR Y Q - - - - - - - - - -G R G G S S G A K F R I S L G L P V G A V I NC A DNT G A K N L Y I I S V K G I K G R L NR L P S A
E K V S K L R E M S P L - - - - - - - -W E A A A R - - - - - -G R G G QV G T K A K V S L G L P V G A V MNC A DN S G A K N L Y T I A C F G I K G H L S K L P S A
K V V E K L R N L S P L - - - - - - - -Y E M -K R - - - - - -G R G G S G G NK L R V T L G L P V G A L I NC C DN S G G K N L Y L I A V K G T G A C L NR L P S A
K V V E K L R N L S P L - - - - - - - -Y E M -K R - - - - - -G R G G S G G NK L R V T L G L P V G A L I NC C DN S G G K N L Y L I A V K G T G A C L NR L P S A
HA V K H L R D L S P L - - - - - - - -W E M S K R - - - - - -G R T G QQG T K F A MT A G L P V G A V I NC C DN S G A K NM F I I S V R G HK G R L NR L P A A
R T V E R L R E M S P L - - - - - - - -WDM -G K - - - - - -DK A NV K G C R F R V S L A L P V G A V V NC A DNT G A K N L Y I I S V K G Y HG R L NR L P A A
R NV E R L R E M S P L - - - - - - - -WDM -G K - - - - - - E K A NV K G C R F R V S L A L P V G A V V NC A DNT G A K N L Y I I S V K G Y HG R L NR L P A A
K V V NK L R DM S P L - - - - - - - -W E M - - - - - - - - - - S K A A V G T K F R MT L A L P V G A V MNC A DN S G A K N L F V I A V HG I G A R L NR L P A A
E R V N F L R E L S P L - - - - - - - -W E M - - - - - - - - - - - S G A S G T K Y K M S MA L P V G A I MNC A DN S G A R N L Y V I A V K G C G A R L NR L P A A

1330

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1340

1350

1360

1370

1380

1390

1400

G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
C V G DMV MA T V K K G K P D L R K K V L P A V I V R QR K P WR R K DG V F MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K P WR R K DG V Y L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V M P A V V V R Q S K P WR R P DG I Y L Y F E DNA G V I V NA K G E MK G S A I T G P V G K E A A E L W P V S S L L F S N
G V G DMV MA T V K K G K P E L R K K V M P A V V V R Q S K P WR R P DG I Y L Y F E DNA G V I V NA K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
G S G DM I V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DM F V C S V K K G K P E L R K K V L QG V V I R QR K Q F R R K DG T F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - -A N
G V G DM F V C S V K K G K P E L R K K V L QG V V I R QR K Q F R R K DG T F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - -A N
A A G DMV MA T V K K G K P E L R K K V M P A I V I R Q S K P WR R R DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K A WR R K DG V Y L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R V A - - - S N
G V G DMV MA T V K K G K P X L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V X K E C A D L W P X I A - - - S N
G V G D I V L A T V K K G K P E L R K K V H P A V I I R Q S K S Y R R K HG QM I Y F E DNA G V I V NQK G E MK G - - - - - - - - - - - - - - - - - - - - - - - A A G DMV MA S V K K G K P E L R K K V M P A V I C R QR K P WR R R DG I F L Y F E DNA G V I V NA K G E MK G S A I NG P V A K E C A D L W P R I A - - - S N
S I G DMV L A T V K K G K P E L R K K V W P A V I V R QR K A F R R P E G T F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E C A E L W P K V S - - -A A
S I G DMV L A T V K K G K P E L R K K V W P A V I V R QR K A F R R P E G T F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E C A E L W P K V S - - -A A
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S A G DMV MA T V K K G K P E L R K K I M P A I V V R QA R P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V C T G L V V R QR K HWK R K DG V Y I Y F E DNA G V MC N P K G E V K G N - I L G P V A K E C S D L W P K V A - - -T N
G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DM F V A T V K K G K P E L R K K V M P A V V I R QR K P F R R R DG V F I Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
A P G D I C V V S V K K G K P E L R K K V HY A I L I R QK K I WR R T DG S H I M F E DNA A V L I NNK G E L R G A Q I A G P V P R E V A DMW P K I S - - - S Q
G C G DMV V A T C K K G K P E Y R K K MHT A V I I R QR R T WR R K DG V T L Y F E DNA A V I V NMK G E MK G S A I T G P V S K E S A D L W P K I S - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K A WR R K DG V F L Y F E DNA G V I A N P K G E MK G S A V T G P V G K E C A D L W P R I A - - - S N
A L G DMV MC S V K K G K P E L R K K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E S A D L W P K I S - - -T H
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V I V R Q S K P WK R T DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
G I G DMC V V S V K K G T P E MR K QV L L A V V V R QK Q E F R R P DG L HV S F E DNA MV I T D E E G I P K G T D I K G P V A R E V A E R F P K I G - - -T T
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V I V R Q S K P WK R F DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V G K E A A E L W P R I A - - - S N
C V G DMV MA T V K K G K P D L R K K V M P A V I V R QR K P WR R K DG V Y MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
N P G S MV MA T V K K G K P D L R K K V F P A I I V R QR K P I R R K E G L I I Y F E DNA G V I C N P K G E MK G S A I A G P V A K E C A D L W P R V A - - - S A
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S I G DMV L C S V K QG K P A L R K K V MQA V V V R QR K P Y R R R E G Y Y I Y F E DNA G V I I N P K G E MK G S A I T G P V G K E A A D L W P K I A - - - S A
S A G DMV MA T V K K G K P E L R K K V M P A I V I R Q S R P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV L A T V K K G K P D L R K K V L NA I I C R Q S K A WR R H E G Y Y I Y F E DNA G V I V N P K G E MK G S A I T G P V A R E C A E L W P K L S - - - S A
S L G DMV L A T V K K G K P D L R K K V L NA I I T R Q S K A WR R H E G Y F I Y F E DNA G V I V T P R -R MK G S A I T G P V A R E C A E L W P K L S - - - S A
S L G DMV L A T V K K G K P D L R K K V L NA I I T R Q S K A WR R H E G Y F I Y F E DNA G V I V N P K G E MK G S A I T G P V A R E C A E L W P K L S - - - S A
C V G DMV MA T V K K G K P D L R K K V M P A V I V R QR K P WR R K DG V F MY F E DNA G V I V N P K G E MK G S A I T G P I G K E C A D L W P R I A - - - S A
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S L G DMV MA T V K K G K P E L R K K V M P A I V V R QA K S WR R R DG V F L Y F E DNA G V I A N P K G E MK G S A I T G P V G K E C A D L W P R V A - - - S N
S C G DMV L A T V K K G K P D L R K K I M P A I V V R QR K A WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G L G DM I V A T V K K G K P E L R K K V M P A V V I R QR K P I R R R E G I V L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
G V G DMV MA T V K K G K P E L R K K V H P A V V I R QR K S Y R R K DG V F L Y F E DNA G V I V NNK G E MK G S A I T G P V A K E C A D L W P R I A - - - S N
S I G DM I L C S V K K G S P K L R K K V L QA I V I R QR R P WR R R DG V F I Y F E DNA G V I A N P K G E MK G S Q I T G P V A K E C A D I W P K V A - - - S N
S V G DMV L A T V K K G R P D L R K K V L P A V I V R QR K A WR R R E G Y F I Y F E DNA G V I V N P K G E MK G S A I NG P V A K E C A E L W P K I S - - -A A
S V G DMV L A T V K K G R P D L R K K V L P A V I V R QR K A WR R R E G Y F I Y F E DNA G V I V N P K G E MK G S A I NG P V A K E C A E L W P K I S - - -A A
S V S D L I V V T C K K G K P A L R K K V S MG V V V R QR A I WR R K DG V V I G F QDNA G V I I NDK G E MK G S A I T G P V A K E A A E L W P K V A - - - S V
A L G DMV MA S V K K G K P E L R R K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E A A E L W P K I S - - -T H
A L G D I V MA S V K K G K P E L R R K V L NA V I I R QR K S WR R K DG T V I Y F E DNA G V I V N P K G E MK G S G I A G P V A K E A A D L W P K I S - - - S H
A A G DMV V A S V K K G K P E L R K K V M P A V V V R QR K P WR R R DG V F L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D I W P R I A - - - S N
G A G DMV MA T V K K G K P E L R K K V M P A I V V R Q S K P WR R K DG V Y L Y F E DNA G V I V N P K G E MK G S A I T G P V A K E C A D L W P R I A - - - S N

1420

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1430

1440

1450

1460

1470

1480

1490

A G S I - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MA R K A R I I DV V Y NA S NN E L I R T K
A NA I G - - I S R D S I HK K K R K Y E L G R Q P A NT K L - - S I R V R G G NV K WR A L R L DT G N F S WG S E A V T R K T R I L DV A Y NA S NN E L V R T Q
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NK K F R A L R I E T G N F S WA S E G V S R K T R I V G V V Y H P S NN E L V R T N
T S S -G - - I S R D S R HK K K R A F E K G R Q P A NT R I - -G V R T R G G NR K F R A L R L E S G N F S WG S E G I S R K T R V I V V A Y H P S NN E L V R T N
S G V V G - - I S R D S R HK K K R A F E K G R Q P S NT R I - -G V R T R G G NQK F R A L R L E S G N F S WG S E G I S R K T R V I V V A Y H P S NN E L V R T N
A S S I G - - I S R DHWHK K K R K Y E L G R P A A NT R L - -G V R S R G G NT K Y R A L R L DT G N F S WG S E C S T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R D S WHK K K R K F E L G R P A A NT K I - -G V R T R G G N L K Y R A L R L DNG N F S WA S E QT T R K T R I V DT MY NA T NN E L V R T K
A G S I G - - I S R D S WHK K K R K F E L G R P A A NT K I - -G V R T R G G N E K Y R A L R L D S G N F S WA S E QT T R K T R I V DT MY NA T NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NQK F R A L R V E T G N F S WG S E G V S R K T R I A G V V Y H P S NN E L V R T N
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NQK F R A L R I E T G N F S WA S E G V A K K T R I V G V V Y H P S NN E L V R T N
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
- - -MG - - I S QD S WHK K K R K F E L G R P A A NT K I - -G V R T R G G NT K F R G L R L DT G N F S WG S E A C A R K T R I I DV MY NA S NN E L L R T K
A G T V G - - I T R D S R HK K K R K F E L G R Q P A MT K L D - S V R T R G G NV K Y R A L R L D S G N F A WG S E S V T R K T R L I QV R Y NA T NN E L L R T Q
A P S I G - - I S R D S R HK K K R K Y E MG R P A S NT K L - -G V R C R G G NK K F R A L R L D S G NY S WG S QG V T R K A R I M E V V Y NA S NN E L V R T K
A P S I G - - I S R D S R HK K K R K Y E MG R P A S NT K L - -G V R C R G G NK K F R A L R L D S G NY S WG S QG I S R K A R I M E V V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G I R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G N E K F R A L R I E T G N F S WG S E G V A R K T R L A G V V Y H P S NN E L V R T N
A G T I G - - I S R DA L HK K K R K Y E L G R QA A K T K I - -C I R V R G G HQK F R A L R L DT G N F S WA T E K I T R K C R I L NV V Y NA T S ND L V R T N
A S S I G - - I S R D S A HK K K R K F E L G R P A A NT K L - -G V R T R G G NT K L R A L R L E T G N F A WA S E G V A R K T R I A DV V Y NA S NN E L V R T K
A S S I G - - I S R D S A HK K K R K F E L G R P A A NT K L - -G V R T R G G N S K L R A L R L E NG N F A WA S E G V A R K T R I A DV V Y NA S NN E L V R T K
A S S I G - - I NHR G DHK K K R NNR A G S Q P S S T K I - -G V R V R G G NR K Y K A L R L DMG H F K F I T T G K F R MA K L L QV V Y H P S S N E L V R T N
A P T I G - - I T R D S R HK K K K K NT MG R Q P A NT R L - -G V R C R Y G I I K R R A L R L E NG N F S WA S Q S I T K G T K I L NV V Y NA S DND F V R T N
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R QA A NT K I - -G V R T R G G NQK F R A L R I E T G N F S WA S E G V A R K T R I T G V V Y H P S NN E L V R T N
A P A I G - - I V R S R L HK K R MK A E L G R L P A NT R L - -G V R A R G G N F K I R A L R L DT G N F A WA S E A I A HR V R L L DV V Y NA T S N E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK QK R A W E A G R Q P A S T K I - -G V R V R G G NT K Y R A L R L D S G N F S WG S E G V T R K T R V I A V A Y H P S NN E L V R T N
A S I I - - -MR WQG S S R G K R K F E MG R E S A E T R I - - S V P T MG G NR K V R L L Q S NV A NV T N P K DG K T V T A P I E T V I DNT A NK HY V R R N
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R A F E A G R Q P A NT R I - -G V R T R G G NHK Y R A L R L D S G N F A WA S E G C T R K T R V I V V A Y H P S NN E L V R T N
A NA I G - - I S R D S MHK K K R K Y E L G R Q P A NT K L - - S V R V R G G N L K WR A L R L DT G NY S WG S E A V T R K T R I L DV V Y NA S NN E L V R T Q
A S S I G - - I S R D S L HK K K R K Y E L G R Q P A NT K L - - S V R C R G G N I K HR A L R L DT G N F A WG S E NC T R K T R I L DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S V G - - I S R D S R HK K K R A F E K G R QA A MT K L V S G I R V R G G N F K F R A L R L S E G N F S WG S QG I A K K A K I V E V V Y H P S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q S A NT K I - -G V R T R G G NQK F R A L R V E T G N F S WG S E G V S R K T R I A T V V Y H P S NN E L V R T N
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G R NY K Y R A I K L D S G S F S W P T F G I S K NT R I I DV V Y NA S NN E L V R T K
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G K N L K Y R A I K L D S G S F S W P A F G V S K I T R I I DV V Y NA S NN E L V R T K
A S A I G - - I S R DG R HK K K R K Y E L G R P P S NT K L - -G V R G R G R NY K Y R A I K L D S G S F S W P A F G I S K MT R I I DV V Y NA S NN E L V R T K
A NA I G - - I S R D S MHK K K R K Y E L G R Q P A S T K L - - S I R V R G G NV K WR A L R L DT G NY S WG S E A V T R K T R I L DV V Y NA S NN E L V R T Q
A G S I G - - I S R DNWHK K K R K Y E L G R P A A NT K I - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
S G V V G - - I S R D S R HK K K R K F E L G R Q P A NT K I - -G V R T R G G NK K Y R A L R I E T G N F S WA S E G I S K K T R I A G V V Y H P S NN E L V R T N
A G T V G - - I T R D S R HK K K R K F E L G R Q P S NT R I - -G V R V R G G NK K F R A L R L D S G N F S WG S E G V S K K T R I I QV A Y H P S NN E L V R T N
A S T I G G R I P DDT T R K A HY A L P L A R K K G A K L L - -G V R C MG G N I K R R A L R L DNG N F S WG S E HT T R K T R I I DV V Y NA S NN E L V R T K
A G S I G - - I S R DNWHK K K R K Y E L G R P P A NT K L - -G V R V R G G NK K Y R A L R L DV G N F S WG S E C C T R K T R I I DV V Y NA S NN E L V R T K
A G S V G - - I S R D S K HK K K R A F E K G R P I S MT K L - -T V R V R G G H L K F R A L R L C E G N F S WG S E N I T R K T K I L DV K Y NA T NN E L V R T K
A P S I G - - I S R D S R HK K K R K Y E L G R P S S NT K L - -G V R C R G G N L K F R A L R L D S G N F S WG S QNV T R K T R V MDV V Y NA S S N E L V R T K
A P S I G - - I S R D S R HK K K R K Y E L G R P S S NT K L - -G V R C R G G N L K F R A L R L D S G N F S WG S QNV T R K T R V MDV V Y NA S S N E L V R T K
A P A V G - - I T R MG D L K K K R N F L A G R P S A QT R I - -G V R V R G G N L K MR A L R L E T G T F A WA S E NC T R K T R I L NV T Y H P A DND L V R T N
A P A I G - - I V R S R L HK K R MK A E L G R L P A NT K L - -G V R A R G G N F K L R G L R L DT G N F A WG T E A S A QR A R I L DV V Y NA T S N E L V R T K
A P A I G - - I V R S R L HK K R MK A E L G R L P A HT K L - -G V R A R G G N F K L R G L R L DT G N F A WG T E A I A QR A R I L DV V Y NA T S N E L V R T K
A G T V G - - I T R D S R HK K K R A F E L G R QA A NT R I - -G V R V R G G N L K HR A L R L E S G N F A WG S E H I T A K T R V L G V V Y NA S NN E L V R T N
S G V V G - - I S R D S R HK K K R K F E C G R QG A V T R I - -G V R T R G G NK K F R A I R I E T G N F S WG S E G T T R K T R V L G V S F H P S NN E L I R T N

1500

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
I
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T

1510

1520

1530

1540

1550

1560

L V K NA I I V I DA S P F R QWY E S HY S K S N L R K Y V K R - - -QK NA K I D P A V E E Q F NA G R L L A C I S S R P G QV G R A DG Y I
L V K S A I V QV DA A P F K QG Y L QHY S NHV QR K L E MR - - -Q E G R A L D S H L E E Q F S S G R L L A C I A S R P G QC G R A DG Y I
L T K S A I V Q I DA T P F R QWY E S HY S R NA E R K WA A R - - -A G DA K I E G A V D S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L T K S A V V Q I DA A P F R QWY E A HY S N S V V K K QA A R F - -A DHG K V E P A I E K Q F E S G R L Y A V I A S R P G Q S G R V DG Y I
L T K S A V V Q I DA A P F R QWY E A HY S N S V V K K QA A R F - -A E QG K V E S A V E R Q F E S G R L Y A V V S S R P G Q S G R V DG Y I
L V K NA I V V V DA T P F R QWY E S HY S QK T A R K Y L A R - - -QR L A K V E G A L E E Q F HT G R L L A C V A S R P G QC G R A DG Y I
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L V K G A I V S V DA A P F R QWY E A HY S NHT L K K Y T E R - - -QK T A A V DA L L T E Q F NT G R L L A R I S S S P G QV G QA NG Y I
L V K G A I I S V DA A P F R QWY E A HY S HHT MK K Y T E R - - -QK T A A V DA L L I E Q F NT G R L L A R I S S S P G QV G QA NG Y I
L T K S A V V Q I DA T P F R QWY E NHY S R K V E R K L A A R - - - S G A A A I E S A V D S Q F G S G R L Y A V I S S R P G Q S G R C DG Y I
L T K A A I V Q I DA T P F R QWY E A HY S K S A E R K WA A R - - -A A S A K V E S A V D S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L V K NA I I Q I D S T P F R QWY E A HY S K K T QK K Y E E R - - -K K E P K V A QA L E E Q F NQG R I L A C I S S R P G Q S G R C DG Y I
L V K G A V V D I DA T P F R QWY E S HY S NHV K R I L E E R - - -K K V A K I D P L L E QQ F R A G R L L A V I S S R P G Q S G R A DG Y I
L V K NA I V V I DA T P F R Q F Y L QR Y S G H L L A T R K A R - - - L MNNV I D P L V E E Q F G I G R L L A C V S S R P G QC G R C DG Y I
L V K NA I V V I DA T P F R Q F Y L QR Y S G H L L A T R K A R - - - L MNNV I D P L V E E Q F G I G R L L A C V S S R P G QC G R C DG Y I
L V K NC V V L V D S T P Y R QWY E S HY S K K V QK K F T L R - - -R K T A K I S P L L E E Q F L QG K L L A C I S S R P G QC G R A DG Y V
L T K A A I V Q I DA T P F K QW F E T HY S R K V E R K L A QR - - - S G A S N I E S A V E HQ F NA G R L Y A A I S S R P G Q S G R C DG Y I
L V K G S I V Q I DA T P Y K QWY E T HY S A S L L A K L A S R - - -A K G R V L D S A I E S Q I G E G R F F A R I T S R P G QV G K C DG Y I
L V K N S I V V I DA T P F R QWY E A HY S E K V MK K Y L E R - - -QK Y G K V E QA L E DQ F T S G R I L A C I S S R P G QC G R S DG Y I
L V K N S I V V I DA T P F R QWY E S HY S E K V MK K Y L E R - - -QK F G K V E QA L E DQ F T S G R I L A C I S S R P G QC G R S DG Y I
L T K S S V V K I S A E P F K ND I K - - - - - - - - - - - - - - - - -DV A R DV D P S L H E S F E K G H L Y A I I T S R P G QV G MA QG HV
L V K G A I I E I D P A P F R L W F L K F Y S K T MQK K Y A K K L E V L K NMK F D E A L L E G F Q S G R V L A C I S S R P G QT G S V E G Y I
L V K NC I V L V D S T P Y R QWY E A HY S K K I QK K Y D E R - - -K K NA K I A S I L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K A A I V Q I DA T P F R QWY E S HY S K NT E R K WA A R - - -A A E A K I E HA V D S Q F G A G R L Y A A I S S R P G Q S G R C DG Y I
L V K NC I V A V DA A P F K R WY A K HY S P K L QR E WT R R - - -R R NHR V E K A I A DQ L R E G R V L A R I T S R P G Q S G R A DG I L
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K S A V I Q I DA A P F R QWY E A HY S K S V E K K QA E R F - -A A R G K V D S A L E K Q F E A G R V F A V V S S R P G Q S G R C DG Y I
L T K G S V I R T S MG T - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -A R V T S R P G QDG V V NA V L
L V K NC I V L V D S T P Y R QWY E S HY S K K I QK K Y E E R - - -K K NA K I S P L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K S A V V Q I DA A P F R QWY E A HY S K S V E K K QA E R F - -A A A G K V D P A L E K Q F E A G R L Y A V I S S R P G Q S G R C DG Y I
L V K S A I V QV DA A P F K QWY L T HY S NHV V R K L E K R - - -QQT R T L D S H I E E Q F G S G R L L A C I S S R P G QC G R A DG Y I
L V K S A V I A V DA A P F R A WY A QHY S K S V T MK L R S R - - -NQK H E V A K A I D E Q F A T G R L L A I I T S R P G QC G R A DG Y V
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T R G V I V QV DA T P F R QWY A K K Y S R S L I K K L E QR - - -A K DNA I DA L V Q E Q F T NQR L L V R I T S R P G Q S G R A DG Y I
L T K A A I V Q I DA T P F R QWY E NHY S R K V E R K L A S R - - -A G QA A I E S A V DA Q F G S G K L Y A A I S S R P G Q S G R C DG Y I
L V K NC I V V I D S H P F T T WY E NT F T Y G V I K K I - - - - - -G K S K N I D P L L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K NC I V L I D S H P F T T WY E NT F S Y S V I K K I - - - - - -G K S K Q I D P A L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K NC I V L I D S H P F T A WY E NT F S Y S V I K K I - - - - - -G K A K Q I D P A L L E Q F K QG R V L A C I S S R P G QC G K A DG Y I
L V K S A I V QV DA A P F K QWY L QHY S NHV I R K L E K R - - -QQV R K L D P H I E E Q F G S G R L L A S I S S R P G QC G R A DG Y I
L V K NC I V L I D S T P Y R QWY E S HY S K K I QK K Y D E R - - -K K NA K I S S L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y V
L T K A A I V Q I DA T P F R QW F E A HY S K NA E R K WA A R - - -A A S A K I E S S V E S Q F S A G R L Y A C I S S R P G Q S G R C DG Y I
L T K S A I V Q I DA A P F R V WY E T HY S K HV QR K H S A R - - - L G D S K V D S A L E T Q F A A G R L Y A V V S S R P G Q S G R C DG Y I
L V K NA I V Q I D S T P F R QWY E A HY S K K V V K K F E E R - - -K K T A K V A QA L E E Q F G T G R L L A C I A S R P G QC G R A DG Y I
L V K NC I I L V D S L P F R QWY E A HY S K K T QK K Y D E R - - -K K T A K I S T L L E E Q F QQG K L L A C I A S R P G QC G R A DG Y I
L V K N S I V E I D S T P F R E WY K L HY S R HV QK R V -K R - - -T K A QA L E K N I E E Q F V S QR I L A C I T S R P G Q S G R A DG Y I
L V K NA I V T V D P T P F K L W F K T HY S E K V - - - - - - - - - - - -A G L V P K T L L E Q F S S G R L L A C I S S R P G QC G R C DG Y V
L V K NA I V T V D P T P F K L W F K T HY S E K V - - - - - - - - - - - -A A L V P R T L L DQ F S S G R L L A C I S S R P G QC G R C DG Y V
L A R G S V V S I DA A P F K QWY E R Q F T DK MT QR WA A N - - -K DG G V V A P E L V A E F DQG R L L A V I T S R P G QC G R A DG Y I
L V K NC I V V V DA A P F R L WY A K HY S S K L K R K W E Y R - - -R K HHK I E K A L A DQ L R E G R L L A R I T S R P G QT G R A DG A L
L V K NC I V V V DA A P F K L WY A K HY S D E L K R K WM L R - - -R E NHK I E K A V A DQ L K E G R L L A R I T S R P G QT A R A DG A L
L V K G C I V QV DA T P F R QA Y E K HY S NNV T R K L E NR - - -R K E G K L D S L V E QQ F G A G R L Y A A V S S R P G Q S G R C DG Y I
L T K S A I V Q I DA T P F R QWY E S Y Y A E A DQA A V A A R - - -QA DA K L D P A V E A Q F G A G R L Y A C V S S R P G Q S G R V DG Y V

1570
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EG E E LA FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E LD FY
L EGK E LD FY
L EG E E LA FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EAK E L E FY
L EGK E L E FY
L EGK E L E FY
L QG D E L K F Y
L EGK E LD FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EGA E LQ FY
L EGK E L E FY
L EG E E LA FY
I E------L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
I EGD E L L FY
I EGD E L L FY
I EGD E L L FY
L EGK E L E FY
L EGK E L E FY
L EG E E LA FY
L EG E E LH FY
L EGK E LD FY
L EGK E L E FY
L EGK E L E FY
L EG E E LN FY
L EG E E LN FY
L EG E E LA FY
L EGA E LQ FY
L EGA E LQ FY
L EGK E L E FY
L EG E E LA FY

1580

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1590

1600

1610

1620

1630

1640

1650

L K K I K NK K MV L A D L G R K I T NA L H S L S K A T I I N E E -V L D S M L K E I C T A L L E A DV N I R L V K K L R E NV R Q S A V F K E L V K L V I M F V G
MK K L QK K K MV L A E L G G R I T R A I QQM S NV T I I D E K -A L N E C L N E I T R A L L Q S DV S F P L V K E MQ S N I K E QA I F S E L C K MV V M F V G
V R R L T A K K MV L A D L G K R I NA A V A QA L NNDT DDY V A G V E T M L K A I V T A L L E NDV N I K L V S S V R S N I K QK T V F E E L C A L V V M F V G
QR A I R K - -MV L QD L G R R I NA A V ND L T R S S N L D E K -A F DDM L K E I C A A L L S A DV NV R L V QT L R K S I K QK A V F D E L V A L V I M F V G
QR A I R K - -MV L QD L G R R I NA A V ND L T R S NN L D E K QA F DDM I K E I C A A L L S A DV NV R L V Q S L R K S I K QK A V F D E L V S L V I M F V G
LR K I K SK R - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R K I R A K K MV L A D L G R K I R NA I G K L G QNT V I N E E - E L D L M L K E V C T A L I E S DV H I R L V K Q L K DNV K QK T V F N E L L K L V F M F V G
L R K I R A K K MV L A D L G R K I R NA I G K L G Q S T V I N E G - E L D L M L K E V C T A L I E S DV H I R L V K Q L K DNV K QK T V F N E L L K L V F M F V G
L R R L T A K K MV L A D L G S R L R G A L S S V E S G S - - -DD - E I QQM I K D I C S A L L E S DV NV K L V A K L R G N I K QK I I F D E L C A L V I M F V G
L R R L T A K K MV L A D L G K R I NNA V N S A L S NT E DDY V N S I DG M L K G I S T A L L E A DV N I M L V S K V R NN I R QK T V F D E L C G L I I M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
QR K I K A R K MV L A D L G R K I NNA L R S L S NA T I I N E E -V L Q S M L S E I C R A L L E S DV N I R L V K K L R E NV R Q S A V F R E L V K L V I M F V G
HHK L Q I R K MV L A D L G T R L HG A WNQ L S K A S V I DDK -V I DG V L K E L C A A L L E S DV NV K L V A S L R T K V K QK A V F D E L V A L V L MA V G
K K K L EK K K - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - K K K L E K K K MV L A D L G G Q L A S A I R K F Q S S T I A D E A -A I D L C L K E I A T A L L K A DV NV K L V A Q L R NN I K Q S A V V E E L V N I V I V F V G
L R K I K A K K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C A A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R R L T A K K MV L A D L G S R L R G A L S NV E S G S - - - E T - E I Q S M I K D I C NA L L E S DV N I K L V A K L R DN I K QK I V F D E L C S L V I M F V G
QR R L QK K K MV L A D L G NQ L S S A L R S L N E T T I V N E D -T I NQ L L K E V G NA L S K S DV S M S L I I QMR K N I K K QV V F D E L I R L I V M F V G
L K K I K S K K MV L A D L G R K I T T A L H S L S K A T V I N E E -A L N S M L K E I C A A L L E A DV N I R L V K Q L R E NV R Q S A V F K E L V K L V V M F V G
L K K I K S K K MV L A D L G R K I T T A L H S L S K A T V I N E E -A L N S M L K E I C A A L L E A DV N I R L V K Q L R E NV R Q S A V F K E L I K L I I M F V G
A DK F NK K S -M I T E L G R S I T NT L S N L L S S P A T DQH - - I E T A I R E I C N S L I L S NV N P R Y V S D L R D E L R QNA V Y E R L V D L V V V F V G
S K K I S DK K MV L S Q L G S S L V T A L R K MT S S T V V D E E -V I NT L L K E I E T S L L G E DV N P I F I R QMV NN I K K D S V F E E L I N L V L MMV G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L I I M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R R L T A K K MV L A D L G K R I NNA V T NA I S N E QT DY E T T V Q S M L K E I A T A L L E NDV N I R L V S R L R E N I K QK T V F D E L C N L V I M F V G
L K R L E K K K MV L A E L G QK I G QA I HR M S A K S M L G E D -DV K E L MN E I A R A L L QA DV NV T I V K K L QV S I R QNA V F NG L K R I V V M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
QR K L HK - -MV L QD L G R R I NA A V S D L T R A P N L D E K -A - - - - - -K I C A A L L E A DV NV R L V G Q L R K S I K QK A V F D E L V S L V I M F V G
- - - - - - - -MV M E K L G D S L QG A L K K L I G A G R I D E R -T V N E V V K D I QR A L L QA DV NV K L V MG M S QR I K I R I V Y Q E L M E I T I MMV G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
QR K L HK - -MV L QD L G R R I NA A V S D L T R A P N L D E K -A F DG M L K E I C S A L L E A DV NV R L V G Q L R K S I K QK A V F D E L V R L V I M F V G
MK K L QR K K MV L A Q L G G S I S R A L A QM S NA T V I D E K -V L S DC L N E I S R A L L Q S DV Q F K MV R DMQ S N I K QQA V F T E L C NMV V M F V G
QK K MMK K K MV L ND L G NK I A S A L R S L NA HV V V D E E - L L DA C L K D I T NA L L A S DV A V P L V V R MK K N I V E R A V F K E L T A L V V M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
I K K V E QK K MV L A E L G K S I NA A L QK L S K A P V V D E A - L V DQ I L G E I A MA L L K A DV NA K F I K K L R E DV K QK A V V DG L T R MV I M F V G
L R R L T A K K MV L A D L G S R L R G A L S S V E S A S - - -D E - E I NQM I K DV C T A L L E S DV N I K L V V K L R DN I K QK I I Y D E L V G L I I M F V G
K R K MDK K K MV L T E L G T Q I T NA F R K L QT S T L A DDV -V I E E C L K E I I R A L I L S D I NV S Y L K D I K S N I K QK Y V V E E L I K L V I L F V G
K R K MDK K K MV L T E L G T Q L T S A L QK L QA S A V A DD S -A I E E C L K E V I R A L I L A D I N I S Y L K D I K S N I K QQY V V E E L I N L V I L F V G
K R K MDK K K MV L T E L G A Q L T S A L QK I QA A P V A DDN -V I E E C L K E I V R A L I L A D I NV I Y L K D I K S N I K QK Y V V E E L I K L V I L F V G
MK K I QR K K MV L A Q L G G S I S R A I QQM S NA T I I D E K -A L NDC L N E I T R A L L Q S DV Q F K L V R DMQT N I K QQA I F N E L C K I V V M F V G
L R K I K A R K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C T A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
L R R L T A K K MV L A D L G K R I N S A V NNA I S NT QDD F T T S V DV M L K G I V T A L L E S DV N I A L V S K L R NN I R QK T V F D E L C K L I I M F V G
L R R MA P K K MV F A D L G R R L N S A L G D F S K A T S V N E E - L V DT L L K N I C T A L L E T DV NV R L V Q E L R S N I K QK A V F D E L C S L V I MMV G
MR K MR A K K MV L A D L G R K I T S A L K S L S NA T I I D E D -V L N S M L N E I C R A L L E A DV N I R L V K A L K E NV K QT A V F K E L V K L I I M F V G
L R K I K A K K MV L A D L G R K I T S A L R S L S NA T I I N E E -V L NA M L K E V C A A L L E A DV N I K L V K Q L R E NV K QHA V F K E L V K L V I M F V G
I R K L Q S K K MV L A D L G K R I NNA L QQ L NK A P V I D E E - L L NQV L K E I Q L A L L Q S DV NV K Y V A K L K S N I I QQA V V Q E L T QMV V M F V G
R R R MDK K K MV L A E L S NQ I T QA F R K L H S T T V I S E A -V I E E V I G D I V R A L L MA DV NV K L V HK L K E NV K QK I V V D E L V NMV I M F V G
R R R MDK K K MV L A E L S NQ I T K A F R K L H S T T V I S E A -V I E E V I G D I V R A L L MA DV NV K L V HK L K E NV K QK I V V D E L V NMV I M F V G
S DK I A K K K -M L QD L G E K L MG S I K K L S E S K T I D E K -V Y V T F MA E V A K S L I A A DC S K E I V F D F S R R L K E K A V F N E L V K L I F MMV G
L K K L DK K K MV L A E L G QK I G A A I S K M S S K S F V G E D -DV K E F L N E V A R A L L QA DV NV K T V K E L QQNV R QT A V F NG I K K M I V M F V G
L K K L E K K K MV L A E L G QK I G G A I S K M S S K P L L G E D -DV K E F L N E V A R A L L QA DV HV T T V K E L QQT I R QT A V F S G L R K I I V M F V G
V R R L K A S K MV L S D L G R R I N S A F QD L S K V P T V DA A - S I DQ L L K S V C NA L I E A DV NV K L V A N L R S QV K QK A V F DH L V A L V I M F V G
L K K I V S K K MV L E D L G K R I NG A F A N L S K G G D I D E - -A L DA M L K E V C S A L L E S DV N I K L V S Q L R QK V K QK A L F D E L V N L V V M F V G

1670

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1680

1690

1700

1710

1720

1730

1740

L QG S G K T T T C T K L A Y HY QK K NWK S C L V C A DT F R A G A Y DQ I K QNA T K A R I P F Y G S Y T E V D P V T I A QDG V E M F K K E G F E F I I V DT


L QG A G K T T T C T K Y A Y Y HQK K G Y K P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P V K I A V E G V DT F K K E NC D L I I V DT
L QG A G K S T S C S K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A K I P F Y G S Y T E T N P V R V A A DG V A K F K K E R F E I I I V DT
L QG A G K T T T C T K L A R HY QMR G F K T A L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T QT D P A V V A A E G V A K F K K E R F E V I I V DT
L QG A G K T T T C T K L A R HY QMR G F K T A L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T QT D P A I V A A E G V A K F K K E R F E I I I V DT
----------------------------------------------------------------------------------L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C S K MA Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y S E I D P V K I A A E G V E K F T K E G F E I I I V DT
L QG S G K T T T C T K MA Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y S E I D P V K I A A E G V E K F T Q E G F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A N I P Y Y G S Y L E P D P V K I A F E G V QK F K Q E K F D I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA V K A R I P F Y G S Y T E T D P V K V A G DG I A K F K K E K F DV I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C T K L A Y Y Y QR K NWK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E A D P V V I A S E G V E T F K E E N F E I I I V DT
I QG A G K T T T C T K L A V HY QR R G F R T C L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E T D P V A I A S L G V E K F R K E R F DV I I V DT
----------------------------------------------------------------------------------L QG S G K T T T C T K F A NY Y QR R G WK T A L V C A DT F R A G A F DQ L K QNA T K V K I P F Y G S Y T E T D P V K I A R DG V R E F R K E G Y D L I I V DT
L QG S G K T T T C S K L A Y Y F QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A A E G V E K F K S E N F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G Y K V G L V C A DT F R A G A F DQ L K QNA I K S S I P Y Y G S Y I E T D P V K V A Y E G V V K F K Q E K F D I I I V DT
L QG A G K T T S V T K L A Y F Y K K K G F S T A I V C A DT F R A G A F DQV R HNA A K A K I HY Y G S E T E K D P V V V A R T G V D I F K K DG T E I I I V DT
L QG S G K T T T C T K L A Y HY QK R NWK S C L V C A DT F R A G A Y DQV K QNA T K A R I P F Y G S Y T E I D P V V I A QDG V DM F K R E G F E M I I V DT
L QG S G K T T T C T K L A Y HY QK R NWK S C L V C A DT F R A G A Y DQ I K QNA T K A R I P F Y G S Y T E I D P V V I A Q E G V DM F K R E G F E M I I V DT
L QG S G K T T S I C K Y A N F Y K K K G Y K V G I V C A DT F R A G A F DQV R QNA L K I K V P F F G S - S E A D P V K V A S A G V E R F R K E R F E L I L V DT
L QG A G K T T T I T K L A L Y Y K NR G Y K P A V V G A DT F R A G A Y E Q L QMNA K R A G V P F F G I K E E S D P V K V A S E G V R T F R K E K ND I I L V DT
L QG S G K T T T C S K L A Y F Y QR K G WK T C L I C A DT Y R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G Y K V G L V C A DT F R A G A F DQ L K QNA I K A K I P F Y G S Y T E P N P V K V A K DG V DK F K K E K F E I I I V DT
L QG S G K T T S C T K Y A A Y F QR K G F K T A L V C A DT F R A G A Y DQ L R QNA T K A K V R F Y G S L T E A D P V A I A K E G V A E L K K E K Y D L I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG A G K T T T C T K L A R HY Q S R G F K A C L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T E T D P A V V A R E G V DK F K K E R F E V I I V DT
L QG S G K T T S A A K L A R Y F QR K G L K A G V V A A DT F R P G A Y HQ L K T L A E K L NV G F Y G E E G N P DA V E I T K NG L K A L - - E K Y D I R I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG A G K T T T C T K L A R HY Q S R G F R V G L V C A DT F R A G A F DQ L K QNA T K A K I P Y Y G S L T E T D P V V V A R DG V DK F K K E K F E I I I V DT
L QG S G K T T T C T K Y A Y Y HQR K G F K P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y M E S D P V K I A V E G V E R F K K E NC D L I I V DT
L QG A G K T T T C T K F A HY Y A K K G F K P S L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P A T I A A A G V K R F E E E K S D L I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T T C T K Y A Y Y Y QK K G WK V A L V C A DT F R A G A F DQ L K QNA T K V R V P F Y G S Y T E A D P V Q I A Q E G V NV F K K E A F E I I I V DT
L QG A G K T T S C T K L A V Y Y K K R G F K V G L V C A DT F R A G A F DQ L K QNA I K A S I P Y Y G S Y L E QD P V K I A Y E G V T K F R S E K F D I I I V DT
L QG S G K T T T C T K Y A HY Y QK K G F K T A L V C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A T DG V NA F L K DK Y D L I I V D S
L QG S G K T T T C T K F A HY Y QK K G F K T A L V C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A S DG V NA F L K E K Y D L I I V D S
L QG S G K T T T C T K Y A HY Y QK K G F K T A L I C A DT F R A G A F DQ L K QNA A K V K I P F Y G S Y S E V D P V K I A T DG V NT F L K DK Y D L I I V D S
L QG S G K T T T C T K Y A Y Y HQK K G WK P A L V C A DT F R A G A F DQ L K QNA T K A K I P F Y G S Y T E S D P V K I A E E G V E T F K K E NC D L I I V DT
L QG S G K T T T C S K L A Y F Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V I I A S E G V E K F K N E N F E I I I V DT
L QG S G K T T S C T K L A V Y Y S K R G F K V G L V C A DT F R A G A F DQ L K QNA I R A R I P F Y G S Y T E T D P A K V A E E G I NK F K K E K F D I I I V DT
L QG S G K T T T C S K L A L HY QR R G L K S C L V A A DT F R A G A F DQ L K QNA I K A R V P Y F G S Y T E T D P V V I A K E G V DK F K NDR F DV I I V DT
L QG S G K T T T C T K L A Y Y Y QK K G WK V A L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E V D P V I I A A DG V E K F K K E N F E I I I V DT
L QG S G K T T T C S K L A Y Y Y QR K G WK T C L I C A DT F R A G A F DQ L K QNA T K A R I P F Y G S Y T E MD P V V I A T E G V DK F K M E N F E I I I V DT
L QG A G K T T T C T K Y A Y HWQK K G WR T A L I C A DT F R A G A F DQ L K QNA T K V R V P F Y G S Y S E T D P V A I A E E G V K H F K K E NY E M I I V DT
L QG A G K T T S C T K F A Y HY QR K G WR T A L I C A DT F R A G A F DQ L K QNA A K V K I S F Y G S Y S E A N P A K V A A DG V A R F K E E K Y DM I I V DT
L QG A G K T T S C T K F A Y HY QR K G WR T A L I C A DT F R A G A F DQ L K QNA A K V K I S F Y G S Y S E A N P A K V A A DG V A R F K E E K Y DM I I V DT
L QG A G K T T T V T K L A N F Y K R R NWR T G V I A A DT F R A G A R E Q L MQNA QT A R I P Y F V D F T E QD P V QA A L K G I E K F R K DK Y E I V I I DT
L QG S G K T T S C T K Y A A Y F QR K G L K T G L V C A DT F R A G A Y DQ L R QNA T K A K V R F Y G S L T E A D P V I I A K E G V L E L K K E K Y D L I I V DT
L QG S G K T T S C T K Y A A Y F QR K G L K T A L V C A DT F R A G A Y DQ L R QNA T K A K I R F Y G S L T E A D P V I I A K E G V A E L E K E K Y D L I I I DT
L QG S G K T T S C T K L A L Y Y QK R G F K T G L V C A DT F R A G A F DQ L K QNA S K I NV P F Y G S Y T E T D P V A I S A A G V A S F K QNR F E V I I V DT
L QG S G K T T S C T K L A V Y Y QR R G F K V G L V C A DT F R A G A F DQ L K QNA T K A K I P F F G S Y T E T D P V A V A A E G V A K F K K E K F E I I I V DT

1750

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1760

1770

1780

1790

S G R HK Q E E S L F E E M L A V A NA V N P DN I I F V MDA T I G QA C E A QA K A F K E K V D I G S V
S G R HK Q E A S L F E E MR QV A E A T K P D L V I F V MD S S I G QA A F DQA QA F K Q S V A V G A V
S G R HHQ E DA L F Q E MV E I A Q E V K P NQT I MV L DA S I G QA A E QQ S R A F K E A A D F G A I
S G R HK Q E E E L F T E MT Q I QNA V T P DQT I L V L D S T I G QA A E A Q S A A F K A T A N F G A I
S G R HK Q E E E L F T E MT Q I QT A V T P DQT I L V L D S T I G QA A E A Q S S A F K A T A D F G A I
- - - - - - - - - - - - - - - - - - - - - - P DN I I F V MDA T I G QA C E A QA R A F K DK V D I G S V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E A S L F E E M L QV S NA V T P DNV V F V MDA S I G QA C E A QA R A F S QT V DV A S V
S G R HK Q E A S L F E E M L QV S NA V T P DNV V F V MDA S I G QA C E A QA R A F S QT V DV A S V
S G R HR Q E E Q L F T E MV Q I G E A V Q P T QT I MV MDG S I G QA A E S QA R A F K E S S N F G S I
S G R HHQ E E E L F H E MV Q I S NV I K P NQT I MV L DA S I G QA A E QQ S K A F K E S S D F G A I
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E D S L F E E M L QV Y NA V A P DNV I F V MDA S I G QA C E G QA R A F K E K V DV A S V
S G R HK Q E S E L F E E MV A I G A A V K P DMT L MV L DA S I G QA A E G Q S R A F K D S A D F G A I
- - - - - - - - - - - - -M E QV V M E T N P DDV V F V MD S H I G QA C Y DQA MA F C NA V DV G S V
S G R HK Q E S S L F V E M E QV V M E T N P DDV V F V MD S H I G QA C Y DQA MA F C NA V DV G S V
S G R HK Q E D S L F E E M L QV S NA V Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E Q S L F N E M I Q I S E M I V P T QT I MV MDG S I G QA A E S QA K A F K E S S Q F G S I
S G R HK QD S E L F E E MK Q I E T A V K P DNC I F V MD S S I G QA A Y E QA T A F R S S V K V G S I
S G R HK Q E E S L F E E M L A V S NA V S P DN I I F V MDA T I G QA C E A QA K A F K DK V D I G S V
S G R HK Q E E S L F E E M L A V A NA V N P DN I I F V MDA T I G QA C E A QA K A F K DK V D I G S V
S G R HT Q E T E L F T E MK D I I R E I S P S S I V F V MDA G I G Q S A E DQA MG F K R A V DV G S I
S G R HK QDK E L F K E MQ S V R DA I K P D S I I F V MDG A I G QA A F G QA K A F K DA V E V G S V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HQQ E D S L F Q E MV E I S QA V K P K QT I MV L DA S I G QA A E HQ S K A F K E S A D F G S I
S G R HK Q E S A L F E E MK QV E E A V K P ND I V F V M S A T DG QA V E E QA R N F K E MV A V G S V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HR Q E S A L F Q E MMD I QK A V K P D E T I MV L DA S I G QQA E A QA K A F K E A A D F G A I
A G R HA L E A D L I E E M E R I HA V A K P DHK F MV L DA G I G QQA S QQA HA F ND S V G I T G V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E D S L F E E M L QV S NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HR Q E E A L F Q E MMD I QT A V K P D E T I MV L DA S I G QQA E A QA K A F K E A A D F G A I
S G R HK Q E A A L F E E MR QV S E A T K P D L V I F V MD S S I G QA A F DQA QA F K Q S V S V G A V
S G R HK Q E E A L F E E MR E I A S V T E P T MT I F V MD S S I G Q S A S DQA K A F A S T V DV G G V
S G R HK Q E D S L F E E M L QV A NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HK Q E ND L F E E MK QV E A A V K P DD I V F V MD S S I G QA C F DQA L A F K K A V NV G S V
S G R HR Q E HQ L F Q E MV Q I G E M I Q P T QT I MV MDG S I G QA A E S QA K A F K E S S N F G S I
S G R HK Q E N E L F E E M I QV E N S I Q P E E I I F V I D S H I G Q S C HDQA MA F K N S V S L G S I
S G R HK Q E S E L F E E MK QV E S S I N P E E I V F V I D S H I G Q S C HDQA MA F K N S V T L G S I
S G R HK Q E ND L F E E MK QV E N S I K P E E I V F V I D S H I G Q S C HDQA MA F K N S V K V G S I
S G R HK Q E A A L F E E MR QV S E A T K P D L I I F V MD S S I G QA A F DQA QA F K QMV A V G A V
S G R HK Q E D S L F E E M L QV S NA I Q P DN I V Y V MDA S I G QA C E A QA K A F K DK V DV A S V
S G R HHQ E E E L F Q E M I E I S NV I K P NQT I MV L DA S I G QA A E QQ S K A F K E S S D F G A I
S G R HQQ E Q E L F A E MV E I S DA I R P DQT I M I L DA S I G QA A E S Q S K A F K E T A D F G A V
S G R HK Q E D S L F E E M L QV A NV T S P DN I I F V MDA S I G QA C E S QA K A F K E K V DV A S V
S G R HK Q E D S L F E E M L QV S NA V Q P DN I V Y V MDA S I G QA C E S QA K A F K DK V DV A S V
S G R HK Q E S E L F D E MK QV QA A V N P D E C I F V MDG S I G QA C Y DQA QA F R NA V NV G S V
S G R HK Q E DA L F D E MK L I Y DA V Q P D E V V F V MD S H I G QA C Y DQA S A F NK A V DV G S V
S G R HK Q E DA L F D E MK L I Y DA V Q P D E V V F V MD S H I G QA C Y DQA A A F NK A V DV G S V
S G R HMQ E E A L F A E MK A L A A A V N P H E I I F V MDG T I G QA A Y DQA L G F K NA V G V G S I
S G R HK Q E S A L F E E MK QV QQA V K P ND I V F V M S A T DG QG I E E QA R Q F K E K V P I G S V
S G R HK Q E S A L F E E MK QV Q E A V K P ND I V F V M S A T DG QG I R E QA R Q F K E K V P V G S V
S G R HK Q E Q E L F D E MR E I DT A V T P D L T I MV L DA N I G QA A E A Q S R A F K QA A G Y G A I
S G R HR Q E S E L F T E MV D I G A A V K P D S T I MV L DA S I G QA A E P Q S R A F K DA S D F G S I

1800
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I

1810

I T K L DG HA K G G G A
I T K MDG HA K G G G A
L T K MDG HA K G G G A
I T K T DG HA A G G G A
I T K T DG HA A G G G A
I T K L DG HA K G G G A
V T K L DG HA K G G G A
I T K L D S HA K G G G A
I T K L D S HA K G G G A
L T K MDG HA K G G G A
L T K MDG HA K G G G A
V T K L DG HA K G G G A
V T K L DG HA K G G G A
V T K L DG HA K G G G A
I T K L DG HA K G G G A
I T K L DG HA K G G G A
V T K L DG HA K G G G A
L T K MDG HA R G G G A
I T K MDG N S MG G G A
I T K L DG HA K G G G A
I T K L DG HA K G G G A
L T K I DG T T K A G G A
I T K L DG H S NG G G A
V T K L DG HA K G G G A
V T K L DG HA K G G G A
L T K MDG HA R G G G A
V T K L DC QT K G G G A
V T K L DG HA K G G G A
I T K T DG HA S G G G A
I T K L DG T A K G G G A
V T K L DG HA K G G G A
V T K L DG HA K G G G A
I T K T DG HA A G G G A
V T K MDG HA K G G G A
MT K L DG HA K G G G A
V T K L DG HA K G G G A
I T K L DG HA K G G G A
L T K MDG HA K G G G A
I T K I DG HA K G G G A
I T K I DG HA K G G G A
I T K I DG HA K G G G A
I T K MDG HA K G G G A
V T K L DG HA K G G G A
L T K MDG HA R G G G A
I T K L DG HA K G G G A
I T K L DG HA K G G G A
V T K L DG HA K G G G A
I T K L DG HA K G G G A
I T K L DG HA K G G G A
I T K L DG HA K G G G A
I T K L D S NA K G G G A
V T K L DG QA K G G G A
I T K L DG HA K G G G A
V T K L DG HA K G G G A
L T K MDG HA K G G G A

1820

L SAV AAT N S P I I F I G
L SAV AAT K S PV I F I G
I SAV AAT K T PV I F I G
I S A V A A T HT P I I F L G
I S A V A A T HT P I I Y L G
L SAV AAT Q S P I I F I G
L SAV AAT K S P I I F I G
L SAV AV T K S PV I F I G
L SAV AV T K S PV I F I G
I SAV AAT K T P I V F I G
I S A V A A T NT P I A F I G
L SAV AAT K S P I I F I G
L SAV AAT Q S P I I F I G
I SAV AAT K T P I I F LG
L SAV AAT GA P I I F I G
L SAV AAT GA P I I F I G
L SAV AAT K S P I I F I G
I SAV AT T K T P I V F I G
I S A V A A T NT P I I F I G
L SAV AAT Q S P I I F I G
L SAV AAT Q S P I I F I G
I S SV AAT K C P I E FV G
L SAV AAT K S P I I F I G
L SAV AAT K S P I I F I G
L SAV AAT K S P I I F I G
I S A V A S T NT P I I F I G
L SAV AAT R S P I V F I G
L SAV AAT K S P I I F I G
I S A V A A T HT P I V F I G
L SAV S ET K A P I A F I G
L SAV AAT K S P I I F I G
L SAV AAT K S P I I F I G
I S A V A A T HT P I V F I G
L SAV AAT K S PV I F I G
I SAV S ET K A P I L F I G
L SAV AAT K S P I I F I G
L SAV AAT E S P I V F I G
I SAV AAT K T P I V F I G
L SAV AAT GC P I T F I G
L SAV A ST GC P I T F I G
L SAV SA I GC P I T F I G
L SAV AAT K S PV I F I G
L SAV AAT K S P I I F I G
I S A V A A T NT P I I F I G
L SAV AAT K T P I V F I G
L SAV AAT K S PV I F I G
L SAV AAT R S P I I F I G
L SAV AAT E S P I I F I G
L SAV SAT N S P I I F I G
L SAV SAT N S P I I F I G
L SAV AAT N S P I S F I G
L A A V A MT K S P I V F I G
L A A V A MT K S P I V F I G
I SAV AAT K T P I MF I G
I S A V A A T NT P I I F I G

1830

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1840

1850

1860

1870

1880

1890

1900

T G E H I DD L E P F K T K P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E E L I DK I K HG Q - F T I R DMY E Q F QN I MK MG P F S Q I MG M I
T G E HMD E F E V F DV K P F V S R L L G MG DW S G F V DK L Q E V V P - - -K DQQ P E L L E K L S QG N - F T L R I MY DQ F QN I L NMG P L K E V F S M L
T G E HV HD F E K F S P K S F V S K L L G I G D I E S L L E Q F QT V S N - - -K E DT K A T M E N I QQG R - F T L L D F QK QMQT I MK MG P L S N L A S M I
T G E H L MD L E R F E P K A F V QK L L G MG DMA G L V E HV QA V T K - -D S A A A K E T Y K H I A E G I -Y T L R D F R E N I T S I MK MG P L S K L S G M I
T G E H L MD L E R F E P K A F I QK L L G MG DMA G L V E HV QA V T K - -D S A S A K E T Y K H I S E G I -Y T L R D F R E N I T S I MK MG P L S K L S G M I
T G E H I DD L E P F K T K P F I K K L L G MG D I E G L L DK V N E L K - - - - L DDND E L L E K L K HG Q - F T L R A MY E Q F QN I MK MG P F S Q I MG M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I DD F E I F K P K S F V QK L L G MG D I A G L V DMV ND I G - - - - I S DNK E L V G R L K QG Q - F T L R DMY E Q F QN I MK MG P F S Q I MV M I
T G E H I DD F E I F K P K S F V QK L L G MG D I A G L V DMV ND I G - - - - I QDNK E L V G R L K QG Q - F T L R DMY E Q F QN I MK MG P F S Q I MG M I
T G E HV G D L E I F K P T T F I S K L L G I G D I QG L I E HV Q S L N L -HQD E G HK QT I E H I K E G K - F T L R D F QNQMNN F L K MG P L T N I A S M I
T G E H I HD L E K F S P K S F I S K L L G I G D I E G L F E Q L K T V S N - - -K E DT K A T M E N I QQG K - F T L L D F K K QMQT I MK MG P L S N I A QM I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I DD F E M F K T Q P F V R K L L NMG D I E G L I DK V N E L K - - - - L DDN E E L L DK L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I M S M I
T G E H L ND L E R F A P Q P F I S K L L G MG DMQG L V E HMQDMA R -A N P DR QK D L A K K L E QG K - F T I R DWR E Q L S N I MNMG S I S K I A S M I
T G E H F D E F E P F E T K G F V S R L L G L G D I S G L MA K I N E V V P - - - L DR Q P DMV NR L V QG I - F T L R DMY E Q F QNM L NMG S P S A L L S M I
T G E H F D E F E P F E T K G F V S R L L G L G D I S G L MA K I N E V V P - - - L DR Q P DMV NR L V QG I - F T L R DMY E Q F QNM L NMG S P S A L L S M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E E L I DK L K HG Q - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E HA T D L E I F K P T S F I S K L L G I G D I Q S L I E HV Q S L N L -QDD E S HK K T I E N F K E G K - F T L R D F QT QMNN F MK MG P L T N I A S M I
T G E H L T D L E L F D P S T F V S K L L G Y G DMK G M L E K I K E V I P - - - - - E D S T S L K E I A QG K - F T L R S MQQQ F QQ I MQ L G P I DK L V QM I
T G E H I DD L E P F K T K P F V S K L L G MG D I E G L I DK V N E L K - - - - L DG ND E L L E K I K HG H - F T I R DMY E Q F QN I MK MG P F S Q F MNM I
T G E H I DD L E P F K T K P F V S K L L G MG D I E G L I DK V N E L K - - - - L DG N E E L L E K I K HG H - F T I R DMY E Q F QN I MK MG P F S Q I MNM I
T G E G MDD L E A F DA R R F V S R M L G MG DV E G L M E K V G S L G I - - - - -D E K E V V K K L R QG R - F T L G D F Y DQ F QK I L S L G P I S K L L E M I
T G E K V N E I E E F DA E S F V R K L L G MG D L K G I A K L A K D F A E - - -NA E Y K T MV K H L Q E G T - L T V R DWK E Q L S N L QK MG Q L G N I MQM I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I HD F E K F S P K S F V S K L L G I G D I E S L M E R F QT V S D - - -QDDA K NT L E N I QQG K - F T L L D F K NQMQT I MK MG P L S N I A NM I
T G E H F E D F D L F N P E R F V QK M L G MG D I G G L MDT MR DA N - - - - I DG N E E V Y K R L QDG L - F T MR DMY E H L QNV L K MG S V G K I M E M L
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E HM L D L E R F A P QQ F I S K L L G MG DMA G L V E HV Q S L K L - - - - -DQK DT I K H I T E G I - F T I R D L R DQ L QN I MK MG P L S K MA G M I
V G E T P E D F E K F E A DR F I S R L L G MG D L K S L M E K A E E S L S - - - - - E E DV NV E A L MQG R - F T L K DMY K Q L E A MNK MG P L K Q I M S M L
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E HM L D L E R F V P NN F I S K L L G MG DMA G L V E HV Q S L K L - - - - -DQK DT I K H I T E G I - F T I R D L R DQ L QN I MK MG P L S K MA G M I
T G E H I D E F E V F DV K P F V S R L L G MG DW S G F MDK I H E V V P - - -T DQQ P E L L QK L S E G T - F T L R L MY E Q F QN I L K MG P I G QV F S M L
T G E H I G E L E A F E T T S F V S K L L G MG D I K G L V E K MN E I V P - - - E E S A E K L M E A F G S G T - F T MR L L Y E Q F QN L QNMG P I S S I M S MV
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
E G E H F DD L E S F E A S S F V R R L L G L G D I NK L F Q S V K DV V N - - -MR DQ P Q L I QK L K E G K - F S I R D L QT Q F N S V L K L G S L NQ F M S A I
T G E H I G D L E I F K P T T F I S K L L G I G D I Q S L I E HV QG L N L -QND E NHK QT M E N I K E G K - F T L K D F QNQMNN F L K MG P L T N I A S M I
T G E HV ND F E K F E A K S F V S R L L G L G D I S G L V S T I K E V I D - - - I DK Q P E L MNR L S K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E H I ND F E K F E A K S F V S R L L G L G D I NG L V S T L K E V I D - - - I E K Q P Q L I NR L S K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E HV ND F E K F E A K S F V S R L L G L G D I DG L V S T L K E V I D - - - I E K Q P Q L I NR I A K G K - F V L R DMY DQ F QNV F K MG S L S K V M S M I
T G E HMD E F E V F DV K P F V S R L L G MG DW S G F MDK I H E V V P - - -MDQQ P E L L QK L S E G N - F T L R I MY E Q F QN L L K MG P I G QV F S M L
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DK V N E L K - - - - L DDN E A L I E K L K HG Q - F T L R DMY E Q F QN I MK MG P F S Q I L G M I
T G E H I HD L E K F S P K S F I S K L L G I G D I E S L F E Q L QT V S N - - -K E DA K A T M E N I QK G K - F T L L D F K K QMQT I MK MG P L S N I A QM I
T G E H I ND L E R F S P R S F I S K L L G L G D L E G L M E HV Q S L D F - - - - -DK K NMV K N L E QG K - F T V R D F R DQ L G N I MK L G P L S K MA S M I
T G E H I DD F E P F K T K P F I S K L L G MG D I E G L I DK V S E L N - - - - L DDN E E L I NK L K HG E - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E H I DD F E P F K T Q P F I S K L L G MG D I E G L I DR V ND L K - - - - L DDN E E L I DK L K HG Q - F T L R DMY E Q F QN I MK MG P F G Q I MG M I
T G E H F E D L E P F N P E S F V K R L L G L G D I K G M I T T V T E A V D - - -M E T QG K A I A N I T K G Q - F S I R D F QA QY K S I L K L G S I NQ F M S M I
T G E H F DD F E P F D P K S F I S R L L G F G D I NG L I NT L K DV I N - - - L E DK P D L L DR I A S A K - F T I R DMY DQ F QN L L K MA P I G K V M S M L
T G E H F DD F E P F D P K S F I S R L L G F G D I NG L I NT L K DV I N - - - L DDK P D L L DR I A S A K - F T I R DMY DQ F QN L L K MA P I G K V M S M L
S G E Q F T D L E W F D P N S F V S R L L G I QD P G V I QR T L E E I D - - - -K E A NK E I A E H I QK G Q - F S F R D L Y NQY K MV L DV G N F N S M L D S I
T G E H F DD F E L F Q P E S F V S R M L G MG DMR A L V D S MK DA N - - - - I DT D S E L Y K R F QDG Q - F T L R DMY E H L QNV L K MG S V S K I MDM I
T G E H F E D F E L F Q P E S F V S R M L G MG DMR A L MDT MK DA N - - - - I DT D S E L Y R R F QDG Q - F T MR DMY E H L QNV L K MG S V S K I MNM I
T G E HA A D L E P F R A Q P F I S K L L G MG D I S G L MDK M E E MQMNG G Q E R QQ E M L K K I G QG G I F S I R DWR E Q L S N I MG MG P L S K I A G M I
T G E H I HD L E A F S P K Q F I S K L L G I G D L QG L M E T MQ S L N L - - - - -DQK K T M E H I Q E G I - F T L A D L R DQMG NM L K MG S L S S I A G M I

1920

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

1930

1940

1950

1960

1970

1980

1990

- - - - - - - P -G F S QD F MT K G -G E Q E S MA R I K R L MT MMD S M S DG E L DR V T R V A QG S G V M E R E V R D L I G G MNG L QNMMDY R S QM E L


- - - - - - - P -G I S A E MM P K G -H E K E S QA K I K R Y MT MMD S MT ND E L DR MMR I A R G S G R QV R E V M E M L G G MG G L Q S L MD F T -D L E L
- - - - - - - P -G M S G -MM S G I - S E D E T S R K MK K MV Y V L D S M S R E E L E R M L R V A R G S G T S V F E V E M I L A A M P DMA DMMD F S -Y L K L
- - - - - - - P -G L S N - L T A G L -DD E DG S L K L R R MV Y I F D S MT A A E L DR MV R I A C G S G T T V R E V E D L L G G M -D L Q S MMD F S - S L A L
- - - - - - - P -G L S N - L T A G L -DD E DG S MK L R R M I Y I F D S MT A A E L DR MV R I A C G S G T T V R E V E D L L G G M -D L Q S MMD F S - S L S L
- - - - - - - P -G F S QD L M S K G - S E Q E S MA K F K L L MT I MD S I D E E - - - - - - - - - - - - - - - -R DV K D L I G G M S G L HNMMDY R S QM P L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F G P E F MNK G -N E Q E S V NR L K R MMT V MD S M S DK E L DR V A R V A R G S G C F QQ E V R D L L G G MG G L QNMMDY R K DM P L
- - - - - - - P -G F G S E F MT K G -N E Q E S V NR L K R MMT V MD S M S DK E L DR V A R V A R G S G S HQQ E V R D L L G G MG G L QNMMDY R K DM P L
- - - - - - - P -G L S N - I M S QV -G D E E T S K K I K NM I Y I MD S MT I K E L E R I V R V A R G S G C A V V E V E M I L G G - - - L A G MMD F S -Y L K L
- - - - - - - P -G MG N -MM S QV -G E E E T S QK MK K MV Y V L D S MT K E E L E R L V R V A R G S G T S V F DV E M I L G G M P DMN E MMD F S -Y L K L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F N -D F MT K G -H E K D S T E R L K R L MT I MD S MT D E E L DR V A R V A R G S G T L P K E V N E L L G G MNG L QNMM - - - - - - - - - - - - - - P -G L P A -G I MDG -N E E E A S A K L K R L I F I T DA MR A D E L DR A K R V A K G S G T S L R E L E D L L G G M P DM S QMQD L S -G QT L
- - - - - - - P -G MG P N I L A K E -D E QA G I E R L K K F MV I MD S MT E S E L DR V E R I S R G S G T S NQDV Q E L L G G A QNMMNMMDY S S E L K L
- - - - - - - P -G MG P N I L A K E -D E QA G I E R L K K F MV I MD S MT E S E L DR V E R I S R G S G T S NQDV Q E L L G G A QNMMNMMDY S S E L K L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V A T R DV Q E L L G G MA G L Q S MMDY R L QM L L
- - - - - - - P -G M S N - I M S QV -G E E E T S S K I K NM I Y I MD S MT T K E L DR I I R V A R G A G C S A V E V E MV L G G M P DM S S MMD F S -Y L K L
- - - - - - - P -G MNQ - - L P Q L -QG N E G G L K L K A Y I N I L D S L S E K E L DR I I T I A QG S G R H P N E V V E L L G G M P S MG D L A DY S K R C I L
- - - - - - - P -G F S QD F MT K G -G E Q E S MA R V K R MMT MMD S M S DN E L DR C V R V A QG A G V M E R E V K E L I G G V G G I QNMMDY R S QMQ L
- - - - - - - P -G F S QD F MT K G -G E A E S MA R I K R MMT MMD S M S DN E L DR C T R V A QG A G V L E R E A K E L I G G V G G L QNMMDY R G QMQ L
- - - - - - - P -G F S G - - - - - - - L S L P D E DT F K K L I Y V F D S L S R G E L DR I MR V A R G S G T S V QG V V E I L G S M F - - - - - - P Y E - E L V L
- - - - - - - - -G L NH P M F QG G -N I E - - -K K F K V F MV I L D S MT DR E L DR I R R L A R G S G R D I R E V N E L F -NQA Q L QQ L MDY - -DMR L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G MG N -M L NQ F - S E E E T S K K MK T MV Y I F D S MT K K E L E R L V R V A K G S G T T V F D I E M L L G G M P NMQD I MD F S -Y L K L
- - - - - - - P -G M S G HA A T A G - - -QQG D I A L K G F I HM L D S MT V A E L DR MHR I A R G S G H S V V E V QN L I G G MG G L QG MM S L S G DA R T
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G M S N -MMQG M -DD E E G T G K L K R M I Y I C D S MT DK E L DR MT R V A R G S G T HV R E V E D L L G G M -DMA A MMDY S -Y L K L
- - - - - - - P MG MG G MK F S D E -M F QA T S DK MK NY K V I MD S MT E E E MT R I K R I S K G S G C S S E DV R E L L G G K F N I QK MM E I S F K - - - - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R QQM P L
- - - - - - - P -G M S N -MMQNM -DD E E G S L K L K R M I Y I C D S MT DK E L DR MT R V A R G S G T T V R E V E D L L G G M -DMNA MMD F S -Y L N L
- - - - - - - P -G F S S E L M P K G -H E K E S QA K I K R Y MT MMD S MT DG E L DR I L R I A R G S G R P V R DV V DM L G G MG G L Q S L MD F T -K L E L
- - - - - - - P -G MA D -M I P K G -G E E QG T K R MK S MMV L MD S MT DA E L DR M E R V C R G A G R L P S E M I T L L G G QQG MMNMMD F S - E L E L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R L QM P L
- - - - - - - P -G MG S S V L S K G -N E K E S I K R I QR F L C I MN S MT A D E L DR I V R I A K G S G T S I E E V H I L L G G MDNV MNMMDY R - - -N I
- - - - - - - P -G L S N - I M S QV -G E E E T S K K I K NMV Y I MD S MT T E E L E R I L R V A K G S G C A A V E I E MV L G G M P NMA NMMDY S -Y L K L
- - - - - - - P -G F G NN L I S K G -T E K E G I DK I K K F MV I MD S MT N E E L DR C L R I V K G S G T R L QD I K E L L G G A NNMV N I L DY S K DMK L
- - - - - - - P -G F G NN I I S K G -T E K E G I E K I K K F MV I MD S MT N E E L DR C L R I C K G S G T R L QD I R E L L G G A NNMV N I L DY S K DMK L
- - - - - - - P -G F G T N L I S K G -T E K E G I DK I K K Y MV I MD S MT N E E L DR C I R I C K G S G T K L S D I K E L L G G A NNMV NM L DY S K E MK L
- - - - - - - P -G F S A E L M P K G -H E K E S QA K I K R Y MT MMD S MT N E E L DR MMR I A R G A G R P I R DV M E I L G G MG G L QN L MD F S -K L E L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V S T R DV Q E L L G G MA G L Q S MMDY R QQM P L
- - - - - - - P -G M S N -MMNQV -G E E E T S QK MK K MV Y V L D S MT K E E L E R MV R V A K G S G T S V F E V E M I L G G M P DMN E MMD F S -Y L R L
- - - - - - - P -G M S N -MMNG M -ND E E G S L R MK R M L Y I V D S MT E Q E L DR V L R V A R G S G T S V L E V E E T I G G M -D F S G M L D F S N L L G L
- - - - - - - P -G F S S D F MT K G -N E Q E S MA R L K K L MT MMD S MK D E E L DR V QR V A R G S G V S V R E V Q E L L G G MNG L Q S MMDY R G QM E L
- - - - - - - P -G F G T D F M S K G -N E Q E S MA R L K K L MT I MD S MNDQ E L DR I QR V A R G S G V A T R DV Q E L L G G MA G L Q S MMDY R S QMQM
- - - - - - - P -G MG N S I L DK N -N E K E S I R K V K K F L T I MD S MNDN E L DR I V R I A R G S G S S L E DV NQ L L G G MG N I MNMV DY S -T L E L
- - - - - - - P -G I P P E L L QA G -R E Q E G V DR I K R F M I I MD S MT D E E L DR I MR I A K G S G S S P H E I N F L I G G A G N I MK L MDY S N - L K L
- - - - - - - P -G I P P E L L QA G -R E Q E G V DR I K R F M I I MD S MT D E E L DR I MR I A K G S G S S P H E I S F L I G G A G N I MK L MDY S N - L K L
T I A QG I K P -G MK N - - - - - - -D P E HT K DT V K R I L V V I DA M S T S E I E R I A R L C S G S G M P P P F V QY V I G G I E G L E A S MK Y F K D L Y I
- - - - - - - P -G M S G F T G NA G - - -DA G DV T L K T F I HMMD S MT A A E L DR I HR I A R G S G HT I L E V HN L I G G L T G L QD I M - - - - - - - - - - - - - - P -G M S A L S G A A G - - - E L G DV T L K A F I H I MD S MT A A E L DR I L R V A R G S G H S I H E V HN L I G G L G G L QG MM - - - - - - - - - - - - - - P -G MG QM L S G A G G DD E A A G S K MK R MM F I T DA MT A E E L DR A R R V A R G S G T S V K E V E E F L G G M P DM S Q L A D F T -K M P L
- - - - - - - P -G L S G -MA S S I - S D E E G T R R I K R M I Y I L D S MNQK E L DR I T R V A R G S G T S I R E V E E V L G G M P DMG QMMD F S -Y L K L

2000

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2010

2020

2030

2040

2050

2060

2070

K P DN E S R P L WV A P -NG H - - I F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
K P DHG NR P L WA C A -DG K - - I F L E T F S P L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T E T I I S V L NK - - - - L
K P DHA S R P L W I A P NDG R - - I I L E S F S P L A E QA QD F L V T I A E P V S R P S HV H E Y K I T A Y S L Y A A V S V G L E T DD I I A V L DR - - - - L
K P DHA NR P L W I D P L K G T - - I T L E S F S P L A P QA QD F L T T I A E P L S R P T H L H E Y R L T G N S L Y A A V S V G L Q P T D I I N F L DR - - - - L
K P DHA NR P L W I D P L K G T - - I T L E S F S P L A P QA QD F L T T I A E P L S R P T H L H E Y R L T G N S L Y A A V S V G L L P QD I I N F L DR - - - - L
K P DNA S R P L WV A P -NG H - - I F L E A F S P V Y K HA HD F L I A I A E P V C R - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T DY L R K - - - - L
K A D F S A R P L WV A P -DG H - - I F L E S F S P V Y K HA R D F L I A I S E P V C R P QH I H E Y Q L T A Y S L Y A A V S V G L QT K D I I E Y L E R - - - - L
K G D F T A R P L WV A P -DG H - - I F L E S F S P V Y K HA R D F L I A I S E P V C R P QH I H E Y Q L T A Y S L Y A A V S V G L QT K D I I E Y L E R - - - - L
K P DH F S R P I W I S P NDG R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S H I H E Y R I T A Y S L Y A A I S V G L E T DD I I S V L NR - - - - L
K P DHA S R P I W I S P S DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y K I T A Y S L Y A A V S V G L E T DD I I S V L DR - - - - L
K DDHG S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
----------------------------------------------------------------------------------K G DH S L R P L WV DD -R G N - - I I V E A F A P F A K QA QD F L V A I S E P V S R P A L I H E Y R I T K P S L H S A M S I G L E T K V I I E V L S R - - - - L
K S DHDK R P I WV F P -DG L - - I I I E T F HQ S S K A A C E F L V T I S E P L S R P E L I H E Y Q L T I F S L Y A A V S L G I T V D S I I E T L G K - - - - F
K S DHDK R P I WV F P -DG L - - I I I E T F HQ S S K A A C E F L V T I S E P L S R P E L I H E Y Q L T I F S L Y A A V S L G I T V D S I I E T L G K - - - - F
K NDH S S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I S E P V C R P T HA H E Y K L T A Y S L Y A A V S V G L QT S D I I E Y L QK - - - - L
K P DH F S R P I W I S P I DA R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S HV H E Y R I T A Y S L Y A A V S V G L E T DD I I L V L NR - - - - L
K QDNK S R P I WV C P -DG H - - I F L E T F S A I Y K QA S D F L V A I A E P V C R P QN I H E Y Q L T P Y S L Y A A V S V G L E T ND I I T V L G R - - - - L
R P DHG NR P L WV A P -NG H - -V F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
R P DHG NR P L WV A P -NG H - -V F L E S F S P V Y K HA HD F L I A I S E P V C R P E H I H E Y K L T A Y S L Y A A V S V G L QT HD I V E Y L K R - - - - L
K E DG E S H P I WV NY -DG L - - I I L E T F R E S S R QA S D F L I A I A E P M S R P L Q I H E F Q I T A Y S L Y A A V S V G L T T S D I I E T L DR - - - - F
K P NH P E L P MWV S S -N L R - - I V V E T S NDM F K E V S DY L S R V A QV K S R M E HMH E Y Q L T P T S I MT A F S F G S T P E A M I S T L E K - - - -Y
K A DNA S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T H I H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L QK - - - - L
K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
K P DHA S R P L W I S P NDG R - -V I L E S F S P L A E QA QD F L V T I A E P V S R P S H I H E Y R I T A Y S L Y A A V S V G L E T E D I I A V L DR - - - - L
G E K C L F V E S R I E S -DG Y I T I I A E S F R R S Y V N I R P F L T T L A E A I S R P S L MH E Y L L T P F S L G A A V S NG I DA A E A T A F L E T HA Y G L
K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
K P DHA NR P L W I N P DK G I - - I I L E S F N P L A E QA QD F L I T I A E P Q S R P T F L H E Y A L T A H S L Y A A V S V G L H P QD I I S T L DR - - - - F
- - - - - - - - - - - - - -QG T - - - - - - - - - - - - - - - - - - - L L I K G NV R V P N S I WD E R S G S F R A P A - - - - - L Y Y R D I V NY L K E - - - - K DDHA S R P L WV A P -DG H - -V F L E A F S P V Y K Y A QD F L V A I A E P V C R P S HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L K K - - - - L
K G DHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
K P DHDQK P L W I D P E K G T - - I I L E K F S P DA DR V T D F L V T I A E P K S R P H F L H E Y Q L T A H S L Y A G V S I G L Q S K D I I DT L DR - - - - F
K P DHA NR P L WA C A -DG R - - I F L E T F S P L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T S T I I S V M S K - - - - L
K P DHA NR P L WV C D -DG R - - I F L E S F S P V Y K A A Y D F L I S V A E P V C R P A NMH E Y V L T P H S L Y A A V S V G L E T S T I L S V L DR - - - - L
K DDHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
E I V Q S NK P L I L S P -D L G - - I I V E K F N P L Y E I A F E F L MC V A E P I S R S E L I H E Y V L T QM S MY T A MV L QY S A DD I I R L L D L - - - - L
K P DH F S R P I WM S P -DG R - - I I L E S F S P L A E QA QD F L I T I A E P I S R P S H I H E Y R L T P Y S L Y A A V S V G L E T DD I I S V L S R - - - - L
K K NHMNK P L W I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E L I H E F Q L T I F S L Y A A I S V G I T L D E L L I N L DK - - - - F
K K NHMNK P MW I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E L I H E F Q L T I F S L Y A A I S V G V T L D E L L V N L DK - - - - F
K K NHMNK P L W I C S -DG F - - I Y L E M F N S C S K QA S D F L I T I A E P I C R P E I I H E F Q L T I F S L Y A A I S V G I T L D E L L L N L DK - - - - F
K P DHA NR P L WA C A -DG R - - I F L E T F S S L Y K QA Y D F L I A I A E P V C R P E S MH E Y N L T P H S L Y A A V S V G L E T E T I I S V L NK - - - - L
K G DHT S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L V A I A E P V C R P T HV H E Y K L T A Y S L Y A A V S V G L QT S D I T E Y L R K - - - - L
R P DHA S R P L W I S P S DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y K I T A Y S L Y A A V S V G L E T DD I I S V L DR - - - - L
K L DHT A R P L W I N P I DG R - - I I L E A F S P L A E QA I D F L V T I S E P V S R P A F I H E Y R I T A Y S L Y A A V S V G L K T E D I I A V L DR - - - - L
K K DHG S R P L W L A P -DG H - - I F L E S F S P V Y K HA HD F L I A I S E P V C R P E N I H E Y K L T A Y S L Y A A V S V G L QT S D I I E Y L R R - - - - L
K DDHA S R P L WV A P -DG H - - I F L E A F S P V Y K Y A QD F L I A I A E P V C R P T H I H E Y K L T A Y S L Y A A V S V G L QT S D I V E Y L QK - - - - L
K DDY R E R P I L I C P -DG I - - I F L E T F N P L Y R V A Y Q F L I S I G E P V QR P L S MHK F T L T K Y S L Y T A MV L QY E P K D I I L C L E K - - - - L
K T NHT A R P L WV C P -DG Y - - L Y L E L F T P V S K QA L D F I V T I A E P V C R P E L I H E Y QV T V F S L Y T A V S V G L S F E E L L NN L NK - - - - F
K NNH S A R P L WV C P -DG Y - - L Y L E L F T P V S K QA L D F I V T I A E P V C R P E L I H E Y QV T V F S L Y T A V S V G L S F E E L L NN L NK - - - - F
L E N S DNR P A I V M P -DG H - - I F V E T F S P F Y S K V V D F I I A I A D P C S R P K Y V Q E Y Q I N P Y S I F S A V S I G L K A K E I I R I L A I - - - - I
- - - - - - - - - - L G P -G G R - - I F I NHG H P A Y P H L MD F L T A C C E P V C R T L Y V S E Y T I S P S S L S A A T A E G T Y S M E MV R NV I R Y F R L D
- - - - - - - - - -V G A -NG S - - L F V NNT H P A Y P H L V D F L T S C C E P V S R T L R M S E Y V I S P S S L S A A S A E G T Y S T A M I R N F I R Y F R L D
K L DHA S R P L W I S P DDG H - - I I L E G F S P L A E QA QD F L I A I A E P V S R P A Y I H E Y K L T P Y S L Y A A V S V G L Q P DD I I E V L NR - - - - L
K P DHA A R P L W I N P E DG R - - I I L E S F S P L A E QA QD F L V T I A E P I S R P S H I H E Y R I T T Y S L Y A A V S V G L E T S D I I S V L NR - - - - L

2080

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2090

2100

2110

2120

2130

2140

2150

S K I R L C T L S Y G K K L V L K HNK Y F V E F E V NQ E K I - E V L QK R C I - - E I E F P L L A E Y D F R NDT I N - - - -A D I N I D L K P A - - -A V L R P
S K I HA S T A NY G K K L V L K K NR Y F I E F E I D P A L V - E NV K QR C L P NA L NY P M L E E Y D F R NDNV N - - - - P D L DM E L K P H - - -A Q P R P
S K I K G A T V S Y G K K L V I K HNR Y F V E F E I DR E S V - E L V K R R C Q - - E I DY P V L E E Y D F R NDNR N - - - - P D L E I D L K P S - - -T Q I R P
S K I I D F T K S F G K K V V L K HNR F F V E F E I P N E A V - E S V K A R C Q - -A MG C P A L E E Y D F R ND E I N - - - - P T L D I D L K P N - - -A R I R S
S K I L D F T K S Y G K K V V L K HNR F F V E F E I P N E A V - E P V K A R C Q - -A MG C P A L E E Y D F R ND E I N - - - - P T L D I D L K P A - - -A R I R S
- - - - - - - - - - - - - - - - - - - -Y L V E F E V D P DK I - E V I QK R C I - - E L E H P L L A E Y D F R ND S I N - - - - P D I N I D L K P T - - -A V L R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I QMC T V S Y G K K L V L K HNR Y F V E F E I K Q E T I - E T V QR R C I - - E L E Y P L L A E Y D F R NDT MN - - - - P N L G I D L K P S - - -T T L R P
S K V QMC T V S Y G K K L V L K HNR Y F V E F E I K Q E T I - E T V QK R C I - - E L E Y P L L A E Y D F R NDT L N - - - - P N L G I D L K P S - - -T T L R P
S K I K A A T V S Y G K K L V L K HNR Y F V E F E I A HD S V - E I V K R R C Q - -D I E Y P V L E E Y D F R HDA R N - - - - P D L E I D L K P S - - -T Q I R P
S K I K G A T I S Y G K K L V I K HNR Y F V E F E I A N S S V - E I V K R R C Q - - E I DY P V L E E Y D F R NDNR N - - - - P D L E I D L K P S - - -T Q I R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
----------------------------------------------------------------------------------S K I E E WT A S F G K R L V L K DNR Y F L E F E V S G E R M - E DV R R R C K - -D I D L P A L E E Y D F R NDT I N - - - - P N L D I Q L K P M - - -T V I R P
S K I R G HC K L F G K K I V L L E G R Y F V E F E I S G DK V -D I V T MA S F -V S L HR P L L S E Y D F R S D I K N - - - - P N L D I S L K HT - - -T Q I R Y
S K I R G HC K L F G K K I V L L E G R Y F V E F E I S G DK V -D I V T MA S F -V S L HR P L L S E Y D F R S D I K N - - - - P N L D I S L K HT - - -T Q I R Y
S K I K L C T V S Y G K K L V L K HNR Y F V E F E I R Q E M I - E E L QK R C I - -Q L E Y P L L A E Y D F R NDT V N - - - - P D I NMD L K P T - - -A V L R P
S K I R G A T I S Y G K K L V L K HNR Y Y V E F E I A N E S V - E I V K R R C Q - - E I E Y P V L E E Y D F R NDDR N - - - - P D L D I D L K P S - - -T Q I R P
S K V R QC T Q S Y G K K L V L QK NK Y F V E F E I D P QQV - E E V K K R C I - -Q L DY P V L E E Y D F R NDT V N - - - - P N L N I D L K P T - - -T M I R P
S K I R L C T L S Y G K K L V L K HNK Y F I E F E V A Q E K I - E V I QK R C I - - E I E H P L L A E Y D F R NDT NN - - - - P D I N I D L K P A - - -A V L R P
S K I R L C T L S Y G K K L V L K HNK Y F I E F E V S Q E K I - E V I QK R C I - - E I E H P L L A E Y D F R NDT NN - - - - P D I N I D L K P A - - -A V L R P
S K I T E C T L S Y G K K L V MK E S S F F L E L S I E V E E V - E L V K K R C I - - E I DY P L I E E Y D F R NDK V L - - - -R S L Q I D L K P T - - -T I I R S
S K I R QA G E K K QNR L V L I NG K Y Y L Q I E V K QT S V - F K L K K K C K - -K K K V R V Y E E Y H F L R DK Q - - - - -K E L P I Q L R K D - - - -C L R P
S K I K L C T V S Y G K K L V L K R NR Y F V E F E V K Q E M I - E E L QK R C I - -H L DY P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I K S A T V S Y G K K L V I K HNR Y F V E F E I DNA S V - E I V K K R C Q - - E L DY P V L E E Y D F R NDR R N - - - - P D L D I D L K P S - - -T Q I R P
A E I E S C MK R Y N L R I I I DA E R T L V Q F L L Q S R A M S K V V A A QC V - -V L G L P I QQQY D F E NDT S V - - - -R T A H I S L R T Q - - -T K P R R
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
L K I E V S T K S Y G K K L V L K NT QY F V E F Q I E D E G V - E I V QK R C L - - E L NY P I L E E Y D F R NDT F N - - - - P V L D I D L R P N - - -T QV R P
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S G I D F E DA V L D L L P C P D L S A A Y E A S G K K L K L R D
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -C L E Y P L L A E Y D F R NDT L N - - - - P D I N I D L K P T - - -A V L R P
L K I E S C T K S Y G K K L V L NNNK Y F V E F E I P E T A V - E I V QR R C L - -D L G F P I L E E Y D F R ND S NN - - - -A D L E I D L R P N - - -T Q I R P
S K I HA S T A NY G K K L V L K K NR Y F V E F E I D P S QV - E NV K QR C L P NA L N F P M L E E Y D F R NDT V N - - - - P D L E M E L K P Q - - -A R P R P
S K V H E C T E NY G K K L V L QR NK F Y L E F E I E A R QV - E HV K QR C L P G N L G Y P T L E E Y D F R NDT R N - - - - P D L G I E L K P M - - -T R I R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -H L E Y P L L A E Y D F R ND S V N - - - - P D I N I D L K P T - - -A V L R P
S K I R HHT NN I G QK F F L QDK S Y Y I D F R I V G DY F - -DV A QA L I - -R S S V P L I Q E Y D F T K E K - - - - - -QK L D I N L K P S - - -T K P R L
S K I K S A T I S Y G K K L V L K HNR Y F V E F E I A N E S V - E I V K R R C Q - -D I DY P V L E E Y D F R NDA R N - - - - P D L E I D L K P S - - -T Q I R P
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK I - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L I C S L K S H - - -V Q I R Y
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK L - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L I C S L K S H - - -V Q I R Y
S K I T K S A E S F G K K L V L R E NK Y Y I E F E V NC DK I - E E V K Q E A L -QT MQR P L L M E Y D F R R DK K N - - - - P N L NC S L K S H - - -V Q I R Y
S K I HG S T A NY G K K L V L K K NR Y F I E F E V D P S QV - E NV K QR C L P NA L NY P M L E E Y D F R NDT V N - - - - P D L NM E L K P H - - -A Q P R P
S K I K L C T V S Y G K K L V L K HNR Y F V E F E V K Q E M I - E E L QK R C I - -C L E Y P L L A E Y D F R ND S L N - - - - P D I N I D L K P T - - -A V L R P
S K I K G A T I S Y G K K L V I K HNR Y F V E F E I A N E S V - E V V K K R C Q - - E I DY P V L E E Y D F R NDHR N - - - - P D L D I D L K P S - - -T Q I R P
S K I R A C T V S Y G K K L V L K K NR Y F I E F E I K H S S V - E T I K K R C A - - E I DY P L L E E Y D F R NDN I N - - - - P D L P I D L K P S - - -T Q I R P
S K I K L C T L S Y G K K L V L K HNR Y F V E F E V V QD E I - E N L QK R C I - - E L E Y P L L A E Y D F R NDT R N - - - - P D L S I D L K P T - - -A V L R P
SK I K V - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S K I T E NT QNY G A R L F L DD S S Y Y L D I Q I I G DH F - - E V T K A V I - -NC S V P L I Q E Y D F E NK S F - - - - -K Q L E I E L K P K - - - I K V R Y
S K I L NT S S A F G K K L V L R D S R Y W I E F E V QQ E K I - E D L K R E A L -QT MR R P L V M E Y D F R K DNN S - - - - P S L NC C I R S N - - - I K I R Y
S K I L NT S S A F G K K L V L R D S R Y W I E F E V QQ E K I - E E L K R E A L -QNMR R P L V M E Y D F R K DNN S - - - - P S L NC C I R S N - - - I K I R Y
S K I E L C C L S V G K K S V L R NT K Y Y I E F Q I K T E S V -R E I R QY A V - -DHN L F I S D E Y D F MNDK T I - - - -DN L G I Q L K NT - - -T R I R P
E QA K V S A NG DV K - -V K K E E T K E V A S QV MDG K M -R NV R E R L Y -K E L S V R A D L F Y DY V QDH S L - - - -HV C D L E L S E N - - -V R L R P
E QT H E S E E S QG K S L V K Q E A T E E T A S QV K DG R L -R NV R E R L F -K E L G V R A D L F Y DY V QDG T L - - - -DV R D L A L A E H - - -V R L R P
S K I R E Y T A S F G K K L V L K QNK Y F V E F E I A E E Y I - E QV K K R C N - - E I G Y P M L E E Y D F R NDQ L N - - - -A D L E I D L K P I - - -T H I R P
S K I H S C T K S Y G K K L V L K HNR Y F V E F E I A P D S V - E T V K K R C Q - - E I DY P V L E E Y D F R NDHG N - - - - P D L D I D L K S S - - -T Q I R P

2160

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2170

2180

2190

2200

2210

2220

2230

Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A C C T V R K R A L V L C N S G V S V E QWK QQ F WG I MV L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A A R I K K S C L C L A T NA V S V DQWA Y Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A A M F R R V V S T
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K K G T I V L C T S S M S V V QWR N E F WG L M I L D E V HV V P A S M F R K V T S A
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A G C T I K K G T I V L C T S S M S V V QWR N E F WG L M I L D E V HV V P A S M F R K V T S A
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A V C T V R K R A L V L C N S G V S V E QWK QQ F WG L V V L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G N S R A R S G V I V L P C G A G K T L V G V T A V T T V NK R C L V L A N S NV S V E QWR A Q F WG L L L L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G N S R A R S G V I V L P C G A G K T L V G V T A V T T V NK R C L V L A N S NV S V E QWR A Q F WG L L L L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A QM F R R V V T T
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V S T
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
----------------------------------------------------------------------------------Y Q E M S L A K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S A L V L C T S A V S V A QWK QQ F WG F L L L D E V HV T P A DM F R K C I NN
Y Q E QA L R MM F S NG R A R S G I I V L P C G A G K T L T G I T A A C T MR K S I L I L T T S A V A V S QWK F Q F WG L L I F D E V Q F A P A P A F R R I NG I
Y Q E QA L R MM F S NG R A R S G I I V L P C G A G K T L T G I T A A C T MR K S V L I L T T S A V A V S QWK F Q F WG L L I F D E V Q F A P A P A F R R I NG I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S S V S V E QWK A Q F WG L I I L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V T T
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L S G I T A A C T V K K S I L V L C T S A V S V E QWK Y Q F WG L V L L D E V HV V P A A M F R K V L T V
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A C C T V R K R A L V L C N S G V S V E QWK QQ F WG I MV L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A C C T V R K R A L V L C N S G V S V E QWK QQ F WG I MV L D E V HT I P A K M F R R V L T I
Y Q E I C L NK M F G NG R A R S G I I V L P C G S G K T I V G I T A I S T I K K NC L V L C T S A V S V E QWK QQT WG L L V L D E V HV V P A MM F R R V L S L
HQ E R A L QQ I F DN E MA R S G I V V L P C G A G K T L T A I A A C S K I K R S T I V L T HT T Q S V F QWK E E F WG F I I F D E V HG S T T DN I E K F V C K
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V S T
Y Q I E A V DA A I HDG T L N S G C L L L P C G A G K T L L G I M L MC K V K K P T L V L C A G S V S V E QWK S Q I Y G L L I L D E V HV M P A E S F R G S L G F
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K R G V I V L C T S T M S V V QWR D E F WG L M I L D E V HV A P A K M F R R V T S A
Y QA E A L V A W S E N - - E K WG V L V L P T G S G K T L L G I R A I A G C NT P A L V I V P T L D L L E QWK T Q L F G L L V F D E V HH L P A A G Y R S I A E F
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A R M F R R V L T I
Y Q E Q S L S K M F G NG R A K S G I I V L P C G A G K T L V G I T A A C T I K K G V I V L C T S S M S V V QWR Q E F WG L M L L D E V HV V P A DV F R R V I S S
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A C R I K K S C L C L A T NA V S V DQWA F Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L T G I A A A A R I R K S C L C L C T S S V S V DQWA A Q F WG C M L L D E V HV V P A A M F R K V I G I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P G K QA G A E L R V
Y Q L R A A K T V I MG DY A K S G L I V L P C G A G K T L V G V L C M S L I K S S T V I I C D S NV S V E QWK R E I WG I C I V D E V HR L P A V Q F QNV L K Q
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V T T
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S A L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S S L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K A L R K M F S NG R S R S G I I V L P C G V G K T L T G I T A A S T I K K S S L F L T T S A V A V E QWK K Q F WG L L V F D E V Q F A P A P S F R R I ND I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K S L V G V S A A C R I K K S C L C L A T NA V S V DQWA F Q F WG L L L MD E V HV V P A HM F R K V I S I
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K S L V G V T A A C T V R K R C L V L G N S A V S V E QWK A Q F WG L M I L D E V HT I P A K M F R R V L T I
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R R V V S T
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S V I V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A A M F R R V V T T
Y Q E K S L R K M F G NG R A R S G V I V L P C G A G K T L V G V T A S C T V R K R C MV L C T S G V A V E QWR S Q F WG L M I L D E V HT I P A K Q F R R V L T Q
----------------------------------------------------------------------------------Y Q E R A L K N I F I QK K A R S G L I I L P C G A G K T I V G V I A I E R I K Q S T V I I C D S DV S V DQWR D E L WG V C V I D E V HK L P A NT F QNV L K Q
Y Q E R A L R R M F S NG R A R S G I I V L P C G A G K T L T G I V A A C T V R K S I F V L T T S A V A V E QW I K Q F WG M L I F D E V Q F V P A P A F R R I N E I
Y Q E R A L R R M F S NG R A R S G I I V L P C G A G K T L T G I V A A C T V R K S I F V L T T S A V A V E QW I K Q F WG M L I F D E V Q F V P A P A F R R I N E I
Y Q E K A L T K M F S G G R S I S G I I V L P C G A G K T L V G I A A L A T I NK P T V I V C NNR L T V K QWY NQ I WG L L I L D E V QD S A A NT F R NV T D I
Y QV A S L E R F R S G NK A HQG V I V L P C G A G K T L T G I G A A A T V K K R T I V MC I NV M S V L QWQR E F WG L L L L D E V HT A L A HN F Q E V L NK
Y QV A S L E R F R C G NK A HQG V I V L P C G A G K T L T G I G A A T I L K K R T I V MC I NV I S V L QWQR E F WG L L L L D E V HA A L A HH F Q E V L NK
Y Q E K S L A K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I K K S C L V L C T S S V S V MQWR QQ F WG F I L L D E V HV V P A S M F R R V L T K
Y Q E K S L S K M F G NG R A R S G I I V L P C G A G K T L V G I T A A C T I R K S V I V L C T S S V S V MQWR QQ F WG F I I L D E V HV V P A A M F R K V V T N

2250

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2260

2270

2280

2290

2300

2310

2320

V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK R G Y I A R V QC A E V WC P MA P E F Y R E Y L Y V MN P A K F R A C QY L I R Y
T K S HC K L G L T A T L V R E D E K I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y V MN P NK F R A C E F L I R F
I A A HA K L G L T A T L V R E DDK I S D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
I A T Q S K L G L T A T L L R E DDK I K D L N F L I G P K L Y E A NWM E L A E QG H I A K V QC A E V WC P MT T E F Y T E Y L Y I MN P R K F QA C Q F L I DY
I A C Q S K L G L T A T L L R E DDK I K D L N F L I G P K L Y E A NWM E L A E QG H I A K V QC A E V WC P MT T E F Y S E Y L Y I MN P R K F QA C Q F L I DY
V H S HA K L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QN S G Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
V QA HC K L G L T A T L V R E DDK I T D L N F L I G P K I Y E A NWM E L QK A G H I A K V QC A E V WC P MT S A F Y S Y Y L A V MN P NK F R I C Q F L I K F
V QA HC K L G L T A T L V R E DDK I T D L N F L I G P K I Y E A NWM E L QK A G H I A K V QC A E V WC P MT S A F Y S Y Y L A V MN P NK F R I C Q F L I K F
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
I A A HA K L G L T A T L V R E DDK I S D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
----------------------------------------------------------------------------------F K V HA K L G L T A T L V R E DDR I G D L G Y L I G P K L Y E A NWMD L A K NG H I A T V QC A E V WC P MT P E F Y R E Y L HA MN P NK I QA C Q F L I NY
V K A HC K L G L T A T L V R E DD L I QD L QW L I G P K L Y E A NWM E L QDR G Y L A K A L C S E V WC P MT A S Y Y R E Y L WV C N P NK L R V C E F L I R W
V K A HC K L G L T A T L V R E DD L I QD L QW L I G P K L Y E A NWM E L QDR G Y L A K A L C S E V WC P MT A S Y Y R E Y L WV C N P NK L R V C E F L I HW
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I R F
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
T K A HC K L G L T A T L L R E D E K I QD L N F L I G P K L Y E A NW L D L QK A G F L A NV S C S E V WC P MT A E F Y K E Y L Y T MN P NK F R A C E Y L I R F
V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK K G Y I A R V QC A E V WC P M S P E F Y R E Y L Y V MN P S K F R S C Q F L I K Y
V Q S HC K L G L T A T L L R E DDK I A D L N F L I G P K L Y E A NW L E L QK K G Y I A R V QC A E V WC P M S P E F Y R E Y L Y V MN P S K F R S C Q F L I K Y
V S HHC K L G L T A T L V R E DDK I E D L N F L I G P K L Y E A DWQD L S A K G H I A R V S C I E V WC G MT G D F Y R E Y L S I MN P T K F QV C E Y L I NK
I K A QC K L G L T A T L I R E DDR I R D L E F M I G P M L Y E A S WQ E L A K QG Y I A NA K C F E V I C P MT K T Y Y S A Y L A Q L N P NK I DA C K Y L L E Q
V QA HC K L E L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QN S G Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
I A A HA K L G L T A T L V R E DDK I S D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
V DA K G V I G L T A T Y V R E DHK I L D L F H L V G P K L Y D I S M E T L A S QG Y L A K V HC V E V R T P MT K E F G L E Y L A A A N P NK MMC V R E L V R Q
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
L K S H S K L G L T A T L L R E DDK I S D L N F L I G P K L Y E A NWM E L S L G G H I A R V QC A E V WC P M P T E F Y R E Y L Y I MN P MK F QA C QY L I NY
S A A P C R L G L T A T Y E R E DG L HT E L NR L V G G K V Y E K K V S E L A -G G H L A P Y T I K R F A V T L T E K E QR E Y L A F N S N S K I E K L R E I L E Q
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
I K S H S K L G L T A T L L R E DDK I S H L N F L I G P K L Y E A NWM E L S E K G H I A K V QC A E V WC P M P T E F Y D E Y L Y A MN P R K F QA C QY L I NY
T K S HC K L G L T A T L V R E D E R I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y A MN P NK F R A C E F L I R F
T K A HC K L G L T A T L V R E DDK V DH L N F L I G P K L Y E A NW L D L QR DG H I A NV QC V E V WC P MT A E F F R K Y L Y C MN P NK F MA C Q F L MQ F
I L A HC N L R L L A T A F G HHD P V L D F L F T H L Q S I F E WA WW L T P NNG Y I A K V QC V E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
I K C A I K I G L T A T L L R E DQK L DN L Y F M I G P K L Y E E N L I D L MT QG F L A K P H I I E I QC DM P P I F L Q E Y L HT G N P G K Y K A L Q F L I K N
I A A HA K L G L T A T L V R E DDK I HD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P T K F QA C Q F L I HY
V K S HC K L G L T A T L V R E D L L I R D L HW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P C S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
V K S HC K L G L T A T L V R E D L L I R D L QW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P S S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
V K S HC K L G L T A T L V R E D L L I R D L QW I I G P K L Y E A NWV E L QNK G F L A K A L C K E I WC S M P S S F Y K Y Y L Y T C N P R K L MMC E Y L I K Y
T K S HC K L G L T A T L V R E D E R I T D L N F L I G P K L Y E A NW L D L V K G G F I A NV QC A E V WC P MT K E F F A E Y L Y V MN P NK F R A C E F L I R F
V QA HC K L G L T A T L V R E DDK I V D L N F L I G P K L Y E A NWM E L QNNG Y I A K V QC A E V WC P M S P E F Y R E Y L Y T MN P NK F R A C Q F L I K F
I A A HA K L G L T A T L V R E DDK I G D L N F L I G P K L Y E A NWM E L S QK G H I A NV QC A E V WC P MT A E F Y Q E Y L Y I MN P T K F QA C Q F L I QY
I A A HT K L G L T A T L V R E DDK I DD L N F L I G P K MY E A NWMD L A QK G H I A K V QC A E V WC A MT T E F Y N E Y L Y I MN P K K F QA C Q F L I DY
V QA HC K L G L T A T L V R E DDK I A D L N F L I G P K L Y E A NWM E L QNK G F I A R V QC A E V WC P MA P E F F R E Y L Y V MN P NK F R A C Q F L V R F
----------------------------------------------------------------------------------Y K F H F K L G L T A T P Y R E D E K I I N L F Y M I G P K L Y E E NWY D L V S QG F L A K P Y C V E I R C E M S Q L WM S E Y I HT S N P R K F K T L E Y L I K V
I R S HC K L G L T A T L V R E DD L I R D L QW L I G P K L Y E A NW L E L QQK G Y L A K V I C K E I WC P MT A P F Y R E Y L W S C N P V K L I T C E Y L L R F
I R S HC K L G L T A T L V R E DD L I R D L QW L I G P K L Y E A NW L E L Q E K G Y L A K V I C K E I WC P MT A P F Y R E Y L W S C N P V K L I T C E Y L L K F
A K A HT R L G L T A T L I R E DDK I S D L R Y L V G P K L Y E A NW L E L S E QG Y L A R V K C F E V T V P MT A S F Y K Y Y L C S S N P NK I R T V A G I I K F
V K Y K C V I G L S A T L L R E DDK I G D L R H L V G P K L Y E A NW L D L T R A G F L A R V E C A E I QC P L P K A F L T E Y V V C L N P Y K L WC T QA L L E F
V K Y K C V V G L S A T L L R E DDK I G D L R H L V G P K L Y E A NW L E L T R A G F L A R V E C A E V QC P L P L P F F R E Y V V C F N P Y K L WC T QA L L E F
I K A H S K L G L T A T L V R E D E K I D E L N F L V G P K L Y E A NWMD L A A K G H I A T V QC A E V WC P MT P E F Y R E Y L Y C MN P NK F QA C Q F L I DY
I A A HA K L G L T A T L V R E DDK I DD L N F L I G P K L Y E A NWMD L A QK G H I A NV QC A E V WC P MT S E F Y Q E Y L Y I MN P S K F QA A Q F L I NY

2330

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2340

2350

2360

2370

2380

2390

2400

H E -K R G - -DK T I V F S DNV F A L K HY - -A I K MNK P Y I Y G P T S QN E R I Q I L QN F K F N P K V NT I F V S K V A DT S F D L P E A NV L I Q I S S


H E QQR G - -DK I I V F A DN L F A L T E Y - -A MK L R K P M I Y G A T S H I E R T K I L E A F K T S K T V NT V F L S K V G DN S I D I P E A NV I I Q I S S
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MN I L QN F QY NDQ I S T I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK V I V F S DNV Y A L E R Y - -A L K L NK A Y I Y G G T P QN E R MR I L E N F QHN E QV NT I F L S K I G DT S L D L P E A T C L I Q I S S
H E -K R G - -DK V I V F S DNV Y A L QR Y - -A L K L NK A Y I Y G G T P QN E R MR I L E N F QHN E QV NT I F L S K I G DT S L D L P E A T C L I Q I S S
----------------------------------------------------------------------------------H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -R R N - -DK I I V F S DNV F A L K R Y - -A I E MQK P F L Y G E T S QN E R MK I L QN F QY N P R V NT I F V S K V A DT S F D L P E A NV L I Q I S A
H E -R R N - -DK I I V F S DNV F A L K R Y - -A I E MQK P F L Y G E T S QN E R MK I L QN F QY N P R V NT I F V S K V A DT S F D L P E A NV L I Q I S A
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L R L G K P F I Y G S T P QQ E R MK I L QN F QHNDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -R R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I F G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
- - - - - - - - - - - - - - - - -V F A L K E Y - -A V R MG K P Y I Y G P T T QG E R L Q I L QN F I HN P K V NT I F I S K V G DN S I D L P A A NV L I QV S S
H E - S R G - -DK V I V F S DNV F A L E A Y - -A K K L G K S F I HG G T P E G E R L R I L S R F QHD P Q L NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR G - -DK I L V F S D S L F A L I N I - -A V A L K K P F V C G S V DT L E R I K I L QQ F K E N P N F NT I F L S K V G DNA I D I P L A NV V I Q I S F
H E -QR G - -DK I L V F S D S L F A L I N I - -A V A L K K P F V C G S V DT L E R I K I L QQ F K E N P N F NT I F L S K V G DNA I D I P L A NV V I Q I S F
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -K R G - -DK I I V F S DNV Y A L QG Y - -A L K L G K P F I Y G S T S QQ E R MK I L QN F QHNDQV NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR G - -DK I I V F S DNV Y A L QK Y - -A K G L G R Y F I Y G P T S G H E R M S I L S K F QHD P T V R T I F I S K V G DT S I D I P E A T V I I QV S S
H E -QR G - -DK T I V F S DNV F A L K HY - -A I K MNK P F I Y G P T S QN E R I Q I L QN F K F N S K V NT I F V S K V A DT S F D L P E A NV L I Q I S S
H E -QR G - -DK T I V F S DNV F A L K HY - -A I K MNK P F I Y G P T S QN E R I Q I L QN F K F N S K V NT I F V S K V A DT S F D L P E A NV L V Q I S S
H E - S R G - -DK I I V F S D S V Y A L K A Y - -A L K L G K P F I Y G P T G QT E R MR I L K Q F QT N P V I NT I F L S K V G DT S I D L P E A T C L I Q I S S
HK -A HG - -DK I I I F C N E L K P A G F Y K E K L K L QK C Y MDG NT S E E HR R N L L DQ F R R D - E I S V I F C S K I G DV G L D L P DA S V A I Q L S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L G K P Y I Y G P T A QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H L -DA G - -A K I L L C C DH I M L L K E Y - -G E L L NA P V I C G S T QHK E R L M I F S D F Q S T S K I NV I C V S R V G DV S V N L P NA N I V I QV S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -A R G - -DK I I V F S DNV Y A L K K Y - -A S V L S K C M I Y G G T S N S E R Q L I L K N F QHN P E I NT L F L S K I G DT S L D L P E A T C L I Q I S S
H - - -R E - -DR V F I F T E HNR L V HR I - - S NT F F I P A I T Y R T P A K E R N S I L E K F R -T G S Y R A V V T S K V L D E G I DV P E A N I G I - I V S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -A R G - -DK I I V F S D E L Y S L K QY - -A L K L NK V F I Y G G T G QA E R MQV L E N F QHN P QV NT L F L S K I G DT S L D L P E A T C L I Q I S S
H E QQR G - -DK I I V F A DN L F A L T S Y - -A MK L R K P M I Y G S T S HV E R T R I L HQ F K N S S DV NT I F L S K V - - - - - - - - - - - - - - - - - H E QQR K - -DK V I V F S DN I F A L R E Y - -A T A L R R P L I Y G DT S HA E R T R V L HA F K Y S N E I NT I F L S K V G DN S I D I P E A NV I I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -M L G - -HK I I V F C D S L L I L NY Y - -A L L L G Y P V I DG D L NT D E K NK I F S I F K N S N E I K T I F V S R V G DT G I D I P S A S V G I E I G Y
H E -K R G - -DK I I V F S DNV Y A L Q E Y - -A L K L G K P F I Y G S T P QQ E R MQ I L S N F QHNDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K HD S S I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K ND S N I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E -QNN - -DK I I V F S DN I F A L L H I - -A K T L NK P F I Y G K L S P I E R I A I I NK F K ND S T I NT I L L S K V G DNA I D I P I A NV V I Q I S F
H E E QR R - -DK I I V F A DN L F A L T E Y - -A MK L HK P M I Y G A T S HA E R T K I L HA F K T S S E V NT V F L S K V G DN S I D I P E A NV I I Q I S S
H E -R R N - -DK I I V F A DNV F A L K E Y - -A I R L NK P Y I Y G P T S QG E R MQ I L QN F K HN P K I NT I F I S K V G DT S F D L P E A NV L I Q I S S
H E -R R G - -DK I I V F S DNV Y A L Q E Y - -A L K MG K P F I Y G S T P QQ E R MN I L QN F QY NDQ I NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK I I V F S DNV Y A L R A Y - -A I K L G K Y F I Y G G T P QQ E R MR I L E N F QY N E L V NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -QR N - -DK V I V F S DNV F A L K HY - -A I A MG R P Y I Y G P T S QG E R MQ I L QN F QHN P A V S T I F I S K V G DN S F D L P E A NV L I Q I S S
----------------------------------------------------------------------------------H E - E R G - -DK I L V F C DR P M I I DY Y - -G N I L K Y P V I Y G DV S QD E R K K I F N L F K V S NQ I NT I F L S R V G DT A I D L P QA NV G I Q I G M
H E - S R G - -DK V I V F S DN L F A L L HA - -A K L L NR P F I Y G K V S S A E R I V I L NK F K N E T T F NT I F L S K V G DNA L D I P C A NV V I Q I S F
H E - S R G - -DK V I V F S DN L F A L L HA - -A K L L NR P F I Y G K V S S A E R I I I L NK F K N E T T F NT I F L S K V G DNA L D I P C A NV V I Q I S F
H E -R R G - -DK V L V F C D I I H I L I H L - -A G L L HC P E I HG E T P E NV R S S I F H E F K NG S K V NT L I L S S V G DK A I D L P S A S V V V QV C S
HR -NR S P P DK V I I F C DQ I DG I QY Y - -A QH L HV P F MDG K T S DM E R E N L L QY F QH S DN I NA I I L S R V G DV A L D I P C A S V V I Q I S G
HR -NR S P P DK V I I F C DD L E G V QY Y - -A R H L NV P F MDG K T T E V E R E N L L QY F QH S ND I NA I I L S R V G DV A L D I P C A S V I I QV S G
H E -NR G - -DK I I V F S DNV Y A L V A Y - -A HK L K K P F I HG G T A H L E R MR I L QN F QHN P L V NT I F L S K V G DT S I D L P E A T C L I Q I S S
H E -K R G - -DK I I V F S DNV HA L K A Y - -A L K L G K F F I F G G T P QQ E R MK I L K N F QY NDQV NT I F L S K V G DT S I D L P E A T C L I Q I S S

2410

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2420

2430

2440

2450

2460

2470

2480

HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E MG Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V A E L K K S P -DT - - - - -H P Y P HK F
HA G S R R Q E A QR L G R I L R A K A F F Y S L V S T DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y Y E NR L K Y L A A E K A K - -G E - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I Q E L R K T Q - E P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QA F L V DQ -G Y A F K V Y F E I R S K R I NK L R E T K -Q P - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QA F L V DQ -G Y A F K V Y F E I R S K R I NK L R E T K -N P - - - - -D P Y P HK F
----------------------------------------------------------------------------------HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E MG Y S R K R QR F L V NQ -G Y A Y K V Y F NMR V R M I E A R R A A - -G E - - - - -N P F P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E MG Y S R K R QR F L V NQ -G Y A Y K V Y F NMR V R M I E A R R A A - -G D - - - - -N P F P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V F F E I R S R Q I S E L R E K N -NA D P S A F N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I L E L R K T H - S P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K I N - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y S L V S QDT E V A Y S T K R QR F L V DQ -G Y S F K A Y L Q I R K NT I T T L R QN - -N I - - - - - E P Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S S K R QG F L I DQ -G Y A F K V F H E MR Y K E I A K L R E T K -Q P - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L L S K DT E M E Y A DK R QQ F I I DQ -G Y S Y R V Y T DNR Y K MM E C I K DA - -G R - - - - - P F Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L L S K DT E M E Y A DK R QQ F I I DQ -G Y S Y R V Y T DNR Y K MM E C I K DA - -G R - - - - - P F Y P HK F
HG G S R R Q E A QR L G R V L R A K A Y F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I QA L K G T - -A E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I D E L R QA N - L A DG S A F N P Y P HK F
HY G S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y K E NR T K Q L T S A D I - - -G V - - - - -N P W P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M S Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V Q E L K R S P -A T - - - - -D P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M S Y S R K R QR F L V NQ -G Y S Y K V Y F K L R S A A V Q E L K Q S A -D S - - - - -H P Y P HK F
H F G S R R Q E A QR L G R I L R A K V Y F Y S L V S K DT E M F Y S S K R QQ F L I DQ -G Y T F T I - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - S S G S R R Q E A QR L G R I L R A K A Y F Y T L T S K DT E MY F S QR R QR V MR QN -G Y T F K V L F L NR C K DV E E Y QK A - -G H - - - - -N P W P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S HA I QQ L K G T - -N E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E A R S R Q I N E L R K T H - S P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R I L R P K A W F Y S I I S T DT E I NY A A HR T A F L V DQ -G Y T C R I Y F E S R L A MV K E MG L L - -G - - - - - -A A Y P HK Y
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y F K I R S QA I HQ L K V N - -G E - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S A K R QA F L V DQ -G Y A F K V Y F E I R S R E V NG M L E N P S G P - - - - -N P Y P HK F
G T G S K R A Y V QR L G R I L R K K A V L Y E I I A G E T E T G T A R R R K E A L S S G -K R T S K A F DD S K L A K L NG I I S Q - -G L - - - - -D P Y P Y R F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I QQ L K I S - -G E - - - - -D P Y P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA V QQ L K V T - -G E - - - - -D P Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S S K R QA F L V DQ -G Y A F K V Y Y E I R T R QV N E L L K N P - E T - - - - -N P Y P HK F
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Q - - - - - - -Y Y E NR L K A L D S L K A T - -G V - - - - -N P Y P HK F
HA G S R R Q E A QR L G R I L R P K A F F Y S L V S T DT E MY Y S T K R QQ F L I QQ -G Y A F K V Y T QNR I NK V L S A K A K - -G E - - - - - S P Y P HK Y
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA I HQ L K V N - -G E - - - - -D P Y P HK F
L G G S R R QK V QR L G R V MR P K A F F Y S L A S K DT E S E Y S Y K R QK Y I T E Q L G L NT E L F H E NR S K QV L A L K QT K -D P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I N E L R E S N -N E N P T A F N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y F E NR S K F I QDQK DK - -G I - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y Y E NR S K F V Q E QK A K - -G I - - - - -N P Y P HK F
N F A S R R Q E A QR L G R I I R P K S F F Y S L V S K DT E MC Y S DK R QR F L I NQ -G Y A Y NV Y F E NR S K L I L S QQ E K - -G I - - - - -NT Y P HK F
HA G S R R Q E A QR L G R I L R A K A F F Y S L V S T DT E MY Y S T K R QQ F L I DQ -G Y S F K V Y Y E NR L K Y L DA QK G E - -G K - - - - -NMY P HK F
HG G S R R Q E A QR L G R V L R A K A F F Y S L V S QDT E MA Y S T K R QR F L V DQ -G Y S F K V Y Y K I R S QA V QQ L K V S - -G E - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E T R S R Q I Q E L R K T H - E P - - - - -N P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S S K R QA F L I DQ -G Y A F K V Y F E NR S R T I M E L R QT K -D P - - - - -N P Y P HK F
HG G S R R Q E A QR L G R I L R A K A F F Y T L V S QDT E M F Y S L K R QR F L V NQ -G Y S F K T Y F K I R S QA V E A L K A A - -G D - - - - -H P Y P HK Y
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Y F K I R S QA I E E L K G A - -G E - - - - -D P Y P HK F
H F K S R R Q E V QR L G R I MR A K A F WY T L V S K G T E T S Y C L A R QK C L I NQ -G F K Y E I Y Y E NR C K A V QD L MT T G -K P - - - - -Y P Y P HK F
N F A S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MV F A DK R QQ F I I DQ -G Y A Y NV - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - N F A S R R Q E A QR L G R I L R P K A F F Y S L V S K DT E MV F A DK R QQ F I I DQ -G Y A Y NV Y Y L NR L E T V E QWR K N - -G - - - - - -T A Y P HK F
NY G A R MQ E S QR L G R V L R P K A F F Y S C I S DMT D L K Y S A R R QQ F L V DQ -G Y V Y E P Y HDR R L A E V T K QV E A H -R K D L S L P S P Y P HK F
L G A S R R Q E A QR L G R I L R P K S Y F Y T L V S QDT E I S Q S Y E R Q S W L R DQ -G F S Y R V Y Y DT R L A MV K E MG P L - -G - - - - - -A A Y P HK F
L G A S R R Q E A QR L G R I L R P K S Y F Y T L V S QDT E V QQ S Y G R Q S W L R DQ -G F A Y R V Y F DT R L A MV K E L G L L - -G - - - - - -A A Y P HK F
H F G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E M F Y S T K R QQ F L I DQ -G Y A F R V Y Y E R R F R T I S A L R E S K -N P - - - - -D P Y P HK F
HY G S R R Q E A QR L G R I L R A K A F F Y S L V S K DT E MY Y S T K R QA F L V DQ -G Y A F K V Y F E I R S R Q I DA L R Q S K -T P - - - - -N P Y P HK F

2500

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2510

2520

2530

2540

2550

2560

2570

S V T V S L G E F I E R Y - - - S G - L QDG E T L D -DV T V S V A G R V HA I R E S G V K L I F F D L R - - - - - -G E G L K L QV F F E E T A R L R R G D I I G
A V S M S I P K Y I E T Y - - -G S - L NNG DHV E -NA E E S L A G R I M S K R S S S S K L F F Y D L H - - - - - -G DD F K V QV F L K L H S NA K R G D I V G
NV T I G L P A F L NK Y - - -A H - L QR G E T L P - E E R V S I A G R I HA K R E S G S K L R F Y V L H - - - - - -A DG V E V QV Y E QDHG L L K R G D I V G
QV T DD L R E Y L K T Y - - -DG - L A K G E QK P -DV T V R I A G R I Y T K R S S G S K L F F Y D I R - - - - - -A E G V K V QV F E A QH E H L R R G D I V G
QV T DD L R K Y L T DY - - - E G - L A K G E QK P - E V A V R I A G R I Y T K R A S G A K L I F Y D I R - - - - - -A E G V K V QV F E A QH E H L R R G D I V G
----------------------------------------------------------------------------------HV D I S L T H F I Q E Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I R I NNK L R R G D I I G
NV T I S L T D F I A K Y - - - S P - L QN E Q -V A -D E I V S V A G R I H S K R E S G S K L V F Y D I H - - - - - -G E G T H I Q I F V T L HDR I K R G D I V G
NV T I S L T D F I T K Y - - -T P - L E K E Q -V V - E E I V S V A G R I H S K R E S G S K L V F Y D I H - - - - - -G E G T H I Q I F V T L HDR I K R G D I V G
NV T T K I P E F V E K Y - - -A H - L QR G E T L K -DV T V S V S G R I MT K R E S G S K L K F Y V L K - - - - - -G DG V E V Q I F E S MH E I L R R G D I I G
QV S I S N P E F L A K Y - - -A H - L K R G E T L P -N E I V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V QV Y E NDHD L I K R G D I V G
HV D I S L T H F I E E Y - - -G H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
HV S I S L S DY V E K Y - - -NN - I E V G S H L N -DQQV S I A G R I HA K R E A G P K L I F Y DV R - - - - - -G DG V K L QV Y Q E I N E R T R R G D I I G
NV T HA V P K F V E E WG K E G K - L E K G E T A Q L N E P I S L A G R V Y T I R E S S S K L R F Y D L K - - - - - -A DG V K V Q I Y L DT HDR I R R G D I I G
K I S M S L P A Y A L K Y - - -G N -V E NG Y I DK -DT T L S L S G R V T S I R S S S S K L I F Y D I F - - - - - -C E E QK V Q I F S V S H S E I R R G DV V G
K I S M S L P A Y A L K Y - - -G N -V E NG Y I DK -DT T L S L S G R V T S I R S S S S K L I F Y D I F - - - - - -C E E QK V Q I F S V S H S E I R R G DV V G
HV D L S L T E F I E R Y - - -NH - L Q P G DH L T -DV V L N L S G R V HA K R A S G A K L L F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
HV S I Q L P A F A E K Y - - -K D - L K K G E S L K -DV E V K V S G R I MG K R E S G S K L K F Y V L K - - - - - -G DG V Q I Q I Y E K MH E Y L R R G D I I G
E V S HQ L P K F V E E F - - - S V - L E K DG E P S -T QV V S I A G R V L S K R A A G S G L V F Y D I T - - - - - -G E F NK V QV Y V K I NG L L R R G D I I G
HV S S S L E D F I A K Y - - E N S - L K E G E T L E -NV K L S V A G R V HA I R E S G A K L I F Y D L R - - - - - -G E G V K V QV F E I DT S K L R R G D I I G
NV S I S L E N F I E QY - - - S G - L T DG E T L E -K V S L S V A G R V HA I R E S G A K L I F Y D L R - - - - - -G E G V K L QV F E T DT A K L R R G D I I G
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M S MR L H - - - - - -A R F C F F V V T - - - - - - S NG E S L QV R E QMA K F L R R G DV V G
NV S I T V P E F I A K Y - - - S G - L E K S Q -V S -DD I V S V A G R V L S K R S S - S A L M F I D L H - - - - - -D S QT K L Q I F V S L T K M I Y R G D I C G
HV D L S L S D F I E R Y - - - S H - L Q P G DH L T -D I T V S V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV Y F R I NNK L R R G D I I G
HV D I S L T D F I QK Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
QV S V T L P E F L S K Y - - -A N - L K R G E T L P - E E K V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V QV Y E DDH S L I K R G D I V G
HR QY T I P QY R R K Y - - -A P L L T E P DT S L -D E T V T I A G R I I NK R S S G S K L H F I T I Q - - - - - -G DM E I V QV F A E I H S K L K R G D I I G
HV D I S L T D F I QK Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
L V DY D P S Q F DK D F - - -K H - L K S G DV DK -T R E I R I A G R I F T K R S S G NK L I F Y D I K T G S DT T T T G S K MQ I F E QQH E H L G R G DV I G
E K NG D I C E I L V K F - - - E D - F E K N E G L S - - - -V R T A G R L Y N I R K HG -K M I F A D L G - - - - - -DQT G R I QV F A T F K N L MD S G D I I G
HV DT S L T H F I E QY - - -NN - L Q P G DH L T -D I T V R V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F F P I NNK L P R G D I F G
HV D I S L T Q F I Q E Y - - - S H - L Q P G DH L T -DV T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
QV NY DD S N F V E E F - - -G S - L K T G E T L P - E K E L R I A G R I Y N I R T A G S K L I F Y D I R T S A DT K S I G T R MQV F E K QHA H L R R G D I I G
L A N I T V A DY I E K Y - - -K S -MNV G DK L V -DV T E C L A G R I MT K R A Q S S K L L F Y D L Y - - - - - -G G G E K V QV F I K F H S T L K R G D I V G
HV DT R V G E F I E K Y - - - S G - L A DG T T A E -G E S A S V A G R I M S K R A S G K K L Y F Y D L I - - - - - -A DG K K I QV F QK I H S A T R R G D I V G
HV D I S L T D F I QK Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F I H I NNK L R R G D I I G
QV D L T I A Q F R DK Y - - -G P L C T E K G K I H - E D F V S V A G R V V T I R S MG A K L M F Y D L Q - - - - - -G E G T K I QV F E K V HT L I K R G D I I G
HR N I T L P E F A E K Y - - - S S - L T R G E T L Q -DV E V K V T G R I MT K R E S G A K L R F Y V L K - - - - - -G DG V E V Q I Y E K MH E Y L R R G DV I G
E R T I S I P E F I E K Y - - -K D - L G NG E H L E -DT I L N I T G R I MR V S A S G QK L R F F D L V - - - - - -G DG E K I QV F A E C Y DK I R R G D I V G
E R T I T V P E F V E K Y - - -QN - L A S G E H L E -NT V L NV T G R I MR V S A S G QK L R F F D L V - - - - - -G DG A K I QV F A E A Y DK I R R G D I V G
E R T I T I P D F I E K Y - - -K D - L QNG E H L E - E T I L NMT G R I MR V S S S G QK L R F F D L V - - - - - -G DG K R I QV F V E C Y DK I K R G D I V G
F V T L S I P E Y I DK Y - - -G G - L S NG E H L E -DV S V S L A G R I M S K R S S S S K L F F Y D L H - - - - - -G L G A K V QV F S K L H S S V K R G D I V G
HV D I S L T Q F I Q E Y - - - S H - L Q P G DH L T -D I T L K V A G R I HA K R A S G G K L I F Y D L R - - - - - -G E G V K L QV F V H I NNK L R R G D I I G
HV S I S N P E F L A K Y - - -A H - L K K G E T L P - E E K V S I A G R I HA K R E S G S K L K F Y V L H - - - - - -G DG V E V Q L Y E K DHD L L K R G D I V G
QV T I T L P E F I A K Y - - - E G - L A R G E T K P - E V E V A V A G R V L G L R T A G NK L R F Y E I H - - - - - -A DG K K L QV F A A QH E H L R R G D I I G
HV T I S L T D F L E K Y - - -DY - L K A E D - I A -D E V L S L S G R V HA K R A S G A K L I F Y D L R - - - - - -G E G V K L QV F T R L N E K I R R G D I I G
HV DV S L T E F I E K Y - - -K N - L Q P G DQ L T -D -A V K V A G R V HA K R V S G A K L L F Y D L R - - - - - -G E G V K L QV F V A I NNK L R R G D I I G
DV S H S I S Q F I E E F - - -D P K L T E NG QT I -DT I V T I G A R I T S F R A S G K A L I F Y QV Q - - - - - -Q E G K K L QV F E E I N S L F K R G D I I G
- - -M S L K E Y V DK Y - - - E H - L E A G E H L E -N E L V S I A G R V S R I A S S S S K L R F L D I K - - - - - - S E G T K L QV F NDT Y NN I K R G D I I G
HV NM S L K E F V G K Y - - -DH - L E A G A H L E -N E L V S I A G R V S R I A S S S S K L R F L D I K - - - - - - S E G T K L QV F NDT Y NN I K R G D I I G
NV S HT F K Q F Y A Q F - - - E H - L K A G E E L P -DV K V S V A S R I A Q L R A HG -N L Y F F E MY - - - - - - E S T F K L Q L F K E E V S S F H L G D I V G
HR DY T L P A F R E C F - - -K P M L Q E K G QR L -DK V V T I A G R I V V K R S S S S K L H F L A L Q - - - - - -G DG E V L QV F A D I H S K I K R G D I I G
DR QY T I P A F K A R F - - -A P Q L S E K G QR V - E E V V A I A G R I V NK R S S G S K L N F L T L Q - - - - - -G DA DT V QV F A A V HG R I R R G D I I G
HV S I S L S E F I S K Y - - - E G K L E A G QH L D -Q E E V S I A G R L HNMR S S G QK L R F Y D L H - - - - - -G E G V K V QV F F A I H E L L R R G DV V G
NV T T K V D E F V E K Y - - -K G - L A R G E I K K -D E E V S V A G R V HT L R A A G S K L R F Y V L H - - - - - -Q E G K T V Q I DWG I HD L I R R G DV I G

2580

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2590

2600

2610

2620

2630

2640

V T G V P G E L S A MA R R I K L L S P C L HM L P G L K DK E T R F R QR Y L D L I L NNNV R N I F V T R A L I I S Y V R R F F DN L G F L E V E T
V I G F P G E L S I F P R S F I L L S HC L HMM P V L K DQ E S R Y R QR H L DM I L NV E V R Q I F R T R A K I I S Y V R R F L DNK N F L E V E T
V E G Y V G E I S V F V S R I Q L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK DA R G R F I T R S K I I T Y I R K F L DNR D F I E V E T
V V G F P G E L S I F A T E V V L L A P C L HA I P G F QDK E QR F R QR Y L D L I MN E R S R NV F V T R S K I V R Y V R N F F D S R D F I E V E T
I V G F P G E L S I F A T E V V L L S P C L HA I P G L QDK E QR F R QR Y L D L I MNDK S R NV F V T R S K I V R Y I R N F F DNR D F V E V E T
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -MY K E V R Q I F Y T R A K I I A Y V R R F L DNMG F L E V E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
F T G QA G E L S L I P K E V L Q L T P C L HM L P G L K DK E L R F R K R Y L D L I L N P R V K DN F V I R S K I I T F L R R Y L DN L G F L E V E T
F T G R A G E L S L I P N E I L Q L T P C L HM L P G L K DK E L R F R K R Y L D L I L N P R V K DN F V I R S K I I T F L R R Y L DN L G F L E V E T
V T G Y P G E L S V F A T K V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MND S S R E R F R V R S K I I QY I R K F L DNR D F V E V E T
V E G Y V G E I S V F V K R I E L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK D S R K R F I T R S K I I K Y I R K F L DNR D F I E V E T
V K G N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V I G H P G E L S I V P NT I E I L S P C L HM L P G L K DK E T R Y R QR Y L D L I MNDQT R QK F I T R A K I I S Y I R S F F DQMG F L E V E T
V T G I P G E L S L S I S S I Q L L S P C L H L L P G V V D L E T R Y R K R Y L D L I MN P S T R D I F V T R S K V I NY I R K Y L DA QG F L E V E T
F T G F P G E L S L F S K S V V L L S P C Y HM L P G L K DQ E V R Y R QR Y L D L M L N E E S R K V F K L R S R A I K Y I R NY F DR L G F L E V E T
F T G F P G E L S L F S K S V V L L S P C Y HM L P G L K DQ E V R Y R QR Y L D L M L N E E S R K V F K L R S R A I K Y I R NY F DR L G F L E V E T
V R G N P G E L S I I P V E MT L L S P C L HM L P G L K DK E T R F R QR Y L D L I L ND F V R QK F V T R S K I I T Y L R S F L DQ L G F L E I E T
V T G Y P G E V S V F A T S V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MN E S T R DR F K V R S Q I I S F I R K F L DT R D F T E V E T
A K G T P G E L S L F A T E V I L L S P C L HM L P G L T D P E T R F R QR Y L DM I C N E S V K K N F I I R S K V I QG V R R Y L DN L G F I E V E T
V V G H P G E L S V M P S E I K L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NNNV R E K F Q I R A K I I S Y V R Q F L DR L G F L E I E T
V K G H P G E L S I M P T E I K L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NNK V R E N F Q I R A K I I S Y V R Q F L DR L G F L E I E T
F T G N P L E A S V F A T D I I V L T P C L R T I P G L K D P E T I Y R K R Y MD L L I NR E S R NR F QK R A Q I I G Y I R S F L D S R G F L E V E T
F T G H P G E L S L I P I S G M I L S P C L HM L P G L G DQ E T R F R K R Y L D L I V N P E S V K N F V L R T K V V K A V R K Y L DDK G F L E V E T
V V G N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L NDY V R QK F I T R A K I V T Y I R S F L D E L G F L E I E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V E G Y V G E I S V F V S R I Q L L T P C L HM L P G F K DQ E L R Y R K R Y L D L I MNK DA R NR F I T R S K I I S Y V R K F L DT R N F I E V E T
I A G K P N E F S L K A T E I T L L S T C Y HM L P G L S S F E QR F R QR Y L D F I V NR DN I K T F I QR A N I I K Y I R K F F D E R D F V E V E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K MV T Y I R S F L D E L G F L E I E T
I V G F P G E L S L F A T E V V Q L S P S L H L L P G F T DG E K R F R MR Y L D F M F NDK S R E V L WQR S R I V K Y I R D F F HDR R F I E V E T
I QG E L G E N S I S V S E F S L L S K S L C A L P G L K DV E T R Y R K R Y L D L I V NA E K R E I F V MR S K L I S E I R R F L A DR E F L E F E T
V P G N P G E L S L I P H E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I T R S K I I T Y I R S F L D E L G F L E I E T
V E G N P G E L S I I P Q E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I V R S K I I T Y I R S F L D E L G F L E I E T
I V G F P G E L S V F A T E V Q L L S P C L HM L P P F A DA E QR A R MR Y L DM L WNDR S R E T L WQR S R MV R Y I R D F F H E R R F I E V E T
V C G Y P G E L S I F P K K I V V L S P C L HMM P V L R DQ E T R Y R QR Y L D L MV NH E V R H I F K T R S K V V S F I R K F L DG L D F L E V E T
V K G T P G E L S L F P S N F E I L T P C L K M L P G L K DV E T R F R MR F L D L MMNN E V R DT F Y I R S N I I R Y I R K Y L DDR D F L E V E T
V QG N P G E L S I I P Y E I T L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V K G N P G E L S I A P G F I Q L L S P T L HM L P G F K DH E QR Y R MR Y L D L I MNK K V R D I F L T R S S V I K Q L R E Y F DG K G F I E V E T
V T G Y P G E V S V F A T S V Q L L T P C L HM L P G F K DQ E A R Y R K R Y L D L I MNDA T R DR F K V R S K I I G Y I R K F L DNR D F V E V E T
I V G F P G E L S I F P K E T I L L S A C L HM L P G L K DT E I R Y R QR Y L D L L I N E S S R HT F V T R T K I I N F L R N F L N E R G F F E V E T
I V G F P G E L S I F P K E T I I L S P C L HM L P G L K DT E I R S R QR Y L D L M I N E S T R S T F I T R T K I I NY L R N F L NDR G F I E V E T
I I G F P G E L S I F P K E T I V L S P C L HM L P G L K DT E I R Y R QR Y L D L L I N E S T R NV F I T R T K I I N F L R N F L NNQG F I E V E T
I T G F P G E L S I F P T S F MV L S HC L HMM P I L K DQ E T R Y R QR Y L D L M L N S E V R Q I F K T R S K I I K Y I QN F L DD L D F L E V E T
V E G N P G E L S I V P R E MT L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND F V R QK F I I R S K I I T Y I R S F L D E L G F L E I E T
V E G Y V G E V S V F V S R V Q L L T P C L HM L P G F K DQ E T R Y R K R Y L D L I MNK DA R NR F I T R S E I I R Y I R R F L DQR K F I E V E T
I R G Y P G E L S I F A R QC V L L S P C L R M L P G L K D L E I R HR QR Y L D L I MNR S T R DR F V MR S R I I QY I R H F F D S R D F M E V E T
V K G R P G E L S I L P S E I T L L S P C L HM L P G V T NK E T R F R QR Y L D L I MNDY V R DK F I T R S K I V S Y L R R F F D E L G F L E V E T
V C G N P G E L S I I P K E M I L L S P C L HM L P G L K DK E T R Y R QR Y L D L I L ND S V R QK F I T R S K I I T Y L R S F L DQMG F L E I E T
I T G K P G E L S I A P T K L Q L L S P C L HM L P G L K DM E T R Y R K R Y L D L I MNN S S R NN F I T R T K I I S Y I R R Y L DDR N F L E V E T
L T G F P G E L S V F P K S V K I L S P C L HM L P G L K DNDV R F R QR Y L D L MMNDD S L K V MK L R S R I I DY L R K F L T S R G F F E V E T
L T G F P G E L S V F P K S V Q I L S P C L HM L P G L K DNDV R F R QR Y L D L MMNDD S L K V MK L R S R I I DY L R K F L T S R G F F E V E T
A E G F P G E L S V V V T K L V L L A P C L F QM P K L E D L E V R Y R QR F F D L I V NR E NR Q I F E T R C K V V K M I R G F L DD L D F T E V E T
V R G V P G E F S M S A Y E I T L L S T C F HM L P G L S S V E QR F R QR Y L D L I V NR E NA K T F I L R S K I I S Y I R S F F DQK D F L E V E T
V K G V A G E F S MNA F E I T L L S T C Y HM L P G L S S I E QR F R QR Y L D F I V NR E N I QT F V T R S K V I R Y I R N F F E D L N F L E V E T
V T G V P G E L S I F P S S I K L L S P S L K M L P G F T DT E QR HR K R Y L D L I MNNHV R D I F V K R A K I I NY V R R F L DN L G F L E V E T
I R G Y P G E L S V F C K E L V L L T P S L HM L P G F K DV E T R F R QR Y L D L I MND S T R E R F I V R S K I I QY I R K F L DNK D F I E V E T

2650
P MMNM I P
P MMNM I A
P MMNV I A
P MMNA I A
P MMNA I A
P L MNMV P
P MMN I I P
P I MNQ I A
P I MNQ I A
P I L NV I A
P MMNV I A
P MMN I I P
P MMNMV A
P MM S M I A
P M L NM I Y
P M L NM I Y
P MMN L I P
P MMNV I A
P MMNM I A
P MMNM I A
P MMNMV A
P MMN L I P
P I L NT I P
P MMN I I P
P MMN I I P
P MMNV I A
P V L NQ I A
P MMN I I P
P MMT S I A
P I L QT V Y
P MMNV I P
P MMN I I P
P MMHA I A
P MMNM I A
P MMNM I A
P MMN I I P
P S L NV I Q
P I L NV I A
P MMN L I A
P T MN L V A
P S MN L MA
P MMNM I P
P MMN I I P
P MMNV I A
P MMNM I A
P MMNM I A
P MMN I V P
P QMNM I P
PMLK T T S
PMLK T T S
P I MWK T A
P M L NQ I A
P V L NQ I A
P MMNQ I A
P MMN I I A

2660

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

GGAT AK
GGAAAR
GGAT AK
GGAT AK
GGAT AK
GGAT AK
GGAV AK
GGAT AK
GGAT AK
GGAT AK
GGAT AK
GGAV AK
GGAT AK
GGAT AK
GGAAAR
GGAAAR
GGAV AK
GGAT AK
GGAAAK
GGAT AK
GGAT AK
GGAAAK
GGAT AR
G G A MA K
GGAV AK
GGAT AK
GGAAAR
GGAV AK
GGAT A L
G G A NA R
GGAV AK
GGAV AK
GGAT A L
GGAAAR
GGAT AR
GGAV AK
GGAT AK
GGAT AK
G G A NA R
G G A NA K
GGA SAR
GGAAAR
GGAV AK
GGAT AK
GGAT AK
GGAT AK
GGAV AR
GGAAAR
T GA SAK
T GA SAK
GGAT AK
GGAAAR
GGAAAR
GGAT AK
GGAT AK

2670

2680

2690

2700

2710

2720

2730

P F I T HHND L NMD L F L R I A P E L Y L K M L T V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND I I D I T QQ L L
P F V T HHND L DMR L Y MR I A P E L Y L K Q L I V G G L E R V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA F A DY ND L M E MT E V M L
P F V T HHND L DMQMY MR I A P E L F L K Q L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F I T HHND L DMN L F MR V A P E L Y L K M L I V G G L E R V Y E L G R Q F R N E G I D L T HN P E F T T C E F Y WA Y A DV Y DV MN L T E E L I
P F V T HHND L DMN L F MR V A P E L Y L K M L I V G G L E R V Y E L G R Q F R N E G I D L T HN P E F T T C E F Y WA Y A DV Y DV MN L T E E L V
P F I T HHN E L NMD L Y MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L MT I T E S I L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T HHND L DMN L F L R V A P E L Y HK M L V V G G I DR V Y E V G R L F R N E G I D L T HN P E F T T C E F Y MA Y A DY E DV I Q L T E D L L
P F I T HHND L DMN L F L R V A P E L Y HK M L V V G G I DR V Y E V G R L F R N E G I D L T HN P E F T T C E F Y MA Y A DY E DV I Q L T E D L L
P F T T HHND L NM E M F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F V T HHND L DMDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F I T Y HN E L DMN L Y MR I A P E L Y HK I L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T HHND L DMD L Y MR V A P E L Y L K M L V V G G L DR V Y E I G R L F R N E G I DMT HN P E F T S C E F Y MA Y A DY E D L MK I S E T L I
P F V T HHND L K L D L F MR I A P E L Y L K E L V V G G L DR V F E I G R V F R N E Q I DMT HN P E F S I C E F Y MA Y A DMY D I MDMT E E L I
P F I T Y HN E L E T Q L Y MR I A P E L Y L K Q L I V G G L DK V Y E I G K N F R N E G I D L T HN P E F T A M E F Y MA Y A DY Y D L MD L T E E L I
P F I T Y HN E L E T Q L Y MR I A P E L Y L K Q L I V G G L DK V Y E I G K N F R N E G I D L T HN P E F T A M E F Y MA Y A DY Y D L MD L T E E L I
P F I T Y HND L NMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K L L
P F V T HHND L NMDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L L F
P F L T HHNA L NMD L F MR I A P E L Y L K Q L V V G G MDR V Y E I G K Q F R N E D I DHT HN P E F T T C E F Y MA Y A DY ND L Y T MT E Q L L
P F V T HHND L K MD L F MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY A D I MD I T E Q L V
P F V T HHN E L K MD L F MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY A DV MD I T E Q L I
P F I T HHN E L K L D L Y MR V S P E L Y L K K L V V G G L E R V Y E I G K Q F R N E G I D L T HN P E F T S C E F Y MA Y A DY ND L M E MT E E L I
P F I T HHNQ L D I QMY MR I A P E L Y L K E L V V G G I NR V Y E I G R L F R N E G I DQT HN P E F T T C E F Y MA Y A DY ND I MK MT E E L L
P F I T Y HN E L DMK L Y MR I A P E L Y HK M L V V G G L DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY R D L M E I T E K L L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K MV
P F V T HHND L DMDMY MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E I M I
P F V T HHND L NQT M F L R I A P E L Y L K E L V V G G MDR V Y E I G K Q F R N E G I D L T HN P E F T S C E A Y WA Y MDY HDWMT A T E D L L
P F I T Y HN E L DMN L Y MR I A P E L Y HK I L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HDV M E I T E K MV
P F V T HHN E Y D L DM F MR I A P E L Y L K M L V V G G Y NK V F E I G K N F R N E G C D L T HN P E F T T I E A Y A A Y Y DMY DV MDY T E E L V
P F K T F HNC L G QN L F L R I A P E L Y L K R L V V G G Y E K V F E I S K N F R N E D I DT T HN P E F T M I E V Y E A Y R DY NDMMD L T E A L I
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M I
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M L
P F V T HHND L DMDM F MR V A P E L F L K K M I V G Q F G K V F E MG K N F R N E G I D L T HN P E F T S I E F Y WA Y A DV Y D L M S I T E E L V
P F V T HHN E L NMR L Y MR I A P E L Y L K E L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L I E L T E T M L
P F I T HHND L NMT L Y MR I A P E L Y L K Q L V V G G I E R V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DY DD L MQMT E E M I
P F I T Y HN E L DMN L Y MR V A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K MV
P F K T F HN S L HR D L F MR V A P E L Y L K M L I V G G L DR V Y E I G K N F R N E G I DQT HN P E F T A M E F Y WA Y C DY ND L MT V T E E V L
P F I T HHND L S MDM F MR I A P E L F L K E L V V G G MDR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G I DK V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A DY ND L I K W S E D F F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G I DK V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A D F Y D L I K W S E D F F
P F I T HHND L D L D L Y L R I A T E L P L K M L I V G G L DR V Y E I G K V F R N E G I DNT HN P E F T S C E F Y WA Y A DY Y D L I K W S E E F F
P F K T HHND L NMK L Y MR I A P E L Y L K E L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY ND L M E L T E K M L
P F I T Y HN E L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I D L T HN P E F T T C E F Y MA Y A DY HD L M E I T E K M L
P F I T HHND L DMDMY MR I A P E L F L K Q L V V G G L DR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y QA Y A DV Y D L MDMT E L M F
P F V T HHND L DMD L Y MR I A P E L Y L K M L V V G G L DR V Y E I G R Q F R N E G A D L T HN P E F T S I E F Y QA Y A DY Y D L MDT T E E L L
P F I T HHND L NMD L F MR V A P E L Y L K M L V V G G L QR V Y E I G R Q F R N E G I D L T HN P E F T T L E F Y MA Y A DY ND L MD I A E R L L
P F V T Y HND L DMN L Y MR I A P E L Y HK M L V V G G I DR V Y E I G R Q F R N E G I DMT HN P E F T T C E F Y MA Y A DY HD L M E I T E K L L
P F V T HHND L NMD I F MR I A P E L Y L K N L V V G G F E R V Y E I G K Q F R N E G I DR T HN P E F T S I E L Y QA Y A DY E DMMK L T E D L L
P F I T HHN E L D L D L F MR I A P E L P L K L I I I G G F E K V F E I G K C F R N E G I D P T HN P E F T S C E F Y WA Y A DY HD L MK L T E E L L
P F I T HHN E L D L D L F MR I A P E L P L K L I I I G G F E K V F E I G K C F R N E G I D P T HN P E F T S C E F Y WA Y A DY HD L MK L T E E F L
P F I T HHNA L D I D L W L R V A P E L F L K M L V V G G MNR V Y E L G K Q F R N E G I D L T HN P E F T S C E F Y MA Y A DY ND L MD L T E K L Y
P F I T HHN E L NQT MY L R I A P E L Y L K K L V V G G L DR V Y E I G K Q F R N E G I D L T HN P E F T S V E S Y WA Y A DY NDWM E T T E E L L
P F I T HHN E L NQR MY L R I A P E L Y L K E L V V G G MDR V Y E L G K Q F R N E G I D L T HN P E F T S V E A Y WA Y A DY NDWMR T T E D L F
P F V T Y HND L K L D L F MR I A P E L F L K E L V V G G L DR V Y E I G R V F R N E S I DQT HN P E F S I C E F Y MA Y A DMY D L MD I T E S M I
P F V T HHND L N L DMY MR I A P E L Y L K Q L V V G G M E R V Y E I G R Q F R N E G I DQT HN P E F T T C E F Y E A Y A DV Y D L M E T T E L L F

2750

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

2760

2770

2780

2790

2800

2810

2820

S G MV HA I HG F T P P F R R I S M I S S L E E A V Q F L DA L C V K H E V - E C K P P R T A A R L L DK L V G E F L E E T C I N - P T F I C DH P Q I M S P L A K
S G MV K E L T G F T P P F R R I E M I G E L E E A NK Y L I DA C A R F DV -K C P P P QT T A R L L DK L V G E F L E P T C V N - P T F I I NQ P E I M S P L A K
S E MV K E I T G F A R P WK R I NM I E E L E E T G E F L K K I L S DNK M -DC P P P L T NA R M L DK L V G E - L E DT C I D - P T F I F G H P QMM S P L A K
S G L V K HV T G WK A P WR R V E M I P A L E E T G E F L K R V L K K T G V - E C S P P L T NA R M L DK L V G E F I E E T C V N - P T F I T G H P QMM S P L A K
S G L V K H I T G WK A P WR R V E M I P A L E E T G E F L K R V L K K T G V - E C S P P MT NA R M L DK L V G E F I E E T C V N - P T F I T G H P QMM S P L A K
S G MV Q S I HG F T P P F A R V P MMA T L E E A ND F L NK L C NT HQ I - E C S P P R T T A R L L DK L V S V F L E E E C I N - P T F I L DH P Q I M S P L S K
S G MV K H I T G F T P P F R K I S MV E E L E E T R R I L DD I C V A K DV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S S MV L A I K G F T P P F K R V NMY E G L A E A R QT F DK L C R DNNV -DC S E P R T T A R L L DK L V G E Y L E S T F I S - P T F L I G H P Q I M S P L A K
S S MV M S I K G F T P P F K R V HMY DG L A E A R E V F DK L C R DNNV -DC S A P R T T A R L L DK L V G E Y L E S T F I S - P T F L I G H P Q I M S P L A K
S E MV K E I T G F S R P WK R V NM I E E L E E T G E F L K K V L K DNN L - E C S P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
S E MV K E I NG F A R P WK R I NM I E E L E E T G E F L QK V L K DNN L - E C P P P I T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
S G MV K H I T G F T P P F R R I S MV E E L E E T R K I L DD I C L A R A V - E C P P P R T T A R L L DK L V G E F L E T T C I N - P T F I C DH P Q I M S P L A K
S G MV K Q I C G Y T P P F R R L R M L P D L E G A QA R L D E I C V K L G V - E C P P P R T T A R L L DK L V G DY L E V NC I N - P T F I T E H P E I M S P L A K
E G MV K S L T G F A R P WK R F DM I G E L E NT NK F L R E L C E K HNV -DC A E P K T N S R L L DK L V G E Y I E NQC V N - P S F I V G H P QV M S P L A K
S G L V L E I HG F T T P WK R F S F V E E I E E N I D F MV E MC E K HK I - E L P H P R T A A K L L DK L A G H F V E T K C T N - P S F I I DH P QT M S P L A K
S G L V L E I HG F T T P WK R F S F V E E I E E N I D F MV E MC E K H E I - E L P H P R T A A K L L DK L A G H F V E T K C T N - P S F I I DH P QT M S P L A K
S G MV K H I T G F T P P F R R I S MT Q E L E E MR K F L DD L C V QK E V - E C P P P R T T A R L L DK L V G D F L E V K C I N - P T Y I C DH P Q I M S P L A K
S E MV K K I T G F T R P WK R V NM I E E L E E T G K F L K Q I L I DHK L -DC S P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
Q S I V M S I HG F S S P WR K I DM I A D L E E C R E F L V K T C R E R K V - E C S A P QT T A R L L DK L V G E Y L E V QC I N - P T F I I NH P E I M S P L S K
S G MV K A I R G F T P P F K R V S M I K T L E E T NQ F L S Q L C A K HQV - E C P A P R T T A R L L DK L V G E F I E E F C V N - P T F I C E H P Q I M S P L A K
S G MV K S I R G F T P P F K R V S M I K T L E A T T D F L S Q L C V K HQV - E C P A P R T T A R L L DK L V G E F I E E E C I N - P T F I C E H P Q I M S P L A K
S G MV E NM F G F K R P F R V I S I L E E L N E T L E K L L S A C DK E G L - S V E K P R T L S R V L DK L I G HV I E P QC V N - P T F V K DH P I A M S P L A K
G NMV K D I T G F T A P F K R I S Y V HA L E E A L T F L K K QA I R F NA - I C A E P QT T A R V MDK L F G D L I E V D L V Q - P T F V C DQ P Q L M S P L A K
S G MV K H I T G F T P P F R R I S MV D E L E E T R R F F DD L C A V R NV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G MV K H I T G F T P P F R R I NMV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S E MV K E I T G F T R P WK R I NM I E E L E E T G E F L K K V L K DNK M -DC A P P L T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
Y G L A V E L HG F S K P F K R L H I I P E L E A G I Q F L MD L C K K HK A -DC P P P Y T A P R L L DA L I A E F L E P E C HD - P C F I C DH P R V M S P L A K
S G MV K H I T G F T P P F R R I S M I E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G L V K H L T G WA R P WK R V K I M P E L E E T NQ F L R D L L K E K N I - E C T P P L T NA R M L DK L I G E Y L E E T C I N - P T F L M E H P Q L M S P L A K
S E L V F R L T G L R S P WK R I S M E G A L K H S L E E L K Q I A I QNR I E DY E K A K S HG E F L A L L F E G L V E DK L V N - P T F I Y D F P V E N S P L A K
S G MV K N I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A R DV - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S G MV K S I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I S - P T F I C DH P Q I M S P L A K
S S L V K E L T G W E A P WR R V E M I P A L E E T NA F L QR I C K K MNV - E C P P P L T NA R M I DK L T G E F I E E T C I N - P T F I L E H P QMM S P L A K
S G MV K E L T G F T P P F R K I DM I E E L E E A NK Y L I DA C A K Y DV -K C P P P QT T T R L L DK L V G H F L E E T C V N - P T F I I NH P E I M S P L A K
S G MV Y A I K G F T P P F R R I S MV S G L E E N E D F L K E L I K K L G V - E M S P P Y T T A R M L D E L V G E Y L E S Q L V N - P G F I C DH P Q I M S P L A K
S G MV K H I T G F T P P F R R I NMV E E L E E T R K I L DD I C V A K A V - E C P P P R T T A R L L DK L V G E F L E V T C I N - P T F I C DH P Q I M S P L A K
S S I V L K L K G F T P P W P R V S MMA E L E E A NA F F V E QA K K HK V - E C S N P R T T A R L I DK L V G H F L E V N F R N - P T F L I DH P Q L M S P L S K
S E MV K E I T G F S R P WK R V NM I E E L E E T G K F L K Q I L I DNK L -DC T P P L T NA R M L DK L V G E - L E DA S I N - P T F I F G H P QMM S P L A K
S Q L V Y H L F G F T P P Y P K V S I V E E I E E T I E K M I N I I K E HK I - E L P N P P T A A K L L DQ L A S H F I E NK Y NDK P F F I V E H P Q I M S P L A K
S T L V MH L F G F T P P Y P K V S I V E E L E E T I NK M I N L I K E NK I - E M P N P P T A A K L L DQ L A S H F I E NQY P NK P F F I I E H P Q I M S P L A K
S K L V Y H L F G F T P P Y P K I S L V E E L E E T I NK M I N I I K E NN I - E M P N P P T A A K L L DQ L A S H F I E N I Y QNQ P F F I I E H P Q I M S P L A K
S G MV K E L T G F T P P F R R I DM I E E L E E A T K Y L V A A C E K F E V -K C P P P QT T T R L L DK L V G H F L E E T C V N - P S F I I NH P E I M S P L A K
S G MV R S I T G F T P P F R R I S MV E E L E E T R K I L DD I C V A R A V - E C P P P R T T A R L L DK L V G E F L E V T C I S - P T F I C DH P Q I M S P L A K
S E MV K E I T G F S R P WK R I NM I E E L E E T G E F L K K I L V DNK L - E C P P P L T NA R M L DK L V G E - L E DT C I N - P T F I F G H P QMM S P L A K
S G L V K D L T G F S R P WR R I NM I E Y L E E A NA F L R D L C A K HG V - E C A P P QT C S R L L DK L V G E F I E S E C I N - P T F I I G H P QMM S P L A K
S G MV K F V T G F T P P F R R V S M I N E L Q E T NK F L DD L C R K H E V - E C T S P R T T A R L L DK L V G E Y I E T QC I S - P T F I MDH P E I M S P L S K
S G MV K H I T G F T P P F R R L S MT HD L E E T R K F F DN L C A E K G V - E C P P P R T T A R L L DK L V G E F M E E T C I S - P T F I C DH P Q I M S P L A K
S S L V MK L T G F T P P F K R V P MM E T L S E A R E F F DK L C V QHNV -A C S A P R S T T R L I DK L V G H F I E V DC K N - P T F L M E H P Q I M S P L A K
S S L V F E L F G F T P P F QR V S MV E E L E E NV E K Y L T A I K E A G L -DM P K P P V P A K L I DQ L V G HY I E DQ I V K - P T F I V D F P QC T S P L S K
S S L V F E L F G F T P P F NK V S MV E E L E E NV E K Y L T A I K E A G L -DM P K P P V P A K L I DQ L V G HY I E DQ I V K - P T F I V D F P QC T S P L S K
QK I V M E V K G F S S P WQR I DM I E E L E E V R E L L E K K C K E L DV -DV P P P MT V A R M L DK MV G K F V E P L C V N - P T F MC NH P QV M S P L A K
Y G MV MH L Y G F NR P F K R L H I V P K L E E A N S F F L D I C K K NQV - E C N P P F T T T R L L DA L V S HY L E P QC HD - P T F L C DH P R I M S P L A K
Y G L A MH I HG F NK P F K R L Y I I P E L E S S NA F L Q E L C S K H E V - E C T P P L T T A R L L DA L I S HY L E P E C QD - P T F V C DH P R V M S P L A K
S G L V K A V T G F S T P WK R F DM I K E L E E T R K W L S D L A A K HNV -DC S E P R T S S R L I DK MT G E F I E T QC I N - P S F I V G H P QV M S P L A K
S E MV K E I T G F S R P WK R L D I I G T L E E T NQ F L Q E Q L K K V G L -V C T P P L T NA R M L DK L I G DY L E DT C I N - P T F L Y G H P E MM S P L A K

2830

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

Y HR S A T G L T E R
WHR S K S G L T E R
H S R DQ P G L C E R
Y HR QHA G L C E R
Y HR QNV G L C E R
Y HR DV P G L T E R
WHR S K E G L T E R
WHR S I P G L T E R
WHR S I P G L T E R
K DR N I P G L C E R
Y S R DQ P G L C E R
WHR S K E G L T E R
WHR S I K G L T E R
Y DR S R P G L C E R
WHR E K P E MT E R
WHR E K P E MT E R
WHR S QK G L T E R
K DR DNV G L C E R
Y HR E K P Q L T E R
Y HR S I P G L T E R
Y HR S A P G L T E R
NHR S K A G L T E R
Y HR S E P E L T E R
WHR I HR G L T E R
WHR S K E G L T E R
Y S R DQ P G L C E R
WHR ND P R L T E R
WHR S K E G L T E R
Y HR T E K G I S E R
NHR E K E G F V E R
WHR S K NG L T E R
WHR S K E G L T E R
Y HR S K NG L C E R
WHR S R P G L T E R
Y HR N I P G MT E R
WHR S K E G L T E R
V HR QY P G L T E R
K DR N I P G L C E R
Y HR T K P G L T E R
Y HR S K P G L T E R
Y HR S K P G L T E R
WHR S K P G L T E R
WHR C K E G L T E R
Y S R DQ P G L C E R
Y HR S DA G L C E R
WHR S I P G L T E R
WHR S E K G L T E R
Y HR S K P NV T E R
WHR S K E NV C E R
WHR S K E NV C E R
WHR T K P G I V E R
WHR K D L R L S E R
WHR ND P Q L T E R
R HR D I P G L C E R
Y S R DR P G I C E R

2840

2850

2860

2870

2880

2890

2900

F E L F V MR K E V C NA Y T E L ND P A V QR E R F E QQA A DK A A G DD E A Q L V D E N F C T A L E Y G L P P T G G WG MG I DR L T M F
F E L F I NK H E L C NA Y T E L ND P V V QR QR F A DQ L K DR Q S G DD E A MA L D E T F C NA L E Y G L A P T G G WG L G I DR L S M L
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L I D E T F C NA L E Y G L P P T G G WG C G V DR L A M F
F E A F V C K K E I V NA Y T E L ND P F DQR L R F E E QA R QK DQG DD E A Q L I D E N F C T S L E Y G L P P T G G WG MG I DR L V M F
F E A F V C K K E I V NA Y T E L ND P F DQR L R F E E QA R QK DQG DD E A Q I I D E N F C T S L E Y G L P P T G G WG MG I DR L V M F
F E V Y V A K K E I C NA Y T E L ND P A T QR E R F E E QA K NR A A G DD E T P P T D E A F C T A L E Y G L P P T G G WG L G V DR L T M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E T F C T A L E Y G L P P T G G WG MG I DR V T M F
F E L F A V T R E I A NA Y T E L ND P I T QR QR F E QQA K DK DA G DD E A QM I D E T F C NA L E Y G L P P T G G WG MG I DR L S M I
F E L F A V T R E I A NA Y T E L ND P I T QR QR F E QQA K DK DA G DD E A QM I D E T F C NA L E Y G L P P T G G WG MG I DR L S M I
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK A QG DD E A QMV D E T F C NA L E Y G L P P T A G WG C G I DR L A M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L I D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V T M F
F E L F V NK K E I C NA Y T E L ND P M I QR QR F E QQA L DK A A G DD E A QMV D E N F C T A L E Y G L P P T G G WG MG I DR L T M F
F E A F L C T K E I C NA Y T E L ND P F DQR E R F M E QV R QK E QG D E E A QG V D E T F L DA L E Y G L P P T G G WG L G I DR L V M F
F E L F V L G K E L C NA Y T E L N E P L QQR K F F E QQA DA K A S G DV E A C P I D E T F C L A L E HG L P P T G G WG L G I DR L I M F
F E L F V L G K E L C NA Y T E L N E P L QQR K F F E QQA DA K A S G DV E A C P I D E T F C L A L E HG L P P T G G WG L G I DR L I M F
F E L F V MK K E I C NA Y T E L ND P I R QR E L F E QQA K A K A E G DD E A M F I D E T F C T A L E Y G L P P T A G WG MG I DR L T M F
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK A QG DD E A QMV D E T F C NA L E Y G L P P T A G WG C G I DR L A M F
F E L F V NT K E I C NA Y T E L NN P F V Q I E R F A E QA K A K A A G DD E S M L I DK V F T T S L E Y G L P P T G G F G L G I DR F A M L
F E L F V A K K E I C NA Y T E L ND P V V QR E R F E QQA S DK A A G DD E A Q L V D E N F C T S L E Y G L P P T G G F G MG I DR L A M F
F E L F V A K K E I C NA Y T E L ND P V V QR E R F E QQA S DK A A G DD E A QMV D E N F C T A L E Y G L P P T G G F G MG I DR L T M F
F E L F I NC K E I C NA Y T E L NN P F E QR E R F L QQT QD L NA G DD E A MMND E D F C T A L E Y G L P P T G G WG I G I DR L V MY
F E L F I L K R E I A NA Y T E L NN P I V QR S N F E QQA K DK A A G DD E A Q L V D E V F L DA I E HA F P P T G G WG L G I DR L A M L
F E L F V MK K E V C NA Y T E L ND P F QQR Q L F E DQA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR F T M F
F E L F V MK K E I C NA Y T E L ND P MR QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V A M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA NQK A QG DD E A Q L V D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E L F V NK K E L A NA Y T E L NN P I V QR E E F L K QV R NR DK G DD E S M E I D E G F V A A L E HA L P P T G G WG L G I DR L V M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T G G F G MG L DR V A M F
F E G F V C K K E I C NA Y T E L NN P F DQR L R F E E QA R QK A QG DD E A QM I D E N F L R S L E Y G L P P T A G WG L G I DR L C M F
F E L F L NG W E L A NG Y S E L ND P L E Q E K R F E E QDK K R K L G D L E A QT V DY D F I NA L G Y G L P P T G G MG L G I DR L T M I
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A MV I DDN F C T A L E Y G L P P T A G WG MG I DR L T M F
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR L T M F
F E A F V C K K E I A NA Y T E L NN P F DQR L R F E E QA R QK DQG DD E A Q L V D E S F L NA L E Y G L P P T G G WG L G I DR L A M F
F E L F V NK H E V C NA Y T E L ND P V V QR QR F E E Q L K DR Q S G DD E A MA L D E T F C T A L E Y G L P P T G G WG L G I DR L T M L
F E L F V NT K E L C NA Y T E L ND P I DQR E R F D E QA K A K S S G DD E A M L I D E V F V Q S L E Y G L P P T G G WG L G V DR L T M L
F E L F V MK K E I C NA Y T E L ND P MR QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V A M F
F E L F V NY H E L C NA Y T E L ND P F V QK A L F QK QV E DA A K G DD E A MG Y D E G F I K S L E HA L P P T A G WG L G I DR F V M L
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK A QG DD E A Q L V D E V F C NA L E Y G L P P T A G WG C G I DR L A M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F K L QQK DR E K G DT E A A Q L D S A F C T S L E Y G L P P T G G L G L G I DR I T M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F S A QQK DR E K G DA E A F Q F DA P Y C T S L E Y G L P P T G G L G L G I DR I T M F
L E M F I C G K E V L NA Y T E L ND P F K QK E C F A S QQK DK E K G DT E A F HC DA A F C T S L E Y G L P P T G G L G L G I DR I T M F
F E L F V NK H E L C NA Y T E L ND P V V QR QR F E A Q L K DR Q S G DD E A MA L D E T F C MA L E Y G L P P T G G WG L G I DR L A M L
F E L F V MK K E I C NA Y T E L ND P V R QR Q L F E E QA K A K A A G DD E A M F I D E N F C T A L E Y G L P P T A G WG MG I DR V T M F
F E V F V A T K E I C NA Y T E L ND P F DQR A R F E E QA R QK DQG DD E A Q L V D E T F C NA L E Y G L P P T G G WG C G I DR L A M F
F E A F V A T K E I C NA Y T E L ND I F DQR A R F E E QA R QK A QG DD E A Q I I D E N F C T A L E Y G L P P T G G WG MG V DR L V M F
F E L F V A R K E I C NA Y T E L ND P MV QR E R F A T QA K DHA A G DD E A Q L I D E N F C T A L E Y G L P P T G G F G L G I DR L A M F
F E L F V MK K E I C N S Y T E L ND S V R QR E L F E QQA K A K A E G DD E A M F I D E T F C T A L E Y G L P P T A G WG MG I DR L C M F
F E L F V NY Y E L C NA F T E L ND P F K QR K I F V QQ I E E K NK G DV E A MG Y DK D F C DC L E HA L P P T G G WG L G I DR L V M L
F E L F I C G K E L I N S Y T E L ND P I T QR E C F K QQQK A K D L G DD E A Q P P D E A F C T A L E Y G L P P T A G WG I G I DR L A M F
F E L F I C G K E L I N S Y T E L ND P I T QR DC F K QQQK A K D L G DD E A Q P P D E A F C T A L E Y G L P P T A G WG I G I DR L T M F
F E V F I NG L E Y A NA Y T E L NC P MV QR E L F L DQ L K A K A A K DD E A M P Y DDT F C T A L E Y A L P P T A G WG C G V DR L V M L
F E L F I NK K E I C N S Y T E L N S P L V QR E E F E R Q L R DR E K G DD E A MD I D E G Y V QA L E Y A L P P T G G WG L G I DR L V MY
F E L F L NK K E L C NA Y T E L NN P I V QR E E F MK Q L R NK E K G DD E A MD I D E G F V QA L E HA L P P T G G WG L G I DR L V M F
F E V F V A T K E I C NA Y T E L ND P WV QR A N F E E Q S R QK DQG DD E A QG I DHV F I DA L E HG L P P T G G WG L G I DR L V M F
F E V F V A T K E I C NA Y T E L ND P F DQR QR F E E QA R QK DA G DD E A Q L V D E T F C T A L E Y G L P P T A G WG C G V DR L T M F

2910

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

L T D S NN I
LT D S LN I
L T D S NT I
L T DNY S I
L T DNY S I
L T D S NN I
L T D S NN I
L T DNNN I
L T DNNN I
L T D S NT I
L T D S NT I
L T D S NN I
L T D S NN I
L T DC S N I
L A DK NN I
L A DK NN I
L T D S NN I
L T D S NT I
M S DT Y N I
L T D S NN I
L T D S NN I
L T DA A N I
L A DV DN I
LT D S SN I
L T D S NN I
L A D S NT I
LT SQ SN I
L T D S NN I
L T NNA T I
LAG L E S I
L T D S NN I
L T D S NN I
L T DNY S I
L T D S QN I
L A DK NN I
L T D S NN I
L T DT QN I
L T D S NT I
L T NK N S I
L T NK NC I
L T NK N S I
L T D S QN I
L T D S NN I
L T D S NT I
L T D S NT I
L T D S NN I
L T D S NN I
L T DN I Y I
L S DK NN I
L A DK NN I
L T NQV S I
L T S QNN I
L T S QA N I
LT D SN S I
L T N S NT I

2920

2930

2940

2950

2960

2970

2980

K E V L L F P A MK P R QA A E A HR QT R QY MQ -K W I K P G MT M I E I C E E L E NT A R G L A F P T G C S R NHC A A HY T P NA G D - P T V L
K E V L F F P A MR P R R A A E V HR QV R K Y V R - S I V K P G M L MT D I C E T L E NT V R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
R E V L L F P T L K P R K G A E I HR R V R R H L Q -NR L R P G QT L T E V V E L V E NA T R G I G F P T G V S L NHC A A H F T P NA G D -T T V L
K E V L A F P F MK E R QA A E V HR QV R QY A Q -K T I K P G QT L T E I A E G I E DA V R G MG F P C G L S I NHC A A HY T P NA G N -K MV L
K E V L A F P F MK E R QA A E V HR QV R QY A Q -K T I K P G QT L T E I A E G I E E S V R G MG F P C G L S I NHC A A HY T P NA G N -K MV L
K E V L L F P A MK P R HA A E A HR QT R K H I R -NW I K P G MT M I D I C E E L E K T A R G L A F P T G C S R NHC A A HY T P NT G D -T T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K M E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
K E V L L F P A MR P R R S A E A HR QV R QY V K - S W I K P G M S M I E I C E R L E T T S R G L A F P T G C S L NHC A A HY T P NA G D -T T V L
K E V L L F P A MR P R R S A E A HR QV R K Y V K - S W I K P G MT M I E I C E R L E T T S R G L A F P T G C S L NHC A A HY T P NA G D -T T V L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R P G MT M I E I A N L I E D S V R G I G F P T G L S L NHV A A HY T P NT G D -K L I L
R E V L L F P T L K P R K G A E I HR R V R K NV Q -NK L K P G M L L T E V A D I I E NA T R G I G F P T G L S L NHC A A HY T P NT G D -K T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
K E V L L F P A MK P R QA A E T HR QV R HHV Q - E F I K P G L S M I E I C E R L E QA S R G L A F P T G C S L NNC A A HY T P NA G D -K T V L
K E V L L F P A MR P R R A G E V HR QV R A Y A Q -K A I K P G MT MT E I A N L I E DG T R G I G F P T G L S V N E V A A HY T P N P G D -K QV L
K E V I L F P A MR NR R A A E V HR QV R K Y MQ - S I I R P E MK L I DMC N I L E S K V K G WG F P T G C S L NHC A A HY T P N P HD - F T K L
K E V I L F P A MR NR R A A E V HR QV R K Y MQ - S I I R P E MK L I DMC N I L E S K V K G WG F P T G C S L NHC A A HY T P N P HD - F T K L
K E V L L F P A MK P R QA A E A HR QV R K Y V Q - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R P G MNMT E I A D L I E N S V R G I G F P T G L S L NHV A A HY T P NA G D -K T V L
K E V I L F P A MK P R R A A E V HR QV R K Y V Q -G I V K P G L G L T E L V E S L E NA S R G I A F P T G V S L NH I A A H F T P NT G D -K T V L
K E V L L F P A MK P R QA A E A HR QT R QY MQ -R Y I K P G MT M I Q I C E E L E NT A R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
K E V L L F P A MK P R QA A E A HR QT R QY MQ -R F I K P G MT M I Q I C E E L E NT A R G L A F P T G C S L NHC A A HY T P NA G D - P T V L
R DV I F F P T MK P R R A A E A HR R A R Y R V Q - S I V R P G I T L L E I V R S I E D S T R G I G F P A G M S MN S C A A HY T V N P G E QD I V L
K E V I L F P T MR P R K A A A I HK S V R QWA Q -QW I K P G M S D L F V A E N I E R K V R G MA F P C G L S V N S C A A H F T P N P ND P L S F Y
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D - P T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
R E V L L F P T L K P R K G A E I HR R V R E S V R -NK I K P G MT L T E I A N L V E DG T R G I A F P T G L S L NHC A A H F T P NA G D -K T V L
K E V L L F P A MK P R E A A E V HR QV R T WA Q - S W I K P G L S L M L MT DR I E K K L NG QA F P T G C S L NHV A A HY T P NT G D E K V V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
R E V L A F P F MR DR HG A E A HR QA R R WA H -K HV K P G M S L T D I A NG I E D S V R G MG F P T G L S I NHC A A HY T P NA G N -K MV L
K E V I L F P QMK R R E A G R I L K I V R T E A A -DM I R V G N S L L E V A E F V E K K T I - -A F P C N I S R NQ E A A HA T P K A G D -QDV F
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D - P T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
R E V L A F P F L R E R HA A E V HR QV R QWA Q -K S I K P G QT L T E I A E N I E D S V R G MG F P T G L S I NHC A A HY T P NA G N -K MV L
K E V L L F P A MK P R R A A E V HR QV R K HMR - S I L K P G M L M I D L C E T L E NMV R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
K E V L L F P A MK P R QC A E V HR E V R QY I S -DWV K P G MK Y I DV C E T L E N S V R G V A F P T G C S K NHV A A HWT P NG G C - E S V I
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
Q E V L L F P A MK P R K A A E C HR QV R QY A QA K L L K P G NK L I D I C E K L E DMNR G I A F P T G C S L N F C A A HY T P NNG D -NT I L
R E V L L F P T L K P R K G A E I HR R V R HK A Q - S S I R A G M S MT E I A D L I E N S V R G I G F P T G L S L NHV A A HY T P NT G D -K L S L
K DV I L F P T MR P R K A A E C HR QV R K HMQ -A F I K P G K K M I D I A Q E T E R K T K G WG F P T G C S L NHC A A HY T P NY G D - E T V L
K DV I L F P T MR P R K A A E C HR QV R K Y I Q -A Y V Q P G R K M I D I V K E T E K K T K G WG F P T G C S L NHC A A HY T P NY G D - E T V L
K DV I L F P T MK P R K A A E C HR QV R K Y I Q - S Y I K P G R K M I D I V QK T E QK T K G WG F P T G C S L NNC A A HY T P NY G D - E T V L
K E V L L F P A MK P R QA A E V HR QV R K Y MK - S I L K P G M L MMD L C E T L E NT V R G I A F P T G C S L NWV A A HWT P N S G D -K T V L
K E V L L F P A MK P R E A A E A HR QV R K Y V M - S W I K P G MT M I E I C E K L E DC S R G L A F P T G C S L NNC A A HY T P NA G D -T T V L
R E V L L F P T L K P R K G A E I HR R V R R A I K -DR I V P G MK L MD I A DM I E NT T R G I G F P T G L S L NHC A A H F T P NA G D -K T V L
R E V L L F P HMK P R R A A E V HR QA R QY A Q - S V I K P G M S MMDV V NT I E NT T R G I G F P T G V S L NHC A A HY T P NA G D -T T I L
K E V L F F P A MK P R QA A E A HR QV R K HV Q -G F I K P G MT M I D I C E R L E T A S R G L A F P T G C S R NHC A A HY T P NA G D -T T V L
K E V L L F P A MK P R QA A E A HR QV R A Y V R - S W I K P G MT M I D I C E K L E DC S R G L A F P T G C S I NHC A A HY T P NA G D - P T V L
Q E V L L F P A MK P R K A A E C HR QV R K Y C Q -Q L I R P G K K L I D I C E S I E E MNR G I A F P T G C S L NHV A A HY T P NNG D - F T T I
K V F I G V I I I V -R R A A E V HR QV R R Y I Q - S V I R P G V S C L D I V QA V E S K T K G WG F P T G C S L N S C A A HY T P NY G D -K T V F
K E V I F F P T MR P R K A A E V HR QA R R Y I Q - S V I K P G L S C L D I V QA L E F K T K G WG F P T G C S L N S C A A HY T P NHG D -K T I F
R E V L L F P L MK P R E G A E I HR R V R R WA M E NV I K P G V K L Y DMC A Q I E E A V R G L A F P C G C S I NNC A A HY T P MY NT DQR V L
K E V L L F P A MK P R C A A E V HR QV R R Y A Q - S F I K P G I S L L S MT DR I E K K L E G QA F P T G C S L NHV A A HY T P NT G D -K C V L
K E V L L F P A MK P R HA A E V HR QV R R Y A Q - S F I K P G I S L I S MT DR I E R K V E G QA F P T G C S L NHV A A HY T P NT G D -K T V L
K E V L A F P A NK P R R A A E V HR QV R QY A Q - S A I K P G MT MT E I A E L V E DG T R G I G F P T G V S V N E C A A HY T P NA G D -K R V L
K E V L L F P A MK P R K G A E I HR V V R K Y A R -DN I K A G MT MT S I A E M I E D S V R G QG F P T G V S L NHC A A HY T P NA G D -K I V L

2990

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3000

3010

3020

3030

3040

3050

3060

L Y DDV T K I D F G T H I K G R I I DC A F T L S F N P - -K Y DK L L E A V K E A T E T G I R E A G I DV R L C D I G A A I Q E V M E S Y E V E L DG K T Y QV K
QY DDV MK L D F G T H I DG H I I DC A F T V A F N P - -M F D P L L A A S R E A T Y T G I K E A G I DV R L C D I G A A I Q E V M E S Y E V E I NG K V F QV K
R H E DV MK V D F G V QV NG H I I D S A WT V T F D P - -R Y D P L L E A V R E A T Y T G I R E A G I DV R L T D I G E A I Q E V M E S Y E V T L G G QT Y QV R
QQG DV MK V D F G A H I NG R I V D S A F T MT F D P - -V Y D P L L E A V K DA T NT G I R E A G I DV R M S D I G A A I Q E A M E S Y E V E L NG T MY P V K
QQG DV MK V D F G A H I NG R I V D S A F T V A F D P - -V Y D P L L A A V K DA T NT G I R E A G I DV R M S D I G A A I Q E A M E S Y E V E I NG T MY P V K
E Y DDV V K I D F G T H I NG R I I DC A F T L H F N P - -R Y D P L V K G V Q E A T E A G I K A S G V DV R L C DV G A A V Q E V M E S H E V E L DG QMY - - QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
QY G DV C K I DY G I HV R G R L I D S A F T V H F D P - -K F D P L V E A V K E A T NA G I R E S G I DV R L C DV G E V V E E V MT S H E V E L E G K T Y V V K
QY G DV C K I DY G I HV R G R L I D S A F T V H F D P - -K F D P L V E A V R E A T NA G I K E S G I DV R L C DV G E I V E E V MT S H E V E L DG K S Y V V K
K K DD I MK V D I G V HV NG R I C D S A F T MT F N E DG K Y DT I MQA V K E A T Y T G I K E S G I DV R L ND I G A A I Q E V M E S Y E M E E NG K T Y P I K
K Y E DV MK V D I G V QV NG H I V D S A WT V S F D P - -QY DN L L A A V K DA T Y T G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I K G K T Y QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
S Y DDV C K I D F G T H I NG R I I DC A F T V S F N P - -K Y DR L L E A V K DA T NT G I K NA G I DV R L C DV G A A I Q E T M E S Y E V E I DG K T Y QV R
QQHDV MK V D F G V HV NG R I V D S A F T M S F E P - -T WDK L L E A V K DA T NT G I R E A G I DV R MC D I G E A I Q E V M E S Y E V E V NG K V Y P V K
T QDD I C K L D F G V QV NG M I I DC A F T V A F ND - -V F D P L I Q S T L DA T NT G L K V A G I DV M F S E I G S A I E E V I K S Y E F E Y K S K V Y N I K
T QDD I C K L D F G V QV NG M I I DC A F T V A F ND - -V F D P L I Q S T L DA T NT G L K V A G I DV M F S E I G S A I E E V I K S Y E F E Y K S K V Y N I K
QY DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K Y DK L L E A V K DA T NT G I K C A G I DV R L C D I G E S I Q E V M E S Y E V D L DG K T Y QV K
NY E DV MK V D I G V HV NG H I V D S A F T L T F DD - -K Y D S L L K A V K E A T NT G V K E A G I DV R L ND I G E A I Q E V M E S Y E M E L NG K T Y P I K
K K DDV L K I D F G T HV NG Y I I DC A F T V T F D E - -K Y DK L K DA V R E A T NT G I Y HA G I DA R L G E I G A A I Q E V M E S H E I E L NG K T Y P I R
QY DDV C K I D F G T H I K G R I I DC A F T L T F NN - -K Y DK L L QA V K E A T NT G I R E A G I DV R L C D I G A A I Q E V M E S Y E I E L DG K T Y P I K
QY DDV C K I D F G T H I K G R I I DC A F T L T F NN - -K Y DK L L QA V K E A T NT G I K E A G I DV R L C D I G A A I Q E V M E S Y E V E L DG K T Y P I K
K E DDV L K I D F G T H S DG R I MD S A F T V A F K E - -N L E P L L V A A R E G T E T G I K S L G V DV R V C D I G R D I N E V I S S Y E V E I G G R MW P I R
K T DDV V K I D F G V HV NG H L I D S A F T MT WD P - -A L Q P I L DC S K DA T NT G I K N I G V DV R L C D I G DA I E E V M S S Y E V E I K G K T Y Q L Q
HY DD I C K I D F G T Y Y S G R I I DC A F T V T F N P - -K Y DR L L E A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
K F E DV MK V D F G V HV NG Y I I D S A F T I A F D P - -QY DN L L A A V K DA T NT G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I NG E T HQV K
T Y DDV MK V D F G T H I NG R I I DC A WT V A F N P - -M F D P L L QA V K E A T Y E G I K QA G I DV R L G D I G A A I E E V M E S H E V E I NG K V HQV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
E HDDV L K V D I G V HV NG R I V D S A F T V A F N P - -R Y DN L L A A V K DA T NT G I R E A G I DA R L G E I G E A I Q E T M E S Y E V E I DG E T Y P V K
G -NDMV K L D L G V HV DG Y I A D S A V T V D L S G - -N S D - I V K A S E E A L A A A I D L MK P G V S T G E I G A A I E E R I H S - - - - - - - - -Y G L K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y D I L L T A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
Q E DDV MK V D F G V HV NG R I V D S A F T V A F N P - -R Y D P L L E A V K A A T NA G I K E A G I DV R V G D I G A A I Q E V M E S Y E V E I NG QM L P V K
QY DDV MK L D F G T H I DG Y I V DC A F T V A F N P - -M F D S L L QA S K DA T NT G V K E A G I DA R L C DV G A A I Q E V M E S Y E V E I NG K V F Q I K
DK DDV I K F D F G V QV K G R I I DC A F T K T F ND - -MY D P L L K A V N E A T E T G I R S A G I DV R L C D I G E A V Q E V M E S HT V E I HG K E Y QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y DT L L K A V K DA T NT G I K C A G I DV R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
T Y DDV C K I D F G T QV DG W I I DC A F T V A F N P - -V Y DT L L QA A K DA T DT G I R N S G I DV R L G DV G A A I Q E T M E S Y E V E I G G K V Y K V K
G K DD L MK V D I G V HV NG H I C D S A F T MT L NDT G K Y D S I MK A V K DA T NT G V K E A G I DV R L ND I G E A I Q E V M E S Y E M E L DG K T Y P V K
K Y DDV C K L D F G V HV NG Y I I DC A F T I A F N E - -K Y DN L I K A T QDG T NT G I K E A G I DA R MC D I G E A I Q E A I E S Y E I E L NQK I Y P I K
K Y DDV C K L D F G V HV NG Y I I DC A F T I A F N E - -K Y DN L I K A T QDG T NT G I R E A G I DA R MC D I G E A I Q E A I E S Y E I E L NK K I Y P I K
K E DDV C K L D F G V HV NG Y I I DC A F T I A F ND - -K Y DN L I K A T QDG T NT G I K E A G I DA R MC D I G E A I Q E A I E S Y E I E L NQK V Y P I K
QY DDV MK L D F G T H I DG H I V DC A F T V A F N P - -M F D P L L E A S R E A T NT G I K E S G I DV R L C DV G A A I Q E V M E S Y E V E I NG K V F QV K
QY DD I C K I D F G T H I S G R I I DC A F T V T F N P - -K Y D I L L K A V K DA T NT G I K C A G I D I R L C DV G E A I Q E V M E S Y E V E I DG K T Y QV K
K Y E DV MK V DY G V QV NG N I I D S A F T V S F D P - -QY DN L L A A V K DA T Y T G I K E A G I DV R L T D I G E A I Q E V M E S Y E V E I NG E T Y QV K
K E K DV MK V D I G V HV NG R I V D S A F T M S F D P - -QY DN L L A A V K A A T NK G I E E A G I DA R L N E I G E A I Q E V M E S Y E V E I NG K T HQV K
E Y DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K Y DQ L L A A V K DA T NT G I K E A G I DV R L C DV G E R I Q E V M E S Y E V E L DG K T Y QV K
R Y DDV C K I D F G T H I NG R I I DC A F T V T F N P - -K F DG L L E A V R DA T NT G I K F A G I DV R L C DV G E T I Q E V M E S Y E V E I DG K T Y QV K
E Y DDV C K I D F G T QV E G R I I DC A F T V A F N P - -K Y DK L L E A V K E A T NT G I K E A G I DV R I P DV G A A I Q E V M E S Y E V E I E G K T Y P V K
E K DD I MK L D F G T HV NG Y I I D S A F T I A F D E - -K Y D P L I E S T K E A T NT G L K L A G I DA R T S E L G E A I E E V I E S F E I T L K NR T HK I K
HK NDV MK L D F G T HV NG Y I I D S A F T I A F D E - -K Y D P L I E S T K E A T NT G V K L A G I DA R T S E L G E A I Q E V I E S Y E I T L K NK T HK I K
G K S DV MK I D F G V A I NG N I I D S A F T V C F D P - -K F E P L L E A A K T A T NT G V K I A G I DA R MN E I G DA I Q E V F DA S S I D I DG K HY D I K
MY DDV MK V D F G T Q I NG R I I DC A WT V A F K D - - E Y E P L L T A V K E A T Y E G V K QA G I DV R L C DV G A A I Q E V M E S Y E V E L NG K V Y P V K
T Y DDV MK V D F G T Q I NG R I V DC A WT V A F ND - - E Y A P L L E A V K S A T Y E G I K QA G I DV R L C D I G E A I Q E V M E S Y E V E I K G K V Y P V K
QA T DV L K V D F G V HV K G R I V D S A F T L N F E P - -T WD P L L A A V K A A T NA G I K E A G I DA R L G E I G A S I Q E V M E S H E F E A NG K T HR V K
K E DDV L K V D F G V HV NG K I I D S A F T HV QND - -K WQG L L DA V K A A T E T G I R E A G I DV R L G D I G E A I Q E T M E S H E V E V DG K V Y QV K

3080

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3090

A I R N L NG H S I S P Y R I HA - - - - - -G K T V
S I R N L NG H S I G P Y Q I HA - - - - - -G K S V
P C R N L C G HN I V P Y Q I HG - - - - - -G K S V
C I R N L NG HN I DR H I I HG - - - - - -G K S V
C I R N L NG HN I DQH I I HG - - - - - -G K S V
- - - - - - - - - - - - L E I HA - - - - - -G K T V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L NG H S I A QY R I HA - - - - - -G K T V
P I R N L NG H S I A QY R I HA - - - - - -G K T V
C I K N L NG HN I DD F V I H S - - - - - -G K S V
P C R N L C G H S I G P Y T I HA - - - - - -G K S V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L S G H S I G QY R I HA - - - - - -G K T V
S I S N L NG H S I T P Y T I HG G I G T R P G K S V
P I K N L NG H S I L P Y H I HG - - - - - -G K S V
P I K N L NG H S I L P Y H I HG - - - - - -G K S V
P I R N L NG H S I G QY R I HA - - - - - -G K T V
C I R N L NG HN I G DY L I H S - - - - - -G K T V
S I R N L NG H S I R P Y V I HG - - - - - -G K T V
A I R N L NG H S I S P Y R I HA - - - - - -G K T V
A I R N L NG H S I S P Y R I HA - - - - - -G K T V
P I S D L HG H S I S Q F R I HG - - - - - -G I S I
P V R N L S G HMV G S Y A V HA - - - - - -G K S I
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L NG H S I G QY R I HA - - - - - -G K T V
P C R N L C G HN I N P Y S I HG - - - - - -G K S V
S I R N L S G HN I A P Y I I H S - - - - - -G K S V
P I R N L NG H S I G QY R I HA - - - - - -G K T V
P I R N L NG HT I DR Y T I HG - - - - - -G K S V
P I T N L T G HG L S HY E A HD - - - - - -N P P V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
S I R N L NG HT I NHY S I HG - - - - - -T K S V
S V R N L NG H S I G P Y Q I HA - - - - - -G K S V
C C S N L NG H S I D P Y R I HA - - - - - -G K S V
P I R N L NG H S I G QY R I HA - - - - - -G K T V
S V K N L NG H L I C K Y H I HG - - - - - -G K S V
C I K N L NG HN I G DY I I H S - - - - - -G K T V
A I S N L R G H S I NK Y I I HG - - - - - -G K C V
A I S N L R G H S I NK Y I I HG - - - - - -G K C V
P I S N L R G H S I C K Y V I HG - - - - - -G K C V
S I R N L NG H S I G P Y Q I HA - - - - - -G K S V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P C R N L C G H S I A P Y R I HG - - - - - -G K S V
S I R N L C G HN L D P Y I I HG - - - - - -G K S V
P I R N L NG H S I G P Y R I HA - - - - - -G K T V
P I R N L NG H S I G QY R I H S - - - - - -G K T V
A I R N L NG H S I E A Y Q I HA - - - - - -G K S V
P I R N L T G HN I G QY I I HA - - - - - -G K A V
P I R N L T G HN I G QY V I HA - - - - - -G K A V
P I S N L S G H S L G P Y T V HA - - - - - -G K S I
S I R N L S G HT I A P Y V I HG - - - - - -G K S V
S I R N L C G HN I G P Y V I H S - - - - - -G K S V
C V E N L NG H S I E R Y S I HG - - - - - -G K S V
S I R N L NG HN I A P Y E I HG - - - - - -G K S V

3100

3110

3120

P I V K - - - -G G E -T T R -M E E N E F Y A I
P I V K - - - -G G E -QT K -M E E G E F Y A I
P I V K - - - -NG D - E T K -M E E G E H F A I
P I V K - - - -G S D -QT K -M E E G E T F A I
P I V K - - - -G G D -QT K -M E E G E V F A I
P I V K - - - -G G E -T T R -M E E N E I Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -G G E -QT K -M E E N E I Y A I
P I V K - - - -G G E -QT K -M E E N E I Y A I
P I I A - - - -NG D -MT K -M E E G E T F A I
P I V K - - - -NG D -QT K -M E E G E H F A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -G G D -QT R -M E E G E V F A I
P I V K QHG S DK D - E T K -M E E G E Y F A I
P I I A - - - -T ND -DT R -M E E N E I Y A I
P I I A - - - -T ND -DT R -M E E N E I Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V P - - - -NG D -MT K -M E E G E T F A I
P I V R - - - -G G E -MT K -M E E G E F Y A I
P I V K - - - -G G E - S T R -M E E D E F Y A I
P I V K - - - -G G E - S T R -M E E D E F Y A I
P A V N - - - -NR D -T T R - I K G D S F Y A V
P I C K - - - -G G P -QT K -M E E G E V Y A L
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -NG D -NT K -M E E N E H F A I
P I V K - - - -G G E -QA K -M E E G E V F A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - - S A D -QT K -M E E G E I Y A I
P NK H - - - -V E G -G V I - L K E G DV L A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - - S ND -QT K -M E E G DV F A I
P I V K - - - -G G E -QT K -M E E G E F Y A I
P I V K - - - -G G V -QT K -M E E G E Y Y A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - - S ND -NT L -MK E G E L Y A I
P I V A - - - -NG D -MT K -M E E G E T F A I
P I V R - - - -QK E -K N E I M E E G E L F A I
P I V K - - - -QK E - E N E I M E E G E L F A I
P I V K - - - -QQ E -K H E I M E E G D L F A I
P I V K - - - -G G E -QT K -M E E G E F F A I
P I V K - - - -G G E -A T R -M E E G E V Y A I
P I V K - - - -NG D -T T K -M E E G E H F A I
P I V K - - - -G G E - E I K -M E E G E I F A I
P I V K - - - -G G E -A T R -M E E N E F Y A I
P I V K - - - -G G E -A T R -M E E G DV Y A I
P C I R - - - -T G P -NV K -M E E G E QY A I
P I V G - - - -K S G -NR D I M E E G DV F A I
P I V G - - - -NT N -NR D I M E E G E V F A I
P I T K - - - -G G N -A E K -M E E G E L F A C
P I V R - - - -G G E -A T R -M E E G E L F A I
P I V R - - - -G G E -A I K -M E E G E L F A I
P I V N - - - -M P D L QV K -M E E G E Y Y A I
P I V K - - - - S A D -MT K -M E E G E T F A I

3130
E T F G S -T G R G L V
E T F G S -T G K G Y V
E T F G T -T G R G Y V
E T F G S -T G K G Y V
E T F G S -T G K G Y V
E T F G S -T G R G QV
E T F G S -T G K G V V
E T F G S -T G K G Y V
E T F G S -T G K G Y V
E T F G S -T G NG Y V
E T F G T -T G R G Y V
E T F G S -T G K G V V
E T F G S -T G K G Y V
E T F G S -T G R G R V
E T F A T -T G R G Y V
E T F A T -T G R G Y V
E T F G S -T G K G MV
E T F G S -T G NG Y V
E T F G S -T G R A QV
E T F G S -T G R G L V
E T F G S -T G R G L V
E T F A T -T G K G S I
E T F A T -T G R G R V
E T F G S -T G K G V V
E T F G S -T G K G V V
E T F G S -T G R G Y V
E T F G S -T G R G F V
E T F G S -T G K G V V
E T F G S -T G L G Y V
E P F A T -NG T G L V
E T F G S -T G K G V V
E T F G S -T G K G V V
E T F G S -T G NG Y V
E T F G S -T G K G F V
E T F G T -T G R G Y V
E T F G S -T G K G V V
E T F G S -T G K G Y V
E T F G S -T G R G Y V
E T F A S -T G K G Y V
E T F A S -T G K G Y V
E T F A S -T G K G F V
E T F A S -T G K G Y V
E T F G S -T G K G V V
E T F G S -T G R G Y V
E T F G S -T G R G V V
E T F G S -T G K G F V
E T F G S -T G R G A V
E T F G V I NG K A S I
E T F A T -T G S G T V
E T F A T -T G S G MV
E T F G S -T G K G I V
E T F G S -T G R G F V
E T F G S -T G R G V V
E T F G S -T G R G Y V
E T F G S -T G R G Y V

3140

3150

S HY MK D F DA P - - -K V P L R L
S HY MK N F DA G - - -HV P L R L
S HY A K N P G A L - - - - P A P T L
S HY A L I P DA P - - - S V P L R L
S HY A L I P DH S - - -QV P L R L
S HY MK N F DQQ - - - F V P L R L
S HY MK N F DV G - - -HV P I R L
S HY MK N F E L A D - E K I P L R L
S HY MK N F E L A D - E K I P L R L
S HY A MNK G V E - - -H L K P P S
S HY A R L P S DG - - - L P Q P N L
S HY MK N F DV G - - -HV P I R L
S HY MK N F D L A N -QHV P L R L
S HY A L N S A A P - - E K Y QG HH
S HY MK Y Y DN P F L N E N S T R L
S HY MK Y Y DN P F L N E N S T R L
S HY MK N F E V G - - -HV P I R L
S HY A K N P G T D - - -D I V V P G
S HY MK T DY - - - - -QT T V R L
S HY MK N F D L P - - - F V P L R L
S HY MK N F D L P - - -Y V P L R L
S H F V L NT Y K S - - - - - - -R K
S HY MV DA NA F - - -DY P V R D
S HY MK N F DV G - - -HV P I R L
S HY MK N F DV G - - -HV P I R L
S HY A K K P G S H - - - - P T P S L
S HY MMQ P G A E - - -V MQ L R S
S HY MK N F DV G - - -HV P I R L
S HY A K R A DA P - - -NV A L R L
E I Y S L I K - - - - - -K K P V R L
S HY MK N F DV G - - -HV P I R L
S HY MK N F DV G - - -HV P I R L
S HY A K R G DA A - - -K V D L R L
S HY MK N F DV G - - -HV P L R V
S HY MK N F DV G - - -HV P L R L
S HY MK N F DV G - - -HV P I R L
S HY MK D F Y A K - - - P T A V R V
S HY S R NQN I D - - -G I R V P S
S HY MR N P E K Q - - - F V P I R L
S HY MR N P DK Q - - - F V P I R L
S HY MR NR DV Q - - -Y A P I R L
S HY MK N F DV G - - -H I P L R L
S HY MK N F DV E - - -HV P I R L
S HY A R S A E DH - - -QV M P T L
S HY A K I P DA G - - -H I P L R L
S HY MK N F E V G - - -HV P L R M
S HY MK N F NV G - - -HV P I R L
S HY MK D F NK E - - -MV P L R Q
S HY MK N P N S I - - -Y A P I R L
S HY MK N P N S I - - -Y A P I R L
S H F MV A R N P P - - - - -T P R T
S HY MMV P G G E - - -K T QV R S
S HY MMV P G G D - - -K T Q L R S
S HY A R K K N L P - -K S I P I R V
S HY A K NV G V G - - -HV P L R V

3160

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3170

3180

3190

3200

3210

3220

3230

Q S S K S L L G L - - - I NR N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C DV K -G S Y T A QY E HT I M L R P T C K - E V V
P R A K Q L L A T - - - I NK N F S T L A F C R R Y L DR I G E T K - -Y L MA L K N L C D S G I V Q P Y P P L C DV K -G S Y V S Q F E HT I L L R P T C K - E V L
S R A K A L L R T - - - I DA N F G T L P WC R R Y L DR L G E DK - -Y M F A L NH L V K QG I V QDY P P L V DV E -G S Y T A Q F E HT I L L H P HK K - E V V
S S A K N L L NV - - - I NK N F G T L P F C R R Y L DR L G Q E K - -Y L L G L NN L V S S G I V QDY P P L C DV K -G S Y T A Q F E HT I L L R P T V K - E V I
S S A K N L L NV - - - I NK N F G T L P F C R R Y L DR L G QDK - -Y L L G L NN L V S S G I V QDY P P L C D I K -G S Y T A QY E HT I V L R P NV K - E V I
Q S S K Q L L NV - - - I NK N F G T L A F C K R W L E R A G A S R - -Y A MA L K D L C DK G V V DA Y P P L C D I K -G C Y T A Q F E HT I L L R P T C K - E V V
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
QK S K G L L S L - - - I DK N F S T L A F C R R W - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - QK S K G L L N L - - - I DK N F A T L A F C R R W I DR L G E T K - -Y L MA L K D L C DK G I V D P Y P P L C DV K -G C Y T A QW E HT I L MR P T V K - E V V
E R S K Q L L E T - - - I K QN F G T L P WC R R Y L E R T G E E K - -Y L F A L NQ L V R HG I V E E Y P P I V DK R -G S Y T A Q F E HT I L L H P HK K - E V V
A S A K Q L L K V - - - I DDH F G T L P WC R R Y L DR L G QDK - -Y L F A L N S L V K QG HV QDY P P L NDV I -G S Y T A QY E HT I L L H P HK K - E V V
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
P R A K R L L HV - - - I N E N F G T L A F C R R W I DR I G E T K - -Y L MA L K N L C D S G V V DA Y P P L C D I K -G C Y T A Q F E HT I L MR P NC K - E V V
Q S A K S L L A S - - -V K R N F G T L P F C R R Y L DHV G E K N - -Y L L A L NT L V R E D F I A DY P P L V D P Q P G A MT A Q F E HT I L L R P T C K - E V V
N S A K I L L G G - - - I NT H F G T L A F C R R W L DQ L G F NK - -HA L A L K S L V D S E I I R P Y P P L ND I P -G S F S S QM E HT I L L R P S C K - E V V
N S A K I L L G G - - - I NT H F G T L A F C R R W L DQ L G F NK - -HA L A L K S L V D S E I I R P Y P P L ND I P -G S F S S QM E HT I L L R P S C K - E V V
P R A K H L L NV - - -V N E N F G T L A F C R R W L DR L G E T K - -Y L MA L K N L C D L G I I D P Y P P L C DT K -G C Y T A Q F E HT I L L R P T C K - E V V
DK A K S L L NV - - - I N E N F G T L P WC R R Y L DR L G QDK - -Y L L A L NQ L V R A G I V QDY P P I V D I K -G S Y T A Q F E HT I L L H P HK K - E V V
P K A K Q L L QY - - - I NK NY DT L C F C R R W L DR A G E DK - -H I L A L NN L C D L G I I QR HA P L V D S K -G S Y V A QY E HT L L L K P T A K - E V L
Q S S K Q L L G T - - - I NK N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C D I K -G C Y T A QY E HT I M L R P T C K - E V V
Q S S K Q L L G T - - - I NK N F G T L A F C K R W L DR A G A T K - -Y QMA L K D L C DK G I V E A Y P P L C D I K -G C Y T A QY E HT I M L R P T C K - E I V
L F NK D L I K V Y E F V K D S L G T L P F S P R H L DY Y G L V K G G S L K S V N L L T MMG L L T P Y P P L ND I D -G C K V A Q F E HT V Y L S E HG K - E V L
G NA K R L L HA - - - L DA N F K T L A F C R R Y V DK I G F A K - -WQM P F K F L V DDG C V NA Y P P L S DC H -G S Y V A Q F E HT I Y L K P T C K - E V L
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
S S A K N L L K V - - - I D E N F G T I P F C R R Y L DR L G E DK - -HV Y A L NT L V R QG I V E DY P P L ND I K -G S Y T A Q F E HT L I L H P HK K - E I V
E K A QQ L L K H - - - I HK S Y S T L A F C R K W L DR DG F DR - -H L MN L NR L V D E G A V NK Y P P L V DV K -G S F T A QY E HT I Y L G P T A K - E I L
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
T S A QK I L NV - - - I NK N F G T L P F C R R Y L DR L G QDK - -Y L L G L NN L V S NG I V E A Y P P L V DK K -G S Y T A QY E HT I L L R P T V K - E V I
P A V R NV L K Q - - -V - E E Y R E L P F A K R W L E - - - S DK - - L E F S L I Q L E K A G I L H S Y P V L V E S A -G G L V S QA E HT V I I T R DG C - E V T
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
S S A K S L L NV - - - I T K N F G T L P F C R R Y I DR L G QDK - -Y L L G - - - - - - -G I V E A Y P P L V DK K -G S Y T A HW L S T QR L K N S T A F Q I V
A K A K Q L L G T - - - I NNN F G T L A F C R R Y L DR L G E T K - -Y L MA L K N L C DV G I V Q P Y P P L C DV R -G S Y V S Q F E HT I L L R P T C K - E V I
P R A K Q L L G V - - - I DR N F G T L A F C K R Y L DR I G E QR - -Y S MA L K N L C DNG I V Q P Y P P L C D I K -G S Y V A QY E HT I L L K P S S G V E V L
P R T K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
P K A K S L L T H - - - I DNHY DT L A F C R R F L DR DG Q S N - -Y L L G L K N L C D L G I V N P Y P P L C D I R -G S Y V S QY E HT I F L K P S C I - E V I
E R A K T L L N S - - - I T S N F G T L P WC R R Y L E R T G E E K - -Y L F A L NQ L V R A G I V E E Y P P L V D I K -G S Y T A QY E HT I L L H P HK K - E V V
N S A K T L L K V - - - I NDN F DT L P F C NR W L DD L G QT R - -H F MA L K T L I D L N I V E P Y P P L C D I K -N S F T S QM E HT I L L R P T C K - E V L
N S A K T L L K V - - - I NDN F DT L P F C HR W L DD L G QK R - -H F MA L K T L V D L N I V E P Y P P L C DV K -N S F T S QM E HT I L L R P T C K - E V L
N S A K T L L K V - - - I NDK F DT L P F C NR W L DD L G QT R - -H F MA L K T L V D L N I V E P Y P P L C D I K -N S F T S QM E HT I L L R P T C K - E V L
P R A K Q L L A T - - - I NK N F S T L A F C R R Y L DR L G E T K - -Y L MA L K N L C D S G I I Q P Y P P L C DV K -G S Y V S Q F E HT I L L R P T C K - E V I
P R T K H L L NV - - - I N E N F DT L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I V D P Y P P L C D I K -G S Y T A Q F E HT I L L R P T C K - E V V
D S A K N L L K T - - - I DR N F G T L P F C R R Y L DR L G Q E K - -Y L F A L NN L V R HG L V QDY P P L ND I P -G S Y T A Q F E HT I L L HA HK K - E V V
P R A K A L L NT - - - I T QN F G T L P F C R R Y L DR I G E S K - -Y L L A L NN L V S A G I V QDY P P L C D I R -G S Y T A Q F E HT I I L H P T QK - E V V
QR S K A L L K V - - - I NNN F G T L A F C R R W L DR L G E T K - -Y L MA L K N L C DT G L V D P Y P P L C DV K -G C Y T A QY E HT I M L R P T Y K - E V V
P R A K H L L NV - - - I N E N F G T L A F C R R W L DR L G E S K - -Y L MA L K N L C D L G I I D P Y P P L C D I K -G S Y T A QY E HT I L L R P T Y K - E V V
P K A K N L L K F - - - I DNN F G T L A F C R R W L DR G G QT G - -H I L S L K Q L C DA G I V V P Y P P L V DV R -G S Y V A QY E HT I V L K P S HK - E V I
K S A R E S L NV - - - I NR E F S T L P F C K R W L DD L T NK R - -G S L V L R N L V DA G I I V P Y P P L C DNN -N S F T S QM E HT I L L R P T C K - E V L
K S A R E A L NV - - - I NR E F S T L P F C K R W L DD L T NR R - -G S MV L R S L V DA G I V V P Y P P L S DNN -H S F T S QM E HT I L L R P T C K - E V L
P A A R K L L K T - - - L Q E N F S T L A F S QR F I DR I G E K K - -Y Q L N L R H L V E C R A V HDY P S L S DV K -G S Y V A Q F E HT F I L L P T HK - E V L
DK A QQ L L R H - - - I HK T Y NT L A F A R K W L DR DG HDR - -H L L N L NQ L V E A G A V NK Y P P L C D I R -G C Y T A Q L E HT L I L K P T A K - E I L
E K A QH L L K H - - - I NK T Y G T L A F A R K W L DR DG Y DR - -H L L N L NQ L V E A G A V NR Y P P L C DV K -G C Y T A Q F E HT I L L K P T A K - E I L
H S A HG L L R T - - - I NK H F D S L P F C R R Y L DR V G E K N - -Y L L G L K H L V S L G V V QDY P P L C D I A -G S MT A QY E HT I L L R P T C K - E V V
NK A K Q L L A T - - - I DK N F G T L P F C R R Y L DR L G E E K - -Y L L A L K N L V Q S G V V QDY P P L V DQK -G C QT A QY E HT I Y L R P T C K - E I L

3240

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3250

3260

3270

3280

3290

3300

3310

S R - - - - - - - -G DDY - - - - - - - - - - - - - - - - - - - -A QMQMK A R L A G A HK G HG L L K K K A DA L QMR F R M I L S K I I E T K T L MG E V - S K - - - - - - - -G DDY M - -A G QNA -R L NV V P T V T - -M L G V MK A R L V G A T R G HA L L K K K S DA L T V Q F R A L L K K I V T A K E S MG DM - S K - - - - - - - -G DDY M - - - - S NN -R E QV F P T R M - -T L G L MK S K L K G A NQG H S L L K R K S E A L T K R F R E I T R R I D E S K QR MG A V - S R - - - - - - - -G DDY M - - S G A V G -R E P V F P T R Q - - S L G L MK S K L K G A E T G H S L L K R K S E A L T K R F R E I T R R I D E A K QK MG R V - S R - - - - - - - -G DDY M S G F N P P G -R E A V F P T R Q - - S L G L MK G K L K G A E T G H S L L K R K S E A L T K R F R E I T R R I D E A K QK MG R V - S R - - - - - - - -G DDY - - - - - - - - - - - - - - - - - - - - - -M L I K G R L A G A V K G HG L L K K K A DA L QV R F R M I L S K I I E T K T L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - - - - - - - - - - - - - - -M S G G G G K D -R I A V F P S R M - -A QT L MK T R L K G A QK G H S L L K K K A DA L N L R F R D I L K K I V E NK V L MG E V - S R - - - - - - - -G DDY M - S G G G K D -R I A V F P S R M - -A QT L MK T R L K G A QK G H S L L K K K A DA L N L R F R D I L R K I V E NK V L MG E V - T K - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G V MK S K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S K - - - - - - - -G DDY M - - - - S G N -R E QV F P T R M - -T L G L MK T K L K G A NQG H S L L K R K S E A L T K R F R D I T K R I D E A K QK MG R V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - S S HD -R I D I F P S R M - -N L T I MK T R L K G A HK G H S L L K K K A DA L K MK F H S I L R K I I E A K Q L MG E I - S R - - - - - - - -G DDY M - - S G T G P -R E A I F P T R M - -N L T L T K G R L K G A QT G H S L L A K K R DA L T T R F R Q I L R K V D E A K R L MG R V - S R - - - - - - - -G DD F - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -M L K E I V E T K R S I G ND - S R - - - - - - - -G DD F L - - - - - - - - - - - - - - - L I Y R A L QA I K L K S K G A K QG Y D L L K R K S DA L S NK F R G M L K E I V E T K R S I G ND - S R - - - - - - - -G DDY M - - - S G K E -R I D I F P S R M - -A QT I MK A R L K G A QT G R S L L K K K S DA L S MR F R Q I L R K I I E T K T L MG E V - S R - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G L MK G K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S R - - - - - - - -G DDY M - - - S G K N -R L N I F P T R M - -A L T V MK T K L K G A V T G H S L L K K K S DA L T I R F R R I L A N I V E NK Q L MG T T - S R - - - - - - - -G DDY M - - - S G K D -R L P I F P S R G - -A QM L MK A R L A G A QK G HG L L K K K A DA L QMR F R L I L G K I I E T K T L MG DV - S R - - - - - - - -G DDY M - - - S G K D -R L P I F P S R G - -A QM L MK A R L A G A QK G HG L L K K K A DA L QMR F R M I L G K I I E T K T L MG DV - T R - - - - - - - -G DDY M - - - -T G E -R I P V F P T R M - -N L R T M E T K QK S A QK G H S L L K R K S DA L K V R Y R A V E D E Y K R K E L G I NQK - S R - - - - - - - - - - - -M - - - - S DK -R Y T V F P T R M - -Q L T T Y K G K L V G A QR G HD L L K R K T DA L NQK F K S I L K K I I E E K M S MK DY - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K L L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - - S S N -R E QV F P T R M - -T L G L MK T K L K G A NQG Y S L L K R K S E A L T K R F R D I T K R I DD S K QK MG R V - S K - - - - - - - -G S DY M - - - S S T S -R Y P A L P S R M - - S L I A F K T R L K G A QK G H S L L K K K A DA L S L R Y R T V MG E L R T A K L E MA NQ - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - S G A G E -R E A V F P T R Q - - S L G I MK A K L K G A E T G H S L L K R K S E A L T K R F R E I T K R I D E A K R K MG R V - T K - - - - - - - - - - - -M - - - - - - -A QQDV K P T R S - - E L I N L K K K I K L S E S G HK L L K MK R DG L I L E F F K I L N E A R NV R T E L DA A - S R - - - - - - - -G DDY M - - - S T K D -R I D I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA MT L R F R Q I L K K V I QT K V L MG E V - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - NQR R R V T E A I T E A E M - - S G A A D -R E A V F P T R Q - - S L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - E I T R R I D E A K R K MG R V - S R - - - - - - - -G DDY M - - S G QT Q -R L NV V P T V T - -M L G V MK A R L V G A T R G HA L L K K K S DA L T V Q F R A I L K K I V A A K E S MG E A - T R - - - - - - - -G E DY M - - S S A G A -R L NV T P T V T - -T L A V I K S R L A G A QR G HR L L K K K A DA L T L R Y R G I L R D I V E A K R K L A T S - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S R - - - - - - - -G DDY M - - - - - - - -A E QV V P S R M - -N L A L Y K A K I I S A K K G H E L L K K K C DA L K T K F R I V MV A L L E NK K F MG D E - T R - - - - - - - -G DDY M - - S G A G N -R E QV F P T R M - -T L G L MK G K L K G A QQG H S L L K R K S E A L T K R F R D I T QR I DDA K R K MG R V - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L Q L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K T K V G E E - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L Q L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K T K V G E E - S R - - - - - - - -G P D F M - - -G A L D - E S T P V P S R I - -T L H L MK QK K K S A F QG Y S L L K K K S DA L F I H F R DV L K D I V K T K NK V G E D - S R - - - - - - - -G DDY M - - S G S G Q -R L NV V P T V T - -V L G V V K A R L V G A T R G HA L L K K K S DA L T V Q F R Q I L K K I V S T K E S MG DK - S R - - - - - - - -G DDY M - - - S G K D -R I E I F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L T L R F R Q I L K K I I E T K M L MG E V - S K - - - - - - - -G DDY M - - - - S G N -R E QV F P T R M - -T L G L MK T K L K G A NQG Y S L L K R K S E A L T K R F R D I T K R I DDA K QK MG R V - S R - - - - - - - -G DDY M - - -A S K Q -R E NV F P T R M - -T L T T MK T R L K G A QT G H S L L K R K S E A L K K R F R E I V V N I E QA K QK MG R V - S R - - - - - - - -G DDY M - - - - S K D -R I A V F P S R M - -A L T T MK I R L K G A QK G H S L L K K K A DA L T L K F R Q I L G K I I E NK T L MG E A - S R - - - - - - - -G E DY M - - - S G K E -R I DV F P S R M - -A QT I MK A R L K G A QT G R N L L K K K S DA L S MR F R Q I L R K I I E V S W L S S A I P I
S R - - - - - - - -G DDY M - - - - - - - - S QQ I T P S R M - -T L A I Y K A K T V S A K K G H E L L K K K C DA L K T K F R A I M I A L L E NK L K MD E E - S R - - - - - - - -G DDY M - - - - S S L - S V L L I P S R M - -N L QN L K QR R HNA H L G Y S L L K R K S DA L T S K F HR L L R A T V QG K E R L V E G - S R - - - - - - - -G DDY M - - - - S N L - S V L L I P S R M L V N L QN L K QR R HNA H L G Y S L L K R K S DA L T S K F HR L L R A T V QG K E R L V E G - S R - - - - - - - -G DDY M - - - - - - - - -A A I I P T R M - - E L QN L K E K L K G A R K G Y D L L K K K S DA L T MK F R S L L R E I R DT K L S V G NV - S K - - - - - - - -G DDY M - - - - S S N -R Y T A L P S R M - - S L I A F K T R L K G A QK G H S L L K K K A DA L A F R Y R T V MD E L R R A K L E V A DQ - S K - - - - - - - -G DDY M - - - - S S N -R Y P A L P S R M - - S L I S F K T R L K G A QK G H S L L K K K A DA L A I R Y R A I MG D L R NA K M E MV E Q - S R - - - - - - - -G T DY M - S S G K G Q -R E S V F P T R Q - -A L G S A K T R L K G A QT G H S L L K K K A DA L T K R F R T I T HK I D E A K R K MG R V - S R - - - - - - - -G DDY M - - - S A NN -R E A V F P T R M - -T L G MMK G K L K G A T QG HN L L K R K S E A L T K R F R D I T R K I D E S K HK MG R V - -

3330

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3340

3350

3360

- -MK E A A F S L A E A K F A S G -D F NQV V L QNV T K A Q I K I R T K K DNV A G V T L


- -MK T S S F A L T E V K Y V A G DNV K HV V L E NV K E A T L K V R S R T E N I A G V K L
- -MQT A S F S L A E V T Y A T G E N I G Y QV Q E S V A NA R F K V G A R Q E NV S G V Y L
- -MQ I A A F S L A E V S Y A V G G D I G Y QV Q E S A K QA R F R V R A K Q E NV S G V F L
- -MQ I A A F S L A E V S Y A V G G D I G Y Q I Q E S A K QA R F R V R A K Q E NV S G V L L
- -MK E A A F S L A E A K F T T G -D F NQV V L QNV T K A Q I K I R S K K DNV A G V T L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MK E A A F S L A E A K F T A G -D F S HT V I QNV S QA QY R V R MK K E NV V G V L L
- -MK E A A F S L A E A K F T A G -D F S HT V I QNV S QA QY R V R MK K E NV V G V F L
- -MQT A A F S L A E V QY A T G DN I S Y QV Q E S V QK A R F T V K A K Q E NV S G V F L
- -MQT A A F S L A E V S Y A T G E N I G Y QV Q E S V L NA R F K V K A R Q E NV S G V Y L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MK E A A F S L A E A K F S G G -D F S HV V L QNV G K A QMK V R S K T DNV A G V K L
- - L Q L A S F S L A E V T Y A A G -D I G Y QV Q E S V R K A NY T V QA R Q E NV S G V V L
- - I K E A S F A L A K A T WA A G -D F K DR I I E S C K R P T V T M E V G T E N I A G V R L
- - I K E A S F A L A K A T WA A G -D F K DR I I E S C K R P T V T M E V G T E N I A G V R L
- -MR E A A F S L A E A K F A A G -D F S T T V I QNV NK A QV K V R A K K DNV A G V T L
- -MQT A A F S L A E V QY A T G DN I S Y L V Q E S V QNA R F QV K A K Q E NV S G V Y L
- -MR DA S F S L A A A K Y A A G - E F S N S V I E NV S N P T I A V K MT T E NV A G V H L
- -MK E A A F S L A E A K F T S G -D I NQV V L QNV T K A Q I K I R T K K DNV A G V T L
- -MK E A A F S L A E A K F T S G -D I NQV V L QNV T K A Q I K I R T K K DNV A G V T L
- - I R DA F F R L T E A E F L G A -N L K M F L Y E -C QK QNV Y V R S R V E QV S G V S L
- -MK A S S F S L V S A K Y T A G - E F S HV V V QNV K N S T Y K V K L T Q E N I A G V R L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MQT A A F S L A E V T Y A T G E N I G Y QV Q E NV A NA R F K V R A T Q E NV S G V Y L
- - I K G S Y F T I T QA Q F I A G -D I S L A V Q E S L K I P T Y R M E L QV E N I A G V QV
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MQ I A A F S L A E V T Y A V G G D I G Y T V Q E S A K S A R F R I R A K Q E NV S G V L L
- -Y E K S T E K I N L A S A V NG -MV A V K S T A F T A K E Y P E I Q L S G HN I MG V V V
- -MR E A A F S L A E A K F T A G -D F S A T V I QNV NK A QV K I R T K K DNV A G V T L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MQ I A S L S L A E V T Y A V G G N I G Y Q I Q E S A K S A R F R I R A K Q E NV S G V L L
- -MR A S S F S L A E A K Y V A G DG V R HV V L Q S V R S A S L R V R S HQ E NV A G V K L
- -MR DA H F A WT R A K Y A G G DA V K HA V L DG V DR A NV R V MA H E DNV A G V K I
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -A Q E A L L L I A K A QY A A G - E F HQNV K DA V K R A T I R L E I S S E N I A G V M L
- -MQT A A F S L A E V QY A T G DN I A Y QV Q E S V QK A R F QV K A K Q E NV S G V Y L
- -MR NA S F S L A K S V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MG NA S F S L A K A V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MR NA S F A L A K S V WA A G -D F K G Q I I E G I K R P V V T L S L S T NNV A G V K L
- -MK A S S F A L T E A K Y V A G E N I K HT V L E NV QT A T L K V R S R Q E NV A G V K L
- -MR E A A F S L A E A K F T A G -D F S T T V I QNV NK A QV K I R A K K DNV A G V T L
- -MQT A A F S L A E V S Y A T G E N I G Y QV Q E S V S T A R F K V R A R Q E NV S G V Y L
- -MQ I A A F S MA E V G F A MG NN I N F E I QQ S V K Q P R L R V R S K Q E N I S G V F L
- -MK L A S L S L A E A K F A MG -D I S HNV L QNV T K A QT K V R S K K E NV A G V N L
Q F MR E A A F S L A E A K F T A G -D F S I T V I QNV NK A QV K V R A K K DNV A G V T L
- -MQK A F I Q L A DA Y WA A D -Q F NT NV R E S V K K A L V R I E Y S S E N I A G V M L
- - L K DA T Y S L A NA V W S A E -D F K S L V I E S V G R P S V T L K L R G E N I A G V L L
- - L K DA T Y S L A NA V W S A E -D F K S L V I E S V G R P S V T L K L R G E N I A G V L L
- -A K DA L F A Y T E V K F V A S -D I S P T V I Q S V G NM P Q L L L MT I DN I A G V R T
- - I K G S Y F T I T QA Q F I A G -D I S L A V Q E S L K L P T Y T L T L R V DNV A G V R V
- - I R G A Y F T V S K A Q F I A G -D I G L A V Q E S L K L P T Y A MR L R V E N I A G V R V
- -MQQA S F S L A E V QY A T G -D I G Y I V Q E S V K S A S F R V R A K Q E NV S G V I L
- -MQT A A F S L A E V T Y A T G DN I NY QV Q E S V R S A R L R V R A K E E NV S G V K L

3370

3380

3390

3400

P V F G L A K G G QQ L QK L K K NY Q S A V K L L V E L A S L QT S
P K F G L A R G G QQV R A C R V A Y V K A I E V L V E L A S L QT S
P Q F G L G R G G QQV QR A K N I Y T K V V E S L V Q L A S L QT A
P Q F G L G K G G QQV QR C R E T Y A R A V E T L V E L A - - - - P H F G L G K G G MQV QR C R E T Y A R A V E T L V E L A S L QT A
P I F G L A R G G QQ L A K L K K N F Q S A V K L L V E L A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P V F G L G K G G A N I A R L K K NY NK A I E L L V E L A T L QT C
P V F G L G K G G A N I A R L K K NY NK A I E L L V E L A T L QT C
P T F A L A R G G QQV QK A K L I Y S K A V E T L V E L A S L QT A
P Q F G L G R G G QQV QR A K D I Y S K A V E T L V E L A S L QT A
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P V F G L S R G G E Q L S R L K K NY S K A V K L L V E L A S L QT S
P A F G L S R G G QQ I QK S R DT Y I K A V G T L V E L A S L QT A
P I F G V A S G G QV I Q S T R E I Y MK V L R D L V K L A S L QT A
P I F G V A S G G QV I Q S T R E I Y MK V L R D L V K L A S L QT A
P V F G L A R G G E Q L S R L K R NY A K A V E L L V E L A S L QT S
P T F G L G R G G QQV QK A K MV Y T K A V E T L V E L A S L QT A
P T F G L S K G G QQ I NK S R E S H I K A V E A L I A L A S L QT A
P V F G L A R G G QQ L A K L K K NY Q S A V K L L V E L A S L QT S
P V F G L A R G G QQ L A K L K K NY Q S A V K L L V E L A S L QT S
P F F F L DR S G Q S L N E C R E K F L E V L E M L V D L C A L K N S
P V F G L S K G G Q S V A NA R QQY L K A L D S L V K L A S L QT A
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P Q F G L G R G G QQV QR A K E I Y S R A V E T L V E L A S L QT A
P S F G L G K G G E Q I K E A Y S A F R HT L S L L V K I A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P A F G L G K G G QQV QR C R E T Y A R A V E A L V E L A S L QT A
P K I G I I G T N S Y I D E T A DA Y E E L V E K I I A A A E L E T T
P V F G L A R G G E QV T K L K K NY G K A V E L L V E L A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P A F G L G K G G QQV QR C R E T Y A R A V E A L V E L A S L QT A
P K F G L A R G G QQV A A C R A A HV K A I E V L V E L A S L QT S
P K F G L A R G G A R V R E A K A S Y G E A I G L L S E L A S L QT A
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
P E V G L A R G G Q S I QR C R DK F K D L L M L L V K I A S Y QT S
P T F G L G R G G QQV QK A K L V Y T R A V E T L V E L A S L QT A
P I F G V A A G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P I F G I A S G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P I F G V A A G G QV I NNT R E NY L QC L NM L V K L A S MQV A
P K F G L A R G G QQV QA C R A A Y V K A I E V L V E L A S L QT S
P V F G L A R G G E Q L A K L K R NY A K A V E L L V E L A S L QT S
S Q F G L G R G G QQV QR A K E I Y S R A V E T L V E L A S L QT A
P T F G L G K G G QQ I QK A R QV Y E K A V E T L V Q L A S Y Q S A
P V F G L S R G G QQ I DR L K K NY A K A I E L L V E L A S L QT S
P V F G L A K G G E Q I S R L K R NY A R A V E L L V E L A S L QT S
P N L G L DK G G F S I QK A K E R F K E A L Y L L V K V A S L QT S
PV F S L S SGG SA I Q SV K T T H LAA LD I LV E LA S LQ I S
PV F S L S SGG SA I Q SV K T T H LAA LD I LV E LA S LQ I S
P Q F G L A R G G QQ I QK A R E E F T K F L D S L V R L A E L QT A
P A F G I G R G G E Q L R E A R DA F R E T L K L F V K I A S L QV S
P S F G I G R G G E Q L R E A S E K F R E T L R L L V K I A S L QV S
P A F G L S R G G QQV S K A R E V Y T QA L K V L V E L A S L QT A
P S F G L G R G G QQV QK A K A V Y S K A V E T L V E L A S L QT A

3410

Anopheles_gambiae/1-3220
Arabidopsis_thaliana/1-3286
Ashbya_gossypii/1-3281
A_fumigatus/1-3241
Aspergillus_niger/1-3300
Bombyx_mori/1-2389
Bos_taurus/1-3273
C_briggsae/1-3231
Caenorhabditis_elegans/1-3289
Candida_albicans/1-3266
Candida_glabrata/1-3282
Canis_familiaris/1-3286
Ciona_intestinalis/1-2712
Cryptococcus_neoformans/1-3301
Cryptosporidium_hominis/1-3020
C_parvum/1-3281
Danio_rerio/1-3286
Debaryomyces_hansenii/1-3166
Dictyostelium_discoideum/1-3275
D_melanogaster/1-3287
Drosophila_pseudoobscura/1-3286
Encephalitozoon_cuniculi/1-3199
Entamoeba_histolytica/1-3271
Gallus_gallus/1-3286
Homo_sapiens/1-3286
Kluyveromyces_lactis/1-3281
Leishmania_major/1-3286
Macaca_mulatta/1-3182
Magnaporthe_grisea/1-3286
Methanosarcina_acetivorans/1-3043
Monodelphis_domestica/1-3196
Mus_musculus/1-3286
Neurospora_crassa/1-3264
Oryza_sativa/1-3218
Ostreococcus_lucimarinus/1-3287
Pan_troglodyte/1-3286
Paramecium_tetraurelia/1-3282
Pichia_stipitis/1-3290
P_falciparum/1-3285
P_knowlesi/1-3284
Plasmodium_yoelii/1-3285
Populus_trichocarpa/1-3286
Rattus_norvegicus/1-3286
Saccharomyces_cerevisiae/1-3284
Schizosaccharomyces_pombe/1-3284
Strongylocentrotus_purpuratus/1-3252
Takifugu_rubripes/1-2930
Tetrahymena_thermophila/1-3270
T_annulata/1-3248
Theileria_parva/1-3273
Trichomonas_vaginalis/1-3198
T_brucei/1-3264
Trypanosoma_cruzi/1-3266
Ustilago_maydis/1-3291
Yarrowia_lipolytica/1-3278

3420

3430

3440

3450

3460

F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R I DR T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F L T L D E A I K T T NR R V NA L E N - - - - -V V K P K L E NT I S Y I K G E L D E L E R E D F F R L K K I QG Y K
F V I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
- - - -N E V I K V V NR R V - S T S L - - - - - S L E P R T L S N - - - - - - - - - - - - - - - - - - - - - - - - - F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V S G K K
F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R L E R T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F V T L D E A I K I T NR R V NA I E HG E F K L P F C P R L H P C L R P A R T QA - - - - - - - - - - - - - - - - - F I T L D E A I K V T NR R V NA I E H - - - - -V I I P R I E NT L T Y I V T E L D E M E R E E F F R MK K I QA NK
F I T L D E A I K V T NR R V NA I E H - - - - -V I I P R I E NT L T Y I V T E L D E M E R E E F F R MK K I QA NK
F I I L D E V I K I T NR R V NA I E H - - - - -V I I P R T E NT I A Y I NG E L D E MDR E E F Y R L K K V Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E S I K I T NR R V NA I E H - - - - -V I I P K I E R T I S Y I I T E L D E G E R E E F F R L K K I QQK K
F T I L D E V I R A T NR R V NA I E H - - - - -V V I P R L E NT I K Y I N S E L D E MDR E E F F R L K K V QG K K
F F S L D E E I K MT NR R V NA L QN - - - - -V V L P K L E DG MNY I L R E L D E I E R E E F F R L K K I Q E K K
F F S L D E E I K MT NR R V NA L QN - - - - -V V L P K L E DG MNY I L R E L D E I E R E E F F R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L T Y I I T E L D E R E R E E F Y R L K K I Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I S Y I N S E L D E L DR E E F Y R L K K V Q E K K
F I T L D E V I K I T NR R V NA I E Y - - - - -V V K P K L E NT I S Y I I T E L D E S E R E E F Y R L K K V QG K K
F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R I DR T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F V T L D E V I K I T NR R V NA I E H - - - - -V I I P R I DR T L A Y I I S E L D E L E R E E F Y R L K K I QDK K
F R V L N S I L M S T NR R V NA L E F - - - - -N I I P R L E NT V S Y I V S E L D E QDR G D F F R L K K V QN L K
F L T L DT V I K I T NR R V NA L E H - - - - -V V I P MT QA T V K Y I E T E L D E S E R E E F F R L K L I QNK K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L S Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
W I T L D I A QK V T S R R V NA L E K - - - - -V V I P R V QNT L S Y I T S E L D E Q E R E E F F R L K MV QK K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V A G K K
MK R L L D E I E K T K R R V NA L E F - - - - -K V I P E L I A T MK Y I R F M L E E M E R E NT F R L K R V K A R M
F I T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L NY I V T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V I L D E V I K V V NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V A A K K
F L T L D E A I K T T NR R V NA L E N - - - - -V V K P R L E NT I S Y I K G E L D E L E R E D F F R L K K I QG Y K
F V T L D E A I K T T NR R V NA L E N - - - - -Y V T P R L QNT V K Y I L S E L D E L E R E E F F R L K K V QA K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F V S L DQV I K V T NR R V NA L E Y - - - - -V V I P R F T A T MNY I DM E L D E M S K E D F F R L K K V L DNK
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I S Y I N S E L D E L DR E E F Y R L K K V Q E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L DG G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L DG G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F F S L D E E I K MT NR R V NA L NN - - - - - I V L P R L E G G I NY I I K E L D E I E R E E F Y R L K K I K E K K
F MT L DT A I K T T NR R V NA L E N - - - - -V V K P R L E NT I T Y I K G E L D E L E R E D F F R L K K I QG F K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I E R T L A Y I I T E L D E R E R E E F Y R L K K I Q E K K
F I I L D E V I K V T NR R V NA I E H - - - - -V I I P R T E NT I A Y I N S E L D E L DR E E F Y R L K K V Q E K K
F V L L G DV L QMT NR R V N S I E H - - - - - I I I P R L E NT I K Y I E S E L E E L E R E D F T R L K K V QK T K
F I T L D E V I K I T NR R V NA I E H - - - - -V I I P R I E NT I S Y I T T E L D E R E R E E F Y R L K K I Q E K K
F V T L D E A I K I T NR R V NA I E H - - - - -V I I P R I DR T L T Y I V T E L D E R E R E E F Y R L K K I Q E K K
F I T L D E V I K V T NR R V NA L E H - - - - -V V I P R F M E V QA Y I NQ E L D E M S R E D F F R L K K V L D F K
F I I L N E E I R MT NR R I NA L DN - - - - -V L I P S I DR N L E Y I R R E L D E M E R E E F Y R L K M I K K HK
F I I L N E E I R MT NR R I NA L DN - - - - -V L I P S I DR N L E Y I R R E L D E M E R E E F Y R L K M I K K HK
F NV I DDV L R I T NR R V NA M E C - - - - -V L I P K Y QA A I A F V D S T L D E N E R E E F F R L K K V Q E T I
WMT L DV A QK V T S R R V NA L E K - - - - -V V I P R M E NT L NY I S S E L D E Q E R E E F F R L K M I QK K K
WV T L D L A QK V T NR R V NA L E K - - - - -V V I P R V QNT L S Y I T S E L D E Q E R E E F F R L K MV QK K K
F V I L D E V I R MT NR R V NA I E H - - - - -V I I P R L E NT I S Y I V S E L D E A DR E E F F R L K K V QA K K
F V I L D E V I K I T NR R V NA I E H - - - - -V I I P R T E NT I K Y I N S E L D E L DR E E F Y R L K K V QDK K

S-ar putea să vă placă și