Automatic Classification and Pattern Discovery in High-Throughput Protein Crystallization Trials

Journal of Structural and Functional Genomics 6: 195–202, 2005.
Springer 2005
DOI 10.1007/s10969-005-5243-9
Automatic classification and pattern discovery in high-throughput protein

crystallization trials
Christian Cumbaa & Igor Jurisica*

Ontario Cancer Institute, 610 University Avenue, Toronto, Ontario, M5G 2M9, Canada; and Northeast
Structural Genomics Consortium. *Author for correspondence (e-mail: juris@ai.utoronto.ca)
Received 31 July 2004; accepted in revised form 04 May 2005
Key words: association-rule discovery, image analysis, protein crystallization
Abstract
Conceptually, protein crystallization can be divided into two phases search and optimization. Robotic
protein crystallization screening can speed up the search phase, and has a potential to increase process
quality. Automated image classification helps to increase throughput and consistently generate objective
results. Although the classification accuracy can always be improved, our image analysis system can classify
images from 1536-well plates with high classification accuracy (85%) and ROC score (0.87), as evaluated on
127 human-classified protein screens containing 5600 crystal images and 189472 non-crystal images. Data
mining can integrate results from high-throughput screens with information about crystallizing conditions,
intrinsic protein properties, and results from crystallization optimization. We apply association mining, a
data mining approach that identifies frequently occurring patterns among variables and their values. This
approach segregates proteins into groups based on how they react in a broad range of conditions, and
clusters cocktails to reflect their potential to achieve crystallization. These results may lead to crystallization
screen optimization, and reveal associations between protein properties and crystallization conditions. We
also postulate that past experience may lead us to the identification of initial conditions favorable to
crystallization for novel proteins.
Introduction cal and biological parameters), and the unknown

correlations between the variation of a parameter
One of the fundamental challenges in modern and the propensity for a given macromolecule to
molecular biology is the elucidation and under- crystallize.
standing of the rules by which proteins adopt Conceptually, protein crystallization can be di-
their three-dimensional structure. Currently, the vided into two phases: search and optimization.
most powerful method for protein structure Approximate crystallization conditions are identified
determination is single crystal X-ray diffraction, during the search phase, while the optimization
although new breakthroughs in NMR and in silico phase varies these conditions to ultimately yield
approaches are growing in their importance. high quality crystals. There is yet no reliable
A crystallography experiment begins with a means of generating quality crystals from
well-formed crystal that ideally diffracts X-rays arbitrary proteins, and thus automatic methods
to high resolution. For proteins, this process is are needed to conduct high-throughput protein
often limited by the difficulty of growing crystals crystallization trials.
suitable for diffraction. This is partially due to Robotic systems now routinely handle most
the large number of parameters affecting the phases of these trials: cocktail preparation, cock-
crystallization outcome (e.g., purity of proteins, tail and protein solution mixing in sub-microliter
intrinsic physico-chemical, biochemical, biophysi- volumes, and imaging of each experiment.
196
Typically, this pipeline ends with a human expert We assembled cocktail information from 111
responsible for manually assessing an image from plates into a table of 1961 distinct cocktails
each experiment photograph for a positive involving 152 distinct chemical compounds. We
crystallization outcome. One of the current compiled the classification data into an incom-
challenges in the field is to replace the expert with plete 111 · 1961 matrix of experimental out-
a high-accuracy automatic image-classification comes (the precipitation index).
system (e.g., Wilson, 2002; Bern et al., 2004). Members of the Northeast Structural Genom-
Another challenge is to discover the factors that ics Consortium (NESG) provided the protein for
effect crystallization of different proteins under the trials. The proteins were prepared from pro-
different chemical conditions. teins produced in E. coli, using pET vectors with
We report here on parallel efforts on both HexaHis tags, and purified by NiNTA affinity
these fronts: (1) automated image classification purification followed by gel filtration purification
to determine, per plate, in what wells and at (Acton et al., 2005). Where identifying information
what time points crystals form, and (2) pattern was available, additional protein information was
discovery in the resulting matrix of experimental obtained from the following sources:
outcomes, considering each protein, cocktail and
multiple time points, to identify what protein and NESG SPINE database (Goh et al., 2003):
cocktail properties influence crystallization. organism, SWISS-PROT/TrEMBL reference,
The main advantage of this approach is a sys- amino acid sequence;
tematic screening of many proteins under same UniProt (Apweiler et al., 2004): EC number,
conditions (Luft et al., 2003). To date, Interpro domains;
High-Throughput Screening (HTS) lab at the The Protein Calculator (Putnam, 1999): molec-
Hauptman–Woodward Medical Research Institute ular weight, estimated pI, estimated charge at
(HWI) screened over 4000 proteins, under 1961 pH 7.0, estimated wavelength extinction coeffi-
conditions. Unlike other protein crystallization cients for 278, 279, 280 and 282 nm.
databases, such as BMCD (Gilliland et al.,
1996), HWI data includes both successful and These data for the 111 proteins were assem-
unsuccessful crystallization trials. This created a bled into a third table.
computational challenge of analyzing over
40 million images, but a great potential for data Image classification
mining.
Image classification uses extracted features to la-
bel the image with one of the outcome classes –
unknown, clear, precipitate, crystal (we intend to
Materials and methods further subdivide precipitates and crystals, and
allow for multiple classes to be used simulta-
Data sources and preparation neously). Different algorithms can be used to
implement individual stages. To improve perfor-
We obtained image data from the HTS lab at the mance, it may also be useful to apply a boosting
Hauptman–Woodward Medical Research Insti- approach by combining results from multiple
tute (HWI), as time-stamped, per-plate archives techniques. A variety of standard classification
of 1536 images (one image per well for each time methods have been used for classifying crystalli-
point). Provided with each archive are the fol- zation images: Fisher linear discriminants
lowing records: (Cumbaa et al., 2003), self-organizing maps
(Spraggon et al., 2002), naive Bayes classifiers
Protein concentration and sample preparation (Wilson, 2002), decision trees (Bern et al., 2004;
date; Zhu et al., 2004), and support vector machines
The chemical composition of the cocktail used (Zhu et al., 2004).
in each well; At present, we use Linear Discriminant Analy-
Binary classification of each image (crystal/no sis (LDA) (Fisher, 1936) for classification,
crystal) generated by human experts. and we perform image analysis in four stages:
197
registration (locating the well within the image), bor (KNN) classification accuracies for all values
segmentation (locating the droplet within the of k between 1 and 50 neighbors.
well), feature extraction (computing 59 measures
of droplet texture), and finally classification. These
are the same stages described in (Cumbaa et al., Pattern discovery in the precipitation index
2003), with the following improvements:
Our precipitation index database may be
Image registration. To reduce image registra- regarded as a transaction database and is thus
tion error, we now factor robotic camera mo- subject to frequent-itemset and association-rule
tion and plate-to-plate variations in apparent discovery. Transactions in this case are sets of
well diameter (due to adjusted camera height) facts about proteins, cocktails, or particular
into our image registration algorithm. As the experiments, e.g.:
imaging robot moves along its programmed {Cocktail #222, pH < 6.0, contains: PEG 20000,
path, slight miscalibrations in its positioning CAPS, crystallizes proteins: glucose isomerase,
Q8U2K6 (N-type ATP pyrophosphatase)}.
system will introduce a consistent drift in (x, y)
coordinates of consecutive wells. Registration An itemset is a subset of a transaction. Fre-
proceeds as follows: quent-itemset discovery (Goethals, 2003) finds all
itemsets in a database whose frequency, or sup-
(a) A first approximation of the coordinates of port, exceeds some minimum. An association rule
all 1536 wells is simultaneously computed describes the co-occurrence of two disjoint
at multiple well-diameter values. itemsets (i.e., antecedent consequent) in the
(b) The well diameter that minimizes the vari- transactions in the database, such as:
ance in measured (Dx, Dy) values is se- {protein concentration >10 mg/ml, medium molecular
lected. weight, low pI, organism: A. aeolicus, cocktail contains
(c) The computed coordinates for each well CaCl2*2H2O} {crystal}
are corrected by imposing a uniform (Dx, This rule’s consequent is supported by 20 of
Dy) between adjacent wells. the 25 transactions supporting the antecedent in
our database, and thus we say the rule has
Additional image features. The 23 original im- support 20 and confidence 0.8 (20/25). Association-
age features (of which 20 measure microcrystal rule discovery (Agrawal et al., 1993) finds all
correlations, 2 measure straight-lines, and 1 mea- association rules in a database whose support
sures image smoothness) have been augmented and confidence exceed minimum values.
by 36 additional features: 2 Euler-number We factored our database three different ways:
measures, 2 quadtree-decomposition statistics,
10 measures of local extrema, 19 additional (1) As a set of 111 transactions on proteins, each
straight-line measures, and 3 additional image transaction containing protein properties and
energy measures. The combined 26 straight-line the set of all cocktails that produced crystalli-
and image energy features constitute a multi- zation result. Using this database, we per-
scale analysis of image texture missing from our formed frequent itemset mining (using
earlier work. support ‡8) in order to discover clusters of
protein properties and crystallizing cocktails.
We have evaluated performance of the (2) As a set of 1961 transactions on cocktails,
augmented classifier using a set of 127 plates each transaction containing a cocktail’s
containing a total of 5600 crystal images and chemical additives and the set of all protein
189472 non-crystal images. Computing image that crystallized with this cocktail. This
feature vectors using unoptimized Matlab code database was subjected to frequent itemset
compiled to a Linux binary took an average of mining (using support ‡8) in order to
12.27 s per image, or 5.23 h per plate, per CPU discover clusters of cocktail properties and
on 3.0 GHz Xeon processors. We tested the crystallizing proteins.
accuracy of our LDA classifier using 10-fold (3) As a set of 168588 transactions on individual
cross-validation. We also tested k-nearest-neigh- experiments, each transaction containing
198
both protein and cocktail information, and We initially counted all crystal occurrences. It
the crystallization outcome for a single is also important to test on how many plates the
protein/cocktail pair. To this database, we system misses all existing crystals. We have
performed association-rule discovery (using found that 2 out of 75 plates missed all crystals
support ‡8 and confidence ‡0.45) to find pro- (4 nice crystals and 8 murky ones), i.e., we have
tein and cocktail properties associated with a missed 4 good crystal hits for 2 out of 75 pro-
positive crystal outcome. teins. The misses can be partially explained by
the registration error, which the updated image
Due to the requirements of the data mining
classification system diminishes.
tools, we quantized the real-valued variables
(e.g., pH, chemical concentration) in our
Precipitation index mining
database prior to itemset mining. Chemical
concentrations were reported only as present or
Frequent itemset mining produced 2455 sup-
not present. All other real-valued variables were
ported itemsets on the protein transactions, and
split into equal-sized bins (low, medium, and
31682 supported itemsets on the cocktail transac-
high) according to computed 33rd and 67th per-
tions. Association-rule mining on experiment
centiles.
transactions revealed 5487 supported association
rules predicting crystal outcome. Examples of all
three results are shown in Tables 1–3.
Results
Image classification
Discussion
The following table compares the performance of
Image classification
Fisher LDA on our 59-feature dataset against
the original 23-feature dataset, introduced in
Reducing drop volumes to the sub-microliter,
(Cumbaa et al., 2003), and the KNN classifica-
increasing number of cocktails, changing optical
tion (for k=50):
setup for image capture and modifying optical
Precision, recall, and accuracy are based on
qualities of crystallization plates creates chal-
the classifiers’ confusion matrices:
lenges that require specialization and integration
where Precision = TP/(TP + FP), recall = TP/
of image analysis methods. Evaluating the aug-
(TP + FN), and accuracy = (TP + TN)/(TP
mented image classification system shows a
+ TN + FN + FP). The ROC scores measure
marked improvement in classification accuracy,
the area under the respective classifiers’ Receiver
by every measure. Precision is still low in abso-
Operating Characteristic curves (Figure 1).
lute terms, made so in part by the proportional
skew of the data towards non-crystal images. Fu-
Precision Recall Accuracy ROC score ture work will focus on finer-grained classification
of outcomes by adding new feature-extraction
23 features 0.10 0.62 0.83 0.80
algorithms.
59 features 0.12 0.68 0.85 0.85
59 features (KNN) 0.13 0.76 0.85 0.87
Most complex domains may suffer from a po-
tential disconnect between what experts perceive
as important features and what is objectively a
computationally viable image feature with high
59 features 59 features 23 features discriminatory potential. For that reason, it is
(KNN) useful to approach the problem in iterations,
taking multiple approaches to compute diverse
True False 4248 1352 3788 1812 3500 2100
features, collecting the data, and then thoroughly
positives negatives
evaluating which combination of features
False True 28881 160591 27139 162333 31557 157915
positives negatives
provides the best accuracy on testing results.
Importantly, specific features may discriminate
199
0.9
0.8
0.7
False-positive rate
0.6
0.5
0.4
59 features (KNN)
0.3
59 features (LDA)
0.2 23 features (LDA)
0.1
0
0 0.2 0.4 0.6 0.8 1
False-negative rate
Figure 1. Receiver operating characteristic curves of the three classifiers.
Table 1. Sample frequent itemsets discovered in the database of protein transactions.
Pattern Support
Cocktail #1456 (HR Crystal Screen-1-20: Ammonium Sulfate 0.2 M, Sodium Acetate trihydrate 0.1 M, 8
pH 4.6, PEG 4000 25.0%)
Cocktail #1499 (HR Crystal Screen-2-13: Ammonium Sulfate 0.2 M, Sodium Acetate trihydrate 0.1 M,
pH 4.6, Polyethylene Glycol Monomethyl Ether 2000 30.0%)
Cocktail #1018(KBr 0.1 M, MES 0.1 M, pH 6.0, PEG 400 40.0%), 8
Cocktail #1024(KCl 0.1 M, MES 0.1 M, pH 6.0, PEG 400 40.0%)
Cocktail #1018, Cocktail #1042 (NaNO 30.1 M, MES 0.1 M, pH 6.0, PEG 400 40.0%) 8
low pI, Cocktail #1445 (HR Crystal Screen-1-9: Ammonium Acetate 0.2 M, Tri-sodium Citrate dihydrate) 10
Charge at pH 7.0£)7.8, Cocktail #1445 11
Bacterial protein source, Cocktail #1445 12
Bacterial protein source, low pI, Charge at pH 7.0£)7.8, Cocktail #1445 10
The first three illustrate patterns of co-crystallization. The next three associate cocktails with protein properties (isoelectric point,
charge at pH 7.0, source organism) in successful crystallization experiments..
Table 2. Sample frequent itemsets discovered in the database of cocktail transactions.
Pattern Support
Glucose isomerase, YB61_HAEIN (Hypothetical UPF0152 protein HI1161), Q8U2K6 (N-type ATP pyrophosphatase) 108
Glucose isomerase, Q8U189 (putative Transcriptional activator), Q8U2K6 125
Basic pH, PEG 4000, Q8U2K6 47
PEG 8000, glucose isomerase, Q8U189 39
Neutral pH, HEPES, glucose isomerase, Q8U2K6 48
Basic pH, TRIS, Q8U189, Q8U2K6 32
Acidic pH, PEG 4000, sodium acetate, glucose isomerase, YB61_HAEIN 8
Acidic pH, PEG 8000, sodium acetate, glucose isomerase, YB61_HAEIN 14
The first two illustrate clusters of co-crystallizing proteins. The third illustrates discovered crystallization trends of a single protein. The
rest are examples of co-crystallization trends of pairs of proteins.
200
Table 3. Sample antecedents of association rules predicting crystallization discovered in the experiment transactions.
Antecedent Support Crystal

confidence
Protein: Archaea, high WME282, unknown function; cocktail: PEG 4000, magnesium sulfate heptahydrate 10/10 1
Protein: medium MW, unknown function, P. furiosus; cocktail: PEG 8000, ammonium phosphate-dibasic 8/8 1
Protein: 12<mg/mL, Archaea, charge>)3.3, high WME282; cocktail: PEG 8000, TAPS 15/16 0.94
Protein: medium MW, high pI, P. furiosus; cocktail contains PEG 4000, TAPS 15/16 0.94
Protein: medium MW, charge >)3.3, P. furiosus; cocktail: PEG 4000, TAPS 15/16 0.94
Protein: medium MW, high pI; cocktail: PEG 8000, TAPS, P. furiosus 15/16 0.94
Protein: medium MW, charge>)3.3; cocktail: 8000, TAPS, P. furiosus 15/16 0.94
Protein: Archaea, medium MW, high pI; cocktail: PEG 4000, TAPS 15/16 0.94
Protein: Archaea, medium MW, high pI; cocktail: PEG 8000, TAPS 15/16 0.94
Protein: medium MW, low pI, )7.8<charge£)3.3; cocktail: acidic pH, MPD 10/22 0.45
Protein: mg/mL£10, medium pI, E. faecalis; cocktail: acidic pH, 20000, MES 11/20 0.55
Protein: Archaea, medium MW, high pI; cocktail: basic pH, lithium chloride 17/21 0.81
Protein: medium MW, low pI, )7.8< CHARGE £)3.3; cocktail: acidic pH, MPD, sodium cacodylate 9/15 0.6
Protein: medium pI, acidic pH, IPR011060; cocktail: MPD, sodium cacodylate 10/15 0.67
Protein: medium pI, IPR005627; cocktail: acidic pH, MPD 11/22 0.5
Protein: 12<mg/mL, medium pI, IPR005627; cocktail: acidic pH, MPD 11/22 0.5
Protein: acidic, IPR005627, IPR011060; cocktail: spermine tetra-hcl 8/9 0.89
Protein: IPR005627, IPR011060; cocktail: manganese sulfate monohydrate 9/18 0.5
Protein: IPR005627, IPR011060; cocktail: PEG 3350, bis-tris 9/13 0.69
Protein: acidic pH, IPR005627, IPR011060; cocktail: MPD, sodium cacodylate 10/15 0.67
Protein: high WME278, EC4.2.-.-; cocktail: PEG 1000, MES 15/19 0.79
Protein: Bacteria, medium pI, EC4.2.-.-; cocktail: PEG 1000, MES 15/19 0.79
Protein: low MW, low pI, EC3.5.-.-; cocktail: PEG 400, MES 11/19 0.58
Specific protein properties (source organism, EC numbers, Interpro domains, molecular weight, isoelectric point, wavelength
extinction coefficients, charge at pH 7.0, and concentration) combine with specific cocktail properties (pH, additives) to predict a
crystallization result. Confidence of the crystallization result is indicated in the right column; support for the rule is indicated in the
middle column.
only certain subclasses of all images, and thus a can also contribute to improving classification
combination of features is necessary. accuracy.
In protein crystallization, the most important
images represent crystals. Protein crystals come Precipitation index mining
in diverse forms: micro-crystals, micro-needles,
needle crystals, fan-shaped crystals, and larger Data mining can integrate results from
faceted crystals. Although most of the crystals high-throughput screens with information about
exhibit straight edges in a drop, this important crystallizing conditions, intrinsic protein proper-
feature to discern in general may not be ties, and results from crystallization optimization.
detectable in micro-crystals in a high-throughput Association mining helps segregate proteins into
setup, due to insufficient drop size. To diminish groups based on how they react in a broad range
this problem, we need to augment a set of of conditions, and group cocktails to reflect their
image features specifically for detecting micro- potential to achieve crystallization. These results
crystals. In addition, many protein crystallization may lead to crystallization screen optimization,
outcomes can occur simultaneously in a single and reveal associations between protein proper-
experiment: precipitation, crystal formation, ties and crystallization conditions. Clustering can
microcrystal formation, skin effect, and phase be used on the precipitation index, to identify
separation. Thus, extending classification of cocktails or proteins with similar crystallization
precipitates by characterizing image smoothness potential.
201
Preliminary results suggest relationships that of crystallization, then the planning strategies
group together certain proteins, protein successfully employed for the one may be profit-
characteristics, or chemical properties based on ably applied to the other, while failed crystalliza-
crystallization outcomes. These patterns and tions can be avoided. Thus, it is important to
others discovered by frequent itemset and associ- identify a suitable set of precipitating agents to
ation-rule mining are numerous. A filtering step sort the outcomes of reactions for a relatively
is required in order to separate the few, interest- large group of proteins. New crystallization chal-
ing rules from the many uninteresting rules lenges are then approached by the execution and
discovered by a basic association-rule-mining analysis of a set of precipitation reactions, fol-
algorithm. For example, lowed by an automated identification of similar
proteins and an analysis of the recipes used to
We already limit our association-rule miner to crystallize them, i.e., crystal growth method,
finding rules with crystal as the consequent. temperature and pH ranges, concentration of a
More generally, interesting rules adhere to a protein, and crystallization agent. Importantly,
[{independent variables} dependent variables] this does not require a generic strategy that
format would apply for all proteins – we treat each
A rule A B is uninteresting if a more general protein individually using a KNN algorithm.
rule A¢ B (i.e., A¢ A) has a higher confi- Further, a crystal result is not required either, as
dence. different patterns of clear drops and precipitates
An itemset A=B [ C is uninteresting if A oc- can be equally informative for the protein
curs no more frequently than if B and C were similarity measure.
independent (i.e., P(A) £ P(B)P(C)).
Reported rules must satisfy some test of statis-
tical significance, e.g., a p-value. Acknowledgements
Gopalakrishnan et al. (2004) discuss similar The authors would like to thank the Hauptman–
challenges and describe a semi-manual process Woodward HTS lab for providing image data
for filtering their discovered rules for interesting- and especially Angela Lauricella for her meticu-
ness and novelty. lous hand-scoring of the images, George T. deT-
Further, the quantization step should be itta and Joe R. Luft for long term collaboration
incorporated into the mining algorithm, i.e., and many fruitful discussions that lead to the
quantitative association rule mining (Srikant and development of this system. We are grateful to
Agrawal, 1996). Additionally, any reasoning Thomas Acton, Rong Xiao, Gaetano Monteli-
about crystallization conditions is limited by the one, and other members of the NESG Protein
incomplete data in the database, e.g., buffers in Production team for providing the many protein
proteins, pH of cocktails, other characteristics of samples that have made this work possible. We
proteins. also thank Max Kotlyar for the use of his
association-mining software. This research was
Case-based reasoning supported by the Natural Science and Engineering
Research Council of Canada (#RGPIN 203833–
We postulate that past experience will lead us to 02), National Institutes of Health (#P50 GM-
the identification of initial conditions favorable 62413), and IBM Shared University Research
to crystallization. Moreover, it is believed that grant.
solubility experiments can provide a quantitative
measure of similarity among proteins. If we con-
sider only 15 possible conditions, each having 15 References
possible values, the result would be 4.3789e +
1. Acton, T.B., Gunsalus, K., Xiao, R., Ma, L., Aramini, J.,
017 possible experiments; impossible to test
Baron, M.C., Chiang, Y., Clement, T., Cooper, B., De-
exhaustively. Assuming that two proteins react nissova, N., Douglas, S., Everett, J.K., Palacios, D.,
similarly when tested against a large set (over Paranji, R.H., Shastry, R., Wu, M., Ho, C.-H., Shih, L.,
1500) of precipitating agents in the search phase Swapna, G.V.T., Wilson, M., Gerstein, M., Inouye, M.,
202
Hunt, J.F. and Montelione, G.T. (2005). Meth. Enzymol. derlich, Z., Acton, T., Montelione, G.T. and Gerstein, M.
394, 210–243. (2003) Nucleic Acids Res. 31(11), 2833–2838.
2. Agrawal, R., Imielinski, T. and Swami, A.N. (1993). In 10. Gopalakrishnan, V., Livingston, G., Hennessy, D.,
Mining Association Rules between Sets of Items in Large Buchanan, B. and Rosenberg, J.M. (2004) Acta Crystal-
Databases. (Eds., Buneman, P., Jajodia, S.), Proceedings logr. D. Biol. Crystallogr. 60, 1705–1716.
of the 1993 ACM SIGMOD International Conference on 11. Luft, J.R., Collins, R.J., Fehrman, N.A., Lauricella, A.M.,
Management of Data, ACM Press, 207–216. Veatch, C.K. and DeTitta, G.T. (2003) J. Struct. Biol.
3. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Bo- 142(1), 170–1799.
eckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, 12. Putnam C. (1999). The Protein Calculator. http://
R., Magrane, M., Martin, M.J., Natale, D.A., O’Dono- www.scripps.edu/cdputnam/protcalc.html. Scripps Re-
van, C., Redaschi, N. and Yeh, L.S. (2004) Nucleic Acids search Institute.
Res. 32, D115–D119. 13. Spraggon, G., Lesley, S.A., Kreusch, A. and Priestle, J.P.
4. Bern, M., Goldberg, D., Stevens, R.C. and Kuhn, P. (2002) Acta Crystallogr. D Biol. Crystallogr. 58(11), 1915–
(2004) J. Appl. Cryst. 37, 279–287. 1923.
5. Cumbaa, C., Lauricella, A.M., Fehrman, N.A., Veatch, 14. Srikant, R. and Agrawal, R. (1996). In Mining Quantita-
C.K., Collins, R.J., Luft, J.R., DeTitta, G.T. and Jurisica, tive Association Rules in Large Relational Tables. (Eds.,
I. (2003) Acta Cryst., D 59, 1619–1627. Jagadish, H.V., Mumick, I.S.). Proceedings of the 1996
6. Fisher, R. (1936) Ann. Eugenics 7, 179–188. ACM SIGMOD International Conference on Manage-
7. Gilliland, G.L., Tung, M. and Ladner, J. (1996) J. Res. ment of Data, ACM Press, 1–12.
Natl. Inst. Stand. Technol. 101(3), 309–320. 15. Wilson, J. (2002) Acta Cryst. D. 58, 1907–1914.
8. Goethals, B. (2003). Survey on Frequent Pattern Mining. 16. Zhu, X., Sun, S., Cheng, S.E. and Bern, M. (2004). Clas-
http://www.cs.helsinki.fi/u/goethals/publications/sur- sification of Protein Crystallization Imagery. 26th Annual
vey.pdf. Manuscript. International Conference of IEEE Engineering in Medi-
9. Goh, C.S., Lan, N., Echols, N., Douglas, S.M., Milburn, cine and Biology Society, IEEE Press.
D., Bertone, P., Xiao, R., Ma, L.C., Zheng, D., Wun-

Automatic Classification and Pattern Discovery in High-Throughput Protein Crystallization Trials

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Automatic Classification and Pattern Discovery in High-Throughput Protein Crystallization Trials

Încărcat de

Drepturi de autor:

Formate disponibile

Journal of Structural and Functional Genomics 6: 195–202, 2005.

Automatic classiﬁcation and pattern discovery in high-throughput protein

Christian Cumbaa & Igor Jurisica*

Key words: association-rule discovery, image analysis, protein crystallization

Introduction cal and biological parameters), and the unknown

Table 1. Sample frequent itemsets discovered in the database of protein transactions.

Table 2. Sample frequent itemsets discovered in the database of cocktail transactions.

Antecedent Support Crystal

S-ar putea să vă placă și