Documente Academic
Documente Profesional
Documente Cultură
i=1
C
j=1
u
p
ij
x
i
c
j
2
, 1 m < (4.1)
In Equation 4.1, p is any real number that is greater than one and denes the degree of fuzziness,
u
ij
is the membership level of event x
i
in the cluster j, and c
j
is the center of a cluster. The fuzzy
clustering is done through an iterative optimization of Equation 4.1. Each iteration, the membership
u
ij
is updated using Equation 4.2 and the cluster centers c
j
are updated using Equation 4.3.
u
ij
=
1
C
k=1
_
x
i
c
j
x
i
c
k
_ 2
p1
(4.2)
c
j
=
N
i=1
u
p
ij
x
i
N
i=1
u
p
ij
(4.3)
13
The following is an outline of a fuzzy c-means algorithm.
1. Given the number of clusters, c, randomly choose c data points as cluster centers.
2. For each cluster, sum the distance to each data point weighted by its membership in that
cluster.
3. Recompute each cluster center by dividing by the associated membership value of each event.
4. Stop if there is minimal change in the cluster center, otherwise return to 2.
5. Report cluster centers.
4.1.2 Gaussian Mixture Models
Data in ow cytometry is composed of many distinct subclasses or clusters. The data for each vector
(or event) is an aggregate of a mixture of multiple distinct behaviors. Mixture distributions form
probabilistic models composed of a number of component subclasses[20]. Given an M dimensional
data set, each subclass k is characterized by [20]:
k
- the probability that a sample in the data set belongs to the subclass
k
- the spectral mean
R
k
- an M M spectral covariance matrix.
Assuming there are N ow cytometry events Y
1
, Y
2
, , Y
N
, then the likelihood that an event
Y
i
belongs to a Gaussian distribution subclass X
N
is given by [20]:
p
yn|xn
(y
n
|k, ) =
1
(2)
M/2
|R
k
|
1/2
exp
_
1
2
(y
n
k
)
t
R
1
k
(y
n
k
)
_
(4.4)
It is not known what subclass each event belongs to, therefore it is necessary to calculate the
likelihood for each subclass and apply conditional probability [20].
p
yn
(y
n
|) =
K
k=1
p
yn|xn
(y
n
|k, )
k
Neither the statistical parameters of the Gaussian Mixture Model, = (, , R), nor the mem-
bership of events to subclasses are known a priori. An algorithm must be employed to deal with this
lack of information.
14
EM
Expectation maximization is a statistical method for performing likelihood estimation with incom-
plete data [2]. The objective of the algorithm is to estimate K, the number of subclasses, and ,
the parameters for each subclass. First each event Y
N
is classied based on the likelihood criteria
above. However instead of a hard classication, it is desirable to compute a soft classication for
each event [20]:
p
xn|yn
(k|y
n
,
(i)
) =
p
yn|xn
(y
n
|k,
(i)
)
k
K
l=1
p
yn|xn
(y
n
|l,
(i)
)
l
(4.5)
Then the subclass parameters, , are re-estimated [20].
N
k
=
N
n=1
p
xn|yn
(k|y
n
,
(i)
) (4.6)
k
=
N
k
N
(4.7)
k
=
1
N
k
N
n=1
y
n
p
xn|yn
(k|y
n
,
(i)
) (4.8)
R
k
=
1
N
k
N
n=1
(y
n
k
)(y
n
k
)
t
p
xn|yn
(k|y
n
,
(i)
) (4.9)
The event classication (E-step) and re-estimation of subclass parameters (M-step) repeats until
the change in likelihoods for the events is less than some .
Hierarchical Clustering Stage
The clustering begins with a user-specied number of clusters. The algorithm then performs EM on
the clusters and determines the Gaussian model parameters for each cluster. This involves comput-
ing equation 4.5 for every event for every subclass and then equations 4.6, 4.7, 4.8, and 4.9 for each
subclass.
A MDL (Rissanen) score [21] is then calculated using equation 4.10. The Minimum Description
Length (MDL) principle extends the classical maximum likelihood principle by attempting to de-
scribe the data with the minimum number of binary digits required to represent the data with some
precision [21]. The score serves as an information criterion and helps the unsupervised algorithm
determine the optimal solution (i.e. how many clusters).
MDL(K, ) =
N
n=1
log
_
K
i=1
p
yn|xn
(y
n
|k, )
k
_
+
1
2
Llog(NM) . (4.10)
15
The algorithm then attempts to combine the two most similar clusters. In this case, similarity
is based upon the Gaussian model parameters. A distance function is computed between all possi-
ble combinations of clusters. The two clusters with the minimum distance (i.e. most similar) are
combined into a new cluster. The distance function is dened by [20] as:
d(l, m) =
N
l
2
log
_
|R
(l,m)
|
|
R
l
|
_
+
N
m
2
log
_
|R
(l,m)
|
|
R
m
|
_
This process repeats until the data has all been combined into a single cluster. Finally, the
conguration with the minimum Rissenan score is output as the optimal solution. The results are
two-fold. First, there are the statistical parameters, = (, , R), for Gaussian cluster. Secondly,
all events have membership values for every cluster. Figure 4.1 summarizes the basic steps of the
clustering procedure.
Perform soft
classification of each
pixel using cluster
parameters
re-estimate cluster
parameters
(i.e. mean vectors and
covariance matrices)
Reduce number of
clusters by combining
two nearest clusters
Measure goodness of
fit using Rissenan
criteria
Initialize number
of clusters
Initialize cluster
parameters
If best fit so far, then
save result.
Only 1 cluster?
exit
Figure 4.1: High-Level Clustering Procedure [20]
4.1.3 Exhaustive Bivariate
While FLAME [14] and Lo et. al [15] show promise for multivariate gating of ow cytometry data,
multivariate techniques often fall victim of Bellmans so called curse of dimensionality [22]. The
clustering performance of many algorithms degrade as the dimensionality of the data is increased
[23]. Expectation maximization for example, is well-known to fall victim to local minima, and
the likelihood of getting stuck in a local minima increases with higher dimensionality. The relative
scaling of all the dimensions becomes increasingly important as well, since the effect of an important
dimension can be dwarfed by other dimensions.
The following in a novel clustering approach proposed by James S. Cavenaugh and will be ex-
plored as part of this thesis. The idea is a subspace clustering algorithm that performs an exhaustive
bivariate analysis. Rather than clustering the d-dimensional data all at once with a multivariate clus-
tering technique, it analyzes every bivariate subspace of the data. In other words, it analyzes every
combination of two dimensions of the data. This results in
16
_
d
2
_
=
(d)(d 1)
2
(4.11)
bivariate combinations. Every subspace is then clustered. The choice of the bivariate clustering
algorithm is exible, provided it can produce hard cluster associations for every event (one can
also classify soft clustering results into hard clustering results). For simplicity, the other clustering
algorithms being implemented for this thesis will be used for bivariate clustering - c-means and EM
with Gaussians. The individual clusterings produce hard-clustering results for every event in every
subspace, s
i
, as seen in Table 4.1. The numbers in the subspace columns indicate which cluster the
events correspond to in that subspace clustering. The number of clusters found for each subspace
do not need to be identical.
Table 4.1: Intermediate exhaustive bivariate results
Event s
1
s
2
. . . s
(
d
2
)
0 3 4 . . . 1
1 1 3 . . . 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n 5 2 . . . 4
After performing all of the individual clusterings, it is necessary to combine the results into
nal clusters. As seen in Table 4.1, every event has a vector of cluster memberships. Final clusters
are formed by grouping events that have identical vectors, that is, events who exist in all of the
same clusters in each subspace. If two events exhibit similar responses in all combinations of
the uorescent dimensions, then they should be grouped into the same clusters in each bivariate
clustering and are likely very similar in a biological sense. The potential also exists for more domain
expert oriented grouping. For example, a biologist may be interested in cells that have a similar
response in only a few of the dimensions, and therefore the number of subspaces used for grouping
similar cells could be greatly reduced.
This algorithm has a number of perceived benets. First, it allows the individual clustering algo-
rithms to partially escape the curse of dimensionality, since they are only clustering 2-dimensional
data. Algorithms could be chosen that are known to work well at clustering bivariate data, and with
lower computational complexity. Many clustering algorithms, like EM, have computational com-
plexity that grows exponentially as dimension increases, whereas the work load for this algorithm
grows at a rate of d choose 2. In essence, it is a form of dynamic programming [22], breaking a
large problem into many smaller ones. Second, the individual subspace clusterings are embarrass-
ingly parallel and can be distributed to different computational nodes in a computer cluster or grid.
The only overheads are the initial distribution of the subspace data and the aggregation of results
from all nodes. Unfortunately, the amount of result data generated for this technique is much larger
than for the other multivariate techniques discussed (n
_
d
2
_
4 bytes instead of n k 4 bytes
17
where k is the number of clusters). However, it is possible to form a hierarchy of the nodes in the
computing cluster to perform a parallel reduction of the results rather than burdening a master node
with the job of aggregating all of the data by itself. Finally, since it uses bivariate clustering, it is
similar to current practice techniques in ow cytometry (although providing incomplete analyses of
the entire data set, are known to work well), but make use of all combinations of dimensions rather
than only a few.
4.2 Data Clustering on GPGPU
The abundant parallelism makes data clustering algorithms a natural choice for implementation
using GPGPU. In 2004, Hall et al. implemented k-means using Cg and achieved a speedup of 3x
versus a modern cpu at the time the article was written. In 2006, Takizawa et al. implemented
k-means using fragment shaders and Nvidia 6800 GPUs [24]. The implementation in [24] only
showed a speedup of 4x relative to a cluster of CPUs without GPUs, however their implementation
divided the task among a cluster of PCs each equipped with GPUs using MPI. These efforts showed
it was possible to implement a data clustering algorithm using a graphics pipeline and achieve
speedup.
The introduction of more advanced GPU architectures and coding frameworks for general pur-
pose computing on GPUs allowed for much more signicant speedup of data clustering algorithms
on GPUs. Che et al. implemented k-means with an impressive speedup of 72x using CUDA and a
Nvidia 8800 GTX GPU in 2008 [25], and also compared it to a multi-threaded version running on
a Quad core processor, and still maintaining a speedup of 30x.
While the performance results of recent K-means implementations on GPUs and other parallel
architectures are impressive, k-means is an embarrassingly parallel algorithm and its spherical bias
is not very good at analyzing ow cytometry data, where clusters often have very diverse non-
spherical shapes. Outliers can also have a signicant impact on the resulting cluster centers with
k. Despite these short-comings, k-means is still a de facto standard clustering algorithm used in a
variety of applications that has been implemented on many platforms and parallel architectures, and
thus is a good basis for comparison.
Using a fuzzy version of k-means, where data points have a membership value in all of the
clusters, rather than belonging to only one cluster, can lessen the effect of outliers. It is also produces
better results when the number of specied clusters does not match the number of natural clusters
in the data. A hard clustering may attempt to create multiple adjacent, but not overlapping, clusters
inside one natural cluster. A soft clustering is more likely to simply have multiple overlapping
clusters with approximately the same center - which more accurately reects the underlying data.
Therefore the thesis will implement and examine a c-means (the literature uses C for soft clustering,
and K for hard clustering) algorithm.
Shalom et al. implemented c-means (a fuzzy version of k-means) on a Nvidia 8800 GTX using
the OpenGL Shader Language [26]. Results were impressive, with over 70x speedup on high di-
mensional data. Their implementation also scales well to a large number of clusters and dimensions.
18
Our preliminary single-GPU fuzzy C-means implementation has speedup over 100x on ow cytom-
etry data. There are some additional areas for performance improvement and the implementation
will be extended to use multiple GPUs.
In addition, this thesis will implement an Expectation Maximization (EM) algorithm with Gaus-
sian Mixture Models (GMMs). A recent publication from June 2009 implemented EM with GMMs
[27] using CUDA. Using hardware similar to the aforementioned CUDA implementations of C-
means, they achieved a speedup of 120x for particular data sizes. One limitation of their imple-
mentation is that it uses only diagonal covariance matrices, rather than the full covariance matrices
for the Gaussian Mixture Models. This simplies the EM equations (particularly, nding the de-
terminant and inverse of the covariance matrix becomes trivial) and the data structure required for
the algorithm, however it does not allow for any dimensions to be statistically dependent upon one
another - which may occur in real data. It also does not make use of multiple GPUs nor include any
information criterion for unsupervised assessment of clustering results. Other CUDA applications
have been developed using GMMs, such as anomaly detection in hyperspectral image processing
[28] and achieved overall speedup factors of 20x, and over 100x for specic portions of the algo-
rithm.
19
5. Project Deliverables
Essential Deliverables
The thesis document
Conference or Journal paper
CUDA Fuzzy C-means clustering algorithm implementation using multiple GPUs (single
GPU version implemented by previous student, can be improved)
Sequential Fuzzy C-means clustering algorithm implementation using a single CPU (imple-
mented by previous student, requires some modication for performance analysis)
CUDA Gaussian Mixture Model clustering algorithm implementation using multiple GPUs
Sequential Gaussian Mixture Model clustering algorithm implementation using single CPU
(already implemented by Bouman [20], but requires some modications)
Workow software for extracting data from FCS les, applying compensation information
to the data (if available), and running various clustering algorithms on said data. (started by
previous student, needs additional work)
Testing suite for proling the performance of the algorithms and generating results.
Exhaustive Bivariate technique making use of CUDA-enhanced clustering algorithms
Clustering results using synthetic data to assess basic functionality and accuracy of the algo-
rithm implementations.
Clustering results using real owcytometry data for both algorithms with a variety of different
parameters.
Wishlist / Reach Deliverables
Multi-core CPU implementations of C-means and Gaussian Mixture Models
Support of additional models in EM implementation, such as a T mixture or skewed T mixture
(as used in the Pyne et al [14] and Lo et al. [15] papers)
Integration of results with a database for conducting biological inference and querying data
Visualization
20
5.1 Performance Analysis
Performance Metrics
Accuracy (of clustering results). Compare the algorithms and how well they cluster both
known statistical data and real ow cytometry data.
Raw performance (FLOPS). Utilization of the peak performance of the GPU architecture.
GPU occupancy and kernel proling.
Speedup of the computations. Examine how speedup is affected by varying event number,
dimensionality, and the number of clusters.
Speedup of the whole program/workow (including I/O and extra overhead associated with
the GPU such as the host to device memory copying and synchronization). This includes a
detailed proling of the percentage of the total computation for each portion of the applica-
tion.
Scalability. How does the number of GPU cores, the number of GPUs, and the number of
CPU+GPU nodes affect speedup?
21
6. Thesis Outline
1. Abstract
2. Introduction
3. Background
(a) Data Clustering
i. K-means
ii. C-means
iii. Expectation Maximization
(b) Flow Cytometry
(c) GPGPU
4. Supporting Work
(a) Data Clustering for Flow Cytometry
(b) Data Clustering with GPGPU
5. Parallel Algorithms
(a) C-means
(b) EM
(c) Exhaustive Bivariate
6. Flow Cytometry Workow
(a) FCS data extraction
(b) Data compensation, ltering, transformations
(c) Clustering
(d) Aggregating and storing results
(e) Visualization
(f) Expert Analysis
7. Results
(a) Testing Environment
(b) Performance
22
i. Speedup
ii. Scalability
A. # Data Points
B. # Dimensions
C. # Clusters
iii. Resource Utilization
A. System / OS
B. GPUs
(c) Clustering - Synthetic Data
(d) Clustering - Real Flow Cytometry Data
8. Conclusions
23
7. Schedule
October
Write multi-GPU versions of c-means and GMM
Verify functionality with synthetic data
Scripting for thorough performance study
Enhance GUI / Workow software
November
CUDA kernel proling, tweak algorithms for additional performance gains if possible
Collect data
Analyze data
Clustering results on real FCS data
December
Finish writing thesis document
Prepare conference/journal paper for performance study
Prepare for defense
Defend
24
8. Required Resources
Multiple CUDA cards with 1GB or more of onboard memory
Multi-core desktop machines
Flow cytometry data
Cluster with CUDA cards (such as Lincoln on the Teragrid) for MPI version of Exhaustive
Bivariate, however a simulation should be possible on single node if neccesary
Access to ImmPort and FlowJo - competing FC data analysis portals / tools (should be pro-
vided by URMC)
25
Bibliography
[1] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM Comput. Surv.,
vol. 31, no. 3, pp. 264323, 1999.
[2] G. Gan, C. Ma, and J. Wu, Data Clustering Theory, Algorithms, and Applications, M. T. Wells,
Ed. Society for Industrial and Applied Mathematics, 2007.
[3] NVIDIA. Cuda zone. [Online]. Available: www.nvidia.com/cuda
[4] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, Nvidia tesla: A unied graphics and
computing architecture, Micro, IEEE, vol. 28, no. 2, pp. 3955, March-April 2008.
[5] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press,
New York, 1981.
[6] A. P. Reynolds, G. Richards, and V. J. Rayward-Smith, The application of k-medoids and pam
to the clustering of rules, in Intelligent Data Engineering and Automated Learning. Springer
Berlin, 2004, pp. 173178.
[7] H. Shapiro, J. Wiley, and W. InterScience, Practical ow cytometry. Wiley-Liss New York,
2003.
[8] Invitrogen, Fluorescence tutorials: Intro to ow cytometry, 2009. [Online]. Available:
http://www.invitrogen.com/site/us/en/home/support/Tutorials.html
[9] T. S. Inc., Flowjo, 2009. [Online]. Available: http://www.owjo.com/
[10] M. M. Hammer, N. Kotecha, J. M. Irish, G. P. Nolan, and P. O. Krutzik, Webow: A software
package for high-throughput analysis of ow cytometry data, ASSAY and Drug Development
Technologies, vol. 7, pp. 4455, 2009.
[11] M. P. Conrad, A rapid, non-parametric clustering scheme for ow cytometric data, Pattern
Recogn., vol. 20, no. 2, pp. 229235, 1987.
[12] S. Demers, J. Kim, P. Legendre, and L. Legendre, Analyzing multivariate ow cytometric
data in aquatic sciences, Cytometry, vol. 13, no. 3, 1992.
[13] R. Murphy, Automated identication of subpopulations in ow cytometric list mode data
using cluster analysis. Cytometry, vol. 6, no. 4, pp. 302309, 1985.
26
[14] S. Pyne, X. Hu, K. Wang, E. Rossin, T. Lin, L. Maier, C. Baecher-Allan, G. McLachlan,
P. Tamayo, D. Haer et al., Automated high-dimensional ow cytometric data analysis, Pro-
ceedings of the National Academy of Sciences, vol. 106, no. 21, pp. 85198524, 2009.
[15] K. Lo, R. Brinkman, and R. Gottardo, Automated gating of ow cytometry data via robust
model-based clustering, Cytometry. Part A: the journal of the International Society for Ana-
lytical Cytology, vol. 73, no. 4, p. 321, 2008.
[16] ImmPort, Immunology database and analysis portal. [Online]. Available: https:
//www.immport.org
[17] F. Hahne, N. Le Meur, R. Brinkman, B. Ellis, P. Haaland, D. Sarkar, J. Spidlen, E. Strain, and
R. Gentleman, owcore: a bioconductor package for high throughput ow cytometry, BMC
bioinformatics, vol. 10, no. 1, p. 106, 2009.
[18] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, Brook
for gpus: stream computing on graphics hardware, in SIGGRAPH 04: ACM SIGGRAPH
2004 Papers. New York, NY, USA: ACM, 2004, pp. 777786.
[19] NVIDIA, NVIDIA CUDA Programming Guide, 2nd ed. [Online]. Available: http:
//developer.nvidia.com/object/cuda 2 3 downloads.html
[20] C. A. Bouman, Cluster: An unsupervised algorithm for modeling gaussian mixtures, April
1997, available from http://www.ece.purdue.edu/bouman.
[21] J. Rissanen, A universal prior for integers and estimation by minimum description length,
The Annals of Statistics, vol. 11, no. 2, pp. 416431, 1983.
[22] R. Bellman and S. Dreyfus, Applied dynamic programming. Princeton University Press,
1962.
[23] A. Hinneburg and D. A. Keim, Optimal grid-clustering: Towards breaking the curse of di-
mensionality in high-dimensional clustering, in VLDB 99: Proceedings of the 25th Interna-
tional Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 1999, pp. 506517.
[24] H. Takizawa and H. Kobayashi, Hierarchical parallel processing of large scale data clustering
on a pc cluster with gpu co-processing, J. Supercomput., vol. 36, no. 3, pp. 219234, 2006.
[25] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, A performance
study of general-purpose applications on graphics processors using cuda, Journal
of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1370 1380, 2008.
[Online]. Available: http://www.sciencedirect.com/science/article/B6WKJ-4SVV8GS-2/2/
f7a1dccceb63cbbfd25774c6628d8412
27
[26] S. A. A. Shalom, M. Dash, and M. Tue, Graphics hardware based efcient and scalable fuzzy
c-means clustering, in AusDM, ser. CRPIT, J. F. Roddick, J. Li, P. Christen, and P. J. Kennedy,
Eds., vol. 87. Australian Computer Society, 2008, pp. 179186.
[27] N. Kumar, S. Satoor, and I. Buck, Fast parallel expectation maximization for gaussian mixture
models on gpus using cuda, in 11th IEEE International Conference on High Performance
Computing and Communications, 2009. HPCC09, 2009, pp. 103109.
[28] Y. Tarabalka, T. Haavardsholm, I. K asen, and T. Skauli, Real-time anomaly detection in
hyperspectral images using multivariate normal mixture models and gpu processing, Journal
of Real-Time Image Processing, vol. 4, no. 3, pp. 287300, 2009.
28