Sunteți pe pagina 1din 4

CHAPTER 11

CONCLUDING REMARKS

Cluster analysis, without prior information, expert knowledge, or category


labels, organizes data either into a set of groups with a pre-specified number,
or more preferably, dynamically, or in a hierarchical way, represented as a
dendrogram. The absence of prior information usually makes cluster analysis
more difficult than supervised classification. Therefore, the object of cluster
analysis is to unveil the underlying structure of the data, rather than to establish classification rules for discrimination.
Intuitively, data objects that belong to the same cluster should be more
similar to each other than to the ones outside. Such similarity or dissimilarity
is measured through the defined proximity methods, which is also an important
step in cluster analysis. Basically, cluster analysis consists of a series of steps,
ranging from preprocessing (such as feature selection and extraction), proximity definition, and clustering algorithm development or selection, to clustering
result validity and evaluation and knowledge extraction. Each step is tightly
related to all others and could have a large influence on the performance of
other steps. For example, a good representation of data objects with appropriate features makes it easier to find the clustering structures, while data sets
with many redundant or irrelevant features make the clustering structure
vague and the subsequent analysis more complicated. In this sense, they all
are equally important in cluster analysis and deserve the same efforts from
the scientific disciplines.
In essence, clustering is also a subjective process, which asks for extra attention when performing a cluster analysis on the data. The assumption on the
Clustering, by Rui Xu and Donald C. Wunsch, II
Copyright 2009 Institute of Electrical and Electronics Engineers

279

280

CONCLUDING REMARKS

data, the definition of the proximity measure, the construction of the optimum
criterion, the selection of the clustering algorithm, and the determination of
the validation index all have subjectivity. Moreover, given the same data set,
different goals usually lead to different partitions. A simple and direct example
is on the partition of animals: an eagle, a cardinal, a lion, a panther, and a ram.
If they are divided based on the criterion of whether or not they can fly, we
have two clusters with the eagle and the cardinal in one cluster and the rest
in the other cluster. However, if the criterion changes to whether or not they
are carnivores, we have a completely different partition with the cardinal and
the ram in one cluster and the other three in the second cluster.
We have discussed a wide variety of clustering algorithms. These evolve
from different research communities, aim to solve different problems, and
have their own pros and cons. Though we have already seen many examples
of successful applications of cluster analysis, there still remain many open
problems due to the existence of many inherent, uncertain factors. These
problems have already attracted and will continue to attract intensive efforts
from broad disciplines. It will not be surprising to see the continuous growth
of clustering algorithms in the future.
In conclusion, we summarize and emphasize several important issues and
research trends for cluster analysis:

At the pre-processing and post-processing phases, feature selection/extraction (as well as standardization and normalization) and cluster validation
are as important as the clustering algorithms. If the data sets include too
many irrelevant features, this not only increases the computational burden
for subsequent clustering, but, more negatively, affects the effective computation of proximity measures and, consequently, cluster formation. On
the other hand, clustering result evaluations and validation reflect the
degree of confidence to which we can rely on the generated clusters, which
is critically important because of the fact that clustering algorithms can
always provide us with a clustering structure, which may be just an artifact
of the algorithm or may not exist at all. Both processes lack universal
guidance and depend on the data. Ultimately, the tradeoff among different criteria and methods remains dependent on the applications
themselves.
There is no clustering algorithm that universally solves all problems.
Usually, clustering algorithms are designed with certain assumptions
about cluster shapes or data distributions and inevitably favor some type
of bias. In this sense, it is not accurate to say best in the context of
clustering algorithms, although some comparisons are possible. These
comparisons are mostly based on some specific applications, under certain
conditions, and the results may become quite different if the conditions
change.
As a result of the emergence of more new technologies, more complicated
and challenging tasks have appeared, which require more powerful

CONCLUDING REMARKS

281

clustering algorithms. The following properties are important to the efficiency and effectiveness of a novel algorithm:
Generate arbitrary shapes of clusters rather than being confined
to some particular shape. In practice, irregularly-shaped clusters
are much more common than shapes like hyperspheres or
hyperrectangles.
Handle a large volume of data as well as high-dimensional features
with reasonable time and storage complexity. It is not unusual to see
data sets including millions of records with up to tens of thousands
of features. Linear or near linear complexity is highly desirable for
clustering algorithms to meet such requirements. At the same time,
high dimensions with many features irrelevant to the resulting clusters make the algorithms working in low dimensions no longer effective. It is important to find and use the real dimensions to represent
the data.
Detect and remove possible outliers and noise. Noise and outliers are
inevitably present in the data due to all kinds of different factors in
the measurement, storage, and processing of the data. Their existence
can affect the form of the resulting clusters and distort the clustering
algorithm.
Decrease the reliance of algorithms on user-dependent parameters.
Most current clustering algorithms require users to specify several
parameters. Such parameters are usually hard to determine due to a
lack of effective guidance. Furthermore, the formed clusters may be
sensitive to the selection of these parameters.
Have the capability to deal with newly-occurring data without relearning from scratch. This property could save lots of computational
time and increase the clustering efficiency.
Be immune to the effects of the order of input patterns, or at least
provide guidance about what to expect regarding order-dependency.
Most current online clustering algorithms suffer from this problem.
Different orders of the presentation of input patterns leads to different resulting clusters, which may just be artifacts of the algorithms,
making the results questionable.
Provide some insight for the number of potential clusters without
prior knowledge. As discussed in Chapter 10, estimating the number
of clusters is one of the most fundamental and important problems
in cluster analysis. Many current algorithms require that number as
a user-specified parameter; however, without prior knowledge, users
usually do not have such information. Over-estimation or underestimation both could cause the incorrect explanation of the clustering structure.
Show good data visualization and provide users with results that can
simplify further analysis. The ultimate goal of cluster analysis is to
find the potential data structure that can be further used to solve a

282

CONCLUDING REMARKS

more complicated problem. An understandable and visualized result


is very helpful in this regard.
Be capable of handling both numerical and nominal data or be easily
adaptable to some other data types. More commonly, data are
described by features with different types, which requires the clustering algorithm to be flexible in this respect.

Of course, detailed requirements for specific applications will affect these


properties.
Although we have endeavored to provide a broad overview of clustering
as it now stands, we fully expect (and hope to contribute to) a rapid evolution
of this important field. We are eager to see not only the advances on the
important challenges to clustering as discussed above, but also the unpredictable array of forthcoming applications of this singularly fascinating
technology.

S-ar putea să vă placă și