Sunteți pe pagina 1din 7

A I

i n

C h i n a

Machine Learning:
The State of the Art
Jue Wang and Qing Tao, Institute of Automation, Chinese Academy of Sciences

nternet browsers have facilitated the acquisition of large amounts of information.


This technologys development has greatly surpassed that of data analysis, making

This article discusses

large chunks of data uncontrollable and unusable. This is one reason why so many people

MLs advances in light

are enthusiastic about machine learning (ML). In fields such as molecular biology and

of theoretical and

the Internet, if youre concerned only about the


ease of gene sorting and distributing information,
you dont need ML. However, if youre concerned
about advanced problems such as gene functions,
understanding information, and security, ML is
inevitable.
Many researchers quote Herbert Simon in describing ML:

practical difficulties,
analyzing statistical
interpretations,
algorithm design,
and the Rashomon
problem.

Learning denotes changes in the system that are


adaptive in the sense that they enable the system to
do the same task or tasks drawn from the same population more efficiently and more effectively the next
time.1

This is a philosophical and ubiquitous description of ML. But you cant design algorithms and
analyze the problem with this definition. Computer
scientists and statisticians are more interested in
ML algorithms and their performance. In their
eyes, ML is
the process (algorithm) of estimating a model thats
true to the real-world problem with a certain probability from a data set (or sample) generated by finite
observations in a noisy environment.
November/December 2008

1541-1672/08/$25.00 2008 IEEE


Published by the IEEE Computer Society

This definition involves five key terms:


The real-world problem. Suppose the true model

for this problem is y = F(x), where x is the variables (features and attributes) describing the
problem, y is the outcome of x in the problem,
and F is the relationship between the variables
and the outcome. MLs mission is to estimate a
function f(x) that approximates F(x).
Finite observations (the sample set). All the data
in the sample set S({xk, yk}n) are recordings of
finite independent observations of y = F(x) in a
noisy environment.
Sampling assumption. The sample set is an independent and identically distributed (i.i.d.) sample from an unknown joint distribution function
P(x, y).
Modeling. Consider a parametric model space.
The parameters are estimated from the sample
set to get an f(x) thats a good approximation of
F(x). The approximation measure is the average
error for all samples in terms of the distribution
P(x, y). We then say that the estimated model f(x)
is true to the real-world problem F(x) with a certain probability, and we call f(x) a model of F(x).
Process (the algorithm). ML algorithms should
49

A I

i n

C h i n a

have some statistical guarantees. Also,


the algorithm will transform the modeling problem into a parametric optimization problem in the model space. In most
ML approaches, the learned model has a
linear dependence on these parameters.
ML involves both fundamental theoretical difficulties and practical difficulties.
Theoretical difficulties include these:
The number of variables in the sample

set (dimensions) is generally very large


or even huge; compared with it, the sample set is very small.
The model of the real-world problem is
usually highly nonlinear.

Rumelhart and James McClelland proposed the nonlinear backpropagation algorithm.9 AI, pattern recognition, and statistics researchers became interested in
this approach. And, under the impulsion
from real-world demands, ML, especially
statistical ML, gradually became independent from traditional AI and pattern recognition. Researchers shunned early symbolic ML methods because they lacked
good theoretical generalization guarantees. However, because the complexities of
real-world data make a ubiquitous learning algorithm impossible, the quality of
the data and background knowledge could
be the key to MLs success. The intrinsic
readability (interpretability) of symbolic

Practical difficulties include these:


Sometimes, the sample set cannot be

represented as a vector in a given linear


space, and there are specific relations
among the data.
The datas outcome can take many forms
(it could be discrete, it could be continuous, it could have an ordered structure, or
it even could be missing).
The data set can be a product of many
meaningful models intertwined together.
There are outliers to models that must not
be ignored in some cases, which has inspired many new learning models.

Machine learnings ability


to simplify data sets provides
a possible way humans
can understand data quality
and make alterations
to improve that quality.

to evaluate the models performance and estimate the model. ML also needs statistics
to filter noise in the data. The second foundation is computer science algorithm design
methodologies, for optimizing parameters.
In recent years, MLs main objective has
been to find a learner linearly dependent on
these parameters.
In the past decade, ML has been in a
margin era11,12 Geometrically speaking,
the margin of a classifier is the minimal
distance of training points from the decision boundary. Generally, a large margin
implies good generalization performance.
Anselm Blumer was the first to use VC
(Vapnik-Chervonenkis) dimensions and
the probably approximately correct (PAC)
framework to study learners generalization ability.13 In 1995, Vladimir Vapnik
published the book The Nature of Statistical
Learning Theory,11 which included his
work on determining PAC bounds of generalization based on the VC dimensions.
Blumers and Vapniks research serves
as the theoretical cornerstones of finitesample statistical-learning theory. In 1998,
John Shawe-Taylor related generalization to
the margin of two closed convex sets in the
feature space.14
These theories led to the large-margin algorithm design principle, which has three
points:
Generalization is based on finite sam-

Here, we discuss MLs advances in light


of these difficulties.
The Development of ML
The term machine learning has become increasingly popular in recent decades. Other
fields have terms with similar meanings,
such as data analysis in statistics and pattern classification in pattern recognition.
In the early days of AI, ML research used
mainly symbolic data, and algorithm design was based on logic.26
At about the same time, Frank Rosen
blatt proposed the perceptron, a statistical approach based on empirical risk
minimization.7 However, this approach remained unrecognized and undeveloped in
the following decades. Nevertheless, this
statistical-modeling methodology has been
highly regarded in pattern classification research and has become fundamental in statistical pattern recognition.8
The real development of statistical
learning came after 1986, when David
50

methods could allow ML to regain popularity because its ability to simplify data
sets provides a possible way humans can
understand data quality and make alterations to improve that quality.
The research community is no longer interested in the algorithm design principles
of nonlinear backpropagation. But this research reminded people of nonlinear algorithms importance. Designing nonlinear
learning algorithms theoretically and systematically is an important impetus of current statistical ML research.
ML has become an important topic for
pattern recognition researchers and others.
In 2001, Leo Breiman published Statistical Modeling: The Two Cultures, which
viewed ML as a subcategory of statistics.10
This approach led to many new directions
and ideas for ML.
In our view, ML actually has two foundations. The first is statistics. Because MLs
objective is to estimate a model from observed data, it must use statistical measures
www.computer.org/intelligent

ples, which is consistent with application


scenarios.
These algorithms deal with nonlinear
classification using the kernel trick. This
method employs a linear classifier algorithm to solve a nonlinear problem by
mapping the original nonlinear observations into a higher-dimensional space.
The algorithms estimate linear classifiers and are transformed into convex optimization problems. The algorithms
generalization depends on maximizing the margins, with a clear geometric
meaning.
Weve done margin-related research on
improving supervised algorithms and designing unsupervised algorithms.1518 For
example, taking the viewpoint that the de
sired domain containing the training samples should have as few outliers as possible,
we define the margin of one-class problems and derive a support-vector-machinelike framework for solving a new one-class
IEEE INTELLIGENT SYSTEMS

problem with the predefined threshold .15


Weve also modified support vector machines (SVMs) by weighting the margin to
solve unbalanced problems.16
Meanwhile, Robert Schapire proved that,
in the PAC framework, a concept is weakly
learnable if and only if it is strongly learnable.19 This means if you build a group of
models that have a precision slightly better
than a random guess (greater than 50 percent) and combine them correctly, you can
have a model with arbitrarily high precision. Afterward, this concept of grouping
models was developed further under the
margin theory framework, which created a
plethora of boosting methods with statistical
interpretations via the margin.20,21
However, statisticians have begun to
doubt the margins usefulness as a statistical
measure, as we shall see in the next section.
ML seems to have entered a new cycle, one
characterized by consistency of loss functions with statistical interpretations.22,23
Statistical Interpretations
AI researchers were reluctant to use statistics in ML, with the fate of the original
perceptrons serving as a typical example.
The perceptron has its deficienciesfor example, it cant deal with linear inseparable
data. However, AI researchers snubbed it
mainly because they didnt like models that
people cant understand (black boxes). Also,
the differing schools of thought in statistics
perplexed AI researchers. In addition, the
statistical theories at that time assumed that
the number of labeled data approaches infinity, which made AI researchers hesitant
to base ML theory on statistics.
However, pattern recognition pioneers
clearly were aware of statistics importance. In 1973, Richard Duda and Peter
Hart published Pattern Classification and
Scene Analysis.8 They used Bayesian decision theory as the basis of the classification
problem; that is, they evaluated classification models by the models deviance from
the Bayes classifier. ML researchers now
widely acknowledge this criterion. In addition, for Duda and Hart, approximately correct (with a nonzero error rate) model precision was acceptable. This deviates from
statistics, which assumes the model should
be consistent with the true model when the
sample size goes to infinity. This is similar in spirit to PAC theory, which assumes
precision with a probability of 1 ( > 0),
instead of a probability of 1. However, it
November/December 2008

wasnt until 10 years later that Leslie Valiant proposed PAC theory,24 which clearly
describes this idea.
In fact, in the early 70s, Vapnik proposed the finite-sample statistical theory.
Although the theory used VC dimensions
to describe the model spaces complexity, it
neither explained the reasons for using the
approximated correctness of models nor
provided clues on developing algorithms.
So, AI researchers didnt adopt it. Methods
from this statistical theory remained unacknowledged until Vapniks book in 1995
and particularly Shawe-Taylors research on
margin bounds in 1998.
However, many statistics researchers still
widely criticized Vapniks finite-sample sta

Although computer scientists


havent criticized the margin
concept directly, their
experiments havent used
the margin as a criterion
to evaluate model precision.
tistical theory. Vapniks argument was mainly
that you could use the margin as a statistic
to evaluate a models performance. Here are
some of the major criticisms:
[Schapire and his colleagues] offered an explanation of why Adaboost [Adaptive Boosting, an ML algorithm] works in terms of its
ability to reduce the margin. Comparing
Adaboost to our optimal arcing algorithm
shows that their explanation is not valid and
that the answer lies elsewhere. In this situation the VC-type bounds are misleading.25
The bounds and the theory associated with
the AdaBoost algorithms are interesting, but
tend to be too loose to be of practical importance. In practice, boosting achieves results
far more impressive than the bounds would
imply.26
In the presence of noise many of the known
bounds cannot predict wellneither for
small nor for large sample sizes.27
www.computer.org/intelligent

The margin idea mixes the two aspects [the


bias and variance] together so that it is
not clear which aspect is the main contribution to the success of these so-called margin
maximization methods. Moreover, from the
margin concept, we are unable to characterize the impact of different loss functions, and
we are unable to analyze the closeness of a
classifier obtained from convex risk minimization to the optimal Bayes classifier.22
What is special with the SVM is not the
regularization term, but is rather the loss
function, that is, the hinge loss. [Yi Lin23]
pointed out that the hinge loss is Bayes consistent, that is, the population minimizer of
the loss function agrees with the Bayes rule
in terms of classification. This is important
in explaining the success of the SVM, because it implies that the SVM is trying to
implement the Bayes rule.28

Maybe it should be no wonder that a


concept such as the margin would generate
disputes among statisticians. But how do
computer scientists treat margins? Do they
believe that you can use them as a statistic
in evaluating models? Although computer
scientists havent criticized the margin concept directly (at least we havent found any
related reports), their experiments havent
used the margin as a criterion to evaluate
model precision. They still use classic bias
(error rates) and variance (cross-validation)
as criteria. Computer scientists like to use
algorithmic issues to explain why they
would use traditional measures. But theyve
used the margin only as a guide to design
algorithms.
Statisticians have now done more indepth research on margins and have pointed
out that for 0-1 loss, the margin as a loss
function has some connections with Bayes
risk.22 However, research on loss functions
is still based on the infinite-sample assumption. Obtaining tighter, more useful PAC
generalization bounds remains difficult.
Algorithm Design
Perceptron,29 by Marvin Minsky and Seymour Papert, is considered an evil sword
that stopped research on perceptrons, especially by neural network researchers from
1980 to 2000. The book proposed two
seemingly contradictory principles for ML:
Algorithms should solve real-world prob-

lems instead of toy problems.


51

A I

i n

C h i n a

The algorithms time complexity should

be polynomial.
The first implies that algorithms must be
able to deal with nonlinear problems; the
second implies that the algorithms must be
computable. Minsky and Papert implied
that because the two principles contradict
each other, it might be difficult to design
ML methods that dont depend on domain
knowledge. After nearly 40 years, these two
principles still apply to ML algorithms.
Backpropagation is a nonlinear algorithm. It was a milestone in perceptron-type
learning. However, Vapnik recommended
going back to linear perceptrons. Philosophically, nonlinear is a general name
for unknown things. When youve solved
a problem, it means youve found a space
on which the problem can be linearly represented. Technically, you need to find a
space where nonlinear problems can be
linear. This was the basic idea when John
von Neumann built mathematical foundations for quantum mechanics in the 1930s.30
Vapnik used this idea when he suggested
looking for a map that maps linear inseparable data to a Hilbert space (a linear innerproduct space), which he called the feature
space. After mapping, the data could be linearly separable on this space. So, you would
need to consider only linear perceptrons
here. If you considered margins, you would
then have the maximal-margin problem in
the feature space.
The nonlinearity problem would seem
to have been solved, but it isnt so simple.
To make the problem linearly separable,
you need to add dimensions. So, the feature
spaces dimensionality will be much higher
than that of the input space. How high
should this dimensionality be for the mappings of the convex sets in the input space
to have the maximal margin? This question
remains unanswered. Its one reason you
cant use the margin as a criterion to evaluate model precision and must resort to traditional measures.
Schapires weak-learnability theorem implied another design principle for learning
algorithms: You can obtain high-precision
models by combining many low-precision
models. This principle is called ensemble
learning. Neuroscientist Donald O. Hebb
first employed this principle in his multicell ensemble theory, which posits that vision objects are represented by interconnected neuron ensembles. The intent of the
52

term ensemble as its used nowadays is consistent with Hebbs intent: a high-precision
model is represented through many lowprecision models.
To design algorithms, you can regard
learning problems as an optimization problem on the space spanned by these weak
classifiers. This is the basic design principle in popular boosting algorithms.26 This
methods biggest advantage is that it automatically reduces dimensionality because
the number of weak classifiers is usually far
smaller than the inputs dimensionality.
The Rashomon Effect
Lets look at another problem affecting machine learning. In a liter of water, drop 10

Schapires weak-learnability
theorem implied another
design principle for learning
algorithms: You can obtain highprecision models by combining
many low-precision models.
grams of salt or sugar. We can discriminate
whether you dropped sugar or salt when we
taste it. However, add a liter of water, another liter of water, and so on. Eventually,
we wouldnt be able to tell what you added.
Such is the effect of dilution.
When dealing with a high-dimensional
data set, youll see a similar effect. For a
fixed-size training sample, if the number
of variables (features in pattern recognition
or attributes in AI) reaches a certain point,
the sample will be diluted over the space
spanned by these variables. This space will
also contain many satisfactory models,
meeting a precision requirement for common bias and variance measures (such as
cross-validation on the given sample).
However, for real-world problems,
maybe only one or a few of those models
will be useful. This means that for realworld observations outside the training
sample, most of these satisfactory models wont be good. This is the multiplicity of models problem. Leo Breiman listed
www.computer.org/intelligent

this problem alongside Richard Bellmans


curse of dimensionality and Occams
razor and gave it an interesting name
the Rashomon effectto describe the dilemma that ML faces in this situation.10
(In the Japanese movie Rashomon, four
witnesses of an incident give different accounts of what happened.)
In high-dimensional spaces, if the sample
is too small, the i.i.d. conditions wont be
useful. That is, as the number of variables
increases, the given sample will be diluted
exponentially over the space that those variables span. In other words, if p is the upper
limit of the dimensionality (the number of
variables) suitable for the given sample size,
then if the dimensionality increases further,
the sample size must increase exponentially
to make the model useful.
Obviously, if you lower the dimensionality of a high-dimension observation data
set, the needed sample size will drop exponentially. For a long time, ML research
didnt focus on dimensionality reduction
but used it as an auxiliary method. However, it has long been central to pattern
recognition research. This is possibly because image data have very high dimensionality, and you cant build good models
without considering dimensionality reduction. In pattern recognition, dimensionality reduction is called feature selection
and feature extraction. Feature selection
involves selecting a feature subset from a
given feature set with a given rule. Feature
extraction maps the input space to another
space with a lower dimension; the number of features will be smaller in the new
space. In this article, we discuss only feature selection.
Many feature selection methods exist.31
Statisticians were the first to regard feature selection as an important ML problem.
In 1996, Robert Tibshirani proposed the
Lasso (Least Absolute Shrinkage and Selection Operator) algorithm, which is based
on optimizing least squares with an L1 constraint.32 ML researchers have recently recognized this algorithms importance because of their increased awareness of the
problem of information sparsity. Lasso is
similar to wrapper methods in feature selection in that it targets the objective function of learning (to minimize the classification error or squared error). Unlike many
feature selection methods, Lasso doesnt
select a feature subset with a heuristic rule.
However, it does consider feature selection
IEEE INTELLIGENT SYSTEMS

as a set of linear constraints on the coefficient of the variables. During optimization, if a variables coefficient decreases to
0, Lasso eliminates the variable from the
model, thus performing feature selection.
On the basis of Lasso, Bradley Efron
proposed LARS (Least Angle Regression),
which is more interesting.33 To provide
feature selection, LARS uses analysis of
residuals. Efron proved that with one additional constraint, you can use LARS to
solve the Lasso optimization objective.
Because LARS is basically a very efficient
forward selection method, its amazing
that it can obtain the optimal solution of a
numerical optimization problem.
Most research on the Rashomon effect
aims to reduce the number of dimensions
or the number of variables under certain
statistical assumptions and build models on a low-dimensional space. However, in some fields (for example, natural language processing), instead of only
a handful of features being meaningful,
most features are meaningful. Simple dimensionality reduction might give good
models measured on a given training
set but be useless in linguistics. In other
words, an algorithm will be meaningful
in linguistics only if it considers all the
features. However, linguists dont like to
work on the full feature set. They often
start from small parts and then combine
the different parts. In this case, ensemble
methods might be useful.
Another facet of the Rashomon effect
is that the data might come not from one
source but from multiple sources. Suppose that every model relies on a subset
of the feature set and that different models
rely on nonintersecting sets. In this situation, you can directly use feature selection
methods. However, to find the best partition of the feature set, youll have to search
all the feature subsets. This method is different from feature selection based on statistical correlation; it must consider not
only the correlation among features but
also users specific requirements.34 This
is important in personalized applications,
such as personalized information retrieval.
A more complicated situation is when
the data set is an additive combination of
multiple reasonable models. For example,
with image data, an image might contain
many different objects. In this case, feature extraction might be useful. Many researchers have studied feature extraction
November/December 2008

for this situation, but most of their results


are empirical. We havent seen remarkable
theoretical results, either statistically or
through domain-dependent research.
The Rashomon effect is not only an
inevitable challenge for machine learning.
Residual-analysis methods from classical
statistics can deal with low-dimensional
data, but when confronted with highdimensional data, theyre of little help. So,
the Rashomon effect is also a challenge for
statistics.
Other ML Forms
Recall that the fundamental thing in ML
is to estimate a function y = F(x). In recent years, researchers have proposed

Because logistic regression


could convert a classification
problem to a regression
problem, regression
will surely attract ML
researchers attention.
many forms of ML, with the difference
being solely how you define x or y.
Different Definitions of y
Suppose that a data set satisfies the attribute-value form; that is, the variable set (x)
is determined beforehand. For every observation, each variable must have a value. Different domains, ranges, or explanations lead
to totally different ML forms, which need
different algorithms.
If you define y on a limited integer set,
you have a classification problem. If you
also define each variable of x on a limited
integer set, you can use symbolic learning methods. If you define part of xs variables as real numbers, you can use statistical learning methods. Of course, you can
also consider integers to be real numbers,
in which case the problem is statistical.
If you define y as a real number, you have
a regression problem. Traditional ML does
much less research on regression problems
than on classification problems; regreswww.computer.org/intelligent

sion problems are much more important in


statistics and certain engineering applications. However, because logistic regression
could convert a classification problem to a
regression problem, regression will surely
attract ML researchers attention.
If you assign y to only a few samples
while others have no value because of labor
costs or lack of knowledge, semisupervised
learning might be applicable.35 This form
of ML designs a class of rules to evaluate
y for unknown samples according to the already known samples. This is difficult. On
one hand, rules might be related to domain
knowledge. On the other hand, you must
find rules for samples with known y values
and use them as evidence to assign y values
to other samples, which means searching
in a huge space. So far, no good theoretical
framework exists for this form of ML.
If relations or structures exist among
samples, which means the output y is related or structural, structural learning is
appropriate. For this kind of data, in addition to the labels in y, a new variable p describes the structure, and you can simply
write the y-part as (y, p). The simplest, yet
most useful, structural-learning method,
learning to rank, needs the samples arranged according to certain requirements.36
This method has attracted attention because its useful for designing search engines. If the y-part denotes different classes
of documents such as news or sports, you
can write (y, p) as (y, rk), where rk is an order for items in the kth class.
If for all samples y isnt assigned, the problem involves unsupervised learning, which
is closely related to clustering analysis.
Different Definitions of x
If the samples dont satisfy the attributevalue formthat is, the data set is stored
in a relational databasethe variables
have relations among each other. Learning involving such data sets is called relational learning.37 AI first addressed relational learning in the 1970s, but this
problem still hasnt been solved. This is
important because 60 percent of the data
we face, especially economic data, are
stored like this.
Current approaches to relational learning are based mainly on inductive logic
programming (ILP), which
1. scatters samples according to domain knowledge so that each fragment
53

A I

i n

C h i n a

satisfies the attribute-value form,


2. builds models for each fragment by statistical learning, and
3. joins the models according to domain
knowledge.
This is difficult but important for practical
application.
In 2000, Science published a group of articles on manifold learning. In this approach,
the cognitive process is based on data manifolds.3840 Roughly speaking, a manifold
is a locally coordinated Euclidean space,
which means that you can give Euclidean
coordinates (via a homeomorphism) on
every small part of it. The research reported
in Science referred not to the differential
manifolds mathematical property but only
to its definitionthat is, to induce the topological locally coordinate space intuitively,
through piecewise linearization. This concept makes sense for cognitive science and
provides insight for ML. For ML methods,
this kind of learning gives a different explanation from general Euclidean implementations for x in the given data set.
Some detection applications aim to find
special instances or exceptions. In early
research on this approach, the goal of
finding exceptions was to shorten the description length of the model to improve
generality.41,42 Recently, driven by a host
of practical challenges such as detecting illegal financial activities or security
breaches, finding exceptions in a mass
of data has become important.43 Exceptional activities vary and change so rapidly that directly building models with
them is practically useless. You need to
build models according to requirements
and make them the standard for normal
behavior; behavior that exceeds the standard would be considered the exception.
To do this, we have induced models from
data and design methods to extract exceptions from the data set.44
And Even More Forms
Over the last decade, ML researchers
have developed other learning forms such
as metric learning45 and multi-instance
learning.46 These forms are driven by and
derived from practical problems. Their
common trait is that the data are so complexly represented that no previous learning framework can handle them. Generally
speaking, these forms are all related to domain knowledge.
54

ncreasing needs in fields such as molecular biology, Internet information analysis and processing, and economic data analysis have greatly stimulated ML research.
They have also presented many complex
problems that cant be solved with traditional ML but that need fresh approaches
and new knowledge from other fields.
One fresh approach comes from Breimans 2001 article in Statistical Science
about how a statistician understands ML.10
He advises that statisticians start from
practice and pay attention to problems
caused by high-dimensional data. He admonishes computer scientists to consider
the conditions for using various theories.
When processing data, no matter how good
or bad the result was, computer scientists

If you use statistical


methods to build
an approximate model
according to the given
sample, will molecular
biologists believe this model?
had been paying more attention to algorithm design than to statistical analysis.
Having realized the deficiency in how
they think, computer scientists are now
paying more attention to statisticians understanding of ML, their research on the
Rashomon effect, and the discussion about
the limitations of Vapniks finite-sample
statistics theory. We also think its necessary to take a hard look at previous ML research, which was our motivation for this
article.
In this article, we havent discussed traditional ML in AI, and we havent dwelt on
how researchers have neglected symbolic
learning methods. We are much concerned
with these methods,43,47 but were not interested in their generalization. In other words,
when facing different practical problems,
you need to consider not only the models
statistical interpretations but also special
examples and how they influence the models and the problem.
www.computer.org/intelligent

Such specificity is also what some fields


demand for ML. For example, if you use
statistical methods to build an approximate model according to the given sample, will molecular biologists believe this
model? What useful thing can they get
from it? Maybe the more important thing
is to help them read the data in more detail. Although statisticians nowadays are
concerned with models interpretations, researchers in some fields might need more
detailed and readable data. These challenges will provide new pathways to symbolic learning methods.
Of course, some researchers believe that
such research also belongs to data mining.
A common practice of human learning and
knowledge management in data mining is
to use general rules and exceptions to rules.
One crucial issue is to find the right mixture of them. In a previous study, weve considered rule-plus-exception strategies for
discovering this type of knowledge.43 We
summarized and compared results from
psychology, expert systems, genetic algorithms, and ML and data mining, and we
examined their implications for knowledge
management and discovery. That study establishes a basis for the design and implementation of new algorithms for discovering rule-plus-exception-type knowledge.

Acknowledgments

The Chinese National Basic Research Program


(2004CB318103) and Natural Science Foundation of China grant 60835002 supported this
research.

References
1. H. Simon, Why Should Machines Learn?
Machine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, eds., Tioga Press,
1983, pp. 2538.
2. R. Solomonoff, A New Method for Dis
covering the Grammars of Phrase Structure
Language, Proc. Intl Conf. Information
Processing, Unesco, 1959, pp. 285290.
3. E. Hunt, J. Marin, and P. Stone, Experiments in Induction, Academic Press, 1966.
4. L. Samuel, Some Studies in Machine
Learning Using the Game of Checkers,
Part II, IBM J. Research and Development, vol. 11, no. 4, 1967, pp. 601618.
5. J. Quinlan, Induction of Decision Trees,
Machine Learning, vol. 1, Mar. 1986, pp.
81106.
IEEE INTELLIGENT SYSTEMS

T h e

A u t h o r s

Jue Wang is a professor at the Chinese Academy of Sciences Institute of Automation. His re-

search interests include artificial neural networks, genetic algorithms, multiagent systems, machine learning, and data mining. Wang received his masters from the Chinese Academy of Sciences Institute of Automation. Hes an IEEE senior member. Contact him at jue.wang@mail.
ia.ac.cn.
Qing Tao is a professor at the Chinese Academy of Sciences Institute of Automation and at the

New Star Research Institute of Applied Technology. His research interests are applied mathematics, neural networks, statistical learning theory, support vector machines, and pattern recognition. Tao received his PhD from the University of Science and Technology of China. Contact
him at qing.tao@mail.ia.ac.cn or taoqing@gmail.com.

6. Z. Pawlak, Rough Sets: Theoretical As


pects of Reasoning about Data, Kluwer
Academic Publishers, 1991.
7. F. Rosenblatt, The Perceptron: A Perceiving and Recognizing Automaton, tech. report 85-4601, Aeronautical Lab., Cornell
Univ., 1957.
8. R. Duda and P. Hart, Pattern Classification
and Scene Analysis, John Wiley & Sons,
1973.
9. D.E. Rumelhart and J.L. McClelland, Parallel Distributed Processing, MIT Press,
1986.
10. L. Breiman, Statistical Modeling: The
Two Cultures, Statistical Science, vol. 16,
no. 3, 2001, pp. 199231.
11. V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.
12. V. Vapnik and A. Chervonenkis, On the
Uniform Convergence of Relative Frequencies of Events to Their Probabilities,
Theory of Probability and Applications,
vol. 16, Jan. 1971, pp. 264280.
13. A. Blumer et al., Learnability and the
Vapnik-Chervonenkis Dimension, J.
ACM, vol. 36, no. 4, 1989, pp. 929965.
14. J. Shawe-Taylor et al., Structural Risk
Minimization over Data-Dependent Hierarchies, IEEE Trans. Information Theory,
vol. 44, no. 5, 1998, pp. 19261940.
15. Q. Tao, G.-W. Wu, and J. Wang, A New
Maximum Margin Algorithm for OneClass Problems and Its Boosting Implementation, Pattern Recognition, vol. 38,
no. 7, 2005, pp. 10711077.
16. Q. Tao et al., Posterior Probability Support Vector Machines for Unbalanced
Data, IEEE Trans. Neural Networks, vol.
16, no. 6, 2005, pp. 15611573.
17. Q. Tao, G.-W. Wu, and J. Wang, Learning
Linear PCA with Convex Semi-Definite
Programming, Pattern Recognition, vol.
40, no. 10, 2007, pp. 26332640.
18. Q. Tao, D.-J. Chu, and J. Wang, Recursive
Support Vector Machines for Dimensional
ity Reduction, IEEE Trans. Neural Net
works, vol. 19, no. 1, 2008, pp. 189193.
19. R. Schapire, The Strength of Weak Learnability, Machine Learning, vol. 5, no. 2,
1990, pp. 197227.
November/December 2008

20. Y. Freund and R.E. Schapire. A DecisionTheoretic Generalization of On-line Learning and an Application to Boosting, J.
Computer and System Sciences, vol. 55,
no. 1, 1997, pp. 119139.
21. R.E. Schapire et al., Boosting the Margin:
A New Explanation for the Effectiveness of
Voting Methods, Annals of Statistics, vol.
26, no. 5, 1998, pp. 16511686.
22. T. Zhang, Statistical Behaviour and Consistency of Classification Methods Based
on Convex Risk Minimization, Annals of
Statistics, vol. 32, no. 1, 2004, pp. 5685.
23. Y. Lin, Support Vector Machines and the
Bayes Rule in Classification, Data Mining
and Knowledge Discovery, vol. 6, no. 3,
2002, pp. 259275.
2 4. L. Valiant, A Theory of Learnability,
Comm. ACM, vol. 27, no. 11, 1984, pp.
11341142.
25. L. Breiman, Prediction Games and Arcing Algorithms, Neural Computation, vol.
11, no. 7, 1999, pp. 14931517.
26. J. Friedman, T. Hastie, and R. Tibshirani,
Additive Logistic Regression: A Statistical View of Boosting, Annals of Statistics,
vol. 28, no. 2, 2000, pp. 337407.
27. I. Steinwart, Which Data-Dependent
Bounds Are Suitable for SVMs? tech.
report, Los Alamos Natl Lab., 2002;
www.ccs3.lanl.gov/~ingo/pubs.shtml.
28. T. Hastie and J. Zhu, Comment, Statistical Science, vol. 21, no. 3, 2006, pp.
352357.
29. M. Minsky and S. Parpert, Perceptron
(expanded edition), MIT Press, 1988.
30. J. von Neumann, Mathematical Foundations of Quantum Mechanics, Princeton
Univ. Press, 1932.
31. H. Liu and H. Motoda, Feature Selection
for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, 1998.
32. R. Tibshirani, Regression Shrinkage and
Selection via the Lasso, J. Royal Statistical Soc.: Series B, vol. 58, no. 1, 1996, pp.
267288.
33. B. Efron et al., Least Angle Regression,
Annals of Statistics, vol. 32, no. 2, 2004,
pp. 407499.
www.computer.org/intelligent

34. H.L. Liang, W. Jue, and Y. YiYu, UserOriented Feature Selection for Machine
Learning, Computer J., vol. 50, no. 4,
2007, pp. 421434.
35. A. Blum and T. Mitchell, Combining
Labeled and Unlabeled Data with Cotraining, Proc. 11th Ann. Conf. Compu
tational Learning Theory, ACM Press,
1998, pp. 92100.
36. R. Herbrich, T. Graepel, and K. Ober-Mayer, Support Vector Learning for Ordinal
Regression, Proc. 9th Intl Conf. Artificial
Neural Networks, IEEE Press, 1999, pp.
97102.
37. S. Dzeroski and N. Lavrac, eds., Relational
Data Mining, Springer, 2001.
38. H. Seung and D. Lee, The Manifold Way
of Perception, Science, vol. 290, no. 5500,
2000, pp. 22682269.
39. S. Roweis and L. Saul, Nonlinear Dimen
sionality Reduction by Locally Linear Embedding, Science, vol. 290, no. 5500, 2000,
pp. 23232326.
40. J. Tenenbaum, V.D. Silva, and J. Langford,
A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science,
vol. 290, no. 5500, 2000, pp. 23192323.
41. N. Shepard, C. Novland, and H. Jenkins,
Learning and Memorization of Classification, Psychological Monographs, vol. 75,
no. 13, 1961, pp. 142.
42. M. Nosofsky, J. Palmeri, and C. McKinley,
Rule-Plus-Exception Model of Classification Learning, Psychological Rev., vol.
101, no. 1, 1994, pp. 5379.
43. Y. Yao et al., Rule + Exception Strategies
for Security Information Analysis, IEEE
Intelligent Systems, Sept./Oct. 2005, pp.
5257.
44. J. Wang et al., Multilevel Data Summarization from Information System: A Rule +
Exception Approach, AI Comm., vol. 16,
no. 1, 2003, pp. 1739.
45. E.P. Xing et al., Distance Metric Learning, with Application to Clustering with
Side-Information, Advances in NIPS, vol.
15, Jan. 2003, pp. 505512.
46. T.G. Dietterich, R.H. Lathrop, and T. Lozano-Prez, Solving the Multiple-Instance
Problem with Axis-Parallel Rectangles,
Artificial Intelligence, vol. 89, nos. 12,
1997, pp. 3171.
47. W. Zhu and F.-Y. Wang, Reduction and
Axiomization of Covering Generalized
Rough Sets, Information Sciences, vol.
152, no. 1, 2003, pp. 217230.

For more information on this or any other computing topic, please visit our Digital Library at
www.computer.org/csdl.
55

S-ar putea să vă placă și