Documente Academic
Documente Profesional
Documente Cultură
i n
C h i n a
Machine Learning:
The State of the Art
Jue Wang and Qing Tao, Institute of Automation, Chinese Academy of Sciences
large chunks of data uncontrollable and unusable. This is one reason why so many people
are enthusiastic about machine learning (ML). In fields such as molecular biology and
of theoretical and
practical difficulties,
analyzing statistical
interpretations,
algorithm design,
and the Rashomon
problem.
This is a philosophical and ubiquitous description of ML. But you cant design algorithms and
analyze the problem with this definition. Computer
scientists and statisticians are more interested in
ML algorithms and their performance. In their
eyes, ML is
the process (algorithm) of estimating a model thats
true to the real-world problem with a certain probability from a data set (or sample) generated by finite
observations in a noisy environment.
November/December 2008
for this problem is y = F(x), where x is the variables (features and attributes) describing the
problem, y is the outcome of x in the problem,
and F is the relationship between the variables
and the outcome. MLs mission is to estimate a
function f(x) that approximates F(x).
Finite observations (the sample set). All the data
in the sample set S({xk, yk}n) are recordings of
finite independent observations of y = F(x) in a
noisy environment.
Sampling assumption. The sample set is an independent and identically distributed (i.i.d.) sample from an unknown joint distribution function
P(x, y).
Modeling. Consider a parametric model space.
The parameters are estimated from the sample
set to get an f(x) thats a good approximation of
F(x). The approximation measure is the average
error for all samples in terms of the distribution
P(x, y). We then say that the estimated model f(x)
is true to the real-world problem F(x) with a certain probability, and we call f(x) a model of F(x).
Process (the algorithm). ML algorithms should
49
A I
i n
C h i n a
Rumelhart and James McClelland proposed the nonlinear backpropagation algorithm.9 AI, pattern recognition, and statistics researchers became interested in
this approach. And, under the impulsion
from real-world demands, ML, especially
statistical ML, gradually became independent from traditional AI and pattern recognition. Researchers shunned early symbolic ML methods because they lacked
good theoretical generalization guarantees. However, because the complexities of
real-world data make a ubiquitous learning algorithm impossible, the quality of
the data and background knowledge could
be the key to MLs success. The intrinsic
readability (interpretability) of symbolic
to evaluate the models performance and estimate the model. ML also needs statistics
to filter noise in the data. The second foundation is computer science algorithm design
methodologies, for optimizing parameters.
In recent years, MLs main objective has
been to find a learner linearly dependent on
these parameters.
In the past decade, ML has been in a
margin era11,12 Geometrically speaking,
the margin of a classifier is the minimal
distance of training points from the decision boundary. Generally, a large margin
implies good generalization performance.
Anselm Blumer was the first to use VC
(Vapnik-Chervonenkis) dimensions and
the probably approximately correct (PAC)
framework to study learners generalization ability.13 In 1995, Vladimir Vapnik
published the book The Nature of Statistical
Learning Theory,11 which included his
work on determining PAC bounds of generalization based on the VC dimensions.
Blumers and Vapniks research serves
as the theoretical cornerstones of finitesample statistical-learning theory. In 1998,
John Shawe-Taylor related generalization to
the margin of two closed convex sets in the
feature space.14
These theories led to the large-margin algorithm design principle, which has three
points:
Generalization is based on finite sam-
methods could allow ML to regain popularity because its ability to simplify data
sets provides a possible way humans can
understand data quality and make alterations to improve that quality.
The research community is no longer interested in the algorithm design principles
of nonlinear backpropagation. But this research reminded people of nonlinear algorithms importance. Designing nonlinear
learning algorithms theoretically and systematically is an important impetus of current statistical ML research.
ML has become an important topic for
pattern recognition researchers and others.
In 2001, Leo Breiman published Statistical Modeling: The Two Cultures, which
viewed ML as a subcategory of statistics.10
This approach led to many new directions
and ideas for ML.
In our view, ML actually has two foundations. The first is statistics. Because MLs
objective is to estimate a model from observed data, it must use statistical measures
www.computer.org/intelligent
wasnt until 10 years later that Leslie Valiant proposed PAC theory,24 which clearly
describes this idea.
In fact, in the early 70s, Vapnik proposed the finite-sample statistical theory.
Although the theory used VC dimensions
to describe the model spaces complexity, it
neither explained the reasons for using the
approximated correctness of models nor
provided clues on developing algorithms.
So, AI researchers didnt adopt it. Methods
from this statistical theory remained unacknowledged until Vapniks book in 1995
and particularly Shawe-Taylors research on
margin bounds in 1998.
However, many statistics researchers still
widely criticized Vapniks finite-sample sta
A I
i n
C h i n a
be polynomial.
The first implies that algorithms must be
able to deal with nonlinear problems; the
second implies that the algorithms must be
computable. Minsky and Papert implied
that because the two principles contradict
each other, it might be difficult to design
ML methods that dont depend on domain
knowledge. After nearly 40 years, these two
principles still apply to ML algorithms.
Backpropagation is a nonlinear algorithm. It was a milestone in perceptron-type
learning. However, Vapnik recommended
going back to linear perceptrons. Philosophically, nonlinear is a general name
for unknown things. When youve solved
a problem, it means youve found a space
on which the problem can be linearly represented. Technically, you need to find a
space where nonlinear problems can be
linear. This was the basic idea when John
von Neumann built mathematical foundations for quantum mechanics in the 1930s.30
Vapnik used this idea when he suggested
looking for a map that maps linear inseparable data to a Hilbert space (a linear innerproduct space), which he called the feature
space. After mapping, the data could be linearly separable on this space. So, you would
need to consider only linear perceptrons
here. If you considered margins, you would
then have the maximal-margin problem in
the feature space.
The nonlinearity problem would seem
to have been solved, but it isnt so simple.
To make the problem linearly separable,
you need to add dimensions. So, the feature
spaces dimensionality will be much higher
than that of the input space. How high
should this dimensionality be for the mappings of the convex sets in the input space
to have the maximal margin? This question
remains unanswered. Its one reason you
cant use the margin as a criterion to evaluate model precision and must resort to traditional measures.
Schapires weak-learnability theorem implied another design principle for learning
algorithms: You can obtain high-precision
models by combining many low-precision
models. This principle is called ensemble
learning. Neuroscientist Donald O. Hebb
first employed this principle in his multicell ensemble theory, which posits that vision objects are represented by interconnected neuron ensembles. The intent of the
52
term ensemble as its used nowadays is consistent with Hebbs intent: a high-precision
model is represented through many lowprecision models.
To design algorithms, you can regard
learning problems as an optimization problem on the space spanned by these weak
classifiers. This is the basic design principle in popular boosting algorithms.26 This
methods biggest advantage is that it automatically reduces dimensionality because
the number of weak classifiers is usually far
smaller than the inputs dimensionality.
The Rashomon Effect
Lets look at another problem affecting machine learning. In a liter of water, drop 10
Schapires weak-learnability
theorem implied another
design principle for learning
algorithms: You can obtain highprecision models by combining
many low-precision models.
grams of salt or sugar. We can discriminate
whether you dropped sugar or salt when we
taste it. However, add a liter of water, another liter of water, and so on. Eventually,
we wouldnt be able to tell what you added.
Such is the effect of dilution.
When dealing with a high-dimensional
data set, youll see a similar effect. For a
fixed-size training sample, if the number
of variables (features in pattern recognition
or attributes in AI) reaches a certain point,
the sample will be diluted over the space
spanned by these variables. This space will
also contain many satisfactory models,
meeting a precision requirement for common bias and variance measures (such as
cross-validation on the given sample).
However, for real-world problems,
maybe only one or a few of those models
will be useful. This means that for realworld observations outside the training
sample, most of these satisfactory models wont be good. This is the multiplicity of models problem. Leo Breiman listed
www.computer.org/intelligent
as a set of linear constraints on the coefficient of the variables. During optimization, if a variables coefficient decreases to
0, Lasso eliminates the variable from the
model, thus performing feature selection.
On the basis of Lasso, Bradley Efron
proposed LARS (Least Angle Regression),
which is more interesting.33 To provide
feature selection, LARS uses analysis of
residuals. Efron proved that with one additional constraint, you can use LARS to
solve the Lasso optimization objective.
Because LARS is basically a very efficient
forward selection method, its amazing
that it can obtain the optimal solution of a
numerical optimization problem.
Most research on the Rashomon effect
aims to reduce the number of dimensions
or the number of variables under certain
statistical assumptions and build models on a low-dimensional space. However, in some fields (for example, natural language processing), instead of only
a handful of features being meaningful,
most features are meaningful. Simple dimensionality reduction might give good
models measured on a given training
set but be useless in linguistics. In other
words, an algorithm will be meaningful
in linguistics only if it considers all the
features. However, linguists dont like to
work on the full feature set. They often
start from small parts and then combine
the different parts. In this case, ensemble
methods might be useful.
Another facet of the Rashomon effect
is that the data might come not from one
source but from multiple sources. Suppose that every model relies on a subset
of the feature set and that different models
rely on nonintersecting sets. In this situation, you can directly use feature selection
methods. However, to find the best partition of the feature set, youll have to search
all the feature subsets. This method is different from feature selection based on statistical correlation; it must consider not
only the correlation among features but
also users specific requirements.34 This
is important in personalized applications,
such as personalized information retrieval.
A more complicated situation is when
the data set is an additive combination of
multiple reasonable models. For example,
with image data, an image might contain
many different objects. In this case, feature extraction might be useful. Many researchers have studied feature extraction
November/December 2008
A I
i n
C h i n a
ncreasing needs in fields such as molecular biology, Internet information analysis and processing, and economic data analysis have greatly stimulated ML research.
They have also presented many complex
problems that cant be solved with traditional ML but that need fresh approaches
and new knowledge from other fields.
One fresh approach comes from Breimans 2001 article in Statistical Science
about how a statistician understands ML.10
He advises that statisticians start from
practice and pay attention to problems
caused by high-dimensional data. He admonishes computer scientists to consider
the conditions for using various theories.
When processing data, no matter how good
or bad the result was, computer scientists
Acknowledgments
References
1. H. Simon, Why Should Machines Learn?
Machine Learning: An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, eds., Tioga Press,
1983, pp. 2538.
2. R. Solomonoff, A New Method for Dis
covering the Grammars of Phrase Structure
Language, Proc. Intl Conf. Information
Processing, Unesco, 1959, pp. 285290.
3. E. Hunt, J. Marin, and P. Stone, Experiments in Induction, Academic Press, 1966.
4. L. Samuel, Some Studies in Machine
Learning Using the Game of Checkers,
Part II, IBM J. Research and Development, vol. 11, no. 4, 1967, pp. 601618.
5. J. Quinlan, Induction of Decision Trees,
Machine Learning, vol. 1, Mar. 1986, pp.
81106.
IEEE INTELLIGENT SYSTEMS
T h e
A u t h o r s
Jue Wang is a professor at the Chinese Academy of Sciences Institute of Automation. His re-
search interests include artificial neural networks, genetic algorithms, multiagent systems, machine learning, and data mining. Wang received his masters from the Chinese Academy of Sciences Institute of Automation. Hes an IEEE senior member. Contact him at jue.wang@mail.
ia.ac.cn.
Qing Tao is a professor at the Chinese Academy of Sciences Institute of Automation and at the
New Star Research Institute of Applied Technology. His research interests are applied mathematics, neural networks, statistical learning theory, support vector machines, and pattern recognition. Tao received his PhD from the University of Science and Technology of China. Contact
him at qing.tao@mail.ia.ac.cn or taoqing@gmail.com.
20. Y. Freund and R.E. Schapire. A DecisionTheoretic Generalization of On-line Learning and an Application to Boosting, J.
Computer and System Sciences, vol. 55,
no. 1, 1997, pp. 119139.
21. R.E. Schapire et al., Boosting the Margin:
A New Explanation for the Effectiveness of
Voting Methods, Annals of Statistics, vol.
26, no. 5, 1998, pp. 16511686.
22. T. Zhang, Statistical Behaviour and Consistency of Classification Methods Based
on Convex Risk Minimization, Annals of
Statistics, vol. 32, no. 1, 2004, pp. 5685.
23. Y. Lin, Support Vector Machines and the
Bayes Rule in Classification, Data Mining
and Knowledge Discovery, vol. 6, no. 3,
2002, pp. 259275.
2 4. L. Valiant, A Theory of Learnability,
Comm. ACM, vol. 27, no. 11, 1984, pp.
11341142.
25. L. Breiman, Prediction Games and Arcing Algorithms, Neural Computation, vol.
11, no. 7, 1999, pp. 14931517.
26. J. Friedman, T. Hastie, and R. Tibshirani,
Additive Logistic Regression: A Statistical View of Boosting, Annals of Statistics,
vol. 28, no. 2, 2000, pp. 337407.
27. I. Steinwart, Which Data-Dependent
Bounds Are Suitable for SVMs? tech.
report, Los Alamos Natl Lab., 2002;
www.ccs3.lanl.gov/~ingo/pubs.shtml.
28. T. Hastie and J. Zhu, Comment, Statistical Science, vol. 21, no. 3, 2006, pp.
352357.
29. M. Minsky and S. Parpert, Perceptron
(expanded edition), MIT Press, 1988.
30. J. von Neumann, Mathematical Foundations of Quantum Mechanics, Princeton
Univ. Press, 1932.
31. H. Liu and H. Motoda, Feature Selection
for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, 1998.
32. R. Tibshirani, Regression Shrinkage and
Selection via the Lasso, J. Royal Statistical Soc.: Series B, vol. 58, no. 1, 1996, pp.
267288.
33. B. Efron et al., Least Angle Regression,
Annals of Statistics, vol. 32, no. 2, 2004,
pp. 407499.
www.computer.org/intelligent
34. H.L. Liang, W. Jue, and Y. YiYu, UserOriented Feature Selection for Machine
Learning, Computer J., vol. 50, no. 4,
2007, pp. 421434.
35. A. Blum and T. Mitchell, Combining
Labeled and Unlabeled Data with Cotraining, Proc. 11th Ann. Conf. Compu
tational Learning Theory, ACM Press,
1998, pp. 92100.
36. R. Herbrich, T. Graepel, and K. Ober-Mayer, Support Vector Learning for Ordinal
Regression, Proc. 9th Intl Conf. Artificial
Neural Networks, IEEE Press, 1999, pp.
97102.
37. S. Dzeroski and N. Lavrac, eds., Relational
Data Mining, Springer, 2001.
38. H. Seung and D. Lee, The Manifold Way
of Perception, Science, vol. 290, no. 5500,
2000, pp. 22682269.
39. S. Roweis and L. Saul, Nonlinear Dimen
sionality Reduction by Locally Linear Embedding, Science, vol. 290, no. 5500, 2000,
pp. 23232326.
40. J. Tenenbaum, V.D. Silva, and J. Langford,
A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science,
vol. 290, no. 5500, 2000, pp. 23192323.
41. N. Shepard, C. Novland, and H. Jenkins,
Learning and Memorization of Classification, Psychological Monographs, vol. 75,
no. 13, 1961, pp. 142.
42. M. Nosofsky, J. Palmeri, and C. McKinley,
Rule-Plus-Exception Model of Classification Learning, Psychological Rev., vol.
101, no. 1, 1994, pp. 5379.
43. Y. Yao et al., Rule + Exception Strategies
for Security Information Analysis, IEEE
Intelligent Systems, Sept./Oct. 2005, pp.
5257.
44. J. Wang et al., Multilevel Data Summarization from Information System: A Rule +
Exception Approach, AI Comm., vol. 16,
no. 1, 2003, pp. 1739.
45. E.P. Xing et al., Distance Metric Learning, with Application to Clustering with
Side-Information, Advances in NIPS, vol.
15, Jan. 2003, pp. 505512.
46. T.G. Dietterich, R.H. Lathrop, and T. Lozano-Prez, Solving the Multiple-Instance
Problem with Axis-Parallel Rectangles,
Artificial Intelligence, vol. 89, nos. 12,
1997, pp. 3171.
47. W. Zhu and F.-Y. Wang, Reduction and
Axiomization of Covering Generalized
Rough Sets, Information Sciences, vol.
152, no. 1, 2003, pp. 217230.
For more information on this or any other computing topic, please visit our Digital Library at
www.computer.org/csdl.
55