Documente Academic
Documente Profesional
Documente Cultură
The optimal state-action value function is defined as Algorithm 1 Q-Learning Algorithm [4], [6]
\\N: The maximum number of episodes
Q∗ (s, a) = max Q π (s, a). 1: Initialize Q(s, a) arbitrarily;
π
When Q*(s,a) is obtained, the optimal policy π * can be 2: repeat (for each episode i)
computed by 3: Initialize s;
4: repeat (for each step of episode)
π ∗ (s) = arg max Q∗ (s, a) 5: Choose a from s using policy derived from Q(s, a);
a
6: Take action a, observe r, s ;
where the optimal policy π * is a deterministic policy, and it 7: Q(s, a) ← Q(s, a) + α[r + γ max Q(s , a ) − Q(s, a)];
is a projection from S to A. a
8: s ← s ;
For the average reward criteria, let ρ π [see (2)] be the aver-
9: until s is terminal
age reward per time step for a policy π . The relative state
10: until i = N
value function and the relative state-action value function are
defined as [4]
∞ Algorithm 2 R-Learning Algorithm [4], [22]
π \\ρ: The average reward
V (s) = Eπ [ rt − ρπ | s0 = s]
t=0 \\N: The maximum number of episodes
∞ 1: Initialize Q(s, a) and ρ arbitrarily;
Q π (s, a) = Eπ [ rt − ρπ | s0 = s, a0 = a] 2: repeat (for each episode i)
t=0 3: s ← current state;
and the following Bellman equations are satisfied: 4: Select a from s using policy derived from Q(s, a);
5: Take action a, observe r, s ;
V π (s) = Eπ rt + V π s − ρπ 6: Q(s, a) ← Q(s, a) + α[r − ρ + γ maxa Q(s , a )−
Q π (s, a) = R(s, a) − ρπ + Eπ Q π s , a . (7) Q(s, a)];
7: if Q(s, a) = maxa Q(s, a) then
The optimal relative state-action value function for average 8: ρ ← ρ + β[r − ρ + maxa Q(s , a ) − maxa Q(s, a)];
reward settings satisfies 9: end if
Q ∗ (s, a)+ρπ ∗ = max Q π (s, a)+ρπ 10: until i = N
π
and the optimal policy π∗ can also be obtained by
∗ ∗ rate parameter, and r is the immediate reward. If in the limit
π (s) = arg max Q (s, a).
a the Q values of all admissible state-action pairs are updated
infinitely often, and α decays in a way satisfying the usual
B. Basic RL Algorithms stochastic approximation conditions, then the Q values will
The earlier approach to solve MDPs with model informa- converge to the optimal value Q* with probability 1 [20]. For
tion is to use the dynamic programming (DP) techniques, the Sarsa algorithm, if each action is executed infinitely often
which compute the optimal policies by estimating the optimal in every state that is visited infinitely often, the action is greedy
state-action value functions. However, traditional DP algo- with respect to the current Q value in the limit, and the learning
rithms commonly require full model information and large rate decays appropriately, then the estimated Q values will also
amounts of computation are needed for large state and action converge to the optimal value Q* with probability 1 [21].
spaces. Different from DP, a RL agent learns an optimal For the average reward criteria, R-learning [22] is the most
or near-optimal policy by interacting with the environment widely studied RL algorithm based on TD. The major steps
whose dynamic model is assumed to be unknown [4], [19]. of the R-learning algorithm are illustrated in Algorithm 2.
As indicated in [4], based on the observed state transition
data of MDPs, RL algorithms integrate the techniques of C. MOO Problems
Monte Carlo, stochastic approximation, and function approxi- The MOO problem can be formulated as follows [23], [24]:
mation to obtain approximate solutions of MDPs. As a central
mechanism of RL, temporal-difference (TD) learning [5] can max F(X) = f1 (X), f2 (X), . . ., fmf (X)
be viewed as a combination of Monte Carlo and DP. On one s.t. gi (X) ≤ 0, i = 1, . . . , mg
hand, like Monte Carlo methods, TD algorithms can learn
the value functions using state transition data without model where the “max” operator for a vector is defined either in the
information. On the other hand, similar to DP, before a final sense of Pareto optimality or in the sense of maximizing a
outcome is obtained, TD methods can update the current esti- weighted scalar of all the elements, X = [x1, x2, . . . xN ]T ∈ RN
mation of value functions partially based on previous learned is the vector of variables to be optimized, functions gi (X)
results [4], [5]. (i = 1, 2, . . . , mg ) are the constraint functions of this problem,
For the discounted reward criteria, Q-learning and Sarsa are and fi (X) (i = 1, 2, . . . , mf ) are the objective functions.
the most widely used tabular RL algorithms. The Q-learning The optimal solutions of an MOO problem can be described
algorithm is shown in Algorithm 1, where α is the learning by two concepts. One is the concept of multiobjective to single
388 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015
which satisfies the Bellman equation (6) or (7). Qi (s, a) = (1 − α)Qi (s, a) + α ri + max Qi s , a
a
Let
T where N ≥ i ≥ 1, and α is the learning rate parameter. The
MQπ (s, a) = Qπ1 (s, a), Qπ2 (s, a), . . . QπN (s, a) overall single policy can be determined based on TQ(s, a),
LIU et al.: MORL: A COMPREHENSIVE OVERVIEW 389
for weighting Q-values may be used for specific problem Algorithm 4 Negotiated W-Learning [32]
domains [31]. \\N: The number of objectives
1: Initialize leader l with a random integer between 1
B. W-Learning Approach and N;
2: Observe state s;
In order to ensure that the selected action is optimal for at
3: Wl =0;
least one objective, several winner-take-all methods for MORL
4: al = arg maxa Ql (s, a);
were studied in [32]. When the current state is s, a value Wi (s)
5: loop:
is computed for each objective. Then, the selected action is
6: for all objectives i except l do
based on the objective with the maximal W value.
7: Wi = maxa Qi (s, a) − Qi (s, al );
One simple method to compute W values is called
8: end for
Top-Q [11], [32], which assigns the W value as the highest
9: if maxi Wi > Wl then
Q-value in the current state
10: Wl = maxi Wi ;
Wi (s) = max Qi (s, a) 1 ≤ i ≤ N. 11: al = arg maxa Qi (s, a);
a
12: l=i;
The largest W value can be obtained as 13: go to 5;
14: else
Wmax (s) = max Wi (s)= max max Qi (s, a) 15: terminate the loop;
i i a
16: end if
Wmax (s) = max max Qi (s, a) 1≤i≤N (11) 17: Return al ;
a i
it may be not necessary to learn the W values, and instead SLi = ci,j
j=1,j=i
they can be computed directly from the Q-values in a process
called Negotiated W-learning, as shown in Algorithm 4. The is the importance of objective i relative to all other objectives.
negotiated W-learning algorithm can explicitly find that if an Then, for each objective, a fuzzy inference system can be
objective is not preferred to determine the next action, it may constructed. To compare two candidate actions ap and aq
be expected to lose the most long-term reward. (ap , aq ∈ A), both the importance factor Ii and the value of
LIU et al.: MORL: A COMPREHENSIVE OVERVIEW 391
improvement Di (ap , aq ) = Qi (s, ap ) − Qi (s, aq ) are used as Algorithm 5 Action Comparison Mechanism of the Ranking
the inputs of the fuzzy system, and the output of the fuzzy Approach [12]
system is the “goodness” of ap relative to aq . Superior(CQ(si , a ), CQ(si , a), i);
By incorporating the fuzzy subsets and fuzzy inference 1: if CQ(si , a ) > CQ(si , a) then
rules, an action selection strategy was constructed to solve the 2: Return true;
MORL problem [34]. The main drawback of this approach 3: else if CQ(si , a ) = CQ(si , a) then
is that it requires a lot of prior knowledge of the problem 4: if i = N then
domain. 5: Return true;
6: else
D. Ranking Approach 7: Return Superior(CQ(si+1 , a ), CQ(si+1 , a), i + 1);
8: end if
The ranking approach, also called the sequential approach,
9: else
or the threshold approach, aims to solve multiobjective deci-
10: Return false;
sion problems via an ordering or a preference relation among
11: end if
multiple criteria. The idea of using ordinal relations in
optimal decision making was studied in the early research
work by Mitten [35] and Sobel [36]. The synthetic objec-
tive function TQ(s, a) was expressed in terms of “partial
policies.” To ensure the effectiveness of the subordinate
objective (the less important objective), multiple solutions
need be obtained for the optimization problem of the main
objective.
Inspired by the idea of the ranking approach, an order-
ing of multiple objectives was established in [37] for MORL
where threshold values were specified for some objectives in
order to put the constraints on the objectives. One exam-
Fig. 5. Predicted target set (two objectives).
ple for this kind of situations is that an unmanned vehicle
performs navigation tasks in an environment while avoiding
its fuel level from being empty. The MORL approach in [37]
optimizes one objective while putting constraints on other the challenges of randomness, uncertainty, and multiple
objectives. The actions are chosen based on the thresholds metrics.
and a lexicographic ordering (the last objective is maximized
at first) [12], [37]. E. Geometric Approach
Let
To deal with dynamic unknown Markovian envi-
CQi (s, a) = min {Qi (s, a) , Ci } ronments with long-term average reward vectors,
Mannor and Shimkin [40] proposed a geometric approach
where N ≥ i ≥ 1, Ci is the threshold value (the maximum to MORL. It is assumed that the actions of other agents
allowable value) for objective i. Since objective N is assumed may influence the dynamics of the environment. Sufficient
to be unconstrained, CN = +∞. In the ranking approach for conditions for state recurrence, i.e., the game is irreducible
MORL [12], [37], given a partial ordering of all objectives or ergodic, are also assumed to be satisfied. In [40], using
and their threshold values, actions a and a’ can be compared the proposed geometric-based idea, two MORL algorithms,
by the action comparison mechanism shown in Algorithm 5, called multiple directions RL (MDRL) and single direction
where there is a sub-function Superior() which was recursively RL (SDRL), were presented to approximate a desired target
defined in [12]. Based on the action comparison and selection set in a multidimensional objective space, as shown in Fig. 5.
mechanism, the MORL problem can be solved by combining This target set can be viewed as the synthetic objective
this mechanism with some standard RL algorithms such as function TQ(s, a), which satisfies some geometric conditions
Q-learning. of a dynamic game model. The MDRL and SDRL algorithms
The performance of the ranking-based MORL approach is are based on the reachability theory for stochastic games, and
mainly dependent on the ordering of the objectives as well the main mechanism behind these two algorithms is to steer
as the threshold values. The design of an appropriate lexi- the average reward vector to the target set.
cographic ordering of all the objectives and their threshold When a nested class of target sets is prescribed,
values still requires some prior knowledge of the prob- Mannor and Shimkin [40] also provided an extension of the
lem domain, which exhibits the designer’s preferences [105]. geometric MORL algorithm, where the goal of the learning
Geibel [38] employed this idea to balance the expected return agent was to approach the target set with the smallest size. A
and risk, where the risk is smaller than some specified particular example of this case is to solve constrained MDPs
threshold, and the problem was formulated as a constrained with average rewards. In addition, the geometric algorithms
MDP. In [39], the ranking-based MORL approach was applied also need some prior knowledge of the problem domain to
to the routing problem in cognitive radio networks to address define the target set.
392 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015
TABLE I
Algorithm 6 Convex Hull Algorithm [42] R EPRESENTATIVE A PPROACHES TO MORL
1: Initialize Q̂(s, a), for all s, a, arbitrarily;
2: while not converged do
3: for all s ∈ S, a∈ A do
4: Q̂(s, a) ← E r(s, a) + γ hull d Q̂(s , a )|s, a ;
5: end for
6: end while
7: Return Q̂;
by a single run. Studley and Bull [47] investigated the per- Algorithm 7 Algorithm Schema for Dynamic MORL [63]
formance of a learning classifier system for MORL, and the 1: Obtain the current weight vector w new , set δ as a
results demonstrated that the choice of action-selection poli- threshold value;
cies can greatly affect the performance of the learning system. 2: πinit = arg maxπ ∈ (w new · ρπ )
Kei et al. [48] selected effective action candidates by the 3: Compute the value function vectors of πinit
α-domination strategy and used a goal-directed bias based on 4: Compute the average reward vector of πinit
the achievement level of each evaluation. Zhang et al. [49] and 5: Learn the new policy π through vector-based RL;
Zhao and Zhang [50] used a parallel genetic algorithm (PGA) new · ρπ − w
6: If (w new · ρπinit ) > δ, add π to the set
to evolve a neuro-controller, and perturbation stochastic of stored policies.
approximation (SPSA) was used to improve the conver-
gence of the proposed algorithm. The notion of adaptive
margins [51] was adopted to improve the performance of method that can find and keep a finite number of poli-
MORL algorithms [52]. To obtain Pareto optimal policies cies which can be appropriately selected for varying weight
in large or continuous spaces, some recent efforts include: vectors. This algorithm is based on the average reward criteria,
Khamis and Gomaa [53] developed a multiagent framework and its schema is shown in Algorithm 7, where δ is a tunable
for MORL in traffic signal control, Castelletti et al. [54] parameter. The motivation for this algorithm is that despite
made use of the FQI for approximating the Pareto front, of infinitely many weight vectors, the set of all optimal poli-
and Wiering and de Jong [55] proposed a new value iter- cies may be well represented by a small number of optimal
ation algorithm, called consistency multi objective dynamic policies.
programming.
Most of the MORL algorithms mentioned above belong C. Evaluation of MORL Approaches
to single-policy approaches and they were studied and
tested independently. The main limitation is that the small As a relatively new field of study, research on MORL
scale of previous MORL problems may not verify the has mainly focused on various principles and algorithms to
algorithm’s performance in dealing with a wide range of deal with multiple objectives in sequential decision-making
different problem settings, and the algorithm implementa- problems. Although it is desirable to develop standard bench-
tions always require much prior knowledge about the problem mark problems and methods for the evaluation of MORL
domain. algorithms, there has been little work on this topic except
a recent work by Vamplew et al. [12]. In previous studies
B. Dynamic Preferences in MORL on MORL, the algorithms were usually evaluated on differ-
Since many real-world optimization and control problems ent learning problems so that it is necessary to define some
are not stationary, there are growing interests in solving standard test problems with certain characteristics for a rigor-
dynamic optimization problems in recent years. Similarly, ous performance testing of MORL algorithms. For example,
the policies obtained by MORL approaches usually rely on many MORL algorithms were tested separately on some
the preferences on the reward vectors, so it is necessary to ad hoc problems, including traffic tasks [64]–[67], medical
develop MORL algorithms that take dynamic preferences into tasks [68], robot tasks [69]–[71], network routing tasks [72],
consideration. grids tasks [73], and so on. As indicated in [12], it is difficult
In optimal control with multiple objectives, the designer to make comparisons among these algorithms due to the lack
can use the fixed-weight approach [56] to determine the opti- of benchmark test problems and methodologies. Furthermore,
mization direction. However, if there is no exact background for ad hoc application problems, the Pareto fronts are usually
knowledge of the problem domain, the fixed-weight approach unknown and it is hard to find absolute performance mea-
may find unsatisfactory solutions. For MOO, the random- sures for MORL algorithms. Therefore, a suite of benchmark
weight approach [57] and the adaptive approach [58] were problems with known Pareto fronts were proposed in [12],
studied to compute the weights based on the objective data together with a standard method for performance evaluation
so that the manual selection of objective weights can be which can serve as a basis for future comparative studies.
simplified. Nevertheless, it is hard for these two approaches In addition, two classes of MORL algorithms were evaluated
to express the preference of the designer. in [12] based on different evaluation metrics. In particular, sin-
An important problem is to automatically derive the weights gle policy MORL algorithms were tested via online learning
when the designer is unable to assign the weights. This prob- while offline learning performance was tested for multipolicy
lem is usually called preference elicitation (PE), and its major approaches.
related work is inverse RL [59], [60]. In order to predict
the future decisions of an agent from its previous decisions, VI. R ELATED F IELDS OF S TUDY
a Bayesian approach was proposed [61] to learn the utility MORL is a highly interdisciplinary field, and its related
function. In [62], an apprenticeship learning algorithm was fields mainly include MOO, hierarchical RL (HRL) and
presented in which observed behaviors were used to learn the multiagent RL (MARL). Their relations to MORL and recent
objective weights of the designer. progress are discussed in this section. In addition, by using a
Aiming at the time-varying preferences among multiple keyword-based search in Google scholar, the average number
objectives, Natarajan and Tadepalli [63] proposed an MORL of MORL-related publications was about 90 per year from
394 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015
1999 to 2001, while this average number becomes 700 per was extended to concurrent activities, multiagent domains, and
year from 2011 to 2013. Thus, it can be observed that during partially observable states. As discussed in [103], although
the past 10 years, there is a significant increase in the number value function approximators can be integrated with HRL,
of publications related to MORL. few successful results have been reported in the literature for
applying existing HRL approaches to MDPs with large or con-
A. MOO tinuous spaces. Furthermore, to automatically decompose the
state space of MDPs or construct options is still a difficult task.
MOO problems usually have no unique, optimal solution,
Recently some new HRL algorithms were proposed. In [86],
which is different from SOO that has a single best solution.
an HRL approach was presented where the state space can be
For MOO and MORL, a set of noninferior, alternative solu-
partitioned by critical state. A hierarchical approximate pol-
tions called the Pareto optimal solutions can be defined instead
icy iteration (HAPI) algorithm with binary-tree state space
of a single optimal solution. Nevertheless, the aim of MOO
decomposition was presented in [87]. In the HAPI approach,
is to solve a parameter optimization problem with multiple
after decomposing the original MDP into multiple sub-MDPs
objectives, while MORL is to solve sequential decision making
with smaller state spaces, better near-optimal local policies
problems with multiple objectives.
can be found and the final global policy can be derived by
There are two common goals for MOO algorithms: one
combining the local policies in each sub-MDP. For MORL
is to find a set of solutions that is close to the Pareto
problems, a hierarchical learning architecture with multiple
optimal front, and the other is to obtain a diverse set of
objectives was proposed in [88]. The major idea of this
solutions representing the whole Pareto optimal front. These
approach is to make use of a reference network so that
two goals have also been studied in MORL. In order to
an internal reinforcement representation can be generated
achieve these two goals, a variety of algorithms have been
during the operation of the learning process. Furthermore,
presented to solve MOO problems, such as multiobjective par-
internal reinforcement signals from different levels can be
ticle swarm optimizer (MOPSO) [74], multiobjective genetic
provided to represent multilevel objectives for the learning
algorithms (MOGA) [14], [75], etc. These MOO algorithms
system.
have improved mathematical characteristics for solving var-
ious MOO problems. Nevertheless, when the dimension of C. MARL
MOO problems is high, many of these algorithms usually
In MORL, the learning agent aims to solve sequential
have decreased performance due to the difficulty of finding
decision problems with reward vectors, and multiple policies
a wide range of alternative solutions. Moreover, it is hard for
may be obtained separately through decomposition. For mul-
MOGA and MOPSO to solve MOO problems with concave
tiple policy approaches, the MORL problem can be solved
Pareto fronts which are popularly encountered in real-world
in a distributed manner which is closely related to multiagent
applications.
RL (MARL) [89]. In MARL, each agent may also have its own
Most MOO problems have different kinds of constraints
objective and there may be multiple objectives in the learn-
and they are also called constrained MOO problems. Until
ing system. However, most of the efforts in designing MARL
now, various constraint handling approaches [76]–[79] have
systems have been focused on the communication and inter-
been proposed to improve the performance of multi objective
action (cooperation, competition, and mixed strategies) among
evolutionary algorithm and MOPSO.
agents. In MORL, usually there is no explicit sharing of infor-
mation among objectives. The main task of an MARL system
B. HRL is that autonomous multiple agents explicitly consider other
Hierarchical RL (HRL) makes use of a divide-and-conquer agents and coordinate their action policies so that a coherent
strategy to solve complex tasks with large state or decision joint behavior can be realized.
spaces. Unlike conventional RL, HRL aims to solve sequential In recent years, the research results on MARL can be
decision-making problem that can be best described as a set viewed as a combination of temporal-difference learning, game
of hierarchically organized tasks and sub-tasks. MORL differs theory, and direct policy search techniques. There are sev-
from HRL in that the MORL problem requires the learning eral major difficulties in the research of MARL algorithms,
agent to solve several tasks with different objectives at once. which are different from single-agent settings. The first dif-
The most outstanding advantage of HRL is that it can scale ficulty is the existence of multiple equilibriums. In Markov
to large and complex problems [80]. One common feature of games with a single equilibrium value, the optimal policy can
HRL and MORL is that there are multiple tasks that need to be well defined. But when there are multiple equilibriums,
be solved for the learning agent. MARL algorithms have to ensure the agents to coordinate their
Earlier HRL algorithms need prior knowledge about the policies for selecting appropriate equilibriums. From an empir-
high-level structure of complex MDPs. There have been ical viewpoint, we should carefully consider the influence
some HRL approaches or formulations to incorporate prior of multiple equilibriums on MARL algorithms, and unde-
knowledge: HAMs [81], MAXQ [82], options [83], and sirable equilibriums may be easily reached due to certain
ALisp [84]. The usage of the prior knowledge can simplify the game properties [90]. In the design of MARL algorithms, one
problem decomposition and accelerate the learning process for major aim is to make the agents’ policies converge to desir-
good policies. Currently, most HRL algorithms are based on able equilibriums. There have been many heuristic exploration
the semi-MDP (SMDP) model. In [85], the SMDP framework strategies proposed in the literature so that the probability for
LIU et al.: MORL: A COMPREHENSIVE OVERVIEW 395
reaching the optimal equilibriums can be improved in identical C. Convergence Analysis of MORL Algorithms
interest games [89]–[92]. Both the Q-learning algorithm and the Sarsa algorithm
have some attractive qualities as basic approaches to RL. The
VII. C HALLENGES AND O PEN P ROBLEMS major advantage is the fact that they are guaranteed to con-
In addition to the three important aspects for further devel- verge to the optimal solution for a single MDP with discrete
opment in Section V, there are several challenges and open state and action spaces. Suppose there are N objectives in
problems in MORL, which will be discussed in the sequel. an MORL problem, then this problem can be considered
as N sub-MDPs (a sub-MDP is an MDP with one single
A. State/Feature Representation in MORL objective). Hence, the convergence results of the algorithm
The problem of selecting an efficient state or feature rep- to solve this MORL problem not only depend on the con-
resentation in a real-world problem has been an important vergence of all the algorithms to solve these sub-MDPs
topic in RL [4], [8]. In the earlier research on RL, there has but also depend on the representations of the preferences
been some important progress in RL theory and algorithms among all the objectives. The convergence of single-policy
for discrete-state MDPs. However, most real-world problems approaches to MORL can be analyzed based on the results
have large or continuous state spaces, thus, the huge com- of the RL algorithms used to solve sub-MDPs. However, the
putational costs will make earlier tabular RL algorithms be convergence of multiple-policy approaches has to consider
impractical for real applications. In the past decade, approx- the way how the Pareto front is approximated. In short, for
imate value functions or policies with feature representations stationary preferences, the convergence analysis of MORL
have been widely studied. But one main obstacle is that algorithms is mainly dependent on the properties of the learn-
many RL algorithms and theories with good convergence ing algorithms to solve MDPs and the representations of the
properties usually rely on manually selected feature repre- preferences. For dynamic preferences, the dynamic character-
sentations. Thus, it is difficult to accurately approximate the istics must be considered additionally. So far, the convergence
optimal value functions or policies without carefully selected of MORL algorithms commonly lacks rigorous theoretical
features. Some recent advances in automatic feature represen- results.
tation include kernel methods and graph Laplacian approaches
for RL [93], [94]. In MORL, there are multiple objectives to be D. MORL in Multiagent Systems
achieved and multiple value functions may need to be approx- A multiagent system (MAS) is a system that has multi-
imated simultaneously. The feature representation problem in ple interacting autonomous agents, and there are increasing
MORL is more complicated due to the existence of multi- numbers of application domains that are more suitable to
ple objectives or value functions. Therefore, state or feature be solved by a multiagent system instead of a centralized
representation in MORL is still a big challenge for further single agent [88], [100]. MORL in multiagent systems is
research. a very important research topic, due to the multiobjective
nature of many practical multiagent systems. How to extend
B. Value Function Approximation in MORL the algorithms and theories of MORL in single-agent sys-
As discussed above, to solve MDPs with large or contin- tems to MORL in multiagent systems is an open problem. In
uous state spaces, value function approximation (VFA) is a particular, to achieve several objectives at once by the coop-
key technique to realize generalization and improve learning eration of multiple agents in multiagent systems is a difficult
efficiency [95]. According to the basic properties of function problem to be solved. If there are competitions in multiagent
approximators, there are two different kinds of VFA methods, systems, the MORL problem will become more challenging.
i.e., linear VFA [96], [97] and nonlinear VFA [9], [98], [99].
In some real-world applications, multilayer neural networks E. Applications of MORL
were commonly employed as the nonlinear approximators for In recent years, RL has been applied in a variety of
VFA. However, the empirical results of successful RL applica- fields [101]–[104], but the research on MORL algorithms is a
tions using nonlinear VFA commonly lack rigorous theoretical relatively new field of study, and as such there are few real-
analysis. Some negative results concerning divergence were world applications so far. The following difficult problems
reported for Q-learning and TD learning based on direct gra- should be studied before MORL can be successfully applied in
dient rules [4], [20]. Hence, many RL approaches with VFA real-time complex systems. One is to develop efficient MORL
require significant design efforts or problem insights, and it is algorithms that can solve MDPs with large or continuous state
hard to find a basis function set that is both sufficiently simple spaces. The other is to evaluate the performance of differ-
and sufficiently reliable. To solve MDPs with large or con- ent MORL algorithms in practice and find the gap between
tinuous state spaces, MORL algorithms also require VFA to theoretical analysis and real performance. The third one is
improve generalization ability and reduce computational costs. to consider the constraints from real-world applications when
The additional representation of the preferences among differ- developing theoretical models for MORL and to investigate
ent objectives is more difficult for developing VFA techniques, how theoretical results can promote the development of new
especially when the MORL problem has dynamic preferences. algorithms and mechanisms.
Hence, VFA becomes a greater challenge for MORL than that Since in many real-world applications, there are multiple
for standard RL. conflict objectives to be optimized simultaneously, it can be
396 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015
expected that MORL will find more application domains with [9] R. H. Crites and A. G. Barto, “Elevator group control using multi-
the development of new theory and algorithms. ple reinforcement learning agents,” Mach. Learn., vol. 33, nos. 2–3,
pp. 235–262, 1998.
[10] G. J. Tesauro, “Practical issues in temporal difference learning,”
Mach. Learn., vol. 8, nos. 3–4, pp. 257–277, 1992.
VIII. C ONCLUSION
[11] N. Sprague and D. Ballard, “Multiple-goal reinforcement learning with
In this paper, the background, basic architectures, major modular Sarsa(0),” in Proc. 18th Int. Joint Conf. Artif. Intell., 2003,
research topics, and naïve solutions of MORL were intro- pp. 1445–1447.
[12] P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker,
duced at first, then several representative approaches were “Empirical evaluation methods for multiobjective reinforcement learn-
reviewed and some important directions of recent research ing algorithms,” Mach. Learn., vol. 84, nos. 1–2, pp. 51–80, 2011.
were discussed in detail. There are two main advantages [13] G. Mitsuo and C. Runwei, Genetic Algorithms and Engineering
Optimization. Beijing, China: Tsinghua Univ. Press, 2003.
of MORL. One is that MORL is very useful to improve [14] I. Y. Kim and O. L. de Weck, “Adaptive weighted sum method for mul-
the performance of single objective RL by generating highly tiobjective optimization: A new method for Pareto front generation,”
diverse Pareto-optimal models for constructing policy ensem- Struct. Multidiscipl. Optim., vol. 31, no. 2, pp. 105–116, 2006.
bles in domains with multiple objectives. The second is that [15] A. Konaka, D. W. Coitb, and A. E. Smith, “Multi-objective optimiza-
tion using genetic algorithms: A tutorial,” Reliab. Eng. Syst. Safety,
MORL algorithms can realize a trade-off between accuracy vol. 91, no. 9, pp. 992–1007, Sep. 2006.
and interpretability for sequential decision-making tasks. [16] M. Yoon, Y. Yun, and H. Nakayama, Sequential Approximate
For single-policy approaches, the weighted sum and Multiobjective Optimization Using Computational Intelligence. Berlin,
Germany: Springer, 2009.
W-learning approaches are very simple to implement, but they [17] J. G. Lin, “On min-norm and min-max methods of multi-objective
cannot express exactly the preferences of the designer. The optimization,” Math. Program., vol. 103, no. 1, pp. 1–33, 2005.
AHP, ranking, and geometric approaches may express the pref- [18] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey
of multi-objective sequential decision-making,” J. Artif. Intell. Res.,
erences more exactly, but they need more prior knowledge vol. 48, no. 1, pp. 67–113, Oct. 2013.
of the problem domain. For multiple-policy approaches, the [19] J. Si, A. Barto, W. Powell, and D. Wunsch, Handbook of Learning and
convex hull algorithm can learn optimal policies for all lin- Approximate Dynamic Programming. Hoboken, NJ, USA: Wiley-IEEE
ear preference assignments over the objective space at once. Press, 2004.
[20] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence
The varying parameter approach can be easily implemented by of stochastic iterative dynamic programming algorithms,” Neural
performing multiple runs with different parameters, objective Comput., vol. 6, no. 6, pp. 1185–1201, Nov. 1994.
threshold values, and orderings in any single-policy algorithm. [21] S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari,
“Convergence results for single-step on-policy reinforcement learning
MORL approaches have been improved recently in three algorithms,” Mach. Learn., vol. 38, no. 3, pp. 287–308, Mar. 2000.
important aspects: enhancing their solution qualities, adapt- [22] A. Schwartz, “A reinforcement learning method for maximizing undis-
ing dynamic preferences and constructing evaluation systems. counted rewards,” in Proc. 10th Int. Conf. Mach. Learn., 1993,
pp. 298–305.
The main challenges and open problems in MORL include
[23] H. L. Liao, Q. H. Wu, and L. Jiang, “Multi-objective optimization by
value function approximation, feature representation, conver- reinforcement learning for power system dispatch and voltage stability,”
gence analysis of algorithms and the application of MORL in Proc. Innov. Smart Grid Technol. Conf. Eur., Gothenburg, Sweden,
to multiagent systems and real world problems. It can be 2010, pp. 1–8.
[24] K. Sindhya, “Hybrid evolutionary multi-objective optimization with
expected that there will be more and more research progress enhanced convergence and diversity,” Ph.D. dissertation, Dept. Math.
toward these directions. Inf. Tech., Univ. Jyvaskyla, Jyvaskyla, Finland, 2011.
[25] P. Vamplew, J. Yearwood, R. Dazeley, and A. Berry, “On the limi-
tations of scalarisation for multi-objective reinforcement learning of
ACKNOWLEDGMENT Pareto fronts,” in Proc. 21st Aust. Joint Conf. Artif. Intell., vol. 5360.
2008, pp. 372–378.
The authors would like to thank the Associate Editors [26] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and
and anonymous reviewers for their valuable comments and V. G. da Fonseca, “Performance assessment of multiobjective
suggestions, which greatly improved the quality of this paper. optimizers: An analysis and review,” IEEE Trans. Evol. Comput., vol. 7,
no. 2, pp. 117–132, Apr. 2003.
[27] A. Castelletti, G. Corani, A. Rizzolli, R. Soncinie-Sessa, and E. Weber,
“Reinforcement learning in the operational management of a water
R EFERENCES system,” in Proc. IFAC Workshop Model. Control Environ. Issues,
[1] E. L. Thorndike, Animal Intelligence. Darien, CT, USA: Halfner, 1911. Yokohama, Japan, 2002, pp. 325–330.
[2] A. M. Turing, “Computing machinery and intelligence,” Mind, vol. 59, [28] J. Karlsson, “Learning to solve multiple goals,” Ph.D. dissertation,
pp. 433–460, Oct. 1950. Dept. Comput. Sci., Univ. Rochester, Rochester, NY, USA, 1997.
[3] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement [29] D. C. K. Ngai and N. H. C. Yung, “A multiple goal reinforcement learn-
learning: A survey,” J. Artif. Intell. Res., vol. 4, no. 1, pp. 237–285, ing method for complex vehicle overtaking maneuvers,” IEEE Trans.
1996. Intell. Transp. Syst., vol. 12, no. 2, pp. 509–522, Jun. 2011.
[4] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. [30] F. Zeng, Q. Zong, Z. Sun, and L. Dou, “Self-adaptive multi-objective
Cambridge, MA, USA: MIT Press, 1998. optimization method design based on agent reinforcement learning for
[5] R. S. Sutton, “Learning to predict by the method of temporal elevator group control systems,” in Proc. 8th World Congr. Int. Control
differences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988. Autom., Jinan, China, 2010, pp. 2577–2582.
[6] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, [31] G. Tesauro et al., “Managing power consumption and performance
no. 3, pp. 279–292, 1992. of computing systems using reinforcement learning,” in Advances
[7] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: in Neural Information Processing Systems. Cambridge, MA, USA:
An introduction,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39–47, MIT Press, 2007, pp. 1497–1504.
May 2009. [32] M. Humphrys, “Action selection methods using reinforcement
[8] X. Xu, D. W. Hu, and X. C. Lu, “Kernel based least-squares policy iter- learning,” in From Animals to Animats 4, P. Maes, M. Mataric,
ation for reinforcement learning,” IEEE Trans. Neural Netw., vol. 18, J.-A. Meyer, J. Pollack, and S. W. Wilson, Eds. Cambridge, MA, USA:
no. 4, pp. 973–992, Jul. 2007. MIT Press, 1996, pp. 134–144.
LIU et al.: MORL: A COMPREHENSIVE OVERVIEW 397
[33] X. N. Shen, Y. Guo, Q. W. Chen, and W. L. Hu, “A multi-objective [57] C. Q. Zhang, J. J. Zhang, and X. H. Gu, “The application of hybrid
optimization genetic algorithm incorporating preference information,” genetic particle swarm optimization algorithm in the distribution net-
Inf. Control, vol. 36, no. 6, pp. 746–753, 2007. work reconfigurations multi-objective optimization,” in Proc. 3rd Int.
[34] Y. Zhao, Q. W. Chen, and W. L. Hu, “Multi-objective reinforcement Conf. Nat. Comput., vol. 2. 2007, pp. 455–459.
learning algorithm for MOSDMP in unknown environment,” in Proc. [58] D. Zheng, M. Gen, and R. Cheng, “Multiobjective optimization using
8th World Congr. Int. Control Autom., 2010, pp. 3190–3194. genetic algorithms,” Eng. Val. Cot Anal., vol. 2, pp. 303–310, 1999.
[35] L. G. Mitten, “Composition principles for synthesis of optimum multi- [59] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement
stage processes,” Oper. Res., vol. 12, pp. 610–619, Aug. 1964. learning,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 663–670.
[36] M. J. Sobel, “Ordinal dynamic programming,” Manage. Sci., vol. 21, [60] C. Boutilier, “A POMDP formulation of preference elicitation
pp. 967–975, May 1975. problems,” in Proc. 18th Nat. Conf. Artif. Intell., 2002, pp. 239–246.
[61] U. Chajewska, D. Koller, and D. Ormoneit, “Learning an agent’s utility
[37] Z. Gabor, Z. Kalmar, and C. Szepesvari, “Multi-criteria reinforcement
function by observing behavior,” in Proc. 18th Int. Conf. Mach. Learn.,
learning,” in Proc. 15th Int. Conf. Mach. Learn., 1998, pp. 197–205.
2001, pp. 35–42.
[38] P. Geibel, “Reinforcement learning with bounded risk,” in Proc. 18th [62] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-
Int. Conf. Mach. Learn., 2001, pp. 162–169. forcement learning,” in Proc. 21st Int. Conf. Mach. Learn., 2004,
[39] K. Zheng, H. Li, R. C. Qiu, and S. Gong, “Multi-objective reinforce- pp. 1–8.
ment learning based routing in cognitive radio networks: Walking in [63] S. Natarajan and P. Tadepalli, “Dynamic preferences in multi-criteria
a random maze,” in Proc. Int. Conf. Comput. Netw. Commun., 2012, reinforcement learning,” in Proc. 22nd Int. Conf. Mach. Learn., 2005,
pp. 359–363. pp. 601–608.
[40] S. Mannor and N. Shimkin, “A geometric approach to multi-criterion [64] M. A. Khamis, W. Gomaaa, “Adaptive multi-objective reinforcement
reinforcement learning,” J. Mach. Learn. Res., vol. 5, pp. 325–360, learning with hybrid exploration for traffic signal control based on
Jan. 2004. cooperative multi-agent framework,” Eng. Appl. Artif. Intell., vol. 29,
[41] S. Mannor and N. Shimkin, “The steering approach for multi-criteria pp. 134–151, 2014.
reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., 2001, [65] Z. H. Yang and K. G. Wen, “Multi-objective optimization of freeway
pp. 1563–1570. traffic flow via a fuzzy reinforcement learning method,” in Proc. 3rd
[42] L. Barrett and S. Narayanan, “Learning all optimal policies with mul- Int. Conf. Adv. Comput. Theory Eng., vol. 5, 2010, pp. 530–534.
tiple criteria,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 41–47. [66] K. G. Wen, W. G. Yang, and S. R. Qu, “Efficiency and equity based
[43] C. R. Shelton, “Balancing multiple sources of reward in reinforce- freeway traffic network flow control,” in Proc. 2nd Int. Conf. Comput.
ment learning,” in Proc. Adv. Neural Inf. Process. Syst., 2000, Autom. Eng., vol. 5. 2010, pp. 453–456.
pp. 1082–1088. [67] H. L. Duan, Z. H. Li, and Y. Zhang, “Multi-objective reinforcement
[44] A. Castelletti, F. Pianosi, and M. Restelli, “Tree-based fitted Q-iteration learning for traffic signal control using vehicular ad hoc network,”
for multi-objective Markov decision problems,” in Proc. Int. Joint Conf. EURASIP J. Adv. Signal Process., vol. 2010, pp. 1–7, Mar. 2010,
Neural Netw., 2012, pp. 1–8. Art. ID 724035.
[68] R. S. H. Istepanian, N. Y. Philip, and M. G. Martini, “Medical QoS
[45] H. L. Liu and Q. H. Wu, “Multi-objective optimization by rein- provision based on reinforcement learning in ultrasound streaming over
forcement learning,” in Proc. IEEE Congr. Evol. Comput., 2010, 3.5G wireless systems,” IEEE J. Select. Areas Commun., vol. 27, no. 4,
pp. 1–8. pp. 566–574, May 2009.
[46] H. Handa, “Solving multi-objective reinforcement learning problems [69] Y. Nojima, F. Kojima, and N. Kubota, “Local episode-based learning
by EDA-RL-acquisition of various strategies,” in Proc. 9th Int. Conf. of multi-objective behavior coordination for a mobile robot in dynamic
Int. Syst. Design Appl., 2009, pp. 426–431. environments,” in Proc. 12th IEEE Int. Conf. Fuzzy Syst., vol. 1. 2003,
[47] M. Studley and L. Bull, “Using the XCS classifier system for multi- pp. 307–312.
objective reinforcement learning,” Artif. Life, vol. 13, no. 1, pp. 69–86, [70] D. C. K. Ngai and N. H. C. Yung, “Automated vehicle overtaking based
2007. on a multiple-goal reinforcement learning framework,” in Proc. IEEE
[48] A. Kei, S. Jun, A. Takanobu, I. Kokoro, and K. Shigenobu, “Multi- Int. Conf. Control Appl., Seattle, WA, USA, 2010, pp. 530–535.
criteria reinforcement learning based on goal-directed exploration and [71] T. Miyazaki, “An evaluation pattern generation scheme for electric
its application to bipedal walking robot,” Trans. Inst. Syst. Control Inf. components in hybrid electric vehicles,” in Proc. 5th IEEE Int. Conf.
Eng., vol. 18, no. 10, pp. 352–360, 2005. Int. Syst., Yokohama, Japan, 2010, pp. 321–326.
[49] H. J. Zhang, J. Zhao, R. Wang, and T. Ma, “Multi-objective rein- [72] A. Petrowski, F. Aissanou, I. Benyahia, and S. Houcke, “Multicriteria
forcement learning algorithm and its application in drive system,” in reinforcement learning based on a Russian doll method for network
Proc. 34th Annu. IEEE Conf. Ind. Electron., Orlando, FL, USA, 2008, routing,” in Proc. 5th IEEE Int. Conf. Intell. Syst., 2010, pp. 321–326.
pp. 274–279. [73] J. Perez, C. Germain-Renaud, B. Kegl, and C. Loomis, “Multi-objective
[50] J. Zhao and H. J. Zhang, “Multi-objective reinforcement learning algo- reinforcement learning for responsive grids,” J. Grid Comput., vol. 8,
rithm and its improved convergency method,” in Proc. 6th IEEE Conf. no. 3, pp. 473–492, 2010.
Ind. Electron. Appl., Beijing, China, 2011, pp. 2438–2445. [74] L. V. S. Quintero, N. R. Santiago, and C. A. C. Coello, “Towards
[51] K. Hiraoka, M. Yoshida, and T. Mishima, “Parallel reinforcement learn- a more efficient multi-objective particle swarm optimizer,” in Multi-
ing for weighted multi-criteria model with adaptive margin,” Cogn. Objective Optimization in Computational Intelligence: Theory and
Neurodyn., vol. 3, pp. 17–24, Mar. 2009. Practice. Hershey, PA, USA: IGI Global, 2008, pp. 76–105.
[75] H. Li and Q. F. Zhang, “MOEA/D: A multiobjective evolutionary algo-
[52] X. L. Chen, X. C. Hao, H. W. Lin, and T. Murata, “Rule driven
rithm based on decomposition,” IEEE Trans. Evol. Comput., vol. 11,
multi objective dynamic scheduling by data envelopment analysis and
no. 6, pp. 712–731, Dec. 2007.
reinforcement learning,” in Proc. IEEE Int. Conf. Autom. Logist.,
[76] G. Tsaggouris and C. Zaroliagis, “Multiobjective optimization:
Hong Kong, 2010, pp. 396–401.
Improved FPTAS for shortest paths and non-linear objectives with
[53] M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcement applications,” Theory Comput. Syst., vol. 45, no. 1, pp. 162–186, 2009.
learning with hybrid exploration for traffic signal control based on [77] J. P. Dubus, C. Gonzales, and P. Perny, “Multiobjective optimiza-
cooperative multi-agent framework,” Eng. Appl. Artif. Intell., vol. 29, tion using GAI models,” in Proc. Int. Conf. Artif. Intell., 2009,
pp. 134–151, Mar. 2014. pp. 1902–1907.
[54] A. Castelletti, F. Pianosi, and M. Restelli, “Multi-objective fitted [78] P. Perny and O. Spanjaard, “Near admissible algorithms for multiob-
Q-iteration: Pareto frontier approximation in one single run,” in Proc. jective search,” in Proc. Eur. Conf. Artif. Intell., 2008, pp. 490–494.
IEEE Int. Conf. Netw. Sens. Control, Delft, The Netherlands, 2011, [79] G. G. Yen and W. F. Leong, “A multiobjective particle swarm optimizer
pp. 260–265. for constrained optimization,” Int. J. Swarm Intell. Res., vol. 2, no. 1,
[55] M. A. Wiering and E. D. de Jong, “Computing optimal stationary poli- pp. 1–23, 2011.
cies for multi-objective Markov decision processes,” in Proc. IEEE Int. [80] N. Dethlefs and H. Cuayahuitl, “Hierarchical reinforcement learning
Symp. Approx. Dyn. Program. Reinf. Learn., Honolulu, HI, USA, 2007, for adaptive text generation,” in Proc. 6th Int. Conf. Nat. Lang. Gener.,
pp. 158–165. 2010, pp. 37–45.
[56] M. Oubbati, P. Levi, and M. Schanz, “A fixed-weight RNN dynamic [81] R. Parr and S. Russell, “Reinforcement learning with hierarchies of
controller for multiple mobile robots,” in Proc. 24th IASTED Int. Conf. machines,” in Advances in Neural Information Processing Systems.
Model. Identif. Control, 2005, pp. 457–462. Cambridge, MA, USA: MIT Press, 1997, pp. 1043–1049.
398 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015
[82] T. Dietterich, “Hierarchical reinforcement learning with the MaxQ Chunming Liu received the B.Sc. and M.Sc.
value function decomposition,” J. Artif. Intell. Res., vol. 13, no. 1, degrees from the National University of Defence
pp. 227–303, Aug. 2000. Technology, Changsha, China, in 2004 and 2006,
[83] D. Precup and R. Sutton, “Multi-time models for temporally abstract respectively, where he is currently pursuing the
planning,” in Advances in Neural Information Processing Systems. Ph.D. degree.
Cambridge, MA, USA: MIT Press, 1998, pp. 1050–1056. His current research interests include intelligent
[84] D. Andre and S. Russell, “State abstraction for programmable rein- systems, machine learning, and autonomous land
forcement learning agents,” in Proc. 18th Nat. Conf. Artif. Intell., 2002, vehicles.
pp. 119–125.
[85] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical rein-
forcement learning,” Discrete Event Dyn. Syst. Theory Appl., vol. 13,
nos. 1–2, pp. 341–379, 2003.
[86] Z. Jin, W. Y. Liu, and J. Jin, “Partitioning the state space by critical
states,” in Proc. 4th Int. Conf. Bio-Inspired Comput., 2009, pp. 1–7.
[87] X. Xu, C. Liu, S. Yang, and D. Hu, “Hierarchial approximate pol-
icy iteration with binary-tree state space decomposition,” IEEE Trans.
Neural Netw., vol. 22, no. 12, pp. 1863–1877, Dec. 2011.
[88] H. B. He and B. Liu, “A hierarchical learning architecture
with multiple-goal representations based on adaptive dynamic
programming,” in Proc. Int. Conf. Netw. Sens. Control, 2010,
pp. 286–291.
[89] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive sur-
vey of multi-agent reinforcement learning,” IEEE Trans. Syst., Man,
Cybern. C, Appl. Rev., vol. 38, no. 2, pp. 156–172, Mar. 2008. Xin Xu (M’07–SM’12) received the B.S. degree in
[90] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in electrical engineering from the National University
cooperative multiagent systems,” in Proc. 15th Nat. Conf. Artif. Intell., of Defense Technology (NUDT), Changsha, China,
1998, pp. 746–752. in 1996, where he received the Ph.D. degree in con-
[91] S. Devlin and D. Kudenko, “Theoretical considerations of potential- trol engineering from the College of Mechatronics
based reward shaping for multi-agent systems,” in Proc. 10th Annu. and Automation, in 2002.
Int. Conf. Auton. Agents Multiagent Syst., 2011, pp. 225–232. He is currently a Full Professor with the Institute
[92] F. Leon, “Evolving equilibrium policies for a multiagent reinforcement of Unmanned Systems, College of Mechatronics
learning problem with state attractors,” in Proc. Int. Conf. Comput. and Automation, NUDT. He has been a Visiting
Collect. Intell., Gdynia, Poland, 2011, pp. 201–210. Scientist for Cooperation Research with Hong Kong
[93] X. Xu, Z. Hou, C. Lian, and H. He, “Online learning control using Polytechnic University, Hong Kong, University of
adaptive critic designs with sparse kernel machines,” IEEE Trans. Alberta, Edmonton, AB, Canada, University of Guelph, Guelph, ON, Canada,
Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 762–775, May 2013. and University of Strathclyde, Glasgow, U.K. His current research interests
[94] S. Mahadevan and M. Maggioni, “Proto-value functions: A Laplacian include reinforcement learning, approximate dynamic programming, machine
framework for learning representation and control in Markov decision learning, robotics, and autonomous vehicles. He has authored or coauthored
processes,” J. Mach. Learn. Res., vol. 8, pp. 2169–2231, Jan. 2007. over 100 papers in international journals and conferences, and has coauthored
[95] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission con- four books. He currently serves as an Associate Editor for the Information
trol scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., Sciences journal, and a Guest Editor of the International Journal of Adaptive
vol. 16, no. 5, pp. 1219–1228, Sep. 2005. Control and Signal Processing.
[96] X. Xu, H. G. He, and D. W. Hu, “Efficient reinforcement learning Dr. Xu was the recipient of the 2nd Class National Natural Science Award
using recursive least-squares methods,” J. Artif. Intell. Res., vol. 16, of China in 2012 and the Fork Ying Tong Youth Teacher Fund of China
pp. 259–292, Apr. 2002. in 2008. He is a Committee Member of the IEEE Technical Committee on
[97] J. Boyan, “Technical update: Least-squares temporal difference Approximate Dynamic Programming and Reinforcement Learning and the
learning,” Mach. Learn., vol. 49, nos. 2–3, pp. 233–246, 2002. IEEE Technical Committee on Robot Learning. He was a PC Member or a
[98] G. Tesauro, “TD-Gammon, a self-teaching backgammon program, Session Chair of various international conferences.
achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219,
Mar. 1994.
[99] W. Zhang and T. Dietterich, “A reinforcement learning approach to
job-shop scheduling,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995,
pp. 1114–1120.
[100] J. Wu et al., “A novel multi-agent reinforcement learning approach for
job scheduling in grid computing,” Future Gen. Comput. Syst., vol. 27,
no. 5, pp. 430–439, 2011.
[101] H. S. Ahn et al., “An optimal satellite antenna profile using reinforce-
ment learning,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 41,
no. 3, pp. 393–406, May 2011.
[102] F. Bernardo, R. Agustí, J. Pérez-Romero, and O. Sallent, “Intercell Dewen Hu (SM’09) was born in Hunan, China,
interference management in OFDMA networks: A decentralized in 1963. He received the B.Sc. and M.Sc. degrees
approach based on reinforcement learning,” IEEE Trans. Syst., Man, from Xi’an Jiaotong University, Xi’an, China, in
Cybern. C, Appl. Rev., vol. 41, no. 6, pp. 968–976, Nov. 2011. 1983 and 1986, respectively, and the Ph.D. degree
[103] S. Adam, L. Busoniu, and R. Babuska, “Experience replay for real-time from the National University of Defense Technology,
reinforcement learning control,” IEEE Trans. Syst., Man, Cybern. C, Changsha, China, in 1999.
Appl. Rev., vol. 42, no. 2, pp. 201–212, Mar. 2012. From 1986, he was with the National University
[104] F. Hernandez-del-Olmo, E. Gaudioso, and A. Nevado, “Autonomous of Defense Technology. From 1995 to 1996, he was
adaptive and active tuning up of the dissolved oxygen setpoint in a Visiting Scholar with the University of Sheffield,
a wastewater treatment plant using reinforcement learning,” IEEE Sheffield, U.K., and was promoted as a Professor in
Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 42, no. 5, pp. 768–774, 1996. His current research interests include image
Sep. 2012. processing, system identification and control, neural networks, and cognitive
[105] R. Issabekov and P. Vamplew, “An empirical comparison of two com- science.
mon multiobjective reinforcement learning algorithms,” in Proc. 25th Dr. Hu was the recipient of the 2nd Class National Natural Science Award
Int. Australas. Joint Conf., Sydney, NSW Australia, 2012, pp. 626–636. of China in 2012. He is an Action Editor of Neural Networks.