Sunteți pe pagina 1din 10

Brown 1

No man is an island entire of itself; every man in a piece of the continent, a part of the
main – John Donne. In many ways, this quote emblemizes the true nature of human capacity,
specifically in terms of the architecture of our intelligence. Though each individual is endowed
with his or her own intellectual ability, the true power of human intelligence comes from its
status as a social intelligence. Social intelligences are marked by their reliance on collaboration,
by their vested interest in the coherent contributions of many. Humanity has made remarkable
progress as a function of this social intelligence, and thus one is compelled to test the effects of
“social” system in terms of computational manifestations. Generally, this study is focused on the
computational intelligence implications of collaboration in unsupervised neural networks.
The successes of unsupervised networks in recent history have been profound. The
achievements of DQN networks in a variety of Atari Games (Mnih et al) have shown the
flexibility and utility of single entity policy-based value algorithms. Furthermore, the profound
achievements of Deep Mind’s AlphaGo Zero showed the world the capacity and potential of
adversarial unsupervised learning; specifically, its strategic innovations in the game of Go
resembled many of the features of human intelligence. As the natural next step of exploration,
one is motivated to seek a path that is both outside of the regime of the single-entity framework
and aside from the concept of an adversarial system, ergo to pursue a model of collaboration.
Given the wide breadth of this field, this paper focuses on the particular phenomena of learning
in the case of imperfect or incomplete knowledge sources.
To understand the problem, imagine this scenario. A young boy has older brother who
takes to a sport early in the younger boy’s life. When the younger brother comes of age, he also
wishes to play this sport, and, as such, his older brother teaches him what he knows. Given that
the sport is complicated enough, it is fair to assume that the older brother does not have a
complete or comprehensive understanding of the sport, specifically when it comes to
determining the relatively optimal action in every possible scenario of the game. After some
finite period of teaching, the boy knows everything that the older brother knows, and then it is
possible for them to learn in concert.
This hypothetical enlivens a variety of interesting questions concerning the particular
dynamics of learning that occur. For instance, is it fair to assume that learning from a teacher is
more efficient that exploring on one’s own? If so, by what metric? If not, why? Does learning
from incomplete knowledge sources serve as a hindrance or an assistant in most cases? Is this
proportional to the level of knowledge that the “amateur” teacher has? The answer to each of
these questions may provide insight into new ways of maximizing the computational ability of
reinforcement learning systems and motivate the topic of this investigation.
Such an investigation into this "Big Brother" scenario has yet to be produced in the
community; however, different regimes of exploration have provided insights into questions
which may assist this investigation indirectly. For instance, significant literature exists
concerning the computational effectiveness of parallelizing systems, specifically in terms of the
most effective ways of maximizing one’s hardware. Such advances provide a framework for how
Brown 2

to best implement these sorts of student-teacher networks. Considering this thought, some
progress has been made in the dynamics of teacher-student networks. For instance, in Rusu et al,
it was shown that policy distillation from a teacher network could yield increased robustness and
efficiency in student networks. Such investigations highlight interesting dynamics when it comes
socially collaborative settings but do not attend to the specific quandary of this investigation.

To address this problem, this study uses the “Cartpole Problem” as the basis of training.
The goal of the task is for the neural network in question to balance a pole on a cart. The neural
net receives only the state information of the system, such as the velocity and position of the cart
and the dynamics of the pole, and must choose to either apply a fixed force to the right or left at
every time step. This particular problem defines a relatively simple task; the degrees of freedom
of the system are relatively few, and the DQN framework has empirically shown to converge on
this problem in a short sum of time. Thus, it is the optimal sort of task for this Big Brother
scenario, as the learning dynamics of the interactions can be truly elucidated.
As mentioned briefly before, this study uses the neural network architecture of a Deep Q
Network (DQN), popularized by Mnih et al in their famous work concerning neural networks
mastering Atari games using an architecture of this type. DQN’s use classical neural networks to
approximate Q-values, which functionally equate to value functions of states .Although this is
not a study into the particulars of the DQN framework, it is important to recognize why they are
useful in solving this type of problem. DQN’s are a particular type of a more general set of
architectures, reinforcement learning systems, which, in practice, means they learn “by
themselves” by reinforcing certain behaviors and dissuading others. In considering the pattern of
reinforcement learning that defines early learning in children, one could say that this best
emulates how human beings learn, even if the methods are not perfectly analogous to any
physiological process in the human body.
To model the "Big Brother" hypothetical discussed above, the system proceeds as
follows. First, a neural network designated as the “Big Brother” is trained on the Cartpole
problem for some finite sum training session. A training session is defined to be 100 separate
runs of the task while the network is able to update in response to the final results. Afterward,
another neural network, “Little Brother,” is trained on this same task for the same amount of
training session as the Big Brother; however, his options for training are different. At each time
step, the Little Brother has the option of either choosing his action based on his own Q-values or
“asking” what his Big Brother would do and doing that. This “asking” ability comes in the form
of supplying the Big Brother network with the state information and returning the action given
by its own neural network. The Little Brother makes this decision with some probability, which
in this study favors asking the Big Brother with 90% probability. After this period of time, the
Little Brother ceases to be able to sample from the Big Brother network, and both networks train
independently. During this training, the performance of each network is tracked and each
proceeds to train until either it achieves mastery of the task or attains the maximum allowable
training sessions. The maximum number of allowable training sessions is 80 plus the number of
Brown 3

initial training sessions. Mastery in this task is equivalent to achieving an average of over 999
steps in 100 testing sessions.
Hopefully it is evident how the set up described above models the phenomena closely but
imperfectly. Assigning the dynamics of the Big Brother sampling function to a probability
distribution is likely an oversimplification, but we hope that it captures the interplay between
tutelage and autonomy. Likewise, many of the parameters, such as the value of this probability
distribution, had to be chosen by intuition rather than analysis, and this could limit the generality
of this study. However, we are encouraged to proceed despite these limitations, as it allows for at
least some exploration into this regime of thought.
There is one parameter that has a canonical choice about its values, and this is the
maximization of the epsilon value in the DQN. Colloquially this means that the neural networks
have a maximum tendency towards exploration. This choice was made for two reasons, one
being philosophical, the other being pragmatic. Philosophically, exploration is set to be
maximized because that value best models the exploratory nature of children when they first
begin to learn a new task. This exploration is the computational equivalent of having no “fear of
failure” and appears to be a self-evident decision when faced with modelling early learning. The
second reason is pragmatic: higher degrees of exploration have empirically shown to reduce
convergence time. Thus, by maximizing the exploration, one can produce more results in less

Results and Analysis

In analyzing the results of this experiment, it’s important to clarify one’s motivations in
the endeavors. As such, in the opinion of this author, the most important aspect of the analysis is
to clarify how the learning dynamics of the big brother/little brother system affect the
computational efficiency and robustness of convergence. To capture this, the analysis of the
effects is derived from the graphs comparing the “relative” training trials and the average number
of steps over the course of 100 trials. The term “relative” implies a certain conditionality to this
measure; in order to really compare the computational efficiency of the teaching from the Big
Brother, the relative training trials for the Little Brother account for the number of training
sessions that the Big Brother went though. Essentially, this corresponds to shifting the graph of
the Little Brother over to the right by the number of initial training trials for the Big Brother. The
effect of this is to allow the graph comparing the learning charts of both entities to adequately
describe the utility of the Little Brother learning from his older Brother.
The charts for initial training sessions of 5 to 35 are shown below respectively. In
analyzing them, three distinct regimes of behavior are evident for the Little Brother:
comparable/robust, lagging/robust, and simply lagging.
Brown 4
Brown 5
Brown 6
Brown 7

The trials with 5-15 initial training sessions display what is described as comparable and
robust behavior. To obtain this label, the average trajectory of the Little Brother must act on
similar timescales to the Big Brother and be more likely to converge by the termination point.
The timescale similarity is evident in these graphs by the similar slopes early in the learning
process, while said convergence is apparent in the superior average steps achieved by the
termination point.
The trials 20-25 exhibit what is defined as lagging yet robust behavior in the Little
Brother. This behavior is defined by a significant delay in the achievement of the final average
behavioral pattern (after the learning ceases leading to augmented performance) while also
achieving superior performance once this is achieved.
Simply Lagging
Trials with initial training sessions above 25 achieve lagging behavior in the Little
Brother, in which the Big and Little Brother exhibit a very similar learning trajectory, but the
Little Brother, in terms of relative training, takes far more invested time.
In order to appreciate these results, one must consider the effect of the initial training
trials. We note distinct patterns of failure for the Big Brother specifically in the
comparable/robust and lagging/robust regimes, which are likely to be functions of limited
experience in the state space. Essentially, the Big Brother has not yet seen much of the problem
and thus fails often in his initial confrontations with it. This behavior ceases to occur after 30
Brown 8

initial training trials, and thus we conclude that there is some average limit to the breadth of
initial state space exploration needed to converge in the allotted time. If the entire trajectory was
plotted, including data given in the initial training, then the Big Brother would learn in an
alternating series of plateaus and positive slopes.
With this in mind, one can fully consider the utility of training one neural network with
the ability to sample from an incomplete-teacher. First, we consider the best performing regime –
comparable and robust. In this regime, we found that we could achieve significantly faster
convergence on similar timescales in terms of relative training sessions. Essentially, the Little
Brother does not experience the same plateau in learning as the Big Brother, and we assert this is
a product of the teaching. Because the Big Brother has seen a significant portion of the initial
state space (at least in comparison to no experience), the ability to sample from him allows the
Little Brother to avoid the initial pitfalls of blind exploration. This is analogous to real life,
because, if we consider the hypothetical posed, the Big Brother has been through the experience
of learning from scratch and will have the Little Brother learn from his mistakes. An interesting
question is whether this is an artifact of the size of the initial state space, though this will be
addressed in the discussion section.
Though the utility of the lagging/robust and simply lagging regimes is less than
lagging/comparable, it is interesting to probe them in order to understand the effect of increased
duration of initial trials. If one studies the graphs of all initial training trials, a pattern of
displacement in the learning trajectory becomes evident. Essentially, as the initial training
increases, the steep positive slope of the Little Brothers learning trajectory is shifted to the right.
In the case of the robust/lagging, the negative impact of this distention or lag is rectified by an
increased rate of convergence. In the case of simply lagging, there is enough exploration time in
the initial training session for the Big Brother to converge with similar success rates as the Little
Brother and do so more efficiently. Both of these regimes affirm two conclusions. For one, in the
case of the Little Brother, the increased initial training does not directly impact the rate of growth
i.e. it does not augment its slope; however, it does shift the initial point of the steep positive
slope towards the right. This is evident by the extremely similar learning trajectories in the
robust/lagging and simply lagging cases, distinguishable only when the rapid increase in
performance occurs. Secondly, the initial training does have a palpable impact on the stability of
convergence for the Little Brother, and the utility of this impact is greatest in the regime of
relatively low initial training sessions.

In reflecting on the achievements of this study, it is important to revisit the questions that
were raised in the beginning.
Is it fair to assume that learning from a teacher is more efficient that exploring on one’s
own? If so, by what metric? If not, why?
We are defining efficient to mean greater rates of convergence in less time. Thus,
according to data in this study, it is more efficient to learn from a teacher than to explore on
one’s own when that learning is focused on “early” behavior. This is evident in the convergence
Brown 9

of the learning trajectory for the Little Brother at earlier periods than the Big Brother in the
robust/comparable and robust/lagging regimes. However, as the initial training sessions increase,
one achieves similar convergence patterns in the Little Brother with less efficient time expenses.
Does learning from incomplete knowledge sources serve as a hindrance or an assistant in
most cases? Is this proportional to the level of knowledge that the “amateur” teacher has?
The presence of a teacher never seems to serve as a hindrance; however, the utility of the
teacher declines as the sum of initial training sessions increases. We hypothesize that this is the
result of a high utility when it comes to avoiding “amateur” mistakes, however, once these early
mistakes are evaded, the Little Brother follows an otherwise standard learning trajectory.
The greatest limitation of this study was the availability of data. Training the dual neural
networks took significant sums of time, and as such, we were able to capture relatively few trials
to conclude from. In order to augment the validity and generality of this study, we will replicate
the experiments and increase the total sum of trials.
Further Extensions
In many ways, this experiment has generated more questions than answers. For instance,
how did the specifics of the sampling probabilities affect the behavior of this system? To address
this, one avenue of study would be to test different “personalities” of Little Brother, essentially
meaning that one changes the probability that the Little Brother will sample from the Big
Brother; for example, if the Little Brother is “obedient” he will sample often, but if he is
“rebellious,” he may sample relatively sporadically. In doing so, a fuller picture of the efficiency
metric could develop and would better elucidate the effects of the incomplete teacher. Another
question is whether there is some analytical means of determining the optimal initial training
length that maximizes robustness and speed in relation to the size of the state space. Such
endeavors would be far more mathematical in nature but would be an interesting investigation in
the relationship between task complexity and performance. A final potential direction could be
the use of a different sort of sampling method, such as the policy distillation methods discussed
at the beginning of this study. Such methods may increase the efficiency of the initial training for
the Little Brother, and in doing so augment the utility of this method even further.

Investigations such as this work to challenge the methodologies of human beings in an
analytical way, with hopes that such challenges will bear the fruit of a more effective systems. In
addition, there is the added corollary that Mother Nature has encouraged the development of
efficient systems for survival, and thus it makes intuitive sense that as we attempt to develop our
own learning systems, we take inspiration from her. The utility of such action is evident here –
the learning dynamics of this coupled system led to greater efficiency and robustness than a
singular learning entity. In general, we find this utility present and as such are encouraged to
Brown 10

continually take lessons for the empirical dynamics of the human experience and augment our
systems thusly.