Sunteți pe pagina 1din 38

Deep Learning for Finance

A Perspective on AI, Machine Learning and Deep Learning


Ayush Sagar
ayush@cs.virginia.edu
September 12, 2016
The goal of Artificial Intelligence (AI) is to solve problems by combining intellectual
abilities of the human brain with the speed and efficiency of machine. It is challenging
to replicate these abilities of brain into machine with same fidelity because of the
perplexity of underlying computations in the brain. Researchers have been replicating
these abilities on machines with limited success by writing programs to solve problems,
constrained by the limits of their intuition.
Cognitive functions such as language, vision, and understanding are a product of
complex learning mechanisms in the brain. Machine learning tries to perform these
functions by learning from data. However, the traditional machine learning models did
not perform satisfactorily on these tasks, until a recent breakthrough was made by a
machine learning paradigm called deep learning. Deep learning is inspired by the
design of visual perception process in the brain discovered by the Nobel Laureates
Hubel and Wiesel in 1959.
This perception model was translated into a deep (layered) computational model called
artificial neural network with the hope of solving AI problems, but it received criticism
due to ineffectiveness of models that could be trained with limited computing power
in 1980s. Deep models made a comeback in 2006 in the form of deep belief networks
that demonstrated breakthrough performance and revived interest. With this revived
interest, more than ever availability of data and million times faster computing with
modern GPUs, deep learning has been making continuous breakthroughs since 2006
by outperforming traditional machine learning algorithms and even humans for certain
cognitive tasks.
Deep learning is creating new AI capabilities that are driving new business models and
present an untapped opportunity for entrepreneurs and existing businesses. The report
provides a holistic coverage of the key ideas and substantiates its advantage over
traditional machine learning by enhancing a financial model based on classical machine
learning published by Bloomberg L.P. in a 2016 press release.

Contents
1.

The Pursuit of Artificial Intelligence (AI) ........................................................ 3

2.

Computing Advances and AI ........................................................................ 5

3.

The Forefront of AI ..................................................................................... 9

4.

Deep Learning ......................................................................................... 17

5.

A Note on Adversarial Machine Learning ...................................................... 25

6.

Some Applications in Finance ..................................................................... 26

7.

Promising Future Technologies ................................................................... 29

Acknowledgement .......................................................................................... 33
References .................................................................................................... 34

1. The Pursuit of Artificial Intelligence (AI)


The hallmark of human evolution is the expansion of brain in areas responsible for
perception, understanding and consciousness. It gifts us an edge over other species
in interacting with nature for sustaining life functions. Consider some of the
breakthroughs we made in this pursuit:
Date

Invention

2,000,000 BC

Stone tools

400,000 BC

Use of fire

10,000 BC

Agriculture

5,000 BC

Metalworking

4,000 BC

Writing

3,000 BC

Cities

3,000 BC

The wheel

1440

Printing

1765

Steam engines

1800

Electricity

1879

The light bulb

1885

Automobile

1903

Airplanes

1926

Television

1928

Penicillin

1944

Electronic Computer

1951

Computer plays Checkers (AI)

1961

Space travel

1979

Wireless phone

1981

Personal computers

1983

The Internet

2000

Mobile Computing

Two observations can be made:


1. Inventions and discoveries are being made at an exponentially increasing rate
with respect to time. Consistent with this observation is the following
interesting trend of patent applications from WIPO:

The exponentially rising trend in both can be explained by the fact that every
invention facilitated new discoveries and inventions. An interesting digression
is to ask ourselves: will this trend continue?
2. We see an evolving desire of functionality as the society evolves from these
breakthroughs. With progression of time, these breakthroughs are solving more
complex problems.
The first observation challenges entrepreneurs and businesses more than ever to stay
abreast of new technologies to find and maintain their place in the market amidst
frequent technological disruptions. The second observation suggests that technology
is heading towards intelligent machines. This motivates a discussion on AI.

Science or Fiction?
An artificially intelligent machine exhibits intelligent behavior. In computer science,
the Turing test is a commonly used criteria for intelligent behavior. The test says that
a machines behavior is considered intelligent, if a blinded human evaluator is unable
to distinguish its performance from that of a human. The state of the art AI applications
successfully passes Turing test when this definition is applied to a specific task.
Artificial General Intelligence (a.k.a. strong AI and full AI) refers to hypothetical
machines that can think like humans and perform with full autonomy. These would
pass the Turing test without the need to constrain its definition to specific task.
However, it doesnt appear achievable in the foreseeable future with the current
technologies. It is still in infancy and currently caters to the interest of researchers,
science fiction writers and futurists.

2. Computing Advances and AI


AI is computation intensive and the modern algorithms are data intensive. To
understand the factors underlying the rapid recent growth in AI, it is important to take
a look at relevant computing trends.

Computation Capacity Trend


While we have been historically performing computations on mechanical assemblies,
pneumatics and electrical circuits, the invention of electronic transistors in the 20th
century led to microprocessor technology: a far more scalable computation device.
Transistors implement logical operations in digital circuits and the semiconductor
industry has been packing more of them together at an exponential rate famously
called the Moores law which is shown below [1]:

This trend appears exciting but its extrapolation in shaded area might not be realistic.
The industry is facing more than ever challenges to sustain growth with this rate. So
far, the industry was relying on technology scaling i.e. miniaturization of transistors.
However, in the early 2010s we reached a point where quantum effects and
wavelength of lithography light source have been limiting the practical extent of
miniaturization. Quantum effects result in uncertain electrical charge distribution when
separation structures are made too small. Wavelength limitation causes diffraction
during photolithography, making photolithography masks ineffective at smaller feature

size. This has been mitigated to an extent by use of techniques such as immersion
lithography and optical proximity correction. To keep up with the trend at the moment,
the

microprocessor

industry

is

currently

using

and

developing

alternative

manufacturing techniques such as multiple patterning, 3D microfabrication and EUV


lithography.
While the Moores law appears threatened, there is a great potential for improving
use of available transistors with computer architecture optimizations. Since the first
microprocessor, computing performance has been improving primarily by scaling up
its clock speed. But in the early 2000s it was realized that apart from thermal issues,
there is a fundamental limit that prevents going beyond 5 GHz clock rate. The limit
arises because the size of chip has to be much smaller than wavelength of clock signal
for the signal to be seen identically across different parts of chip at a given time. The
industry instead focused on multiple processing cores with shared cache to leverage
on parallel computing paradigm.
As parallel computing paradigm became more popular, General Purpose GPU (GP-GPU)
computing started receiving more attention leading to its own development. Graphics
Processing Units (GPUs) have high level of parallelism inherent in its design as an
optimization for linear algebra operations which are typical in graphics processing.
Machine learning algorithms also use linear algebra for underlying computations and
achieve several orders of magnitude speed-up from GPUs parallelism.
Parallel and distributed computing has been a big leap in computing, resulting in new
level of scalability in computation capacity. The idea behind both is to identify
independent sub-problems and solve them simultaneously across local and remote
processing units. Modern AI algorithms process high volumes of data and are being
highly benefitted from distributed and parallel computing architectures.
On the software side, a new class of algorithms called the Communication Avoiding
Algorithms, most applicable to large-scale computing, are being developed. These
algorithms rearrange sub-problems in a way that minimizes latency and energy
associated with data transfer within algorithms. Since the time and energy spent on
data transfer is several order of magnitude larger than actual computation, there lies
enormous potential for speed-up and energy-reduction with adoption of these
algorithms. President Obama cited Communication-Avoiding Algorithms in the FY 2012
Department of Energy Budget Request to Congress [2].

Clearly, computer scientists and engineers have been dealing with difficulties
creatively. The momentum of advancements, it seems, will continue to support
growing computation requirements of AI development for the foreseeable future.

Digital Information Trend

As illustrated above [3], with growing use of digital technology and Internet
connectivity, the amount of electronic information available to human kind is trending
similar to the Moores Law. Starting with innovations in web search in late 90s, the
science related to storing and processing large-scale data has been rapidly evolving
under the term Big Data.
The recent Internet of Things (IoT) approach to product design is taking data collection
a step further. In this approach products are connected to a cloud hosted backend
through the Internet with the motive of increasing reach of businesses to consumers.
Businesses can provide new services while collecting usage data for adapting their
services to consumer behavior. Smartphone apps have also been doing the same by
delivering functionality through interactive interfaces instead of physical product.
The phenomenon of massive data growth is enabled by advances in storage media.
Throughout most of the computing history, we have been storing data on magnetic
hard disk drives (HDD). The amount of storage for a given price point has been rising

exponentially similar to the Moores Law. By late 2000s solid-state drives (SSDs), a
Flash memory based storage technology became a serious contender to the magnetic
hard disk market. SSD is being widely adopted by consumers and data centers because
it not only provides performance improvement, it also reduces energy, cooling and
space requirements [4]. SSDs performance benefits are enabling more low-latency
and high-throughput data processing applications.
Another promising storage technology called Phase Change Memory (PCM) being
developed since the 1970s was commercially introduced in 2015 by Intel and Micron
under 3D XPoint trademark. The engineering samples released in 2016 showed 2.4 to
3 times speedup compared to a modern SSD [5]. PCM not only packs more storage,
but could offer new level of performance scalability. And at a certain point, it could be
possible to unify main memory and storage memory in computers, resulting in
computers that persist state in absence of power. Among other benefits, this could
result in large energy savings for cloud infrastructures.
The implication of these continued advances in data storage is that AI algorithms are
being exposed to data about human expressions & processes at an increasing rate.

Cloud Computing: AI-as-a-Service


The confluence of data flywheels, the algorithm economy, and cloud-hosted intelligence
means every company can now be a data company, every company can now access
algorithmic intelligence, and every app can now be an intelligent app.

Matt Kiser, Algorithmia


Cloud computing services offers on-demand access to applications, data and
computation platforms over the Internet through a programming interface (API). It
allows businesses to avoid upfront infrastructure costs and reduces barrier for
implementation. Availability of AI as a service over cloud is flourishing AI by enabling
businesses to borrow state of the art AI capabilities without concerning about
challenging computation, storage and algorithm design requirements. As an example,
all delivery and ride booking smartphone apps are using Google Maps optimum routing
capabilities. Nervana Systems, recently acquired by Intel [6], is one such startup that
provides deep learning based AI as a Service.

3. The Forefront of AI
AI is the study of how to make computers do things that at which, at the moment,
humans are better [7]. As machines become increasingly capable, facilities once
thought to require intelligence are removed from the definition. For example, optical
character recognition is no longer perceived as an exemplar of "artificial intelligence"
having become a routine technology. [8]

Intelligent Agents
AI literature frequently deals with the term intelligent agents. An intelligent agent is
an abstract entity that acts on a humans behalf to maximally achieve a given goal
with minimum cost. Cognitive tasks such as planning, prediction, pattern or anomaly
detection, visual recognition and natural language processing could also be goals for
an intelligent agent.

Why intelligent agents should learn from data?


An intelligent agent can be an explicitly programmed algorithm to solve a task. These
systems are called expert systems because the knowledge about the world that the
agent interacts with is programmed by human experts.
Expert systems works well to a certain extent of complexity. In fact, in the 60 years
of computing history, the main emphasis of AI was on writing explicit programs to
perform functions. However, this approach could not scale because AI designers faced
the following problems as they attempted complex problems [9]:
1. It is difficult to anticipate and program an agent to respond to all possible
conditions. Strategies hard-coded by a programmer would be biased by their
limited understanding of the problem and can easily fail under unanticipated
conditions.
2. All changes over time to the world that the agent interacts with cannot be
anticipated. The strategies learnt may need to evolve over time.
3. If a problem is complex enough, writing a program may not even be possible.
These problems were addressed to a large extent by using a data-driven approach
called machine learning.

Learning from Data (Classical Machine Learning)


Machine learning happens when a machine learns to accomplish a task without being
explicitly programmed. The learning algorithm either attempts the task and improves
from its mistakes, or looks at previous examples. This is called supervised learning. In
some cases, machine learning can learn on its own by discovering structure in the
observed data and this is called unsupervised learning.
It powers many aspects of modern society: from web searches to content filtering on
social networks to recommendations on e-commerce websites, and it is increasingly
present in consumer products such as cameras and smartphones. [10] In finance, it
is used for predicting risks and opportunities in various contexts. It is also used for
detecting fraud or anomaly in operations at large scale.
To get an intuitive understanding of how machine learning works, lets consider a toy
problem. We design an intelligent agent that predicts house price at a given location
for a specified land area. The agent could do the following:
Step 1 - Acquire training examples for given location.
House id

Land Area (sq. ft.)

Price ($)

1000

150,200

2000

225,500

3000

451,800

4000

684,500

Same data is plotted below for our convenience:


800,000
700,000

Price ($)

600,000
500,000
400,000
300,000
200,000
100,000
0
0

1000

2000

3000

sq. ft.

10

4000

5000

Step 2 - Learn from data: There are many model frameworks to choose from. A
model is picked by the designer by their judgement of its suitability to the problem.
That effectively introduces a prior. In this case, we use a linear model because we
know for a fact that housing prices increase somewhat linearly with land area.
=
This simple equation says that house price is times land area . It may be noted
that the equation represents a line with a slope determined by .
Now a training algorithm1 systematically picks a value for , such that the equation is
maximally consistent with the data. Intuitively, it is fitting the line to data. In this
example, the algorithm could choose a value of 150.

Step 3 - Making Prediction: On assigning = 150 the equation becomes a prediction


model. For example, to predict price of, say 2500 sq. ft. house, we apply the model as
follows:
= 150
= 150 2500
= $ 375, 000
Therefore, the model predicts that price of 2500 sq. ft. house is $ 375,000 which is
consistent with data.

Training algorithms are not discussed for simplicity.


11

Calculating the model parameters algorithmically and systematically, in this case, is


the essence of machine learning. It might appear trivial in the illustration, but in real
problems it is challenging for the following reasons:
1) This problem had just 1 input column (area), 4 training examples and 1 model
parameter. Real problems have many more input columns, model parameters
and modelling requires more training data. Randomly or sequentially guessing
the parameter values becomes infeasible as no. of model parameters increase.
Optimization algorithms are employed to do this systematically.
2) Machine learning assumes that the model obtained from training examples, will
generalize for unforeseen data. The assumption becomes more reliable with
more training examples.
3) To assess this assumption quantitatively we measure models prediction error
on unforeseen data. We simulate unforeseen data by taking out a small portion
of training examples and calling it test set. We train on the reduced training set
and run predictions on test set. To measure error, we compare predictions with
pre-existing values in the test set.
4) We did not require a step called feature engineering after step 1. This is
discussed later under its own heading.
5) Linear regression was a simple model for our intuition. There are many other
models such as SVM, Nave Bayes, KNN and neural nets. These can capture
more complicated patterns.
6) An important tradeoff in machine learning is the bias vs variance tradeoff:
a. When reducing variance (distance between predicted and actual values)
we can make the model fit well to the data. In the example we could
have used polynomial regression to create a curve that would fit all
points. The problem is that the model then fits to noisy data and outliers
as well, adding bias to the model. A biased model does not generalize
well on data outside training set.
b. When reducing bias, we use simple models that do not capture fine
details, such as the green line in the example. This leaves out noise and
outliers to an extent. However, this increases variance which leads to
higher error rates.

12

The challenge is to tune the model in such a way that only necessary detail is
captured by the learning model. Ensemble methods and regularization are
some widely used methods towards this problem.

The Machine Learner vs Statistician Debate


Machine learning engineers and statisticians sometimes have different approaches in
a modelling problem. A statistician is trained to better understand data and can make
reasonable assumptions about data. From these assumptions, the models are less
prone to noisy data and can be better trusted for real world application. A machine
learner approach is towards scalability it lets the computer derive assumptions about
the data. Its effectiveness heavily relies on quality and quantity of training data. And
a machine learner has the unique advantage in the financial industry that financial
records are accurate. Given that, the assumptions computationally derived in machine
learning should be accurate. However, even accurately recorded data can be biased
either by adversaries or by its limited availability. This is where statistical approach
becomes important. But then statisticians assumptions can be narrow, outdated or
prone to human error. But again, machine learning scales better when relationships
are subtle or when no. of attributes in training example is large.

Feature Engineering
From the discussion on machine learner vs statistician approach, it seems both have
their strengths and weaknesses. In such situations engineers ask the golden question:
why not both?
One way to combine best from both is to use statistical approach and domain expertise
to understand the properties of the data and transform it into a representation that
augments machine learning. This process is called feature engineering. However, it is
manual, expensive and can be ineffective if the problem is complex enough. As we will
see later in the discussion of deep learning, there is an automated approach for
generating representations.

A Feature Engineering Demonstration


Problem Statement: Bloomberg L.P. published a machine learning model this
February in a press release [11] towards solving the following question:
Will a company X beat analysts estimates of its quarterly earning?

13

Wall Street analysts' consensus earnings estimates are used by the market to judge
stock performance of the company. Investors seek a sound estimate of this year's and
next year's earnings per share (EPS), as well as a strong sense of how much the
company will earn even farther down the road [12]. The approach published by
Bloomberg is as follows:
Step 1: Acquire data
As always, we start with acquiring dataset containing signals that could indicate the
outcome. They acquired the following data for 39 tickers:
1) Daily stock data (OLCV) 2000-2014 from Yahoo! Finance
2) Corresponding actual and predicted earnings from Estimize and Zacks
Investment Research respectively.
From these two, a combined dataset was prepared as shown below:

The screenshot above shows partial contents of one of the 39 combined dataset files
for each ticker. Whats obtained here is time-series data.
Step 2: Feature Engineering
They aggregated rows for each quarter and calculated the following features:

14

Feature Name

Description

yr

Year, as it is.

qtr

Quarter, as it is.

up_day

The number and sum of up-days in the quarter and if the ratio
of sum and total number is > 50%, set the feature to 1, else
0.

p_over_20

Price above moving-average of 20 days more than half the


time.

p_over_10_ema

Percentage of times the price of 10 days exponential moving


average was above 50%.

p_mom_1

Percentage of times price went up was over 50.

v_mom_1

Percentage of times volume went up was over 50.

target

1 if company beats consensus estimate. 0 otherwise.

The feature engineered dataset corresponding to screenshot above looks like the
following:

Note that it required human expertise to derive these features. This feature engineered
dataset is an abstract representation: significantly reduced size, losing most
information but retaining only the information assumed by the domain expert to be
important.
Step 3: Learn from Data
For this classification problem they applied the logistic regression, decision trees and
random forest (an ensemble of varied decision trees).
15

Step 4: Make predictions


They simulated unforeseen data by splitting available data into a 60% training and
40% test set. On running the predictions they obtained the following results:

Logistic Regression

CONFUSION

Actual

MATRIX

No
Actual
Yes

ACCURACY

Decision Tree

Random Forest

Pred.

Pred.

Pred.

Pred.

Pred.

Pred.

No

Yes

No

Yes

No

Yes

303

25

299

285

591

45

541

613

Actual
No
Actual
Yes

Actual
No
Actual
Yes

65.71

62.2

68.03

54

54

61

RECALL %

66

62

68

F1 %

53

53

56

%
PRECISION
%

Conclusion: Quoting from author: Its a work in progress, but the best model had a
recall of 68% and precision 61%, which is above the 50% mark that is equivalent to
randomly guessing. The models built can be improved by including more stocks and
getting data over a longer period of time, while adding parameter search and cross
validation to the process.
In the next topic, we will attempt to improve this model by using deep learning
approach.

16

4. Deep Learning
In the discussion of feature engineering in the previous topic, the importance of data
representation was emphasized. Deep Learning is a machine learning paradigm that
learns multiple levels of data representation, where each level of representation is
more abstract than the previous one. It has dramatically improved the state-of-theart in speech recognition, visual object recognition, object detection and many other
domains such as drug discovery and genomics. [10] Deep learning is also applicable
in finance wherever improved machine learning performance can be an advantage.
It began in 1950s when the Nobel Laureates Hubel and Wiesel accidently noticed
neuron activity in the visual cortex of a cat, as they moved a bright line across its
retina. During these recordings, they made interesting observation: (1) the neurons
fired only when the line was in a specific place on the retina, (2) the activity of these
neurons changed with orientation of the line, and (3) sometimes the neurons fired only
when the line was moving in a particular direction. [13] With series of experiments
they noticed that there is a hierarchy of pattern filters, with increasing levels of
abstraction across the visual cortex. This eventually revealed the process of visual
perception in the brain. A simplified form of this model is illustrated below.

The image is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
It is attributed to Randall C. O'Reilly and original work can be found at [14]

Deep Learning borrows two important aspects of the visual perception model:
1) Representation Learning Along Depth: As illustrated in the figure above, it
can be seen that the first set layers of neurons V1 learnt elementary features
from the raw image coming from retina. The second set V2 learnt a more
abstract representation of features generated by V1. As the model layer

17

progresses, more abstract concepts are learnt. This is called representation


learning. And this depth is the reason why deep learning is called so.
Representation Learning was demonstrated in 1986 by Geoffrey Hintons lab in
using backpropagation training algorithm [15] for artificial neural networks
(ANN). ANN is the first computational model inspired from visual perception
model which was developed with the hope of solving problems.
2) Distributed Representation Learning: This is an old concept in machine
learning but was first demonstrated on an unsupervised deep model in 2006
[16]. In distributed representation the input to a layer is represented by a set
of features that are not mutually exclusive, and might even be statistically
independent. [17] This form of learning happens unsupervised, just like the
visual perception model above.
It can be intuitively seen in the illustration above. The representation of eye,
ear and head-outline in V4 are composed by sharing features from V2. Similarly
in IT posterior, representation of different peoples faces are composed of
shared features in V4. And so on.
This not only improves efficiency in representation, but by finding general
components, it can generalize better. For example, for learning a new persons
face, V1, V2 and V4 do not have to go through learning process. A new face
can be learnt in the IT posterior from composition of features generated by V4.
Language models are especially benefitted with distributed representation
learning. It elegantly solves the problem of training the model every-time a
new word is introduced into the dictionary.
The state-of-the art deep models combine both types of learning. First, the model is
trained unsupervised to learn distributed representations from the data. Second, the
model is fine-tuned by supervised learning (by providing outcome labels). The first
step augments the second step by creating general features whose composition can
be learnt in the deeper layers during second step. This not only makes the training
fast, but it also improves performance because the model can generalize better from
the features learnt during first step.

18

A Brief History
Since 1943, many ANN designs were published, but the 1986 model that used
backpropagation training algorithm was the first ANN model that deep learning
borrows from. Another model was an unsupervised learning model called the
Neocognitron [18] and was published in 1980. It laid the foundation for the now widely
used deep model called Convolutional Neural Network (CNN) which Yann LeCun et al.
trained in 1989 using backpropagation for handwritten digit recognition in US postal
mail. [19] CNNs are more efficient for image recognition as it takes advantage of
spatial properties in image data.
When backpropagation was first introduced its most exciting use was for training
recurrent neural networks (RNNs). [20] RNNs are suitable for speech, language and
other sequential data. RNNs process an input sequence one element at a time,
maintaining in their hidden units a state vector that implicitly contains information
about the history of all the past elements of the sequence. [10]
Researchers had difficulty during the 1990s in training RNNs due to the vanishing
gradient problem which increases with recursion. This problem arises when weights
(model parameter being learnt) are too low, and the repeated multiplication during
training computation converges it to zero. [21]
For RNNs, one workaround was to use a history compression method proposed by
Jrgen Schmidhuber in 1992. [22] Another method was to use gating mechanism to
retain weight indefinitely if required. These models were called Long Short-Term
Memory (LSTM). [23]
A CNN based model broke image classification error records in the 2012 ImageNet
competition which was another major breakthrough attracting significant research
interest. [24] The Google Trends data shows how interest in deep learning has been
trending in the general public since 2012, pushing machine learning along with it.

19

A Deep Learning Demonstration


This demonstration will apply LSTM, the state of the art model for sequence learning,
towards the example previously described in feature engineering section of previous
topic. The task was to predict if a companys quarterly earnings would beat consensus
estimates.
In step 1, Bloomberg acquired data and formed a combined dataset. In step 2, domain
expertise was applied to perform feature engineering. In this approach, step 2 is being
replaced by a proposed LSTM model. It learns data representation autonomously.
LSTM model results are juxtaposed with their best performing model:
LSTM

CONFUSION

Actual

MATRIX

No
Actual
Yes

ACCURACY

Random Forest

Pred.

Pred.

Pred.

Pred.

No

Yes

No

Yes

285

613

2167

16759

969

37424

Actual
No
Actual
Yes

69.07

68.03

68.07

61

RECALL %

97.48

68

F1 %

80.85

56

%
PRECISION
%

Though a comparable result would have sufficed to make a point, a higher performance
was obtained. Notice that the confusion matrix has larger values because the timeseries was not aggregated like the feature engineering approach.
Before the used LSTM model is described, the working of an RNN in general is described
and how the LSTM cell improves over it.

20

Source: [25]

RNN is a specialized neural network architecture that learns pattern in sequential data.
When applied to time domain, it models dynamic systems. The left side of the figure
above shows that the RNN cell A is a function of input sample xt at time t and the cellstate at previous time-step. The recursions are shown unfolded on right hand side of
the figure. ht is an output for next (hidden) layer and is a non-linear function of cell
state. This means that the output of an RNN is not just a function of its input, but a
function of the input as well as cells history, allowing it to respond to a trend instead
of just the absolute value of the input in case of ANN.
A problem of vanishing gradients with RNNs was mentioned before. Its effect is that
this model is not capable of learning dependencies which are distant enough in the
sequence. To understand this, consider an example show below:

Source: [25]

Here the dependency means that the cell state and output h t+1 at time t+1 is
dependent on inputs at time 0 and 1. The vanishing gradient problem says that if
such dependencies cannot be learnt effectively if the dependency distance t is large
enough.

21

The Long Short-Term Memory (LSTM) resolved this problem. Consider another
example:

In this language model, predicting the word French depended strongly on the word
France which came 3 words before. Here an RNN can work well. But in a case where
there may be paragraphs between these two parts, the vanishing gradient problem
makes it difficult for an RNN to learn the dependency. This is where the LSTM shines
through its gating mechanism. An LSTM cell looks like the following:

Source: [26]

is the input vector at time t, is the cell state vector, is output gate weight
vector. The weight vector modulates how much of the cell-state propagates to
hidden layer through a multiplier. It was mentioned before that RNN responds to two
things: 1) current input vector and 2) past cell state vector. The input gate
modulates through a multiplier how much of the current input vector is given
weightage in the learning process. The forget gate modulates how much of the
previous-cell state vector is given weightage in the learning process. The activations
of both allows the cell to persist and forget long and short term dependencies. Note
that the values of weight inside these vectors and any neural network lie in the range
0-1.

22

This prepares us to describe the model used in the demonstration which is shown
below:

The LSTM layer shown in dark has 200 units. The representation of input data is vector
h. This is then given as an input to logistic regression: a binary classifier. The output
of logistic regression gives the probability of the company beating consensus estimate
at time t. The program may be found here. [27]

Why does deep learning work so well?


There has been no mathematical proof to explain why the idea of hierarchical learning
works so well. Very recently, a research [28] tries to prove that the answer lies in
physics.
To get a sense of the problem, consider an example of classifying a megabit grayscale
image to determine whether it shows a cat or a dog. Such an image could consist of a
million pixels that can each take one of 256 grayscale values. So in theory, there can
be 2561000000 possible images, and for each one it is necessary to compute whether it
shows a cat or dog. And yet neural networks, with merely thousands or millions of
parameters, somehow manage this classification task with ease. [29]
In order to explain this, the authors performed analysis using tools of information
theory and supported the following two claims:
1. The statistical process generating observed data is a hierarchy of causal
physical processes.

23

2. With the fact that laws of nature are captured by simple physics-based
functions whose order never seem to exceed 4, each layer of a deep model can
efficiently learn a function that represents a causal process in the hierarchy.
Exceptional simplicity of physics-based functions hinges on properties such as
symmetry, locality, compositionality and polynomial log-probability of input
data. [28]
So, the depth of a deep model is more efficient at capturing hierarchy of causal
processes in the statistical process generating the observed data. Therefore, deep
neural networks dont have to approximate every possible mathematical function but
only a tiny subset of them.

24

5. A Note on Adversarial Machine Learning


Machine learning models must be carefully deployed in adversarial environments.
There lies opportunity for an adversary to reverse-engineer a victims machinelearning model, such that they can perform actions in their favor which are undesirable
to the victim. Vulnerable machine learning models include:

Spam filtering

Malware detection

Biometric recognition

Financial Trading and Prediction

The common theme of exploitation is to understand the important features of dataset


that victims machine learning model responds to and to use it to gain advantage.
In finance, it is common for competitors to reverse-engineer trading algorithms. Once
they understand the important features of dataset and how the model responds to it,
they can manipulate market conditions to affect those features and make victims
trading system trade in their favor. It is therefore important to obfuscate trade orders
and change algorithms frequently.
Another example is the design of fraud detection systems. These look for specific
signatures in transactions. If an adversary determines the signatures by looking at
transactions that passed and the ones that got flagged, it will be able to modify the
fraudulent transactions such that it goes through the system undetected.
So the overall effectiveness of a machine learning based system also takes into account
its ability to stay ahead of adversary agents. Adversarial machine learning [30] is a
new research field at the intersection of machine learning and computer security that
studies this. There also has been research that explores application of game theory to
systematically study the interaction of a machine learning system and its adversaries
[31].

25

6. Some Applications in Finance


Portfolio Management
The portfolio management problem seeks optimum investment strategy to enable
investors to maximize their wealth by distributing it on a set of available financial
instruments without knowing the market outcome in advance.
A deep learning approach: Theres a similarity to the game of chess: there are rules
and some known strategies. Intelligent agents in AI naturally deal with planning and
strategy search problem. On high-level, this can be modeled as a graph search
problem where each node represents an action. In a graph search problem, the
traversal possibilities are so many that even a computer cannot evaluate all of them.
To make it tractable, the allowed transitions at each step is constrained by heuristics
(strategies) that can be learnt by using deep learning. This approach is analogous to
Google AlphaGos take on Lee Seedol the world champion in Go. It made news
headlines this year [32] [33].
Go is an ancient Chinese game that, unlike chess, is so complex that computers could
not win previously [34]. The system learnt to win by simulating large number of games
and used a deep learning model to learn strategies on its own. An interesting question
arises: can we train an intelligent agent that learns to win the portfolio game on
its own? And, can we also beat the world champion in portfolio management?
Two challenges can be readily observed:

All strategies learnt in Go stay valid throughout gameplay, while in the game
of portfolio that may not be the case. Strategies may need to evolve with time.

The computer simulated large no. of games to learn strategies. Since we cannot
simulate the portfolio game, large amount of historical data is required for the
intelligent agent to learn strategies from it. But then, those strategies could be
outdated to an extent.

However, these are engineering challenges which may be resolved by quantitatively


studying these problems.

Behavioral Finance
Behavioral finance studies the effects of psychological, social, cognitive, and emotional
factors on the economic decisions of individuals and institutions and the consequences
26

for market prices, returns, and resource allocation, although not always that narrowly,
but also more generally, of the impact of different kinds of behavior, in different
environments of varying experimental values. [35].
Opinion mining, sentiment analysis and subjectivity analysis uses natural language
processing to understand information retrieved from social media, news releases and
reports. Deep learning has substantially improved the ability to pick opinion,
sentiments and subjectivity from human expressions. Predictive modelling can
estimate relationship between this information and the financial outcome. Although
general techniques are well known [36], it is a complex phenomenon to capture.
Proprietary models such as those used by IBM and Bloomberg L.P. gain competitive
edge by using more advanced models, data engineering and AI based targeted web
crawlers.

Retail Banking
There is opportunity in increased automation and better risk models to reduce delays
in service pipeline, making the end product more appealing to the smartphone enabled
generation. AI can also gather insights from consumer data and help engineer products
that better engage with clients. State of the art in computational vision and language
abilities have drastically improved, and there is a strong potential in incorporating
these to provide a more natural interaction experience in client side applications.
Personalized engagement is effective at building and maintaining relationship with
clients. Insights from product use data, and data from other channels that client
engages with makes it possible to offer this personalized experience at large scale.

Risk Management
New financial risks evolve and regulations increase with time. The increasing overhead
of modelling financial risk can be managed by making the process of creating new risk
models more efficient. Using a data-driven approach, especially deep learning to
eliminate feature engineering, can improve model performance and make the
modelling process economic and agile.
Resulting model performance benefits would enable more automation in transactions
in the pursuit of delivering seamless banking experience to clients. And resulting agility
of the modelling process would make it easier to prepare the new risk models for new
regulations and evolving risks.

27

The automation aspect not only benefits clients but also reduces operating cost of
services at a given scale, allowing workforce to be used for more intellectual tasks.

Systematic Trading
Systematic trading is a methodical approach to investments and trading decisions
based on well-defined goals and risk controls [37]. It may be partially or fully
automated. Since it is hard for humans to understand, predict, and regulate the trading
activity, there is opportunity to leverage on AI. An intelligent agent can respond
instantly to ever-shifting market conditions, taking into account thousands or millions
of data points every second. Resulting system is a market ruled by precision and
mathematics rather than emotion and fallible judgment from lack of automation [38].

28

7. Promising Future Technologies


Deep Learning Optimized Hardware
Many well defined software tasks can be deployed in hardware for optimizing speed
and power consumption. These are deployed on chips called FPGAs during prototyping
phase or when required volume is small. Once the design is verified and it is known
that there is a market for these chips, it is then implemented directly on silicon. The
resulting chips are called application specific ICs (ASICs).
Field programmable gate arrays (FPGAs) are chips where customized processing
architectures can be programmed electronically. There exists an FPGA based
convolutional neural network implementation (CNN) for embedded computer vision
applications [39].
The market for CNN is growing rapidly due to its proven success and several companies
have started developing CNN ASICs. NVIDIA, Mobileye, Intel, Qualcomm and Samsung
are among them [10]. The hope is that these chips will reduce the footprint of
hardware that runs these algorithms. While this is especially useful for self-driving cars
that need real-time computer vision capabilities within the vehicle, it could also reduce
infrastructure cost of applying these algorithms in financial applications.

Deep Learning + Computational Photography


Secure Face Authentication: There is opportunity in convenient and secure client
authentication enabled by time-of-flight (ToF) imaging. Its success has been
demonstrated in Microsofts flagship Surface Pro 4 tablet where its specificity surpasses
that of humans even an adversarial twin sibling cannot trick the face recognition
[40]. With the improved recognition performance and rotation invariance, it is
reasonable to believe that their recognition employs a pre-trained proprietary deep
neural network. However, an analysis on its vulnerability of adversarial machine
learning based attack is required.
A ToF camera acquires 3D depth map instead of 2D image. It works by having an
infrared light source emit time-coded photon sequence. When the photons hits the
target surface, some of the photons bounce back into to an image sensor. Photon
sequence sent at a given time will be retrieved at different portions of the sensor with
different delay based on distance they travelled as they hit different parts of surface,
revealing its shape.
29

It is possible that ToF cameras make their way into smartphones to provide face signin and gesture recognition capabilities. A reliable and convenient multifactor
authentication is then possible by combining 3D face recognition and fingerprint
recognition that is already found in Apple iPhone 5s.
Data from Gaze tracking applied to Human-Computer Interaction: Gaze
tracking has been used for marketing research for a long time but it is also capable of
providing computing experience where the interface reacts to users attention and
intent. Since 2010, it appears that there is patent race on gaze tracking technology
between Google, Microsoft, Apple and a Swedish company Tobii which is the leader in
eye tracking products.
The technology has substantially improved over years and as a result it has entered
into the gaming industry recently. Tobii has released its EyeX sensor in the consumer
market and introduced gaze tracking in major game titles such as Assassins Creed
Syndicate [41], Deus Ex: Mankind Divided [42], Tom Clancys the Division [43] to
name a few. Several products in the market have integrated Tobiis gaze tracking
sensors e.g. the MSI GT72S G laptop and Acers Predator series gaming displays. Tobii
has recently received order from Dell to integrate their sensor in their Alienware IS4
series gaming laptops [44].
Computing interfaces that react to intentions is a new experience for consumers and
could be a revolution in computing. In this pursuit, Tobiis gaming sensor is already
augmenting Microsoft Windows 10 interface by providing on-screen gaze pointing
abilities that reduce use of mouse and keyboard.
The real opportunity for banking lies in the fact that if gaze tracking catches on, the
data available from these sensors is far more indicative of users interests, and
presents a big opportunity for marketing and product engineering in retail banking, as
well as for predictive modelling in behavioral finance. A similar opportunity is in virtual
reality (VR) and augmented reality (AR) applications where users attention can be
approximated by their head movements.
The invasiveness of this technology and data privacy concerns are noteworthy. To
ensure its adoption, there is a challenge in creating compelling value proposition to
counter a possible backlash from consumers.

30

Hierarchical Temporal Memory (HTM)


Sequence learning covers a major portion of predictive analytics in finance.
Hierarchical temporal memory (HTM) sequence memory is recently proposed as a
theoretical framework for sequence learning in the cortex. Based on HTM, online
sequence learning models are being proposed by Numenta, Inc. that develops this
technology and makes it available through its NuPIC open-source library. It is said to
work best with data that meets the following characteristics:

Streaming data rather than batch data files.

Data with time-based patterns.

Many individual data sources where hand crafting separate models is


impractical.

Subtle patterns that cant always be seen by humans.

Data for which simple techniques such as thresholds yield substantial false
positives and false negatives.

In a comparative study on HTM [45] by its founder, it was shown to perform


comparably with LSTMs but the following advantages were claimed:

Ability to handle multiple predictions and branching sequences with high order
statistics.

Robustness to noise and fault tolerance.

Good performance without task-specific hyper-parameters tuning.

Compared to LSTM, HTM works on a completely different principle and it is possible


that for some sequence learning problems it can outperform LSTMs either in terms of
performance or training efficiency. Possibilities for HTM based trading models are
already being explored. [46]

Neural Turing Machines: RNN + Memory


Quoting a good explanation by Alex Graves, a researcher at Google DeepMind project:
The basic idea of the neural Turing machine (NTM) was to combine the fuzzy pattern
matching capabilities of neural networks with the algorithmic power of programmable
computers. A neural network controller is given read/write access to a memory matrix
of floating point numbers, allowing it to store and iteratively modify data. As Turing

31

showed, this is sufficient to implement any computable program, as long as you have
enough runtime and memory By learning how to manipulate their memory, Neural
Turing Machines can infer algorithms from input and output examples alone. In other
words, they can learn how to program themselves.
NTMs takes inspiration from biological working of memory & attention and the design
of computers. Unlike a machine learning model which learns an input to output
mapping, NTMs are capable of learning algorithms i.e. instructions that lead to
completion of a task. Alexs research [47] introduced a model that successfully learnt
and performed elementary operations like copy and sort.
Although its research is emergent, having algorithms synthesize new algorithms could
be ground-breaking in AI.

32

Acknowledgement
This content is an extended discussion of the case-study titled Opportunity for Banking
in Data-Driven Predictive Analytics by my team: Jacqueline Zhang, Nicholas Mancini,
Indraneel Bende, Ricky He and myself. It was presented to the domain leads at DB
Global Technology, Cary, NC as a part of the 2016 summer analyst program.
Im thankful for the contributions of my team members in the case-study and for the
inspiring feedback given by the domain leads. I grateful to Bryan Cardillo, Shambhu
Sharan and the rest of dbTradeStore team for keeping me inspired and motivated
throughout this internship program.

33

References
[1] L. Grossman, "2045: The Year Man Becomes Immortal," Time Magazine, 10
February 2011.
[2] J. Demmel, "Communication-Avoiding Algorithms for Linear Algebra and
Beyond," in IPDPS, 2013.
[3] M. Hilbert and P. Lpez, "The worlds technological capacity to store,
communicate, and compute information," science, pp. 60-65, 2011.
[4] D. Floyer, "The IT Benefits of an All-Flash Data Center," 23 March 2015.
[Online]. Available: http://wikibon.com/the-it-benefits-of-an-all-flash-datacenter/.
[5] I. Cutress, "Intels 140GB Optane 3D XPoint PCIe SSD Spotted at IDF,"
AnandTech, 26 August 2016. [Online]. Available:
http://www.anandtech.com/show/10604/intels-140gb-optane-3d-xpoint-pciessd-spotted-at-idf.
[6] K. Freund, "Intel Acquires Nervana Systems Which Could Significantly Enhance
Future Machine Learning Capabilities," Forbes, 9 August 2016. [Online].
Available: http://www.forbes.com/sites/moorinsights/2016/08/09/intelacquires-nervana-systems-which-could-significantly-enhance-future-machinelearning-capabilities. [Accessed 7 September 2016].
[7] E. Rich and K. Knight, Artificial Intelligence (second edition), McGraw-Hill,
1991.
[8] R. C. Schank, "Where's the AI?," AI magazine, p. 38, 1991.
[9] S. Russel and P. Norvig, Artificial Intelligence: A modern approach (third
edition), Prentice Hall, 2010.
[10] Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," Nature, 2015.
[11] K. P. Roberto Martin, "Can Machine Learning Predict a Hit or Miss on Estimated
Earnings?," Bloomberg L.P., 4 February 2016. [Online]. Available:

34

http://www.bloomberg.com/company/announcements/can-machine-learningpredict-a-hit-or-miss-on-estimated-earnings/. [Accessed 8 September 2016].


[12] B. McClure, "Earnings Forecasts: A Primer," Investopedia, [Online]. Available:
http://www.investopedia.com/articles/stocks/06/earningsforecasts.asp.
[Accessed 8 September 2016].
[13] D. H. Hubel and T. N. Wiesel, "Receptive fields of single neurones in the cat's
striate cortex.," The Journal of physiology, vol. 148, no. 3, pp. 574-591, 1959.
[14] University of Colorado, "CCNBook/Perception," 2016. [Online]. Available:
https://grey.colorado.edu/CompCogNeuro/index.php/CCNBook/Perception.
[Accessed 11 September 2016].
[15] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning representations by
back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986.
[16] G. E. Hinton, S. Osindero and Y.-W. Teh, "A fast learning algorithm for deep
belief nets," Neural Computation, vol. 18, no. 7, pp. 1527 - 1554, 2006.
[17] Y. Bengio, "Learning Deep Architectures for AI," Foundations and Trends in
Machine Learning, vol. 2, no. 1, pp. 1-127 , 2009.
[18] K. Fukushima, "Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position," Biological
Cybernetics, vol. 36, no. 4, pp. 193-202, 1980.
[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and
L. D. Jackel, "Backpropagation Applied to Handwritten Zip Code Recognition,"
MIT Press: Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.
[20] F. J. Pineda, "Generalization of back-propagation to recurrent neural
networks," Physics Review Letters, vol. 59, no. 19, pp. 2229--2232, 1987.
[21] S. Hochreiter, "Untersuchungen zu dynamischen neuronalen Netzen," Diploma
thesis, Institut f. Informatik, Technische Univ. Munich, 1991.
[22] J. Schmidhuber, "Learning Complex, Extended Sequences Using the Principle of
History Compression," MIT Press, vol. 4, no. 2, pp. 234-242, 1992.

35

[23] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural


Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[24] A. I. S. a. G. E. H. Krizhevsky, "Imagenet classification with deep convolutional
neural networks," in Advances in neural information processing systems, 2012,
pp. 1097-1105.
[25] C. Olah, "colah's blog: Understanding LSTM Networks," 27 August 2015.
[Online]. Available: http://colah.github.io/posts/2015-08-UnderstandingLSTMs/. [Accessed 12 September 2016].
[26] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink and J. Schmidhuber,
"LSTM: A Search Space Odyssey," arXiv:1503.04069, 2015.
[27] A. Sagar, "lstm_model.ipynb," 11 August 2016. [Online]. Available:
https://github.com/ayushsagar/big-dataanalytics/blob/master/lstm_model.ipynb. [Accessed 12 September 2016].
[28] H. W. Lin and M. Tegmark, "Why does deep and cheap learning work so well?,"
arXiv:1608.08225 [cond-mat.dis-nn], 2016.
[29] Emerging Technology from the arXiv, "The Extraordinary Link Between Deep
Neural Networks and the Nature of the Universe," MIT Technology Review, 9
September 2016. [Online]. Available:
https://www.technologyreview.com/s/602344/the-extraordinary-link-betweendeep-neural-networks-and-the-nature-of-the-universe/. [Accessed 10
September 2016].
[30] L. Huang, A. D. Joseph, B. Nelson, B. I. P. Rubinstein and J. D. Tygar,
"Adversarial Machine Learning," in 4th ACM Workshop on Artificial Intelligence
and Security, New York, NY, USA, 2011.
[31] S. Meng, M. Wiens and F. Schultmann, "A Game-theoretic Approach To Assess
Adversarial Risks," WIT Transactions on Information and Communication
Technologies, vol. 47, p. 12, 2014.
[32] D. Hassabis, "AlphaGo defeats Lee Sedol 4-1 in Google DeepMind Challenge
Match," Google Official Blog, 27 January 2016. [Online]. Available:

36

https://googleblog.blogspot.nl/2016/01/alphago-machine-learning-gamego.html. [Accessed 6 September 2016].


[33] Google DeepMind, "Mastering the game of Go with deep neural networks and
tree search," Nature, 2016.
[34] A. Levinovitz, "The Mystery of Go, the Ancient Game That Computers Still Cant
Win," Wired, 12 May 2015.
[35] T. C. W. Lin, "A Behavioral Framework for Securities Risk," 34 Seattle
University Law Review, 8 October 2013.
[36] B. Pang and L. Lee, "Opinion Mining and Sentiment Analysis," Foundations and
Trends in Information Retrieval, 2008.
[37] R. Carver, in Systematic Trading, Harriman House, 2015, p. 10.
[38] F. Salmon and J. Stokes, "Algorithms Take Control of Wall Street," 27
December 2010. [Online]. Available:
http://www.wired.com/2010/12/ff_ai_flashtrading/.
[39] C. Farabet, C. Poulet and Y. LeCun, "An FPGA Based Stream Processor for
Embedded Real-Time Vision with Convolutional Networks," in Fifth IEEE
Workshop on Embedded Computer Vision, 2009.
[40] C. Griffith, "Windows Hello: can identical twins fool Microsoft and Intel?," The
Australian: Business Review, 20 August 2015.
[41] Tobii AB, "Assassin's Creed Syndicate - Now Enhanced with Tobii Eye
Tracking," 5 January 2016. [Online]. Available:
https://www.youtube.com/watch?v=O4s5GByBYwQ.
[42] Tobii AB, "Deus Ex: Mankind Divided. Tobii Eye Tracking enhanced mode.," 9
August 2016. [Online]. Available:
https://www.youtube.com/watch?v=Ic2rZojA83I.
[43] Tobii AB, "Play & Experience Tom Clancys The Division with Tobii Eye
Tracking," 9 May 2016. [Online]. Available:
https://www.youtube.com/watch?v=TX0_KZh39R0.

37

[44] Tobii AB, "Tobii Receives Order from Alienware Regarding the IS4 Eye-Tracking
Platform," 2 September 2016. [Online]. Available:
http://www.businesswire.com/news/home/20160901006614/en/.
[45] Y. Cui, S. Ahmad and J. Hawkins, Continuous online sequence learning with an
unsupervised neural network model, arXiv.org, 2015.
[46] P. Gabrielsson, R. Knig and U. Johansson, "Evolving Hierarchical Temporal
Memory-Based Trading Models," in Applications of Evolutionary Computation,
Vienna, Austria, 2013.
[47] A. Graves, G. Wayne and I. Danihelka, Neural Turing Machines, arXiv.org,
2014.
[48] A. Jakulin, "What is the difference between statistics and machine learning?,"
Quora, 22 December 2012. [Online]. Available: https://www.quora.com/Whatis-the-difference-between-statistics-and-machine-learning/answer/AleksJakulin?srid=OlUS. [Accessed 7 September 2016].

38

S-ar putea să vă placă și