Documente Academic
Documente Profesional
Documente Cultură
ee
D i s t i l l e d
E x p e r i e n c e
Rajiv Tiwari
P U B L I S H I N G
C o m m u n i t y
$ 29.99 US
19.99 UK
pl
Sa
m
Rajiv Tiwari
or on
I recommend you refer to your Hadoop cloud provider documentation if you need
to dive deeper.
[ 47 ]
The why
As far as banks are concerned, especially investment banks, business fluctuates a lot
and is driven by the market. Fluctuating business means fluctuating trade volume
and variable IT resource requirements. As shown in the following figure, traditional
on-premise implementations will have a fixed number of servers for peak IT
capacity, but the actual IT capacity needs are variable:
Traditional IT
capacity
Capacity
Time
Your IT needs
As shown in the following figure, if a bank plans to have more IT capacity than
maximum usage (a must for banks), there will be wastage, but if they plan to have
IT capacity that is the average of required fluctuations, it will be lead to processing
queues and customer dissatisfaction:
WASTE
On and Off
Fast Growth
Variable peaks
Predictable peaks
CUSTOMER DISSATISFACTION
[ 48 ]
Chapter 3
With cloud computing, financial organizations only pay for the IT capacity they use
and it is the number-one reason for using Hadoop in the cloudelastic capacity and
thus elastic pricing.
The second reason is proof of concept. For every financial institution, before the
adoption of Hadoop technologies, the big dilemma was, "Is it really worth it?" or
"Should I really spend on Hadoop hardware and software as it is still not completely
mature?" You can simply create Hadoop clusters within minutes, do a small proof
of concept, and validate the benefits. Then, either scale up your cloud with more use
cases or go on-premise if that is what you prefer.
The when
Have a look at the following questions. If you answer yes to any of these for your
big data problem, Hadoop in the cloud could be the way forward:
The biggest concern isand will remain for the foreseeable futurethe
security of the data in the cloud, especially customers' private data. The
moment senior managers think of security, they want to play safe and drop
the idea of implementing it on the cloud.
Once the data is in the cloud, vendors manage the day-to-day administrative
tasks, including operations. The implementation of Hadoop in the cloud
will lead to the development and operation roles merging, which is slightly
against the norm in terms of departmental functions of banks.
In the next section, I will pick up one of the most popular use cases: implementing
Hadoop in the cloud for the risk division of a bank.
[ 49 ]
Solution
For our illustration, I will use Amazon Web Services (AWS) with Elastic
MapReduce (EMR) and parallelize the Monte Carlo simulation using a MapReduce
model. Note, however, that it can be implemented on any Hadoop cloud platform.
The bank will upload the client portfolio data into cloud storage (S3); develop
MapReduce using the existing algorithms; and use EMR on-demand additional
nodes to execute the MapReduce in parallel, write back the results to S3, and release
EMR resources.
HDFS is automatically spread over data nodes. If you decommission
the nodes, the HDFS data on them will be lost. So always put your
persistent data on S3, not HDFS.
[ 50 ]
Chapter 3
St +1 = St t + t
St +1 = St +1 St
Where : St is the Asset price at time t ;
St +1 is the Asset price at time t + 1
[ 51 ]
For a large number of iterations, the asset price will follow a normal pattern. As
shown in the following figure, the value at risk at 99 percent is 0.409, which is
defined as a 1 percent probability that the asset price will fall more than 0.409 after
300 days. So, if a client holds 100 units of the asset price in his portfolio, the VaR is
40.9 for his portfolio.
The results are only an estimate, and their accuracy is the square root of the number
of iterations, which means 1,000 iterations will make it 10 times more accurate. The
iterations could be anywhere from the hundreds of thousands to millions, and even
with powerful and expensive computers, the iterations could take more than 20
hours to complete.
[ 52 ]
Chapter 3
EMR
AMAZON
S3
EC2
EC2
EC2
EC2
EC2
EC2
AMAZON
S3
RESULTS
Data collection
The data storage for this project is Amazon S3 (where S3 stands for Simple Storage
Service). It can store anything, has unlimited scalability, and has 99.999999999
percent durability.
If you have a little more money and want better performance, go for storage on:
Amazon Redshift: This is a relational parallel data warehouse with the scale
of data in Petabytes and should be used if performance is your top priority.
It will be even more expensive in comparison to DynamoDB and in the order
of $1,000/TB/year.
[ 53 ]
Data upload
Now you have to upload the client portfolio and parameter data into Amazon S3
as follows:
1. Create an input bucket on Amazon S3, which is like a directory and must have
a unique name, something like <organization name + project name + input>.
2. Upload the source files using a secure corporate Internet.
I recommend you use one of the two Amazon data transfer services, AWS
Import/Export and AWS Direct Connect, if there is any opportunity to do so.
The AWS Import/Export service includes:
Export the data using Amazon format into a portable storage devicehard
disk, CD, and so on and ship it to Amazon.
Amazon imports the data into S3 using its high-speed internal network and
sends you back the portable storage device.
The process takes 56 days and is recommended only for an initial large data
loadnot an incremental load.
[ 54 ]
Chapter 3
Use this service if you need to import/export large volumes of data in and
out of the Amazon cloud on a day-to-day basis
Data transformation
Rewrite the existing simulation programs into Map and Reduce programs and
upload them into S3. The functional logic will remain the same; you just need
to rewrite the code using the MapReduce framework, as shown in the following
template, and compile it as MapReduce-0.0.1-VarRiskSimulationAWS.jar.
The mapper logic splits the client portfolio data into partitions and applies iterative
simulations for each partition. The reducer logic aggregates the mapper results,
value, and risk.
package com.hadoop.Var.MonteCarlo;
import <java libraries>;
import <org.apache.hadoop libraries>;
public class VarMonteCarlo{
public static void main(String[] args) throws Exception{
if (args.length < 2) {
System.err.println("Usage: VAR Monte Carlo <input path>
<output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "VaR calculation");
job.setJarByClass(VarMonteCarlo.class);
job.setMapperClass(VarMonteCarloMapper.class);
job.setReducerClass(VarMonteCarloReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(RiskArray.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
[ 55 ]
Once the Map and Reduce code is developed, please follow these steps:
1. Create an output bucket on Amazon S3, which is like a directory and
must have a unique name, something like <organization name + project
name + results>.
2. Create a new job workflow using the following parameters:
Core Instance EC2 instance: This selects the larger instances and
selects a lower count
Task Instance EC2 instance: This selects the larger instances and
selects a very high count, which must be in line with the number
of risk simulation iterations
Chapter 3
Data analysis
You can download the simulation results from the Amazon S3 output bucket for
further analysis with local tools.
In this case, you should be able to simply download the data locally, as the result
volume may be relatively low.
Summary
In this chapter, we learned how and when big data can be processed in the cloud,
right from configuration, collection, and transformation to the analysis of data.
Currently, Hadoop in the cloud is not used much in banks due to a few concerns
about data security and performance. However, that is debatable.
For the rest of this book, I will discuss projects using on-premise Hadoop
implementations only.
In the next chapter, I will pick up a medium-scale on-premise Hadoop project
and see it in a little more detail.
[ 57 ]
www.PacktPub.com
Stay Connected: