Amazon Elastic Mapreduce

Amazon Elastic MapReduce -
Architecture, Best Practices
April 2006
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect from the Session
•  Technical introduction to Amazon EMR

•  Basic tenets
•  Amazon EMR feature set
Amazon EMR
•  Managed platform
•  MapReduce, Apache Spark, Presto
•  Launch a cluster in minutes
•  Open source distribution and MapR
distribution
•  Leverage the elasticity of the cloud
•  Baked in security features
•  Pay by the hour and save with Spot
•  Flexibility to customize
Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud
What Do I Need to Build a Cluster ?
1.  Choose instances

2.  Choose your software
3.  Choose your access method
An Example EMR Cluster
HDFS (DataNode).
Slave Group - Core YARN (NodeManager).
c3.2xlarge
Slave Group – Task

Master Node m3.2xlarge (EC2 Spot)
r3.2xlarge
NameNode (HDFS)
ResourceManager
(YARN)
Slave Group – Task
m3.xlarge
Choice of Multiple Instances
General CPU Memory Disk/IO

m4 family c4 family r3 family d2 family
m3 family c3 family i2 family
Batch Machine In-memory Large HDFS

Processing Learning (Spark &
Presto)
Select an Instance
Choose Your Software (Quick Bundles)
Choose Your Software – Custom
Hadoop Applications Available in Amazon EMR
Choose Security and Access Control
You Are Up and Running!
Master Node DNS

Information about the software you are

running, logs and features
Infrastructure for this cluster

Security Groups and Roles

Use the CLI
aws emr create-cluster

--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK

Programmatic Access to Cluster Provisioning
Now that I have a cluster, I need to process
some data
Amazon EMR can process data from multiple sources
Hadoop Distributed File

System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis

System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis

System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
On an On-premises Environment
Tightly coupled
Compute and Storage Grow Together
Storage grows along with

compute
Compute requirements vary
Tightly coupled
Underutilized or Scarce Resources
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Weekly peaks Re-processing
120
100
80
Steady state
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Provisioned capacity
120
100
Underutilized capacity
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Contention for Same Resources
Compute
Memory
bound
bound
Separation of Resources Creates Data Silos
Team A
Replication Adds to Cost
Single datacenter
3x
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store
11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
Amazon S3 EMRFS
The Amazon EMR File System (EMRFS)
•  Allows you to leverage Amazon S3 as a file-system

•  Streams data directly from Amazon S3
•  Uses HDFS for intermediates
•  Better read/write performance and error handling than
open source components
•  Consistent view – consistency for read after write
•  Support for encryption
•  Fast listing of objects
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(

host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(

host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-apache/
input/'
Benefit 1: Switch Off Clusters
Amazon S3 Amazon S3 Amazon S3

Auto-Terminate Clusters
You Can Build a Pipeline
Or You Can Use AWS Data Pipeline
Input data
Ingest into
Use Amazon EMR to Push to Amazon
transform unstructured Amazon S3 Redshift
data to structured
Sample Pipeline
Run Transient or Long-Running Clusters
Run a Long-Running Cluster
Amazon EMR cluster

Benefit 2: Resize Your Cluster
Resize the Cluster
Scale Up, Scale Down, Stop a resize,

issue a resize on another
How do you scale up and save cost ?
Spot Instance
OD
Price
Bid
Price
Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3

InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with Amazon EMR
•  Can provision instances from the Spot market

•  Replaces a Spot instance incase of interruption
•  Impact of interruption
•  Master node – Can lose the cluster
•  Core node – Can lose intermediate data
•  Task nodes – Jobs will restart on other nodes (application
dependent)
Scale up with Spot Instances
10 node cluster running for 14 hours

Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot

20 node cluster running for 7 hours

Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
50 % less run-time ( 14 à 7)
25% less cost (140 à 105)

Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/
1500 to 2000 clusters

6000 Jobs
For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}
Source: Github /Bloomreach/ Briefly

Intelligent Scale Down
Intelligent Scale Down: HDFS
Effectively Utilize Clusters
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical Separation of Jobs
Hive, Pig, Prod

Cascading
Amazon S3
Presto Ad-Hoc
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Amazon S3
Cluster 3 Cluster 4
Availability Zone Availability Zone
Amazon S3 as a Data Lake
Nate Sammons, Principal Architect – NASDAQ

Reference – AWS Big Data Blog
Re-cap
Rapid provisioning of clusters

Hadoop, Spark, Presto, and other applications
Standard open-source packaging
De-couple storage and compute and scale them
independently
Resize clusters to manage demand
Save costs with Spot instances
Thank You

Amazon Elastic Mapreduce

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Amazon Elastic Mapreduce

Încărcat de

Drepturi de autor:

Formate disponibile

Amazon Elastic MapReduce -

Architecture, Best Practices

• Technical introduction to Amazon EMR

1. Choose instances

Slave Group – Task

General CPU Memory Disk/IO

Batch Machine In-memory Large HDFS

Master Node DNS

Information about the software you are

Infrastructure for this cluster

Security Groups and Roles

aws emr create-cluster

Or use your favorite SDK

Hadoop Distributed File

Hadoop Distributed File

Hadoop Distributed File

Storage grows along with

• Allows you to leverage Amazon S3 as a file-system

CREATE EXTERNAL TABLE serde_regex(

CREATE EXTERNAL TABLE serde_regex(

Amazon S3 Amazon S3 Amazon S3

Amazon EMR cluster

Scale Up, Scale Down, Stop a resize,

aws emr create-cluster --name "Spot cluster" --ami-version 3.3

• Can provision instances from the Spot market

10 node cluster running for 14 hours

Add 10 more nodes on Spot

20 node cluster running for 7 hours

25% less cost (140 à 105)

1500 to 2000 clusters

Source: Github /Bloomreach/ Briefly

Hive, Pig, Prod

Nate Sammons, Principal Architect – NASDAQ

Rapid provisioning of clusters

S-ar putea să vă placă și

•  Technical introduction to Amazon EMR

1.  Choose instances

•  Allows you to leverage Amazon S3 as a file-system

•  Can provision instances from the Spot market