Sunteți pe pagina 1din 65

Amazon Elastic MapReduce -

Architecture, Best Practices

April 2006

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect from the Session

•  Technical introduction to Amazon EMR


•  Basic tenets
•  Amazon EMR feature set
Amazon EMR
•  Managed platform
•  MapReduce, Apache Spark, Presto
•  Launch a cluster in minutes
•  Open source distribution and MapR
distribution
•  Leverage the elasticity of the cloud
•  Baked in security features
•  Pay by the hour and save with Spot
•  Flexibility to customize
Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud
What Do I Need to Build a Cluster ?

1.  Choose instances


2.  Choose your software
3.  Choose your access method
An Example EMR Cluster
HDFS (DataNode).
Slave Group - Core YARN (NodeManager).
c3.2xlarge

Slave Group – Task


Master Node m3.2xlarge (EC2 Spot)
r3.2xlarge

NameNode (HDFS)
ResourceManager
(YARN)
Slave Group – Task
m3.xlarge
Choice of Multiple Instances

General CPU Memory Disk/IO


m4 family c4 family r3 family d2 family
m3 family c3 family i2 family

Batch Machine In-memory Large HDFS


Processing Learning (Spark &
Presto)
Select an Instance
Choose Your Software (Quick Bundles)
Choose Your Software – Custom
Hadoop Applications Available in Amazon EMR
Choose Security and Access Control
You Are Up and Running!
You Are Up and Running!

Master Node DNS


You Are Up and Running!

Information about the software you are


running, logs and features
You Are Up and Running!

Infrastructure for this cluster


You Are Up and Running!

Security Groups and Roles


Use the CLI

aws emr create-cluster


--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK


Programmatic Access to Cluster Provisioning
Now that I have a cluster, I need to process
some data
Amazon EMR can process data from multiple sources

Hadoop Distributed File


System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources

Hadoop Distributed File


System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources

Hadoop Distributed File


System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
On an On-premises Environment

Tightly coupled
Compute and Storage Grow Together

Storage grows along with


compute
Compute requirements vary

Tightly coupled
Underutilized or Scarce Resources
120

100

80

60

40

20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
Weekly peaks Re-processing
120

100

80
Steady state
60

40

20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
Provisioned capacity
120

100
Underutilized capacity

80

60

40

20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Contention for Same Resources

Compute
Memory
bound
bound
Separation of Resources Creates Data Silos

Team A
Replication Adds to Cost

Single datacenter

3x
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store

11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
Amazon S3 EMRFS
The Amazon EMR File System (EMRFS)

•  Allows you to leverage Amazon S3 as a file-system


•  Streams data directly from Amazon S3
•  Uses HDFS for intermediates
•  Better read/write performance and error handling than
open source components
•  Consistent view – consistency for read after write
•  Support for encryption
•  Fast listing of objects
Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(


host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)

LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex(


host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)

LOCATION 's3://elasticmapreduce.samples/pig-apache/
input/'
Benefit 1: Switch Off Clusters

Amazon S3 Amazon S3 Amazon S3


Auto-Terminate Clusters
You Can Build a Pipeline
Or You Can Use AWS Data Pipeline

Input data
Ingest into
Use Amazon EMR to Push to Amazon
transform unstructured Amazon S3 Redshift
data to structured
Sample Pipeline
Run Transient or Long-Running Clusters
Run a Long-Running Cluster

Amazon EMR cluster


Benefit 2: Resize Your Cluster
Resize the Cluster

Scale Up, Scale Down, Stop a resize,


issue a resize on another
How do you scale up and save cost ?
Spot Instance

OD
Price

Bid
Price
Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3


InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with Amazon EMR

•  Can provision instances from the Spot market


•  Replaces a Spot instance incase of interruption
•  Impact of interruption
•  Master node – Can lose the cluster
•  Core node – Can lose intermediate data
•  Task nodes – Jobs will restart on other nodes (application
dependent)
Scale up with Spot Instances

10 node cluster running for 14 hours


Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances

Add 10 more nodes on Spot


Resize Nodes with Spot Instances

20 node cluster running for 7 hours


Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35

Total $105
Resize Nodes with Spot Instances

50 % less run-time ( 14 à 7)

25% less cost (140 à 105)


Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/

1500 to 2000 clusters


6000 Jobs
For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}

Source: Github /Bloomreach/ Briefly


Intelligent Scale Down
Intelligent Scale Down: HDFS
Effectively Utilize Clusters
120

100

80

60

40

20

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical Separation of Jobs

Hive, Pig, Prod


Cascading

Amazon S3
Presto Ad-Hoc
Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Amazon S3
Cluster 3 Cluster 4
Availability Zone Availability Zone
Amazon S3 as a Data Lake

Nate Sammons, Principal Architect – NASDAQ


Reference – AWS Big Data Blog
Re-cap

Rapid provisioning of clusters


Hadoop, Spark, Presto, and other applications
Standard open-source packaging
De-couple storage and compute and scale them
independently
Resize clusters to manage demand
Save costs with Spot instances
Thank You

S-ar putea să vă placă și