Documente Academic
Documente Profesional
Documente Cultură
April 2006
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect from the Session
NameNode (HDFS)
ResourceManager
(YARN)
Slave Group – Task
m3.xlarge
Choice of Multiple Instances
Tightly coupled
Compute and Storage Grow Together
Tightly coupled
Underutilized or Scarce Resources
120
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
Weekly peaks Re-processing
120
100
80
Steady state
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
Provisioned capacity
120
100
Underutilized capacity
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Contention for Same Resources
Compute
Memory
bound
bound
Separation of Resources Creates Data Silos
Team A
Replication Adds to Cost
Single datacenter
3x
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store
11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
Amazon S3 EMRFS
The Amazon EMR File System (EMRFS)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
LOCATION 's3://elasticmapreduce.samples/pig-apache/
input/'
Benefit 1: Switch Off Clusters
Input data
Ingest into
Use Amazon EMR to Push to Amazon
transform unstructured Amazon S3 Redshift
data to structured
Sample Pipeline
Run Transient or Long-Running Clusters
Run a Long-Running Cluster
OD
Price
Bid
Price
Spot Integration
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14 à 7)
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical Separation of Jobs
Amazon S3
Presto Ad-Hoc
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Amazon S3
Cluster 3 Cluster 4
Availability Zone
Availability Zone
Amazon S3 as a Data Lake