Sunteți pe pagina 1din 60

Amazon S3 Best Practice and

Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama
Senior Cloud Support Engineer, Amazon Web Services Japan

2019.03.14
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Who I am...

Noritaka Sekiyama
Senior Cloud Support Engineer
@moomindani

- Engineer in AWS Support


- Speciality: Big Data
(EMR, Glue, Athena, …)
- SME of AWS Glue
- Apache Spark lover ;)

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
About today’s session
Question
• Are you already using S3 on Hadoop/Spark?
• Will you start using Hadoop/Spark on S3 in the future?
• Are you just interested in using cloud storage in Hadoop/Spark?

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Relationship between Hadoop/Spark and S3
Difference between HDFS and S3, and use-case
Detailed behavior of S3 from the viewpoint of Hadoop/Spark
Well-known pitfalls and tunings
Service updates on AWS/S3 related to Hadoop/Spark
Recent activities in Hadoop/Spark community related to S3

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationship between
Hadoop/Spark and S3

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data operation on Hadoop/Spark
Hadoop/Spark processes large data and write output to
destination
Possible to locate input/output data on various file systems like
HDFS
Hadoop/Spark accesses various file system via Hadoop
FileSystem API
App

FileSystem API

FileSystem

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hadoop/Spark and file system
Hadoop FileSystem API
• Interface to operate Hadoop file system
⎼ open: Open input stream
⎼ create: Create files
⎼ append: Append files
⎼ getFileBlockLocations: Get block locations
⎼ rename: Rename files
⎼ mkdir: Create directories
⎼ listFiles: List files
⎼ delete: Delete files
• Possible to use various file system like HDFS when using various
implementation of FileSystem API
⎼ LocalFileSystem, S3AFileSystem, EmrFileSystem
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HDFS: Hadoop Distributed File System
HDFS is
• Distributed file system that enables high throughput data access
• Optimized to large files (100MB+, GB, TB)
⎼ Not good for lots of small files
• There are NameNode which manages metadata, and DataNode which
manages/stores data blocks
• In Hadoop 3.x, there are many features (e.g. Erasure Coding, Router
based federation, Tiered storage, Ozone) that are actively developed.

How to access
• Hadoop FileSystem API
• $ hadoop fs ...
• HDFS Web UI
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
Amazon S3 is
• Object storage service that achieves high scalability, availability,
durability, security, and performance
• Pricing is mainly based on data size and requests
• Maximum size of single object: 5TB
• Objects have unique keys under bucket
⎼ There are no directories in S3 although S3 console emulates directories.
⎼ S3 is not a file system
How to access
• REST API (GET, SELECT, PUT, POST, COPY, LIST, DELETE, CANCEL, ...)
• AWS CLI
• AWS SDK
• S3 Console
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3’s implementation of Hadoop FileSystem API
Enables to handle S3 like HDFS from Hadoop/Spark
History of S3 FileSystem API implementation
• S3: S3FileSystem
• S3N: NativeS3FileSystem
• S3A: S3AFileSystem
• EMRFS: EmrFileSystem

App

Cluster S3

HDFS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3: S3FileSystem
HADOOP-574: want FileSystem implementation for Amazon S3
Developed to use S3 as file system in 2006
• Object data on S3 = Block data(≠ File data)
• Blocks are stored on S3 directly
• Limited to read/write from S3FileSystem
URL prefix: s3://
Deprecated in 2016

https://issues.apache.org/jira/browse/HADOOP-574

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3N: NativeS3FileSystem
HADOOP-931: Make writes to S3FileSystem world visible only on
completion
Developed to solve issues in S3FileSystem in 2008
• Object data on S3 =File data(≠Block data)
• Empty directories are represented with empty object “xyz_$folder$“
• Limited to use files which does not exceed 5GB
Uses jets3t to access S3 (not use AWS SDK)
URL prefix: s3n://

https://issues.apache.org/jira/browse/HADOOP-931

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A: S3AFileSystem
HADOOP-10400: Incorporate new S3A FileSystem implementation
Developed to solve issues in NativeS3FileSystem in 2014
• Support parallel copy and rename
• Compatible with S3 console about empty directories ("xyz_$folder$“-
>”xyz/”)
• Support IAM role authentication
• Support 5GB+ files and multipart uploads
• Support S3 server side encryption
• Improve recovery from error
Uses AWS SDK for Java to access S3
URL prefix: s3a://
Amazon EMR does not support S3A officially
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://issues.apache.org/jira/browse/HADOOP-10400
EMRFS: EmrFileSystem
FileSystem implementation at Amazon EMR (Limited to use on
EMR)
Developed by Amazon (optimized for S3 specification)
• Support IAM role authentication
• Support 5GB+ files and multipart uploads
• Support both S3 server-side/client-side encryption
• Support EMRFS S3-optimized Committer
• Support pushdown with S3 SELECT
Uses AWS SDK for Java to access S3
URL prefix: s3:// (or s3n://)

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How to choose S3A or EMRFS
S3A
• On-premise
• Other cloud
• Hadoop/Spark on EC2
EMRFS
• Amazon EMR

App App
S3A EMRFS
S3 EMR Cluster

HDFS HDFS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hadoop/Spark and AWS
Choice at AWS
• EMR: Covers most of use-cases
• Hadoop/Spark on EC2: Good for specific use-case
⎼ Multi-master (Coming Soon in EMR)
https://www.slideshare.net/AmazonWebServices/a-deep-dive-into-whats-new-with-amazon-emr-ant340r1-aws-reinvent-2018/64

⎼ Needs combination of applications/versions which are not supported in EMR


⎼ Needs specific distribution of Hadoop/Spark
• Glue: Can be used as serverless Spark

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Difference between HDFS
and S3, and use-case

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common features in both HDFS and S3
Possible to access via Hadoop FileSystem API
Can be changed based on URL prefix (“hdfs://”, “s3://”, “s3a://”)

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Work load and data where HDFS is better
Extremely high I/O performance
Frequent data access
Temporary data
High consistency
• Cannot accept S3 consistency model and EMRFS consistent view, S3 Guard
Fixed cost for both storage and I/O is preferred
The use-case where data locality work well
(network bandwidth between nodes < 1G)
Physical location of data needs to be controlled

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Work load and data where S3 is better (1/2)
Extremely high durability and availability
• Durability: 99.999999999%
• Availability:99.99%
Cold data is stored for long-term use
https://aws.amazon.com/s3/storage-classes/

Lower cost for data size is preferred


• External blog post said it is less than 1/5 of HDFS
Data size is huge and incrementally increasing

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Work load and data where S3 is better (2/2)
Wants to separate storage from computing cluster
• Data on S3 remains after terminating clusters
Multiple clusters and applications share the same file system
• Multiple Hadoop/Spark clusters
• EMR, Glue, Athena, Redshift Spectrum, Hadoop/Spark on EC2, etc.
Centerized security is preferred (including other components than Hadoop)
• IAM, S3 bucket policy, VPC Endpoint, Glue Data Catalog, etc.
• Will be improved by AWS LakeFormation

Note: S3 cannot be used as default file system (fs.defaultFS)


https://aws.amazon.com/premiumsupport/knowledge-center/configure-emr-s3-hadoop-storage/

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Detailed behavior of S3
from the viewpoint of
Hadoop/Spark

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 (EMRFS/S3A)
Read Spark
Slave Node Slave Node

Write Cluster
Client Cluster Worker
Spark Spark
Cluster Worker
Spark Spark
Driver Executor Executor Executor

Cluster Driver Data Node Data Node


Disk Disk

Name Node
Slave Node Slave Node
Cluster Worker Cluster Worker
S3 Spark Spark Spark Spark
Master Node
Client Executor Executor Executor Executor

Data Node Data Node


Disk Disk
S3 API
© 2019,Endpoint
Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 (EMRFS/S3A): Split
Split is
• Data chunk which is generated by splitting target data so that Hadoop/Spark can
process
• Splittable format(e.g. bzip2) data is splitted based on pre-defined size
Well-known issues
• Increased overhead due to lots of splits from lots of small files
• Out of memory due to large unsplittable file
Default split size
• HDFS: 128MB (Recommendation:HDFS Block size = Split size)
• S3 EMRFS: 64MB (fs.s3.block.size)
• S3 S3A: 32MB (fs.s3a.block.size)
Request for unsplittable files
• S3 GetObject API with specifying content length in Range parameter
• Status code: 206 (Partial Content)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 consistency model

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 consistency model
PUTs of new objects: read-after-write consistency
• Consistent result is retrieved when you get object just after putting it.
HEAD/GET of non-existing objects: eventual consistency
• If you make a HEAD or GET request to the key name (to find if the object exists)
before creating the object, S3 provides eventual consistency for read-after-write.
PUTs/DELETEs for existing objects: eventual consistency
• A process replaces an existing object and immediately attempts to read it. Until
the change is fully propagated, S3 might return the prior data
• A process deletes an existing object and immediately attempts to read it. Until
the deletion is fully propagated, S3 might return the deleted data.
LIST of objects: eventual consistency
• A process writes a new object to S3 and immediately lists keys within its bucket.
Until the change is fully propagated, the object might not appear in the list.
• A process deletes an existing object and immediately lists keys within its bucket.
Until the deletion is fully propagated, S3 might list the deleted object.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
Impact to Hadoop/Spark due to S3 consistency
Example:ETL data pipeline where there are multiple steps
• Step 1: Transforming and converting input data
• Step 2: Statistic processing of converted data

Expected issue
• Step 2 will get object list without some of data written in step 1

Workload where impact is expected


• Requires immediate, incremental, consistent processing consists of
multiple steps

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mitigating consistency impact in Hadoop/Spark
Write into HDFS, then write into S3
• Write data into HDFS as a intermediate storage, then move data from
HDFS to S3.
• DiscCP or S3DistCp can be used to move data from HDFS to S3.
• Cons: Overhead in moving data, adding intermediate process, delay to
reflect the latest data

App
Input
Output
Cluster S3
Backup
Restore
HDFS

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Mitigating consistency impact in Hadoop/Spark
S3 Guard (S3A), EMRFS Consistent view (EMRFS)
• Mechanism to check S3 consistency (especially LIST consistency)
• Use DynamoDB to manage S3 object metadata
• Provide the latest view to compare results returned from S3 and
DynamoDB
Object PUT/GET

S3
App
Temp
Cluster data
Metadata PUT/GET
HDFS DynamoDB

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common troubles and workarounds
Update data on S3 from outside clusters
• There would be difference between metadata on DynamoDB and data
on S3 if you write data not using S3A or EMRFS
→ Limit basic operations only from inside cluster
→ Sync metadata when updating S3 data from outside cluster
EMRFS CLI $ emrfs ...
S3Guard CLI $ hadoop s3guard ...
Object PUT/GET
PUT
S3
App
Client
Temp
Cluster data
Metadata PUT/GET
HDFS DynamoDB
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common troubles and workarounds
DynamoDB I/O throttling
• Fail to get/update metadata if there is not enough capacity in DynamoDB table.
→ Provision enough capacity, or use on-demand mode instead
→ Retry I/O to mitigate impact
S3A: fs.s3a.s3guard.ddb.max.retries,fs.s3a.s3guard.ddb.throttle.retry.interval,..
→ Notify when there is inconsistency
EMRFS: fs.s3.consistent.notification.CloudWatch, etc.
Object PUT/GET

S3
App
Temp
Cluster data
Metadata PUT/GET
HDFS DynamoDB
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 multipart uploads

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hadoop/Spark and S3 multipart uploads
Multipart uploads are used when Hadoop/Spark uploads
large data to S3
• Both S3A and EMRFS supports S3 multipart uploads.
• Size threshold can be set in parameters
EMRFS: fs.s3n.multipart.uploads.split.size, etc.
S3A: fs.s3a.multipart.threshold, etc.

Case of EMR: Multipart uploads are always used when EMRFS S3-
optimized Commiter is used

Case of OSS Hadoop/Spark: Multipart uploads are always used


when S3A committer is used
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Steps in multipart uploads
Multipart Upload Initiation
• When you send a request to initiate a multipart upload, S3 returns a response
with an upload ID, which is a unique identifier for your multipart upload.
Parts Upload
• When uploading a part,you must specify a part number and the upload ID.
• Only after you either complete or abort a multipart upload will S3 free up the
parts storage and stop charging you for the parts storage.
Multipart Upload Completion (or Abort)
• When you complete a multipart upload, S3 creates an object by concatenating
the parts in ascending order based on the part number.
• If any object metadata was provided in the initiate multipart upload request, S3
associates that metadata with the object.
• After a successful complete request, the parts no longer exist.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common troubles and workarounds
Remaining multipart uploads
• Part might remain when jobs are aborted, clusters are terminated
unexpectedly.
• S3 console does not show remaining parts which are not completed
→ Delete remaining parts periodically based on S3 life cicle
→ Configure multipart related parameters
EMRFS: fs.s3.multipart.clean.enabled, etc.
S3A: fs.s3a.multipart.purge, etc.
→ You can check if there are remaining parts or not via AWS CLI

$ aws s3api list-multipart-uploads --bucket bucket-name

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- S3 request performance

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 performance and throttling
Request performance (per prefix)
• 3,500 PUT/POST/DELETE requests per second
• 5,500 GET requests per second
• ”HTTP 503 Slowdown” might be returned if some condition is met

s3://MyBucket/customers/dt=yyyy-mm-dd/0000001.csv

Performance will be improved if prefixes are splitted into


s3://MyBucket/customers/US/dt=yyyy-mm-dd/0000001.csv
s3://MyBucket/customers/CA/dt=yyyy-mm-dd/0000002.csv
https://www.slideshare.net/AmazonWebServices/best-practices-for-amazon-s3-and-amazon-glacier-stg203r2-aws-reinvent-2018/50

In case that it is difficult to split the prefixes due to use-case


• Query over multiple prefixes (e.g. query with ‘*’ not specifying ‘US’/’CA’)
→ Please reach out to AWS support to get proactive support
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tuning S3 requests
S3 connections
• Configure the number of connections in order to adjust concurrency of
S3 requests
EMRFS: fs.s3.maxConnections, etc.
S3A: fs.s3a.connection.maximum, etc.

S3 request retries
• Configure request retry behavior in order to address request throttling
EMRFS: fs.s3.retryPeriodSeconds (EMR 5.14 or later), fs.s3.maxRetries (EMR 5.12or
later), etc.
S3A: fs.s3a.retry.throttle.limit, fs.s3a.retry.throttle.interval, etc.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Well-known pitfalls and tunings
- Others

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hive Write performance tuning
Common troubles
• Performance degrade when writing data into Hive tables on S3
⎼ Lack of parallelism of write I/O
⎼ Writing not only output data, but also intermediate data to S3

Workarounds
• Parallelism
⎼ hive.mv.files.threads
• Intermediate data
⎼ hive.blobstore.use.blobstore.as.scratchdir = false
⎼ There is an example that achieves 10 times faster performance.

https://issues.apache.org/jira/browse/HIVE-14269
https://issues.apache.org/jira/browse/HIVE-14270
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tuning S3A Fast Upload
Common troubles
• Slow file upload via S3A
• Consuming too much disk space or memory when uploading data

Workarounds
• Tuning S3A Fast Upload related parameters
⎼ fs.s3a.fast.upload.buffer: disk, array, bytebuffer
⎼ fs.s3a.fast.upload.active.blocks

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Service updates on AWS/S3
related to Hadoop/Spark

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.7: S3 request performance improvement
Previous request performance
• 100 PUT/LIST/DELETE requests per second
• 300 GET requests per second

Current request performance (per prefix)


• 3,500 PUT/POST/DELETE requests per second
• 5,500 GET requests per second

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.9: S3 SELECT supports Parquet format
S3 SELECT is
• A feature to enable querying
required data from object
• Support queries from API, S3
console
• Possible to retrieve max 40MB
record from max 128 MB source
file
Supported formats
• CSV
• JSON
• Parquet <-New!

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://aws.amazon.com/jp/about-aws/whats-new/2018/09/amazon-s3-announces-
2018.10: EMRFS supports pushdown by S3 SELECT
EMRFS supports pushdown by using S3 SELECT queries
• Expected outcome: performance improvement, faster data transfer
• Supported applications: Hive, Spark, Presto
• How to use: Configure per application
Note: EMRFS does not decide if it uses S3 SELECT or not automatically based on
workload.
• Guidelines to determine if your application is a candidate for S3 Select:
⎼ Your query filters out more than half of the original data set.
⎼ Your network connection between S3 and the EMR cluster has good transfer
speed and available bandwidth.
⎼ Your query filter predicates use columns that have a data type supported by
both S3 Select and application (Hive/Spark/Presto)
→ Recommend to do benchmark to ensure if S3 SELECT is better for your
workloads
Hive: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-s3select.html

Spark: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html

Presto: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-s3select.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.11: EMRFS S3-optimized committer
EMRFS S3-optimized committer is
• Output committer introduced in EMR 5.19.0 or later (Default in
5.20.0 or later)
• Used when you use Spark SQL / DataFrames / Datasets to write
Parquet file
• Based on S3 multipart uploads

Pros
• Improve performance by avoiding S3 LIST/RENAME during job/task
commit phase.
• Improve correctness of job with failed tasks by avoiding S3 consistency
impact during job/task commit phase
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html
https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.11: EMRFS S3-optimized committer
Difference between FileOutputCommitter and EMRFS S3-optimized
committer
• FileOutputCommitterV1: 2 phases RENAME
⎼ RENAME to commit individual task output
⎼ RENAME to commit whole job output from completed/succeeded tasks
• FileOutputCommitterV2: 1 phase RENAME
⎼ RENAME to commit files to final destination.
⎼ Note: Intermediate data would be visible before completing jobs.
(Both versions have RENAME operations to write data into intermediate location.)
• EMRFS S3-optimized committer
⎼ Avoid RENAME to take advantage of S3 multipart uploads

The reason to focus on RENAME


• HDFS RENAME:Metadata only operation. Fast.
• S3 RENAME:N times data copy and deletion. Slow.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.11: EMRFS S3-optimized committer
Performance comparison
• EMR 5.19.0 (master m5d.2xlarge / core m5d.2xlarge x 8台)
• Input data: 15 GB (100 Parquet files)

INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’


USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

EMRFS consistent view is disabled EMRFS consistent view is enabled

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2018.11: DynamoDB Ondemand
Provisioned
• Configure provisioned capacity for Read/Write I/O
Ondemand
• No need to configure capacity (Auto-scale based on workloads)
• EMRFS consistent view, S3 Guard can take advantage of this.

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://aws.amazon.com/jp/blogs/news/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing/
Recent activities in
Hadoop/Spark community
related to S3

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
JIRA – S3A
HADOOP-16132: Support multipart download in S3AFileSystem
• Improve download performance to refer AWS CLI implementation.
https://issues.apache.org/jira/browse/HADOOP-16132

HADOOP-15364: Add support for S3 Select to S3A


• S3A supports S3 SELECT
https://issues.apache.org/jira/browse/HADOOP-15364

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
JIRA – S3Guard
HADOOP-15999: S3Guard: Better support for out-of-band operations
• Improve handling of files updated from outside S3Guard
https://issues.apache.org/jira/browse/HADOOP-15999

HADOOP-15837: DynamoDB table Update can fail S3A FS init


• Improve S3Guard initiation when DynamoDB AutoScaling is enabled
https://issues.apache.org/jira/browse/HADOOP-15837

HADOOP-15619: Über-JIRA: S3Guard Phase IV: Hadoop 3.3 features


• Hadoop 3.3 S3Guard related parent JIRA (Hadoop 3.0, 3.1, 3.2 have its own specific JIRA)
https://issues.apache.org/jira/browse/HADOOP-15619

HADOOP-15426: Make S3guard client resilient to DDB throttle events and network
failures
• Improve S3Guard CLI behavior when there are throttling in DynamoDB
https://issues.apache.org/jira/browse/HADOOP-15426

HADOOP-15349: S3Guard DDB retryBackoff to be more informative on limits


exceeded
• Improve S3Guard behavior when there are throttling in DynamoDB
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
JIRA – Others
HADOOP-15281: Distcp to add no-rename copy option
• DistCp adds new option without RENAME (mainly for S3)
https://issues.apache.org/jira/browse/HADOOP-15281

HIVE-20517: Creation of staging directory and Move operation is


taking time in S3
• Change Hive behavior to write data into final destination to avoid
RENAME operations.
https://issues.apache.org/jira/browse/HIVE-20517

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SPARK-21514:Hive has updated with new support for
S3 and InsertIntoHiveTable.scala should update also
Issue
• Spark writes intermediate files and RENAMEs then when writing data
• Even intermediate data is written into S3. It caused slow performance.
• HIVE-14270 related Issue
Approach
• Divide location (HDFS for intermediate files, S3 for final destination)
• Expected outcome: performance improvement, S3 cost reduction.

Current status
• My implementation is 2 times faster, but still in testing phase.

https://issues.apache.org/jira/browse/SPARK-21514
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusion

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusion
Relationship between Hadoop/Spark and S3
Difference between HDFS and S3, and use-case
Detailed behavior of S3 from the viewpoint of Hadoop/Spark
Well-known pitfalls and tunings
Service updates on AWS/S3 related to Hadoop/Spark
Recent activities in Hadoop/Spark community related to S3

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Appendix

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Major use-case of Hadoop/Spark on AWS
Transient clusters
• Batch job
• One-time data conversion
• Machine learning
• ETL into other DWH or data lake

Persistent clusters
• Ad-hoc jobs
• Streaming processing
• Continuous data conversion
• Notebook
• Experiments

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Useful information in troubleshooting HDFS
Resource/Daemon logs
• Name Node logs
• Data Node logs
• HDFS block reports
Request logs
• HDFS audit logs
Metrics
• Hadoop Metrics v2

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Useful information in troubleshooting S3
Request logs
• S3 Access logs
⎼ Logs are written when you configure S3 bucket in advance.
• CloudTrail
⎼ Management events:Records control plane operations
⎼ Data events:Records data plane operations (Need to be configured)
Metrics
• CloudWatch S3 metrics
⎼ Storage metrics
– There are 2 types of metrics; bucket size and the number of objects
– Updated once a day
⎼ Request metrics (Need to be configured)
– 16 types of metrics including request counts (GET, PUT, HEAD, …) and 4XX/5XX errors
– Updated once a minute

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.