2 Hadoop Basics

Hadoop Ecosystem Fundamentals
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı
kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi
dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma
Bakanlığı’nın görüşlerini yansıtmamaktadır.
How to Scale for Big Data?
• Data Volumes are massive
• Reliability of Storing PBs of data is challenging
• All kinds of failures: Disk/Hardware/Network
Failures
• Probability of failures simply increase with the
number of machines …
Distributed processing is non-trivial
• How to assign tasks to different workers in an
efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated
to different workers?
One popular solution: Hadoop
Hadoop Cluster at Yahoo! (Credit: Yahoo)

Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses MapReduce
• Feb 2006 – Becomes Lucene subproject
• Apr 2007 – Yahoo! on 1000-node cluster
• Jan 2008 – An Apache Top Level Project
• Jul 2008 – A 4000 node test cluster
• Sept 2008 – Hive becomes a Hadoop subproject
• Feb 2009 – The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and produces data
that is now used in every Yahoo! Web search query.
• June 2009 – On June 10, 2009, Yahoo! made available the source
code to the version of Hadoop it runs in production.
• In 2010 Facebook claimed that they have the largest Hadoop cluster
in the world with 21 PB of storage. On July 27, 2011 they
announced the data has grown to 30 PB.
• Hadoop is an open-source implementation
based on Google File System (GFS) and
MapReduce from Google
• Hadoop was created by Doug Cutting and
Mike Cafarella in 2005
• Hadoop was donated to Apache in 2006
Hadoop offers
• Redundant, Fault-tolerant data storage
• Parallel computation framework
• Job coordination Q: Where file is
located?
No longer need to
worry about Q: How to handle
failures & data lost?
Q: How to divide
computation?
Programmers
Q: How to program
for scaling?
Typical Hadoop Cluster
Aggregation switch
Rack switch
• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 GBps bandwidth in rack, 8 GBps out of rack
• Node specs (Yahoo terasort):
8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
A real world example of New York
Times
• Goal: Make entire archive of articles available
online: 11 million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud
(EC2) and Simple Storage System (S3)
• Time: ?
• Costs: ?
A real world example of New York
Times
• Goal: Make entire archive of articles available
online: 11 million, from 1851
• Task: Translate 4 TB TIFF images to PDF files
• Solution: Used Amazon Elastic Compute Cloud
(EC2) and Simple Storage System (S3)
• Time: < 24 hours
• Costs: $240
The Basic Hadoop Components
• Hadoop Common - libraries and utilities
• Hadoop Distributed File System (HDFS) –

a distributed file-systemHadoop
• MapReduce – a programming model for

large scale data processing
• Hadoop YARN – a resource-management

platform, scheduling
Hadoop Stack
Computation
Storage
Applications and Frameworks
• HBase – a scalable data warehouse with support for large
tables. Column-oriented database management system,
Key-value store, Based on Google Big Table
• Hive – a data warehouse infrastructure that provides data

summarization and ad hoc querying the data using a SQL-
like language called HiveQL
• Pig – A high-level data-flow language and execution framework

for parallel computation. Uses the language: Pig Latin
• Spark –a fast and general compute engine for Hadoop data.

Wide range of applications ETL, Machine Learning, stream
processing, and graph analytics.
MapReduce Framework
• Typically compute and storage nodes are the
same.
• MapReduce tasks and HDFS running on the

same nodes
• Can schedule tasks on nodes with data

already present.
Original MapReduce Framework
• Single master JobTracker
• JobTracker schedules, monitors, and re-

executes failed tasks.
• One slave TaskTracker per cluster node
• TaskTracker executes tasks per JobTracker

requests.
Original HDFS Design
• Single NameNode - a master server that
manages the file system namespace and
regulates access to files by clients.
• Multiple DataNodes – typically one per node

in the cluster. Functions:
• Manage storage
• Serving read/write requests from clients
• Block creation, deletion, replication based
on instructions from NameNode
Hadoop 1.0 and Hadoop 2.0(YARN)
Whatʼs wrong with Original
Mapreduce?
• Scaling >4000 nodes
• HA of Job Tracker
• Poor resource utilization
YARN(Yet Another Resource
Negotiator)
Main idea – Separate resource management and
job scheduling/monitoring of the JobTracker into
separate daemons:
• Global ResourceManager(RM) - high availability

scheduler
• ApplicationMaster(AM) – one for each
application, framework specific entity tasked
with negotiating resources from the RM
Resource Manager
The ResourceManager has a Pure scheduler,
• responsible for allocating resources to running
applications
• no monitoring or tracking of status for the application,
heartbeat to NodeManager.
• no guarantees on restarting failed tasks either due to
application failure or hardware failures.
• Runs on master node.
The scheduler performs its scheduling

• by using a resource container.
NodeManager and Application Manager
The NodeManager is a generalized Task Tracker
• One per-machine slave, runs on slave nodes.
• launching the containers, and killing them.
• monitoring their resource usage (CPU, memory, disk,
network),
• reporting the same to the ResourceManager.
The ApplicationMaster
• One per-application
• negotiating resources with the RM
YARN Architecture
Resources model
An application can request resources via the
ApplicationMaster with highly specific requirements such as:
• Resource-name (including hostname, rackname and
possibly complex network topologies)
• Amount of Memory
• CPUs (number/type of cores)
• Eventually resources like disk/network I/O, GPUs, etc.
The Scheduler responds to a resource request by granting a

container, which satisfies the requirements
Container
The YARN Container launch specification API is platform
agnostic and contains:
• Command line to launch the process within the container.
• Environment variables.
• Local resources necessary on the machine prior to launch,
such as jars, shared-objects, auxiliary data files etc.
• Security-related tokens.
• This design allows the ApplicationMaster to work with the
NodeManager to launch containers ranging from simple
shell scripts to C/Java/Python processes on Unix/Windows
to full- fledged virtual machines.
Basic Hadoop API*
 Mapper
 void setup(Mapper.Context context)
Called once at the beginning of the task
 void map(K key, V value, Mapper.Context context)
Called once for each key/value pair in the input split
 void cleanup(Mapper.Context context)
Called once at the end of the task
 Reducer/Combiner
 void setup(Reducer.Context context)
Called once at the start of the task
 void reduce(K key, Iterable<V> values, Reducer.Context context)
Called once for each key
 void cleanup(Reducer.Context context)
Called once at the end of the task
*Note that there are two versions of the API!

Basic Hadoop API*
 Partitioner
 int getPartition(K key, V value, int numPartitions)
Get the partition number given total number of partitions
 Job
 Represents a packaged Hadoop job for submission to cluster
 Need to specify input and output paths
 Need to specify input and output formats
 Need to specify mapper, reducer, combiner, partitioner classes
 Need to specify intermediate/final key/value classes
 Need to specify number of reducers (but not mappers, why?)
 Don’t depend of defaults!
*Note that there are two versions of the API!

Basic Cluster Components*
 One of each:
 Namenode (NN): master node for HDFS
 Set of each per slave machine:

 Datanode (DN): serves HDFS data blocks
NameNode Metadata
 Meta-data in Memory
– The entire metadata is in main memory
– No demand paging of meta-data
 Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
 A Transaction Log
– Records file creations, file deletions. etc
Namenode Responsibilities
 Managing the file system namespace:

 Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
 Coordinating file operations:
 Directs clients to datanodes for reads and writes
 No data is moved through the namenode
 Maintaining overall health:
 Periodic communication with the datanodes
 Block re-replication and rebalancing
 Garbage collection
DataNode
 A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
 Block Report
– Periodically sends a report of all existing blocks to the
NameNode
 Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
HDFS Working Flow
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
Adapted from (Ghemawat et al., SOSP 2003)

NameNode Failure
 A single point of failure
 Transaction Log stored in multiple directories
– A directory on the local file system
– A directory on a remote file system (NFS/CIFS)
 Need to develop a real HA solution
Hadoop Workflow
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job

3a. Go back to Step 2
Hadoop Cluster
You
4. Retrieve data from HDFS

Recommended Workflow
 Example:
 Develop code in Eclipse on host machine
 Build distribution on host machine
 Check out copy of code on VM
 Copy (i.e., scp) jars over to VM (in same directory structure)
 Run job on VM
 Iterate
 …
 Commit code on host machine and push
 Pull from inside VM, verify
 Avoid using the UI of the VM
 Directly ssh into the VM
Hadoop Platforms
• Platforms: Unix and on Windows.
– Linux: the only supported production platform.
– Other variants of Unix, like Mac OS X: run Hadoop for
development.
– Windows + Cygwin: development platform (openssh)
• Java 8
– Java 1.8.x (aka 8.0.x aka 8) is recommended for
running Hadoop.
Hadoop Installation
• Download a stable version of Hadoop:
– http://hadoop.apache.org/core/releases.html
• Untar the hadoop file:
– tar xvfz hadoop-0.20.2.tar.gz
• JAVA_HOME at hadoop/conf/hadoop-env.sh:
– Mac OS:
/System/Library/Frameworks/JavaVM.framework/Versions
/1.6.0/Home (/Library/Java/Home)
– Linux: which java
• Environment Variables:
– export PATH=$PATH:$HADOOP_HOME/bin
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in a
single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is easy
to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode

– The Hadoop daemons run on a cluster of machines.
Pseudo Distributed Mode
• Create an RSA key to be used by hadoop when
ssh’ing to Localhost:
– ssh-keygen -t rsa -P ""
– cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
– ssh localhost
• Configuration Files
– Core-site.xml
– Mapred-site.xml
– Yarn-site.xml
– Hdfs-site.xml

<?xml version="1.0"?> <configuration>
 <property>
<configuration> <name>yarn.resourcemanager.hostname</name>
<property> <value>localhost</value>
<name>fs.defaultFS</name> </property>
<value>hdfs://localhost/</value> <property>
</property> <name>yarn.nodemanager.aux-services</name>
</configuration> <value>mapreduce_shuffle</value>
</property>
</configuration>
<?xml version="1.0"?> <?xml version="1.0"?>

 
<configuration> <configuration>
<property> <property>
<name>mapreduce.framework.name</name> <name>dfs.replication</name>
<value>yarn</value> <value>1</value>
</property> </property>
</configuration> </configuration>
<?xml version="1.0"?>
Start Hadoop
• hdfs namenode –format
• bin/star-all.sh (start-dfs.sh/start-yarn.sh)
• jps
• bin/stop-all.sh
• Web-based UI
– http://localhost:50070 (Namenode report)
Basic File Command in HDFS
• hdfs fs –cmd <args>
– hadoop dfs
• URI: //authority/path
– authority: hdfs://localhost:9000
• Adding files
– hdfs fs –mkdir
– hdfs fs -put
• Retrieving files
– hdfs fs -get
• Deleting files
– hdfs fs –rm
• hdfs fs –help ls
Run WordCount
• Create an input directory in HDFS
• Run wordcount example
– hadoop jar hadoop-examples-2.7.4.jar wordcount
/user/hduser/input /user/hduser/ouput
• Check output directory
– hdfs fs lsr /user/hduser/ouput
– http://localhost:50070
Installation Tutorials
• http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/
• http://hadoop.apache.org/common/docs/r0.20.2
/quickstart.html
• http://oreilly.com/other-
programming/excerpts/hadoop-tdg/installing-
apache-hadoop.html
• http://snap.stanford.edu/class/cs246-
2011/hw_files/hadoop_install.pdf

2 Hadoop Basics

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2 Hadoop Basics

Încărcat de

Drepturi de autor:

Formate disponibile

Hadoop Ecosystem Fundamentals

Hadoop Cluster at Yahoo! (Credit: Yahoo)

• 40 nodes/rack, 1000-4000 nodes in cluster

• Hadoop Distributed File System (HDFS) –

• MapReduce – a programming model for

• Hadoop YARN – a resource-management

• Hive – a data warehouse infrastructure that provides data

• Pig – A high-level data-flow language and execution framework

• Spark –a fast and general compute engine for Hadoop data.

• MapReduce tasks and HDFS running on the

• Can schedule tasks on nodes with data

• JobTracker schedules, monitors, and re-

• One slave TaskTracker per cluster node

• TaskTracker executes tasks per JobTracker

• Multiple DataNodes – typically one per node

• Global ResourceManager(RM) - high availability

The scheduler performs its scheduling

The Scheduler responds to a resource request by granting a

*Note that there are two versions of the API!

*Note that there are two versions of the API!

 Set of each per slave machine:

 Managing the file system namespace:

Adapted from (Ghemawat et al., SOSP 2003)

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job

4. Retrieve data from HDFS

• Fully distributed mode

<?xml version="1.0"?> <?xml version="1.0"?>

S-ar putea să vă placă și