Documente Academic
Documente Profesional
Documente Cultură
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı
kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi
dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma
Bakanlığı’nın görüşlerini yansıtmamaktadır.
How to Scale for Big Data?
• Data Volumes are massive
• Reliability of Storing PBs of data is challenging
• All kinds of failures: Disk/Hardware/Network
Failures
• Probability of failures simply increase with the
number of machines …
Distributed processing is non-trivial
• How to assign tasks to different workers in an
efficient way?
• What happens if tasks fail?
• How do workers exchange results?
• How to synchronize distributed tasks allocated
to different workers?
One popular solution: Hadoop
Q: How to divide
computation?
Programmers
Q: How to program
for scaling?
Typical Hadoop Cluster
Aggregation switch
Rack switch
Computation
Storage
Applications and Frameworks
• HBase – a scalable data warehouse with support for large
tables. Column-oriented database management system,
Key-value store, Based on Google Big Table
The ApplicationMaster
• One per-application
• negotiating resources with the RM
YARN Architecture
Resources model
An application can request resources via the
ApplicationMaster with highly specific requirements such as:
• Resource-name (including hostname, rackname and
possibly complex network topologies)
• Amount of Memory
• CPUs (number/type of cores)
• Eventually resources like disk/network I/O, GPUs, etc.
HDFS namenode
Application /foo/bar
(file name, block id)
File namespace block 3df2
HDFS Client
(block id, block location)
instructions to datanode
datanode state
(block id, byte range)
HDFS datanode HDFS datanode
block data
Linux file system Linux file system
… …
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Configuration Files
– Core-site.xml
– Mapred-site.xml
– Yarn-site.xml
– Hdfs-site.xml
<!-- yarn-site.xml -->
<?xml version="1.0"?> <configuration>
<!-- core-site.xml --> <property>
<configuration> <name>yarn.resourcemanager.hostname</name>
<property> <value>localhost</value>
<name>fs.defaultFS</name> </property>
<value>hdfs://localhost/</value> <property>
</property> <name>yarn.nodemanager.aux-services</name>
</configuration> <value>mapreduce_shuffle</value>
</property>
</configuration>
• Web-based UI
– http://localhost:50070 (Namenode report)
Basic File Command in HDFS
• hdfs fs –cmd <args>
– hadoop dfs
• URI: //authority/path
– authority: hdfs://localhost:9000
• Adding files
– hdfs fs –mkdir
– hdfs fs -put
• Retrieving files
– hdfs fs -get
• Deleting files
– hdfs fs –rm
• hdfs fs –help ls
Run WordCount
• Create an input directory in HDFS
• Run wordcount example
– hadoop jar hadoop-examples-2.7.4.jar wordcount
/user/hduser/input /user/hduser/ouput
• Check output directory
– hdfs fs lsr /user/hduser/ouput
– http://localhost:50070
Installation Tutorials
• http://www.michael-noll.com/tutorials/running-
hadoop-on-ubuntu-linux-single-node-cluster/
• http://hadoop.apache.org/common/docs/r0.20.2
/quickstart.html
• http://oreilly.com/other-
programming/excerpts/hadoop-tdg/installing-
apache-hadoop.html
• http://snap.stanford.edu/class/cs246-
2011/hw_files/hadoop_install.pdf