Documente Academic
Documente Profesional
Documente Cultură
www.edureka.co/big-data-and-hadoop
How it Works?
Slide 2 www.edureka.co/big-data-and-hadoop
Course Topics
Module 1 Module 6
» Understanding Big Data and Hadoop » HIVE
Module 2 Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase
Module 3 Module 8
» Hadoop MapReduce Framework » Advance HBase
Module 4 Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
Module 5
» PIG Module 10
» Oozie and Hadoop Project
Slide 3 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
Understand What is Big data
Hadoop Ecosystem
Work on Edureka’s VM
Slide 4 www.edureka.co/big-data-and-hadoop
What is Big Data?
Lots of Data (Terabytes or Petabytes) support
Slide 5 www.edureka.co/big-data-and-hadoop
What is Big Data?
Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information
Amazon handles 15 million Stock market generates about one 294 billion emails sent every
customer click stream user terabyte of new trade data per day day. Services analyse this data
data per day to to perform stock trading analytics to to find the spams.
recommend products. determine trends for optimal trades.
Slide 6 www.edureka.co/big-data-and-hadoop
Un-structured Data is Exploding
Un-structured Data
7000
6000
5000
4000
3000
2000
1000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes
(ZB)
The world’s information is doubling every two years. By 2020, there will be 5,200 GB of data for every person on
Earth.
Slide 7 www.edureka.co/big-data-and-hadoop
IBM’s Definition of Big Data
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Slide 8 www.edureka.co/big-data-and-hadoop
Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 9 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 10 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 11 www.edureka.co/big-data-and-hadoop
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
http://www-01.ibm.com/software/data/bigdata/
Slide 12 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios
Web and e-tailing
» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection
Telecommunications
http://wiki.apache.org/hadoop/PoweredBy
Slide 13 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
Government
http://wiki.apache.org/hadoop/PoweredBy
Slide 14 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
Retail
http://wiki.apache.org/hadoop/PoweredBy
Slide 15 www.edureka.co/big-data-and-hadoop
Hidden Treasure
Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
Some key early indicators can mean Fortunes to Business.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Slide 16 http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? www.edureka.co/big-data-and-hadoop
Limitations of Existing Data Analytics Architecture
90% of
2. Moving data to compute the ~2PB
doesn’t scale archived
Collection
Inctrumentation
Slide 17 www.edureka.co/big-data-and-hadoop
Solution: A Combined Storage Computer Layer
1. Data Exploration &
Advanced analytics
BI Reports + Interactive Apps
Mostly Append
Collection
Both
Storage
And Instrumentation
Processing
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Slide 18 www.edureka.co/big-data-and-hadoop
Why DFS?
Read 1 TB Data
1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
Slide 19 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
Read 1 TB Data
1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
43 Minutes
Slide 20 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
Read 1 TB Data
1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
Slide 22 www.edureka.co/big-data-and-hadoop
Hadoop Key Characteristics
Reliable
Hadoop
Flexible Features Economical
Scalable
Slide 23 www.edureka.co/big-data-and-hadoop
Hadoop Design Principles
Facilitate the storage and processing of large and/or rapidly growing data sets
Fault-tolerance
Slide 24 www.edureka.co/big-data-and-hadoop
Hadoop – It’s about Scale and Structure
RDBMS HADOOP
Slide 25 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 26 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
Slide 27 www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
Hadoop 1.0 Hadoop 2.0
Apache Oozie Apache Oozie
(Workflow) (Workflow)
Unstructured or
Semi-structured Data Structured Data Unstructured or
Semi-structured Data Structured Data
Slide 28 www.edureka.co/big-data-and-hadoop
Machine Learning with Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Hadoop and
MapReduce magic in
action
https://mahout.apache.org/general/powered-by-mahout.html
Slide 29 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Core Components
Hadoop 2.x Core Components
Storage Processing
HDFS YARN
Secondary
NameNode
DataNode Slave Node Manager
Slide 30 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Core Components ( Contd.)
Slide 31 www.edureka.co/big-data-and-hadoop
Main Components of HDFS
NameNode:
DataNodes:
Slide 32 www.edureka.co/big-data-and-hadoop
NameNode Metadata
Meta-data in Memory
NameNode
» The entire metadata is in main memory (Stores metadata only)
» No demand paging of FS meta-data
METADATA:
Types of Metadata /user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
» List of files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor NameNode:
Keeps track of overall file directory
A Transaction Log structure and the placement of Data Block
Slide 33 www.edureka.co/big-data-and-hadoop
File Blocks
By Default, block size is 128mb in Hadoop 2.x and 64mb in Hadoop 1.x
200mb – abc.txt
200mb – emp.dat
128mb – Block 1
64mb – Block1
72mb – Block 2
64mb – Block2
64mb – Block3
8mb – Block4
» The main reason for having the HDFS blocks in large size is to reduce the cost of seek time.
» The large block size is to account for proper usage of storage space while considering the limit on the
memory of name node.
Slide 34 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness
Slide 35 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 36 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 37 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 38 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 39 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 40 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)
Slide 41 www.edureka.co/big-data-and-hadoop
Pipelined Write
Slide 42 www.edureka.co/big-data-and-hadoop
Client reading file from HDFS
Slide 43 www.edureka.co/big-data-and-hadoop
Annie’s Question
Slide 44 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 45 www.edureka.co/big-data-and-hadoop
Annie’s Question
A file of 400MB is being copied to HDFS. The system has
finished copying 250MB. What happens if a client tries to
access that file:
a. Can read up to block that's successfully written.
b. Can read up to last bit successfully written.
c. Will throw an exception.
d. Cannot see that file until its finished copying.
Slide 46 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Slide 47 www.edureka.co/big-data-and-hadoop
Another Hadoop Distributors
Slide 48 www.edureka.co/big-data-and-hadoop
Another Hadoop Distributors
Slide 49 www.edureka.co/big-data-and-hadoop
Further Reading
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Slide 50 www.edureka.co/big-data-and-hadoop
A tour of Edureka’s Virtual Machine
Slide 51 www.edureka.co/big-data-and-hadoop
Edureka VM
Pre-requisites
YES
YES NO
Proceed with the steps Check for YES Refer README.txt from
Import Edureka
mentioned in the “Edureka VM to Virtual Box daemons Edureka VM Desktop
VM Installation “document
Slide 52 www.edureka.co/big-data-and-hadoop
Pre-requisites
Slide 53 www.edureka.co/big-data-and-hadoop
Installing Edureka VM
With reference to “ Edureka VM Installation ” document present in the LMS, you can install Edureka VM
Slide 54 www.edureka.co/big-data-and-hadoop
Challenges Faced During Installation of VM
File may not get downloaded completely, due to its huge size, without displaying any error.
Slide 55 www.edureka.co/big-data-and-hadoop
Edureka VM Remote Server
With reference to “ Remote Login Using Putty – Hadoop 2.2.0 ” document present in the LMS, you can
access our Edureka VM Remote Server
Slide 56 www.edureka.co/big-data-and-hadoop
Challenges Faced While Importing VM in Virtual Box
Below is the error you may face while importing the Edureka VM in Virtual Box, if the Edureka VM file has
not been downloaded completely
Slide 57 www.edureka.co/big-data-and-hadoop
Challenges Faced After Importing VM in Virtual Box
After importing VM in Virtual Box, start the VM and check if all the daemons are up and running by using:
After executing the above commands check if all daemons are running fine by using:
Slide 58 www.edureka.co/big-data-and-hadoop
Edureka VM Desktop
Please refer to the README.txt file present in the Desktop of Edureka VM to know more about it
Slide 59 www.edureka.co/big-data-and-hadoop
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 60 www.edureka.co/big-data-and-hadoop
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
Slide 61 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
Hadoop 2.x Cluster Architecture
Slide 62 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.
Slide 63 www.edureka.co/big-data-and-hadoop