01 UnderstandingBigDataAndHadoop

Module-1
Understanding Big Data And Hadoop
www.edureka.co/big-data-and-hadoop
How it Works?
Experienced Instructor Class Recording in LMS
Live Online Class Module Wise Assessment
In-class Questions Project Work
Survey Feedback Verifiable Certificate
24x7 Support Android & iOS App
Slide 2 www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE
» Hadoop Architecture and HDFS » Advance HIVE and HBase
» Hadoop MapReduce Framework » Advance HBase
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project
Objectives
At the end of this module, you will be able to
 Understand What is Big data
 Analyse limitations and solutions of existing

Data Analytics Architecture
 Understand What is Hadoop and its features
 Hadoop Ecosystem
 Understand Hadoop 2.x core components
 Perform Read and Write in Hadoop
 Understand Rack Awareness concept
 Work on Edureka’s VM
What is Big Data?
 Lots of Data (Terabytes or Petabytes) support
 Big data is the term for a collection of data sets so

cloud database
large and complex that it becomes difficult to process
using on-hand database management tools or tools storage
traditional data processing applications analyze
statistics information
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization Big Data
No SQL
mobile
compression
processing
terabytes
What is Big Data?
Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information
Amazon handles 15 million Stock market generates about one 294 billion emails sent every
customer click stream user terabyte of new trade data per day day. Services analyse this data
data per day to to perform stock trading analytics to to find the spams.
recommend products. determine trends for optimal trades.
Un-structured Data is Exploding
Un-structured Data
7000
6000
5000
4000
3000
2000
1000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Structured Data Un-structured Data
 By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes
(ZB)
 The world’s information is doubling every two years. By 2020, there will be 5,200 GB of data for every person on
Earth.
IBM’s Definition of Big Data
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Min Max Mean SD

Web Audios
logs 4.3 7.9 5.84 0.83
Images 2.0 4.4 3.05 0.43
Videos
Sensor
Data 0.1 2.5 1.20 0.76
VOLUME VELOCITY VARIETY VERACITY
Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Question
Map the following to corresponding data type:

» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)
Annie’s Answer
Ans. XML files,  Semi-structured data

e-mail body, Audio, Video, Image, Files, Archived documents 
Unstructured data
Data from Enterprise systems (ERP, CRM etc.)  Structured
data
Further Reading
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Common Big Data Customer Scenarios
 Web and e-tailing
» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection
 Telecommunications
» Customer Churn Prevention

» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
 Government
» Fraud Detection and Cyber Security

» Welfare Schemes
» Justice
 Healthcare and Life Sciences
» Health Information Exchange

» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety
Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services
» Modeling True Risk

» Threat Analysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis
 Retail
» Point of Sales Transaction Analysis

» Customer Churn Analysis
» Sentiment Analysis
Hidden Treasure
 Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Slide 16 http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? www.edureka.co/big-data-and-hadoop
Limitations of Existing Data Analytics Architecture
BI Reports + Interactive Apps
1. Can’t explore original Processing

A meagre RDBMS (Aggregated Data)
10% of the
high fidelity raw data
~2PB data is
available for ETL Compute Grid
BI
90% of
2. Moving data to compute the ~2PB
doesn’t scale archived
Storage only Grid (Original Raw Data)

Storage 3. Premature data
Mostly Append death
Collection
Inctrumentation
Solution: A Combined Storage Computer Layer
1. Data Exploration &
Advanced analytics
BI Reports + Interactive Apps
RDBMS (Aggregated Data) No Data

Entire ~2PB
Data is Archiving
available for 2. Scalable throughput for ETL &
processing aggregation
3. Keep data alive
Hadoop : Storage + Compute Grid
forever
Mostly Append
Collection
Both
Storage
And Instrumentation
Processing
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Why DFS?
Read 1 TB Data
1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
Why DFS? (Contd.)
Read 1 TB Data
43 Minutes
Why DFS? (Contd.)
Read 1 TB Data
43 Minutes 4.3 Minutes

What is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage and distributed processing.
Hadoop Key Characteristics
Reliable
Hadoop
Flexible Features Economical
Scalable
Hadoop Design Principles
 Facilitate the storage and processing of large and/or rapidly growing data sets
» Structured and unstructured data

» Simple programming models
 Scale-Out rather than Scale-Up
 Bring Code to Data rather than data to code
 High scalability and availability
 Use commodity hardware
 Fault-tolerance
Hadoop – It’s about Scale and Structure
RDBMS HADOOP
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On Write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
OLTP Best Fit Use Data Discovery

Complex ACID Transactions Processing Unstructured Data
Operational Data Store Massive Storage/Processing
Annie’s Question
Hadoop is a framework that allows for the distributed

processing of:
» Small Data Sets
» Large Data Sets
Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.
Hadoop Ecosystem
Hadoop 1.0 Hadoop 2.0
Apache Oozie Apache Oozie
(Workflow) (Workflow)
Hive Pig Latin Mahout Hive Pig Latin Mahout Other

DW System Data Analysis Machine Learning
DW System Data Analysis Machine Learning YARN
Frameworks
(MPI, GRAPH)
MapReduce Framework MapReduce Framework HBase
HBase
YARN
HDFS Cluster Resource Management
(Hadoop Distributed File System)
HDFS
(Hadoop Distributed File System)
Flume Sqoop
Flume Sqoop
Unstructured or
Semi-structured Data Structured Data Unstructured or
Semi-structured Data Structured Data
Machine Learning with Mahout
Write intelligent applications using Apache Mahout
LinkedIn Recommendations
Hadoop and
MapReduce magic in
action
https://mahout.apache.org/general/powered-by-mahout.html
Hadoop 2.x Core Components
Hadoop 2.x Core Components
Storage Processing
HDFS YARN
NameNode Master Resource Manager
Secondary
NameNode
DataNode Slave Node Manager
Hadoop 2.x Core Components ( Contd.)
Resource Node Node Node Node

YARN Manager Manager Manager Manager Manager
HDFS DataNode DataNode DataNode DataNode

Cluster NameNode
Main Components of HDFS
 NameNode:
» Master of the system

» Maintains and manages the blocks which are present on
the DataNodes
 DataNodes:
» Slaves which are deployed on each machine and provide

the actual storage
» Responsible for serving read and write requests for the
clients
NameNode Metadata
 Meta-data in Memory
NameNode
» The entire metadata is in main memory (Stores metadata only)
» No demand paging of FS meta-data
METADATA:
 Types of Metadata /user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
» List of files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor NameNode:
Keeps track of overall file directory
 A Transaction Log structure and the placement of Data Block
» Records file creations, file deletions etc.
File Blocks
 By Default, block size is 128mb in Hadoop 2.x and 64mb in Hadoop 1.x
Hadoop 2.x Hadoop 1.x
200mb – abc.txt
200mb – emp.dat
128mb – Block 1
64mb – Block1
72mb – Block 2
64mb – Block2
64mb – Block3
8mb – Block4
 Why block size is large?
» The main reason for having the HDFS blocks in large size is to reduce the cost of seek time.
» The large block size is to account for proper usage of storage space while considering the limit on the
memory of name node.
Replication and Rack Awareness
Replication and Rack Awareness (Contd.)
Pipelined Write
Client reading file from HDFS
Annie’s Question
In HDFS, blocks of a file are written in parallel, however

the replication of the blocks are done sequentially:
a. TRUE
b. FALSE
Annie’s Answer
Ans. True. A file is divided into Blocks, these blocks are

written in parallel, but the block replication happen in
sequence.
Annie’s Question
A file of 400MB is being copied to HDFS. The system has
finished copying 250MB. What happens if a client tries to
access that file:
a. Can read up to block that's successfully written.
b. Can read up to last bit successfully written.
c. Will throw an exception.
d. Cannot see that file until its finished copying.
Annie’s Answer
Ans. Option (a)

Client can read up to the successfully written data block.
Another Hadoop Distributors
Cloudera distributes a platform of open-source projects called

Cloudera's Distribution including Apache Hadoop or CDH.
These include architectural services and technical support for
Hadoop clusters in development or in production.
Another major player in the Hadoop market, Hortonworks

has the largest number of committers and code contributors
for the Hadoop ecosystem components.Provider of expert
technical support, training and partner-enablement services
for both end-user organizations and technology vendors.
The MapR Distribution including Apache Hadoop provides

you with an enterprise-grade distributed data platform to
reliably store and process big data. MapR's Apache Hadoop
distribution claims to provide full data protection, no single
points of failure, improved performance
Another Hadoop Distributors
Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
A tour of Edureka’s Virtual Machine
Edureka VM
Pre-requisites
NO Remote Login using

Condition YES
Putty Hadoop-2.2.0
YES
If the file is NO Run the commands to

Edureka VM Edureka VM
downloaded start the daemons
Installation Split Files
completely manually
YES NO
Proceed with the steps Check for YES Refer README.txt from
Import Edureka
mentioned in the “Edureka VM to Virtual Box daemons Edureka VM Desktop
VM Installation “document
Pre-requisites
 Pre-requisites to install Edureka VM
» Minimum 4 GB RAM If you don’t have the

» Dual Core Processor or above mentioned hardware
requirements, refer to slide 57
Installing Edureka VM
 With reference to “ Edureka VM Installation ” document present in the LMS, you can install Edureka VM
We suggest you to use the

Download Manager while
downloading Edureka VM to avoid
any network issues that may occur.
You can download it from here
for different platforms which is an
open source tool.
Challenges Faced During Installation of VM
 File may not get downloaded completely, due to its huge size, without displaying any error.
After downloading VM always

check for its file size it should
be 4.5 GB
Download Complete Download Incomplete
Proceed with the steps

mentioned in the “Edureka
VM Installation “document Using “Edureka VM Split Files”
document present in LMS, you
can download Edureka VM
Edureka VM Remote Server
 With reference to “ Remote Login Using Putty – Hadoop 2.2.0 ” document present in the LMS, you can
access our Edureka VM Remote Server
If your system does not meet

minimum requirements please refer
to the document “ Remote Login
using Putty Hadoop-3.0” present in
the LMS
Challenges Faced While Importing VM in Virtual Box
 Below is the error you may face while importing the Edureka VM in Virtual Box, if the Edureka VM file has
not been downloaded completely
In case you are using split

files, then before importing
split files in Virtual Box
combine all the downloaded
split files using hjsplit
application.
Challenges Faced After Importing VM in Virtual Box
 After importing VM in Virtual Box, start the VM and check if all the daemons are up and running by using:
Command: sudo jps
 If daemons are not up and running, execute below commands
Command: sudo service hadoop-master stop

Command: sudo service hadoop-master start

Command: hadoop dfsadmin -safemode leave
 After executing the above commands check if all daemons are running fine by using:
Command: sudo jps
Edureka VM Desktop
 Please refer to the README.txt file present in the Desktop of Edureka VM to know more about it
Assignment
Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.
» Setup Edureka’s Virtual Machine

» Execute Linux Basic Commands
» Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.
Agenda for Next Class
 Hadoop 2.x Cluster Architecture
 Hadoop cluster modes
 Basic Hadoop commands
 Hadoop 2.x configuration files and its parameters
 Password-less SSH on Hadoop cluster
 Dump of a MapReduce program
 Data loading techniques
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.

01 UnderstandingBigDataAndHadoop

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

01 UnderstandingBigDataAndHadoop

Încărcat de

Drepturi de autor:

Formate disponibile

Module-1

Understanding Big Data And Hadoop

Experienced Instructor Class Recording in LMS

Live Online Class Module Wise Assessment

In-class Questions Project Work

Survey Feedback Verifiable Certificate

24x7 Support Android & iOS App

 Analyse limitations and solutions of existing

 Understand What is Hadoop and its features

 Understand Hadoop 2.x core components

 Perform Read and Write in Hadoop

 Understand Rack Awareness concept

 Big data is the term for a collection of data sets so

Structured Data Un-structured Data

Min Max Mean SD

VOLUME VELOCITY VARIETY VERACITY

Map the following to corresponding data type:

Ans. XML files,  Semi-structured data

IBM’s definition – Big Data Characteristics

» Customer Churn Prevention

» Fraud Detection and Cyber Security

 Healthcare and Life Sciences

» Health Information Exchange

» Modeling True Risk

» Point of Sales Transaction Analysis

 More Precise Analysis with more data.

BI Reports + Interactive Apps

1. Can’t explore original Processing

Storage only Grid (Original Raw Data)

RDBMS (Aggregated Data) No Data

43 Minutes 4.3 Minutes

 It is an Open-source Data Management with scale-out storage and distributed processing.

» Structured and unstructured data

 Scale-Out rather than Scale-Up

 Bring Code to Data rather than data to code

 High scalability and availability

 Use commodity hardware

Structured Data Types Multi and Unstructured

Limited, No Data Processing Processing Processing coupled with Data

Standards & Structured Governance Loosely Structured

Required On Write Schema Required On Read

Reads are Fast Speed Writes are Fast

Software License Cost Support Only

Known Entity Resources Growing, Complexities, Wide

OLTP Best Fit Use Data Discovery

Hadoop is a framework that allows for the distributed

Hive Pig Latin Mahout Hive Pig Latin Mahout Other

NameNode Master Resource Manager

Resource Node Node Node Node

HDFS DataNode DataNode DataNode DataNode

» Master of the system

» Slaves which are deployed on each machine and provide

» Records file creations, file deletions etc.

Hadoop 2.x Hadoop 1.x

 Why block size is large?

In HDFS, blocks of a file are written in parallel, however

Ans. True. A file is divided into Blocks, these blocks are

Ans. Option (a)

Cloudera distributes a platform of open-source projects called

Another major player in the Hadoop market, Hortonworks

The MapR Distribution including Apache Hadoop provides

 Apache Hadoop HDFS Architecture

NO Remote Login using

If the file is NO Run the commands to