Sunteți pe pagina 1din 64

Module-1

Understanding Big Data And Hadoop

www.edureka.co/big-data-and-hadoop
How it Works?

Experienced Instructor Class Recording in LMS

Live Online Class Module Wise Assessment

In-class Questions Project Work

Survey Feedback Verifiable Certificate

24x7 Support Android & iOS App

Slide 2 www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE

 Module 2  Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase

 Module 3  Module 8
» Hadoop MapReduce Framework » Advance HBase

 Module 4  Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project

Slide 3 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Understand What is Big data

 Analyse limitations and solutions of existing


Data Analytics Architecture

 Understand What is Hadoop and its features

 Hadoop Ecosystem

 Understand Hadoop 2.x core components

 Perform Read and Write in Hadoop

 Understand Rack Awareness concept

 Work on Edureka’s VM

Slide 4 www.edureka.co/big-data-and-hadoop
What is Big Data?
 Lots of Data (Terabytes or Petabytes) support

 Big data is the term for a collection of data sets so


cloud database
large and complex that it becomes difficult to process
using on-hand database management tools or tools storage
traditional data processing applications analyze
statistics information
 The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization Big Data
No SQL
mobile
compression
processing
terabytes

Slide 5 www.edureka.co/big-data-and-hadoop
What is Big Data?
Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information

Amazon handles 15 million Stock market generates about one 294 billion emails sent every
customer click stream user terabyte of new trade data per day day. Services analyse this data
data per day to to perform stock trading analytics to to find the spams.
recommend products. determine trends for optimal trades.

Slide 6 www.edureka.co/big-data-and-hadoop
Un-structured Data is Exploding
Un-structured Data

7000
6000
5000
4000
3000
2000
1000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Structured Data Un-structured Data

 By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes
(ZB)

 The world’s information is doubling every two years. By 2020, there will be 5,200 GB of data for every person on
Earth.

Slide 7 www.edureka.co/big-data-and-hadoop
IBM’s Definition of Big Data
IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Min Max Mean SD


Web Audios
logs 4.3 7.9 5.84 0.83
Images 2.0 4.4 3.05 0.43
Videos
Sensor
Data 0.1 2.5 1.20 0.76

VOLUME VELOCITY VARIETY VERACITY

Slide 8 www.edureka.co/big-data-and-hadoop
Annie’s Introduction

Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.

Slide 9 www.edureka.co/big-data-and-hadoop
Annie’s Question

Map the following to corresponding data type:


» XML files, e-mail body
» Audio, Video, Images, Archived documents
» Data from Enterprise systems (ERP, CRM etc.)

Slide 10 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. XML files,  Semi-structured data


e-mail body, Audio, Video, Image, Files, Archived documents 
Unstructured data
Data from Enterprise systems (ERP, CRM etc.)  Structured
data

Slide 11 www.edureka.co/big-data-and-hadoop
Further Reading
More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

http://www-01.ibm.com/software/data/bigdata/

Slide 12 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios
 Web and e-tailing

» Recommendation Engines
» Ad Targeting
» Search Quality
» Abuse and Click Fraud Detection

 Telecommunications

» Customer Churn Prevention


» Network Performance Optimization
» Calling Data Record (CDR) Analysis
» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

Slide 13 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
 Government

» Fraud Detection and Cyber Security


» Welfare Schemes
» Justice

 Healthcare and Life Sciences

» Health Information Exchange


» Gene Sequencing
» Serialization
» Healthcare Service Quality Improvements
» Drug Safety

http://wiki.apache.org/hadoop/PoweredBy

Slide 14 www.edureka.co/big-data-and-hadoop
Common Big Data Customer Scenarios (Contd.)
 Banks and Financial services

» Modeling True Risk


» Threat Analysis
» Fraud Detection
» Trade Surveillance
» Credit Scoring and Analysis

 Retail

» Point of Sales Transaction Analysis


» Customer Churn Analysis
» Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Slide 15 www.edureka.co/big-data-and-hadoop
Hidden Treasure
 Insight into data can provide Business Advantage.
Case Study: Sears Holding Corporation
 Some key early indicators can mean Fortunes to Business.

 More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Slide 16 http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? www.edureka.co/big-data-and-hadoop
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps

1. Can’t explore original Processing


A meagre RDBMS (Aggregated Data)
10% of the
high fidelity raw data
~2PB data is
available for ETL Compute Grid
BI

90% of
2. Moving data to compute the ~2PB
doesn’t scale archived

Storage only Grid (Original Raw Data)


Storage 3. Premature data
Mostly Append death

Collection

Inctrumentation

Slide 17 www.edureka.co/big-data-and-hadoop
Solution: A Combined Storage Computer Layer
1. Data Exploration &
Advanced analytics
BI Reports + Interactive Apps

RDBMS (Aggregated Data) No Data


Entire ~2PB
Data is Archiving
available for 2. Scalable throughput for ETL &
processing aggregation
3. Keep data alive
Hadoop : Storage + Compute Grid
forever

Mostly Append

Collection
Both
Storage
And Instrumentation
Processing

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as
was the case with existing Non-Hadoop solutions.
Slide 18 www.edureka.co/big-data-and-hadoop
Why DFS?
Read 1 TB Data

1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s

Slide 19 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
Read 1 TB Data

1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s

43 Minutes
Slide 20 www.edureka.co/big-data-and-hadoop
Why DFS? (Contd.)
Read 1 TB Data

1 Machine 10 Machine
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s

43 Minutes 4.3 Minutes


Slide 21 www.edureka.co/big-data-and-hadoop
What is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.

 It is an Open-source Data Management with scale-out storage and distributed processing.

Slide 22 www.edureka.co/big-data-and-hadoop
Hadoop Key Characteristics

Reliable

Hadoop
Flexible Features Economical

Scalable

Slide 23 www.edureka.co/big-data-and-hadoop
Hadoop Design Principles
 Facilitate the storage and processing of large and/or rapidly growing data sets

» Structured and unstructured data


» Simple programming models

 Scale-Out rather than Scale-Up

 Bring Code to Data rather than data to code

 High scalability and availability

 Use commodity hardware

 Fault-tolerance

Slide 24 www.edureka.co/big-data-and-hadoop
Hadoop – It’s about Scale and Structure

RDBMS HADOOP

Structured Data Types Multi and Unstructured

Limited, No Data Processing Processing Processing coupled with Data

Standards & Structured Governance Loosely Structured

Required On Write Schema Required On Read

Reads are Fast Speed Writes are Fast

Software License Cost Support Only

Known Entity Resources Growing, Complexities, Wide

OLTP Best Fit Use Data Discovery


Complex ACID Transactions Processing Unstructured Data
Operational Data Store Massive Storage/Processing

Slide 25 www.edureka.co/big-data-and-hadoop
Annie’s Question

Hadoop is a framework that allows for the distributed


processing of:
» Small Data Sets
» Large Data Sets

Slide 26 www.edureka.co/big-data-and-hadoop
Annie’s Answer
Ans. Large Data Sets.
It is also capable of processing small data-sets. However, to
experience the true power of Hadoop, one needs to have
data in TB’s. Because this is where RDBMS takes hours and
fails whereas Hadoop does the same in couple of minutes.

Slide 27 www.edureka.co/big-data-and-hadoop
Hadoop Ecosystem
Hadoop 1.0 Hadoop 2.0
Apache Oozie Apache Oozie
(Workflow) (Workflow)

Hive Pig Latin Mahout Hive Pig Latin Mahout Other


DW System Data Analysis Machine Learning
DW System Data Analysis Machine Learning YARN
Frameworks
(MPI, GRAPH)
MapReduce Framework MapReduce Framework HBase
HBase
YARN
HDFS Cluster Resource Management
(Hadoop Distributed File System)
HDFS
(Hadoop Distributed File System)
Flume Sqoop
Flume Sqoop

Unstructured or
Semi-structured Data Structured Data Unstructured or
Semi-structured Data Structured Data
Slide 28 www.edureka.co/big-data-and-hadoop
Machine Learning with Mahout
Write intelligent applications using Apache Mahout

LinkedIn Recommendations

Hadoop and
MapReduce magic in
action

https://mahout.apache.org/general/powered-by-mahout.html

Slide 29 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Core Components
Hadoop 2.x Core Components

Storage Processing

HDFS YARN

NameNode Master Resource Manager

Secondary
NameNode
DataNode Slave Node Manager

Slide 30 www.edureka.co/big-data-and-hadoop
Hadoop 2.x Core Components ( Contd.)

Resource Node Node Node Node


YARN Manager Manager Manager Manager Manager

HDFS DataNode DataNode DataNode DataNode


Cluster NameNode

Slide 31 www.edureka.co/big-data-and-hadoop
Main Components of HDFS
 NameNode:

» Master of the system


» Maintains and manages the blocks which are present on
the DataNodes

 DataNodes:

» Slaves which are deployed on each machine and provide


the actual storage
» Responsible for serving read and write requests for the
clients

Slide 32 www.edureka.co/big-data-and-hadoop
NameNode Metadata
 Meta-data in Memory
NameNode
» The entire metadata is in main memory (Stores metadata only)
» No demand paging of FS meta-data
METADATA:
 Types of Metadata /user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
» List of files
» List of Blocks for each file
» List of DataNode for each block
» File attributes, e.g. access time, replication factor NameNode:
Keeps track of overall file directory
 A Transaction Log structure and the placement of Data Block

» Records file creations, file deletions etc.

Slide 33 www.edureka.co/big-data-and-hadoop
File Blocks
 By Default, block size is 128mb in Hadoop 2.x and 64mb in Hadoop 1.x

Hadoop 2.x Hadoop 1.x

200mb – abc.txt
200mb – emp.dat
128mb – Block 1
64mb – Block1
72mb – Block 2
64mb – Block2
64mb – Block3
8mb – Block4

 Why block size is large?

» The main reason for having the HDFS blocks in large size is to reduce the cost of seek time.

» The large block size is to account for proper usage of storage space while considering the limit on the
memory of name node.

Slide 34 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness

Slide 35 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 36 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 37 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 38 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 39 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 40 www.edureka.co/big-data-and-hadoop
Replication and Rack Awareness (Contd.)

Slide 41 www.edureka.co/big-data-and-hadoop
Pipelined Write

Slide 42 www.edureka.co/big-data-and-hadoop
Client reading file from HDFS

Slide 43 www.edureka.co/big-data-and-hadoop
Annie’s Question

In HDFS, blocks of a file are written in parallel, however


the replication of the blocks are done sequentially:
a. TRUE
b. FALSE

Slide 44 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. True. A file is divided into Blocks, these blocks are


written in parallel, but the block replication happen in
sequence.

Slide 45 www.edureka.co/big-data-and-hadoop
Annie’s Question
A file of 400MB is being copied to HDFS. The system has
finished copying 250MB. What happens if a client tries to
access that file:
a. Can read up to block that's successfully written.
b. Can read up to last bit successfully written.
c. Will throw an exception.
d. Cannot see that file until its finished copying.

Slide 46 www.edureka.co/big-data-and-hadoop
Annie’s Answer

Ans. Option (a)


Client can read up to the successfully written data block.

Slide 47 www.edureka.co/big-data-and-hadoop
Another Hadoop Distributors

Cloudera distributes a platform of open-source projects called


Cloudera's Distribution including Apache Hadoop or CDH.
These include architectural services and technical support for
Hadoop clusters in development or in production.

Another major player in the Hadoop market, Hortonworks


has the largest number of committers and code contributors
for the Hadoop ecosystem components.Provider of expert
technical support, training and partner-enablement services
for both end-user organizations and technology vendors.

The MapR Distribution including Apache Hadoop provides


you with an enterprise-grade distributed data platform to
reliably store and process big data. MapR's Apache Hadoop
distribution claims to provide full data protection, no single
points of failure, improved performance

Slide 48 www.edureka.co/big-data-and-hadoop
Another Hadoop Distributors

Slide 49 www.edureka.co/big-data-and-hadoop
Further Reading
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

 Apache Hadoop HDFS Architecture


http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Slide 50 www.edureka.co/big-data-and-hadoop
A tour of Edureka’s Virtual Machine

Slide 51 www.edureka.co/big-data-and-hadoop
Edureka VM
Pre-requisites

NO Remote Login using


Condition YES
Putty Hadoop-2.2.0

YES

If the file is NO Run the commands to


Edureka VM Edureka VM
downloaded start the daemons
Installation Split Files
completely manually

YES NO

Proceed with the steps Check for YES Refer README.txt from
Import Edureka
mentioned in the “Edureka VM to Virtual Box daemons Edureka VM Desktop
VM Installation “document

Slide 52 www.edureka.co/big-data-and-hadoop
Pre-requisites

 Pre-requisites to install Edureka VM

» Minimum 4 GB RAM If you don’t have the


» Dual Core Processor or above mentioned hardware
requirements, refer to slide 57

Slide 53 www.edureka.co/big-data-and-hadoop
Installing Edureka VM
 With reference to “ Edureka VM Installation ” document present in the LMS, you can install Edureka VM

We suggest you to use the


Download Manager while
downloading Edureka VM to avoid
any network issues that may occur.
You can download it from here
for different platforms which is an
open source tool.

Slide 54 www.edureka.co/big-data-and-hadoop
Challenges Faced During Installation of VM
 File may not get downloaded completely, due to its huge size, without displaying any error.

After downloading VM always


check for its file size it should
be 4.5 GB

Download Complete Download Incomplete

Proceed with the steps


mentioned in the “Edureka
VM Installation “document Using “Edureka VM Split Files”
document present in LMS, you
can download Edureka VM

Slide 55 www.edureka.co/big-data-and-hadoop
Edureka VM Remote Server
 With reference to “ Remote Login Using Putty – Hadoop 2.2.0 ” document present in the LMS, you can
access our Edureka VM Remote Server

If your system does not meet


minimum requirements please refer
to the document “ Remote Login
using Putty Hadoop-3.0” present in
the LMS

Slide 56 www.edureka.co/big-data-and-hadoop
Challenges Faced While Importing VM in Virtual Box
 Below is the error you may face while importing the Edureka VM in Virtual Box, if the Edureka VM file has
not been downloaded completely

In case you are using split


files, then before importing
split files in Virtual Box
combine all the downloaded
split files using hjsplit
application.

Slide 57 www.edureka.co/big-data-and-hadoop
Challenges Faced After Importing VM in Virtual Box
 After importing VM in Virtual Box, start the VM and check if all the daemons are up and running by using:

Command: sudo jps

 If daemons are not up and running, execute below commands

Command: sudo service hadoop-master stop


Command: sudo service hadoop-master start

Command: hadoop dfsadmin -safemode leave

 After executing the above commands check if all daemons are running fine by using:

Command: sudo jps

Slide 58 www.edureka.co/big-data-and-hadoop
Edureka VM Desktop
 Please refer to the README.txt file present in the Desktop of Edureka VM to know more about it

Slide 59 www.edureka.co/big-data-and-hadoop
Assignment
Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Slide 60 www.edureka.co/big-data-and-hadoop
Pre-work for next Class
Setup the Hadoop development environment using the documents present in the LMS.

» Setup Edureka’s Virtual Machine


» Execute Linux Basic Commands
» Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.

Slide 61 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
 Hadoop 2.x Cluster Architecture

 Hadoop cluster modes

 Basic Hadoop commands

 Hadoop 2.x configuration files and its parameters

 Password-less SSH on Hadoop cluster

 Dump of a MapReduce program

 Data loading techniques

Slide 62 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!

Please spare few minutes to take the survey after the webinar.

Slide 63 www.edureka.co/big-data-and-hadoop

S-ar putea să vă placă și