Big Data

Big Data
Kailash S
C-DAC
Chennai
Which is Big Data

Large data files (65%)
to
Advanced analytics or analysis (60%)
to
Data from visualization tools (50%)
Structured, Semi, Unstructured
Properties
Scalability
Data I/O Performance
Fault tolerance
Real-time processing
Data size supported
Iterative task support
Scalability
Horizontal scaling
Peer to Peer
MPI
Apache Hadoop
HDFS, YARN
Map reduce
Spark
Vertical scaling
High Performance Computing Clusters

Multi Core systems
Graphical processing unit
Field Programmable Gate arrays
Choose platform for big data analytics

Data size
Speed or throughput optimization
Map reduce impact
Types of Analytics
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics.
Data Loading Scenario

Data at rest
Motion
Web server / sensor logs
FLUME
Database
SQOOP SQL + HADOOP
Types of tools used in Big Data

Scenario
Where is the processing hosted?
Distributed server/cloud
Where data is stored?
Distributed Storage (eg: Amazon s3)
Where is the programming model?
Distributed processing (Map Reduce)
How data is stored and indexed?
High performance schema free database
What operations are performed on the data?
Analytic/Semantic Processing (Eg. RDF/OWL)
Terminology
Google calls it:
Hadoop equivalent:
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
Zookeeper
HADOOP
What is Hadoop ?
Framework for running applications and storing data over

large clusters.
Provides a distributed file system (HDFS) that stores data
on the nodes.
Data replication takes place, hence data is never lost.
Hadoop implements Map/Reduce
Application's process is divided into many small

fragments of work
Each of which may be executed on any node in the

cluster
Actual Parallel processing.
Hadoop Architecture
Data
Hadoop Cluster
Data data data data data

DFS Block 1
DFS Block 1
DFS Block 1
DFS Block 2
DFS Block 2 MAP
Results
MAP
Reduce
DFS Block 2

MAP
DFS Block 3
DFS Block 3
DFS Block 3
Data data data data

Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Architecture of Hadoop DB
EDBT 2011 Tutorial
What is HDFS ?
HDFS stands for Hadoop Distributed File System

primary storage system used by Hadoop
applications.
HDFS splits the data into several pieces called data
blocks.
HDFS creates multiple replicas of data blocks and
distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid
computations.
HDFS Architecture
Architecture
Hadoop is based on Master-Slave architecture.
An HDFS cluster consists of a single,
Namenode (master server) that manages the file

system.
Datanodes (Slaves), which manage storage attached to
the nodes.
Internally, a file is split into one or more blocks

and these blocks are stored in a set of Datanodes.
Namenode & Datanode
Namenode executes file system namespace

operations
opening, closing, and renaming files and directories.

Determines the mapping of blocks to Datanodes.
Repository for all HDFS metadata.
Datanodes are responsible for serving read and write

requests from the file systems clients.
Perform block creation, deletion, and replication upon

instruction from the Namenode.
Hadoop Distributed File System

HDFS Server
Master node
HDFS Client
Application
Local file
system
Block size: 2K
Name Nodes
Block size: 128M
Replicated
6/23/2010
Wipro Chennai 2011
30
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Client
Rack2
Nodes, Trackers, Tasks
Master node runs JobTracker instance, which

accepts Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for
task instances
MAP REDUCE
Map-Reduce
MapReduce is a framework for processing

huge datasets on certain kinds of
distributable problems using a large number
of nodes.
MapReduce Architecture
Parallel Execution
Map
The master node takes the input, chops it up into
smaller sub-problems, and distributes those to
worker nodes.
A worker node may do this again in turn, leading
to a multi-level tree structure.
The worker node processes that smaller problem,
and passes the answer back to its master node.
Reduce
The master node then takes the answers to
all the sub-problems and combines them
in a way to get the output - the answer to
the problem it was originally trying to
solve.
MapReduce: High Level

Master node
MapReduce job
submitted by
client computer
JobTracker
In our case: circe.rc.usf.edu
Slave node
Slave node
Slave node
TaskTracker
TaskTracker
TaskTracker
Task instance
Task instance
Task instance
Large scale data splits
Map <key, 1>

<key, value>pair
Reducers (say, Count)
Parse-hash
Count
P-0000
, count1
Parse-hash
Count
P-0001
, count2
Parse-hash
Count
Parse-hash
6/23/2010
Wipro Chennai 2011
P-0002
,count3
40
Input key*value
pairs
Input key*value
pairs
...
map
map
Data store 1
Data store n
(key 1,
values...)
(key 2,
values...)
(key 3,
values...)
(key 2,
values...)
(key 1,
values...)
(key 3,
values...)
== Barrier == : Aggregates intermediate values by output key

key 1,
intermediate
values
key 2,
intermediate
values
key 3,
intermediate
values
reduce
reduce
reduce
final key 1
values
final key 2
values
final key 3
values
Hadoop Internal process

Job Launch Process
Client, Job Client
JobTracker
Task tracker, Task, Task runner
Creating mapper
Mapper
Getting Data To The Mapper
InputFormat
Input file
Input file
InputSplit
InputSplit
InputSplit
InputSplit
RecordReader
RecordReader
RecordReader
RecordReader
Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Reading data
File input format and friends
Filtering file inputs
Record readers
Input split size
Sending data to reducers
Writable comparator
Sending data to client
shuffling
Partition And Shuffle

Mapper
Mapper
Mapper
Mapper
(intermediates)
(intermediates)
(intermediates)
(intermediates)
Partitioner
Partitioner
Partitioner
Partitioner
(intermediates)
(intermediates)
(intermediates)
Reducer
Reducer
Reducer
Partitioner
Reduction
Output format
OutputFormat
Finally: Writing The Output
Reducer
Reducer
Reducer
RecordWriter
RecordWriter
RecordWriter
output file
output file
output file
Other Hadoop related Projects
Pig
HBase
Distributed, column-oriented database.
HBase uses HDFS for its underlying storage
Supports both batch-style computations using MapReduce and point queries (random
reads).
ZooKeeper
A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.
Distributed, highly available coordination service.

Provides primitives such as distributed locks that can be used for building distributed
applications.
Hive
Distributed data warehouse.
Manages data stored in HDFS and provides a query language based on SQL (and which
is translated by the runtime engine to MapReduce jobs) for querying the data.
Hive
A database/data warehouse on top of Hadoop
Rich data types (structs, lists and maps)
Efficient implementations of SQL filters, joins and groupbys on top of map reduce
Allow users to access Hive data without using

Hive
Hive Architecture
Map Reduce
Web UI
Mgmt, etc
Hive CLI
Browsing Queries DDL
Hive QL
MetaStore
Parser
Planner
Execution
SerDe
Thrift Jute JSON
Thrift API
HDFS
BIG DATA is not just HADOOP

Understand and navigate
federated big data sources
Federated Discovery and Navigation
Manage & store huge

volume of any data
Hadoop File System

MapReduce
Structure and control data
Data Warehousing
Manage streaming data
Stream Computing
Analyze unstructured data
Text Analytics Engine
Integrate and govern all

data sources
Integration, Data Quality, Security,

Lifecycle Management, MDM
SPARK
University of California
Reduce Disk I/O limitations
Java, Scala, Python
In-memory computation
100x faster
Berkeley Data Analytics stack
INSTALLATION OF PACKAGES
Linux packages
Binary installation
Source installation
Manual
Manual
Automated
(Pre requisites &

Dependencies)
Main package
Configuration
Installing Hadoop
Hadoop comes as a standalone download from
http://www.motorlogy.com/apachemirror//hadoop/core/hadoop0.20.2/hadoop-0.20.2.tar.gz
Pre-requasites
JDK 1.5 or above
SSH
Requires to create a new user with ownership for the following

folders
Path of Hadoop folder
Path where the Data's are placed.
Configuring Hadoop
There are 5 major files in hadoop/conf folder which are to be configured.

masters
slaves
core-site.xml
hdfs-site.xml
mapred-site.xml
hadoop-env.sh
Configuration may be of either of the two types
Single-node cluster
Multi-node cluster
Steps to configure
Map the IP addresses of all nodes with its host name i.e /etc/hosts file
Add environmental variables HADOOP_HOME
Add the master node name in masters file.
Add the master node and data nodes name in slaves file.
Add JAVA_HOME path in hadoop-env.sh
Login into the hadoop user.
Continued ...
Create a SSH rsa key and copy it to all the data nodes from master node.
ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
ssh hadoop@master
ssh hadoop@slave
Configuring core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:8020</value>
</property>
</configuration>
Configuring hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Configuring mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:8021</value>
</property>
</configuration>
Starting Hadoop
hadoop namenode -format

sh start-all.sh
Starts Namenode
Starts Secondary Namenode
Starts Datanode
Starts Job Tracker
Starts Trace Tracker
To view Job tracker and Trace Tracker Web Interface
http://master:50070/ -- Task tracker
http://master:50030/ -- Job Tracker
Sample Hadoop commands
hadoop dfsadmin -report

Standard report about the nodes and replication.
hadoop fs -ls
List all files on the HDFS.
hadoop fs -mkdir abcd
Creates a directory on HDFS
Continued ...
hadoop fs -put /root/abc.txt abcd/input

To put external data into HDFS
hadoop jar hadoop-example.jar \com.hadoop.WordCount \abcd/input

abcd/output
Executing a map-reduce algorithm.
Internet of Things (IoT)

Extending the current Internet and providing
connection, communication, and internetworking between devices and physical
objects or "Things.
The technologies and solutions that enable
integration of real world data and services into
the current information networking
technologies are described under the
umbrella term of the Internet of Things (IoT).
Thing connected to the internet
Image Courtesy: : CISCO
Future Networks IOT

Big Data

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Big Data

Încărcat de

Drepturi de autor:

Formate disponibile

Big Data

Which is Big Data

Structured, Semi, Unstructured

High Performance Computing Clusters

Choose platform for big data analytics

Data Loading Scenario

Types of tools used in Big Data

Framework for running applications and storing data over

Application's process is divided into many small

Each of which may be executed on any node in the

Actual Parallel processing.

Data data data data data

Data data data data data

Data data data data

EDBT 2011 Tutorial

HDFS stands for Hadoop Distributed File System

Namenode (master server) that manages the file

Internally, a file is split into one or more blocks

Namenode & Datanode

Namenode executes file system namespace

opening, closing, and renaming files and directories.

Datanodes are responsible for serving read and write

Perform block creation, deletion, and replication upon

Hadoop Distributed File System

Wipro Chennai 2011

Nodes, Trackers, Tasks

Master node runs JobTracker instance, which

MapReduce is a framework for processing

MapReduce: High Level

In our case: circe.rc.usf.edu

Large scale data splits

Map <key, 1>

Reducers (say, Count)

Wipro Chennai 2011

== Barrier == : Aggregates intermediate values by output key

Hadoop Internal process

Getting Data To The Mapper

Partition And Shuffle

Finally: Writing The Output

Other Hadoop related Projects

Distributed, column-oriented database.

HBase uses HDFS for its underlying storage

Distributed, highly available coordination service.

Distributed data warehouse.

Allow users to access Hive data without using

BIG DATA is not just HADOOP

Federated Discovery and Navigation

Manage & store huge

Hadoop File System

Structure and control data

Manage streaming data

Analyze unstructured data

Text Analytics Engine

Integrate and govern all

Integration, Data Quality, Security,

(Pre requisites &

Hadoop comes as a standalone download from

Requires to create a new user with ownership for the following

There are 5 major files in hadoop/conf folder which are to be configured.

Add environmental variables HADOOP_HOME

Add the master node name in masters file.

Add JAVA_HOME path in hadoop-env.sh

Login into the hadoop user.