Sunteți pe pagina 1din 67

Big Data

Kailash S
C-DAC
Chennai

Which is Big Data


Large data files (65%)
to
Advanced analytics or analysis (60%)
to
Data from visualization tools (50%)

Structured, Semi, Unstructured

Properties

Scalability
Data I/O Performance
Fault tolerance
Real-time processing
Data size supported
Iterative task support

Scalability
Horizontal scaling
Peer to Peer
MPI

Apache Hadoop
HDFS, YARN

Map reduce
Spark

Vertical scaling

High Performance Computing Clusters


Multi Core systems
Graphical processing unit
Field Programmable Gate arrays

Choose platform for big data analytics


Data size
Speed or throughput optimization
Map reduce impact

Types of Analytics
Prescriptive Analytics
Predictive Analytics
Descriptive Analytics.

Data Loading Scenario


Data at rest
Motion
Web server / sensor logs
FLUME

Database
SQOOP SQL + HADOOP

Types of tools used in Big Data


Scenario
Where is the processing hosted?
Distributed server/cloud
Where data is stored?
Distributed Storage (eg: Amazon s3)
Where is the programming model?
Distributed processing (Map Reduce)
How data is stored and indexed?
High performance schema free database
What operations are performed on the data?
Analytic/Semantic Processing (Eg. RDF/OWL)

Terminology
Google calls it:

Hadoop equivalent:

MapReduce

Hadoop

GFS

HDFS

Bigtable

HBase

Chubby

Zookeeper

HADOOP

What is Hadoop ?

Framework for running applications and storing data over


large clusters.
Provides a distributed file system (HDFS) that stores data
on the nodes.
Data replication takes place, hence data is never lost.
Hadoop implements Map/Reduce

Application's process is divided into many small


fragments of work

Each of which may be executed on any node in the


cluster

Actual Parallel processing.

Hadoop Architecture

Data

Hadoop Cluster

Data data data data data


Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data
Data data data data data

DFS Block 1

DFS Block 1

DFS Block 1
DFS Block 2
DFS Block 2 MAP

Results

MAP

Reduce

DFS Block 2

Data data data data data


Data data data data data
Data data data data data

MAP
DFS Block 3

DFS Block 3
DFS Block 3

Data data data data


Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data
Data data data data

Architecture of Hadoop DB

EDBT 2011 Tutorial

What is HDFS ?

HDFS stands for Hadoop Distributed File System


primary storage system used by Hadoop
applications.
HDFS splits the data into several pieces called data
blocks.
HDFS creates multiple replicas of data blocks and
distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid
computations.

HDFS Architecture

Architecture
Hadoop is based on Master-Slave architecture.
An HDFS cluster consists of a single,

Namenode (master server) that manages the file


system.
Datanodes (Slaves), which manage storage attached to
the nodes.

Internally, a file is split into one or more blocks


and these blocks are stored in a set of Datanodes.

Namenode & Datanode

Namenode executes file system namespace


operations

opening, closing, and renaming files and directories.


Determines the mapping of blocks to Datanodes.
Repository for all HDFS metadata.

Datanodes are responsible for serving read and write


requests from the file systems clients.

Perform block creation, deletion, and replication upon


instruction from the Namenode.

Hadoop Distributed File System


HDFS Server

Master node

HDFS Client
Application

Local file
system
Block size: 2K

Name Nodes
Block size: 128M
Replicated
6/23/2010

Wipro Chennai 2011

30

HDFS Architecture
Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write
Client

Rack2

Nodes, Trackers, Tasks

Master node runs JobTracker instance, which


accepts Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for
task instances

MAP REDUCE

Map-Reduce

MapReduce is a framework for processing


huge datasets on certain kinds of
distributable problems using a large number
of nodes.

MapReduce Architecture

Parallel Execution

Map
The master node takes the input, chops it up into
smaller sub-problems, and distributes those to
worker nodes.
A worker node may do this again in turn, leading
to a multi-level tree structure.
The worker node processes that smaller problem,
and passes the answer back to its master node.

Reduce
The master node then takes the answers to
all the sub-problems and combines them
in a way to get the output - the answer to
the problem it was originally trying to
solve.

MapReduce: High Level


Master node
MapReduce job
submitted by
client computer

JobTracker

In our case: circe.rc.usf.edu

Slave node

Slave node

Slave node

TaskTracker

TaskTracker

TaskTracker

Task instance

Task instance

Task instance

Large scale data splits

Map <key, 1>


<key, value>pair

Reducers (say, Count)

Parse-hash
Count

P-0000
, count1

Parse-hash
Count

P-0001
, count2

Parse-hash

Count

Parse-hash
6/23/2010

Wipro Chennai 2011

P-0002
,count3
40

Input key*value
pairs

Input key*value
pairs

...
map

map

Data store 1

Data store n

(key 1,
values...)

(key 2,
values...)

(key 3,
values...)

(key 2,
values...)

(key 1,
values...)

(key 3,
values...)

== Barrier == : Aggregates intermediate values by output key


key 1,
intermediate
values

key 2,
intermediate
values

key 3,
intermediate
values

reduce

reduce

reduce

final key 1
values

final key 2
values

final key 3
values

Hadoop Internal process


Job Launch Process
Client, Job Client
JobTracker
Task tracker, Task, Task runner

Creating mapper
Mapper

Getting Data To The Mapper

InputFormat

Input file

Input file

InputSplit

InputSplit

InputSplit

InputSplit

RecordReader

RecordReader

RecordReader

RecordReader

Mapper

Mapper

Mapper

Mapper

(intermediates)

(intermediates)

(intermediates)

(intermediates)

Reading data
File input format and friends
Filtering file inputs
Record readers
Input split size
Sending data to reducers
Writable comparator
Sending data to client

shuffling

Partition And Shuffle


Mapper

Mapper

Mapper

Mapper

(intermediates)

(intermediates)

(intermediates)

(intermediates)

Partitioner

Partitioner

Partitioner

Partitioner

(intermediates)

(intermediates)

(intermediates)

Reducer

Reducer

Reducer

Partitioner
Reduction
Output format

OutputFormat

Finally: Writing The Output

Reducer

Reducer

Reducer

RecordWriter

RecordWriter

RecordWriter

output file

output file

output file

Other Hadoop related Projects

Pig

HBase

Distributed, column-oriented database.

HBase uses HDFS for its underlying storage

Supports both batch-style computations using MapReduce and point queries (random
reads).

ZooKeeper

A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.

Distributed, highly available coordination service.


Provides primitives such as distributed locks that can be used for building distributed
applications.

Hive

Distributed data warehouse.

Manages data stored in HDFS and provides a query language based on SQL (and which
is translated by the runtime engine to MapReduce jobs) for querying the data.

Hive
A database/data warehouse on top of Hadoop
Rich data types (structs, lists and maps)
Efficient implementations of SQL filters, joins and groupbys on top of map reduce

Allow users to access Hive data without using


Hive

Hive Architecture
Map Reduce

Web UI
Mgmt, etc

Hive CLI
Browsing Queries DDL

Hive QL

MetaStore
Parser

Planner

Execution

SerDe
Thrift Jute JSON

Thrift API

HDFS

BIG DATA is not just HADOOP


Understand and navigate
federated big data sources

Federated Discovery and Navigation

Manage & store huge


volume of any data

Hadoop File System


MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all


data sources

Integration, Data Quality, Security,


Lifecycle Management, MDM

SPARK

University of California
Reduce Disk I/O limitations
Java, Scala, Python
In-memory computation
100x faster
Berkeley Data Analytics stack

INSTALLATION OF PACKAGES
Linux packages

Binary installation

Source installation

Manual
Manual

Automated

(Pre requisites &


Dependencies)

Main package

Configuration

Installing Hadoop

Hadoop comes as a standalone download from

http://www.motorlogy.com/apachemirror//hadoop/core/hadoop0.20.2/hadoop-0.20.2.tar.gz

Pre-requasites
JDK 1.5 or above
SSH

Requires to create a new user with ownership for the following


folders
Path of Hadoop folder
Path where the Data's are placed.

Configuring Hadoop

There are 5 major files in hadoop/conf folder which are to be configured.


masters
slaves
core-site.xml
hdfs-site.xml
mapred-site.xml
hadoop-env.sh
Configuration may be of either of the two types
Single-node cluster
Multi-node cluster

Steps to configure

Map the IP addresses of all nodes with its host name i.e /etc/hosts file

Add environmental variables HADOOP_HOME

Add the master node name in masters file.

Add the master node and data nodes name in slaves file.

Add JAVA_HOME path in hadoop-env.sh

Login into the hadoop user.

Continued ...

Create a SSH rsa key and copy it to all the data nodes from master node.

ssh-keygen -t rsa -P ""


cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
ssh hadoop@master
ssh hadoop@slave

Configuring core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:8020</value>
</property>
</configuration>

Configuring hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

Configuring mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:8021</value>
</property>
</configuration>

Starting Hadoop

hadoop namenode -format


sh start-all.sh
Starts Namenode
Starts Secondary Namenode
Starts Datanode
Starts Job Tracker
Starts Trace Tracker
To view Job tracker and Trace Tracker Web Interface
http://master:50070/ -- Task tracker
http://master:50030/ -- Job Tracker

Sample Hadoop commands

hadoop dfsadmin -report


Standard report about the nodes and replication.

hadoop fs -ls
List all files on the HDFS.
hadoop fs -mkdir abcd
Creates a directory on HDFS

Continued ...

hadoop fs -put /root/abc.txt abcd/input


To put external data into HDFS

hadoop jar hadoop-example.jar \com.hadoop.WordCount \abcd/input


abcd/output
Executing a map-reduce algorithm.

Internet of Things (IoT)


Extending the current Internet and providing
connection, communication, and internetworking between devices and physical
objects or "Things.
The technologies and solutions that enable
integration of real world data and services into
the current information networking
technologies are described under the
umbrella term of the Internet of Things (IoT).

Thing connected to the internet

Image Courtesy: : CISCO

Future Networks IOT

S-ar putea să vă placă și