Documente Academic
Documente Profesional
Documente Cultură
Satyendra Pasalapudi
Associate Practice Director
Apps Associates LLC
@pasalapudi
Magnetic tape
IDMS
Access HBase
“flat” (sequential) files
ADABAS
System R Postgres Dynamo
Magnetic Disk
Oracle V2
MySQL MongoDB
Pre-computer Redis
technologies:
VoltDB
Printing press Neo4J
Dewey decimal
system
Punched cards
1940-50 1950-60 1960-70 1970-80 1980-90 1990-2000 2000-2010
Billing
ERP RFID Network Switches
CRM
© Copyright 2016. Apps Associates LLC. 13
Hybrid Cloud Framework
HR FIN
SCOM SALES
PLANNING
DW / BI
PROCUREMENT
HIGH TECHNOLOGY /
Volume, Velocity, Variety
LIFE SCIENCES MEDIA/
ON-LINE
SERVICES / HEALTH CARE
INDUSTRIAL MFG. Clinical trials ENTERTAINMENT SOCIAL MEDIA Patient sensors,
Mfg quality Genomics Viewers / advertising People & career monitoring, EHRs
Warranty analysis effectiveness matching Quality of care
Cross Sell Web-site
optimization
Social
Media/A Life Financial
Oil & Gas Retail Security Network/
dvertising Sciences Services
Gaming
Monte User
Targeted Anti-virus Demograp
Recomme Carlo
Advertisin hics
nd Simulatio
g
ns
Seismic Genome Fraud Usage
Analysis Analysis Detection analysis
Image
Transactio
and Video Risk Image
ns In-game
Processin Analysis Recogniti
Analysis metrics
g on
Data In-memory
Web Server Warehouse Analytics
RDBMS (HANA,
(Oracle, Exalytics …)
Teradata …)
Hadoop
Web DBMS ERP & in-
(MySQL, Operational house CRM
Mongo, RDBMS (Oracle,
Cassandra) SQL Server, …)
Chukwa (Monitoring)
MapReduce (Job Scheduling/Execution System)
(Coordination)
ZooKeeper
Client
Block ops
Read Datanodes Datanodes
replication B
Blocks
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
NameNode - These processes are charged with storing a directory tree of all files
in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the
file data is kept within in the cluster. Client Applications contact Name Nodes when
they need to locate a file, or add, or copy or delete a file.
DataNodes - The datanode stores data in the HDFS and is responsible for
replicating data across clusters. Data Nodes interact with client applications when
the NameNopde has supplied the Datanode's address.
WorkerNode: Unlike a master node, whose numbers we can count on one hand, a
representative Hadoop Deployment consists of dozens or hundreds of worker
nodes, which provides enough processing power to analyze a
few hundreds terabytes all the way upto one petabyte. Each worker node includes
a DataNode as well as Task Tracker.
Map Reduce
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each DataNode
in the cluster.
BlockReport contains all the blocks on a Datanode.
• Replica selection for READ operation: HDFS tries to minimize the bandwidth
consumption and latency.
• If there is a replica on the Reader node then that is preferred.
• HDFS cluster may span multiple data centers: replica in the local data center
is preferred over the remote one.
Every 3
seconds. No heartbeat
“I AM ALIVE” for 10 minutes
Reply
DataNode DataNode
Storage ID: DataNode Storage ID:
XYZ001 Storage ID: XYZ003
XYZ002
© Copyright 2016. Apps Associates LLC. 46
© Copyright 2016. Apps Associates LLC. 47
Coordination in a distributed system
• Coordination: An act that multiple nodes must perform together.
• Examples:
– Group membership
– Locking
– Publisher/Subscriber
– Leader Election
– Synchronization
• Getting node coordination correct is very hard!
Introducing ZooKeeper
• Speed, agility, and intelligence are competitive advantages that nearly all
organizations seek.
• To support Operational Users and influence what should happen next, the data
should be available in real time to know what is happening now.
Silo’d clusters
Largely batch system
Difficult to integrate
Hadoop 2.0
HADOOP 2
Standard Query Online Data Real Time Stream
Processing Processing Processing Others
Hive
Batch Interactive
MapReduce Tez
MarkLogic
Riak
XML based
BerkeleyDB
XML
Cassandra
Neo4J
Hbase
Table Based BigTable
Graph
Infinite Graph
HyperTable Database
FlockDB
Accumulo
Multiple Data Stores
Stinger
NoSQL
PhD Anyone
???
NoSQL Databases
Filesystem (HDFS)
(Oracle NoSQL DB, Hbase)
VISUALIZE
ANALYZE
Fast Pace Innovation
http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Note: company logos and images are for illustration purposes only. Not a real use case for the company.
Data In-memory
Web Server Warehouse Analytics
RDBMS (HANA,
(Oracle, Exalytics …)
Teradata …)
Hadoop
Web DBMS ERP & in-
(MySQL, Operational house CRM
Mongo, RDBMS (Oracle,
Cassandra) SQL Server, …)