Sunteți pe pagina 1din 36

SINGLE

PLATFORM. COMPLETE SCALABILITY.

Real Time Analy:cs for Big Data


Lessons Learned from Facebook
@uri1803 Head of Product GigaSpaces

About Me
MTBK Junky
A Proud Dad

Technology addict

Head of Product @ GigaSpaces

Real Time Analy:cs Use Cases


Ecommerce Auc=on monitoring, addwards Search engines Real-=me Marke=ng Improving conversion rate Weather repor=ng Trac analysis Call Center Management Supply-Chain Op=miza=on Quality Management in Manufacturing SLA Monitoring and Maintenance Global Shipment & Delivery Monitoring Fraud Detec=on in Financial Companies

Analy:cs @ TwiJer
How many request/day? Whats the average latency? How many signups, sms, tweets? Desktop vs Mobile user ? What devices fail at the same time? What features get user hooked? Duplicate detection Sentiment analysis Patterns and trends

Counting

Correlating

Research

Note the Time dimension

Counting

Real time (msec/sec) Near real time(Min/Hours) Batch (Days..)

Correlating

Research

The data resolu:on & processing models

Counting

Mostly Event Driven High resolution every tweet counts

Correlating

Ad-hoc queries Mid resolution - Aggregated counters

Research

Pre generated reports Cross grain resolution trends,..

Tradi:onal analy:cs applica:ons


Scale-up Database
Use tradi=onal SQL database Use stored procedure for event driven reports Use ash memory disks to reduce disk I/O Use read only replica to scale-out read queries

Limita=ons
Doesnt scale on write Extremely expensive (HW + SW)

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

CEP Complex Event Processing


Process the data as it comes Maintain a window of the data in-memory Pros:
Extremely low-latency Rela=vely low-cost

Cons
Hard to scale (Mostly limited to scale-up) Not agile - Queries must be pre-generated Fairly complex

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

In Memory Data Grid


Distributed in-memory database Scale out Pros
Scale on write/read Fits to event driven (CEP style) , ad-hoc query model

Cons
- Cost of memory vs disk - Memory capacity is limited
Copyright 2011 Gigaspaces Ltd. All Rights Reserved

NoSQL
Use distributed database
Hbase, Cassandra, MongoDB

Pros
Scale on write/read Elas=c

Cons
Read latency Consistency tradeos are hard Maturity fairly young technology
Copyright 2011 Gigaspaces Ltd. All Rights Reserved

10

Hadoop MapReudce
Distributed batch processing Pros
Designed to process massive amount of data Mature Low cost

Cons
Not real-=me

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

11

Hadoop Map/Reduce Reality check..

With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. [I]t will never be true real-time.. (Yahoo CTO Raymie Stata)

Hadoop/Hive..Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals ( Alex Himel, Engineering Manager at Facebook.) "MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency, (Google senior director of engineering Eisar Lipkovitz)

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

12

So whats the boJom line?

One size doesnt fit all.. The solution has to be a combination of several technologies and patterns..
Copyright 2011 Gigaspaces Ltd. All Rights Reserved

13

FACEBOOK REAL-TIME ANALYTICS SYSTEM

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

14

Goals
Show why plugins are valuable
What value is your business deriving from it?

Make the data more ac=onable


Help users take ac=on to make their content more valuable. How many people see a plugin, how many people take ac=on on it, and how many are converted to trac back on your site.

Make the data more =mely


Went from a 48-hour turn around to 30 seconds. Mul=ple points of failure were removed to make this goal.

Handle massive load


20 billion events per day (200,000 events per second)
Copyright 2011 Gigaspaces Ltd. All Rights Reserved

15

The actual analy:cs..


Like buJon analy:cs

Comments box analy:cs

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

16

Technology Evalua:on
MySQL DB Counters In-Memory Counters MapReduce Cassandra HBase

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

17

The solu:on..
Real Time Scribe
FACEBOOK

Long Term 10,000 write/sec per server

HDFS
Log

Hbase

FACEBOOK

PTail Puma
Log

FACEBOOK

Batch
1.5 Sec
Log

Checking the assump:ons..

Memory is still core

(We) write extremely lean log lines. The more compact the log lines the more can be stored in memory.. (We) batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable When Facebook engineers started the project 6 months ago, Cassandra did not have distributed counters which is now committed in trunk.. (Eric Hauser Senior Software Engineer at ExactTarget)

The NoSQL space is very dynamic..

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

19

Facebook Analy:cs.Next..
What if..
We can rely on memory as a reliable store? We cant decide on a particular NoSQL database? We need to package the solution as a product?

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

20

Step 1: Use memory..


We rely on memory anyway to get 10k msg/ sec.. Why not use memory to store the events Reliability is achieved through redundancy and replica=on
Events
FACEBOOK

Memory Grid
Data Grid

FACEBOOK Data Grid

FACEBOOK Data Grid

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

21

Step 1: Use memory..


We rely on memory anyway to get 10k msg/ sec.. Why not use memory to store the events Reliability is achieved through redundancy and replica=on
Events
FACEBOOK

Any API

FACEBOOK Data Grid

FACEBOOK

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

22

Step 2 Collocate
Pulng the code together with the data.
Events
FACEBOOK

Processing Grid
Data Grid

FACEBOOK Data Grid

FACEBOOK Data Grid

Step 2 Collocate
Pulng the code together with the data.
Events
FACEBOOK

Processing Grid
Data Grid

FACEBOOK
@EventDriven @Polling public class SimpleListener { @EventTemplate Data unprocessedData() { Data template = new Data(); template.setProcessed(false); return template; }

Data Grid

FACEBOOK

@SpaceDataEvent public Data eventListener(Data event) { //process Data here } }

Data Grid

Step 3 Write behind to SQL/NoSQL


Events
FACEBOOK

Processing Grid
Data Grid
Write Behind

Open Long Term persistency

MySQL
FACEBOOK Data Grid

Data Source Adaptor

HBase

FACEBOOK Data Grid

Cassandra

Economic Data Scaling


Combine memory and disk
Memory is x10, x100 lower than disk for high data access rate (Stanford research) Disk is lower at cost for high capacity lower access rate. Solu=on:
Memory - short-term data, Disk - long term. data

High Memory Memory Cores Clock speed Dell Price TB (~960GB)/ Month 192GB 12 cores 3.2 GHhz $367/month $1.9/GB 5x Blades = $1835/month


Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

26

Economic Opera:ons Scaling

Automa=on - reduce opera=onal cost Elas=c Scaling reduce over provisioning cost Cloud portability (JClouds) choose the right cloud for the job Cloud burs=ng scavenge extra capacity when needed
Copyright 2011 Gigaspaces Ltd. All Rights Reserved

27

Pu_ng it all together


- In Memory Data Grid - RT Processing Grid Light Event Processing Map-reduce Event driven Execute code with data Transactional Secured Elastic NoSQL DB Low cost storage Write/Read scalability Dynamic scaling Raw Data and aggregated Data

Event Sources

Write behind

Analytic Application Generate Patterns

28

Pu_ng it all together


- In Memory Data Grid - RT Processing Grid Light Event Processing Map-reduce Event driven Execute code with data Write Transactional behind Secured Script script = new Elastic StaticScritpt(groovy,println NoSQL DB Query q = Low cost storage Write/Read em.createNativeQuery(execute scalability ?); q.setParamter(1, Dynamic scaling script); Generate Patterns Raw Data and Integer result = aggregated Data query.getSingleResult(); 29 hi; return 0)

Event Sources

Analytic Application

5x beJer performance per server!


Event injector Up to 128 threads GigaSpaces/ (Other Msg Server) App Services Up to 128 threads

Hardware Linux
HP DL380 G6 servers - each has: 2 Intel quad-core Xeon X5560 processors (2.8 Ghz Nehalem) 32 Gb RAM (4GB per core) 60,000 6 * 146 Gb 15K RPM SAS disks 50,000 Red Hat 5.2
40,000 30,000 20,000 10,000 0 Event injection throughput

50,000 write/sec per server

GS Giga

Other WLS

Event injection EJB/Remoting throughput with service invocation write multiple throughput

Pu_ng it all together Elas:c Big Data Plaborm


The best of both worlds Support Real Time and Batch Fully managed stack Makes the development and deployment of Big Data applica=on signicantly simpler Extremely cost eec=ve
Best ra=o of Disk + Memory Run on any cloud TRUE Cloud burs=ng support

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

31

Other benets
Designed for real time event processing Open

Built-in Pub/Sub Built-in CEP

Standard Query Any database

Reliable

Transactional, consistent Survive complete database failure

Simple

Can be packaged into a single product Fully automated deployment End to end management and monitoring
32

Copyright 2010 Gigaspaces Ltd. All Rights Reserved

Further reading..

natishalom.typepad.com
Real Time Analytics for Big Data: An Alternative Approach

GigaOM
Big data in real time is no fantasy

Highscalability.com
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day

GigaSpaces.com

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

33

THANK YOU!
@uri1803 hJp://blog.gigaspaces.com

34

Economic Scaling

Cloudify Application Cluster Console


Controller Controll

er
Cloudify Agent Cloudify Agent

Worker Instance VM Role

VM Instance

JClouds Cloud Driver Scale-in Scale-out

Load Balancer Network

Storage Compute Services

Copyright 2011 Gigaspaces Ltd. All Rights Reserved

35

S-ar putea să vă placă și