Documente Academic
Documente Profesional
Documente Cultură
About
Me
MTBK Junky
A Proud Dad
Technology addict
Analy:cs
@
TwiJer
How many request/day? Whats the average latency? How many signups, sms, tweets? Desktop vs Mobile user ? What devices fail at the same time? What features get user hooked? Duplicate detection Sentiment analysis Patterns and trends
Counting
Correlating
Research
Counting
Correlating
Research
Counting
Correlating
Research
Limita=ons
Doesnt
scale
on
write
Extremely
expensive
(HW
+
SW)
Cons
Hard
to
scale
(Mostly
limited
to
scale-up)
Not
agile
-
Queries
must
be
pre-generated
Fairly
complex
Cons
- Cost
of
memory
vs
disk
- Memory
capacity
is
limited
Copyright
2011
Gigaspaces
Ltd.
All
Rights
Reserved
NoSQL
Use
distributed
database
Hbase,
Cassandra,
MongoDB
Pros
Scale
on
write/read
Elas=c
Cons
Read
latency
Consistency
tradeos
are
hard
Maturity
fairly
young
technology
Copyright
2011
Gigaspaces
Ltd.
All
Rights
Reserved
10
Hadoop
MapReudce
Distributed
batch
processing
Pros
Designed
to
process
massive
amount
of
data
Mature
Low
cost
Cons
Not
real-=me
11
With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. [I]t will never be true real-time.. (Yahoo CTO Raymie Stata)
Hadoop/Hive..Not realtime. Many dependencies. Lots of points of failure. Complicated system. Not dependable enough to hit realtime goals ( Alex Himel, Engineering Manager at Facebook.) "MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency, (Google senior director of engineering Eisar Lipkovitz)
12
One size doesnt fit all..
The solution has to be a combination of several technologies and patterns..
Copyright
2011
Gigaspaces
Ltd.
All
Rights
Reserved
13
14
Goals
Show
why
plugins
are
valuable
What
value
is
your
business
deriving
from
it?
15
16
Technology
Evalua:on
MySQL
DB
Counters
In-Memory
Counters
MapReduce
Cassandra
HBase
17
The
solu:on..
Real Time
Scribe
FACEBOOK
HDFS
Log
Hbase
PTail
Puma
Log
Batch
1.5 Sec
Log
(We) write extremely lean log lines. The more compact the log lines the more can be stored in memory.. (We) batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable When Facebook engineers started the project 6 months ago, Cassandra did not have distributed counters which is now committed in trunk.. (Eric Hauser Senior Software Engineer at ExactTarget)
19
Facebook
Analy:cs.Next..
What
if..
We can rely on memory as a reliable store? We cant decide on a particular NoSQL database? We need to package the solution as a product?
20
Memory Grid
Data Grid
21
Any API
22
Step
2
Collocate
Pulng
the
code
together
with
the
data.
Events
FACEBOOK
Processing Grid
Data Grid
Step
2
Collocate
Pulng
the
code
together
with
the
data.
Events
FACEBOOK
Processing Grid
Data Grid
FACEBOOK
@EventDriven @Polling public class SimpleListener { @EventTemplate Data unprocessedData() { Data template = new Data(); template.setProcessed(false); return template; }
Data Grid
Data Grid
Processing Grid
Data Grid
Write Behind
MySQL
FACEBOOK
Data Grid
HBase
Cassandra
High Memory Memory Cores Clock speed Dell Price TB (~960GB)/ Month 192GB 12 cores 3.2 GHhz $367/month $1.9/GB 5x Blades = $1835/month
Only
~16G
required
to
store
the
log
in
memory
(
500b
messages
at
10k/h
)
at
a
cost
of
~32$
month
per
server.
26
Automa=on
-
reduce
opera=onal
cost
Elas=c
Scaling
reduce
over
provisioning
cost
Cloud
portability
(JClouds)
choose
the
right
cloud
for
the
job
Cloud
burs=ng
scavenge
extra
capacity
when
needed
Copyright
2011
Gigaspaces
Ltd.
All
Rights
Reserved
27
Event Sources
Write behind
28
Event Sources
Analytic Application
Hardware
Linux
HP
DL380
G6
servers
-
each
has:
2
Intel
quad-core
Xeon
X5560
processors
(2.8
Ghz
Nehalem)
32
Gb
RAM
(4GB
per
core)
60,000 6
*
146
Gb
15K
RPM
SAS
disks
50,000 Red
Hat
5.2
40,000 30,000 20,000 10,000 0 Event injection throughput
GS Giga
Other WLS
Event injection EJB/Remoting throughput with service invocation write multiple throughput
31
Other
benets
Designed for real time event processing
Open
Reliable
Simple
Can be packaged into a single product
Fully automated deployment
End to end management and monitoring
32
Further reading..
natishalom.typepad.com
Real Time Analytics for Big Data: An Alternative Approach
GigaOM
Big data in real time is no fantasy
Highscalability.com
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day
GigaSpaces.com
33
THANK
YOU!
@uri1803
hJp://blog.gigaspaces.com
34
Economic Scaling
er
Cloudify Agent Cloudify Agent
VM Instance
35