Sunteți pe pagina 1din 39

Running a Realtime Stats

Service on MySQL
Cybozu Labs, Inc.
Kazuho Oku
Background

Apr. 23 2009 Running Realtime Stats Service on MySQL 2


Who am I?

 Name: Kazuho Oku (奥 一穂)


 Original Developer of Palmscape / Xiino
 The oldest web browser for Palm OS
 Working at Cybozu Labs since 2005
 Research subsidiary of Cybozu, Inc.
 Cybozu is a leading groupware vendor in Japan
 My weblog: tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 3


Introduction of Pathtraq

Apr. 23 2009 Running Realtime Stats Service on MySQL 4


What is Pathtraq?

 Started in Aug. 2007


 Web ranking service
 One of Japan’s largest
  10,000 users submit access information
  1,000,000 access infomation per day
 like Alexa, but semi-realtime, and per-page

Apr. 23 2009 Running Realtime Stats Service on MySQL 5


What is Pathtraq? (cont'd)

 Automated Social News Service


 find what's hot
 like Google News + Digg
 calculate relevance from access stats
 Search by...
 no filtering (all the Internet)
 by category
 by keyword
 by URL (per-domain, etc.)
Apr. 23 2009 Running Realtime Stats Service on MySQL 6
How to Provide Real-time Analysis?

 Data Set (as of Apr. 23 2009)


 # of URLs: 147,748,546
 # of total accesses: 413,272,527
 Sharding is not a good option
 since we need to join the tables and aggregate
 prefix-search by URL, search by keyword, then join
with access data table
 core tables should be stored on RAM
 not on HDD, due to lots of random access

Apr. 23 2009 Running Realtime Stats Service on MySQL 9


Our Decision was to...

 Keep URL and access stats on RAM


 compression for size and speed
 Create a new message queue
 Limit Pre-computation Load
 Create our own cache, with locks
 to minimize database access
 Fulltext-search database on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 10


Our Servers

 Main Server
 Opteron 2218 x2, 64GB Mem
 MySQL, Apache
 Fulltext Search Server
 Opteron 240EE, 2GB Mem, Intel SSD
 MySQL (w. Tritonn/Senna)
 Helper Servers
 for Content Analysis
 for Screenshot Generation
Apr. 23 2009 Running Realtime Stats Service on MySQL 11
The Long Tail of the Internet
-0.44
y=C・x
# of URLs with 1/10 hits: x2.75

Apr. 23 2009 Running Realtime Stats Service on MySQL 12


Compressing URLs

Apr. 23 2009 Running Realtime Stats Service on MySQL 13


Compressing URLs

 The Challenges:
 URLs are too short for gzip, etc.
 URLs should be prefix-searchable in compressed
form
 How to run like 'http://www.mysql.com/%' on a
compressed URL?

 The Answer:
 Static PPM + Range Coder

Apr. 23 2009 Running Realtime Stats Service on MySQL 14


Static PPM

 PPM: Prediction by Partial Matching


 What is the next character after ".co"?
 The answer is "m"!
 PPM is used by 7-zip, etc.
 Static PPM is PPM with static probabilistic
model
 Many URLs (or English words) have common
patterns
 Suitable for short texts (like URLs)
Apr. 23 2009 Running Realtime Stats Service on MySQL 15
Range Coder

 A fast variant of arithmetic compression


 similar to huffmann encoding, but better
 If probability of next character being "m" was
75%, it will be encoded into 0.42 bit
 Compressed strings preserve the sort
order of uncompressed form

Apr. 23 2009 Running Realtime Stats Service on MySQL 16


Create Compression Functions

 Build prediction table from stored URLs


 Implement range coder
 took an open-source impl. and optimized it
 original impl. added some bits unnecessary at the tail
 use SSE instructions for faster operation
 coderepos.org/share/browser/lang/cplusplus/range_coder

 Link the coder and the table to create


MySQL UDFs

Apr. 23 2009 Running Realtime Stats Service on MySQL 17


Rewriting the Server Logic

 Change schema
url varchar(255) not null # with unique index

urlc varbinary(767) not null # with unique index

 Change prefix-search form


url like 'http://example.com/%'

url_compress('http://example.com/')<=urlc and
urlc<url_compress('http://example.com0')
Note: "0" is next character of '/'
Apr. 23 2009 Running Realtime Stats Service on MySQL 18
Compression Ratio

 Compression ratio: 37%


 Size of prediction table: 4MB
 Benchmark of the compression functions
 compression: 40MB/sec. (570k URLs/sec.)
 decompression: 19.3MB/sec. (280k URLs/sec.)
 fast enough since searchable in compressed form
 Prefix-search became faster
 shorter indexes lead to faster operation

Apr. 23 2009 Running Realtime Stats Service on MySQL 19


Re InnoDB Compression

 URL Compression can coexist with


InnoDB compression
 though we aren't using InnoDB compression on our
production environment

Compression Table Size


N/A 100%
URL compression 57%
InnoDB compression 50%
using both 33%
Apr. 23 2009 Running Realtime Stats Service on MySQL 20
Compressing the Stats Table

 Used to have two int columns: at, cnt


 it was waste of space, since...
 most cnt values are very small numbers
 most accesses to each URL occur on a short period (ex.
the day the blog entry was written)
 at field should be part of the indexes
at (hours since epoch) cnt (# of hits)
330168 1
330169 2
330173 1
330197 1

Apr. 23 2009 Running Realtime Stats Service on MySQL 21


Compressing the Stats Table (cont'd)

 Merge the rows into a sparse array


 example on the prev. page becomes:
(offset=330197),1,0(repeated 23 times),1,2,1
 Then compress the array
 the example becomes a blob of 8 bytes
 originally was 8 bytes x 4 rows with index
 And store the array in a single column
 fewer rows lead to smaller table, faster access

Apr. 23 2009 Running Realtime Stats Service on MySQL 22


Compressing the Stats Table (cont'd)

 Write MySQL UDFs to access the sparse


array
 cnt_add(column,at,cnt)
-- adds cnt on given index (at)
 cnt_between(column,from,to)
-- returns # of hits between given hours
 and more...

 We use int[N] arrays for vectorized calc.


 especially when creating access charts

Apr. 23 2009 Running Realtime Stats Service on MySQL 23


Create a new Message Queue

Apr. 23 2009 Running Realtime Stats Service on MySQL 24


Q4M

 A simple, reliable, fast message queue


 runs as a pluggable storage engine of MySQL
 GPL License; q4m.31tools.com
 presented yesterday at MySQL Conference :-p
 slides at tinyurl.com/q4m2009

 Used for relaying messages between our


servers

Apr. 23 2009 Running Realtime Stats Service on MySQL 25


Limiting Pre-computation Load

Apr. 23 2009 Running Realtime Stats Service on MySQL 26


Limit # of CPU-intensive Pre-computations

 Use cron & setlock


 setlock is part of daemontools by djb
 setlock
 serializes processes by using flock
 -n option: use trylock; if locked, do nothing

# use only one CPU core for pre-computation


*/2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries
5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data

Apr. 23 2009 Running Realtime Stats Service on MySQL 27


Limit # of Disk-intensive Pre-computations

 Divide pre-computation to blocks and


sleep depending on the elapsed time

my $LOAD = 0.25;

while (true) {
my $start = time();
precompute_block();
sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);
}

Apr. 23 2009 Running Realtime Stats Service on MySQL 28


Creating our own Cache System

Apr. 23 2009 Running Realtime Stats Service on MySQL 29


The Problem

 Query cache is flushed on table update


 access stats can be (should be) cached for a
certain period
 Memcached has a thundering-herd
problem
 all clients try to read the database when a
cached-entry expires
 critical for us since our queries does joins,
aggregations, and sort operations

Apr. 23 2009 Running Realtime Stats Service on MySQL 30


Swifty and KeyedMutex

 Swifty is a mmap-based cache


 cached data shared between processes
 lock-free on read, flock on write
 notifies a single client that the accessed entry is
going to expire within few seconds
 notified client can start updating a cache entry
before it expires
 KeyedMutex
 a daemon used to block multiple clients issuing
same SQL queries
Apr. 23 2009 Running Realtime Stats Service on MySQL 31
Swifty and KeyedMutexd (cont'd)

 Source codes are available:


 coderepos.org/share/browser/lang/c/swifty
 coderepos.org/share/browser/lang/perl/Cache-Swifty
 coderepos.org/share/browser/lang/perl/KeyedMutex

Apr. 23 2009 Running Realtime Stats Service on MySQL 32


Fulltext-search on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 33


Senna / Tritonn

 Senna is a FTS engine popular in Japan


 might not work well with European languages
 Tritonn is a replacement of MyISAM FTS
 uses Senna as backend
 faster than MyISAM FTS
 Wrote patches to support SSD
 during our transition from RAM to SSD
 patches accepted in Senna 1.1.4 / Tritonn 1.0.12

Apr. 23 2009 Running Realtime Stats Service on MySQL 34


FTS: RAM-based vs. SSD-based

 Size of FTS data: 20GB


 Downgraded hardware to see if SSD-
based FTS is feasible
 Speed became ¼
 but latency of searches are well below one second

Old Hardware New Hardware


CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz)
Memory 32GB 2GB
Storage 7200rpm SATA HDD SSD (Intel X25-M)
Apr. 23 2009 Running Realtime Stats Service on MySQL 35
Summary

Apr. 23 2009 Running Realtime Stats Service on MySQL 36


Summary

 Use UDFs for optimization


 Sometime it is easier to scale UP
 esp. when you can estimate your data growth
 Use SSD for FTS
 Baidu (China's leading search engine) uses SSD
 Most of the things introduced are OSS
 We plan to open-source our URL compression
table as well

Apr. 23 2009 Running Realtime Stats Service on MySQL 37


We are Looking for...

 If you are interested in localizing


Pathtraq to your country, please contact
us
 we do not have resources outside of Japan
 to translate the web interface
 to ask people to install our browser extension
 to follow local regulations, etc.

Apr. 23 2009 Running Realtime Stats Service on MySQL 38


Thank you for listening

tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 39

S-ar putea să vă placă și