App Engine Google

Building Scalable Web Apps
with Google App Engine

Brett Slatkin
June 14, 2008
Agenda
Using the Python runtime effectively

Numbers everyone should know
Tools for storing and scaling large data sets
Example: Distributed counters
Example: A blog
Prevent repeated, wasteful work
Prevent repeated, wasteful work
Loading Python modules on every request can be slow

Reuse main() to addresses this:
def main():
wsgiref.handlers.CGIHandler().run(my_app)
if __name__ == "__main__":
main()
Lazy-load big modules to reduce the "warm-up" cost
def my_expensive_operation():
import big_module
big_module.do_work()
Take advantage of "preloaded" modules
Prevent repeated, wasteful work 2
Avoid large result sets

In-memory sorting and filtering can be slow
Make the Datastore work for you
Avoid repeated queries

Landing pages that use the same query for everyone
Incoherent caching
Use memcache for a consistent view:
results = memcache.get('main_results')
if results is None:
results = db.GqlQuery('...').fetch(10)
memcache.add('main_results', results, 60)
Writes are expensive!

Datastore is transactional: writes require disk access
Disk access means disk seeks
Rule of thumb: 10ms for a disk seek

Simple math:
1s / 10ms = 100 seeks/sec maximum
Depends on:
The size and shape of your data
Doing work in batches (batch puts and gets)
Numbers everyone should know 2
Reads are cheap!

Reads do not need to be transactional, just consistent
Data is read from disk once, then it's easily cached

All subsequent reads come straight from memory
Rule of thumb: 250usec for 1MB of data from memory

Simple math:
1s / 250usec = 4GB/sec maximum
For a 1MB entity, that's 4000 fetches/sec
Tools for storing data
Tools for storing data: Entities
Fundamental storage type in App Engine

Schemaless
Set of property name/value pairs
Most properties indexed and efficient to query
Other large properties not indexed (Blobs, Text)
Think of it as an object store, not relational

Kinds are like classes
Entities are like object instances
Relationship between Entities using Keys
Reference properties
One to many, many to many
Tools for storing data: Keys
Key corresponds to the Bigtable row for an Entity

Bigtable accessible as a distributed hashtable
Get() by Key: Very fast! No scanning, just copying data
Limitations:
Only one ID or key_name per Entity
Cannot change ID or key_name later
500 bytes
Tools for storing data: Transactions
ACID transactions
Atomicity, Consistency, Isolation, Durability
No queries in transactions
Transactional read and write with Get() and Put()
Common practice
Query, find what you need
Transact with Get() and Put()
How to provide a consistent view in queries?

Tools for storing data: Entity groups
Closely related Entities can form an Entity group

Stored logically/physically close to each other
Define your transactionality

RDBMS: Row and table locking
Datastore: Transactions across a single Entity group
"Locking" one Entity in a group locks them all
Serialized writes to the whole group (in transactions)
Not a traditional lock; writers attempt to complete in parallel
Tools for storing data: Entity groups 2
Hierarchical
Each Entity may have a parent
A "root" node defines an Entity group
Hierarchy of child Entities can go many levels deep
Watch out! Serialized writes for all children of the root
Datastore scales wide
Each Entity group has serialized writes
No limit to the number of Entity groups to use in parallel
Think of it as many independent hierarchies of data
Entity groups all transacting in parallel:
Root Root Root Root
Child Child Child Child
Txn 1 Txn 2 Txn 3 Txn 4

Pitfalls
Large Entity groups = high contention = failed transactions
Not thinking about write throughput is bad
Structure your data to match your usage patterns
Good news
Query across entity groups without serialized access!
Consistent view across all entity groups
No partial commits visible
All Entities in a group are the latest committed version
Example: Counters
Counters
Using Model.count()
Bigtable doesn't know counts by design
O(N); cannot be O(1); must scan every Entity row!
Use an Entity with a count property:
class Counter(db.Model):
count = db.IntegerProperty()
Frequent updates = high contention!
Transactional writes are serialized and too slow
Fundamental limitation of distributed systems
Counters: Before and after
Single Sharded
Counter Counter Counter Counter

Counters: Sharded
Shard counters into multiple Entity groups

Pick an Entity at random and update it transactionally
Combine sharded Entities together on reads
"Contention" reduced by 1/N
Sharding factor can be changed with little difficulty
Counters: Models
class CounterConfig(Model):
name = StringProperty(required=True)
num_shards = IntegerProperty(required=True,
default=1)
class Counter(Model):
name = StringProperty(required=True)
count = IntegerProperty(required=True,
default=0)
Counters: Get the count
def get_count(name):
total = 0
for counter in Counter.gql(
'WHERE name = :1', name):
total += counter.count
return total
Counters: Increment the count
def increment(name):
config = CounterConfig.get_or_insert(name,
name=name)
def txn():
index = random.randint(0, config.num_shards - 1)
shard_name = name + str(index)
counter = Counter.get_by_key_name(shard_name)
if counter is None:
counter = Counter(
key_name=shard_name, name=name)
counter.count += 1
counter.put()
db.run_in_transaction(txn)
Counters: Cache reads
def get_count(name):
total = memcache.get(name)
if total is None:
total = 0
for counter in Counter.gql(
'WHERE name = :1', name):
total += counter.count
memcache.add(name, str(total), 60)
return total
Counters: Cache writes
def increment(name):
config = CounterConfig.get_or_insert(name,
name=name)
def txn():
index = random.randint(0, config.num_shards - 1)
shard_name = name + str(index)
counter = Counter.get_by_key_name(shard_name)
if counter is None:
counter = Counter(key_name=shard_name,
name=name)
counter.count += 1
counter.put()
memcache.incr(name)
Example: Building a Blog
Building a Blog
Standard blog
Multiple blog posts
Each post has comments
Efficient paging without using queries with offsets
Remember, Bigtable doesn't know counts!
Building a Blog: Blog entries
Blog entries with an index

Having an index establishes a rigid ordering
Index enables efficient paging
This is a global counter, but it's okay
Low write throughput of overall posts = no contention
Building a Blog: Models
class GlobalIndex(db.Model):
max_index = db.IntegerProperty(required=True,
default=0)
class BlogEntry(db.Model):
index = db.IntegerProperty(required=True)
title = db.StringProperty(required=True)
body = db.TextProperty(required=True)
Building a Blog: Posting an entry
def post_entry(blogname, title, body):

def txn():
blog_index = BlogIndex.get_by_key_name(blogname)
if blog_index is None:
blog_index = BlogIndex(key_name=blogname)
new_index = blog_index.max_index
blog_index.max_index += 1
blog_index.put()
new_entry = BlogEntry(
key_name=blogname + str(new_index),
parent=blog_index, index=new_index,
title=title, body=body)
new_entry.put()
Building a Blog: Posting an entry 2
Hierarchy of Entities:
Blog
Index
Entry
Building a Blog: Getting one entry
def get_entry(blogname, index):

entry = BlogEntry.get_by_key_name(
parent=Key.from_path('BlogIndex', blogname),
blogname + str(index))
return entry
That's it! Super fast!

Building a Blog: Paging
def get_entries(start_index):
extra = None
if start_index is None:
entries = BlogEntry.gql(
'ORDER BY index DESC').fetch(
POSTS_PER_PAGE + 1)
else:
start_index = int(start_index)
entries = BlogEntry.gql(
'WHERE index <= :1 ORDER BY index DESC',
start_index).fetch(POSTS_PER_PAGE + 1)
if len(entries) > POSTS_PER_PAGE:
extra = entries[-1]
entries = entries[:POSTS_PER_PAGE]
return entries, extra
Building a Blog: Comments
High write-throughput
Can't use a shared index
Would like to order by post date
Post dates aren't unique, so we can't use them to page:
2008-05-26 22:11:04.1000 Before
2008-05-26 22:11:04.1234 My post
2008-05-26 22:11:04.1234 This is another post
2008-05-26 22:11:04.1234 And one more post
2008-05-26 22:11:04.1234 The last post
2008-05-26 22:11:04.2000 After
Building a Blog: Comments
High write-throughput
Can't use a shared index
Would like to order by post date
Post dates aren't unique, so we can't use them to page:
2008-05-26 22:11:04.1000 Before
2008-05-26 22:11:04.1234 My post
2008-05-26 22:11:04.1234 This is another post
2008-05-26 22:11:04.1234 And one more post
2008-05-26 22:11:04.1234 The last post
2008-05-26 22:11:04.2000 After
Building a Blog: Composite properties
Make our own composite string property:

"post time | user ID | comment ID"
Use a shared index for each user's comment ID
Each index is in a separate Entity group
Guaranteed a unique ordering, querying across entity groups:
2008-05-26 22:11:04.1000|brett|3 Before

2008-05-26 22:11:04.1234|jon|3 My post
2008-05-26 22:11:04.1234|jon|4 This is another post
2008-05-26 22:11:04.1234|ryan|4 And one more post
2008-05-26 22:11:04.1234|ryan|5 The last post
2008-05-26 22:11:04.2000|ryan|2 After
Building a Blog: Composite properties 2
High throughput because of parallelism
User User User

Index Index Index
Comment Comment Comment

What to remember
What to remember
Minimize Python runtime overhead
Minimize waste
Why Query when you can Get?
Structure your data to match your load

Optimize for low write contention
Think about Entity groups
Memcache is awesome-- use it!

Learn more
code.google.com

App Engine Google

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

App Engine Google

Încărcat de

Drepturi de autor:

Formate disponibile

Building Scalable Web Apps

with Google App Engine

Using the Python runtime effectively

Loading Python modules on every request can be slow

Avoid large result sets

Avoid repeated queries

Writes are expensive!

Rule of thumb: 10ms for a disk seek

Reads are cheap!

Data is read from disk once, then it's easily cached

Rule of thumb: 250usec for 1MB of data from memory

Fundamental storage type in App Engine

Think of it as an object store, not relational

Key corresponds to the Bigtable row for an Entity

How to provide a consistent view in queries?

Closely related Entities can form an Entity group

Define your transactionality

Entity groups all transacting in parallel:

Root Root Root Root

Child Child Child Child

Txn 1 Txn 2 Txn 3 Txn 4

Counter Counter Counter Counter

Shard counters into multiple Entity groups

Blog entries with an index

def post_entry(blogname, title, body):

def get_entry(blogname, index):

That's it! Super fast!

Make our own composite string property:

2008-05-26 22:11:04.1000|brett|3 Before

High throughput because of parallelism

User User User

Comment Comment Comment

Minimize Python runtime overhead

Why Query when you can Get?

Structure your data to match your load

Memcache is awesome-- use it!

S-ar putea să vă placă și