Sunteți pe pagina 1din 9

Data modeling with DynamoDB

brewing.codes/2017/11/13/dynamo-data-modeling

November 13,
2017

DynamoDB is a truly unique and powerful database. It provides a predictable low


latency, a vast array of features, tight integration with other AWS services and almost no
operational load.

The only problem is that its data model can seem bizarre. The moment you start
learning about it you will stumble upon partition keys, global secondary indexes, and
other unusual concepts. Where would you even begin?

In this article, I will explain how to use core DynamoDB features to and how to fit your
data into it. To be more practical I will show how to create a simple database that stores
information about videos on a video sharing website.

Data in DynamoDB
Before we start designing data store for our application, let’s talk about how data is
organized in DynamoDB. The most high-level element in DynamoDB is a table. When you
work with data in DynamoDB, you work with a single table, and there are no operations
that span multiple tables. In this article, I will call a logical group of tables a database, but
there is no special support for this in DynamoDB.

Every table is a collection of items, and every item is a collection key/value pairs called
attributes. In contrast to relational databases, DynamoDB does not enforce a strict
schema. It means that different items can have different numbers of attributes and
attributes with the same name can have different types. Every item should have a total
size less than 400KB.
1/9
The only exception to this rule is a key attribute. Every item in a database should have a
key DynamoDB enforces a type of key attributes.

DynamoDB supports several data types that can be divided into scalar types, set types,
and document types. Scalar types include:

String – a UTF-8 string, and is only limited by the maximum size of an item
Number – a float point number that can have up to 38 digits precision
Binary – an array of unsigned bytes that is limited by the maximum size of an item
Boolean – true or false values
Null – for unknown or undefined state

// Scalar types item


{
"videoId": "rk67v9",
"name": "How to make pancakes",
// Binary image represented as Base64
"featuredImage": "ab4f...86ty",
"videoFile": ,
"length": 3786,
"published": true,
"rating": null
}
Copy

Set types, just as the name suggest, are special types for sets. DynamoDB supports sets
of Strings, Numbers and Binary values. These sets do not store duplicate values and do
not preserve order. Surprisingly, DynamoDB does not support an empty set.

Lastly, there are only two document types in DynamoDB:

Map – a nested data structure similar to a JSON object


List – an ordered sequence of elements on any types

// Document types
{
"movieId": "123",
"name": "Star Wars: The Force Awakens",
"actors": [
"Daisy Ridley",
"John Boyega",
...
],
"boxOffice": {
"budget": 245000000,
"gross": 936000000
}
}
Copy

Mind the price


2/9
When you create a table, you need to specify how many requests a table should be able
to process per second. The more requests you need to process the more you will have to
pay. Hence it pays to understand how we can use as little provisioned throughput as
possible.

A table throughput is defined by two values:

Number of read capacity units (RCUs) – how many read requests can you send per
second.
Number of write capacity units (WCUs) – how many write requests can you send
per second.

Not all requests are equal though. The more data you read or write in a single request,
the more read and write capacity will be consumed. The calculation looks like this:

RCU_consumed = ceil(size_read / 4KB)


WCU_consumed = ceil(size_written / 1KB)
Copy

You can change the provisioned throughput at any time by performing the UpdateTable
API call. If you send more requests to the table than it can handle DynamoDB will return
the ProvisionedThroughputExceededException exception, and you need either to retry or
provision more capacity.

Another thing to keep in mind is that you can use two types of read operations:
eventual.y consistent and strongly consistent. With strongly consistent read you are
guaranteed to get the latest written result. With eventual consistency, you can get stale
data, but it will consume twice less capacity units than a strongly consistent read.

Table keys and how to select them

Simple key
Now, let’s go back to our videos database. Let’s design a first table that will store
information about video authors such as user id, name, country, and most importantly: a
favourite beverage. But before we can store out first item DynamoDB forces to make an
important choice: we need to select what field(s) will form a key in our table. This choice
is important for two reasons: first of all this defines what queries we can perform
efficiently and second you can’t change it after the table is created.

For the video authors table, we will use a so-called simple key. With the simple key a
single attribute is defined as a key in the table:

3/9
With this key type we can perform only one query efficiently: get an item by id. To do this,
we need to use the GetItem action.

In addition to that, you can use an operation called Scan to fetch your data. Scan allows
to provide an arbitrary filtering expression which allows great flexibility, and this can
make you wonder why do we need to care about key types in the first place. The
problem is that when you use Scan DynamoDB consumes capacity for every item
DynamoDB evaluates even if it does not return it to you. And since Scan operation does
not use any knowledge about the structure of your data using it on a big table can be
really expensive.

A simple key is fine for simple cases, but it’s hard to implement something more
complicated than this. Say we want to implement a 1:M relationship authors and videos.
We could store a list of videos IDs in a list in an author’s item and then fetch every video
using the GetItem action but this is not convenient and not cost-efficient.

// Possible solution
{
"authorId": "unclebob",
"videos": ["sd82s7", "whd2bs", ..., "ful57s"]
}
Copy

If we call GetItem to get a number of items we spend at least one RCU for every call
even if a size of an item is less than 4KB. So if we want to fetch ten items 400 bytes each,
it will cost us 10 RCUs!

Complex key
There is, however, a better way. DynamoDB has a different key type which is called a
composite key. When we create a table with a composite key, we need to define two
attributes: a partition key and a sort key. As the name suggests the sort key can be used
to sort the output, but it can also do much more.

4/9
Just as with the simple key you can get an element by id, but in this case you need to
provide two values for both simple and sort keys. You can also get all items with the
same partition key if use the Query action and only provide the partition key. This is very
convenient to implement 1:M relationships.

Optionally with the Query operation, you can provide a key condition expression that
allows filtering items in the response. To do this you can use < , > , <= , >= ,
BETWEEN , and begins_with operators.

Also, you can specify a sort order based on the sort key value. This does not make much
sense with our videos table but it would make more sense in a different case. Say, if we
have a table where we store sensor measurements we could make sensor id a partition
key and a timestamp of a measurement a sort key. With this, we could sort or filter
measurements for a particular sensor by timestamp.

5/9
Using a sort key it is also easy to paginate requests. All we need to do is to specify the
Limit parameter that limits the number of item DynamoDB should evaluate before
returning a reply.

Notice that queries with composite keys only operate with values for a single partition
key.

Consumed capacity
Remember the issue with consuming too much read capacity using the GetItem action?
This is not the case with the query operation. Now if we want to get ten items 400 bytes
each it will only cost us 1 RCU. The reason for this is that when we use queries
DynamoDB sums the total size of all returned items and only then divides them by 4KB
to calculate the consumed capacity:

consumed_capacity = ceil(sum(item_sizes) / 4KB)


Copy

Local secondary index


You can see that the complex key is a very powerful construct, but what if we want to
have more than one sort order? What if we want to sort item by a video lengths or by a
release date?

For this DynamoDB has another feature called local secondary index. We can use it just
like the sort key and can use comparison operators and specify a sort order. For
example, if we need to sort videos by video length we can add a local secondary index.

With this approach, we can now sort videos by a particular author by time creation, fetch
video created only after some date, filter videos that are longer than some threshold, etc.
An important difference compared to the sort key is that a pair of partition key/local
secondary index attribute should not be unique.
6/9
Notice that when you perform a query, you can use either a sort key or one of the local
secondary indexes and not both. Just as with the key type you have to specify local
secondary indexes on table creation and can’t change them afterward.

Global secondary index


In previous examples we had tables with a single partition key. This is not very
convenient since we can only use the Query operation to fetch items with a selected
partition key. What if we want to allow authors to group videos in seasons and want to
fetch all videos in a season:

One option would be to maintain a copy of this data in a different table with a different
partition key, but the same sort key.

Fortunately, DynamoDB can do this for us. If you create a globally secondary index for a
table, you can specify an alternative simple or composite key regardless of what key type
you selected for the table. When you write an item into the original item DynamoDB will
copy data in the background into the global secondary index table:

7/9
Because data is copied to the index in the background, it only supports eventually
consistent read operations. This is because it takes some time for updates to propagate.

In contrast to local secondary indexes, global secondary indexes can be created at any
time even after the table was created.

Aggregating data with DynamoDB


But what if we need to group items by some key with DynamoDB? What if we want to
count how many videos do we have from each author? Or what if we need to find out
how many videos from 5 to 10 minutes long do we have?

With a relational database we could use the GROUP BY operation, but unfortunately,
DynamoDB does not support it out of the box. We have three main options of how we
can implement it:

Use Redshift integration – we can copy data from DynamoDB to Redshift that has a
full-fledged SQL support and allows to perform analytical queries. The main
downside of this approach is that we will have to query stale data.

8/9
Use EMR Hive integration – with this we can either copy DynamoDB data to HDFS
or S3 and query data from there or, alternatively, EMR Hive can perform analytical
queries on DynamoDB data itself.
Create aggregated tables with DynamoDB streams – we can create some sort of a
materialized view of original data that fits DynamoDB data model. To maintain this
materialized view, we read data from DynamoDB streams, and update materialized
view in real time.

Conclusions
While the data model of DynamoDB can seem peculiar at first, it is actually pretty
straightforward. Once you’ve grasped what table keys are and how to use table indexes,
you will be able to efficiently use DynamoDB. The key is to understand its data model
and to think how you can fit your data in DynamoDB. Remember. To achieve the stellar
performance you need to use queries as much as possible and try to avoid scan
operations.

DynamoDB is a complex topic, and I will write more about it, so so stay tuned. In the
meantime, you can take a look at my deep dive DynamoDB course. You can watch the
preview for the course here.

9/9

S-ar putea să vă placă și