RavenDBMythology 11

RavenDB Mythology Documentation
Release 1.0
Ayende Rahien (Oren Eini)
November 29, 2010
CONTENTS
NoSQL? What is that? 1.1 How we got this NoSQL thing? . . . 1.2 NoSql Data stores . . . . . . . . . . 1.3 How to select a data storage solution? 1.4 Summary . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 4 5 15 17 19 20 24 25 25 27 27 28 29 30 31 31 32 32 33 35 35 36 36 37 37 39 40 41 42 43 43 44 44 45 i
Grokking Document Databases 2.1 Data modeling with document databases 2.2 Denormalization isnt scary . . . . . . . 2.3 Indexes bring order to schema free world 2.4 Summary . . . . . . . . . . . . . . . . . Chapter 3 - Basic Operations 3.1 Creating a document session . . . . . . . 3.2 Saving a new document . . . . . . . . . 3.3 Loading & Editing an existing document 3.4 Deleting existing documents . . . . . . . 3.5 Transaction support in RavenDB . . . . . 3.6 Basic query support in RavenDB . . . . 3.7 Safe by default . . . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . Chapter 4 Advanced RavenDB indexing 5.1 What is an index? . . . 5.2 Index optimizations . . 5.3 Collation . . . . . . . . 5.4 Exact matches . . . . . 5.5 Full text search . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
4 5
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Map / Reduce indexes 6.1 Stepping through the map / reduce process . . . . . . . 6.2 What is map/reduce, again? . . . . . . . . . . . . . . . 6.3 Rules for Map/Reduce operations . . . . . . . . . . . . 6.4 Applications of Map/Reduce . . . . . . . . . . . . . . . 6.5 How map/reduce works in RavenDB . . . . . . . . . . 6.6 How RavendB stores the results of map/reduce indexes? 6.7 Creating our rst map/reduce index . . . . . . . . . . . 6.8 Querying map/reduce indexes . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
6.9 Where should we use map / reduce indexes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 9 Chatper 7 - Scaling RavenDB Chapter 8 - Replication Chapter 9 - Authorization
46 46 47 49 51 53 55 57 59 61 63
10 Chapter 10 - Extending RavenDB 11 Chapter 11 - RavenDB Builtin Bundles 12 Chapter 12 - Building your own Bundle 13 Chapter 13 - Adminstration 14 Summary 15 Things to talk about
ii
RavenDB Mythology Documentation, Release 1.0
This book is dedicated to my father. Dad, from your mouth to Gods ears. Oren Eini, 2010 Warning: This book is a draft, it is known to have multiple spelling & grammer issues. The source for this page can be found here: http://github.com/ravendb/docs We would love to hear suggestions & improvements about this book. The discussion group for this book can be found here: http://groups.google.com/group/ravendb/
CONTENTS
CONTENTS
CHAPTER
ONE
NOSQL? WHAT IS THAT?

In this chapter... NoSQL? What is that? (page 3) How we got this NoSQL thing? (page 4) NoSql Data stores (page 5) * Key/Value Stores (page 6) * Document Databases (page 8) * Column family databases (BigTable) (page 11) How to select a data storage solution? (page 15) * Multiple data stores in a single application? (page 16) * When is NoSQL a poor choice? (page 16) * And when scaling is not an issue? (page 16) Summary (page 17) In the beginning there was the data. And the Programmer put it in memory, and it was so. And on the second day, it was discovered that this data should be persisted, and then there was a le. And on the third day, the customer wanted searching, and that was the rst database. This book is about RavenDB, a document database written in .NET. But I dont think that I can accurately talk about RavenDB without rst discussing some of the history of data storage in the IT eld for the last half a century or so. At rst, data was simply stored in les, with each application having their own proprietary format. That quickly became a problem, since it was soon discovered that users have a lot of interesting requirements, such as being able to retrieve the data, search it, read reports about it, etc. I distinctly remember learning how to do le IO by writing my own PhoneBook application and doing binary read/writes from the disk. And probably the hardest part was having to write the search routine. I ended up having to do a sequential scan over the entire le for each search, and having to write custom code for each and every search permutation that was required. Not surprisingly, developers facing the same solution at the dawn of computing quickly sought ways to avoid having to do this over & over again. The rst steps toward what we consider a database today were the ISAM (Indexed Sequential Access Method) les. Which are simply a way to store data in les with indexing. The problem with those cropped up when you wanted to do a bit more than just accessing the data, in particular, aggregations. That was the point when data storage grew out of les and into data management libraries and systems. The next step was Edgar Codds paper: A Relational Model of Data for Large Shared Data Banks. And from that point on, an absolute majority of the industry has been focused almost exclusively on relational databases. For a very long time, data storage was putting things in a database. Until very recently, in fact...
1.1 How we got this NoSQL thing?

Probably the worst thing about relational databases is that they are so good in what they are doing. Good enough to conquer the entire market on data storage and hold it for decades. Wait! That is a bad thing? How? It is a bad thing because relational databases are appropriate for a wide range of tasks, but not for every task. Yet it is exactly that which caused them to be used in contexts where they are not appropriate. In the last month alone (March 2010), my strong recommendation for two different clients was that they need to switch to a non relational data store because it would greatly simplify the work that they need to do. That met with some (justied) resistance, predictably. Most people equate data storage with RDBMS, there is no other way to store data. Since you are reading this book, you are probably already aware that there are other options out there for data storage. But you may not be certain why you might want to work with a No SQL database. Before we can answer that question, we have to revisit some of our knowledge about RDBMS rst. We have to understand what it is that made RDBMS the de facto standard for data storage for so long, and why there is such a fuss around the alternatives for RDBMS. Relational Databases have the following properties: ACID (Atomic, Consistent, Isolated, Durable) Relational (based on relation algebra and the work by Edgar Codd) Table / Row based Rich querying capabilities Foreign keys Schema Just about any of the No SQL approaches give up on some of those properties, sometimes, a NoSQL solution will gives up on all of those properties. Considering how useful an RDBMS is, and how exible it turned out to be for so long. Why should we give all of those advantages? The most common reason to want to move from a RDBMS is running into the RDBMS limitations. In short, RDBMS doesnt scale (http://adamblog.heroku.com/past/2009/7/6/sql_databases_dont_scale/). Actually, let me phrase that a little more strongly, RDBMS systems cannot be made to scale 1 . The problem is inherit into the basic requirements of the relational database system, it must be consistent, to handle things like foreign keys, maintain relations over the entire data set, etc. The problem, however, is that trying to scale a relational database over a set of machines. At that point, you run head on into the CAP theorem (http://www.julianbrowne.com/article/viewer/brewers-cap-theorem), which state that you can have only two of the Consistency, Availability and Partition Tolerance triad. Hence, if consistency is your absolute requirement, you need to give up on either availability or partition tolerance. In most high scaling environments, it is not feasible to give up on either option, you have to give up on consistency. But RDBMS will not allow that so relational databases are out. That leave you with two basic choices: Use an RDBMS, but instead of having a single instance across multiple nodes, treat each instance as an independent data store. This approach is called sharding, and we will discuss how it applies to RavenDB in Chapter 7. The problem with this approach with RDMBS is that you lose a lot of the capabilities that RDBMS brings to the table (you cant join between nodes, for example).
1 To be rather more exact, I should say that when I am talking about scaling, I am talking about scaling a database instance across a large number of machines. It is certainly possible to scale RDBMS solutions, but the typical approach is by breaking the data store to independent nodes (sharding), which means that things like cross node joins are no longer possible. Another RDBMS scaling solution is a set of servers that acts as a single logical database instance, such as Oracle RAC. The problem with this approach that the number of machines that can take part in such a system in limited (usually to low single digits), making it impractical for high scaling requirements.
Chapter 1. NoSQL? What is that?
Use a No SQL solution. What it boils down to is that when you bring the need to scale to multiple machines, the drawbacks of using a RDBMS ( TODO: provide a full list) out weight the benets that it usually brings to the table. Since we have to do a lot of work already with sharded SQL databases, it is worth turning out attention to the NoSQL alternatives, and what we might want to choose them. This book is about RavenDB, a Document Database, but I want to give you at least some background on each of the common NoSQL databases types, before starting to talk about RavenDB specically.
1.2 NoSql Data stores

I am going to briey touch on each NoSQL data store, from the developer perspective (what kind of API and interfaces the data store have), and from the scaling perspective, to see how we can scale our solution. This isnt a book about NoSQL solutions in general, but it is important to understand who are the other players in the elds when it comes the time to evaluate options your data storage strategy. Almost all data stores need to handle things like: Concurrency Queries Transactions Schema Replication Scaling One thing that should be made clear up front is the major difference between performance and scalability, the two are often at odds and usually increasing one would decrease the other. For performance, we ask: How can we execute the same set of requests, over the same set of data with: shorter time? fewer resources usage (for example, less memory)? Note that here, too, there is usually a tradeoff between resource usage and processing time. In general, you can cut the processing time by consuming more resources (for example, by adding a cache). Conversely, you can reduce resource usage by increasing the processing time (compute as needed, instead of precomputing results). For scaling, we ask: How can we meet our SLA when: we get a lot more data? we get a lot more requests? With relational databases, the answer is usually, you dont scale. The No SQL alternatives are generally quite simple to scale, however.
1.2. NoSql Data stores
Data access strategy follows the data access pattern One of the most common problems that I nd when reviewing a project is that the rst step (or one of them) was to build the Entity Relations Diagram, thereby sinking a large time/effort commitment into it before the project really starts and real world usage tells us what sort of data we actually need and what is the data access pattern of the application. One of the major problems with this approach is that it simply doesnt work with NoSQL solutions. An RDBMS allows very exible querying, so you can sometimes get away with this approach (although it is generally discouraged when using RDBMS as well), but NoSQL solutions often require you to query / access the data only in pre dened manner (for example, key/value stores allows only access by key). This means that the structure of your data is usually going to be dictated by the way that you are going to access it. This is usually a surprise for people coming from the RDBMS world, since it is the inverse of how you usually model data in RDBMS. We will discuss modeling techniques for a document database in Chapter 2.
1.2.1 Key/Value Stores

The simplest No SQL databases are the Key/Value stores. They are simplest only in terms of their API, because the actual implementation may be quite complex. But let us focus on the API that is exposed to by most key/value stores rst. Most of the Key/Value stores expose some variation on the following API:
void Put(string key, byte[] data); byte[] Get(string key); void Remove(string key);
There are many variations, but that is the basis for everything else. A key value store allows you to store values by key, as simple as that. The value itself is just a blob, as far as the data store is concerned, it just stores it, it doesnt actually care about the content. In other words, we dont have a data stored dened schema, but a client dened semantics for understanding what the values are. The benets of using this approach is that it is very simple to build a key value store, and that it is very easy to scale it. It also tend to have great performance, because the access pattern in key value store can be heavily optimized. In general, most key/value operations can be performed using O(1), regardless of how many machines there are in the data stores and regardless of how much data is stored. Concurrency In Key/Value Store, concurrency is only applicable on a single key, and it is usually offered as either optimistic writes or as eventually consistent. In highly scalable systems, optimistic writes are often not possible, because of the cost of verifying that the value havent changed (assuming the value may have replicated to other machines), there for, we usually see either a key master (one machine own a key) or the eventual consistency model, which is discussed below. Queries There really isnt any way to perform a query in a key value store, except by the key. Some key/value stores allow range queries on the key, but that is rare. Most of the time, queries on key/value stores are implemented by the user, using a manually maintained secondary index. Transactions While it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesnt work when you start thinking about
a distributed key/value store, where different keys may reside on different machines. Because of that, it is typically best to think about key/value stores as allowing transaction on a single key put on a single machine. Please note that transactions do not imply ACID. In a distributed key/value store, the only way to ensure that is if a key can reside on a single machine. However, we usually do not want that, we want each key to live on multiple machines, to avoid data loss / data unavailability if a node goes down for some reason. We discuss this model (also call eventual consistent key/value store) below. Schema Key/value stores have the following schema Key is a string, Value is a blob. Which is probably not a very useful schema for your purposes. Beyond that, the client is the one that determines how to deal the data. The key/value store just stores it. Scaling Up In Key Value stores, there are two major options for scaling, the simplest one would be to shard the entire key space. That means that keys starting in A go to one server, while keys starting with B go to another server, and so on. In this system, a key is only stored on a single server. That drastically simplify things like transactions guarantees, but it expose the system for data loss if a single server goes down. At this point, we introduce replication, which gives us safety from data loss, but also force us to give up on ACID guarantees. Replication In key value stores, the replication can be done by the store itself or by the client (writing to multiple servers). Replication also introduce the problem of divergent versions. In other words, two servers in the same cluster think that the value of key ABC are two different things. Resolving that is a complex issue, the common approaches are to decide that it cant happen (Scalaris) and reject updates where we cant ensure non conict or to accept all updates and ask the client to resolve them for us at a later date (Amazon Dynamo, Rhino DHT). Eventually consistent key/value stores A system which decides that divergent versions of the same key should be avoided will reject updates if such a scenario may happen. Following the CAP theorem, it means that we give up Partition Tolerance. The problem is that in most cases, you really cant assume that your network wont be partitioned. If that happen (and it happens quite frequently) and you choose the reject divergent updates mode, you can no longer accept writes, rendering you unavailable. To avoid this problem, there is a different model, of allowing divergent writes and let the client resolve the conict when the partition is resolves and the conict is detected. We discuss exactly this problem in detail in Chapter 8, Replication. Common Usages Key/Value stores shine when you need to access the data by key. User related data, such as the session or shopping cart information are ideal, because we always know what the user id is. Another common usage is to store pre-compute data based on the primary key. For example, we may want to store all the information about a product (including related products, reviews, etc) in a key/value store based on the product SKU. That allows us to query all the relevant data about a product in an O(1) manner. Because key based queries are practically free, by structuring our data access along keys, we can get signicant performance benet by structuring our applications to t that need. It turns out that there is quite a lot that you can do with just key/value store. Amazons shopping cart runs on a key value store (Amazon Dynamo), so I think you can surmise that this is a highly scalable technique.
Amazon Dynamo Paper is one of the best resources on the topic that one can ask for. Rhino DHT is a scalable, redundant, zero cong, key value store on the .NET platform. Just remember, if you need to do things more complex than just access a bucket of bits using a key, you probably need to look at something else, and the logical next step in the chain in the Document Database.
1.2.2 Document Databases

A document database is, at its core, a key/value store where the value is in a known format. A document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON (JavaScript Object Notation), Binary JSON (BSON), or just about anything, as long as the database can understand the document internal structure. In practice, most document databases uses JSON (or BSON) or XML. Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most document databases, that means that we can now allow queries on the document data. The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data. I am going to use RavenDB as the example for this post. Documents in RavenDB use the JSON format, and each document contains both the actual data and additional metadata information about the document that is external to the document itself. Here is an example of a document:
{ "name": "ayende", "email": "ayende@ayende.com", "projects": [ "rhino mocks", "nhibernate", "rhino service bus", "raven db", "rhino persistent hash table", "rhino distributed hash table", "rhino etl", "rhino security", "rampaging rhinos" ] }
We can put this document in the database, under the key ayende. We can also get the document back by using the key ayende. A document database is schema free, you dont have to dene your schema ahead of time and adhere to that. This allows us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you dont really think about it. It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity. The major benet of using a document database comes from the fact that while it has all the benets of a key/value store, you arent limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us, such as querying. The following HTTP request will nd all documents where the name equals to ayende:
GET /indexes/dynamic?query=name:ayende
Because the document database understand the format of the data, it can answer queries like that. Being able to perform queries is just one advantage of the database being able to understand the data, it also allows:
Projecting the document data into another form. Running aggregations over a set of documents. Doing partial updates (patching a document) From my point of view, though. The major benet is that you are dealing with documents. There is little or no impedance mismatch between objects and documents. That means that storing data in the document database is usually signicantly easier than when using an RDBMS for most non trivial scenarios. It is usually quite painful to design a good physical data model for an RDBMS, because the way the data is laid out in the database and the way that we think about it in our application are drastically different. Moreover, RDBMS has this little thing called Schemas. And modifying a schema can be a painful thing indeed, especially if you have to do it on production an on multiple nodes. The schema less nature of a document database means that we dont have to worry about the shape of the data we are using, we can just serialize things into and out of the database. It helps that the commonly used format (JSON) is both human readable and easily managed by tools. A document database doesnt support relations, which means that each document is independent. That makes it much easier to shard the database than it would be in a relational database, because we dont need to either store all relations on the same shard or support distributed joins. I like to think about document databases as a natural candidate for Domain Driven Design applications. When using a relational database, we are instructed to think in terms of Aggregates and always go through an aggregate. The problem with that is that it tends to produce very bad performance in many instances, as we need to traverse the aggregate associations, or specialized knowledge in each context. With a document database, aggregates are quite natural, and highly performant, they are just the same document, after all. Standard modeling technique for a document database is to think in terms of aggregates, in fact. We discuss this in depth in the next chapter. Concurrency In most document stores, concurrency is only applicable on a single document, and it is usually offered as optimistic writes. For document databases that also have replication support, we have to deal with the same potential conicts that arise when using eventual consistency key/value store, and we resolve them in much the same way. But letting the client decide how to merge all the conicting versions. We discuss this in more detail in Chapter 8, Replication. Queries There really isnt any way to perform a query in a key value store, except by the key. Some key/value stores allow range queries on the key, but that is rare. Most of the time, queries on key/value stores are implemented by the user, using a manually maintained secondary index. Transactions Most document databases will offer you transaction support for the a single document. RavenDB supports multi document (and multi node) transactions, but even so, it isnt recommended for common use, because of the potential for issues when using distributed transactions. Schema Document databases doesnt have a schema per-se, you can store any sort of document inside them. The only limitation is that the document must be in a format that the database understands (usually JSON). Note, however, that for while document databases allows arbitrary schema for documents, for practical purposes, indexes (or views) in document 1.2. NoSql Data stores 9
database does allow you to threat some part of the data in a more formal way. We discuss indexes in detail in Chapter 4 - RavenDB Indexes. Scaling Up The common approach for scaling a document store is using sharding. Since each document is independent, document databases lends themselves easily to sharding. Usually sharding is combined with replication support to handle fail over in case of node failure, but that is about as complex as it gets. We discuss sharding strategies for RavenDB in Chapter 7 - Scaling RavenDB. Common Usages ^^^^^^^^^^^^^^ Document databases are usually used to store entities (more accurately, aggregates). There is very little effort involved in turning an object graph to a document, and vice versa. And aggregates plays very well with both document databases and Domain Driven Design principles.Examples for the type of data that would be stored in a document database include blog posts and discussion threads, product catalogs, orders and similar entities. Graph Databases Think about a graph database as a document database, with a special type of documents, relations. An common example would be a social network, such as the one shown in gure 1.1.
Figure 1.1: Figure 1.1 - An example of nodes in a graph database There are four documents and three relations in this example. Relations in a graph database are more than just a pointer. A relation can be unidirectional or bidirectional, but more importantly, a relation is typed, I may be associated to you in several ways, you may be a client, family or my alter ego. And the relation itself can carry information. In the case of the relation document in gure 1.1 above, we simply record the type of the association and the degree of closeness. And that is about it, mostly. Once you think about graph databases as document databases with a special document type, you are pretty much done. Except that graph database has one additional quality that make them very useful.
10
They allow you to perform graph operations. The most basic graph operation is traversal. For example, let us say that I want to know who of my friends is in town so I can go and have a drink. That is pretty easy to do, right? But what about indirect friends? Using a graph database, I can dene the following query:
new GraphDatabaseQuery { SourceNode = ayende, MaxDepth = 3, RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"}, Where = node => node.Location == ayende.Location, SearchOrder = SearchOrder.BreadthFirst }.Execute();
I can execute more complex queries, ltering on the relation properties, considering weights, etc. Graph databases are commonly used to solve network problems. In fact, most social networking sites use some form of a graph database to do things like You might know.... Because graph databases are intentionally design to make sure that graph traversal is cheap, they also provide other operations that tend to be very expensive without it. For example, Shortest Path between two nodes. That turn out to be frequently useful when you want to do things like: Who can recommend me to this companys CTO so they would hire me. One problem with scaling graph databases is that it is very hard to nd an independent sub graph, which means that it is very hard to shard graph databases. There are several effort currently in the academy to solve this problem, but I am not aware of any reliable solution as of yet.
1.2.3 Column family databases (BigTable)

Column family databases are probably most known because of Goggles BigTable implementation. The are very similar on the surface to relational database, but they are actually quite different beast. Some of the difference is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You cant apply the same sort of solutions that you used in a relational form to a column database. That is because column databases are not relational, for that matter, they dont even have what a RDBMS advocate would recognize as tables. The following concepts are critical to understand how column databases work: Column family Super columns Column Columns and super columns in a column database are spare, meaning that they take exactly 0 bytes if they dont have a value in them. Column families are the nearest thing that we have for a table, since they are about the only thing that you need to dene up front. Unlike a table, however, the only thing that you dene in a column family is the name and the key sort options (there is no xed schema). Column family databases are probably the best proof of leaky abstractions. Just about everything in CFDB (as Ill call them from now on) is based around the idea of exposing the actual physical model to the users so they can make efcient use of that. Column families - A column family is how the data is stored on the disk. All the data in a single column family will sit in the same le (actually, set of les, but that is close enough). A column family can contain super columns or columns. Super columns - A super column can be thought of as a dictionary, it is a column that contains other columns (but not other super columns).
11
Column - A column is a tuple of name, value and timestamp (Ill ignore the timestamp and treat it as a key/value pair from now on). It is important to understand that when schema design in a CFDB is of outmost importance, if you dont build your schema right, you literally cant get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending). The sort order, unlike in a relational database, isnt affected by the columns values, but by the column names. Let assume that in the Users column family, in the row with the key @ayende, we have the column named name set to Ayende Rahien and the column named location set to Israel. The CFDB will physically sort them like this in the Users column family le:
@ayende/location = "Israel" @ayende/name = "Ayende Rahien"
This is because the column name location is lower than the column name name. If we had a super column involved, for example, in the Friends column family, and the user @ayende had two friends, they would be physically stored like this in the Friends column family le:
@ayende/friends/arava= 945 @ayende/friends/rose = 14
This property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We dene three column families: Users - sorted by UTF8 Tweets - sorted by Sequential Guid UsersTweets - super column family, sorted by Sequential Guid Let us create the user (a note about the notation: I am using named parameters to denote columns name & value here. The key parameter is the row key, and the column family is Users):
cfdb.Users.Insert(key: "@ayende", name: "Ayende Rahien", location: "Israel", profession: "Wizard");
You can see a visualization of how this row looks like in gure 1.2. Note that this doesnt look at all like how we would typically visualize a row in a relational database.
Figure 1.2: Figure 1.2 - A representation of a row in a Column Family Database Now let us create a tweet:
12
Figure 1.3: Figure 1.3 - A representation of two tweets in a Column Family Database
13
var firstTweetKey = "Tweets/" + SequentialGuid.Create(); cfdb.Tweets.Insert(key: firstTweetKey, application: "TweekDeck", text: "Err, is this on?", private: t var secondTweetKey = "Tweets/" + SequentialGuid.Create();
cfdb.Tweets.Insert(key: secondTweetKey, app: "Twhirl", version: "1.2", text: "Well, I guess this is m
Those value are visualized in gure 1.3. There are several things to notice in the gure: The actual key value doesnt matter, but it does matter that it is sequential, because that will allow us to sort of it later. Both rows have different data columns on them, because we dont have a schema for the column family. We dont have any way to associate a user to a tweet. That last bears some talking about. In a relational database, we would dene a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational database will allow us to query the tweets by the user id, letting us get the users tweets. A CFDB doesnt give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene). Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need to create a secondary index, which is where the UsersTweets column family comes into play:
cfdb.UsersTweets.Insert(key: "@ayende", cfdb.UsersTweets.Insert(key: "@ayende", timeline: { SequentialGuid.Create(): firstTweetKey } ); timeline: { SequentialGuid.Create(): secondTweetKey } );
Figure 1.4 visualize how it looks like in the database. We insert into the UsersTweets column family, to the row with the key: @ayende, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.
Figure 1.4: Figure 1.4 - A representation of secondary index, connecting users & tweets, in a Column Family Database Note: Couldnt we create a super column in the Users column family to store the relationship? We could, except that a column family can contain either columns or super columns, it cannot contain both. In order to get tweets for a user, we need to execute:
var tweetIds = cfdb.UsersTweets.Get("@ayende") .FetchSuperColumnValues("timeline")
14
.Take(25) .OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Get(tweetIds);
Note: There isnt such an API for .NET (at least, not that I am aware of), I created this sample to show a point, not to demonstrate real API. In essence, we execute two queries, the rst on the UsersTweets column family, requesting the columns & values in the timeline super column in the row keyed @ayende, we then execute another query against the Tweets column family to get the actual tweets. This sort of behavior is pretty common in NosQL data stores. It is called secondary index, a way to quickly access the data by key based on another entity/row/document value. This is one example of how the need to query for tweets by user has affected the data that we store. If we didnt create this secondary index, we would have no possible way to answer a question such as show me the last 25 tweets from @ayende. Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user. What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order. Why is column family database so limiting? You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand at rst, since it is so close on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited? The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily. Remember that CFDB is really all about removing abstractions. CFDB is what happens when you take a relational database, strip everything away that make it hard to run in on a cluster and see what happens. The reason that CFDB dont provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that. CFDB dont provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which is again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper. It requires a drastically different mode of thinking, and while I dont have practical experience with CFDB, I would imagine that migrations using them are... unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.
1.3 How to select a data storage solution?

So far I have shown you the major players in the NoSQL elds. Each of them has its own weaknesses and strengths. A question that I get a lot is: I want to use NoSql-Technology-X for Xyz and... I usually cringe when I hear this sort of question, because almost invariably, it falls into one of two pitfalls:
1.3. How to select a data storage solution?
15
Trying to import a relational mind set into a NoSQL data store. Trying to use a single data store for all things, including things that it really isnt suitable for. Selecting a data storage strategy isnt a one time decision. In a single application, you may use a Key/Value store to hold session information, graph database to serve social queries and a document database to hold your entities. I view the we use a single data store mentality in the same way that I view people who want to write all their code in a single le. You certainly can do that, but that is going to be... awkward. I try to break down things based on the expected data access patterns expected from each section in the application. If in the product catalog am always dealing with queries by the product SKU, and speed is of the essence, it make a lot of sense to use a key/value store. But that doesnt means that orders should be stored there, for order I need a lot more exibility, so I put them in a document database, etc.
1.3.1 Multiple data stores in a single application?

The logical conclusion of this approach is that a single application may have several different data stores. While I wouldnt go out of my way to try to use any data store technology that exists out there in a project, I wouldnt balk from using the best data store technology for the application purposes. The idea is to choose the best match for what we need to do, not to just use whatever is already there whatever it ts our purposes or not. That said, be aware that it only make sense to introduce a new data store technology to a project if the benet of having multiple data stores outweigh the cost. If I need to support user dened elds, I would gravitate very quickly to a document database, rather than try to implement that on top of a RDBMS. Warning: Dont forget about the RDBMS Despite the name, NosQL actually stands for Not Only SQL. The main point is that the problem isnt with the RDBMS as a technology, the problem is that for many people, data storage is RDBMS. When choosing a data storage technology I always take care to include RDBMS in the mix as well. RDBMS is an incredibly powerful tool and should not be discarded just because there are younger and sexier contenders in the ring.
1.3.2 When is NoSQL a poor choice?

After spending so long extolling the benets of the various NoSQL solutions, I would like to point out at least one scenario where I havent seen a good NosQL solution for the RDBMS: Reporting. One of the great things about RDBMS is that given the information that it already have, it is very easy to massage the data into a lot of interesting forms. That is especially important when you are trying to do things like give the user the ability to analyze the data on their own, such as by providing the user with a report tool that allows them to query, aggregate and manipulate the data to their hearts content. While it is certainly possible to produce reports on top of a NoSQL store, you wouldnt be able to come close to the level of exibility that a RDMBS will offer. That is one of the major benets of the RDBMS, its exibility. The NoSQL solutions will tend to outperform the RDBMS solution (as long as you stay in the appropriate niche for each NoSQL solution) and they certainly have better scalability story than the RDBMS, but for user driven reports, the RDBMS is still my tool of choice.
1.3.3 And when scaling is not an issue?

The application data is one of the most precious assets that we have. And for a long time, there wasnt any question about where we are going to put this data. The RDBMS was the only game in town. The initial drive away from the RDBMS was indeed driven by the need to scale. But that was just the original impetuous to start developing the NoSQL solutions. Once those solutions came into being and matured, it isnt just the we need web-scale players that beneted. 16 Chapter 1. NoSQL? What is that?
Proven & Mature NoSQL solutions arent applicable just at high end of scaling. NoSQL solutions provide a lot of benets even for applications that will never need to scale higher than a single machine. Document databases drastically simplify things like user dened elds, or working with Aggregates. The performance of a NoSQL solution can often exceed a comparable RDBMS solution, because the NoSQL solution will usually focus on a very small subset of the feature set that RDMBS has.
1.4 Summary
In this chapter, we have gone over the reasons for the NoSQL movement, born out of the need to handle ever increasing data, users and complexity. We have explored the various NoSQL options and discussed their benets and disadvantages as well as what scenarios they are suitable for. We looked at how to select an appropriate data store for specic purposes and nally discussed how the emergence of robust NoSQL solutions has improved our options even when we arent required to scale, because we have more data storage models to select from when it comes the time to design our application. In the next chapter, we will leave the general topic of NoSQL and begin to focus specically on document databases, the topic of this book. So turn the page to the next chapter, and let us explore...
1.4. Summary
17
18
CHAPTER
TWO
GROKKING DOCUMENT DATABASES

In this chapter... Grokking Document Databases (page 19) Data modeling with document databases (page 20) * Documents are not at (page 20) * Document databases are not relational (page 22) * Documents are Aggregates (page 22) * Relations and Associations (page 23) Denormalization isnt scary (page 24) Indexes bring order to schema free world (page 25) Summary (page 25) In the previous chapter, we have spoken at length about a lot of different options for No SQL data stores. But even though we touched on document databases, we havent really discussed them in detail. In essence, document databases stores documents (duh!). A document is usually represented as JSON (sometimes it can be XML). Note: I am going to assume that you are familiar with RDBMS, and compare a document database behavior directly to the behavior of RDBMS. The following JSON document represent an order:
// Listing 2.1 - A sample order document { "Date": "2010-10-05", "Customer": { "Name": "Dorothy Givens", "Id": "customers/2941" }, "Items": [ { "SKU": "products/4910", "Name": "Water Bucket", "Quantity": 1, "Price": { "Amount": 1.29 "Currency": "USD" } }, { "SKU": "products/6573", "Name": "Beach Ball", "Quantity": 1,
19
"Price": { "Amount": 2.19 "Currency": "USD" } } ] }
Documents in a document database dont have to follow any schema and can have any form that they wish to be. This make them an excellent choice when you want to use them for sparse models (models where most of the properties are usually empty) or for dynamic models (customized data model, user generated data, etc). In addition to that, documents are not at. Take a look at the document shown in listing 2.1, we represent a lot of data in a single document here. And we represent that internally. Unlike RDBMS, a document is not just a a set of keys and values, it can contain nested values, lists and arbitrarily complex data structures. This make it much easier to work with documents compared to working with RDBMS, because the complexity of your objects doesnt translate into a large number of calls to the database to load the entire object, as it would be in RDBMS. In order to build the document showing in listing 2.1 in a RDBMS system, we would probably have to query at least 3 tables, and it is pretty common to have to touch more than ve tables to get all the needed information for a single logical entity. With document databases, all that information is already present in the document, and there is no need to do anything special. You just need to load the document, and the data is there. The down side here is while you can embed information inside the document very easily, it is harder to reference information in other documents. In RDBMS, you can simply join to another table and get the data from the database that way. But document databases do not have the concept of joins (RavenDB has something similar called includes, which is discussed in Chapter 6, but it isnt really a parallel). As you can imagine, these two changes leads to drastically different methods of modeling data in a document database...
2.1 Data modeling with document databases

While document databases are schema-free data stores, that doesnt mean that you shouldnt take some time to consider how to design your documents to ensure that you can access all the data that you need to serve requests efciently, reliably and with as little maintainability cost as possible. The most typical error people make when trying to design the data model on top of a document database is to try to model it the same way you would on top of a relational database. A document database is a non-relational data store. Trying to hammer a relational model on top of it will produce sub-optimal results. But you can get fantastic results by taking advantage of the documented oriented nature of a document database.
2.1.1 Documents are not at

Documents, unlike a row in a RDBMS, are not at. You are not limited to just storing keys and values. Instead, you can store complex object graphs as a single document. That includes arrays, dictionaries and trees. Unlike a relational database, where a row can only contain simple values and more complex data structures need to be stored as relations, you dont need to work hard to map your data into a document database. Take a look at gure 2.1 for an example of a simple blog page. In a relational database, we would have to touch no less than 4 tables to show the data in this single page (Posts, Comments, Tags and RelatedPosts). But a document database let us store all the data in a single document, as shown in listing 2.1:
20
Chapter 2. Grokking Document Databases
Figure 2.1: Figure 2.1 - A simple blog post page
{// Listing 2.1 - A blog post document can contain complex data "Title": "Modeling in Docs DBs", "Content": "Modeling data in...", "Tags": [ "Raven", "DocDB", "Modeling" ], "Comments": [ { "Content": "Great post...", "Author": "John" }, { "Content": "Sed ut...", "Author": "Nosh" } ], "RelatedPosts": [ { "Id": "posts/1234", "Title": "Doc Db Modeling Anti Patterns" }, { "Id": "posts/4321", "Title": "Common Access Patterns" } ] }
Using a document database in this fashion allows us to get everything that we need to display the page shown above
2.1. Data modeling with document databases
21
in a single request.
2.1.2 Document databases are not relational

When starting out with a document database, the most common problems happen when users attempt to use relational concepts. The major issue with that is, of course, that Raven is non-relational. However, its actually more than that; there is a reason why Raven is non-relational. A document database treats each document as an independent entity. By doing so, it is able to optimize the way documents are stored and managed. Moreover, one of the sweet spots that we see for a document database is for storing large amounts of data (too much data to store on a single machine). Document databases sharding are very simple, since each document is isolated and independent, it is very easy to simply split the data across the various sharded nodes. Doing so is very since, since there is no need to store a group of related documents together. Each document is independent and can be stored on any shard in the system. Another aspect of the non-relational nature of document databases is that documents are expected to be meaningful on their own. You can certainly store references to other documents, but if you need to refer to another document to understand what the current document means, you are probably using document databases incorrectly. With a document database, you are encouraged to include all of the information you need in a single document. Take a look at the post example in listing 2.1. In a relational database, we would have a link table for RelatedPosts, which would contain just the ids of the linked posts. If we wanted to get the titles of the related posts, we would need to join to the Posts table again. You can do that in a document database, but that isnt the recommended approach. Instead, as shown in the example above, you should include all of the details that you need inside the document 1 . Using this approach, you can display the page with just a single request, leading to much better overall performance.
2.1.3 Documents are Aggregates

When thinking about using a document database to persist entities, we need to consider the two previous points. The suggested approach is to follow the Aggregate pattern from the Domain Driven Design book <http://domaindrivendesign.org/node/88>. An Aggregate Root contains several entities and value types and controls all access to the objects contained in its boundaries. External references may only refer to the Aggregate Root, but never to one of its child entities / value objects. When you apply this sort of thinking to a document database, there is a natural and easy to follow correlation between an Aggregate Root (in DDD terms) and a document in a document database. An Aggregate Root, and all the objects that it holds, is a document. This also neatly resolves a common problem with Aggregates when using relational databases: traversing the path through the Aggregate to the object we need for a specic operation is very expensive in terms of number of database calls. Using a document database, loading the entire Aggregate is just a single call and hydrating a document to the full Aggregate Root object graph is a very cheap operation. Changes to the Aggregate are also easier to control, when using RDMBS, it can be hard to ensure that concurrent requests wont violate business rules. The problem is that two separate requests may touch two different parts of the Aggregate. And while each request is valid on its own, together they result in an invalid state. This has led to the usage of coarse grained locks <http://martinfowler.com/eaaCatalog/coarseGrainedLock.html>, which are hard to implement when using RDBMS. Since a document database treats the entire Aggregate as a single document, the problem simply doesnt exist. You can utilize the database concurrency support to determine if the Aggregate or any of its children has changed. And if that happened, you can simply refresh the modied Aggregate and retry the transaction.
1 Yes, that does means that we are effectively denormalize the data. RavenDB includes several mechanisms to deal with this issue, but in practice, it turns out to be a fairly minor concern. We will discuss this issue at more length later in this chapter.
22
2.1.4 Relations and Associations

Aggregate Roots may contain all of their children, but even Aggregates do not live in isolation. Let us look at the document in listing 2.2:
// listing 2.2 - The Order aggregate refers to other aggregates { // Order document - id: orders/95128 "Customer": { "Id": "customers/84822", "Name": "John Doe" }, "OrderLines": [ "Product": { "Id": "products/1724", "Name": "Milk" }, "Quantity": 3, "Price": { "Amount": 1.2, "Currency": "USD" } ] } { // Product document - id: products/1724 "Name": "Milk", "Price": { "Amount": 1.2, "Currency": "USD" }, "OrganicFood": true, "GoodForYou": true } { // Customer document - id: customers/84822 "Name": "John Doe", "Email": "john.doe@example.org", "LastLogin": "2010-10-05T15:40:19" }
The Aggregate Root for an Order will contain Order Lines, but an Order Line will not contain a Product. Instead, it contains a denormalized reference to the product. The product is another aggregate, obviously. And here we have a tension between competing needs. On the one hand, we want to be able to process the order document without having to reference another document (since this results in much better overall performance). But on the other hand, in order to do so, we have to duplicate the product (and customer, for that matter) information inside the order document. We will discuss this problem in the next section. Note: What to denormalize? While I think that denormalizing some data to the referring document, you should carefully consider what sort of data you are going to denormalize. For example, in the customer case, we denormalized the customer name. That is a good choice, because a name is going to change rarely. But the LastLogin property is going to change all the time. In this case, we dont really care about the customer login time, but even if we did, we still wouldnt be able to denormalize the LastLogin property. Like in most cases, the answer to What to denormalized? is it depends! It depends on: How often the value changes?
2.1. Data modeling with document databases
23
How important is the value to the referring document? Luckily, in practice it turns out that it is rare that you would want to have access to a rapidly changing from another document. BUt if you do, it might be a good idea to relax the documents are independent rule. In a relational database, we can usually rely on Lazy Loading to help us, but most document databases client API will not support lazy loading. This is intentional, explicit, and by design decision. Instead of relying on lazy loading, the expected usage is to hold the associated document key as well as the information from it to process the current document. If you really need the full associated document, you need to explicitly load it 2 . The reasoning behind this is simple: we want to make it just a tad harder to reference data in other documents. It is very common when using an Object Relational Mappers to do something like: orderLine.Product.Name, which will lazily load the Product entity. That makes sense when you are living in a relational world, but a document database is not relational.
2.2 Denormalization isnt scary

Data modeling in relational database is usually focused on discovering what data we need to keep, and normalizing it so each piece of data will live in only a single location. Normalization in RDBMS had such a major role because storage was expensive. It helps to remember that when a lot of those techniques were develop, in 1981, a megabyte of persistent storage cost U$460. At the time of this writing you can get a 1 terra byte HD for 63$, putting the price of a gigabyte of persistent storage at 6 cents USD! It made sense to try to optimize this with normalization. In essence, normalization is compressing the data, by taking the repeated patterns and substituting them with a marker. There is also another issue, when normalization came out, the applications being were far different than the type of applications we build today. In terms of number of users, time that you had to process a single request, concurrent requests, amount of data that you had to deal with, etc. Under those circumstances, it actually made sense to trade off read speed for storage. In todays world? I dont think that it hold as much. The other major benet of normalization, which took extra emphasis when the reduction in storage became less important as HD sizes grew, is that when you state a fact only once, you can modify it only once. The corollary to that is that when you need to modify this data, you can do so in only one location. Except... there is a large set of scenarios where you dont want to do that. Let us take invoices as a good example. In the case of an invoicing system , if you changed the product name from Thingamajig to Foomiester, that is going to be mighty confusing for the user when I look at that invoice and have see an invoice for a product that they never bought. What about the name of the customer? Think about the scenarios in which someone changes their name (marriage is most common one, probably). If a woman orders a book under her maiden name, then changes her name after she married, what is supposed to show on the order when it is displayed? If it is the new name, that person didnt exist at the time of the order. Another very important consideration is to consider costs. In the vast majority of systems, the number of reads far exceeds the number of writes. But normalization is a technique that trades off write speed for read speed (you have to write the data only once, but you have to join the data on every read). At the time the technique was introduced, it made a lot of sense, but today... I dont think so. So we have ruled out the space saving as not really important, and the only thing that is left is the cost of actually ensuring that when we update the data, we update it in all locations. As I mentioned previously, there is a large set of scenarios where you actually dont want to update the data, you want to keep the information as it was at the time the document was created. Not surprisingly, this tends to show up a lot when you are dealing with data that represent actual documents (orders, invoices, loan contract, etc). And when you do want to update the data, you can do so when you write to the master source. That is a bit annoying, because you have to keep track of where you denormalized the data, but it isnt hard, and the end result is that you are
Note, however, that RavenDB specically includes a feature to make such operations more efcient. We discuss this in chapter 6 (page ??). The feature is called includes.
2
24
doing some additional amount of work on writes (rare) but signicantly reduces the amount of work that you do for reads (common). That is a good tradeoff, in my eyes.
2.3 Indexes bring order to schema free world

Document databases allow you to store data without requiring any schema. That is great, except that in practice, there isnt much that you can do if I just hand you a document. You can display it, and allow the user to edit it, but that is about it. In practice, our documents usually have the same structure. An order will always have OrderLines, for example. And even though two different order documents may have slightly different schema, they will tend to look fairly similar to one another. Some document databases (RavendB and CouchDB, for example) have the notion of indexes (CouchDB calls them Views), which allow us to bring some order back to our database. An index denes how to transform a document from the basic anything goes form to a predictable, known, format. The advantages in that are huge. After all, there is a reason why relational databases requires you to have a schema. When you have a known data format, there are a lot of things that you can do with it. In particular, you can search that data really fast. Moreover, you can pull the data directly from the index, skipping the schema free nature of documents in favor of the predictable nature of the index format. What happens in practice is that document databases generally use indexes to allow you to dene how you want to query the documents. There is another aspect to it, however. Remember the notion that documents are independent? That is great when you are thinking about a single document, but one of the major features that a user expects from a database is to be able to query on aggregation of documents (how many posts in Ayendes blog, for example). In document databases, aggregations are handled using map/reduce indexes. Note: Dont Panic! Yes, I know that map/reduce sounds scary. But map/reduce is really just another way to call group by. That is all what map/reduce is, when you get down to it. We will discuss map/reduce indexes in detail in Chapter 5, dont worry, youll pick it up very quickly. All aggregations inside a document database is done using map/reduce. Some databases (such as MongoDB) allows you to run those map/reduce queries on the y. Others (RavenDB, CouchDB) requires you to dene a map/reduce index and then query the index. We will discuss the differences between the two in detail in Chapter 4.
2.4 Summary
In this chapter we have explored what exactly a document database is, not only in the sense of what sort of data is stored inside a document database, but how we work with it. Documents can be arbitrarily complex, which allows us to hold an entire Aggregate Root inside a single document. And because documents are independent, they should not require referencing another document in order to process requests regarding that document. Therefor, we model documents in order to include denormalized references to other documents. Those denormalized reference copy the document id as well as whatever properties that are important to the referring document. We can handle denormalized updates in one of two ways: Keep the old data - useful for invoices, orders, etc. Where the document referent a point in time. Update all copies of the data - useful when the data represent the current value. RavenDB includes explicit support to make handle denormalized updates, which we discuss in TODO.
2.3. Indexes bring order to schema free world
25
Finally, we discussed the role of indexes in a document database, and introduced the dreaded map/reduce indexes. Indexes are used to give the database a way to extract a schema out of a set of documents. And now, enough with discussing high level concepts, we are going to go ahead and start working with RavenDB directly and discover why it is the best document database 3 that you have seen.
In my obviously unbiased opinion :-).
26
CHAPTER
THREE
CHAPTER 3 - BASIC OPERATIONS

In this chapter: Creating and modifying documents Loading documents Querying documents Using System.Transactions So far we have spoken in abstracts, about No SQL in general and RavenDB in particular, but in this chapter, we leave the high level concepts aside and concentrate on actually using RavenDB. We will go through all the steps required to perform basic CRUD operations using RavenDB, familizaring ourselves with RavenDB APIs, concepts and workings. This chapter assumes usage of the RavenDB .NET Client API, and will provide examples of the underlying HTTP calls behind made for each call.
3.1 Creating a document session

In order to communicate with a RavenDB instance, we must rst create a document store and initialize it, you can see a sample of initializing a document store in listing 3.1:
// listing 3.1 - initializing a new document store store = new DocumentStore() { Url = "http://localhost:8080" }; store.Initialize();
This will create a document store that connects to a RavenDB server running on port 8080 on the local machine. Note: It is possible to run RavenDB using an embedded mode in an application and run it in-process by utilising an EmbeddableDocumentStore, more information about this can be found in the documentation. Once a document store has been created, the next step is to create a session against that document store that will allow us to perform basic CRUD operations within a Unit of Work. It is important to note that when invoking any operations against this store, that no changes will be made to the underlying document database until the SaveChanges method has been called, as in listing 3.2:
// listing 3.2 - saving changes using the session API using (IDocumentSession session = store.OpenSession()) { // Operations against session
27
// Flush those changes session.SaveChanges(); }
In this context, the session can be thought of as managing all changes internally, and SaveChanges can be thought of as committing all those changes to the RavenDB server. Any operations submitted in a SaveChanges call will be committed atomically (that is to say, either they all succeed, or they all fail). It will be assumed in the following examples that a valid store has been created, and that the calls are being made within the context of a valid session, and that SaveChanges is being called safely at the end of that session lifetime. Note: If you dont call SaveChanges, all the changes made in that session will be discarded!
3.2 Saving a new document

Before we can start saving information to RavenDB, we must dene what we will save. You can see the sample class structure in listing 3.3:
// listing 3.3 - Simple class structure public class Blog { public string Id { get; set ; } public string Title { get; set; } public string Category { get; set; } public string Content { get; set; } public BlogComment[] Comments { get; set; } } public class BlogComment { public string Title { get; set; } public string Content { get; set;} }
We can now create a new instance of the Blog class, as shown in listing 3.4:
// listing 3.4 - creating a new instance of the Blog class Blog blog = new Blog() { Title = "Hello RavenDB", Category = "RavenDB", Content = "This is a blog about RavenDB", Comments = new BlogComment[]{ new BlogComment() { Title = "Unrealistic", Content= "This example is unrealistic"}, new BlogComment() { Title = "Nice", Content= "This example is nice"} } };
Note: Neither the class itself or instansiating it required anything from RavenDB, either in the form of attributes or in the form of special factories. The RavenDB Client API works with POCO (Plain Old CLR Objects) objects. Persisting this entire object graph involves using Store and then SaveChanges, as seen in listing 3.5:
28
Chapter 3. Chapter 3 - Basic Operations
// listing 3.5 - saving the new instance to RavenDB session.Store(blog); session.SaveChanges();
The SaveChanges call will product the HTTP communication shown in listing 3.6. Note that the Store method operates purely in memory, and only the call to SaveChanges communicates with the server:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8080 Content-Length: 378 Expect: 100-continue
[{"Key":"blogs/1","Etag":null,"Method":"PUT","Document":{"Title":"Hello RavenDB","Category":"RavenDB"
HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Server: Microsoft-HTTPAPI/2.0 Date: Tue, 16 Nov 2010 20:37:00 GMT Content-Length: 205
[{"Etag": "00000000-0000-0100-0000-000000000002","Method":"PUT","Key":"blogs/1","Metadata":{"Raven-En
Two things of note at this point: We left the Id property of Blog blank, and it is this property that will be used as the primary key for this document The entire object graph is serialized and persisted as a single document, not as a set of distinct objects. Note: If there is no Id property on a document, RavenDB will allocate an Id, but it will be retrievable only by calling session.Advanced.GetDocumentId. In other words, having an Id is entirely optional, but as it is generally more useful to have this information available, most of your documents should have an Id property.
3.3 Loading & Editing an existing document

If you have the id of an existing document (for example the previous saved blog entry), it can be loaded in the following manner:
Blog existingBlog = session.Load<Blog>("blogs/1");
This results in the HTTP communication shown in listing 3.7:

GET /docs/blogs/1 HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8080 HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Last-Modified: Tue, 16 Nov 2010 20:37:01 GMT ETag: 00000000-0000-0100-0000-000000000002 Server: Microsoft-HTTPAPI/2.0 Raven-Entity-Name: Blogs Raven-Clr-Type: Blog Date: Tue, 16 Nov 2010 20:39:41 GMT
3.3. Loading & Editing an existing document
29
Content-Length: 214
{"Title":"Hello RavenDB","Category":"RavenDB","Content":"This is a blog about RavenDB","Comments":[{"
Changes can then be made to that object in the usual manner:

existingBlog.Title = "Some new title";
Flushing those changes to the document store is achieved in the usual way:
session.SaveChanges();
You dont have to call an Update method, or track any changes yourself. RavenDB will do all of that for you. For the above example, the above example will result in the following HTTP message:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8080 Content-Length: 501 Expect: 100-continue
[{"Key":"blogs/1","Etag":null,"Method":"PUT","Document":{"Title":"Some new title","Category":"RavenDB HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Server: Microsoft-HTTPAPI/2.0 Date: Tue, 16 Nov 2010 20:39:41 GMT Content-Length: 280
[{"Etag": "00000000-0000-0100-0000-000000000003","Method":"PUT","Key":"blogs/1","Metadata":{"Content-
Note: The entire document is sent to the server with the Id set to the existing document value, this means that the existing document will be replaced in the document store with the new one. Whilst patching operations are possible with RavenDB, the client API by default will always just replace the entire document in its entirety.
3.4 Deleting existing documents

Once a valid reference to a document has been retrieved, the document can be deleted with a call to Delete in the following manner:
session.Delete(blog); session.SaveChanges();
Once again, this results in an HTTP communcation as shown in Listing listing 3.8:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 49 Expect: 100-continue [{"Key":"blogs/1","Etag":null,"Method":"DELETE"}]
30
3.5 Transaction support in RavenDB

All the previous examples have assumed that a single unit of work can be achieved with a single IDocumentSession and a single call to SaveChanges - for the most part this is denitely true - sometimes however we do need multiple calls to SaveChanges for one reason or another, but we want those calls to be contained within a single atomic operation. RavenDB supports System.Transaction for multiple operations against a RavenDB server, or even against multiple RavenDB servers. The client code for this is as simple as:
using (var transaction = new TransactionScope()) { Blog existingBlog = session.Load<Blog>("blogs/1"); existingBlog.Title = "Some new title"; session.SaveChanges();
session.Delete(existingBlog); session.SaveChanges(); transaction.Complete(); }
If at any point any of this code fails, none of the changes will be enacted against the RavenDB document store. The implementation details of this are not important, although it is possible to see that RavenDB does indeed send a transaction Id along with all of the the HTTP requests under this transaction scope as shown in listing 3.9:
POST /bulk_docs HTTP/1.1 Raven-Transaction-Information: 975ee0bf-cac9-4b8e-ba29-377de722f037, 00:01:00 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 300 Expect: 100-continue
[{"Key":"blogs/1","Etag":null,"Method":"PUT","Document":{"Title":"Some new title","Category":null,"Co
A call to commit involves a separate call to another HTTP endpoint with that transaction id:
POST /transaction/commit?tx=975ee0bf-cac9-4b8e-ba29-377de722f037 HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 0
Note: While RavenDB supports System.Transactions, it is not recommended that this be used as an ordinary part of application workow as it is part for the partition tolerance aspect of our beloved CAP theorum.
3.6 Basic query support in RavenDB

Once data has been stored in RavenDB, the next useful operation is the ability to query based on some aspect of the documents that have been stored.
3.5. Transaction support in RavenDB
31
For example, we might wish to ask for all the blog entries that belong to a certain category like so:
var results = from blog in session.Query<Blog>() where blog.Category == "RavenDB" select blog;
That Just Works(tm) and gives us all the blogs with a category of RavenDB. The HTTP communication for this operation is shown in listing 3.10:
GET /indexes/dynamic/Blogs?query=Category:RavenDB&start=0&pageSize=128 HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081
The important part of this query is that we are querying the Blogs collection, for the property Category with the value of RavenDB. We will also notice that a page size of 128 was passed along, although none was specied, which leads us onto the next topic of Safe by default.
3.7 Safe by default

RavenDB, by default, will not allow operations that might compromise the stability of either the server or the client. The two examples that present themselves in the above examples are If a page size value is not specied, the length of the results will be limited to 128 results The number of remote calls to the server per session is limited to 30 The rst one is obvious - unbounded result sets are dangerous, and have been the cause of many failures in ORM based systems - unless a result-size has been specied, RavenDB will automatically attempt to limit the size of the returned result set. The second example is less immediate, and should never be reached if RavenDB is being utilised correctly - remote calls are expensive, and the number of remote calls per session should be as close to 1 as possible. If the limit is reached, it is a sure sign of either a Select N+1 problem or other mis-use of the RavenDB session.
3.8 Summary
In this chapter we learned how to utilise the session as a basic Unit of Work in RavenDB, and saw a basic example of querying in action as well as examples of how these remote calls look in raw HTTP calls across the wire. We also saw how RavenDB attempts to be safe by default in limiting the capacity of common mistakes to cause damage in your application. In the next chapter, we will look more closely at the query API, and how to utilise this within our applications to good effect.
32
CHAPTER
FOUR
CHAPTER 4 In this chapter: How indexes are stored Indexes are stale Simple indexes RavenDB Collections Projections Lucene Fields options: Storage Indexing Sorting Analyzing
33
34
Chapter 4. Chapter 4 -
CHAPTER
FIVE
ADVANCED RAVENDB INDEXING

We have learned how to work with RavenDB and how to query it, but we still have only a very rough understanding about how everything actualy works. The terms indexing and indexes were thrown around a lot, but we havent yet talked about what it actually means. This chapter will go into all the details about RavenDB indexes, when you want to pay particular attention to them and what sort of features they expose.
5.1 What is an index?

The rst thing to approach, however, is to understand exactly what an index is. RavenDB doesnt allow unindexed queries, so all the queries that you make using RavenDB always use an index. That statments sounds strange, on the face of it, isnt it? So far, we have seen neither hide nor hair of any indexes, but we have certainly been able to query. The code in listing 5.1 certainly seems to work:
//listing 5.1 - querying RavenDB var ayendeBlog = session.Query<Blog>() .Where(blog => blog.Title == "Ayende @ Rahien") .First();
How, then, does this works? When you make a query to RavenDB, the RavenDB query optimizer will nd the appropriate index for the query. But what happens when there isnt any matching index? RavenDB will create a temporary index for us, just for this query. We discussed how and what RavenDB does this in the prevous chapter. But we still dont have a good idea what an index is, right? Listing 5.2 shows the index that RavenDB generates on the server:
//listing 5.2 - the auto generated index created by RavenDB from blogItem in docs.Blogs select new { blogItem.Title }
That looks like a Linq query, and not any sort of index that I have seen before, so what is going on? Well, the answer is that what you see is the index definition function, which is what RavenDB uses to extract the information to be indexed from the documents. Let us assume that the server contains the documents in listing 5.3:
// listing 5.3 - sample documents { // blogs/1234 "Title": "Ayende @ Rahien", "Author": "...", "StartedAt": "..." }
35
{ // blogs/1235 "Title": "Ravens Flight", "Author": "...", "StartedAt": "..." }
The output of the indexing function in listing 5.2 over the documents in listing 5.3 is showing in listing 5.4:
// listing 5.4 - the ouput of the indexing function over the sample documents { "Title": "Ayende @ Rahien", "__document_id": "blogs/1234" } { "Title": "Ravens Flight", "__document_id": "blogs/1235" }
Those values are then stored inside a persistent index, which gives us the ability to perfom low cost queries over the values stored in the index. Note: Where did the __document_id in listing 5.4 came from? It doesnt appear in the indexing function in listing 5.2. That value is inserted by RavenDB to all the results of the indexing function, this is one of a few values that is automatically inserted by RavenDB (another is the __reduce_key value, which serves the same function, but for Map/Reduce indexex). After RavenDB ensures that an index exists, it can query the index. In chapter 4, we discussed the way RavenDB builds indexes on the background and the notion of staleness. Because RavenDB doesnt have to wait for the indexing process to complete, it is able to produce answers without having to wait even if there are concurrent indexing tasks running. All of that together brings us to the reason why RavenDB queries are so fast. All the queries are running on precomputed indexes, and those queries never have to wait. The index storage format is a Lucene index, which is discussed in greater detail in Chapter TODO.
5.2 Index optimizations

We mentioned that if a query is made when the query optimizer cannot nd an applicable index, that index will be created. That index is temporary, but it will hang around for a while, just in case additional queries that requires it are needed. Indeed, if enough queries using that temporary index are made, the index will be promoted into a persistent index. In general, it is better to have fewer indexes that index more elds than more indexes, where each index fewer elds. The dominating factor in indexing performance is I/O, and bigger indexes can utilize the disk I/O better than more indexes that each compete for disk I/O. For generated indexes, the query optimizer will aggregate indexes together, but for manually created indexes, you should be aware that you should strive for bigger indexes than ne grained indexes. For the most part, the generated indexes serve just ne, but there are several advanced options that are available when you write your own indexes. For the rest of the chapter, we will discuss those in detail.
5.3 Collation
RavenDB support sorting in a culture sensitive manner, but you have to explicitly tell it about that. The index dention in listing 5.5 shows how we can sort the shopping cart by the customer name using Swedish sorting rules:
// listing 5.5 - index definition sorting carts by customer name using Swedish sorting rules
36
Chapter 5. Advanced RavenDB indexing
public class ShoppingCarts_ByCustomerName_InSwedish : AbstractIndexCreationTask<ShoppingCart> { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts select new { cart.Customer.Name }; Analyzers.Add( x => x.Customer.Name, typeof(Raven.Database.Indexing.Collation.Cultures.SvCollat } }
Querying the ShoppingCarts_ByCustomerName_InSwedish index now will return results sorted by the customer name using the Swedish sorting rules. The same approach is available for most languages. All you need is to select change the two letter language code prex for the CollationAnalyzer.
5.4 Exact matches

By default, RavenDB uses case insensitive match to compare values. There are certain values where case sensitivity matters, and you want to capture the value exactly as it is. You can do that by specifying that the value is NotAnalyzed, which will cause RavenDB to make an exact (and case sensitive) match to it. You can see how to set this option in listing 5.6:
public class ShoppingCarts_ByCustomerName_NotAnalyzed: AbstractIndexCreationTask<ShoppingCart> { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts select new { cart.Customer.Name }; Indexes.Add( x => x.Customer.Name, FieldInexing.NotAnalyzed); } }
5.5 Full text search

As mentioned previously, RavenDB will default to case insensitive match to compare values, but often we want to query on more than just the exact value, we want to query using Full Text Search. So the value The Green Fox jumped over the Grey Hill would be matched by fox and hill. In order to do that, we need to set the value to be Analyzed, which will enable full text searching on the value. Listing 5.7 shows how this can be done:
public class ShoppingCarts_ByCustomerName_Analyzed: AbstractIndexCreationTask<ShoppingCart> { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts select new { cart.Customer.Name }; Indexes.Add( x => x.Customer.Name, FieldInexing.Analyzed); } }
Hierarchies Spatial WhereEntityIs Suggestions 5.4. Exact matches 37
38
Chapter 5. Advanced RavenDB indexing
CHAPTER
SIX
MAP / REDUCE INDEXES

In this chapter: What is map / reduce? How map / reduce works in RavenDB? Creating and querying map / reduce indexes Where should we use map / reduce indexes? One of the biggest hurdles for NoSQL databases has always been the perception that map/reduce is such a hard topic this cannot be further from the truth. Map/reduce is actually a very simple (and elegant) solution to an equally simple problem. Map/reduce is simply another way to say group by. Chances are, you are already familiar with the notion of group by and in fact, I am not aware of anything who has nightmares about group by - but I do know a few people who get the shakes at the mere mention of map/reduce (just ask Bob <http://browsertoolkit.com/fault-tolerance.png>). It is usually best to demonstrate such concepts using an example, and we will use counting the number of comments for each blog as our map/reduce sample. You can see a sample blog post document in listing 5.1:
// Listing 5.1 - A sample blog post { // Document id: posts/1923 "Name": "Ravens Map/Reduce functionality", "BlogId": "blogs/1234", "Comments": [ { "Author": "Martin", "Text": "..." } ] }
In order to answer the question of how many comments each blog have, we have to aggregate the data from multiple documents. Using Linq, we can do so very easily, as shown in listing 5.2:
// Listing 5.2 - A Linq query to aggregate all the comment count per blog from post in docs.Posts group post by post.BlogId into g select new { BlogId = post.BlogId, CommentCount = g.Sum(x=>x.Comments.Length) };
39
You have probably seen similar code scores of times. Unfortunately, this code has a small, almost insignicant problem, it assumes that it can access all the data. But what happens if the data is too big to t in memory? Or even too big to t on a single machine? This is where map/reduce comes into play. Map/reduce is merely meant to deal with group by on a massive scale, but the concept is stll the same old concept. It is just that we need to break the group by into multiple steps that can each run on a different machine.
6.1 Stepping through the map / reduce process

The rst thing that we need to do is to break the operation in listing 5.2 to distinct operations. Let us look at what the original code there is doing... We start by grouping all posts on the BlogId and then we select the BlogId and the Comment.Length. This suggests that that the only information that we actually need from the post are the BlogId and the Comment.Length properties. So we dene a linq query that executes just that part, shown in listing 5.3:
// Listing 5.3 - Projecting just the required fields from the posts from post in docs.Posts select new { post.BlogId, CommentCount = Comments.Length }
How is this useful? Well now instead of having to deal with full blown post documents, we can deal with a much smaller projection. We have minimized the amount of data that we have to work on and if we feed a set of documents through the linq query in listing 5.3, we are going to get the results we can see in listing 5.4:
// Listing 5.4 - The results of the query in liting 5.4 { { { { { BlogId: BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/9313", "blogs/1234", "blogs/2394", "blogs/9313", CommentCount: CommentCount: CommentCount: CommentCount: CommentCount: 4 2 3 1 0 } } } } }
The size difference between the results of the query and the original document is pretty big, as you can see. Now we need to take a look at the second part of this task; grouping the results and performing the actual aggregation. We do that in listing 5.5:
// Listing 5.5 - Grouping the results to find the final result from result in results group result by result.BlogId into g select new { BlogId = g.Key, CommentCount = g.Sum(x=>x.CommentCount) }
So far, so good. The query in listing 5.4 seems reasonable, it is very similar to the one we have seen in listing 5.2, after all. What we need to do now is to feed the results in listing 5.4 through the query. We can see the result of that in listing 5.6:
{ BlogId: "blogs/1234", CommentCount: 7 } { BlogId: "blogs/9313", CommentCount: 2 } { BlogId: "blogs/2394", CommentCount: 1 }
40
Chapter 6. Map / Reduce indexes
So far, we havent done anything special. But we have actually done something that might surprise you. We have dened a pair of map/reduce functions. Listing 5.3 is the map function. Listing 5.5 is the reduce function. I know what you are thinking, I am explaining to you things that you already know, but bear with me - the fat lady hasnt sung yet after all. I didnt complicate the query in 5.2 by breaking it apart into two separate queries for no reason. Let us assume that we have another data set, on another machine. This data set is shown in listing 5.7:
{ { { { BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/7269", "blogs/1234", "blogs/9313", CommentCount: CommentCount: CommentCount: CommentCount: 5 2 4 2 } } } }
We want to get the answer for all blogs, not just the posts on a particular machine (the query in listing 5.2 would do just ne for that). What we are going to do is to run all the data in listing 5.7 through the query in 5.3, giving us the data in listing 5.8:
{ BlogId: "blogs/1234", CommentCount: 9 } { BlogId: "blogs/7269", CommentCount: 2 } { BlogId: "blogs/9313", CommentCount: 2 }
The fun part starts now, because the reduce function can be applied recursively. What we are going to do now is to execute the query in listing 5.5 on the data in both listing 5.6 and 5.8 (we are simply going to combine the two datasets and execute the query on all of the data at once). This gives us the results in listing 5.9:
{ { { { BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/7269", "blogs/9313", "blogs/2394", CommentCount: CommentCount: CommentCount: CommentCount: 16 2 4 1 } } } }
And that is the whole secret of map/reduce, honestly. We were able to take two data sets from two distinct nodes and by applying the map/reduce algorithm, we were able to derive the nal result for an aggregation that spanned machine boundaries.
6.2 What is map/reduce, again?

Map/reduce 1 is simply a way to break the concept of group by into multiple steps. By breaking the group by operation into multiple steps, we can execute a group by operation over a set of machines, allowing us to execute such operations on data sets which are too big to t on a single machine. Map/reduce is composed of two steps: The rst step is the map. The map is just a function (or a linq query) which is executed over a data set. It is the responsibility of the map to lter the data set (Linq where clause) from data that we dont care about and project the data that we are interested in for the task at hand from the data that was passed in (the Linq select clause). The second step in the map/reduce process is the reduce function (or a linq query). This function takes the output of the map function and reduces the values. In practice, the reduce function almost always uses a group by clause to aggregate the incoming dataset based on a common key. Distributed map/reduce relies on an executer that can execute the map function, and then the reduce function on the output of the map function. If multiple nodes are used, the executer merges the reduced data from several nodes and then executes reduce again on these merged result sets.
1 Map/reduce is an old concept, most functional languages uses the notion of map and reduce constant. In many such languages, those functions usually serve where loops would be used in procedural languages. Google is the one responsible for taking those concepts and popularizing them with regards to distributing work across a set of worker nodes.
6.2. What is map/reduce, again?
41
Most of the complexity that was attached to map/reduce is because writing the executer is a non trivial task, but conceptually, the idea is very simple.
6.3 Rules for Map/Reduce operations

RavenDB primarily uses Linq queries to dene the map and reduce functions, and linq queries tend to naturally match the rules for map/reduce functions, but it is important to be aware of what those rules are: The reduce function must be able to process the map function output as well as its own output. This is required because reduce may be applied recursively to its own output. In practice, what this means is that the map function outputs the same type as the output of the reduce function. Since the types are the same, it is naturally possible to run the reduce function on its own output (after all, it is also the map function output). Listing 5.10 shows an example of a map/reduce pair returning the same type:
// Listing 5.10 - Map/reduce pair returning the same type. // map from post in docs.Posts select new { post.BlogId, CommentCount = post.Comments.Length } // reduce from result in results group result by result.BlogId into g select new { BlogId = g.Key, CommentCount = g.Sum(x=>x.CommentCount) }
And listing 5.11 shows an example of an invalid map/reduce pair:

// Listing 5.11 - Map/reduce pair returning different types // map from post in docs.Posts select new { post.BlogId, CommentCount = post.Comments.Length } // reduce from result in results group result by result.BlogId into g select new { BlogId = g.Key, TotalComments = g.Sum(x=>x.CommentCount) }
If we attempt to send the output of the reduce function in listing 5.11 back into the same function, we are going to get an error because there is no CommentCount in the output of the reduce function. The map and reduce function must be pure functions. A pure function is a function that: Given the same input will return the same output. i.e. [ map(doc) == map(doc), for any doc ] What this means is that you cannot rely on any external input, only the input that it was passed in. Evaluation of the function will have no side effects. What this means in practice is that you cant make any external calls from the map/reduce functions. That isnt an onerous requirement, since you usually dont have a way to make external calls anyway. As I mentioned, for the most part, we dont really need to pay close attention to those rules, Linq queries tend to follow them anyway.
42
6.4 Applications of Map/Reduce

As I mentioned, map/reduce is mostly just a gloried way of using group by. But what is interesting is how much this is useful. One obvious result of map/reduce is running these simple aggregations: Count Sum Distinct Average And many others like that. But you can also use map/reduce to implement joins. We will discuss how to do just that later in this chapter. Map/reduce is not applicable however, in scenarios where the dataset alone is not sufcient to perform the operation. In the case of a navigation computation, you cant really handle this via map/reduce because you lack key data points (the start and end points). Trying to compute paths from all points to all other points is probably a losing proposition, unless you have a very small graph. Another problem occurs when you have a 1:1 mapping between input and output. Oh, Map/Reduce will still work, but the resulting output is probably going to be too big to be really useful. It also means that you have a simple parallel problem, not a map/reduce sort of problem. Map/reduce assumes that the reduce step is going to... well reduce the data set :-). If you need fresh results, map/reduce isnt applicable either, it is an inherently a batch operation, not an online one. Trying to invoke map/reduce operation for a user request is going to be very expensive, and not something that you really want to do. If you data size is small enough to t on a single machine, it is probably going to be faster to process it as a single reduce(map(data)) operation, than go through the entire map/reduce process (which require synchronization). And now that we have discussed what map/reduce is, exactly, let us see how RavenDB uses that and how you can utilize map/reduce within RavenDB.
6.5 How map/reduce works in RavenDB

RavenDB uses map/reduce to allow you to perform aggregations over multiple documents. One thing that it is important to note from the start is that RavenDB doesnt apply distributed map/reduce, but runs all the map/reduce operations locally. This raises the question, if we are going to use map/reduce on a single machine only, why bother, cant we just execute the process as a single Linq query with a group by clause? Theoretically, we could do that, but while RavenDB doesnt use distributed map/reduce, it does have a use for map/reduce and that is avoiding unnecessary computation and I/O. Because a map/reduce process is commutative, it means that we can efcently cache and partition work as needed. When a document that is indexed by a Map/Reduce index is changed, we run the map function only on that document, and then reduce the document along with the reduce results of all the other documents that share the same reduce key (the item the Linq query groups on). Listing 5.12 shows a reduce function:
//Listing 5.12 - A sample reduce function // reduce from result in results group result by result.BlogId into g select new { BlogId = g.Key, CommentCount = g.Sum(x=>x.CommentCount) }
6.4. Applications of Map/Reduce
43
The reduce key in listing 5.12 is the value of result.BlogId. RavenDB will use that to optimize the values it will pass to the reduce function (the actual group by is usually done by RavenDB, and not by the linq query). This results in a much cheaper cost of indexing for map/reduce indexes, compared to running a single query with a group by on all documents with the same reduce key. Note: RavenDB doesnt implement re-reduce (yet) This is an implementation detail that should only concern you if you are interested in reducing a very large number of results on the same reduce key. That is because RavenDB currently implements reduce as a single operation, and will pass all the documents with the same reduce key to a single reduce function. This may cause performance issues if you have very large numbers of results with the same reduce key, where very large is in the tens or hundreds of thousands of results for each reduce key. Fixing this limitation is already on the roadmap. We are almost done with the theory, I promise. We just have to deal with one tiny detail before we can start looking at some real code.
6.6 How RavendB stores the results of map/reduce indexes?

In the previous chapter we discussed how RavenDB deals with the results of simple indexes (containing only a map function). Map/reduce indexes actually produce two different data points. The rst is the output from the map function; internally these values are called mapped results inside RavenDB and are never exposed externally, but they are what allows RavenDB to perform partial index updates. The second output is the output from the reduce function. This is the externally visible output from a map/reduce index. And like simple indexes, that data is also stored inside a Lucene index. Storing the data in Lucene allows efcent and full featured querying capabilties (as well as all the other goodies, like full text searching). Unlike simple indexes (where the assumption is that most of the time you would like to search on the index, but get the actual document), map/reduce indexes dont just serve as an index, but actually store the data that we are going to get as a result of a query. For example, if I query the index that we dened in listing 5.3 and listing 5.4 (and whose output is shown in listing 5.9) for the result for the blogs/9313 blog, we will get:
{ BlogId: "blogs/9313", CommentCount: 4 }
This value is stored in the index itself, and it is loaded directly from there. This means that you dont touch any documents when you query a map/reduce index. All the work is being handled by RavenDB in the background. And like simple indexes, it is possible to query a map/reduce and get a stale result. We handle this in exactly the same way we handle stale index with simple indexes. And now, after much ado, let us get to coding and write our rst map/reduce index.
6.7 Creating our rst map/reduce index

Using our shopping cart example, we want to nd out how many items of each product were sold. As a reminder listing 5.13 shows the format of a shopping cart:
//listing 5.13 - a shopping cart document { // shoppingcarts/1342 "Products": [ { "Id": "products/31", "Quantity":3 }, { "Id": "products/25", "Quantity":1 },
44
] }
Before we start writing the map/reduce index, I usually nd it useful to write the full linq query to do the same calculation. That tends to make it easier to write the index later on. The linq query is shown in listing 5.14:
// listing 5.14 - a linq query to calculate the count of products across all shopping carts from shoppingCart from docs.ShoppingCarts from product in shoppingCart.Products group product by product.Id into g select new { ProductId = g.Key, Count = g.Sum(x=>x.Count) }
The next step is to break the query in listing 5.14 to multiple steps, and create an index out of it. We will use the AbstractIndexCreationTask to do that as shown in listing 5.15:
// listing 5.15 - The products count index
public class Products_ByCountInShoppingCart : AbstractIndexCreationTask<ShoppingCart, ProductByCountP { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts from product in cart.Products select new { ProductId = product.Id, Count = product.Count }; Reduce = results => from result in results group result by result.ProductId into g select new { ProductId = g.Key, Count = g.Sum(x=>x.Count) }; } }
The Map part in the index will extract a count for each product from all the shopping cart, exactly as in the blog example that we have examined previously. The only interesting part is that we dig deeper into the shopping cart, and project the values from one of its collections. And the Reduce part will aggregate the results by the product id into the nal answer. You might have noticed that we have added a new twist to the AbstractIndexCreationTask, in the form of an additional generic parameter. The second parameter ProductByCountProjection is the output of the Map function and is both the input and output of the Reduce function.
6.8 Querying map/reduce indexes

Just like standard indexes, we can query a map / reduce index using the session API. Listing 5.16 shows loading the sold count for a particular product:
// listing 5.16 - querying a map / reduce index var results = session.Query<ProductByCountProjection, Products_ByCountInShoppingCart>() .Where(x => x.ProductId == "products/31") .ToList();
The rst parameter of the Query method is the type of the results, while the second parameter indicates which index we should query. Unlike standard indexes (also called simple indexes or map-only indexes), the result of a map / reduce function is always a projection and never the original document. We usually use the same type for the results that we use when creating the index using the AbstractIndexCreationTask class. Now that we know how to create and query indexes, we can move on to a important topic, _where_ should we use those? 6.8. Querying map/reduce indexes 45
6.9 Where should we use map / reduce indexes?

Map / reduce indexes are very useful in aggregating data, but they shouldnt be confused with a full blown reporting solution. While you can certainly use map / reduce indexes to build _some_ reports, in many cases, a report requires more than a map / reduce index can provide (for example, map / reduce indexes cannot support arbitrary grouping). Map / reduce indexes are useful when we want to look at the data in a single format. One common usage is as a part of a homepage or dashboard views. A major advantage of the map / reduce indexes in RavenDB is that (like standard indexes), they are pre-computed, which means that querying them is a very cheap operation. That makes them ideal for aggregating large amount of data that will be viewed often.
6.10 Summary
In this chapter, we have learned what map / reduce is; a way to break the calculation of data into discrete units that can be processed independently (and even on separate machines). Afterward, we continued to discover how map / reduce is implemented inside RavenDB and how best to take advantage of that. We nished with a sample of creating and querying a map / reduce index, which allowed us to calculate how many items were sold for each product. Because of the way map / reduce works with RavenDB, querying the index is very cheap, and we can use this as part of the product page, to show, for example, how popular a particular product is. Finally, we discussed where do we want to use map / reduce index. The obvious answer is that we want to use them whenever we have a reason to use aggregation, but we have to be aware that unlike group by queries in a relational database, map / reduce queries in RavenDB doesnt allow arbitrary grouping (which rules them out for use as part of a generic reporting service). On the other hand, they do provide very fast responses for xed queries, such as the ones typically used in a dashboard / homepage scenarios. Their low cost of querying make it efcient to use them even in the high trafc locations of your applications. In the next chapter, we will discuss Live Projections, Includes and other advanced indexing options. In the chapter after that, we will go over various querying scenarios and see how we can solve them with RavenDB.
46
CHAPTER
SEVEN
CHATPER 7 - SCALING RAVENDB

In this chapter: Sharding Scaling effects on system design Sharding strategies Adding a node to a sharded datastore
47
48
Chapter 7. Chatper 7 - Scaling RavenDB
CHAPTER
EIGHT
CHAPTER 8 - REPLICATION
In this chapter: Master -> Slave Failover Master <-> Master Conicts
49
50
Chapter 8. Chapter 8 - Replication
CHAPTER
NINE
CHAPTER 9 - AUTHORIZATION
In this chapter: Role based authorization Document based authorization Tag based authorization
51
52
Chapter 9. Chapter 9 - Authorization
CHAPTER
TEN
CHAPTER 10 - EXTENDING RAVENDB

In this chapter: Put Triggers Delete Triggers Read Triggers Indexing Querying Load Index Update Triggers Codecs Tasks Background Startup
53
54
Chapter 10. Chapter 10 - Extending RavenDB
CHAPTER
ELEVEN
CHAPTER 11 - RAVENDB BUILTIN BUNDLES

In this chapter: The Versioning Bundle The Expiration BUndle The Index Replication Bundle
55
56
Chapter 11. Chapter 11 - RavenDB Builtin Bundles
CHAPTER
TWELVE
CHAPTER 12 - BUILDING YOUR OWN BUNDLE

In this chapter: How to build your own bundle Conguration Context Deploying your bundle
57
58
Chapter 12. Chapter 12 - Building your own Bundle
CHAPTER
THIRTEEN
CHAPTER 13 - ADMINSTRATION
In this chapter: Backup Installation Deployment Standalone service IIS Shared Hosting Optimizating conguration
59
60
Chapter 13. Chapter 13 - Adminstration
CHAPTER
FOURTEEN
HOW RAVENDB USES LUCENE

In this chapter... How RavenDB uses Lucene (page ??) Lucene (page ??) How indexes are stored (page ??) Advanced Lucene Options (page ??) * Analzying (page ??) * Sorting (page ??) * Storage (page ??) * Indexing (page ??)
14.1 Lucene
The RavenDB indexing mechanism is implemented using the open-source Lucene.NET (http://lucene.apache.org/lucene.net/), a C# port of the original Java library (http://lucene.apache.org/). library
Lucene is a full-text search library that makes it easy to add search functionality to an application. It does so by adding content to a full-text index. It then searches this index and returns results ranked by either the relevance to the query or by an arbitrary eld such as a documents last modied date. The best way of thinking about the indexes in RavenDB is to imagine them as a databases materialized view. Raven executes your indexes in the background, and the results are written to disk. This means that when we perform a query, we have to do very little work. This is how RavenDB manages to achieve its near instantaneous replies for your queries, it doesnt have to think, all the processing has already been done. Lucene comes with an advanced set of query options (http://lucene.apache.org/java/2_4_0/queryparsersyntax.html), that allow RavenDB to support the following (which is still just a partial list): full text search partial string matching range queries (date, integer, oat etc) spatial searches auto-complete or spell-checking faceted searches
61
14.2 How indexes are stored

Lets start by looking at a simple scenarion. Lets assume we have the type of document as shown in listing 4.1:
// Listing 4.1 - A sample blog post { // Document id: users/101 "Name": "Matt Warren", "Age": "Age 30", }
And the following index:

// Listing 4.2 - A simple index var index = new IndexDefinition() { Map = "docs => from doc in docs select new { doc.Name }", }; db.DatabaseCommands.PutIndex("SimpleIndex", index);
Note: In this chapter all code samples will be written using the Lucene syntax as we are looking at Lucene itself. However the recommended way of using RavenDB is via the LINQ API, see Chapter 3 for more information about this. Take a look at gure 4.3 to see how a simple index is stored So by default RavenDB does the following when indexing a text eld: Analyzes the elds using a lower case analyzer (Matt Warren -> matt warren) Stores a the ID of the document that the terms comes from The elds is converted to lower case so that case-sensitivity isnt an issue in basic queries. The ID of the document is stored so that RavenDB can then pull the document out of the data store after is has performed the Lucene query. Remember RavenDB only uses Lucene to store the indexed data, not the actual documents themselves. This reduces the total size of the index. However things are slightly more complex when dealing with numbers. The rules that RavenDB follows here are: If the value is null, create a single eld with the supplied name and the unanalyzed value NULL_VALUE If the value is a string or was set to be not analyzed, create a single eld with the supplied name and value If the value is a date, create a single eld with millisecond precision and the supplied name If the value is numeric (int, long, double, decimal, or oat) it will create two elds * using the eld name, containing the numeric value as an unanalyzed string - useful for direct queries * using the eld name +_Range, containing the numeric value in a form that allows range queries The last item is important. To enable RavenDB to perform range queries (i.e. Age > 4, Age < 40 etc) with Lucene, it needs to store the numerical data in a format that is suitable for this. But it also stores the value in its original format so that direct queries (such as matches) can be performed. Take a look at gure 4.4 to see how a complex index is stored
14.3 Advanced Lucene Options

RavenDB gives you full control on the indexing process, by exposing the low-level Lucene options as part of the index denition. You can use these like so:
62
Chapter 14. How RavenDB uses Lucene
Figure 14.1: Figure 4.3 - A simple index
14.3. Advanced Lucene Options
63
Figure 14.2: Figure 4.4 - A complex index
64
IndexDefinition indexAnalysed = new IndexDefinition() { Map = "docs.Users.Select(doc => new {Name = doc.Name})", Analyzers = { {"Name", typeof(SimpleAnalyzer).FullName} }, SortOptions = { { "Age", SortOptions.Double } } Stores = { { "Name", FieldStorage.Yes } } };
14.3.1 Analzying
By default RavenDB uses a lower case analyser, this converts a string into a lower case version. But this isnt useful if youd like to a full-text search on your documents. To achieve this you need to tokenise or analyse the elds you are indexing. For instance given a eld that contains the text The quick brown fox jumped over the lazy dog, bob@hotmail.com (bob@hotmail.com) 123432., Keyword Analyzer keeps the entire stream as a single token. [The quick brown fox jumped over the lazy dog, bob@hotmail.com (bob@hotmail.com) 123432.] Whitespace Analyzer tokenizes on white space only (note the punctuation at the end of dog) [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog,] [bob@hotmail.com (bob@hotmail.com)] [123432.] Stop Analyzer strips out common English words (such as and, at etc), tokenizes letters only and converts everything to lower case [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob] [hotmail] [com] Simple Analyzer only tokenizes letters and makes all tokens lower case [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [bob] [hotmail] [com] Standard Analyzer simple tokenizer that uses a stop list of common English works, also handles numbers and emails addresses correctly [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com (bob@hotmail.com)] [123432] You would then perform the same analysis on the text you want to match. For instance quick brown -> [quick] [brown] and Lucene would nd all the documents with both of these terms in.
14.3.2 Sorting
When Lucene sorts values it performs this against a encoded version of the number (a binary representation). This means that is certain situations it can get the sort order wrong. For instance when sorting double and oat values or short/int/long values. To get round this issue you can explicitly set the sort option of the eld.
14.3.3 Storage
For completeness RavenDB allows you to control whether or not a eld is stored in the index. This could be useful if you wanted to pull back data directly from the Lucense index, but there are very few scenarious where this is useful. Its far better to let RavenDB handle this for you, so specifying this option isnt really recommended. Note that RavenDB allows to use projections directly from the document, without needing to store them in the index, that means that there usually arent good reasons to store elds data.
14.3. Advanced Lucene Options
65
14.3.4 Indexing
Indexing allows you to control how you can search on an index. For the most part, you can just leave that to RavenDBs defaults. This options, along with the storage option, are there for completion sake, more than anything else, and is only going to be useful for expert usage, if that.
66
CHAPTER
FIFTEEN
SUMMARY
In this book..
67
68
Chapter 15. Summary
CHAPTER
SIXTEEN
THINGS TO TALK ABOUT

Set based updates Automatic indexing Transactions Replication ot SQL Primary keys management implementation of denormalized reference in raven Modeling many to many, many to one, one to one explain what term abcdef will return contains(cde) == false in Lucene
69

RavenDBMythology 11

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

RavenDBMythology 11

Încărcat de

Drepturi de autor:

Formate disponibile

RavenDB Mythology Documentation

Ayende Rahien (Oren Eini)

November 29, 2010

RavenDB Mythology Documentation, Release 1.0

RavenDB Mythology Documentation, Release 1.0

NOSQL? WHAT IS THAT?

RavenDB Mythology Documentation, Release 1.0

1.1 How we got this NoSQL thing?

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

1.2 NoSql Data stores

1.2. NoSql Data stores

RavenDB Mythology Documentation, Release 1.0

1.2.1 Key/Value Stores

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

1.2. NoSql Data stores

RavenDB Mythology Documentation, Release 1.0

1.2.2 Document Databases

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

RavenDB Mythology Documentation, Release 1.0

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

1.2.3 Column family databases (BigTable)

1.2. NoSql Data stores

RavenDB Mythology Documentation, Release 1.0

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

1.2. NoSql Data stores

RavenDB Mythology Documentation, Release 1.0

Chapter 1. NoSQL? What is that?

RavenDB Mythology Documentation, Release 1.0

.Take(25) .OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Get(tweetIds);

1.3 How to select a data storage solution?

1.3. How to select a data storage solution?

RavenDB Mythology Documentation, Release 1.0

1.3.1 Multiple data stores in a single application?

1.3.2 When is NoSQL a poor choice?

1.3.3 And when scaling is not an issue?

RavenDB Mythology Documentation, Release 1.0

RavenDB Mythology Documentation, Release 1.0

Chapter 1. NoSQL? What is that?

GROKKING DOCUMENT DATABASES

RavenDB Mythology Documentation, Release 1.0

"Price": { "Amount": 2.19 "Currency": "USD" } } ] }

2.1 Data modeling with document databases

2.1.1 Documents are not at

Chapter 2. Grokking Document Databases

RavenDB Mythology Documentation, Release 1.0

Figure 2.1: Figure 2.1 - A simple blog post page

2.1. Data modeling with document databases

RavenDB Mythology Documentation, Release 1.0

2.1.2 Document databases are not relational

2.1.3 Documents are Aggregates

Chapter 2. Grokking Document Databases

RavenDB Mythology Documentation, Release 1.0

2.1.4 Relations and Associations

2.1. Data modeling with document databases

RavenDB Mythology Documentation, Release 1.0

2.2 Denormalization isnt scary

Chapter 2. Grokking Document Databases