Documente Academic
Documente Profesional
Documente Cultură
Release 1.0
CONTENTS
NoSQL? What is that? 1.1 How we got this NoSQL thing? . . . 1.2 NoSql Data stores . . . . . . . . . . 1.3 How to select a data storage solution? 1.4 Summary . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 4 5 15 17 19 20 24 25 25 27 27 28 29 30 31 31 32 32 33 35 35 36 36 37 37 39 40 41 42 43 43 44 44 45 i
Grokking Document Databases 2.1 Data modeling with document databases 2.2 Denormalization isnt scary . . . . . . . 2.3 Indexes bring order to schema free world 2.4 Summary . . . . . . . . . . . . . . . . . Chapter 3 - Basic Operations 3.1 Creating a document session . . . . . . . 3.2 Saving a new document . . . . . . . . . 3.3 Loading & Editing an existing document 3.4 Deleting existing documents . . . . . . . 3.5 Transaction support in RavenDB . . . . . 3.6 Basic query support in RavenDB . . . . 3.7 Safe by default . . . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . Chapter 4 Advanced RavenDB indexing 5.1 What is an index? . . . 5.2 Index optimizations . . 5.3 Collation . . . . . . . . 5.4 Exact matches . . . . . 5.5 Full text search . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
4 5
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Map / Reduce indexes 6.1 Stepping through the map / reduce process . . . . . . . 6.2 What is map/reduce, again? . . . . . . . . . . . . . . . 6.3 Rules for Map/Reduce operations . . . . . . . . . . . . 6.4 Applications of Map/Reduce . . . . . . . . . . . . . . . 6.5 How map/reduce works in RavenDB . . . . . . . . . . 6.6 How RavendB stores the results of map/reduce indexes? 6.7 Creating our rst map/reduce index . . . . . . . . . . . 6.8 Querying map/reduce indexes . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
6.9 Where should we use map / reduce indexes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 8 9 Chatper 7 - Scaling RavenDB Chapter 8 - Replication Chapter 9 - Authorization
46 46 47 49 51 53 55 57 59 61 63
10 Chapter 10 - Extending RavenDB 11 Chapter 11 - RavenDB Builtin Bundles 12 Chapter 12 - Building your own Bundle 13 Chapter 13 - Adminstration 14 Summary 15 Things to talk about
ii
This book is dedicated to my father. Dad, from your mouth to Gods ears. Oren Eini, 2010 Warning: This book is a draft, it is known to have multiple spelling & grammer issues. The source for this page can be found here: http://github.com/ravendb/docs We would love to hear suggestions & improvements about this book. The discussion group for this book can be found here: http://groups.google.com/group/ravendb/
CONTENTS
CONTENTS
CHAPTER
ONE
Use a No SQL solution. What it boils down to is that when you bring the need to scale to multiple machines, the drawbacks of using a RDBMS ( TODO: provide a full list) out weight the benets that it usually brings to the table. Since we have to do a lot of work already with sharded SQL databases, it is worth turning out attention to the NoSQL alternatives, and what we might want to choose them. This book is about RavenDB, a Document Database, but I want to give you at least some background on each of the common NoSQL databases types, before starting to talk about RavenDB specically.
Data access strategy follows the data access pattern One of the most common problems that I nd when reviewing a project is that the rst step (or one of them) was to build the Entity Relations Diagram, thereby sinking a large time/effort commitment into it before the project really starts and real world usage tells us what sort of data we actually need and what is the data access pattern of the application. One of the major problems with this approach is that it simply doesnt work with NoSQL solutions. An RDBMS allows very exible querying, so you can sometimes get away with this approach (although it is generally discouraged when using RDBMS as well), but NoSQL solutions often require you to query / access the data only in pre dened manner (for example, key/value stores allows only access by key). This means that the structure of your data is usually going to be dictated by the way that you are going to access it. This is usually a surprise for people coming from the RDBMS world, since it is the inverse of how you usually model data in RDBMS. We will discuss modeling techniques for a document database in Chapter 2.
There are many variations, but that is the basis for everything else. A key value store allows you to store values by key, as simple as that. The value itself is just a blob, as far as the data store is concerned, it just stores it, it doesnt actually care about the content. In other words, we dont have a data stored dened schema, but a client dened semantics for understanding what the values are. The benets of using this approach is that it is very simple to build a key value store, and that it is very easy to scale it. It also tend to have great performance, because the access pattern in key value store can be heavily optimized. In general, most key/value operations can be performed using O(1), regardless of how many machines there are in the data stores and regardless of how much data is stored. Concurrency In Key/Value Store, concurrency is only applicable on a single key, and it is usually offered as either optimistic writes or as eventually consistent. In highly scalable systems, optimistic writes are often not possible, because of the cost of verifying that the value havent changed (assuming the value may have replicated to other machines), there for, we usually see either a key master (one machine own a key) or the eventual consistency model, which is discussed below. Queries There really isnt any way to perform a query in a key value store, except by the key. Some key/value stores allow range queries on the key, but that is rare. Most of the time, queries on key/value stores are implemented by the user, using a manually maintained secondary index. Transactions While it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesnt work when you start thinking about
a distributed key/value store, where different keys may reside on different machines. Because of that, it is typically best to think about key/value stores as allowing transaction on a single key put on a single machine. Please note that transactions do not imply ACID. In a distributed key/value store, the only way to ensure that is if a key can reside on a single machine. However, we usually do not want that, we want each key to live on multiple machines, to avoid data loss / data unavailability if a node goes down for some reason. We discuss this model (also call eventual consistent key/value store) below. Schema Key/value stores have the following schema Key is a string, Value is a blob. Which is probably not a very useful schema for your purposes. Beyond that, the client is the one that determines how to deal the data. The key/value store just stores it. Scaling Up In Key Value stores, there are two major options for scaling, the simplest one would be to shard the entire key space. That means that keys starting in A go to one server, while keys starting with B go to another server, and so on. In this system, a key is only stored on a single server. That drastically simplify things like transactions guarantees, but it expose the system for data loss if a single server goes down. At this point, we introduce replication, which gives us safety from data loss, but also force us to give up on ACID guarantees. Replication In key value stores, the replication can be done by the store itself or by the client (writing to multiple servers). Replication also introduce the problem of divergent versions. In other words, two servers in the same cluster think that the value of key ABC are two different things. Resolving that is a complex issue, the common approaches are to decide that it cant happen (Scalaris) and reject updates where we cant ensure non conict or to accept all updates and ask the client to resolve them for us at a later date (Amazon Dynamo, Rhino DHT). Eventually consistent key/value stores A system which decides that divergent versions of the same key should be avoided will reject updates if such a scenario may happen. Following the CAP theorem, it means that we give up Partition Tolerance. The problem is that in most cases, you really cant assume that your network wont be partitioned. If that happen (and it happens quite frequently) and you choose the reject divergent updates mode, you can no longer accept writes, rendering you unavailable. To avoid this problem, there is a different model, of allowing divergent writes and let the client resolve the conict when the partition is resolves and the conict is detected. We discuss exactly this problem in detail in Chapter 8, Replication. Common Usages Key/Value stores shine when you need to access the data by key. User related data, such as the session or shopping cart information are ideal, because we always know what the user id is. Another common usage is to store pre-compute data based on the primary key. For example, we may want to store all the information about a product (including related products, reviews, etc) in a key/value store based on the product SKU. That allows us to query all the relevant data about a product in an O(1) manner. Because key based queries are practically free, by structuring our data access along keys, we can get signicant performance benet by structuring our applications to t that need. It turns out that there is quite a lot that you can do with just key/value store. Amazons shopping cart runs on a key value store (Amazon Dynamo), so I think you can surmise that this is a highly scalable technique.
Amazon Dynamo Paper is one of the best resources on the topic that one can ask for. Rhino DHT is a scalable, redundant, zero cong, key value store on the .NET platform. Just remember, if you need to do things more complex than just access a bucket of bits using a key, you probably need to look at something else, and the logical next step in the chain in the Document Database.
We can put this document in the database, under the key ayende. We can also get the document back by using the key ayende. A document database is schema free, you dont have to dene your schema ahead of time and adhere to that. This allows us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you dont really think about it. It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity. The major benet of using a document database comes from the fact that while it has all the benets of a key/value store, you arent limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us, such as querying. The following HTTP request will nd all documents where the name equals to ayende:
GET /indexes/dynamic?query=name:ayende
Because the document database understand the format of the data, it can answer queries like that. Being able to perform queries is just one advantage of the database being able to understand the data, it also allows:
Projecting the document data into another form. Running aggregations over a set of documents. Doing partial updates (patching a document) From my point of view, though. The major benet is that you are dealing with documents. There is little or no impedance mismatch between objects and documents. That means that storing data in the document database is usually signicantly easier than when using an RDBMS for most non trivial scenarios. It is usually quite painful to design a good physical data model for an RDBMS, because the way the data is laid out in the database and the way that we think about it in our application are drastically different. Moreover, RDBMS has this little thing called Schemas. And modifying a schema can be a painful thing indeed, especially if you have to do it on production an on multiple nodes. The schema less nature of a document database means that we dont have to worry about the shape of the data we are using, we can just serialize things into and out of the database. It helps that the commonly used format (JSON) is both human readable and easily managed by tools. A document database doesnt support relations, which means that each document is independent. That makes it much easier to shard the database than it would be in a relational database, because we dont need to either store all relations on the same shard or support distributed joins. I like to think about document databases as a natural candidate for Domain Driven Design applications. When using a relational database, we are instructed to think in terms of Aggregates and always go through an aggregate. The problem with that is that it tends to produce very bad performance in many instances, as we need to traverse the aggregate associations, or specialized knowledge in each context. With a document database, aggregates are quite natural, and highly performant, they are just the same document, after all. Standard modeling technique for a document database is to think in terms of aggregates, in fact. We discuss this in depth in the next chapter. Concurrency In most document stores, concurrency is only applicable on a single document, and it is usually offered as optimistic writes. For document databases that also have replication support, we have to deal with the same potential conicts that arise when using eventual consistency key/value store, and we resolve them in much the same way. But letting the client decide how to merge all the conicting versions. We discuss this in more detail in Chapter 8, Replication. Queries There really isnt any way to perform a query in a key value store, except by the key. Some key/value stores allow range queries on the key, but that is rare. Most of the time, queries on key/value stores are implemented by the user, using a manually maintained secondary index. Transactions Most document databases will offer you transaction support for the a single document. RavenDB supports multi document (and multi node) transactions, but even so, it isnt recommended for common use, because of the potential for issues when using distributed transactions. Schema Document databases doesnt have a schema per-se, you can store any sort of document inside them. The only limitation is that the document must be in a format that the database understands (usually JSON). Note, however, that for while document databases allows arbitrary schema for documents, for practical purposes, indexes (or views) in document 1.2. NoSql Data stores 9
database does allow you to threat some part of the data in a more formal way. We discuss indexes in detail in Chapter 4 - RavenDB Indexes. Scaling Up The common approach for scaling a document store is using sharding. Since each document is independent, document databases lends themselves easily to sharding. Usually sharding is combined with replication support to handle fail over in case of node failure, but that is about as complex as it gets. We discuss sharding strategies for RavenDB in Chapter 7 - Scaling RavenDB. Common Usages ^^^^^^^^^^^^^^ Document databases are usually used to store entities (more accurately, aggregates). There is very little effort involved in turning an object graph to a document, and vice versa. And aggregates plays very well with both document databases and Domain Driven Design principles.Examples for the type of data that would be stored in a document database include blog posts and discussion threads, product catalogs, orders and similar entities. Graph Databases Think about a graph database as a document database, with a special type of documents, relations. An common example would be a social network, such as the one shown in gure 1.1.
Figure 1.1: Figure 1.1 - An example of nodes in a graph database There are four documents and three relations in this example. Relations in a graph database are more than just a pointer. A relation can be unidirectional or bidirectional, but more importantly, a relation is typed, I may be associated to you in several ways, you may be a client, family or my alter ego. And the relation itself can carry information. In the case of the relation document in gure 1.1 above, we simply record the type of the association and the degree of closeness. And that is about it, mostly. Once you think about graph databases as document databases with a special document type, you are pretty much done. Except that graph database has one additional quality that make them very useful.
10
They allow you to perform graph operations. The most basic graph operation is traversal. For example, let us say that I want to know who of my friends is in town so I can go and have a drink. That is pretty easy to do, right? But what about indirect friends? Using a graph database, I can dene the following query:
new GraphDatabaseQuery { SourceNode = ayende, MaxDepth = 3, RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"}, Where = node => node.Location == ayende.Location, SearchOrder = SearchOrder.BreadthFirst }.Execute();
I can execute more complex queries, ltering on the relation properties, considering weights, etc. Graph databases are commonly used to solve network problems. In fact, most social networking sites use some form of a graph database to do things like You might know.... Because graph databases are intentionally design to make sure that graph traversal is cheap, they also provide other operations that tend to be very expensive without it. For example, Shortest Path between two nodes. That turn out to be frequently useful when you want to do things like: Who can recommend me to this companys CTO so they would hire me. One problem with scaling graph databases is that it is very hard to nd an independent sub graph, which means that it is very hard to shard graph databases. There are several effort currently in the academy to solve this problem, but I am not aware of any reliable solution as of yet.
11
Column - A column is a tuple of name, value and timestamp (Ill ignore the timestamp and treat it as a key/value pair from now on). It is important to understand that when schema design in a CFDB is of outmost importance, if you dont build your schema right, you literally cant get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending). The sort order, unlike in a relational database, isnt affected by the columns values, but by the column names. Let assume that in the Users column family, in the row with the key @ayende, we have the column named name set to Ayende Rahien and the column named location set to Israel. The CFDB will physically sort them like this in the Users column family le:
@ayende/location = "Israel" @ayende/name = "Ayende Rahien"
This is because the column name location is lower than the column name name. If we had a super column involved, for example, in the Friends column family, and the user @ayende had two friends, they would be physically stored like this in the Friends column family le:
@ayende/friends/arava= 945 @ayende/friends/rose = 14
This property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We dene three column families: Users - sorted by UTF8 Tweets - sorted by Sequential Guid UsersTweets - super column family, sorted by Sequential Guid Let us create the user (a note about the notation: I am using named parameters to denote columns name & value here. The key parameter is the row key, and the column family is Users):
cfdb.Users.Insert(key: "@ayende", name: "Ayende Rahien", location: "Israel", profession: "Wizard");
You can see a visualization of how this row looks like in gure 1.2. Note that this doesnt look at all like how we would typically visualize a row in a relational database.
Figure 1.2: Figure 1.2 - A representation of a row in a Column Family Database Now let us create a tweet:
12
Figure 1.3: Figure 1.3 - A representation of two tweets in a Column Family Database
13
var firstTweetKey = "Tweets/" + SequentialGuid.Create(); cfdb.Tweets.Insert(key: firstTweetKey, application: "TweekDeck", text: "Err, is this on?", private: t var secondTweetKey = "Tweets/" + SequentialGuid.Create();
cfdb.Tweets.Insert(key: secondTweetKey, app: "Twhirl", version: "1.2", text: "Well, I guess this is m
Those value are visualized in gure 1.3. There are several things to notice in the gure: The actual key value doesnt matter, but it does matter that it is sequential, because that will allow us to sort of it later. Both rows have different data columns on them, because we dont have a schema for the column family. We dont have any way to associate a user to a tweet. That last bears some talking about. In a relational database, we would dene a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational database will allow us to query the tweets by the user id, letting us get the users tweets. A CFDB doesnt give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene). Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need to create a secondary index, which is where the UsersTweets column family comes into play:
cfdb.UsersTweets.Insert(key: "@ayende", cfdb.UsersTweets.Insert(key: "@ayende", timeline: { SequentialGuid.Create(): firstTweetKey } ); timeline: { SequentialGuid.Create(): secondTweetKey } );
Figure 1.4 visualize how it looks like in the database. We insert into the UsersTweets column family, to the row with the key: @ayende, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.
Figure 1.4: Figure 1.4 - A representation of secondary index, connecting users & tweets, in a Column Family Database Note: Couldnt we create a super column in the Users column family to store the relationship? We could, except that a column family can contain either columns or super columns, it cannot contain both. In order to get tweets for a user, we need to execute:
var tweetIds = cfdb.UsersTweets.Get("@ayende") .FetchSuperColumnValues("timeline")
14
Note: There isnt such an API for .NET (at least, not that I am aware of), I created this sample to show a point, not to demonstrate real API. In essence, we execute two queries, the rst on the UsersTweets column family, requesting the columns & values in the timeline super column in the row keyed @ayende, we then execute another query against the Tweets column family to get the actual tweets. This sort of behavior is pretty common in NosQL data stores. It is called secondary index, a way to quickly access the data by key based on another entity/row/document value. This is one example of how the need to query for tweets by user has affected the data that we store. If we didnt create this secondary index, we would have no possible way to answer a question such as show me the last 25 tweets from @ayende. Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user. What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order. Why is column family database so limiting? You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand at rst, since it is so close on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited? The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily. Remember that CFDB is really all about removing abstractions. CFDB is what happens when you take a relational database, strip everything away that make it hard to run in on a cluster and see what happens. The reason that CFDB dont provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that. CFDB dont provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which is again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper. It requires a drastically different mode of thinking, and while I dont have practical experience with CFDB, I would imagine that migrations using them are... unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.
15
Trying to import a relational mind set into a NoSQL data store. Trying to use a single data store for all things, including things that it really isnt suitable for. Selecting a data storage strategy isnt a one time decision. In a single application, you may use a Key/Value store to hold session information, graph database to serve social queries and a document database to hold your entities. I view the we use a single data store mentality in the same way that I view people who want to write all their code in a single le. You certainly can do that, but that is going to be... awkward. I try to break down things based on the expected data access patterns expected from each section in the application. If in the product catalog am always dealing with queries by the product SKU, and speed is of the essence, it make a lot of sense to use a key/value store. But that doesnt means that orders should be stored there, for order I need a lot more exibility, so I put them in a document database, etc.
Proven & Mature NoSQL solutions arent applicable just at high end of scaling. NoSQL solutions provide a lot of benets even for applications that will never need to scale higher than a single machine. Document databases drastically simplify things like user dened elds, or working with Aggregates. The performance of a NoSQL solution can often exceed a comparable RDBMS solution, because the NoSQL solution will usually focus on a very small subset of the feature set that RDMBS has.
1.4 Summary
In this chapter, we have gone over the reasons for the NoSQL movement, born out of the need to handle ever increasing data, users and complexity. We have explored the various NoSQL options and discussed their benets and disadvantages as well as what scenarios they are suitable for. We looked at how to select an appropriate data store for specic purposes and nally discussed how the emergence of robust NoSQL solutions has improved our options even when we arent required to scale, because we have more data storage models to select from when it comes the time to design our application. In the next chapter, we will leave the general topic of NoSQL and begin to focus specically on document databases, the topic of this book. So turn the page to the next chapter, and let us explore...
1.4. Summary
17
18
CHAPTER
TWO
19
Documents in a document database dont have to follow any schema and can have any form that they wish to be. This make them an excellent choice when you want to use them for sparse models (models where most of the properties are usually empty) or for dynamic models (customized data model, user generated data, etc). In addition to that, documents are not at. Take a look at the document shown in listing 2.1, we represent a lot of data in a single document here. And we represent that internally. Unlike RDBMS, a document is not just a a set of keys and values, it can contain nested values, lists and arbitrarily complex data structures. This make it much easier to work with documents compared to working with RDBMS, because the complexity of your objects doesnt translate into a large number of calls to the database to load the entire object, as it would be in RDBMS. In order to build the document showing in listing 2.1 in a RDBMS system, we would probably have to query at least 3 tables, and it is pretty common to have to touch more than ve tables to get all the needed information for a single logical entity. With document databases, all that information is already present in the document, and there is no need to do anything special. You just need to load the document, and the data is there. The down side here is while you can embed information inside the document very easily, it is harder to reference information in other documents. In RDBMS, you can simply join to another table and get the data from the database that way. But document databases do not have the concept of joins (RavenDB has something similar called includes, which is discussed in Chapter 6, but it isnt really a parallel). As you can imagine, these two changes leads to drastically different methods of modeling data in a document database...
20
{// Listing 2.1 - A blog post document can contain complex data "Title": "Modeling in Docs DBs", "Content": "Modeling data in...", "Tags": [ "Raven", "DocDB", "Modeling" ], "Comments": [ { "Content": "Great post...", "Author": "John" }, { "Content": "Sed ut...", "Author": "Nosh" } ], "RelatedPosts": [ { "Id": "posts/1234", "Title": "Doc Db Modeling Anti Patterns" }, { "Id": "posts/4321", "Title": "Common Access Patterns" } ] }
Using a document database in this fashion allows us to get everything that we need to display the page shown above
21
in a single request.
22
The Aggregate Root for an Order will contain Order Lines, but an Order Line will not contain a Product. Instead, it contains a denormalized reference to the product. The product is another aggregate, obviously. And here we have a tension between competing needs. On the one hand, we want to be able to process the order document without having to reference another document (since this results in much better overall performance). But on the other hand, in order to do so, we have to duplicate the product (and customer, for that matter) information inside the order document. We will discuss this problem in the next section. Note: What to denormalize? While I think that denormalizing some data to the referring document, you should carefully consider what sort of data you are going to denormalize. For example, in the customer case, we denormalized the customer name. That is a good choice, because a name is going to change rarely. But the LastLogin property is going to change all the time. In this case, we dont really care about the customer login time, but even if we did, we still wouldnt be able to denormalize the LastLogin property. Like in most cases, the answer to What to denormalized? is it depends! It depends on: How often the value changes?
23
How important is the value to the referring document? Luckily, in practice it turns out that it is rare that you would want to have access to a rapidly changing from another document. BUt if you do, it might be a good idea to relax the documents are independent rule. In a relational database, we can usually rely on Lazy Loading to help us, but most document databases client API will not support lazy loading. This is intentional, explicit, and by design decision. Instead of relying on lazy loading, the expected usage is to hold the associated document key as well as the information from it to process the current document. If you really need the full associated document, you need to explicitly load it 2 . The reasoning behind this is simple: we want to make it just a tad harder to reference data in other documents. It is very common when using an Object Relational Mappers to do something like: orderLine.Product.Name, which will lazily load the Product entity. That makes sense when you are living in a relational world, but a document database is not relational.
24
doing some additional amount of work on writes (rare) but signicantly reduces the amount of work that you do for reads (common). That is a good tradeoff, in my eyes.
2.4 Summary
In this chapter we have explored what exactly a document database is, not only in the sense of what sort of data is stored inside a document database, but how we work with it. Documents can be arbitrarily complex, which allows us to hold an entire Aggregate Root inside a single document. And because documents are independent, they should not require referencing another document in order to process requests regarding that document. Therefor, we model documents in order to include denormalized references to other documents. Those denormalized reference copy the document id as well as whatever properties that are important to the referring document. We can handle denormalized updates in one of two ways: Keep the old data - useful for invoices, orders, etc. Where the document referent a point in time. Update all copies of the data - useful when the data represent the current value. RavenDB includes explicit support to make handle denormalized updates, which we discuss in TODO.
25
Finally, we discussed the role of indexes in a document database, and introduced the dreaded map/reduce indexes. Indexes are used to give the database a way to extract a schema out of a set of documents. And now, enough with discussing high level concepts, we are going to go ahead and start working with RavenDB directly and discover why it is the best document database 3 that you have seen.
26
CHAPTER
THREE
This will create a document store that connects to a RavenDB server running on port 8080 on the local machine. Note: It is possible to run RavenDB using an embedded mode in an application and run it in-process by utilising an EmbeddableDocumentStore, more information about this can be found in the documentation. Once a document store has been created, the next step is to create a session against that document store that will allow us to perform basic CRUD operations within a Unit of Work. It is important to note that when invoking any operations against this store, that no changes will be made to the underlying document database until the SaveChanges method has been called, as in listing 3.2:
// listing 3.2 - saving changes using the session API using (IDocumentSession session = store.OpenSession()) { // Operations against session
27
In this context, the session can be thought of as managing all changes internally, and SaveChanges can be thought of as committing all those changes to the RavenDB server. Any operations submitted in a SaveChanges call will be committed atomically (that is to say, either they all succeed, or they all fail). It will be assumed in the following examples that a valid store has been created, and that the calls are being made within the context of a valid session, and that SaveChanges is being called safely at the end of that session lifetime. Note: If you dont call SaveChanges, all the changes made in that session will be discarded!
We can now create a new instance of the Blog class, as shown in listing 3.4:
// listing 3.4 - creating a new instance of the Blog class Blog blog = new Blog() { Title = "Hello RavenDB", Category = "RavenDB", Content = "This is a blog about RavenDB", Comments = new BlogComment[]{ new BlogComment() { Title = "Unrealistic", Content= "This example is unrealistic"}, new BlogComment() { Title = "Nice", Content= "This example is nice"} } };
Note: Neither the class itself or instansiating it required anything from RavenDB, either in the form of attributes or in the form of special factories. The RavenDB Client API works with POCO (Plain Old CLR Objects) objects. Persisting this entire object graph involves using Store and then SaveChanges, as seen in listing 3.5:
28
The SaveChanges call will product the HTTP communication shown in listing 3.6. Note that the Store method operates purely in memory, and only the call to SaveChanges communicates with the server:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8080 Content-Length: 378 Expect: 100-continue
[{"Key":"blogs/1","Etag":null,"Method":"PUT","Document":{"Title":"Hello RavenDB","Category":"RavenDB"
HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Server: Microsoft-HTTPAPI/2.0 Date: Tue, 16 Nov 2010 20:37:00 GMT Content-Length: 205
[{"Etag": "00000000-0000-0100-0000-000000000002","Method":"PUT","Key":"blogs/1","Metadata":{"Raven-En
Two things of note at this point: We left the Id property of Blog blank, and it is this property that will be used as the primary key for this document The entire object graph is serialized and persisted as a single document, not as a set of distinct objects. Note: If there is no Id property on a document, RavenDB will allocate an Id, but it will be retrievable only by calling session.Advanced.GetDocumentId. In other words, having an Id is entirely optional, but as it is generally more useful to have this information available, most of your documents should have an Id property.
29
Content-Length: 214
Flushing those changes to the document store is achieved in the usual way:
session.SaveChanges();
You dont have to call an Update method, or track any changes yourself. RavenDB will do all of that for you. For the above example, the above example will result in the following HTTP message:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8080 Content-Length: 501 Expect: 100-continue
[{"Key":"blogs/1","Etag":null,"Method":"PUT","Document":{"Title":"Some new title","Category":"RavenDB HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Server: Microsoft-HTTPAPI/2.0 Date: Tue, 16 Nov 2010 20:39:41 GMT Content-Length: 280
[{"Etag": "00000000-0000-0100-0000-000000000003","Method":"PUT","Key":"blogs/1","Metadata":{"Content-
Note: The entire document is sent to the server with the Id set to the existing document value, this means that the existing document will be replaced in the document store with the new one. Whilst patching operations are possible with RavenDB, the client API by default will always just replace the entire document in its entirety.
Once again, this results in an HTTP communcation as shown in Listing listing 3.8:
POST /bulk_docs HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 49 Expect: 100-continue [{"Key":"blogs/1","Etag":null,"Method":"DELETE"}]
30
If at any point any of this code fails, none of the changes will be enacted against the RavenDB document store. The implementation details of this are not important, although it is possible to see that RavenDB does indeed send a transaction Id along with all of the the HTTP requests under this transaction scope as shown in listing 3.9:
POST /bulk_docs HTTP/1.1 Raven-Transaction-Information: 975ee0bf-cac9-4b8e-ba29-377de722f037, 00:01:00 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 300 Expect: 100-continue
A call to commit involves a separate call to another HTTP endpoint with that transaction id:
POST /transaction/commit?tx=975ee0bf-cac9-4b8e-ba29-377de722f037 HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081 Content-Length: 0
Note: While RavenDB supports System.Transactions, it is not recommended that this be used as an ordinary part of application workow as it is part for the partition tolerance aspect of our beloved CAP theorum.
31
For example, we might wish to ask for all the blog entries that belong to a certain category like so:
var results = from blog in session.Query<Blog>() where blog.Category == "RavenDB" select blog;
That Just Works(tm) and gives us all the blogs with a category of RavenDB. The HTTP communication for this operation is shown in listing 3.10:
GET /indexes/dynamic/Blogs?query=Category:RavenDB&start=0&pageSize=128 HTTP/1.1 Accept-Encoding: deflate,gzip Content-Type: application/json; charset=utf-8 Host: 127.0.0.1:8081
The important part of this query is that we are querying the Blogs collection, for the property Category with the value of RavenDB. We will also notice that a page size of 128 was passed along, although none was specied, which leads us onto the next topic of Safe by default.
3.8 Summary
In this chapter we learned how to utilise the session as a basic Unit of Work in RavenDB, and saw a basic example of querying in action as well as examples of how these remote calls look in raw HTTP calls across the wire. We also saw how RavenDB attempts to be safe by default in limiting the capacity of common mistakes to cause damage in your application. In the next chapter, we will look more closely at the query API, and how to utilise this within our applications to good effect.
32
CHAPTER
FOUR
CHAPTER 4 In this chapter: How indexes are stored Indexes are stale Simple indexes RavenDB Collections Projections Lucene Fields options: Storage Indexing Sorting Analyzing
33
34
Chapter 4. Chapter 4 -
CHAPTER
FIVE
How, then, does this works? When you make a query to RavenDB, the RavenDB query optimizer will nd the appropriate index for the query. But what happens when there isnt any matching index? RavenDB will create a temporary index for us, just for this query. We discussed how and what RavenDB does this in the prevous chapter. But we still dont have a good idea what an index is, right? Listing 5.2 shows the index that RavenDB generates on the server:
//listing 5.2 - the auto generated index created by RavenDB from blogItem in docs.Blogs select new { blogItem.Title }
That looks like a Linq query, and not any sort of index that I have seen before, so what is going on? Well, the answer is that what you see is the index definition function, which is what RavenDB uses to extract the information to be indexed from the documents. Let us assume that the server contains the documents in listing 5.3:
// listing 5.3 - sample documents { // blogs/1234 "Title": "Ayende @ Rahien", "Author": "...", "StartedAt": "..." }
35
The output of the indexing function in listing 5.2 over the documents in listing 5.3 is showing in listing 5.4:
// listing 5.4 - the ouput of the indexing function over the sample documents { "Title": "Ayende @ Rahien", "__document_id": "blogs/1234" } { "Title": "Ravens Flight", "__document_id": "blogs/1235" }
Those values are then stored inside a persistent index, which gives us the ability to perfom low cost queries over the values stored in the index. Note: Where did the __document_id in listing 5.4 came from? It doesnt appear in the indexing function in listing 5.2. That value is inserted by RavenDB to all the results of the indexing function, this is one of a few values that is automatically inserted by RavenDB (another is the __reduce_key value, which serves the same function, but for Map/Reduce indexex). After RavenDB ensures that an index exists, it can query the index. In chapter 4, we discussed the way RavenDB builds indexes on the background and the notion of staleness. Because RavenDB doesnt have to wait for the indexing process to complete, it is able to produce answers without having to wait even if there are concurrent indexing tasks running. All of that together brings us to the reason why RavenDB queries are so fast. All the queries are running on precomputed indexes, and those queries never have to wait. The index storage format is a Lucene index, which is discussed in greater detail in Chapter TODO.
5.3 Collation
RavenDB support sorting in a culture sensitive manner, but you have to explicitly tell it about that. The index dention in listing 5.5 shows how we can sort the shopping cart by the customer name using Swedish sorting rules:
// listing 5.5 - index definition sorting carts by customer name using Swedish sorting rules
36
public class ShoppingCarts_ByCustomerName_InSwedish : AbstractIndexCreationTask<ShoppingCart> { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts select new { cart.Customer.Name }; Analyzers.Add( x => x.Customer.Name, typeof(Raven.Database.Indexing.Collation.Cultures.SvCollat } }
Querying the ShoppingCarts_ByCustomerName_InSwedish index now will return results sorted by the customer name using the Swedish sorting rules. The same approach is available for most languages. All you need is to select change the two letter language code prex for the CollationAnalyzer.
38
CHAPTER
SIX
In order to answer the question of how many comments each blog have, we have to aggregate the data from multiple documents. Using Linq, we can do so very easily, as shown in listing 5.2:
// Listing 5.2 - A Linq query to aggregate all the comment count per blog from post in docs.Posts group post by post.BlogId into g select new { BlogId = post.BlogId, CommentCount = g.Sum(x=>x.Comments.Length) };
39
You have probably seen similar code scores of times. Unfortunately, this code has a small, almost insignicant problem, it assumes that it can access all the data. But what happens if the data is too big to t in memory? Or even too big to t on a single machine? This is where map/reduce comes into play. Map/reduce is merely meant to deal with group by on a massive scale, but the concept is stll the same old concept. It is just that we need to break the group by into multiple steps that can each run on a different machine.
How is this useful? Well now instead of having to deal with full blown post documents, we can deal with a much smaller projection. We have minimized the amount of data that we have to work on and if we feed a set of documents through the linq query in listing 5.3, we are going to get the results we can see in listing 5.4:
// Listing 5.4 - The results of the query in liting 5.4 { { { { { BlogId: BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/9313", "blogs/1234", "blogs/2394", "blogs/9313", CommentCount: CommentCount: CommentCount: CommentCount: CommentCount: 4 2 3 1 0 } } } } }
The size difference between the results of the query and the original document is pretty big, as you can see. Now we need to take a look at the second part of this task; grouping the results and performing the actual aggregation. We do that in listing 5.5:
// Listing 5.5 - Grouping the results to find the final result from result in results group result by result.BlogId into g select new { BlogId = g.Key, CommentCount = g.Sum(x=>x.CommentCount) }
So far, so good. The query in listing 5.4 seems reasonable, it is very similar to the one we have seen in listing 5.2, after all. What we need to do now is to feed the results in listing 5.4 through the query. We can see the result of that in listing 5.6:
{ BlogId: "blogs/1234", CommentCount: 7 } { BlogId: "blogs/9313", CommentCount: 2 } { BlogId: "blogs/2394", CommentCount: 1 }
40
So far, we havent done anything special. But we have actually done something that might surprise you. We have dened a pair of map/reduce functions. Listing 5.3 is the map function. Listing 5.5 is the reduce function. I know what you are thinking, I am explaining to you things that you already know, but bear with me - the fat lady hasnt sung yet after all. I didnt complicate the query in 5.2 by breaking it apart into two separate queries for no reason. Let us assume that we have another data set, on another machine. This data set is shown in listing 5.7:
{ { { { BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/7269", "blogs/1234", "blogs/9313", CommentCount: CommentCount: CommentCount: CommentCount: 5 2 4 2 } } } }
We want to get the answer for all blogs, not just the posts on a particular machine (the query in listing 5.2 would do just ne for that). What we are going to do is to run all the data in listing 5.7 through the query in 5.3, giving us the data in listing 5.8:
{ BlogId: "blogs/1234", CommentCount: 9 } { BlogId: "blogs/7269", CommentCount: 2 } { BlogId: "blogs/9313", CommentCount: 2 }
The fun part starts now, because the reduce function can be applied recursively. What we are going to do now is to execute the query in listing 5.5 on the data in both listing 5.6 and 5.8 (we are simply going to combine the two datasets and execute the query on all of the data at once). This gives us the results in listing 5.9:
{ { { { BlogId: BlogId: BlogId: BlogId: "blogs/1234", "blogs/7269", "blogs/9313", "blogs/2394", CommentCount: CommentCount: CommentCount: CommentCount: 16 2 4 1 } } } }
And that is the whole secret of map/reduce, honestly. We were able to take two data sets from two distinct nodes and by applying the map/reduce algorithm, we were able to derive the nal result for an aggregation that spanned machine boundaries.
41
Most of the complexity that was attached to map/reduce is because writing the executer is a non trivial task, but conceptually, the idea is very simple.
If we attempt to send the output of the reduce function in listing 5.11 back into the same function, we are going to get an error because there is no CommentCount in the output of the reduce function. The map and reduce function must be pure functions. A pure function is a function that: Given the same input will return the same output. i.e. [ map(doc) == map(doc), for any doc ] What this means is that you cannot rely on any external input, only the input that it was passed in. Evaluation of the function will have no side effects. What this means in practice is that you cant make any external calls from the map/reduce functions. That isnt an onerous requirement, since you usually dont have a way to make external calls anyway. As I mentioned, for the most part, we dont really need to pay close attention to those rules, Linq queries tend to follow them anyway.
42
43
The reduce key in listing 5.12 is the value of result.BlogId. RavenDB will use that to optimize the values it will pass to the reduce function (the actual group by is usually done by RavenDB, and not by the linq query). This results in a much cheaper cost of indexing for map/reduce indexes, compared to running a single query with a group by on all documents with the same reduce key. Note: RavenDB doesnt implement re-reduce (yet) This is an implementation detail that should only concern you if you are interested in reducing a very large number of results on the same reduce key. That is because RavenDB currently implements reduce as a single operation, and will pass all the documents with the same reduce key to a single reduce function. This may cause performance issues if you have very large numbers of results with the same reduce key, where very large is in the tens or hundreds of thousands of results for each reduce key. Fixing this limitation is already on the roadmap. We are almost done with the theory, I promise. We just have to deal with one tiny detail before we can start looking at some real code.
This value is stored in the index itself, and it is loaded directly from there. This means that you dont touch any documents when you query a map/reduce index. All the work is being handled by RavenDB in the background. And like simple indexes, it is possible to query a map/reduce and get a stale result. We handle this in exactly the same way we handle stale index with simple indexes. And now, after much ado, let us get to coding and write our rst map/reduce index.
44
] }
Before we start writing the map/reduce index, I usually nd it useful to write the full linq query to do the same calculation. That tends to make it easier to write the index later on. The linq query is shown in listing 5.14:
// listing 5.14 - a linq query to calculate the count of products across all shopping carts from shoppingCart from docs.ShoppingCarts from product in shoppingCart.Products group product by product.Id into g select new { ProductId = g.Key, Count = g.Sum(x=>x.Count) }
The next step is to break the query in listing 5.14 to multiple steps, and create an index out of it. We will use the AbstractIndexCreationTask to do that as shown in listing 5.15:
// listing 5.15 - The products count index
public class Products_ByCountInShoppingCart : AbstractIndexCreationTask<ShoppingCart, ProductByCountP { public Products_ByCountInShoppingCart() { Map = carts => from cart in carts from product in cart.Products select new { ProductId = product.Id, Count = product.Count }; Reduce = results => from result in results group result by result.ProductId into g select new { ProductId = g.Key, Count = g.Sum(x=>x.Count) }; } }
The Map part in the index will extract a count for each product from all the shopping cart, exactly as in the blog example that we have examined previously. The only interesting part is that we dig deeper into the shopping cart, and project the values from one of its collections. And the Reduce part will aggregate the results by the product id into the nal answer. You might have noticed that we have added a new twist to the AbstractIndexCreationTask, in the form of an additional generic parameter. The second parameter ProductByCountProjection is the output of the Map function and is both the input and output of the Reduce function.
The rst parameter of the Query method is the type of the results, while the second parameter indicates which index we should query. Unlike standard indexes (also called simple indexes or map-only indexes), the result of a map / reduce function is always a projection and never the original document. We usually use the same type for the results that we use when creating the index using the AbstractIndexCreationTask class. Now that we know how to create and query indexes, we can move on to a important topic, _where_ should we use those? 6.8. Querying map/reduce indexes 45
6.10 Summary
In this chapter, we have learned what map / reduce is; a way to break the calculation of data into discrete units that can be processed independently (and even on separate machines). Afterward, we continued to discover how map / reduce is implemented inside RavenDB and how best to take advantage of that. We nished with a sample of creating and querying a map / reduce index, which allowed us to calculate how many items were sold for each product. Because of the way map / reduce works with RavenDB, querying the index is very cheap, and we can use this as part of the product page, to show, for example, how popular a particular product is. Finally, we discussed where do we want to use map / reduce index. The obvious answer is that we want to use them whenever we have a reason to use aggregation, but we have to be aware that unlike group by queries in a relational database, map / reduce queries in RavenDB doesnt allow arbitrary grouping (which rules them out for use as part of a generic reporting service). On the other hand, they do provide very fast responses for xed queries, such as the ones typically used in a dashboard / homepage scenarios. Their low cost of querying make it efcient to use them even in the high trafc locations of your applications. In the next chapter, we will discuss Live Projections, Includes and other advanced indexing options. In the chapter after that, we will go over various querying scenarios and see how we can solve them with RavenDB.
46
CHAPTER
SEVEN
47
48
CHAPTER
EIGHT
CHAPTER 8 - REPLICATION
In this chapter: Master -> Slave Failover Master <-> Master Conicts
49
50
CHAPTER
NINE
CHAPTER 9 - AUTHORIZATION
In this chapter: Role based authorization Document based authorization Tag based authorization
51
52
CHAPTER
TEN
53
54
CHAPTER
ELEVEN
55
56
CHAPTER
TWELVE
57
58
CHAPTER
THIRTEEN
CHAPTER 13 - ADMINSTRATION
In this chapter: Backup Installation Deployment Standalone service IIS Shared Hosting Optimizating conguration
59
60
CHAPTER
FOURTEEN
14.1 Lucene
The RavenDB indexing mechanism is implemented using the open-source Lucene.NET (http://lucene.apache.org/lucene.net/), a C# port of the original Java library (http://lucene.apache.org/). library
Lucene is a full-text search library that makes it easy to add search functionality to an application. It does so by adding content to a full-text index. It then searches this index and returns results ranked by either the relevance to the query or by an arbitrary eld such as a documents last modied date. The best way of thinking about the indexes in RavenDB is to imagine them as a databases materialized view. Raven executes your indexes in the background, and the results are written to disk. This means that when we perform a query, we have to do very little work. This is how RavenDB manages to achieve its near instantaneous replies for your queries, it doesnt have to think, all the processing has already been done. Lucene comes with an advanced set of query options (http://lucene.apache.org/java/2_4_0/queryparsersyntax.html), that allow RavenDB to support the following (which is still just a partial list): full text search partial string matching range queries (date, integer, oat etc) spatial searches auto-complete or spell-checking faceted searches
61
Note: In this chapter all code samples will be written using the Lucene syntax as we are looking at Lucene itself. However the recommended way of using RavenDB is via the LINQ API, see Chapter 3 for more information about this. Take a look at gure 4.3 to see how a simple index is stored So by default RavenDB does the following when indexing a text eld: Analyzes the elds using a lower case analyzer (Matt Warren -> matt warren) Stores a the ID of the document that the terms comes from The elds is converted to lower case so that case-sensitivity isnt an issue in basic queries. The ID of the document is stored so that RavenDB can then pull the document out of the data store after is has performed the Lucene query. Remember RavenDB only uses Lucene to store the indexed data, not the actual documents themselves. This reduces the total size of the index. However things are slightly more complex when dealing with numbers. The rules that RavenDB follows here are: If the value is null, create a single eld with the supplied name and the unanalyzed value NULL_VALUE If the value is a string or was set to be not analyzed, create a single eld with the supplied name and value If the value is a date, create a single eld with millisecond precision and the supplied name If the value is numeric (int, long, double, decimal, or oat) it will create two elds * using the eld name, containing the numeric value as an unanalyzed string - useful for direct queries * using the eld name +_Range, containing the numeric value in a form that allows range queries The last item is important. To enable RavenDB to perform range queries (i.e. Age > 4, Age < 40 etc) with Lucene, it needs to store the numerical data in a format that is suitable for this. But it also stores the value in its original format so that direct queries (such as matches) can be performed. Take a look at gure 4.4 to see how a complex index is stored
62
63
64
IndexDefinition indexAnalysed = new IndexDefinition() { Map = "docs.Users.Select(doc => new {Name = doc.Name})", Analyzers = { {"Name", typeof(SimpleAnalyzer).FullName} }, SortOptions = { { "Age", SortOptions.Double } } Stores = { { "Name", FieldStorage.Yes } } };
14.3.1 Analzying
By default RavenDB uses a lower case analyser, this converts a string into a lower case version. But this isnt useful if youd like to a full-text search on your documents. To achieve this you need to tokenise or analyse the elds you are indexing. For instance given a eld that contains the text The quick brown fox jumped over the lazy dog, bob@hotmail.com (bob@hotmail.com) 123432., Keyword Analyzer keeps the entire stream as a single token. [The quick brown fox jumped over the lazy dog, bob@hotmail.com (bob@hotmail.com) 123432.] Whitespace Analyzer tokenizes on white space only (note the punctuation at the end of dog) [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog,] [bob@hotmail.com (bob@hotmail.com)] [123432.] Stop Analyzer strips out common English words (such as and, at etc), tokenizes letters only and converts everything to lower case [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob] [hotmail] [com] Simple Analyzer only tokenizes letters and makes all tokens lower case [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [bob] [hotmail] [com] Standard Analyzer simple tokenizer that uses a stop list of common English works, also handles numbers and emails addresses correctly [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com (bob@hotmail.com)] [123432] You would then perform the same analysis on the text you want to match. For instance quick brown -> [quick] [brown] and Lucene would nd all the documents with both of these terms in.
14.3.2 Sorting
When Lucene sorts values it performs this against a encoded version of the number (a binary representation). This means that is certain situations it can get the sort order wrong. For instance when sorting double and oat values or short/int/long values. To get round this issue you can explicitly set the sort option of the eld.
14.3.3 Storage
For completeness RavenDB allows you to control whether or not a eld is stored in the index. This could be useful if you wanted to pull back data directly from the Lucense index, but there are very few scenarious where this is useful. Its far better to let RavenDB handle this for you, so specifying this option isnt really recommended. Note that RavenDB allows to use projections directly from the document, without needing to store them in the index, that means that there usually arent good reasons to store elds data.
65
14.3.4 Indexing
Indexing allows you to control how you can search on an index. For the most part, you can just leave that to RavenDBs defaults. This options, along with the storage option, are there for completion sake, more than anything else, and is only going to be useful for expert usage, if that.
66
CHAPTER
FIFTEEN
SUMMARY
In this book..
67
68
CHAPTER
SIXTEEN
69