NoSql Databases – Document StoresChaitrali Kulkarni
In the last article, we analyzed the concept of key-value stores.
Moving ahead, let us now go into the details of document stores or document datbases.
What is a document store/ document database?
In the introductory article on NoSql databases, we discussed that the relational models are not suitable for domain driven datamodelling or aggregate oriented data modelling.
An aggregate is a collection of data that we interact with as a unit. These units of data or aggregates form the boundaries for ACID operations with the database.
Aggregates make it easier for the database to manage data storage over clusters, since the unit of data now could reside on any machine and when retrieved from the database gets all the related data along with it.
The document stores allow user to organize the data keeping in mind the aggregate information that the domain demands.
As the name suggests, the document stores, organize data as a document. There are no tables, rows, columns. All the information related to one entity or aggregate unit is stored in one document. Thus when we query for that entity, we get all the information, ideally without requiring multiple references or joins.
e.g. Let us consider a db storing information about all employees. As we saw in previous articles, traditional RDBMs would not be convenient when the information has many multivalued attributes like address, phone number, details of previous employments, skillset, the project the person is working on etc. Now a key value store would store all this information against the employee id as a blob.
A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it. So a document database would store the employee information as one document, along with the metadata, enabling the search based on the fields of the entity. Thus the document stores are suitable for loosly structured or semistructured data.
Document stores do not impose rigid schema on the data that needs to be stored. There are key and values within a document, which the db has understanding of.
Unlike relational databases, document stores are not strongly typed. Document databases get their type information from the data itself, normally store all related information together, and allow every instance of data to be different from any other. This makes them more flexible in dealing with change and optional values, maps more easily into program objects, and often reduces database size.
Thus document databases are schema-agnostic but they can enforce a schema when needed because they are also structure-aware. This approach—having schema when you need it—is a huge change from the relational world where it might take months of work to manage changes to schema design.
How does a document store work?
The document databases are typically organized in form of a collection of documents, where one domain entity forms one document. e.g. In the employee database, an employee record is one document. It can be arranged as follows :
<id> 1 </id>
<FirstName> Rahul </FirstName>
<LastName> Banerjee </LastName>
<DoB> 1-1-1980 </DoB>
<Type> Permanent </Type>
<Name> Shantiniketan </Name
<Street> 41 M G Road </Street>
<City> Kolkata </City>
<State> W. Bengal </State>
<Country> India </Country>
<Pin> 456178 </Pin>
<Type> Local </Type>
<Name> Marine Bay </Name
<Street> 41 Nehru Road </Street>
<City> Mumbai </City>
<State> Maharashtra </State>
<Country> India </Country>
<Pin> 314567 </Pin>
<Phone> (+91)9211420420 </Phone>
<Phone> (+9122)41238567 </Phone>
<Name> Project1 </Name>
<Role> Project Lead </Role>
<StartDate> 14-4-2010 </StartDate>
<EndDate> 30-8-2013 </End Date>
In this case, the document includes both data and the metadata explaining each of the fields. A key-value store receiving this document would simply store it. In the case of a document-store, the system understands that contact documents may have a state field, allowing the programmer to “find all the <Contact>s where the <state> is ‘Maharashtra'”.
Additionally, the programmer can provide hints based on the document type or fields within it, for instance, they may tell the engine to place all <Employee> documents in a separate physical store, or to make an index on the state field for performance reasons. All of this can be done in a key-value store as well, and the difference lies primarily in how much programming effort is needed to add these indexes and other features; in a document-store this is normally almost entirely automated.
In this dicument, number of the fields are either repeated or split out into separate containers in the case of <Pddress>, <Project>. With similar hints, the document store will allow searches for things like “find all my <Projects> with a <Role> of type <Lead> which are ongoing. This is not unlike other database systems in terms of retrieval. What is different is that these fields are defined by the metadata in the document itself. There is no need to pre-define these fields in the database.
The major advantage of the document-oriented concept; every document in the database can have a different format. It is very common for a particular type of document to differ from instance to instance; one <contact> might have a local address, another might not, one might have a single address, another might have several. More widely, the database can store completely unrelated documents, yet still understand that parts of the data within them are the same. For instance, one could construct a query that would look for any document that has the <state> ‘Maharashtra’, it doesn’t matter that the documents might be <contact>s or <business>es, or if the <state> is within an <address> or not.
In addition to making it easier to handle different types of data, the metadata also allows the document format to be changed at any time without affecting the existing records. If one wishes to add an <image> field to their contact book application some time in the future, they simply add it. Existing documents will still work fine without being changed in the database, they simply won’t have an image. Fields can be added at any time, anywhere, with no need to change the physical storage.
Thus document stores typically work best with semistructured data or document formats containing some metadatada e.g. xml, json, bson etc.
However, some document stores also provide functionality to index PDF or TeX documents.
The data in the documents can be queried based on different fields and their URL.
Now let us analyze the document stores terms of different DBMs parameters.
- Concurrency : Most of the document stores use optimistic concurrency control by making use of techniques like document timestamps, versioning, multi-granularity locking for conflict management.
- Queries : Since document stores usually supported semi-structured data, most of them support querying data with different fields/ keys within a document and also by document id. E.g. in the aforementioned example, details of any employee can be queried using the employee id. In addition to that, queries like “get all contacts where the address type is permanent and state is W. Bengal” are also possible. Also, in addition to index on document ids, other fields in the document can be indexed too. That is, normally in document stores, entire data is “queriable” as well as “indexable”. Also, most document stores are eventually consistent for queries. This is achieved by intermittently refreshing the indexes or by reading from secondaries, while by default read, writes are performed on primary.
- Transactions : Usually most of the document stores support transactions on single document level only exception being Elasticsearch.. A transaction on single document is always atomic. However, the way atomicity is handled in case of multiple/ embedded documents differs from one document store to other e.g. RavenDB supports atomicity across multiple document transactions while in case of MongoDB, while transactions with multiple/ embedded documents are atomic on single document level, such transactions are not atomic on the whole.
- Schema : As mentioned above, document stores do not require a rigid schema. However, they are schema agnostic and can enforce the schema whenever needed.
- Scaling up : Typically document stores scale up by sharding collection of documents on different nodes and replicating them. The index is divided across different shards for better query performance.
Document stores offer a schema agnostic, fast, reliable NoSql solution for semi-structured, multivalued data with limited ACID compliance.