With the huge amount of free formed, almost unstructured data generated thrgouh blogs, social media platforms and what not, the question is how to make sense out of this data. In order to do that, it is important that this data be stored in efficient data models, which would keep the natural features of the data intact, without confining it to a specific structure. This made the database designers look beyond the traditional notion of relational databases.
For the past several decades, database professionals have hung their hats on a single standard supported by all databases: the Structured Query Language (SQL) and relational database model.
However, with the huge amount of various types of data like blog posts, photos, videos, tweets etc being generated every minute, the system size and scale have grown significantly over last few years.
The relational model can not scale up to handle the use-cases like this.
For example –
- Not all data can be represented effectively in form of tables. e.g. Online messaging data, where one needs to keep track of conversations between different users, groups etc, data generated in form of blogs – which needs to be analyzed for the content and author, data used in search engines. Also, the structure of data may change over time, and with RDBMs this becomes difficult to manage – schema migrations are not trivial at large scale.
- In many web-applications like search, online messaging, social networking, transaction support is not necessary.
- Application developers have been frustrated with the impedance mismatch between the relational data structures and the in-memory data structures of the application.
- The rise of the web as a platform also created a vital factor change in data storage as the need to support large volumes of data by running on clusters. Relational databases were not designed to run efficiently on clusters.
Let us elaborate these further.
Classical relation database follow the ACID Rules
A database transaction, must be atomic, consistent, isolated and durable.
Atomic : A transaction is a logical unit of work which must be either completed with all of its data modifications, or none of them is performed.
Consistent : At the end of the transaction, all data must be left in a consistent state.
Isolated : Modifications of data performed by a transaction must be independent of another transaction. Unless this happens, the outcome of a transaction may be erroneous.
Durable : When the transaction is completed, effects of the modifications performed by the transaction must be permanent in the system.
Often these four properties of a transaction is acronymed as ACID.
Scalability is the ability of a system to expand to meet your business needs. For example scaling a web application is all about allowing more people to use your application. You scale a system by upgrading the existing hardware without changing much of the application or by adding extra hardware.
There are two ways of scaling horizontal and vertical scaling :
To scale vertically (or scale up) means to add resources within the same logical unit to increase capacity. For example to add CPUs to an existing server, increase memory in the system or expanding storage by adding hard drive. But this needs to be estimated/ identified at the initial stages of deployment and can become costly.
To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application.
Having to conform with ACID rules, horizontal scaling becomes difficult for RDBMs.
Let us consider the following situation :
The employee database of the organization stores Id, Name, Phone numbers, addresses (local and permanent) etc.
Now phone number and address are the multivalued columns. So in order to have a normalized database- they should be stored in different tables. Now, for a huge organization like IBM which has offices all over the world, the employee database would be considerably large (even within the same country, there are a lot of employees). If the database is to be scaled horizontally, different tables would be on different servers. If we want to find out information about one employee, we would need to join 3 tables (minumum) in a perfectly normalized database. Now join across different servers are extremely efficient, especially when the data volume is huge. If we decide to go for a denormalized database, the consistency issues may arise.
In different situations like this, RDBMs face problems in horizontal scaling.
Transaction support and conforming to ACID
The ACID qualities seem indispensable, and yet they are incompatible with availability and performance in very large systems.
Let us assume that we want to develop a web application for an online book store. Now we need to display Every time someone is in the process of buying a book, we lock part of the database until they finish so that all visitors around the world will see accurate inventory numbers. This will work if we have a small shop, intended to serve only specific genres to limited number of users. But we have something as big as Amazon, this will hamper the performance significantly.
Amazon might instead use cached data and slightly comprise on “C” in ACID as the numbers will be updated eventually. Users would not see not the inventory count at this second, but what it was say an hour ago when the last snapshot was taken. Also, Amazon might violate the “I” in ACID by tolerating a small probability that simultaneous transactions could interfere with each other. For example, two customers might both believe that they just purchased the last copy of a certain book. The company might risk having to apologize to one of the two customers (and maybe compensate them with a gift card) rather than slowing down their site and irritating hundreds of other customers.
Also, sites like amazon need transaction support only when someone buys something and the number of users just searching or browsing through the site is as big as the number of users buying something – if not bigger.
CAP Theorem and BASE Standard
You must understand the CAP theorem when you talk about NoSQL databases or in fact when designing any distributed system. CAP theorem states that there are three basic requirements which exist in a special relation when designing applications for a distributed architecture.
Consistency – This means that the data in the database remains consistent after the execution of an operation. For example after an update operation all clients see the same data.
Availability – This means that the system is always on (service guarantee availability), no downtime.
Partition Tolerance – This means that the system continues to function even the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
There is a theorem put forth by Eric Brewer, which states that in any distributed system we can choose only two of consistency, availability or partition tolerance.
Here is the brief description of three combinations CA, CP, AP :
CA – Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks.
CP – Some data may not be accessible, but the rest is still consistent/accurate.
AP – System is still available under partitioning, but some of the data returned may be inaccurate.
BASE Standard :
BASE acronym was defined by Eric Brewer which provides an alternative to ACID for distributed systems. A BASE system gives up on consistency.
- Basically Available indicates that the system doesguarantee availability, in terms of the CAP theorem.
- Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
- Eventual consistency indicates that the system will become consistent over time, given that the system doesn’t receive input during that time.
Aggregate Data Models
Relational database modelling is vastly different than the types of data structures that application developers use. Using the data structures as modelled by the developers to solve different problem domains has given rise to movement away from relational modelling and towards aggregate models.
An aggregate is a collection of data that we interact with as a unit. These units of data or aggregates form the boundaries for ACID operations with the database.
Aggregates make it easier for the database to manage data storage over clusters, since the unit of data now could reside on any machine and when retrieved from the database gets all the related data along with it. Aggregate-oriented databases work best when most data interaction is done with the same aggregate, for example when there is need to get an order and all its details, it better to store order as an aggregate object but dealing with these aggregates to get item details on all the orders is not elegant.
Now that we have understood the challenges faced by traditional RDBMs on distributed architecture to handle huge amount of data varying in structure, let us now take a look at what NoSql databases are.
NoSQL means Not Only SQL, implying that when designing a software solution or product, there are more than one storage mechanism that could be used based on the needs.
The solution is designed keeping in mind the following guidelines :
A NoSql solution :
- Does not use relational model and SQL (some NoSql Dbs support SQL queries though).
- Has distributed fault tolerant architecture.
- Does not have a fixed schema thus making changes in structure of data easy to handle.
- Does not support joins as
- joins are expensive operations combining records from multiple tables – making horizontal scaling difficult.
- Joins require strong consistency and fixed schema
As NoSql databases do not have fixed schema and most of them are aggregate oriented, they scale well horizontally providing a distributed fault tolerant architecture.
NoSQL pros (not ordered by importance):
- Mostly open source.
- Horizontal scalability. There’s no need for complex joins and data can be easily sharded and processed in parallel.
- Support for Map/Reduce. This is a simple paradigm that allows for scaling computation on cluster of computing nodes.
- No need to develop fine-grained data model – it saves development time.
- Easy to use.
- Very fast for adding new data and for simple operations/queries.
- No need to make significant changes in code when data structure is modified.
- Ability to store complex data types (for document based solutions) in a single item of storage.
- Still lots of rough edges.
- Possible database administration issues. NoSQL often sacrifices features that are present in SQL solutions “by default” for the sake of performance. For example, one needs to check different data durability modes and journaling in order not to be caught by surprise after a cold restart of the system. Memory consumption is one more important chapter to read up on in the database manual because memory is usually heavily used.
- No indexing support (Some solutions like MongoDB have indexing but it’s not as powerful as in SQL solutions).
- No ACID (Some solutions have just atomicity support on single object level).
- Bad reporting performance.
- Complex consistency models (like eventual consistency). CAP theorem states that it’s not possible to achieve consistency, availability and partitioning tolerance at the same time. NoSQL vendors are trying to make their solutions as fast as possible and consistency is most typical trade-off.
- Absence of standardization. No standard APIs or query language. It means that migration to a solution from different vendor is more costly. Also there are no standard tools (e.g. for reporting)
For a large web-based system with massive amount of data to be handled, NoSql could provide a more durable solution, meeting the high demand for quick response time.
In the next part of this series, we will dive deep into different types of NoSql databases.