19 Feb 2016 NoSQL in the cloud: A scalable alternative to Relational Databases
With the current move to cloud computing, the need to scale applications presents itself as a challenge for storing data. If you are using a traditional relational database you may find yourself working on a complex policy for distributing your database load across multiple database instances. This solution will often present a lot of problems and probably won’t be great at elastically scaling.
As an alternative you could consider a cloud-based NoSQL database. Over the past few weeks I have been analysing a few such offerings, each of which promises to scale as your application grows, without requiring you to think about how you might distribute the data and load.
Specifically I have been looking at Amazon’s DynamoDB, Google’s Cloud Datastore and Cloud BigTable. I chose to take a look into these 3 databases because we have existing applications running in Google and Amazon’s clouds and I can see the advantage these databases can offer. In this post I’ll report on what I’ve learnt.
Consistency, Availability & Partition Tolerance
Firstly – and most importantly – it’s necessary to understand that distributed NoSQL databases achieve high scalability in comparison to a traditional RDBMS by making some important tradeoffs.
A good starting-place for thinking about this is the CAP Theorem, which states that a distributed database can – at most – provide two of the following: Consistency, Availability and Partition Tolerance. We define each of these as follows:
- Consistency: All nodes contain the same data
- Availability: Every request should receive a response
- Partition Tolerance: Losing a node should not affect the system
Eventually Consistent Operations
All three NoSQL databases I looked at provide Availability and Partition Tolerance for eventually-consistent operations. In most cases these two properties will suffice.
For example, if a user posts to a social media website and it takes a second or two for everyone’s request to pick up the change, then it’s not usually an issue.
This happens due to write operations writing to multiple nodes before the data is eventually replicated across all of the nodes, which usually occurs within one second. Read operations are then read from only one node.
Strongly Consistent Operations
All three databases also provide strongly consistent operations which guarantee that the latest version of the data will always be returned.
DynamoDB achieves this by ensuring that writes are written out to the majority of nodes before a success result is returned. Reads are also done in a similar way – results will not return until the record is read from more then half of the nodes. This is to ensure that the result will be the latest copy of the record.
All this occurs at the expense of availability, where a node being inaccessible can prevent the verification of the data’s consistency if it occurs a short time after the write operation. Google achieves this behaviour in a slightly different way by using a locking mechanism where a read can’t be completed on a node until it has the latest copy of the data. This model is required when you need to guarantee the consistency of your data. For example, you would not want a financial transaction being calculated on an old version of the data.
OK, now that we’ve got the hard stuff out of the way, let’s move onto some of the more practical questions that might come up when using a cloud-based database.
Local Development
Having a database in the cloud is cool, but how does it work if you’ve got a team of developers, each of whom needs to run their own copy of the database locally? Fortunately, DynamoDB, BigTable and Cloud Datastore all have the option of downloading and running a local development server. All three local development environments are really easy to download and get started with. They are designed to provide you with an interface that matches the production environment.
Java Object Mapping
If you are going to be using Java to develop your application, you might be used to using frameworks like Hibernate or JPA to automatically map RDBMS rows to objects. How does this work with NoSQL databases?
DynamoDB provides an intuitive way of mapping Java classes to objects in DynamoDB Tables. You simply annotate the Java object as a DynamoDB Table and then annotate your instance variable getters with the appropriate annotations.
@DynamoDBTable(tableName="users") public class User {
@DynamoDBHashKey(attributeName="username") public String getUsername(){ return username; } public void setUsername(String username){ this.username = username; }
@DynamoDBAttribute(attributeName = "email") public String getEmail(){ return email; } public void setEmail(String email){ this.email = email; }
Querying
An important thing to understand about all of these NoSQL databases is that they don’t provide a full-blown query language.
Instead, you need to use their APIs and SDKs to access the database. By using simple query and scan operations you can retrieve zero or more records from a given table. Since each of the three databases I looked at provide a slightly different way of indexing the tables, the range of features in this space varies.
DynamoDB for example provides multiple secondary indexes, meaning there is the ability to efficiently scan any indexed column. This is not a feature in either of Google’s NoSQL offerings.
Furthermore, unlike SQL databases, none of these NoSQL databases give you a means of doing table joins, or even having foreign keys. Instead, this is something that your application has to manage itself.
That’s said, one of the main advantages in my opinion of NoSQL is that there is no fixed schema. As your needs change you can dynamically add new attributes to records in your table.
For example, using Java and DynamoDB, you can do the following, which will return a list of users that have the same username as a given user:
User user = new User(username); DynamoDBQueryExpression<User> queryExpression = new DynamoDBQueryExpression<User>().withHashKeyValues(user); List<User> itemList = Properties.getMapper().query(User.class, queryExpression);
Distributed Database Design
The main benefit of NoSQL databases is their ability to scale, and to do so in an almost seamless way. But, just like a SQL database, a poorly-designed NoSQL database can give you slow query response times. This is why you need to consider your database design carefully.
In order to spread the load across multiple nodes, distributed databases need to spread the stored data across multiple nodes. This is done in order for the load to be balanced. The flip-side of this is that if frequently-accessed data is on a small subset of nodes, you will not be making full use of the available capacity.
Consequently, you need to be careful of which columns you select as indexes. Ideally you want to spread your load across the whole table as opposed to accessing only a portion of your data.
A good design can be achieved by picking a hash key that is likely to be randomly accessed. For example if you have a users table and choose the username as the hash key it will be likely that load will distributed across all of the nodes. This is due to the likeliness that users will be randomly accessed.
In contrast to this, it would, for example, be a poor design to use the date as the hash key for a table that contains forum posts. This is due to the likeliness that most of the requests will be for the records on the current day so the node or nodes containing these records will likely be a small subset of all the nodes. This scenario can cause your requests to be throttled or hang.
Pricing
Since Google does not have a data centre in Australia, I will only be looking at pricing in the US.
DynamoDB is priced on storage and provisioned read/write capacity. In the Oregon region storage is charged at $0.25 per GB/Month and at $0.0065 per hour for every 10 units of Write Capacity and the same price for every 50 units of read capacity.
Google Cloud Datastore has a similar pricing model. With storage priced at $0.18 per GB of data per month and $0.06 per 100,000 read operations. Write operations are charged at the same rate. Datastore also have a Free quota of 50,000 read and 50,000 write operations per day. Since Datastore is a Beta product it currently has a limit of 100 million operations per day, however you can request the limit to be increased.
The pricing model for Google Bigtable is significantly different. With Bigtable you are charged at a rate of $0.65 per instance/hour. With a minimum of 3 instances required, some basic arithmetic gives us a starting price for Bigtable of $142.35 per month. You are then charged at $0.17 per GB/Month for SSD-backed storage. A cheaper HDD-backed option priced at $0.026 per GB/Month is yet to be released.
Finally you are charged for external network usage. This ranges between 8 and 23 cents per GB of traffic depending on the location and amount of data transferred. Traffic to other Google Cloud Platform services in the same region/zone is free.
Conclusion
So which database should you use?
To be honest, I was not able to differentiate between them much from a performance perspective. In both local and cloud-based tests I found that, for queries returning a single record, response time was in the 10-30ms range, regardless of which platform I used.
Consequently, which you choose will probably depend on what cloud stack you are currently using and/or which billing model suits you best.
Since Bigtable has a minimum number of instances you may want to consider avoiding this if you don’t foresee much traffic, as you will likely be paying for capacity that you don’t need. However Bigtable can be cheaper if you load is significant enough to use 3 or more instances.
In contrast, Cloud Datastore will work well if your application has a small amount of traffic, yet will continue to work well if it scales up to have a large number of requests. Since you only pay for the requests that are made there will be no wasted capacity.
If you are on the Amazon stack, the only managed NoSQL option is DynamoDB. It is more expensive than Datastore and requires that you provision capacity in advance – meaning that will be charged for that capacity even if you don’t use it. Amazon has announced that they will be releasing an auto-scaling feature some time in the future. If you can’t wait for that you can write your own scripts to handle the scaling of provisioned capacity. It is important to get provisioned capacity right. If you are using more capacity then is provisioned then you will start to deplete accumulated credits. Once your credits are depleted your requests will start to be throttled and may even fail.
Whichever option you go with, the prospect of infinitely-scalable, always-available databases is alluring. By understanding the tradeoffs involved and choosing a pricing model that suits you, you may be able to free yourself from the constraints of a traditional RDBMS.
No Comments