Large scale applications achieving high-throughput while storing a ton of randomly accessed data comes naturally to NoSQL databases such as Aerospike and DynamoDB because these databases are designed from the ground up to deliver just that.
In this post, I have outlined a very effective and simple technique of accomplishing high-throughput for applications that have specific access patterns — ones which could lead to “hot” keys situation. For example, where there are only a handful of keys being accessed at a very high rate. Consider the following:
Scenario: Voting application where there are two candidates — Candidate 1 and Candidate 2. For each candidate there needs to be a running total of number of votes being cast by 1000s of voters.
Problem: All voting requests coming in through the application access the same keys and therefore hit the same data partition in order to increment their respective vote count. In other words, for each candidate if you use a single record with a running total that gets updated with every vote, it will be a “hot” key. Therefore, affecting throughput of the application wherein after a certain point, the request queue builds up to an extent where it has direct impact on the vote count being reflected in real-time.
- Create multiple keys for each candidate. For instance, if you want to dilute the number of writes by a factor of 10, then there should be 10 keys for each candidate in a format similar to candidate_1_1, candidate_1_2, … candidate_1_10 and candidate_2_1, candidate_2_2… candidate_2_10. Each of these keys should have a vote count value associated with them.
- Create two additional keys candidate_1 and candidate_2 – these keys should also have a vote count value associated with them as well as timestamp. NOTE: These are the two values you’d report on and display on a dashboard, for instance.
- In your application logic that handles requests to cast a vote, for every request, pick a random number between 1 and 10 and send update vote count database call for key candidate_X_# — Where:** X is the candidate receiving the vote (in this case either 1 or 2) and # is the random number generated between 1 and 10.
- Have a background process running at regular interval that sums vote count values for keys candidate_1_1 thru candidate_1_10 and updates vote count value for key candidate_1 and similarly sums vote count values for keys candidate_2_1 thru candidate_2_10 and updates vote count value for key candidate_2. This process also updates the respective timestamp columns to indicate how “up-to-date” the vote counts are.
Given the above, here’s what the tables/sets might look like:
Splitting keys in this manner not only helps mitigate “hot” keys situation but also enables distribution of the load across nodes in a cluster.