Wednesday, June 22, 2011

High-Replication Migration Lessons

It's about a month since the migration, so how's it gone, what did we learn? What's it done to latency, error rates and CPU consumption? Most of all, was it worth the pain?

Lesson 1: Understand how eventual consistency will affect your application.


Our biggest fear was that there would be some inconsistencies between  master-slave (MS) and high replication (HR) that would cause problems with our apps. And that fear was borne out when users started reporting that new objects were not showing up after they had created them. Some debugging found that the new objects did exist, and that we had been bitten by HR's eventual-consistency.

The datastore is much more efficient getting objects by ID than via a query, so I tend to give an object an array of IDs referencing child objects. When I added a new child, I ran a query to rebuild this ID array, which guaranteed it's consistency. With HR, the new child didn't show up in the query results for some seconds, so because I was rebuilding the array immediately after adding the new child, it disappeared. And when another new child was added, the previous one appeared - but the new one didn't. Users hate that sort of thing!

It was easily fixed - get_by_id() works as soon as an object is added, but queries do not. So we dispensed with the query, and simply added the ID to the array when we added a child (and removed it on deletion).

Lesson 2: Writes take longer and reads take about the same - but both are more consistent.

Probably the next biggest issue is performance. Writes take longer, probably about 50% on average, and reads are just about even. However, performance is far more consistent - our milliseconds/request graphs are much flatter.

With the MS datastore, we got occasional operations that took much longer than normal, and these seem to have largely disappeared with the HR.

Lesson 3: An order of magnitude less datastore timeouts.

The biggest difference has been on our error rate. With MS, we got a daily crop of datastore timeouts, and with HR they are extremely rare. With a non-relational database datastore errors lead to inconsistent data, because one entity is updated but the next may not. Users hate inconsistencies, so reducing errors by an order of magnitude is a huge win.

Lesson 4: HR costs more in CPU time.

Ah, but what about the cost? Google have equalised the storage costs between MS and HR, but you also pay for the CPU time to store and access your data. We're seeing a 30% increase in CPU after migrating, so HR will cost you more. How much depends on your application, but this may be also be moot when Google roll out their new instance-based billing.



So was it worth the time and cost involved in migrating from MS to HR? Yes. The reduced error rate has made our service far more robust, and reduced support calls. Also, we no longer need to manage user's expectations around Google's planned maintenance periods, which reduces a lot of customer irritation. These advantages easily outweigh the extra performance and CPU costs.

Finally, please leave a comment below about how your migration went, or if you have any questions about the process. If you'd like dedicated support with your migration, please contact me (greg at vig dot co dot nz).


If this article has helped you, you can show your appreciation by telling any schools you are involved with about www.SchoolConferences.com. This simple online app (built on Appengine of course!) has already revolutionised parent-teacher conference evenings for over a thousand schools. See how easy it is to book conferences with the demonstration event code HIGH4 for high schools, and ELEM4 for elementary schools. Thanks!

2 comments:

  1. Good Post.

    For the Lesson 1, what about get_by_key_name()? It works as with get_by_id()?

    ReplyDelete
  2. @Santiago Basulto: get(), get_by_id() and get_by_key_name() are all strongly consistent.

    ReplyDelete