Friday, 10 October 2008

Doing EC2 Without Scalr

It is important that your website can scale. You can spend all your energy promoting it and adding features, but that is all wasted if it cannot deal with all those users who are desperate to use your site. The architecture is at least half the battle. We are spoilt these days with affordable options to be able to deal with scale. Cloud computing has become a buzz word and big tech companies are scrambling to carve out their chunk of the market. So far we have EC2, Google App Engine, Joyent, VPS services (such as Slicehost) and a number of others.

EC2 was one of the first and most mature of the offerings. Though it does not stray far from the traditional concept of a machine (with fixed CPU and memory), what it does do is give you access to as many of these as you need at any one point and the (very) basic tools to manage them. They provide the basic infrastructure: machines, network connections and data storage. It is up to you to do the rest: load balancing, application servers, relational databases, backups, replication, redundancy, fail-over, etc.

Scalr promises to help with some of these issues, notably load balancing, and backups / replication / fail-over for MySQL. The idea is great but the project is not mature (despite the claim to v1.0) and the architecture is not very solid. To be fair, this is acknowledged to some degree in that v2 is to be a complete redesign. After a period of two months using Scalr, I decided to move on because of these issues:
  • Ad-hoc design: a strange combination of PHP, bash, MySQL and CRON. The interaction between these components was overly complicated, which means debugging was tricky and the system itself was prone to getting mixed up (for example: CRON calling PHP to do a backup but the MySQL state was stuck so backups were being skipped, silently, for weeks). V2 is to be Java.
  • Buggy: rebundling an instance would break the load balancer. It would get confused about the new instances and refuse to update. This was only solved by restarting the whole farm.
  • Scalr makes the decision to start new instances based on the load average. This is over simplistic really and can mean that your machines are under heavy load for a little too long which new instances are being brought up.
  • Scalr is built on Ubuntu 7.04 AMIs, upgrading the distro breaks it.
I do like what Scalr are trying to do. Just having pre-built load balancing servers, application servers and MySQL servers that are able to replicate and reconfigure themselves based on load is great. However, Scalr v1 is experimental really, I'm very keen to see how the next version evolves.

One alternative is to build the architecture yourself. It's actually not very hard (and quite quick). All you need is a decent scripting language that has good SSH and EC2 libraries and some experience with MySQL and linux admin. A few days work (4 in my case) will get you:
  • Run and stop instances with a single command, including application servers, master and slave MySQL instances.
  • MySQL slave replication (automatically configured against the current master)
  • NGINX load balancer, always up to date with the application servers.
  • Automatic backup and restore facilities.
  • Various other tricks that your setup might benefit from.
All that would be needed to catch up with Scalr's features (albeit not generically) is to automatically bring up the required instances as demand changes and automatically fail over MySQL instances. Not much more work and with the added benefit that you can tailor your load algorithm to your site (i.e. use response times instead of load averages). I would recommend getting your hands dirty!