Wednesday, 28 May 2008

Is SQLAlchemy Ready For Production?

SQLAlchemyI have built my most recent project using SQLAlchemy (SQLA) 0.4. It is a reasonably young library, so I thought others considering using SQLAlchemy might benefit from my experiences.

The most recent version is 0.4, so it obviously does not claim to be 'finished' (can software ever be?) However, there are plenty of benefits in using it already:
  • Ridding your beautiful Python code of any SQL. It's untidy and pain to work with (which means bugs).
  • ORM (Object Relational Model) to fit in with a nice OO architecture.
  • Free sharding to split your database horizontally.
  • DB independant code, so you can switch if you feel the need.
  • Hassle free transactions and DB connection management.
Sounds great doesn't it? In general this is exactly what you get. However, there are a few points which are very important to consider before committing yourself to using SQLA.

Forcing Indices

Anyone who has had to do something reasonably involved in MySQL has spent time optimising their indices. Sadly, MySQL is not very clever about how it chooses the index to use for a particular query. In some cases the only option is to force it with FORCE. This is not available in SQLA, so anytime you need to do this, you'll have to manually specify the SQL. Not only does this partly defeat the point of SQLA, but it also happens that the place you most need the SQL abstraction is often the same place you need to force an index. Consider a complicated search page. You are searching over a large data set using a number of filters. MySQL is probably going to get the index wrong and you're going to have to generate a complicated SQL query without SQLAlchemy's help.

Commit ORM Objects


SQLAlchemy can be a little silly about detecting a change in an ORM object. If you assign to a member variable which is part of the schema, it will be marked as dirty even if the value doesn't actually change. The solution is to check for a change before assigning, which does not make for neat code!

Sharding


It's a good idea to plan for scaling as early as possible. It can be very difficult to build the necessary bits in later when you need it (and will also under pressure to fix things fast). One common way to deal with scale is to split large tables over several machines, this is known as sharding (each independant chunk of table is known as a shard).

SQLAlchemy has some code to help you there. All you need to do is to define 3 functions which tell SQLA which shard a particular row is in. Really simple. Sadly, this part of the code is not very mature at all:
  • query.count() doesn't work (nor any scalar query). I had to write a function to query each shard in turn and sum the result. The real problem here is that it was not clear at all which bits of functionality will or won't work with sharding (expect long debugging sessions, digging into the SQLA code).
  • The ORM caches objects and identifies them by their primary key. However, a common MySQL trick when sharding is to have an auto_incremented INT as the primary key of each shard but use something like a UUID as the 'real' primary key recognised by the code (this speeds things up quite a bit). Of course, the auto_incremented key will not be unique across shards and this will confuse SQLAlchemy. I think the best solution here (suggested by someone on the SQLAlchemy group) is to have a 2 column primary key with an INT and another integer shard identifier, making it unique.
No Server Side Cursors

A minor point, but might be important for some. Server side cursors come in handy sometimes when you are dealing with large amounts of data.

Tips For Optimising
  • Periodically check the SQL queries being made with the echo option. There might be some surprises in there (though usually easily fixed). This kind of thing usually pops up because you will use the ORM instance and forget / not realise that it will result in a query (from a software design point of view, great. From an optimisation point of view, awful).
  • Use set_shard on a query whenever you are able. If you know which shard the column you want is, no need to go checking the other ones. A common example is when the shard identitifier is in one of the query parameters.
  • Design for scale right from the beginning and develop / test on a distributed architecture (i.e. have at least 2 shards). This doesn't need to be difficult, for example: just create two databases on your developement box to simulate two machines.
Conclusion

I still think that SQLAlchemy is worth using. Overall it will save time and effort as long as you are careful and not afraid to get your hands dirty when the going gets tough. I expect most of these problems will be addressed in (near) future releases.

No comments: