Wednesday, 28 May 2008

Is SQLAlchemy Ready For Production?

SQLAlchemyI have built my most recent project using SQLAlchemy (SQLA) 0.4. It is a reasonably young library, so I thought others considering using SQLAlchemy might benefit from my experiences.

The most recent version is 0.4, so it obviously does not claim to be 'finished' (can software ever be?) However, there are plenty of benefits in using it already:
  • Ridding your beautiful Python code of any SQL. It's untidy and pain to work with (which means bugs).
  • ORM (Object Relational Model) to fit in with a nice OO architecture.
  • Free sharding to split your database horizontally.
  • DB independant code, so you can switch if you feel the need.
  • Hassle free transactions and DB connection management.
Sounds great doesn't it? In general this is exactly what you get. However, there are a few points which are very important to consider before committing yourself to using SQLA.

Forcing Indices

Anyone who has had to do something reasonably involved in MySQL has spent time optimising their indices. Sadly, MySQL is not very clever about how it chooses the index to use for a particular query. In some cases the only option is to force it with FORCE. This is not available in SQLA, so anytime you need to do this, you'll have to manually specify the SQL. Not only does this partly defeat the point of SQLA, but it also happens that the place you most need the SQL abstraction is often the same place you need to force an index. Consider a complicated search page. You are searching over a large data set using a number of filters. MySQL is probably going to get the index wrong and you're going to have to generate a complicated SQL query without SQLAlchemy's help.

Commit ORM Objects

SQLAlchemy can be a little silly about detecting a change in an ORM object. If you assign to a member variable which is part of the schema, it will be marked as dirty even if the value doesn't actually change. The solution is to check for a change before assigning, which does not make for neat code!


It's a good idea to plan for scaling as early as possible. It can be very difficult to build the necessary bits in later when you need it (and will also under pressure to fix things fast). One common way to deal with scale is to split large tables over several machines, this is known as sharding (each independant chunk of table is known as a shard).

SQLAlchemy has some code to help you there. All you need to do is to define 3 functions which tell SQLA which shard a particular row is in. Really simple. Sadly, this part of the code is not very mature at all:
  • query.count() doesn't work (nor any scalar query). I had to write a function to query each shard in turn and sum the result. The real problem here is that it was not clear at all which bits of functionality will or won't work with sharding (expect long debugging sessions, digging into the SQLA code).
  • The ORM caches objects and identifies them by their primary key. However, a common MySQL trick when sharding is to have an auto_incremented INT as the primary key of each shard but use something like a UUID as the 'real' primary key recognised by the code (this speeds things up quite a bit). Of course, the auto_incremented key will not be unique across shards and this will confuse SQLAlchemy. I think the best solution here (suggested by someone on the SQLAlchemy group) is to have a 2 column primary key with an INT and another integer shard identifier, making it unique.
No Server Side Cursors

A minor point, but might be important for some. Server side cursors come in handy sometimes when you are dealing with large amounts of data.

Tips For Optimising
  • Periodically check the SQL queries being made with the echo option. There might be some surprises in there (though usually easily fixed). This kind of thing usually pops up because you will use the ORM instance and forget / not realise that it will result in a query (from a software design point of view, great. From an optimisation point of view, awful).
  • Use set_shard on a query whenever you are able. If you know which shard the column you want is, no need to go checking the other ones. A common example is when the shard identitifier is in one of the query parameters.
  • Design for scale right from the beginning and develop / test on a distributed architecture (i.e. have at least 2 shards). This doesn't need to be difficult, for example: just create two databases on your developement box to simulate two machines.

I still think that SQLAlchemy is worth using. Overall it will save time and effort as long as you are careful and not afraid to get your hands dirty when the going gets tough. I expect most of these problems will be addressed in (near) future releases.

Tuesday, 27 May 2008

C# BHO Tutorial

ie7I get a fair bit of traffic from people looking for help with C# and BHOs (my event handling post). There isn't much information about and the only beginners tutorial went missing a few months back. A new one has appeared, anyone wanting to get going should check it out.

Wednesday, 14 May 2008

Vim: The Word Processor

I love Vim. It took me a while to get there, but I've been using it exclusively for coding for a couple of years now and it has become second nature. I first learnt to use it during a practical exercise for an operating systems course at uni. We had to write a Minix driver so all the work had to be done on the command line. A powerful text editor was a must. I think I used Vi (rather than Emacs) because it was available and I had been told it was great. It was a steep learning curve, but I got a hang of the basics after a few days.

It was a couple of years before I started to use it again. In the meantime I'd mostly been working with .NET and Java so I'd been using some pretty decent development environments and it didn't seem necessary to use anything else (especially considering the power of their debuggers). But I moved into the world of the web and started writing PHP and then Python, these did not really have especially good IDEs and so it was back to a text editor and my choice was Vim. It also coincided with worsening RSI, for which Vim is great.I tend to learn a couple of features in a burst every few months when something really bugs me. This is probably not ideal, but Vim is so powerful I will never learn it all. I've been keeping a to do list in a text file recently (rather than on paper - go planet!) but Vim's defaults are not great for editing prose:
  • Vim's word wrap is by character, not word.
  • k and j (up and down) work on a line basis. If you have a wrapped line, you cannot move inside it with j and k (like you would with a normal text editor).
Of course, Vim is hugely powerful and can be tweaked to be much more useable when writing prose. These .vimrc commands:
  • Wrap lines by cutting lines off at word boundaries.
  • The word wrap is virtual, no extra line break is inserted (so that it's easy to edit afterwards).
  • j and k are replaced with gj and gk which allow you to move up and down inside a wrapped line.
  • I've also added the spell checker (I've not played around with it properly yet, but it looks a little weak).
  • smartindent for bullet points.

 autocmd BufRead *\.txt setlocal formatoptions=l
 autocmd BufRead *\.txt setlocal lbr
 autocmd BufRead *\.txt map  j gj
 autocmd BufRead *\.txt  map  k gk
 autocmd BufRead *\.txt setlocal smartindent
 autocmd BufRead *\.txt setlocal spell spelllang=en_us

On a slightly different note: if you're using OS X, get this port of Vim. It's aim (and it does) is to integrate better into Macs. It's worth getting just because it has pretty Carbon tabs ;-)