ICU collation in Erlang

Right now we have a big performance problem in CouchDB view indexing when Erlang calls the ICU collation routines. The problem is that the facilities in Erlang to make C callouts are all dog slow, and collation of strings is something that happens a lot. So right now we have a big CPU bottleneck from collation in the indexing code, and it's mostly overhead just marshaling the Erlang data to a C "port".

To optimize, I had the idea that we could do just the basic ASCII string collation in Erlang, and when we hit non-ASCII we fail over to the ICU callouts. That makes our general collation faster in the general case, but still slower for anyone not American.

That got me to thinking, how hard would it be to implement all of ICU collation in Erlang? It's my understanding that the ICU code is generated from parsable data and the source for the C and java versions are generated from that. How hard would it be to generate the Erlang code to do that, and would it be efficient? And what about case and accent insensitive sorting, something we don't have now but probably will in the future? Any ICU experts out there have ideas?

Right now, the only thing we use the ICU for is collation, so that makes the problem easier.

Posted September 6, 2009 2:49 PM