ICU collation in Erlang

Right now we have a big performance problem in CouchDB view indexing when Erlang calls the ICU collation routines. The problem is that the facilities in Erlang to make C callouts are all dog slow, and collation of strings is something that happens a lot. So right now we have a big CPU bottleneck from collation in the indexing code, and it's mostly overhead just marshaling the Erlang data to a C "port".

To optimize, I had the idea that we could do just the basic ASCII string collation in Erlang, and when we hit non-ASCII we fail over to the ICU callouts. That makes our general collation faster in the general case, but still slower for anyone not American.

That got me to thinking, how hard would it be to implement all of ICU collation in Erlang? It's my understanding that the ICU code is generated from parsable data and the source for the C and java versions are generated from that. How hard would it be to generate the Erlang code to do that, and would it be efficient? And what about case and accent insensitive sorting, something we don't have now but probably will in the future? Any ICU experts out there have ideas?

Right now, the only thing we use the ICU for is collation, so that makes the problem easier.

Posted September 6, 2009 2:49 PM

Comments

Why not extend the OTP VM to add decent globalization features to the language?

Rodrigo Kumpera, September 6, 2009 9:24 PM

I dive right in and start reading about Unicode's collation algorithm,

http://www.unicode.org/reports/tr10/tr10-18.html#S1.1

and ... I turn into a pillar of salt (or maybe a pillar of MSG). I haven't found where ICU uses generated code except maybe for tables read from binary files and translated to C array decls/defs. AFAICT, there's no place where you can just start changing a table so that tools spit out Erlang instead of C/C++ or Java. You've still got to re-implement in Erlang what's been done in Java and C++ in Erlang. That looks like a lot of code, though it might be mostly belaboring the obvious in in-line comments combined with further obfuscating the obscure in the case of gratuitous abstractification.

I started feeling better after reading this PPT,

http://icu-project.org/docs/papers/normalization_iuc21.ppt

but maybe only because it feels SO good when you stop reading Unicode standards.

Since Unicode processing is pretty much just one big case analysis anyway, special-casing the simplest thing that would do the most good is quite in the spirit of the system. As the above PPT points out, most strings are already in normal form. Quickcheck (to see if a string is already in normal form) is very fast, and it's a fairly short bit of code. Erlang is used mostly in the West at the moment, so for most purposes, it's probably enough to just generate Erlang code for Anglo/Euro collation orders, if that simplifies things at all (n.b. not saying it does).

So the ticket here seems to be: incrementally implement algorithms that hit the eject button and pop the ICU parachute upon encountering anything too exotic or scary in mid-flight. Get the basic algorithms working, then, if they seem to need more speed, make them incremental. Most collation decisions are made within the first few characters -- if you can see where and how it's OK to do those comparisons in Erlang, you're probaby halfway to optimal.

(I hope pointing to resources where people talk about what's been done counts toward "getting it done", and not just as "talking about it".)

Michael Turner, September 7, 2009 1:00 AM

Unfortunately, I don't think this is going to be efficient enough. I tried implementing ASCII collation is Erlang, and it was still slower than the port callout to ICU. ICU in C is pretty much ideal, as it is highly optimized in a way that's impossible in Erlang.

Damien Katz, September 8, 2009 12:57 PM

Yeah, that's unfortunate. Could you please elaborate a little on "is highly optimized in a way that's impossible in Erlang"?

Vlad, September 8, 2009 10:03 PM

What about creating a linked in driver around ICU?

I know this is sort of frowned upon... But if you need speed it can work fine. Also with some good defensive coding the Linked in library should be plenty stable. And, if it does crash... Well CouchDB was designed around that problem.

Ray Morgan, September 9, 2009 4:06 PM

Vlad, "Impossible is erlang" is perhaps overstating it a bit, but the truth is the ICU library is low level optimized and getting similar levels of performance from Erlang just isn't possible with the current VM.

Ray, our ICU integration already is a linked-in driver, which currently must be exposed to Erlang as a port. And creating and sending binaries to it is more expensive than doing the equivalent in an Erlang BIF. I also don't worry about crashes from the ICU, both for the reason you gave and the fact that the ICU is a very reliable and well tested library and is unlikely to crash, ever.

Damien Katz, September 9, 2009 11:36 PM

It could be used Haskell to solve the performance problems in Erlang, which are both IO and string manipulation.

http://www.haskell.org/haskellwiki/Applications_and_libraries/Interfacing_other_languages/Erlang

http://github.com/esessoms/haskell-interface

Jonas, September 14, 2009 4:58 AM

I understand that you are using a linked in driver to wrap the ICU library.

Are you sure you are using the most efficient
way of calling the driver?

To avoid copying of binaries in calls to the driver you should implement and use the outputv
call back of the driver.

Does your driver allow per port or per driver
parallelism (i.e locking per port or per the whole driver)?
If the driver is reentrant and allows several ports you can utilize multicore much better (if you are calling the driver from several Erlang processes in parallell)

How do you create and return the result from the driver?

If you have all this as optimal as possible the difference compared with a BIF is not as big as you think.
But I agree that dynamically loadable BIF's would be a good idea since you will get rid of some locking in the SMP case and the need to pass around a port or use a named port dissapears.

Kenneth Lundin, September 14, 2009 5:05 PM

Hi damien, I just stumbled across this post.

If I remember right from my work at a native json implementation ( eep0018), which paul took over, there are *two* binary encodings between Erlang OTP and native drivers. The official one - which is suppossedly stable between OTP releases - and the "internal" one, which is not even properly documented, but can be accessed in a not-too-low level way via Erlang OTP headers and such.

This was the key in Paul's final success in getting blazing speed out of the eep0018 driver, and this might improve the throupghput between CouchDB/Erlang and the ICU driver.

eno, November 13, 2009 6:39 AM

After announcing NIFs in Erlang it could be possible to write bindings for ICU C functions.

nekto0n, November 30, 2009 2:10 AM

Post a comment




Remember Me?

(you may use HTML tags for style)