Lisp as Blub
There's a problem in the server software. When the load gets high, it fails catastrophically instead of gradually. Robert and Patrick Collison are investigating, but they're still not sure what the problem is. My guess from the external evidence is that it's related to garbage collection.Killing the server process fixes the problem, at least for a day or two.
And there's the problem with Lisp for writing server software. Long lived processes, shared state threading, and garbage collection make it extremely difficult to fail gracefully. Even if your code is completely correct and bug free, it can still crash, hang or just run unacceptably slow and there is nothing you can do to correct it without completely restarting.
There is no macro or meta programming technique to fix this problem. There are things you can do to mitigate it (mostly by generating less garbage), but once you reach a certain level of activity in the system where the garbage collector can no longer keep up (and it will happen), then every line of code in your system is now a potential failure point that can leave the whole program in a bad state. Lisp has this problem. Java has this problem. Erlang does not.
Posted April 14, 2008 8:22 AM

Comments
Surely Erlang servers have this problem as well. It just needs to be debugged. Can be hard to find. The method to use is to utilize combos of processes() and process_info() together with application specific knowledge to find the culprit. Usually there is just one process there that is large, once you find that proc and a stacktrace of the process, its usually easy to elliminate the problem.
Also take a deep look at the backtrace arg to process info.
klacke, April 14, 2008 10:13 AM
Any single Erlang process can have this problem, but when the erlang VM runs out of resources it starts killing Erlang processes. The process heaps and other resources are recovered quickly (no giant object graph to unwind before finalization) and the process boundaries ensure other processes aren't affected (unless they are relying on the killed process, in which case they will die too). Java, Lisp, Python, etc cannot do this, the memory models do not allow it.
Damien Katz, April 14, 2008 10:59 AM
Damien, didn't you report that the Erlang VM crashed hard if it reached an OOM state some time earlier? In this post for example?
Now I do notice that in What Sucks about Erlang where you originally stated
you added:
Is the R12B behavior to kill Erlang processes when the VM can't allocate new memory? And if so how is the slaughter order decided?
Anonymous, April 14, 2008 11:54 AM
"garbage), but once you reach a certain level of activity in the system where the garbage collector can no longer keep up (and it will happen)"
So basically the server is overburdened? Guess that would make estimating lisps' real cpu-effectiveness difficult. Maybe there are(/should be) some tools to see whether lisp systems are starting to get overburdened? (Since it apparently degrades to abruptly.)
"[...] Java, Lisp, Python, etc cannot do this, the memory models do not allow it."
You mean not in principle, or not with the current existing implementations? And if so, which lisp implementation are you using?
Since there are so few responses(and linked from reddit.com) too many people reading this see "Boo lisp, hurrah Erlang" without really thinking about it.
Jasper, April 14, 2008 12:19 PM
> So basically the server is overburdened?
Yes, and the problem is once a server is no longer able to handle the load, everything it is currently doing becomes a potential source of errors that can lead to invariant state in the program, so that even once the load has lightened, the server is still in a bad state. It might just be slower (memory leak or runaway thread) or it might be completely crashed, (even worse it might start spitting out the wrong results or even destroying existing data).
Erlang deals with these problems in a fundamentally different way, making these sorts of bugs, perhaps not impossible, but extremely unlikely.
> You mean not in principle, or not with the current existing implementations?
both. The reliability inherent in Erlang is not something that can be easily bolted onto another language. The language VMs could be modified to support processes and isolation, but you'd need all the libraries to support it too, absolutely no shared state allowed (mutable static variables are now forbidden).
> And if so, which lisp implementation are you using?
none.
Damien Katz, April 14, 2008 12:57 PM
Lisp does not have a problem. Lisp implementations may have a problem. That's all I'll say on the subject.
Diogo, April 14, 2008 1:00 PM
'too many people reading this see "Boo lisp, hurrah Erlang" without really thinking about it.'
Haha to be true, I am not one of these.
I am one of the 'boo lisp, boo erlang' folks but I also did not want to comment here, but I just read your comment here, and had to disagree. ;)
(But actually the article here was kinda short, I think i expected something a bit larger to read so I lost interest when I saw it is such a small article.)
she, April 14, 2008 1:00 PM
The other problem is how to scale beyond one server when you have long-lived processes and all that shared state. Eventually you will have to deal with reading and writing to unreliable ports on the network, just like everyone else.
pg got away with scaling ViaWeb because the locality was extreme. He gets away with simple architecture on news.yc because the total data set is very small.
Aristus, April 14, 2008 1:00 PM
Only a neophyte would run a large or heavily loaded site out of a single instance of anything, Lisp, Java, Erlang whatever.
Once you are distributing your calls to multiple independent processes then the Erlang approach (kill the load by killing the processses) works but its nor very friendly to the end user. I suspect slowing the request rate
Whats more likely here is not that the load is killing the service, its more likely a programming glitch is consuming resources that aren't recovered. In this case all systems will fail eventually.
This is not a language problem per-se, its an architecture and design problem. Erlang won't solve it any more than 6502 assembler will.
Joe Drumgoole, April 14, 2008 1:00 PM
And then there are the peeling of the onions as they permutate their internal ooze of green in the early morning air, dew glistening from the folds of scales on the back of the great beast.
Dairy Farmer, April 14, 2008 1:09 PM
Damien, here I was respecting your work and then you say a silly thing like this.
Paul is basing his system off of MzScheme, which simply hasn't had the server paces put to it that Erlang, Java, and even some Common Lisps have had. You use an untested system and you are the testing ground, and you have to accept hiccups. And in general, news.ycomb's infrastructure has held up admirably under the load it's been put under, given the extremely ad-hoc way in which it runs (via an interpreter on top of another JITted environment). A small amount of downtime like this
Saying that Erlang doesn't have this problem is also false, a recent version of Erlang had a critical networking bug that caused a wicked infinite loop below any user code, and could actually take down the box that hosted it.
The Blub Paradox is not about languages being superior or runtimes being more stable, it's about preconceived and irrational notions of languages shaping our opinion and thought. You are falling into it.
Dave Fayram, April 14, 2008 1:11 PM
FYI, news.ycombinator (the server software Paul is talking about) is written in Arc, on top of Scheme. There's no Lisp involved.
Charles PiƱa, April 14, 2008 1:48 PM
Except that Arc is "a new dialect of Lisp" and Scheme is also a dialect of the Lisp programming language.
Matt Brubeck, April 14, 2008 2:10 PM
Well, if all Lisp implementations have that problem, then Lisp has that problem as a rule. Is there any Lisp implementation that doesn't have it?
Not to mention, as Damien specified, you need a specific shared-nothing immutability-oriented failure-resistant mindsent to handle that kind of problems. Is there any existing lisp with that?
That's baked in Erlang's runtime, just so you know, and while the unreliability of ports isn't abstracted in any way, it's much less problematic than in most other languages because unreliability is expected and the language and runtime both have all the tools required to cope with it.
Yes, but by coping for some time (potentially with a lowered QoS) the system allows for continuous operation and gives the programming team the time to investigate the problem (and in a live environment already subject to the issue, too).
Damien does specifically state that Java has the same issue though...
Scheme is a Lisp, and so is Arc. Lisp doesn't mean Common Lisp.
masklinn, April 14, 2008 3:16 PM
Simon Belak, April 14, 2008 4:01 PM
Obviously, everyone should be using C
Bob Balaban, April 14, 2008 6:21 PM
The two relevant issues are system granularity and garbage collector behavior (if it is related to memory and garbage collection).
Erlang encourages an architecture of many small-granularity processes. To the extent that this approach is followed, failures are localized. It is possible to do this with other languages, but erlang does encourage the approach more so than other languages.
The other difference is that erlang uses a single-threaded garbage collector per process. This makes the garbage collection process simpler, more finely grained and distributed. Smaller processes mean less complicated memory structures, and thus the language encourages a simpler model with localized garbage collection failure. Determining the cause of overburdened memory usage (or any other resource because of the localized nature of small processes) becomes easier.
An erlang system can get wedged, but following the principle of many small processes makes it less likely to happen than in other languages which encourage large processes with shared memory structures.
Jay Nelson, April 15, 2008 1:14 PM
I'm not exactly sure how it works out in this particular argument, but Clojure is a lisp dialect (built on the JVM) which uses immutable persistent data structures and is geared toward concurrency. The ant simulation demo certainly shows off the ability to have many small threads in an app.
Something to look into at least...
Shawn Fumo, April 15, 2008 10:15 PM
We are building a production transaction processing system, including
a core written in Common Lisp, that is intended to have 99.99%
availability. We're confident that we can make this work.
"... but once you reach a certain level of activity in the system
where the garbage collector can no longer keep up (and it will
happen), then every line of code in your system is now a potential
failure point that can leave the whole program in a bad state."
First, I don't agree that it's inevitable that you'll get to the point
where the GC cannot keep up. Modern GC's are a lot better than this.
Second, why would slowness from the GC cause every line of code to be
a potential failure point? I don't understand what he means by this.
"The other problem is how to scale beyond one server when you have
long-lived processes and all that shared state." You do it the same
way it's done in any other language: you have a cluster with a load
balancer.
The best thing to do is design your system to be stateless. In the
system I'm working on, there is a three-tiered architecture, with the
business (middle) tier written in Common Lisp. Requests are sent from
the presentation layer to the business layer, and responses are sent
back. There is no "session" concept; each request is independent.
The business tier is a cluster. If one of the Lisp processes crashes
due to some non-repeatable bug (such as the type we're talking about
here), the request is simply retransmitted and the load balancer sends
it to another Lisp process. Meanwhile the crashes Lisp process can be
analyzed by operators, and a new Lisp starts up to take the place of
the old one.
Killing a process is perfectly safe. We are even considering taking
down and restarting each Lisp once per day, just in case there are
small "memory leaks" (storage that is not needed, but is still being
retained because some pointer is still pointing at it). It's
important to track down large leaks, but you'll never get all the
little ones.
For the time being, we are going to only handle one request at a time
in each Lisp, just for simplicity. However, in the future we may have
many request-handling threads in the same Lisp, and we'll be very
careful regarding what state is shared between threads.
Erlang lets you do the killing at a thread granularity rather than a
whole-process granulatiry, which is a good thing. In exchange for
that, you have to put up with a no-side-effect language and other
restrictions. I much prefer a more conventional language like Lisp.
Anonymous, April 27, 2008 11:18 AM
Post a comment