Firstly, there was clearly a problem with their choice of colo provider, they went cheap and got cheap. But the really interesting part (to me) is the problems that stemmed from a complete power failure:
Unfortunately, that meant that when the batteries died, our server farm went down quite ungracefully - causing problem #3, which was data corruption due to the unclean shutdown.
The rest of the weekend has been spent recovering from these failures - we've had to do consistency checks and then rebuilds of the data sets that got corrupted, and we're doing that for over a hundred machines. Bad bad bad.
Now I'm not down on them for not having a better system or planning; they run a small business and must make tradeoffs. I find it unfortunate that this tradeoff must be made.
One mistake they made is to make up for the fact their software isn't crash tolerant by making the surrounding infrastructure more reliable and putting in provisions to make sure the system always shut down cleanly. But, as illustrated, how many events can you possibly provision for?
If your system must have an orderly shutdown in order to come up quickly, then it simply can't be considered highly reliable. But the problem is designing highly reliable and highly available software systems is really difficult. It seems to me we are missing proper tools and concepts needed to build critical software correctly, it shouldn't be this hard.
Maybe I’ll have some insights about this as I build Couch.