As we noted in an earlier post, we experienced an unscheduled outage with our On Demand services (FogBugz, Kiln, and StackExchange) that lasted about 10 minutes.
This was initiated by a hard fault on our primary load balancer. Its redundant peer took over as planned, but was somewhat confused about the transition. It took a few moments to put things back in order, and then all services returned to normal.
Since then, we have identified both the primary cause of the fault on the original system and the misconfiguration that caused the noted 'confusion' on the backup.
We offer our sincere apologies for this disruption, and appreciate your patience in the matter.
Minutes after posting about a brief, single-server outage we saw our entire On Demand and StacKExchange infrastructure go down. This outage lasted for about twenty minutes (about 10:44 to 11:05 EST).
In a nutshell, the previous outage led to a bunch of hung sockets waiting around for replies on our load balancer. Combined with our growing traffic on FogBugz On Demand and StackExchange, this led to file descriptor exhaustion and prevented any new connections from getting established.
I've corrected the load balancer issue by vastly increasing the number of allowed file descriptors and ensured that this configuration will make it into future builds. We're still investigating the original problem.
I'm sorry for the interruption in service. All monitors are green now, so please contact Support if you experience any further difficulties.