First, I want to apologize to all of our customers for the outage this morning. Here's what apparently happened: An electrician working on something completely unrelated to Fog Creek, flipped the breaker that supplies power to two of our racks at our data center. PEER 1's NOC actually called us about it as soon as it happened, to let us know they couldn't reach our equipment, and then a few minutes later their data center people realized the problem and restored our power pretty quickly.
Regardless of why this happened, it's ultimately our responsibility to mitigate failures in our datacenter. We take full blame for this issue.
The most obvious problem (contrary to our earlier post) is that we do not have UPSes installed in our datacenter racks. The data center has them, but we don't have individual ones for our rack. We will be purchasing these to isolate us from power problems at the data center.
Secondly, our racks are interconnected in such a way that a power failure in one rack can isolate the remaining racks from one another. We were already aware of this situation before the power failure and have plans in place to fix it, so this is entirely our fault for not getting that done sooner. We just hired another sysadmin to help us dig through our workload.
Lastly, our SQL servers are currently single points of failure. We can lose a web server or load balancer or firewall and everything continues to function. If we lose a SQL server, all customers on that system will be offline. Like the switching problem, we already had plans to resolve it and are doubling down our efforts to get it done. We will be ordering a new pair of database systems to finish out a redundant configuration.
We will keep you in the loop as we implement these changes. Stay tuned to this status blog and http://blog.fogcreek.com for new information and please call us or email if you have any complaints, suggestions, or feedback.
We're sorry for letting you down this morning and promise to make sure we're prepared next time.