What happened?
Last night's outage occurred in two distinct stages.
The first was caused by a failing Power Distribution Unit (PDU). Each of our racks has two PDUs running on two different 20A circuits and two 48-port switches, each plugged into a different PDU. When this PDU failed, its switch went down with it. Similar issues in the past have caused chain reactions that took our entire environment offline. Because of the changes we've made to our network architecture and power layout, however, actual service impact was limited to customers based on the two SQL servers attached to that switch. There was also no loss of power to the SQL servers and, so, no risk of data corruption.
The second stage of the outage occurred when the second switch in that rack lost power. We still don't have an explanation for why this happened. The effect, though, was less contained than the first stage: it cut off communication between two other racks and took all On Demand services down with it.
The outage was resolved when we replaced the faulty PDU and restored power to both switches.
What will change?
Though we've made progress, we clearly still have work to do on insulating our services from power failures. Our datacenter has grown very organically over the past few years and needs some careful re-engineering. My colleague Tim has put together a well thought-out plan to refactor our power and layer 2 fault tolerance to withstand multiple concurrent failures (including total power failure to at least one rack). Believe it or not, implementation was slated to begin on Monday, June 27.
This will include:
- Replacing all PDUs of a particular model (I'd rather not give the manufacturer and model since I suspect we just got a bad batch of these).
- Scheduling regular PDU replacement intervals.
- Replacing all switches with dual-power-supply models so networking is not interrupted by single-circuit failure.
- Moving our network topology to a full double-ring.
- Adding a pair of switches between our gateways and uplinks to ensure that our VRRP advertisements are more reliable during other network interruptions.
We also need to review our SQL servers' network fault tolerance. We used to use our NIC vendor's interface-teaming feature to tolerate single-link failures but found that it led to unacceptable performance tradeoffs with MSSQL. We'll attack this problem alongside the rest of our network restructuring.
Expect maintenance announcements in the coming weeks as we begin translating this plan to reality.
Finally, I apologize for the downtime last night -- we take great pride in our products and failures like this are deeply frustrating. We're working hard to make things better. In the meantime, if this outage affected you please contact us and we'll make it right.