You may recall some issues stemming from the network maintenance performed last week. We've been investigating since, and here's our post-mortem.
What was the impact?
The network was offline with all services down for approximately two hours. This was one hour longer than our anticipated downtime, but still fell within the maintenance window that we had generated.
So, what happened?
During the process of rearchitecting our switch fabric's spanning tree (moving from a more control-centric per-vlan spanning tree to a faster-failover rapid spanning tree, ironically to keep downtime to a minimum), we suddenly lost access to our equipment. I was the one present on site at the datacenter, so began investigating.
- A network loop had formed between several switches.
- The gateways controlling access to the switch management network were isolated from each other, generating a split-brain scenario. Neither were accessible due to a sudden traffic flood.
- The flood was the result of a multi-switch BPDU (bridge protocol data unit) flood, indicating a spanning-tree flap. This is most likely what was changing the loop domain.
How was it repaired?
Switch reboots and some "hands-on" rearchitecting of the switches enabled us to complete the migration to RSTP (we were interrupted in the middle) and stop the flood. After the packet TTL died off and the switches were able to rebuild their MAC tables accurately, connectivity was restored. The rest of the time spent was on restoring services which had suffered.
What's to prevent this from happening again?
We're taking several steps to prevent a recurrence:
- We're evaluating our equipment. We haven't found standards-based reasons for the BPDU flood; that's exactly what shouldn't be happening. If it's a bug, that means firmware updates or new equipment. If it's an odd convolution of events, that means rearchitecting our equipment to remove one of the necessary members of the convolution.
- We're changing our maintenance planning to be more comprehensive. We identified some holes in the maintenance plan that failed to account for such a catastrophic breakdown. We're redrawing the line on the continuum of "safe to perform" and "keeps the maintenance window small." As a general rule, plans that optimize for short maintenance windows are less safe, whereas plans that are safe tend to have longer windows and longer downtime. We're aiming more toward "safe" now than we were. In a practical sense, our customers won't see the difference... except when we avert a catastrophe, rather than suffer through it.
Finally, I would like to once again apologize for any trouble this maintenance caused.