Last night, we experienced two network interruptions at approximately 20:00 EST and 21:20 EST. The first outage was brief enough that not one customer reported it. The second outage lasted about 15 minutes before it was resolved, and was reported by several customers.
The Short Explanation
We're building a new set of gateways to be deployed this weekend. We took every precaution we knew to take to keep these new machines separated from live traffic, but, ironically, it was actually an undocumented behavior in one of our redundancy technologies that caused the test bed to intrude on the live network.
This particular behavior won't affect us again since we now know it exists. We're also submitting documentation patches so that, hopefully, it will be better documented for others in the future. Ordinarily, if we'd had any inkling that this separate system could have affected production, we would have performed the work during the maintenance window. There was simply no way we could have known that this failover would happen. In the short term, we're reexamining the kind of work that goes on outside a maintenance window, with particular focus on our redundancy systems, to ensure we're not putting the environment at undue risk.
The Long Explanation
We are building out a new set of upgraded gateways to be deployed this weekend. The technology we use to provide IP failover between our gateways is CARP, an open source alternative to VRRP. There are two components to configuring CARP interfaces: the VHID and the CARP password.
The VHID is advertised by CARP to identify interfaces as belonging to the same redundancy group. The CARP password is there to ensure that only authorized interfaces are able to participate in that redundancy group. Advertisements made on a particular VHID will be ignored by the CARP interface if the advertisement doesn't use the same password.
With this in mind, we configured our new gateways with the same VHID but different IPs and different passwords to test our buildout but avoid affecting the production environment. What isn't documented in the OpenBSD man pages*, OpenBSD FAQ, or FreeBSD man pages is that the VHID is used to calculate the CARP interface's virtual MAC address. The result being that traffic intended for our gateways was switched to the new boxes.
We didn't catch this the first time because we were distracted by a red herring: our switches showed a spanning tree topology change at the same time as the first outage. Though we carefully went through all commands that had been run on the switches within the previous hour and couldn't find anything dangerous, we assumed we were missing some subtle effect and stopped working on them until we could do further research (which later revealed that the topology change was innocuous).
Outages suck. The point of these postmortems is to both apologize (profusely) and to explain what happened so you know we take these issues very seriously. This particular behavior should not affect us again since we now know it exists, and hopefully we can help make FreeBSD better from our discovery of this undocumented behavior.
* This is vaguely alluded to in the IP Balancing section of OpenBSD's carp(4) manpage but, as we don't use that feature, we've not studied that portion closely.