As previously noted, we experienced a large outage that brought down all of our services.
The outage lasted from 5 AM GMT to 8:30 AM GMT. If this got in your way, contact us so we can make it right for you.
Here is what happened:
- all of our racks are configured with dual PDUs, each one connecting to a separate power circuit. The general idea being that servers are split between them, and can automatically fail over if one were to fail
- At 5 AM GMT, a PDU in one of our cabinets failed, causing all devices that were leaning on that PDU to move to the second one
- when this occurred, the extra load on the secondary PDU was too much and caused the circuit powering it to trip
- this cabinet housed both of our primary and secondary routers, each one connected to a different PDU
- I worked with a technician to restore power and then replaced the failed PDU. This took a considerable amount of time because the technician is obligated to make sure that the root cause is understood and that further issues won't occur
As you can see from this, we have some to make some changes to our design. For starters, our primary and secondary routers will be split not only between separate power circuits, but between completely different racks. With that in place, this sort of cascading failure would not be able to bring service to a full stop like it did today. I am committed to putting this in place within the next 14 days.
Our second objective is to increase the spare power capacity in each rack so that one failing PDU will not lead to another. This will take more time, but it is considered to be high priority.
As always, I appreciate your patience in these matters. Again, if this outage prevented you from doing your job, let us know.