If you just want the summary: a system designed to contribute to the stability of Kiln masked an upgrade issue by failing gracefully rather than dying horribly during testing. In the future, we will be testing against brittle systems before rolling out to the more resilient infrastructure where customer instances live.
For the full explanation, read on.
At Fog Creek, we've been spending the last couple of weeks making changes to our infrastructure to make it possible to slowly leak new versions of Kiln in a gradual way that will minimize customer impact. When the project is complete, we hope to get to a world where you'll never see the "Upgrading Account" screen ever again.
As part of that, on Wednesday, we rolled out a brand-new component to the Kiln Storage Services that's vital to being able to achieve this kind of fine-grained, super-testable rollout.
We tested the service thoroughly on two boxes that were out of production, then rolled it out to our official servers. Everything looked as if it worked perfectly in initial testing. But we rapidly realized something had gone wrong, and rolled everything back to the way it had been prior to service deployment. The gap between realizing something had gone wrong and having it fixed was around three minutes, and we believe things may have begun misbehaving about four minutes earlier than that, leaving Kiln users unable to push or pull to their Kiln installs for up to about seven minutes.
We have a policy at Fog Creek of asking the Five Whys to try to get down to the bottom of why a problem occurred. So: why did we have this downtime?
- Why were you unable to push or pull? Because the Kiln Storage Services went offline.
- Why did they go offline? Because key configuration files on the Kiln Storage Service had been changed to contain bogus values that would take down the Storage Service.
- Why did the configuration files get altered? Because we had deployed the above-mentioned upgrade service as part of our movement towards allowing us to do super-targeted gradual roll-out of new Kiln features.
- Why were the invalid configuration files not immediately caught? Because they were superficially valid, and tests, both on the testing server, and immediately after rolling out the service to our production servers, indicated that everything was working perfectly.
- Why did everything appear to be working perfectly? Because we have designed the Kiln Storage Service to be redundant and to adopt new configuration changes in a way that avoids having any downtime. A side-effect of that is that configuration files won't necessarily be reread until a few minutes after the changes have been made. This meant we didn't test the new configuration; we tested the old one.
- When we make service changes like this that can impact all customers, we'll target maintenance windows, even when we don't expect downtime.
- When we test configuration changes on our testing servers, we'll force them through a full shutdown and restart sequence to make sure that no transient configuration is sticking around.