As part of the Kiln and FogBugz weekly release process, we upgrade all of our Fog Creek On Demand accounts to the newest stable version of the code on Saturdays just before midnight.
This past weekend, we executed a version bump to all customers, and everything initially appeared to be okay. By 7 PM EST Sunday, however, we realized that Kiln On Demand accounts were slowly dropping offline.
We had substantial difficulty diagnosing the problem. While we realized quickly that Kiln accounts were dropping offline because they were unable to communicate with their associated FogBugz instance, it was extremely unclear why they were having trouble doing so. By around 9 PM EST, we had successfully diagnosed the issue.
There are three separate components that power a Kiln installation: Kiln itself; FogBugz; and a Kiln plugin for FogBugz that allows the two to communicate. For various technical and workflow-related reasons, upgrading both FogBugz and Kiln is fully automated by our deployment system, but upgrading the Kiln plugin is not. The Kiln plugin that was distributed as part of Saturday’s version bump was unfortunately very out-of-date. After a customer generated an indeterminate amount of activity, Kiln would try to make use of new functionality that was only in the newer version of the plugin, at which point Kiln would cease working for that customer. Until that point, Kiln would appear to be functioning perfectly well.
Once we had diagnosed the problem, we knew how to fix it, but had difficulty assembling the staff necessary to address the problem late on a Sunday. Once we had the staff available, and had attempted to fix the problem, we were unable to verify whether the fix actually worked correctly without applying it to all customers, rather than just testing accounts. And once we decided to take the risk and do that, we believed that the fix had failed, when it had in fact succeeded, because our testing accounts were themselves broken for a totally unrelated reason.
The situation was completely resolved by roughly 12:30 AM EST Monday. We know that 6 customers experienced this issue before it was resolved.
We will be avoiding getting into this situation again from many different directions:
This past weekend, we executed a version bump to all customers, and everything initially appeared to be okay. By 7 PM EST Sunday, however, we realized that Kiln On Demand accounts were slowly dropping offline.
We had substantial difficulty diagnosing the problem. While we realized quickly that Kiln accounts were dropping offline because they were unable to communicate with their associated FogBugz instance, it was extremely unclear why they were having trouble doing so. By around 9 PM EST, we had successfully diagnosed the issue.
There are three separate components that power a Kiln installation: Kiln itself; FogBugz; and a Kiln plugin for FogBugz that allows the two to communicate. For various technical and workflow-related reasons, upgrading both FogBugz and Kiln is fully automated by our deployment system, but upgrading the Kiln plugin is not. The Kiln plugin that was distributed as part of Saturday’s version bump was unfortunately very out-of-date. After a customer generated an indeterminate amount of activity, Kiln would try to make use of new functionality that was only in the newer version of the plugin, at which point Kiln would cease working for that customer. Until that point, Kiln would appear to be functioning perfectly well.
Once we had diagnosed the problem, we knew how to fix it, but had difficulty assembling the staff necessary to address the problem late on a Sunday. Once we had the staff available, and had attempted to fix the problem, we were unable to verify whether the fix actually worked correctly without applying it to all customers, rather than just testing accounts. And once we decided to take the risk and do that, we believed that the fix had failed, when it had in fact succeeded, because our testing accounts were themselves broken for a totally unrelated reason.
The situation was completely resolved by roughly 12:30 AM EST Monday. We know that 6 customers experienced this issue before it was resolved.
We will be avoiding getting into this situation again from many different directions:
- The weekly upgrade will be moved from Saturday night to Wednesday night at 10 PM EST, one of the periods of lowest traffic across the FCOD network. The weekly upgrades are tiny and incur minimal downtime; by doing them midweek, we ensure that we will have the staff on-hand to deal with any unexpected issues that arise.
- We will find a way to fully automate the currently manual upgrade of the Kiln Plugin.
- We will be modifying the communication between FogBugz and Kiln so that it fail early and loudly during our initial tests if there is a problem, rather than attempting to soldier on until unable to do so. This will prevent a problem like this being hidden from our QA and deployment teams.
- We will be spreading around more knowledge of how to address synchronization so that, in the case of a problem late on a Sunday evening, we have fewer mission-critical team members who must be contacted to implement a solution.