On Saturday night and Monday night, we upgraded 40% of Kiln and FogBugz customers to new versions of both products. Last night we upgraded the remaining 60% of our users. Unfortunately, the upgrade took down Kiln for a short time. In an effort to understand what went wrong and how we can make sure it doesn't happen again, here's our post-mortem of the issue.
As mentioned in the post last night, the change that caused this problem was due to OPTIONS HTTP requests failing when sent with an empty Host: header. Normal Kiln requests always have a host header that indicates the account that the request is for. But our load balancer sends these without a host header just to verify that IIS is up and running. If it sees failures, it stops passing requests on to that web server. That way when an individual web server goes down, the load balancer automatically moves requests to servers that are up. Of course, since the OPTIONS request failed on both machines, the load balancer stopped sending requests to either of them, and began returning HTTP 503 errors.
The reason the bug didn't bite until we had upgraded all accounts is because, by default, requests for unknown accounts go to the default version of Kiln, which wasn't changed until last night. We rolled this version out over a few days because there were changes to the way Kiln's backend processing worked and we wanted to make sure it scaled. Leaking it out to progressively large numbers of accounts was an attempt to monitor and catch any issues with that. One change we did in this release was to upgrade all of the accounts based in time zones that would have been affected by a 10pm upgrade over the weekend. Unfortunately, due to the nature of the bug, all Kiln accounts were still taken down, not just those that were being upgraded.
Going forward, we'll be doing a few things to make sure this doesn't happen again. First, we'll go back to doing upgrades over the weekend, when they cause the least disruption if an outage occurs. Second, we're working on a staging environment for Kiln (and FogBugz) that closely mirrors our production system. If this had been in place we would have caught this issue well before releasing to our customers. Also, we'll make sure that we do testing of requests to invalid accounts before releasing.