Kiln On Demand suffered an outage lasting approximately 33 minutes (1202 - 1235 EDT). The outage has been resolved and is not likely to recur. Customers would have experienced a combination of slow requests and 503 HTTP errors as a result. This would have impacted the website and Mercurial/Git interactions equally. Customers still on Kiln 2.9.52 or utilizing developers.kilnhg.com were up approximately 10 minutes earlier than customers on Kiln 3.0.27.
The cause was a caching server delaying cache hits due to an unexpected surge in memory usage. This started backing up the Kiln website as the web servers waited longer and longer for page loads. Once it was detected, we rebooted the cache server. To respond to the queued requests which were still causing problems but had no chance of completing, the Kiln application pools were recycled. The cache server came online successfully, but the flood of failed requests from the recycle caused the application pool to be disabled.
We manually started the application pools and adjusted their settings to more sanely handle this situation should it arise again. At that point, service was restored to all customers.
Engineers are monitoring the situation and are seeing no early warning signs of a recurrence.