« November 2012 | Main | January 2013 »
Posted by Bradford Ley at 12:49 AM | Permalink
Update: Further investigation leads us to believe that additional work will be required on the Kiln data structures. This is safe work, but involves taking Kiln offline for longer than originally estimated. We now expect that Kiln On Demand will be offline for approximately 45 minutes. We still intend to start promptly at 2200 EST. Additionally, some FogBugz accounts will need to be taken offline at the same time. Those accounts will be taken down individually and will be unavailable for 5-10 minutes each. That maintenance will also start at 2200 EST. If you need to know if your account is in that list, please don't hesitate to contact us. Given the extra work, we're extending the maintenance window to be two hours instead of 90 minutes to ensure we have sufficient time to complete our tasks.
We will perform maintenance on the infrastructure behind Kiln On Demand on Thursday, Jan. 3 at 0300 UTC (Wednesday, Jan. 2, 2200 EST).
The maintenance will adjust how web and Mercurial requests are terminated in the environment. The goal is to provide better management behind the scenes for us, better logging, and to open up new options for future features. The maintenance window will open promptly at Jan. 3, 0300 UTC (Jan. 2, 2200 EST) and will remain open until 0430 UTC (2330 EST).
In the ideal scenario, customers can expect approximately 45 minutes of time when Kiln On Demand is unavailable. Given the scope of the maintenance, it's possible that this downtime could be as long as 1 hour. We will abort and rollback before any downtime will extend beyond that point.
This maintenance should not change our host fingerprints, so no client-side adjustments should be required. If such a change should be required, we will let our customers know in the announcement after the maintenance is complete.
Posted by Bradford Ley at 06:27 PM | Permalink
Our migration maintenances are set to continue this weekend!
As always, a small subset of customers will be moved during the maintenance. If you need to know if you're going to be a part of it, please contact us.
The maintenance window will open promptly at 0300 UTC, Dec 30 (2200 EST, Dec 29) and will remain open for three hours, closing at 0600 UTC (0100 EST). Individual customers will experience between 5 and 30 minutes of service interruption during this period. That interruption may occur at any time during the maintenance window.
Posted by Bradford Ley at 04:07 PM | Permalink
The post-mortem to the service interruption we had on December 5th is described below.
All referenced times are in EST which is UTC -500.
What was the impact?
Kiln users received 503 errors from about 7:09 AM to 7:50 AM preventing access to Kiln On Demand. From 7:50 AM to about 9:30 AM, 503 errors lingered for some users preventing access to Kiln On Demand.
So, what happened?
The first issue started at 7:09 AM until about 7:50 AM. One of our Kiln servers attempted to use an excessive amount of memory than was currently available. The server was forced to swap heavily and take corrective action by terminating some processes. Kiln users directed to this server received 503 errors. To alleviate the issue we rebooted the server. Core services providing Kiln access did not start up correctly on the newly rebooted server.
At about 8:00 AM we successfully started all processes. At the same time, a second related server experienced an excessive number of requests from our backup mechanism. This prevented the server from fulfilling some external requests to Kiln forcing the requests to queue on the load balancer. Because of the request demands, the load balancer stopped accepting some requests and returned 503s to some Kiln users.
At about 9:30 AM, the Kiln On Demand service was fully restored.
How was it repaired?
In addition to the server reboot, we immediately terminated the backup process, swapped servers in the pool, and restarted the core services on the swapped server. The servers and load balancers began to successfully fulfill the queued requests.
What's to prevent this from happening again?
We do not have sufficient evidence for a root cause on the original excessive memory demands as the logs are quite bare. However, we've improved the load balancers' configuration to handle requests in this situation, and we have added additional monitoring to assist in a faster diagnosis should this incident reoccur.
Posted by Derrick Miller at 07:30 PM | Permalink
We encountered a condition on two of our servers that prevented maintenance from going forward. After some brief troubleshooting, it was determined that this was better handled in the harsh light of day. This condition will not impact normal operation.
We have called off the maintenance and will reschedule it for another weekend. We will post a separate announcement when the maintenance has been rescheduled.
Posted by Bradford Ley at 10:41 PM | Permalink
Our migration maintenances are set to continue this weekend!
As always, a small subset of customers will be moved during the maintenance. If you need to know if you're going to be a part of it, please contact us.
The maintenance window will open promptly at 0300 UTC, Dec 9 (2200 EST, Dec 8) and will remain open for three hours, closing at 0600 UTC (0100 EST). Individual customers will experience between 5 and 30 minutes of service interruption during this period. That interruption may occur at any time during the maintenance window.
Posted by Bradford Ley at 06:16 PM | Permalink
The Kiln On Demand service is fully restored. We are no longer seeing lingering 503 errors.
We will post a full post-mortem of this incident on http://fogcreekstatus.com/. We detail what happened, the root cause, and what we'll do in the future to avoid a similar failure.
We're sorry for the trouble! Please contact us if this materially affected your ability to do business.
Thanks for your patience.
Posted by Derrick Miller at 09:33 AM | Permalink
We still have reports of lingering 503 errors. The Kiln engineering lead and sysadmin team are actively troubleshooting the issue and taking steps to restart any services necessary. All servers remain up and running.
Posted by Derrick Miller at 09:20 AM | Permalink
Though most of them have been resolved, we continue to get reports of 503 errors with pushing to Kiln On Demand. The Kiln engineering lead and sysadmin team are currently working on this. Thanks for your patience.
Posted by Rich Armstrong at 08:25 AM | Permalink
The 503 errors our customers were seeing this morning are resolved.
At 07:09 EST, a backend server, which the web servers display data from, had a sudden memory spike that consume all available memory and swap space, causing a crash of the server and the website to display HTTP 503 errors to any account trying to read from that backend server.
This was resolved by a reboot followed by manual intervention to restore all the services on the backend server.
Full service was restored at approximately 07:50.
Posted by Bradford Ley at 08:10 AM | Permalink