The SysAdmins of Fog Creek are here to give you some details on the outages and slowdowns you experienced using FogBugz and Kiln between November 7th and November 16th, 2016. This is meant to be a high level overview; if you have any questions or are interested in more detail on any specific issue, don't hesitate to contact us. We'd be happy to answer any questions you've got to any level of detail you'd like.
On Monday, November 7th, we performed a standard daytime rolling deploy of FogBugz. During this process, we progressively roll out the new version of FogBugz to a larger number of customers throughout the day. We closely monitor performance metrics throughout any deploy and did not identify any problems with the deploy at the time.
Starting on the morning of Tuesday, November 8th, our monitoring system alerted us that the Kiln backends which store customer repositories were having problems staying connected to the Kiln backend workers that perform various operations on the repositories, including tasks like cross-backend syncing and creating repository links in the UI. The backends expect to maintain a number of persistent connections to the queue server responsible for managing the backends jobs (Redis), and many of these connections were being dropped and timing out.
Combining these symptoms with our network monitoring pointed to us experiencing a significantly larger amount of traffic than normal across our internal routers. A review of the code that was deployed on the 7th showed no changes that would have caused the increase in network traffic.
One of our sysadmins identified the new traffic as syslog messages, and a host that was spamming the network with redundant log messages was identified. We disabled network logging on that host and immediately saw a reduction in traffic, and a minor reprieve in performance degradation, but not a complete fix. We were still seeing higher than expected amounts of traffic flowing across our network, but not from any particular host.
We spent the remainder of the week attempting to identify the source of the new traffic while mitigating the impacts of it, specifically: the Kiln backend workers disconnecting from the queue server.
Distinct from this network issue, early in the afternoon of Sunday, November 13th, we experienced a complete loss of power in the cabinet at our datacenter that holds our storage arrays (SANs). All of the SANs contain redundant controllers, with redundant power supplies, connected to redundant power distribution units (PDUs). Despite these precautions, it was not enough to prevent an outage in this instance because the surge protectors on both redundant PDUs flipped simultaneously. With the SANs offline, Kiln repositories and FogBugz databases were completely unavailable. In addition, the SANs backing our virtual machines (VMs) were offline as well, taking down most of our production infrastructure.
As soon as the SAN power failure was identified, a Fog Creek admin in New York immediately went to our datacenter and restored power to the SAN cabinet.
Upon bootup, we identified that one of our SANs, responsible for the storage of FogBugz customer databases, had experienced two power supply failures. These power supplies were located on distinct PDUs, suggesting that their failures had triggered the surge protection circuits on both and brought the cabinet offline. We physically disconnected the failed power supplies from the PDUs to prevent any possibility of another surge, and began working with our SAN vendor to replace the failed power supplies and restore full redundancy.
While the SANs were quickly brought back online, it took a substantial amount of work to bring all affected services back online. The servers that utilized storage volumes exposed by the SANs (SQL Server for Fogbugz databases, Kiln Backends for repositories) needed to be restarted and have their redundant iSCSI connections re-established. In addition, all of our production VMs were in ‘read-only’ mode due to their hard drives being forced offline, and all of them needed to be rebooted.
The power failure occurred at approximately 12:15 PM Eastern, and fixing the resulting issues took approximately 7 hours.
The next day, Monday, November 14th, we saw a recurrence of the network issues we saw the previous week. Upon further inspection, we discovered that during the deploy on November 7th, a change to our network logging configuration had gone to production before it was expected to. A large number of our productions servers were sending logs to our centralized logging system in a way that the logging servers could not understand; resulting in those production servers attempting to resend the messages and saturating the network. There were three reasons we didn’t identify this sooner:
- The logging volume in our staging environment is considerably lower than production, and so we didn’t identify the issue there.
- No particular host was responsible for the increased traffic; all hosts exhibited a slight increase in traffic that resulted in a significant amount of traffic flowing over the internal routers.
- The logging configurations are deployed by the same system that deploys the production code, but we had been looking at the production code for a possible cause, not the ‘ancillary’ configurations that were deployed along with it.
As soon as we realized the new logging configurations had been deployed earlier than expected, we rolled back those logging changes and immediately saw a dramatic improvement in application performance and drop in total network traffic. While all of our telemetry pointed to both the SAN outage and network saturation issues being resolved, out of an abundance of caution we did not issue an all clear. The rollback occurred after our peak usage times, and we wanted to ensure that the perceived resolution was not solely the result of lower incoming traffic.
While the logging issues were being investigated and rolled back, we simultaneously were receiving and installing replacement power supplies from our SAN vendor.
The next day, November 15th, the improvements we saw in performance persisted, but our typically level response times were still showing unusual spikes. Further research identified one of our production Elasticsearch nodes responsible for searching FogBugz cases was using a significantly larger amount of heap that other nodes in the cluster and was responding much more slowly than the other nodes. This is likely due to extra indexing pressure in the aftermath of the SAN outage. Upon restarting the failing node we saw an immediate levelling of response times for both FogBugz and Elasticsearch.
The next day, November 16th, all systems were again fully redundant and responding normally, and we provided an all clear notice.
There are a number of steps we are taking to prevent this from happening again:
- To prevent accidental deployment of changes such as the logging configuration, we are applying additional controls to our deploy process to test changes which are not in the application code but also affect the deploy.
- We are identifying weaknesses in our network monitoring that made it difficult to identify the sources of increased traffic on our network so that we can resolve network saturation problems more quickly in the future.
- We are exploring additional redundancy options available for our SANs, both physical and logical, to avoid the wide-reaching impacts of a large-scale power loss type event.
As mentioned above, this is meant to be a high level overview of the issues we experienced. We would be more than happy to provide more detail on any specific problem or on the steps we’re taking to prevent this from happening again. Please feel free to contact us with any questions or concerns that you have.