The previously mentioned CloudFront issue has been resolved by Amazon, and Fogbugz & Kiln On Demand have fully recovered.
If you have any questions or concerns, please don't hesitate to contact us.
We're currently experiencing some problems with our CDN (Amazon CloudFront). This is preventing some static assets from loading for FogBugz & Kiln On Demand. We'll provide updates as they become available.
Update 8:30PM EST
CloudFront seems to be recovering. We're awaiting official confirmation from Amazon before sounding the all-clear.
We have completed our emergency maintenance on the misbehaving SQL server.
As predicted, the server was offline for 20 minutes. By 0320 UTC, most accounts on this server were available. The remaining data was loaded over the next 10 minutes.
Service was fully restored by 0329 UTC (2229 EST). Sure, it looks like I fudged that number to seem cool, but if I've learned anything over the past few days (I have) it's that I'm not cool. Just accurate.
Thank you for bearing with us. If you are experiencing any trouble with your account, please let us know!
After one SQL server had trouble today, we put together a test to see if any of the other servers were exhibiting symptoms that weren't visible to us.
Turns out we found one. Currently, its impact is very small. But as we saw with the server earlier today, that can all change in an instant. As a result, I'm going to perform emergency maintenance to update its network drivers tonight. The maintenance window will begin promptly at 0300 UTC (2200 EST) and be open for one hour. I expect the update, reboot, and database loads to take approximately 30 minutes. FogBugz and Kiln On Demand customers utilizing this server should expect a service interruption starting at 0300 UTC and lasting until approximately 0320 UTC, with degraded performance for the next few minutes while the server finishes spinning up.
Sorry about the late notice. Hopefully this is the last of the problems we'll see.
If you have any questions, please reach out to us.
This post-mortem refers to today's trouble impacting a large percentage of our FogBugz and Kiln On Demand accounts.
We haven't been able to definitely trace the root cause of the problem back, but we strongly suspect it was due to a Windows Update run that we completed during our maintenance window this last Saturday night. The problem was due to a driver issue on a network card we quite intentionally left alone during maintenance. The driver version did not change during the update, which is why the root cause is puzzling.
The issue with the driver is that for some hosts on its own subnet it would not longer resolve ARP requests, and for some sockets opened it would, seemingly at random, not send data through the socket. I found this was infuriating and ridiculous. I decided that updating a driver to our vendor's latest qualified driver was the best plan given limited information to work with and a malfunctioning network stack. Fortunately it worked out and service was restored.
We've looked through the list of updates installed and haven't identified anything that should cause this problem. Moreover, all of the other SQL servers received the same update list and have the same hardware with the same drivers. None of the other servers displayed this problem. We're keeping a close eye on this machine, but the problem appears to be resolved.
I apologize to any customers whose accounts were down due to this outage. If you were materially impacted, please don't hesitate to reach out to us and we'll make it right.
The previously announced service interruption is now resolved. Any affected customers should be able to continue using FogBugz and Kiln On Demand at this time.
We are closely monitoring the impacted database server for regressions. If any issues arise, please don't hesitate to let us know!
If you were materially impacted by this outage, please contact us!
We are currently experiencing transient issues on one of our database servers resulting in intermittent service interruptions of our On Demand services affecting some FogBugz users.
We've got all hands on deck working to resolve the issue as soon as we can.
It wasn't perfect. Some of you may have noticed our attempt at putting up maintenance pages, for instance. Our method for that was questionable in hindsight. Our old method had flaws aplenty and we consider this to be an improvement, but clearly we have a ways to go on that front.
The rest of the maintenance went quite well! Superior planning from the teams involved was the real victory, here.
Services went offline promptly at 0300 UTC (2200 EST) and primary services were restored at 0557 UTC (0057 EST). There were some lingering application issues we were cleaning up (spoiled queues, some email delays due to the queueing of email during the downtime, etc.), but full service was restored by 0626 UTC.
That said, I'm confident we could have shaved significant time off of this maintenance with additional planning and preparation. There were some procedural items, particularly regarding vendor equipment, we learned too late to be helpful this time. Fortunately, our procedures have been updated and the next time this happens it will go much faster!
If you're experiencing any trouble or if you have any questions, please don't hesitate to reach out to us!
This is a reminder that we will be performing extensive maintenance on our infrastructure on Saturday night. The maintenance window will open at 0300 UTC and remain open for 5 hours. Starting promptly at 0300 UTC we will be taking the Fog Creek websites, customer support, FogBugz on Demand, and Kiln On Demand offline in order to guarantee data integrity. While we are working diligently to ensure the interruption is significantly less than 5 hours, customers should expect that all services will remain offline for the duration of the maintenance.
Emails and voicemails for customer support will continue to work, but emails won't be delivered until the maintenance is completed.
If you have any questions about this maintenance, please feel free to reach out to us.