Sorry for the late notice but the upgrade announced here has been canceled and will be rescheduled.
We will be upgrading FogBugz On Demand to version 8.9.135 starting June 23 at 1400 UTC (1000 EDT). We will begin by upgrade a small portion of our account base to the new version. Following that we will leak the software to all accounts over the course of the day, completing by 2300 UTC (1900 EDT). During the account upgrade customers will not experience any interruption.
Once all accounts have been upgraded, we will begin an upgrade of the database schema for each customer. The schema upgrade will occur at the account's local midnight (as determined by the Regional setting of the Site Configuration). During the schema upgrade customers may experience up to 5 minutes of service interruption with the majority of accounts experiencing much shorter interruptions.
If you have any questions about this upgrade don't hesitate to contact us.
We will be deploying FogBugz On Demand version 8.9.132 to all customers. This is to address some infrastructure regressions before they become customer impacting. We will begin by upgrading a small, sample of our accounts to the new version. Afterwards, we will slowly release the software to all accounts over the course of the day, completing by 2100 UTC. Customers will experience no service interruption during this upgrade.
If you have any questions about this upgrade please contact us!
Starting at 1300EDT, June 10, we detected a communications failure in our datacenter. Service was restored approximately 25 minutes later.
The device that failed was a router serving as the point between our internal networks and the external firewalls. This was the same equipment we had been working on during the infrastructure maintenance on May 29th. The symptoms of the failure, from a systems perspective, was a sudden change on that router to treating /all/ inbound traffic, on all interfaces, computing with a CRC error. However, no other devices reported interface errors at the same time, leading us to believe that the packets were fine, but the CRC was not computed correctly.
This failure mode prevented our automated redundancy from activating because that system uses advertisement-based monitoring. The secondary router didn't know that the primary had a malfunction; it was still receiving advertisements indicating the primary was alive and well. We fixed this by actively isolating the malfunctioning device from the network. The advertisements stopped and the secondary router came online and has been operating as the primary since then.
Why did that happen?
Why this happened is a more difficult question. The symptoms hit, simultaneously, two pieces of hardware that don't share a riser, driver, or even bus on the motherboard. The component nearest both impacted pieces is the OS kernel. Unfortunately, it never produced a stack that we can evaluate. It's entirely possible the system was certain it was doing the right thing and was "simply" mis-computing the CRCs. Without additional information, we cannot confidently determine the root cause.
Do you think it happened because of the previous maintenance?
If I had a nickle for every time I've asked myself that over the past few days...
Right now we can't say for sure. My intuition, for what it's worth, is leaning toward a driver interaction problem. The level of errors has a /very loose/ correlation between the traffic that goes from the internal interface to the external interface. The new hardware uses a driver we had not been using on this system previously, hence my suspicions.
That said, rolling back is a terrible idea. The heartache of limiting non-essential (but desirable) services -- not to mention taking discuss.joelonsoftware.com offline entirely -- is still fresh. Additionally, though users may not notice those stopgaps in the short term, they will do more harm in the long term. As a team, we've decided that the way out is forward, and I'm confident we can do that without causing too much pain for anyone.
What are you going to do to make sure it doesn't happen again?
Sadly, everything above makes what we do next both simple and ineffectual:
1) Ensure we're in a position to more rapidly dump the kernel if (when) this happens again.
2) Make some systems configuration adjustments to allow us to fail (and recover) faster.
In other words: Our expectation is that, in the event this happens again, we will be able to restore service significantly faster /and/ gather some real data to get to the root cause.
While none of us relish the idea of seeing this happen again, taking any more significant action would be an act of desperation. With the adjustments to our redundancy strategy, we're hoping to keep ourselves, and all of you, as far away from "desperate" as possible.
Finally, as always, if you were materially impacted by this interruption, please don't hesitate to reach out to us and we'll make it right.
We will be upgrading FogBugz On Demand to version 8.9.131 starting tomorrow (June 12) at 1300 UTC (0900 EDT). The upgrade will proceed in two stages, with the final stage completing by June 13, 2200 UTC. This takes the place of the canceled maintenance from last week.
The first stage will be upgrading the software. We will begin by upgrading a small, random portion of our account base to the new version. Following that, we will slowly leak the software to all accounts over the course of the day, completing by 2100 UTC. Customers will experience no service interruption during this upgrade.
The second stage will begin at 2100 UTC, June 12 when we will upgrade the database schema for each customer. This will occur at the account's local midnight (determined by the Regional settings of the Site Configuration). This stage will include up to 5 minutes of service interruption, but most accounts will be much faster. This stage will complete by 2200 UTC, June 13.
If you have any questions about this upgrade, please don't hesitate to contact us!
Update Tuesday, June 10th 1:50 PM EDT / 17:50 UTC: The On Demand background services have successfully caught up and are no longer delayed. You may continue with business as usual.
Update Tuesday, June 10th 1:41 PM EDT / 17:41 UTC: The On Demand background services, such as those that pull in new emails as cases, is currently delayed. It is expected to catch up very soon!
Update Tuesday, June 10th 1:24 PM EDT / 17:24 UTC: The On Demand services have resumed normal operation. The issue is resolved.
We're truly sorry for this unplanned downtime. If you were materially impacted during this time, please let us know!
Update Tuesday, June 10th 1:23 PM EDT / 17:23 UTC: We're making progress on returning the services to normal operation. You may see a few moments of the service returning, but we'll let you know on the final all-clear.
We're currently experiencing a network outage affecting all products including FogBugz On Demand, Kiln On Demand, and fogcreek.com.
We've got all hands on this issue right now. Stay tuned here for updates.
We have decided to cancel our upgrade of FogBugz On Demand to version 8.9.130. While we didn't expect any problems, the team was too busy to give proper attention to the deploy.
We will most likely deploy next Thursday instead. We will post to this blog when that schedule is confirmed.
Sorry for any inconvenience the rescheduling causes!
We will begin a gradual upgrade of all FogBugz On Demand accounts to version 8.9.130. We will upgrade a small cohort starting at 1300 UTC (0900 EDT) on Thursday, June 5. Throughout the day, we will continue to upgrade cohorts.
There is no downtime expected for this portion of the upgrade, which I still think is just awesome.
After the upgrade is complete, all accounts will have their database schema upgraded. Unfortunately, this portion of the upgrade will cause approximately 5 minutes of service interruption. This interruption will take place between midnight and 0100 for the account's local time. We determine that time based on the "Regional" settings in the account. We expect this portion to start at 2100 UTC (making UTC +3 the first cohort to experience a service interruption).
If you have any questions about this upgrade, please don't hesitate to contact us!
All went about as smoothly as possible and expected. You are unlikely to have experienced any interruption at all. But we like to make sure that if things do go unexpectedly South, you're not surprised. Have a good night and even better tomorrow!
We'll be performing some performance-related database maintenance tomorrow (04JUN2014) night. If all goes well, one can expect a Kiln On Demand service interruption on the order of ten to fifteen minutes. So be prepared to push away from the desk, pull a cup of coffee, tag a few photos, and and commit to a breather. We appreiacte your patience!