At 02:00 Sunday morning (GMT) we will upgrade all accounts to FogBugz 8.7.2. Aside from internal bugfixes, the only significant change is:
- Updated Case to Salesforce plugin for FogBugz
Service impact should be no more than five minutes.
« August 2011 | Main | October 2011 »
At 02:00 Sunday morning (GMT) we will upgrade all accounts to FogBugz 8.7.2. Aside from internal bugfixes, the only significant change is:
Service impact should be no more than five minutes.
Posted by Shawn Hargan at 11:21 AM | Permalink
At 2:00 a.m. early Sunday morning (GMT) we will be upgrading all accounts to FogBugz 8.6.51. Changes include:
It will be a short outage that should not last longer than five minutes.
Posted by Tim Stewart at 03:06 PM | Permalink
The Kiln issues reported earlier today have been resolved. To clarify, the exact symptoms were:
No pushes or created repositories were lost -- this was strictly a display issue.
When you push code to Kiln, your data streams through to our Mercurial backends. As soon as that push is complete, those backends queue an event to inform the frontends that there's new information to be indexed and displayed. These events are handled by a worker process that, in this case, had slowed down to a crawl and was unable to keep up with the incoming messagaes.
This slowdown was fixed by restarting the worker process. Our developers are working to discover why it slowed down in the first place, but we've added a bandaid script to ensure that events are processed in due time even if the slowdown happens again.
This also exposed an area of monitoring that we need to improve upon since, to our checks, it looked like this process was doing its job. In the future, we should catch these types of events before they're even noticeable.
Posted by Shawn Hargan at 05:57 PM | Permalink
Some Kiln on Demand accounts are currently experiencing problems including not being able to push change-sets to Kiln, create new repositories and general UI slowness. The Kiln engineering team and sysadmins are actively working on the problem. We will post an update here when we know more.
Thank you for your patience.
Posted by Adam Wishneusky at 04:47 PM | Permalink
We're getting reports that people with accented characters in their names are having difficulty signing up for Trello.
Due to a bug in one of our underlying libraries, if you have a special character in your name and try to create an account, there's a mismatch and the authentication fails. We've reported this bug to our authentication library provider and hope they'll be able to resolve it in the next few days.
You can still sign up for Trello! Just use the normal signup option. We will temporarily convert any characters in your full name to ones that won't run into the bug, and you'll be able to change it back later, once the bug is fixed.
The launch of Trello has been immensely popular, and we can say for sure that this is only the first of the bugs we will eventually find. Thanks for bearing with us!
Posted by Rich Armstrong at 01:05 PM | Permalink
Last night, we experienced two network interruptions at approximately 20:00 EST and 21:20 EST. The first outage was brief enough that not one customer reported it. The second outage lasted about 15 minutes before it was resolved, and was reported by several customers.
The Short Explanation
We're building a new set of gateways to be deployed this weekend. We took every precaution we knew to take to keep these new machines separated from live traffic, but, ironically, it was actually an undocumented behavior in one of our redundancy technologies that caused the test bed to intrude on the live network.
This particular behavior won't affect us again since we now know it exists. We're also submitting documentation patches so that, hopefully, it will be better documented for others in the future. Ordinarily, if we'd had any inkling that this separate system could have affected production, we would have performed the work during the maintenance window. There was simply no way we could have known that this failover would happen. In the short term, we're reexamining the kind of work that goes on outside a maintenance window, with particular focus on our redundancy systems, to ensure we're not putting the environment at undue risk.
The Long Explanation
We are building out a new set of upgraded gateways to be deployed this weekend. The technology we use to provide IP failover between our gateways is CARP, an open source alternative to VRRP. There are two components to configuring CARP interfaces: the VHID and the CARP password.
The VHID is advertised by CARP to identify interfaces as belonging to the same redundancy group. The CARP password is there to ensure that only authorized interfaces are able to participate in that redundancy group. Advertisements made on a particular VHID will be ignored by the CARP interface if the advertisement doesn't use the same password.
With this in mind, we configured our new gateways with the same VHID but different IPs and different passwords to test our buildout but avoid affecting the production environment. What isn't documented in the OpenBSD man pages*, OpenBSD FAQ, or FreeBSD man pages is that the VHID is used to calculate the CARP interface's virtual MAC address. The result being that traffic intended for our gateways was switched to the new boxes.
We didn't catch this the first time because we were distracted by a red herring: our switches showed a spanning tree topology change at the same time as the first outage. Though we carefully went through all commands that had been run on the switches within the previous hour and couldn't find anything dangerous, we assumed we were missing some subtle effect and stopped working on them until we could do further research (which later revealed that the topology change was innocuous).
Outages suck. The point of these postmortems is to both apologize (profusely) and to explain what happened so you know we take these issues very seriously. This particular behavior should not affect us again since we now know it exists, and hopefully we can help make FreeBSD better from our discovery of this undocumented behavior.
* This is vaguely alluded to in the IP Balancing section of OpenBSD's carp(4) manpage but, as we don't use that feature, we've not studied that portion closely.
Posted by Shawn Hargan at 10:50 AM | Permalink
Our On Demand infrastructure suffered two minor network outages this evening, first at 20:10 EST and again at 21:20 EST.
The outages were due to human error. We are gearing up for a scheduled maintenance window this weekend to replace our aging firewalls. During testing, we believed the new systems to be well-isolated but some traffic managed to leak into our production network and cause problems. The root cause was discovered during the second hiccup and was promptly repaired.
We are sorry for any inconvenience this has caused and appreciate your patience as we work to make On Demand better.
Posted by Tim Stewart at 10:12 PM | Permalink
At 02:00 GMT tonight we will upgrade Kiln On Demand to 2.6.38. This is a bugfix release that corrects a couple non-customer-facing issues.
The upgrade is expected to be extremely fast but does include a minor database upgrade. If you run into the "We are upgrading your account" message tonight, give it just a minute or two and reload. The vast majority of you shouldn't even notice!
Posted by Shawn Hargan at 04:45 PM | Permalink
At 02:00 GMT tonight we will upgrade FogBugz On Demand to 8.6.48 and Kiln On Demand to 2.6.36. Along with the usual slew of minor bugfixes and performance enhancements, changes include:
FogBugz
Kiln
Developers, then assign that whole group write access to your repositories. Kiln's groups are shared with FogBugz, too, so any groups that you already have set-up can be reused immediately
Posted by Shawn Hargan at 03:35 PM | Permalink
If you just want the summary: a system designed to contribute to the stability of Kiln masked an upgrade issue by failing gracefully rather than dying horribly during testing. In the future, we will be testing against brittle systems before rolling out to the more resilient infrastructure where customer instances live.
For the full explanation, read on.
At Fog Creek, we've been spending the last couple of weeks making changes to our infrastructure to make it possible to slowly leak new versions of Kiln in a gradual way that will minimize customer impact. When the project is complete, we hope to get to a world where you'll never see the "Upgrading Account" screen ever again.
As part of that, on Wednesday, we rolled out a brand-new component to the Kiln Storage Services that's vital to being able to achieve this kind of fine-grained, super-testable rollout.
We tested the service thoroughly on two boxes that were out of production, then rolled it out to our official servers. Everything looked as if it worked perfectly in initial testing. But we rapidly realized something had gone wrong, and rolled everything back to the way it had been prior to service deployment. The gap between realizing something had gone wrong and having it fixed was around three minutes, and we believe things may have begun misbehaving about four minutes earlier than that, leaving Kiln users unable to push or pull to their Kiln installs for up to about seven minutes.
We have a policy at Fog Creek of asking the Five Whys to try to get down to the bottom of why a problem occurred. So: why did we have this downtime?
Posted by Benjamin Pollack at 10:15 AM | Permalink