The planned maintenance went off without a hitch and all accounts should now be accessible. Please let us know if you have any problems by either emailing us at customer-service@fogcreek.com or calling at 1-866-FOGCREEK.
The planned maintenance went off without a hitch and all accounts should now be accessible. Please let us know if you have any problems by either emailing us at customer-service@fogcreek.com or calling at 1-866-FOGCREEK.
Posted by Michael Gorsuch at 03:53 AM | Permalink
Just a reminder that we will be performing maintenance for FogBugz On Demand. The work will begin in approximately 45 minutes.
Posted by Michael Gorsuch at 11:16 PM | Permalink
UPDATE: In the time it took me to write this post, it appears they have fixed their problems and we should be 100% again.
Currently our payment gateway provider is down, which means we cannot process any credit card related orders. They have a web page where they post their status, but it says their gateway is still functioning perfectly :)
I spoke with a supervisor who said they were looking into it ASAP.
We're going to investigate cloning our payment process to use another provider during outages like this (and it should also give us some flexibility to save some money by switching to a different payment processor).
Posted by Michael Pryor at 11:50 AM | Permalink
Wednesday, January 23, at approximately
So how are we going to fix this? Fortunately, the solution is pretty simple. We are going to create a monitor that will regularly log in to Copilot from both sides and ensure that data is flowing between the clients. If this process fails for any reason, we will be notified of it and can take action before it affects users.
Posted by Tyler Hicks-Wright at 06:45 PM | Permalink
We will be taking down the second half of our FogBugz On Demand customers this weekend in order to migrate them to our shiny new database servers. This will improve reliability and performance for a number of our customers, as well as improve our ability to handle larger loads. An email was sent out to all customers who will be impacted, but to reiterate:
The outage will occur at approximately 00:01 EST on January 27th, 2008, and will end at 05:00 EST. During this time, your FogBugz On Demand account will not be available for use.
If you are interested in receiving updates on this and other outages, please subscribe to our RSS feed.
Posted by Michael Gorsuch at 10:06 AM | Permalink
We have completed all of our maintenance to the On Demand service (previously mentioned here). We have now upgraded all of our customers in our New York Data Center to our brand new database servers.
All of our tests are passing, but should you have any problems accessing your FogBugz On Demand account, please contact our customer service team at 866-FOGCREEK and we'll tackle it right away.
Posted by Michael Gorsuch at 03:24 AM | Permalink
Just a friendly reminder that FogBugz On Demand maintenance will begin in 15 minutes and end at 05:00 EST. Please see this earlier notice for more details. I will post an update upon completion.
Posted by Michael Gorsuch at 11:45 PM | Permalink
We will be taking down half of our FogBugz On Demand customers this weekend in order to migrate them to our shiny new database servers. This will improve reliability and performance for a number of our customers, as well as improve our ability to handle larger loads. An email was sent out to all customers who will be impacted, but to reiterate:
The outage will occur at approximately 00:01 EST on January 20th, 2008, and will end at 05:00 EST. During this time, your FogBugz On Demand account will not be available for use.
If you are interested in receiving updates on this and other outages, please subscribe to our RSS feed.
Posted by Michael Gorsuch at 04:33 PM | Permalink
Post mortem of this morning's outage:
Assuming that the third ‘why’ is correct, and it certainly is probable, then we have our root cause. Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred. Or, it would occur once, and the standard would get updated as appropriate. Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team. It should be clear that it is much more than that.
There is irony in the fact that our system administrator spent the early part of this week drafting a small set of policies and standards for our environment. He now has one more to add to the list.
Now, we could surely take the 5 Whys even further and discover that we would be better off with an HA router / switch configuration, etc. While I see that as fair, the above examination exposes a fundamental flaw in our approach to maintaining this environment which needs to be remedied before adding complexity.
Posted by Joel Spolsky at 05:11 PM | Permalink
There was a full outage in our New York data center this morning. Things began flapping around 3:30 AM, and then settled down after 10 minutes and we saw no need to panic. At approx 5:00 AM, it happened again. We contacted Peer1, and they felt it must be connectivity and started investigating. Things came up around 5:30 AM, and Peer1 did not find anything. At approx 6:15 it happened again, but this time it was a full outage. Again, Peer1 could not detect anything wrong with the connection. Michael went down to the data center, verified that our router could not talk to the outside world, and then moved the Peer1 network connection from our switch directly to our router. This cleared everything up.
Reason suggests that there is either a configuration error on the switch as a whole or an issue with just that port. We see no reason to think that the problem is on Peer1’s end. We are still investigating.
Mitigating factors: This outage did not affect FogBugz customers using the Los Angeles data center. Because the outage occurred during the North American night, most North American customers would not have been affected.
Posted by Joel Spolsky at 12:35 PM | Permalink