We will be performing maintenance on our credit card processor network during the day (EST) on Sunday, May 31. During this time, submitting credit card payments to Copilot or changing the credit card associated with your FogBugz On Demand account may result in a "Declined" message. Don't panic - just give it a few minutes and retry.
Total time of service impact should be less than an hour, but, because of the scope of this maintenance, the window is much wider than we usually like. We'll refactor this part of our network soon to reduce the impact of future maintenance.
Starting very early this morning, a number of users began running into issues when trying to access either FogBugz On Demand's API or BugzScout. This problem was resolved upon discovery at 6 AM.
FogBugz On Demand has a pool of servers dedicated to only handling BugzScout and API requests, which helps keep the load down on the servers that are servining up the UI to our customers. Upon further research, it was discovered that one of our API / BugzScout servers had a configuration mismatch that was causing the reported redirect loop.
I see two failures here. The first is obviously the configuration mismatch, which we have corrected by manually adjusting the configuration values. This bug was introduced in the original deployment / configuration of that specific server, and our deployment scripts and procedures have been updated to include the proper steps so that it doesn't happen again.
The second failure was one of monitoring. Normally, our monitoring system has the capability to discover services throughout our network and sets up monitoring automatically. Howerver, there are a few services that require some manual configuration in order to be monitored correctly. I have updated our documentation to ensure that this doesn't get missed in the future, and we are also working on a visualization system that will help us audit and detect imbalances like this.
Our colo facility is once again experiencing issues with their cabling, causing the connections on one of our routers to flap. We have taken that link out of the path, allowing the backup router to take over as planned. This will mitigate further service problems while our colo moves our data feeds to a different patch panel.
We are terribly sorry to see these repeat problems, and are working hand in hand with the facility's staff to make sure that brand new drops are run to our cabinets and that they are properly tested before being turned up.
For approximately 15 minutes this afternoon, both FogBugz On Demand and Copilot customers were unable to reach their respective services. This was caused by a faulty uplink on our colo's end, which they are now in the process of replacing.
To minimize downtime in the future, our routers will soon be configured to perform an automatic fail over when they detect a flapping link.