Trello.com was down for about 25 minutes on Tuesday, January 3.
Our monitors alerted us to intermittent connection failures at about 10:55 EST and a bandaid fix was rolled out at 11:20. The cause of the actual outage was a dramatic increase in the number of TCP connections open against Trello from our usual 20-25k to our firewall's 100k ceiling. We doubled this limit to try to relieve pressure while we diagnosed the problem, but hit the ceiling again within seconds of making the change.
Eventually, we narrowed the issue down to a problem with Websockets and, at 11:20, forced all clients to fall back on AJAX polling, ending the outage.
The root problem was tracked back to a recent change in a client Websocket library that caused the application to repeatedly open duplicate connections. This has been fixed and a new version deployed. At this point, Trello is running Websockets for compatible browsers and our state tables look healthy.
We apologize for the inconvience. Thanks for bearing with us as we make Trello better and better.