A runaway indexing process (for Kiln code search) tied up the majority of the Apache worker processes on the server, so client requests to that server were not being processed. The queue of unprocessed connections at that server, at the proxying server that passes requests to the storage server, and at the load balancer, became extremely long. At that point, none of the servers were able to make progress on the queue, rendering Kiln unreachable. Incoming connections continued to add load to the server, rendering it effectively unreachable.
Once we realized the nature of the problem, we disconnected all queued client connections, and disabled the indexing process. Kiln On Demand returned to normal responsiveness within a few minutes after that point.
Going forward, we've made several changes to fix this type of problem:
- fixed the indexing process to not tie up server resources.
- reduced the connection queue lengths, so that requests will be rejected when Kiln is under heavy load, rather than contributing further load.
- increased monitoring of load on the storage server.
We're also working on building new, faster storage servers, though they are not quite ready to come into production yet.
I apologize for any inconvenience this service interruption caused -- I understand that Kiln is vital to many businesses. We appreciate your cooperation and patience. If there's anything we can do to make this right, please don't hesitate to contact Fog Creek Support.