Last night at 7:15pm UTC Gitter experienced a prolonged degredation of service. During this period the application's response times became incredibly slow and realtime websocket connections to deliver live message updates failed dramatically.
Below follows an explanation of what happened and what steps we are taking to prevent this from happening again.
Before doing so, I'd like to extend our sincerest apologies to all of our customers who were affected at the time. We care deeply about the service we provide and your support for what we do.
At 7:21pm one of our engineers was alerted by PagerDuty that our websocket server processes had been running at 100% across all of our servers for 5 minutes. Upon investigation we discovered that our web server response times had ground to a halt, upwards of 40 seconds per response and websocket connectivity was virtually unresponsive.
During this period, we noticed that all connected users were effectively reconnecting every 45 seconds. When a web socket connects to Gitter, it receives snapshot data of the room it is in. This includes chat messages, people, integration activity and a host of other data that makes Gitter work the way it does. Each connection has some processing overhead. Having every connected client do this every 45 seconds was effectively DDOSing us.
The root cause of the issue appears to be a momentarily loss of connectivity in one of our application servers. We aren't quite sure yet what caused this yet, it may have merely been a hiccup within the AWS infrastructure itself. What happened on that server meant all websockets were disconnected. As they immediately all came back simulatenously they began to put pressure on that server, as well as the other application servers behind our load balancer.
What followed was a dramatic escalation of connectivity that brought the infrastructure to it's knees, largely caused by constant disconnection and reconnection of websockets, what we refer to as a subscription storm as the sockets begin to subscribe to resources.
A few weekends ago, we performed some scheduled maintenance and saw similar behaviour. Given the lower level of activity on our system at the time, there wasn't a massive impact, but we did see a light subscription storm.
Our past experience with websockets has shown they can be volatile beasts. In particular one issue we've seen happen frequently are ghost sockets, where your browser firmly believes that the websocket is connected, whereas in reality the connection is no longer active. To combat this type of behaviour, our client websockets periodically ping our servers and will restart themselves if they don't receive responses. Our servers, in turn, keep track of these clients and will also perform cleanup operations when we see sockets who haven't responded within their ping timeframe.
The shorter the time period and connectivity retry period as long periods can create situations where your client is no longer receiving messages. However, short periods can cause additional load on the server infrastructure. Over time we've tuned these parameters into a system that works very well.
After the aforementioned scheduled maintenance, we developed a preventative measure that will adjust the websocket parameters when under load. Ironically, this is exactly what caused the cascading failure last night.
As the increased connections from the single app server started to negatively imapct the performance of that machine, the parameters where adjusted upwards to alleviate load. However, a logic flaw in the algorithm meant that the other two application servers were not aware that the socket timeout parameter had been adjusted and so the socket cleanup operations on the other two application servers began to incorrectly close perfectly valid sockets causing them to constantly try to reconnect.
After realising this mistake, we manually adjusted the parameters on all of the servers and connectivity was restored at 8:09pm. Further fine tuning of these parameters caused some fluctuating disruptions up until 22:30pm.
Firstly, we will be increasing our overall server capacity to better deal with momentary failure. The brief failure that occured on a single server illustrated that we do not have enough capacity in our redundancy.
Secondly, we will be modifying the storm prevention alogirthm to adjust its parameters across the entire infrastructure, rather than on a single server.
A shout out to great tools
Were it not for our alerting mechanism, a combination of monit and PagerDuty we would not have been able to respond quite so quickly.
Our instrumentation and visualisation through statsd and Datadog allowed us to see exactly what was going on in realtime and aided us immensely in tracking down the issues.
In case you aren't aware of it, iTerm's Broadcast Input feature greatly aids multi server open heart surgery. Naturally all changes have now been deployed correctly using Ansible.
Once again, we'd like to apologise for the issues caused and thank you to all for your words of support and your patience during the issue. In another twist of irony, we recently posted a job opening for a Devops Engineer. If this sort of thing excites you, come help us make our infracture even better.