First off, let me apologise to all who have been affected by the two service availability issues we've encountered in the last week. We absolutely love providing our service to everyone and are devastated that we let you down.
On Thursday 9th June at 7:30pm BST we experienced a massive degradation of service caused by our database server locking requests. The underlying problem was the RAID array unable to keep up with expanded documents being loaded and modified rapidly.
We quickly restored services to around 75% of our users, certain rooms weren't available for the subsequent 18 hours.
During this outage, we quickly spun up new database servers with more efficient drive utilisation - Mongo 3.0 with WiredTiger (+snappy compression). Additionally, we had to spin up new, faster, RAID arrays to host the new data. We spun up multiple new RAID arrays as we had to move hundreds of gigabytes of data around the network onto each of the new machines.
By 2pm BST on Friday everything was running smoothly again - in fact a lot faster than usual thanks to the new Mongo and after an all nighter finally got some rest.
Everything had been running smoothly for nearly a week and so yesterday (Wednesday 15th June) we begun to decommission some of the unused RAID arrays. Unfortunately, as the new servers were spun up at 2am in panic mode, we failed to properly configure the log file locations on Mongo...
At 3am BST on Thursday 16th our logrotate cron jobs kicked in on the two new database servers and they tried to move log files to a location that didn't exist. Immediately the two Mongo instances on those machines fell over as their new logging location no longer existed. Ordinarily this wouldn't be a problem as we had a third legacy operational Mongo server acting as a secondary as well as an arbiter. What we had failed to do during our emergency maintenance the previous week is spot that having 3 live Mongo servers and an arbiter would mean that if two machines simultaneously failed, an election of a new primary Mongo server would not take place due to a 50% quorum split.
It took us longer than we would have liked to have identified the issue, and services were restored at 7am BST.
We are regrouping around the issue and drawing lessons learned to prevent these issues from occurring again.
It's likely that we will schedule some server maintenance over the weekend that may result in a couple of hours of downtime, which we will do during a sympathetic period where usage on the system is limited to a few hundred people. These improvements, which we have been testing throughout the course of the week, should result in massively improved performance. Rather than trickle-feed them gradually, our current sentiment is that it's better to take a short, scheduled hit over a period of low usage and migrate the data quickly such that everyone will benefit from the changes.
Again, our humblest apologies for these issues, we know that people depend on our service and that we have let you down. We are still a young and growing company and are continuously putting systems and processes in place to learn from our mistakes.
Yours,
Mike Bartlett
CEO & Founder
PS. If all of this made total sense to you and you're tut-tut-tutting, we have been trying to hire a new DevOps engineer for a little while. If that's you, or you know the right person, click here.