On Monday August 24, 2020 from 9:13 AM to 9:30am EDT, Ada had a partial outage where all chat bots on our US Cluster had delayed messages, these messages were all eventually delivered but were delayed by several minutes.
Here's a quick timeline of how everything went down:
Our messaging system works in an asynchronous fashion where messages that are to be delivered to a chatter are put into a queue. We then have a large number of worker processes that take messages from this queue and deliver them to the browser of a chatter. This allows us to send and receive millions of messages per day to our customer's chatters. We utilize a well known standard message queue software called RabbitMQ and use a Managed Service Provider to run this for us.
What happened this morning was unique in that a large customer of ours was having an outage of their own during peak hours and as we have experienced in the past, this can lead to a large number of their users visiting our chat widget for support. In the past, our system had not been able to scale up fast enough to deal with a large amount of messages. We had since implemented autoscaling that scales up before peak hours to anticipate surges in demand.
We were immediately alerted to the rise in CPU and our SRE (Site Reliability Engineering) team began the investigation of the alert. Our RabbitMQ provider, unfortunately, has no autoscaling features and once we realized the bottleneck, we decided to scale up our RabbitMQ cluster manually. This took about 5 minutes and we were able to do it in a way where no messages we lost because we had previously prepared for such an incident. Once the cluster was upgraded and had more CPU and RAM available, the messages started being delivered to the chatters. The influx of messages to our queue was still as high but the larger servers were able to handle the load. To further mitigate increased load we decided to upgrade the cluster to the next biggest size again to prevent any future increase.
While we had prepared for such an event, we learned that there are pieces of our infrastructure and architecture that can be improved. The bottleneck of using a single queue and CPU is quite obvious and so we can take on more load if we shard this into many queues. We have started the process of implementing this change. We also realize there will eventually be a limit to the number of messages per second we can process on the largest group of 3-5 servers we can purchase. We are already planning the re-architecture of our messaging system so that we can scale in terms of both throughput and load if we were to have 10-100x the current amount of messages going through our system.
I would like to thank the members of the SRE team who assisted in the investigation during the incident, these are always high stress situations where you have to consider many factors to find the cause of an outage. Many thanks also to the engineers at Ada who joined our incident triage process to help figure out what was going on. We have a culture of empathy and ownership at Ada and it truly shows whenever we have a high severity issue no matter what time of the day it happens.