NOTE: This is for our US cluster customers only (<bot>.ada.support”). Customers on our other clusters were not affected.
Our message processing pipeline experienced issues when the message broker infrastructure that backs the pipeline exceeded memory thresholds and began rejecting connections. The disruption began at 2:16am and ended at 8:50am EST -- the level of outage during this time ranged from almost complete outage of message processing to degraded performance of the service, with the majority of time spent in a state that impacted around 25% - 75% of users.
The RabbitMQ cluster that we use for message processing activities began having memory contention issues at 2:16am EST. During this time our team did not receive any notifications due to the RabbitMQ cluster being improperly misconfigured during setup. Our team became aware of impacted service at 7:45am EST and begins to diagnose. The reason for high memory contention was found to be a high number of connections being opened and closed on the RabbitMQ cluster in a short space of time. This caused a high number of logs to be written to disk. The writing of logs to disk caused disk i/o contention. During this time our message processing activities, which store in disk and in memory had trouble storing to disk. This caused a large number of messages to be held in memory. The high number of messages being held in memory overloaded the cluster and we reached our memory threshold. While memory was overloaded on the cluster, message processing activity was impacted.
Our team resolved the issues at 8:50pm EST by spinning up a new RabbitMQ cluster with properly configured alerting capability. We are now working diligently to understand the reason for such a high number of connections being opened and closed on the RabbitMQ cluster.
We experienced a partial outage from 2:16am - 4:15am EST where around 20% of chatters were affected. We experienced an almost complete outage from 4:15am - 6:03am EST where upwards of 90% of chatters were affected. We experienced a partial outage from 6:03 - 8:50am EST where around 60% of users were affected.
If you would like a more detailed analysis of this outage and steps going forward please contact your Customer Success Representative.