Chat outage
Incident Report for Ada
Postmortem

NOTE: This is for our US cluster customers only (<bot>.ada.support”). Customers on our other clusters were not affected.

Overview

Our message processing pipeline experienced issues when the message broker infrastructure that backs the pipeline exceeded memory thresholds and began rejecting connections. The disruption began at 2:16am and ended at 8:50am EST -- the level of outage during this time ranged from almost complete outage of message processing to degraded performance of the service, with the majority of time spent in a state that impacted around 25% - 75% of users.

Description

The RabbitMQ cluster that we use for message processing activities began having memory contention issues at 2:16am EST. During this time our team did not receive any notifications due to the RabbitMQ cluster being improperly misconfigured during setup. Our team became aware of impacted service at 7:45am EST and begins to diagnose. The reason for high memory contention was found to be a high number of connections being opened and closed on the RabbitMQ cluster in a short space of time. This caused a high number of logs to be written to disk. The writing of logs to disk caused disk i/o contention. During this time our message processing activities, which store in disk and in memory had trouble storing to disk. This caused a large number of messages to be held in memory. The high number of messages being held in memory overloaded the cluster and we reached our memory threshold. While memory was overloaded on the cluster, message processing activity was impacted.

Resolution

Our team resolved the issues at 8:50pm EST by spinning up a new RabbitMQ cluster with properly configured alerting capability. We are now working diligently to understand the reason for such a high number of connections being opened and closed on the RabbitMQ cluster.

Severity

We experienced a partial outage from 2:16am - 4:15am EST where around 20% of chatters were affected. We experienced an almost complete outage from 4:15am - 6:03am EST where upwards of 90% of chatters were affected. We experienced a partial outage from 6:03 - 8:50am EST where around 60% of users were affected.

More Information

If you would like a more detailed analysis of this outage and steps going forward please contact your Customer Success Representative.

Posted Aug 03, 2018 - 17:43 EDT

Resolved
We have resolved the issues seen with chat messaging. We are collecting information about the outage and will post a Root Cause Analysis on this statuspage.
Posted Aug 03, 2018 - 09:31 EDT
Monitoring
We are monitoring to ensure this incident is resolved
Posted Aug 03, 2018 - 08:53 EDT
Update
We have resolved this issue.
Posted Aug 03, 2018 - 08:50 EDT
Investigating
We are currently investigating an outage with our chat service
Posted Aug 03, 2018 - 08:24 EDT
This incident affected: Chat Client.