Investigating issues with Ada Chat services
Incident Report for Ada
Postmortem

RCA - Lost chat functionality for US and Canadian based Bots

Nov 8, 2022

Synopsis: On November the 8th at 9:53 AM EST Ada experienced an outage in which a third party service used in our chat widgets had an outage. This caused the chat widget to fail for bots based in the US and Canada.

Outage Start Time/End Time: Outage starting at 9:53 AM Eastern Standard Time on Nov 8th. Ending at 11:40 AM

November 6, 2022  9:53 AM EST - Ada detects the outage, engages incident response.

November 8, 2022 10:00 AM EST - Ada engineers determine source of the issue and begin to roll out a change to circumvent the issue

November 8, 2022  10:16 AM EST - SEV-0 declared at Ada.

November 8, 2022  10:32 AM EST - Ada Engineers attempted to shift traffic to another vendor region to return to service. 

November 8, 2022  10:52 AM EST - Ada engineers are unable to complete the shift to another vendor region due to configuration issues. Decision is made to roll back changes and monitor third party services for recovery in the primary region. 

November 8, 2022  11:16 AM EST - Ada Engineering monitoring is showing traffic back online with third party service. Engineers begin to validate and monitor

November 8, 2022  11:40 AM EST -  Ada’s chat service regains function and Ada engineers have validated functionality. Outage is marked as resolved

Impact: 

Between the hours of 9:53 AM EST and 11:16 AM EST, no communications were possible between Ada and its chat users. Chatters experienced various effects including requests that were never responded to, an error message stating that “Messages are taking longer than usual,” and a message acknowledging the outage for chatters who initiated their chat during this timeframe.

100% of the bots residing in the US and Canada Ada regions were affected by this outage. 

Detection: 

Ada’s internal system monitoring detected the outage shortly after it began.

Resolution: 

The outage was ultimately resolved by the vendor addressing the outage on their end.

Root Cause: 

The service that handles communications between Ada and the Chatter’s window had an outage, preventing Ada’s ability to complete responses to the chatter. Ada currently has no backup mechanism to communicate with the client application.

Future Prevention: 

Ada is focusing on building resilience and backup mechanisms to the service that delivers communications between Ada and the client application. We have taken immediate steps to reduce the amount of time it takes us to recover from an incident of this nature, and are executing on a plan in the coming weeks to reduce that time to 5 minutes or less upon discovery of an outage of this nature. Longer term, Ada is evaluating techniques that will provide “live failover” functionality for this capability so that an outage for this provider will not be experienced by Ada customers.

Posted Nov 10, 2022 - 15:27 EST

Resolved
This incident has been resolved.
Posted Nov 08, 2022 - 11:37 EST
Update
We are continuing to monitor for any further issues.
Posted Nov 08, 2022 - 11:16 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 08, 2022 - 11:16 EST
Update
We are continuing to work on a fix for this issue.
Posted Nov 08, 2022 - 10:49 EST
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 08, 2022 - 10:19 EST
Investigating
Some customers are experiencing issues with our chat services. This is affecting some chatters interacting with Ada bots.
We’re aware of the issue and our team is actively working on getting everything back to normal. We're sorry for the inconvenience this causes.
We will continue to share status updates every 30 minutes.
Posted Nov 08, 2022 - 10:08 EST
This incident affected: Chat Client.