Investigating issues with Ada Chat services
Incident Report for Ada
Postmortem

RCA - Chat functionality lost for US and Canadian based Bots

Nov 6, 2022

Synopsis: A third party service used in our chat widgets had an outage. This caused the chat widget to fail for bots based in the US and Canada.

Outage Start Time/End Time: Outage starting at 7:00 PM Eastern on Nov 6th. Ending at 9:00 PM Eastern.

November 6, 2022  7:00 PM ET - Outage of a key vendor’s service begins rendering Ada’s chat experience unresponsive

November 6, 2022  7:06 PM ET - Ada detects the outage, engages incident response.

November 6, 2022 7:16 PM ET - Ada engineers determine source of the issue and begin researching ways to circumvent the affected service.

November 6, 2022  7:31 PM ET - Sev0 declared at Ada.

November 6, 2022  8:00 PM ET - Continued attempts to move traffic around the affected service are slowed by security measures.

November 6, 2022  8:18 PM ET - Vendor publishes the outage on their status page.

November 6, 2022  9:00 PM ET - Vendor’s service begins to function again. Ada’s chat service regains function.

Impact:

Between the hours of 7:00 PM ET and 9:00 PM ET, no communications were possible between Ada and its chat users. Chatters experienced various effects including requests that were never responded to, an error message stating that “Messages are taking longer than usual,” and a message acknowledging the outage for chatters who initiated their chat during this timeframe.

All 100% bots in the US and Canada Ada regions were affected by this outage.

Detection:

Ada detected the outage shortly after it began.

Resolution:

The outage was ultimately resolved by the vendor addressing the outage on their end.

Root Cause:

The service that handles communications between Ada and the Chatter’s window had an outage, preventing Ada’s ability to complete responses to the chatter. Ada currently has no backup mechanism to communicate with the client application.

Future Prevention:

Ada is focusing on building resilience and backup mechanisms to the service that delivers communications between Ada and the client application. We have taken immediate steps to reduce the amount of time it takes us to recover from an incident of this nature, and are executing on a plan in the coming weeks to reduce that time to 5 minutes or less upon discovery of an outage of this nature. Longer term, Ada is evaluating techniques that will provide “live failover” functionality for this capability so that an outage for this provider will not be experienced by Ada customers.

Posted Nov 09, 2022 - 10:04 EST

Resolved
This incident has been resolved.
Posted Nov 06, 2022 - 21:38 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 06, 2022 - 21:06 EST
Update
The team is continuing to work on the issue.
Posted Nov 06, 2022 - 20:49 EST
Identified
The issue has been identified and the team is investigating the appropriate solution.
Posted Nov 06, 2022 - 20:12 EST
Investigating
Some customers on our US and Canadian clusters are experiencing issues with our chat services. This is affecting some chatters interacting with Ada bots.
We’re aware of the issue and our team is actively working on getting everything back to normal. We're sorry for the inconvenience this causes.
We will continue to share status updates every 30 minutes.
Posted Nov 06, 2022 - 19:39 EST
This incident affected: Chat Client.