Nov 6, 2022
Synopsis: A third party service used in our chat widgets had an outage. This caused the chat widget to fail for bots based in the US and Canada.
Outage Start Time/End Time: Outage starting at 7:00 PM Eastern on Nov 6th. Ending at 9:00 PM Eastern.
November 6, 2022 7:00 PM ET - Outage of a key vendor’s service begins rendering Ada’s chat experience unresponsive
November 6, 2022 7:06 PM ET - Ada detects the outage, engages incident response.
November 6, 2022 7:16 PM ET - Ada engineers determine source of the issue and begin researching ways to circumvent the affected service.
November 6, 2022 7:31 PM ET - Sev0 declared at Ada.
November 6, 2022 8:00 PM ET - Continued attempts to move traffic around the affected service are slowed by security measures.
November 6, 2022 8:18 PM ET - Vendor publishes the outage on their status page.
November 6, 2022 9:00 PM ET - Vendor’s service begins to function again. Ada’s chat service regains function.
Impact:
Between the hours of 7:00 PM ET and 9:00 PM ET, no communications were possible between Ada and its chat users. Chatters experienced various effects including requests that were never responded to, an error message stating that “Messages are taking longer than usual,” and a message acknowledging the outage for chatters who initiated their chat during this timeframe.
All 100% bots in the US and Canada Ada regions were affected by this outage.
Detection:
Ada detected the outage shortly after it began.
Resolution:
The outage was ultimately resolved by the vendor addressing the outage on their end.
Root Cause:
The service that handles communications between Ada and the Chatter’s window had an outage, preventing Ada’s ability to complete responses to the chatter. Ada currently has no backup mechanism to communicate with the client application.
Future Prevention:
Ada is focusing on building resilience and backup mechanisms to the service that delivers communications between Ada and the client application. We have taken immediate steps to reduce the amount of time it takes us to recover from an incident of this nature, and are executing on a plan in the coming weeks to reduce that time to 5 minutes or less upon discovery of an outage of this nature. Longer term, Ada is evaluating techniques that will provide “live failover” functionality for this capability so that an outage for this provider will not be experienced by Ada customers.