Nov 8, 2022
Synopsis: On November the 8th at 9:53 AM EST Ada experienced an outage in which a third party service used in our chat widgets had an outage. This caused the chat widget to fail for bots based in the US and Canada.
Outage Start Time/End Time: Outage starting at 9:53 AM Eastern Standard Time on Nov 8th. Ending at 11:40 AM
November 6, 2022 9:53 AM EST - Ada detects the outage, engages incident response.
November 8, 2022 10:00 AM EST - Ada engineers determine source of the issue and begin to roll out a change to circumvent the issue
November 8, 2022 10:16 AM EST - SEV-0 declared at Ada.
November 8, 2022 10:32 AM EST - Ada Engineers attempted to shift traffic to another vendor region to return to service.
November 8, 2022 10:52 AM EST - Ada engineers are unable to complete the shift to another vendor region due to configuration issues. Decision is made to roll back changes and monitor third party services for recovery in the primary region.
November 8, 2022 11:16 AM EST - Ada Engineering monitoring is showing traffic back online with third party service. Engineers begin to validate and monitor
November 8, 2022 11:40 AM EST - Ada’s chat service regains function and Ada engineers have validated functionality. Outage is marked as resolved
Between the hours of 9:53 AM EST and 11:16 AM EST, no communications were possible between Ada and its chat users. Chatters experienced various effects including requests that were never responded to, an error message stating that “Messages are taking longer than usual,” and a message acknowledging the outage for chatters who initiated their chat during this timeframe.
100% of the bots residing in the US and Canada Ada regions were affected by this outage.
Ada’s internal system monitoring detected the outage shortly after it began.
The outage was ultimately resolved by the vendor addressing the outage on their end.
The service that handles communications between Ada and the Chatter’s window had an outage, preventing Ada’s ability to complete responses to the chatter. Ada currently has no backup mechanism to communicate with the client application.
Ada is focusing on building resilience and backup mechanisms to the service that delivers communications between Ada and the client application. We have taken immediate steps to reduce the amount of time it takes us to recover from an incident of this nature, and are executing on a plan in the coming weeks to reduce that time to 5 minutes or less upon discovery of an outage of this nature. Longer term, Ada is evaluating techniques that will provide “live failover” functionality for this capability so that an outage for this provider will not be experienced by Ada customers.