[PLATFORM] Service Disruption in All Production Environments

Incident Report for Kustomer

Postmortem

Summary

On October 20, 2025, Kustomer experienced a significant service disruption affecting all customers on our platform. The incident was triggered by a major AWS service disruption in the us-east-1 region that impacted ~140 AWS services, including core infrastructure components our platform depends on.

The incident occurred in two distinct phases:

Phase 1 (3:00 AM - 6:30 AM EDT): Initial AWS service disruption causing widespread connectivity issues and elevated error rates across our platform. Services began recovering as AWS addressed the underlying DNS resolution issues.
Phase 2 (9:20 AM - 5:00 PM EDT): Secondary wave of issues, including a critical authentication problem that prevented users from logging into the platform from 11:47 AM to 4:30 PM EDT. This simultaneously overlaps with the disruption to AWS EC2 service on October 20 2:48am EDT to 4:50pm EDT. This authentication issue was caused by our disaster recovery automation during our attempts to establish failover capabilities while the AWS EC2 disruption was ongoing.

Full platform recovery was achieved by 5:00 PM EDT on October 20, with final cleanup operations completed by 7:00 PM EDT. A remaining issue with AIC and AIR observability infrastructure was fully resolved by 11:45 AM EDT on October 21.

Root Cause

Primary Cause: AWS Regional Service Disruption

The incident originated from a widespread AWS service disruption in the us-east-1 region. According to AWS, the root cause was a DNS resolution failure for internal service endpoints, specifically affecting DynamoDB regional endpoints. This DNS issue cascaded into AWS's EC2 internal launch system, creating widespread connectivity problems across the region.

Secondary Cause: Disaster Recovery Automation Issue

During our disaster recovery response, an infrastructure configuration issue emerged that prevented proper authentication for approximately 4.5 hours. When disaster recovery preparations were initiated at 11:25 AM EDT, our infrastructure automation discovered that primary and secondary region configurations shared the same management context. As the automation began provisioning resources in the secondary region, it simultaneously triggered scaling changes in the primary region's authentication service, reducing it to a minimal operational state.

Under normal circumstances, an automated deployment would have immediately corrected this configuration. However, this deployment failed due to the ongoing AWS service disruptions, leaving the authentication service's load balancer in an inconsistent state—pointing to a configuration with insufficient capacity while healthy instances remained unreachable.

This was an edge case that required the specific combination of: (1) initiating disaster recovery operations, (2) the infrastructure management coupling between regions that were partially operational (us-east-1 was significantly degraded but remained operational), and (3) simultaneous AWS service failures in us-east-1 preventing the standard recovery mechanism from completing. Had us-east-1 been completely offline, there would not have been extra consideration required for operations in that region.

Timeline

All times in Eastern Daylight Time

Initial Service Disruption Phase

Oct 20, 2025

3:11 AM - Multiple alerts triggered indicating elevated error rates across platform services. Kustomer’s Statuspage was accessible but oncall responders could not authenticate successfully to provide updates.

3:18 AM - Incident response initiated; engineers begin investigation

4:25 AM - Confirmed correlation with AWS service disruptions in us-east-1 region

5:01 AM - AWS identifies root cause of DNS resolution failures

5:27 AM - 6:03 AM - AWS reports significant recovery; Kustomer platform functionality observed to be restored. Access to Kustomer’s Statuspage was restored at 5:37am.

6:30 AM - Initial recovery verified; platform communications channels tested successfully. Kustomer platform is verified fully functional

Secondary Service Disruption Phase

8:55 AM - New wave of elevated error rates detected across multiple services

9:42 AM - AWS announces mitigation in progress with EC2 throttling. Additional compute is significantly limited and affects autoscaling operations.

10:11 AM - Severity escalated; separate incident channel created

10:40 AM - Decision made to initiate disaster recovery preparations in us-east-2 region as contingency

11:02 AM - Engineers begin provisioning infrastructure in secondary region

11:46 AM - Due to authentication issues, Kustomer Technical Support began responding to client inquiries directly through email to continue correspondence

11:47 AM - 3:48 PM Multiple customer reports of authentication failures; disaster recovery efforts are taking place during this time. At 3:48pm, the root cause for authentication errors: load balancer configuration prematurely routing traffic to secondary region

4:03 PM - AWS reports ongoing recovery across most services

4:15 PM - Load balancer configuration corrected

4:30 PM - Customer authentication restored; users able to access platform

5:00 PM - API traffic normalized; error rates return to baseline

5:48 PM - AWS announces full recovery of regional services in us-east-1

6:01 PM - 6:35 PM - Queues redriven to recover delayed operations

7:00 PM - Secondary region infrastructure fully prepared for future failover needs

Extended Recovery

8:40 PM - Observability logs for AIC and AIR feature remain impacted due to Opensearch Serverless recovery mitigations implemented by AWS

Oct 21, 2025

11:45 AM - Full recovery of Opensearch Serverless which restores AIC and AIR observability

‌

Lessons/Improvements

What Went Well

Disaster recovery readiness: Our recent disaster recovery exercise was most recently conducted in July 2025. The updated documentation proved valuable during the incident. The team was able to reference established runbooks and procedures, even as we encountered unexpected challenges.

Cross-team coordination: Engineers across multiple teams collaborated effectively to simultaneously address recovery in the primary region while preparing failover capabilities in the secondary region.

Iterative improvement: The team identified and implemented fixes to disaster recovery automation in real-time, improving our processes even during the incident.

Customer communication: Our status page experienced authentication issues during the incident, limiting our ability to communicate with customers. Additionally, Kustomer’s own access to the platform was impacted from this service disruption. Our tech support and customer success teams were able to keep customers updated by reaching out via the email channel.

Areas for Improvement

Earlier detection of infrastructure anomalies: The authentication service load balancer inconsistency was observed earlier in the incident but wasn't immediately prioritized for investigation amid numerous other AWS-related issues. We're implementing additional health checks and automated detection to ensure redundant and proactive alerting for this critical functionality as well as updating runbooks to reinforce the importance of investigating the minor anomalies during major incidents.

Status page reliability: The ability to keep customers updated to the latest status about the Kustomer platform is critical to our operations and maintaining trust. We are exploring options to improve the reliability of this tool.

Planned Improvements

Immediate Actions (Completed or In Progress):

Exploring redundant customer communication channels beyond our primary status page
Separated infrastructure management for secondary region to enable independent deployment and faster disaster recovery
Adding safeguards to disaster recovery automation and race conditions to prevent unintended traffic shifts and validate target health before routing changes

Strategic Initiatives:

Conducting more frequent disaster recovery exercises to validate our failover procedures under realistic conditions
Developing faster, more reliable automation scripts and orchestration tools for disaster recovery operations to reduce manual intervention and accelerate failover execution
Establishing disaster recovery readiness as a mandatory requirement in our engineering standards, ensuring all new features and services are designed with multi-region capabilities and tested regularly for failover scenarios

Commitment to Reliability

This incident reinforced our commitment to building a resilient platform. While we cannot prevent cloud provider service disruptions, we can minimize their impact through better disaster recovery capabilities, more robust automation safeguards, and regular testing of our failover procedures.

We recognize that our customers depend on Kustomer for critical business operations, and we take that responsibility seriously. The improvements outlined above represent concrete steps toward faster recovery times and reduced impact from future infrastructure disruptions.

We appreciate the patience of our customers during this incident and remain committed to continuous improvement of our platform's reliability and resilience.

Posted Oct 29, 2025 - 13:18 EDT

Resolved

After careful monitoring, our team has determined that Kustomer platform services are now fully operational and the queued Business Rules items have been successfully processed. Thank you for your patience and understanding during this event.

If you experience any residual issues or have questions, please reach out to Kustomer Support at support@kustomer.com.

Posted Oct 20, 2025 - 06:29 EDT

Monitoring

Kustomer platform services should now be operational, though some Business Rules items may remain delayed while our teams work to retrieve and process any queued events.

AWS has reported significant signs of recovery across affected services and continues to work toward full resolution.

We continue to closely monitor platform performance as AWS progresses with their own recovery efforts and keep sharing updates here as systems stabilize further.

Please expect additional updates within the next 30 minutes and if you have any questions or continue to notice delays, please reach out to Kustomer Support at support@kustomer.com.

Posted Oct 20, 2025 - 06:18 EDT

Identified

Kustomer has identified the disruption impacting platform services as a result of the significant widespread outage at Amazon Web Services (AWS). Further details and real-time updates are also available in https://health.aws.amazon.com/health/status

Our teams are actively working to mitigate the impact and restore services as quickly as possible, while maintaining close coordination with AWS as they work to resolve the underlying issue. We’ll continue to post updates here as systems recover and more information becomes available.

Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support at support@kustomer.com for any queries.

Posted Oct 20, 2025 - 05:48 EDT

This incident affected: Prod1 (US) (Analytics, API, Bulk Jobs, Channel - Chat, Channel - Email, CSAT, Events / Audit Log, Exports, Knowledge base, Notifications, Registration, Search, Tracking, Web Client, Web/Email/Form Hooks, Workflow), Prod2 (EU) (Analytics, API, Bulk Jobs, Channel - Chat, Channel - Email, CSAT, Events / Audit Log, Exports, Knowledge base, Notifications, Registration, Search, Tracking, Web Client, Web/Email/Form Hooks, Workflow), Kustomer Voice & Text (Kustomer Voice), Prod4 (IN) (Analytics, API, Bulk Jobs, Channel - Chat, Channel - Email, CSAT, Events / Audit Log, Exports, Knowledge base, Notifications, Registration, Search, Tracking, Web Client, Web/Email/Form Hooks, Workflow), and Regional Incident.