[CHAT] [LATENCY] [PROD1]
Incident Report for Kustomer
Postmortem

Post Mortem: Chat Latency Issues on November 14, 2024

Summary

On November 14th, 2024, some customers experienced latency and issues with chat message delivery. The issue was addressed, with some steps already taken and additional measures planned to prevent recurrence.

Root Cause

Tasks in a key chat service within our PROD1 environment experienced excessive memory usage, rendering them unhealthy. This caused delays in message processing and intermittent failures for some customers. The AWS scaling mechanism, focused on CPU-based triggers, failed to detect the issue as it stemmed from memory constraints. Redeploying the service resolved the immediate problem. To prevent this issue in the future, we migrated the affected service to a more efficient processing architecture and are investigating the underlying cause of the memory spike. 

Timeline

Nov 14, 2024

  • 5:51 PM ET: Customers began reporting delays and issues with latency and message delivery
  • 5:59 PM ET: Incident was declared, and investigation started
  • 6:45 PM ET: Latency resolved by redeploying the affected components

Nov 15, 2024

  • 11:52 AM ET: Migrated to a more efficient processing architecture

Lessons/Improvements

  • [Done] Processing Architecture Migration: Migrate the affected service to Graviton (enhanced processing architecture) to help system performance
  • Future Mitigation:

    • Identify improvements that would auto resolve such problems
    • Audit and refine alerting to respond sooner
Posted Nov 27, 2024 - 11:32 EST

Resolved
Kustomer has resolved an event affecting Chat Channel in PROD 1 that caused long delays in chat messaging or messages to not be sent at all.

After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support through Chat or Email if you have additional questions or concerns.
Posted Nov 14, 2024 - 18:58 EST
Monitoring
Kustomer has implemented an update to address an event affecting Chat Channel in PROD 1 that caused long delays in Chat messaging.

Our team is currently monitoring this update to ensure the issue is fully resolved. Please expect further updates within the next 30 minutes, and reach out to Kustomer support through Email or Chat if you have additional questions or concerns.
Posted Nov 14, 2024 - 18:45 EST
Investigating
Kustomer is aware of an event affecting Chats in PROD 1 ONLY that may cause latency in sending messages and causing some messages to not send.

Our team is currently working to identify the cause of this issue in an effort to implement a resolution. Please expect additional updates within the next 30 minutes, please reach out to Kustomer Support via Email for any further questions or updates.
Posted Nov 14, 2024 - 18:20 EST
This incident affected: Prod1 (US) (Channel - Chat).