On November 14th, 2024, some customers experienced latency and issues with chat message delivery. The issue was addressed, with some steps already taken and additional measures planned to prevent recurrence.
Root Cause
Tasks in a key chat service within our PROD1 environment experienced excessive memory usage, rendering them unhealthy. This caused delays in message processing and intermittent failures for some customers. The AWS scaling mechanism, focused on CPU-based triggers, failed to detect the issue as it stemmed from memory constraints. Redeploying the service resolved the immediate problem. To prevent this issue in the future, we migrated the affected service to a more efficient processing architecture and are investigating the underlying cause of the memory spike.
Nov 14, 2024
Nov 15, 2024
Lessons/Improvements
Future Mitigation: