Begin: 2022-12-16 15:00 UTC
End: 2022-12-16 15:35 UTC
First customer report: 2022-12-16 15:08 UTC
Presentation
Login to the Truly web application failed in US1. When users attempted login, they were immediately returned to the login screen.
Impact
- No data loss.
- ~13% of login request failed in US1.
- ~45% of requests to core-api, in general, failed in US1.
Remediation
Increase of core-api service memory allocation and scaling velocity.
Root Cause
- On 12/15/2022, the Truly core-api service was migrated to new infrastructure an adjusted to scale based on the volume of requests to the service. Traffic increased dramatically in a short time period, leaving too few instances of core-api to serve the requests.
- As traffic increased, the core-api service began scaling out to meet the demand, however, the increased load was distributed to too few instances, leading to an over-utilization of memory, which resulted in existing instances terminating.
How we will avoid this in the future
- Configure the service to target a lower volume of requests per instance so that the instances scale earlier as traffic increases.
- Allocate more memory to the services to reduce the likelihood of instance terminations during times of heavy load.