North America Truly App Users - Unable to Log In to Truly - Consistent
Incident Report for Truly
Postmortem

Begin: 2022-12-16 15:00 UTC

End: 2022-12-16 15:35 UTC

First customer report: 2022-12-16 15:08 UTC

Presentation

Login to the Truly web application failed in US1. When users attempted login, they were immediately returned to the login screen.

Impact

  • No data loss. 
  • ~13% of login request failed in US1.
  • ~45% of requests to core-api, in general, failed in US1.

Remediation

Increase of core-api service memory allocation and scaling velocity.

Root Cause

  1. On 12/15/2022, the Truly core-api service was migrated to new infrastructure an adjusted to scale based on the volume of requests to the service. Traffic increased dramatically in a short time period, leaving too few instances of core-api to serve the requests.
  2. As traffic increased, the core-api service began scaling out to meet the demand, however, the increased load was distributed to too few instances, leading to an over-utilization of memory, which resulted in existing instances terminating.

How we will avoid this in the future

  1. Configure the service to target a lower volume of requests per instance so that the instances scale earlier as traffic increases.
  2. Allocate more memory to the services to reduce the likelihood of instance terminations during times of heavy load.
Posted Dec 23, 2022 - 13:27 EST

Resolved
10:00 AM EST - Our team has received reports of being unable to log in to Truly in North America. We are investigating the issue. Check this page for an update within the next 30 minutes.

10:10 AM EST - Our team has identified an issue with logging in to Truly in North America that appears to impact users in the following ways:

* Unable to log in to Truly

Our engineering team is actively working on resolving the issue. Anticipated resolution is imminent. Check this page for an update within the next hour.

10:30 AM EST - Our team has addressed the issue of not being able to log in to Truly. We are confident in our solution at this point, and will continue to monitor the fix. A postmortem will be provided on this status page within the next 14 days.
Posted Dec 16, 2022 - 10:00 EST