All Users in Australia - Inbound/Outbound Calling - Intermittent - Call Drops
Incident Report for Truly
Postmortem

Context:

System reliability has always been our top priority and as our usage has doubled in the past year, we've been proactively working to ensure our platform does not hit any bottlenecks down the line, as all systems eventually do.  Given the large number of moving parts in a VOIP service, infrastructure automation is key, so we've made several key changes since Q3 to work towards this objective.  This includes improving automated health checks, lowering health check thresholds, periodically swapping out servers, auto-scaling our services and improving logging/measurement/anomaly detection.  We also launched our global V2 network to push all of our supported call routes to Tier 1 traffic in all of our supported region.

We share this to highlight that uptime/quality remains a north star for our operations and infrastructure roadmap.

What's been happening in APAC

The APAC region has suffered not because of a systemic reliability issue but some bad luck in the type/timing of our most recent releases.  As a company that operates globally, we have varying levels of feature usage in each region and plan releases by running canaries in each region by expected impact.  Usually, these will vary between regions but the latest changes happened to be heavily focused on APAC.

The root causes of the last three issues in order were:

  1. Failure at AWS with autoscaling groups (AWS incident)

    1. this was resolved by following our disaster recovery process and was within the RTO shown in our SOC II.  This was a one-off incident that hasn't occurred before or since.
  2. A bad database migration (Truly core app issue)

    1. this was run 'off hours' across all regions without targeting any one region.  The problem we had was that the read replicas in Australia took far longer to update once the issue was reported than other regions (due to distance from the primary DB at AWS Virginia and the DB workload/size. The RCA found that the problem was that our staging DB was sampled down from prod too much to hit the same choke point in the database index being rebuilt by the production change.  The remediation involved a process change on our end to run a fresh dump of the full production database into staging when testing database migrations and disabling database replications temporarily on any database migration that changes indexes
  3. Failed Puppet Config Load

    1. a race condition during instance rotation triggered an AWS lambda function twice, changing a SIP loadbancer IP address in the middle of configuration update by our infrastructure automation service Puppet.  This resulted in the instance getting the wrong IP address, which caused a fraction of call that were set up through that single instance to drop after about 60 seconds due to a failed SIP timer ping between our media server and VOIP application. This also explain why the issue was limited to desktop instead of mobile, and why the issue was intermittent (we have three redundant loadbalancer in the APAC region.)

Aside from the process changes within each RCA:

  • we are taking extra steps/precautions to ensure that we minimize any changes in that region to not stress the user experience any further than we need to.
  • we are changing our release process to centralize our statuspage as the single place for communicating all infrastructure related changes, which will include a proactive email notification regarding upcoming system maintenance and updates throughout the system maintenance period. 

Other Process Changes

Aside from these RCAs, we've made several process changes within our support team

  • Customers now see a more accurate time to response.  A limitation of our support system showed the incorrect time to response for EMEA/APAC which caused users to escalate to their admins instead of submitting tickets
  • Users can submit incidents with one click: this automatically sets of pagerduty with our support team 24/7 and ensures any incident is recognized much quicker than our standard support time.
  • Proactive Monitoring: although our standard support hours are during business hours M-F, we've extended this into the weekend to more proactively detect the result of Friday releases before the start of the business week.

We hope this sheds some light on our current roadmap/operations and gives you some additional comfort that we aim to have a quiet close to Q4, and that our uptime is only going to continue to get better moving forward.  If you have any questions, please do not hesitate to reach out to us via support@truly.co

Posted Dec 01, 2022 - 21:51 EST

Resolved
Nov 9 7:35 PM PT - Our team has received reports of inbound and outbound calls dropping after approximately 60 seconds on the web application, and the desktop application in Australia. We are investigating the issue. Check this page for an update within the next 30 minutes.
Nov 9 7:50 PM PT - Our team has identified an issue with inbound and outbound calls dropping after approximately 60 seconds in Australia that appears to impact users in the following ways: inbound and outbound calls dropping after approximately 60 seconds. Our engineering team is actively working on resolving the issue. Anticipated resolution: unknown. Check this page for an update within the next hour.
Nov 9 8:50 PM PT - Our engineering team is still working on resolving the issue. Anticipated resolution: Under 4 Hours. Check this page for an update within the next hour.
Nov 9 9:40 PM PT - Our team has addressed the issue with inbound and outbound calls dropping after approximately 60 seconds in Australia. We are confident in our solution at this point, and will continue to monitor the fix. A postmortem will be provided on this status page within the next 14 days.
Nov 10 6:30 AM PT - After 9 hours of monitoring the issue with inbound and outbound calls dropping after approximately 60 seconds in Australia, we are confident in the solution that was implemented previously. A postmortem will be provided on this status page within the next 14 days.
Posted Nov 09, 2022 - 21:30 EST