Context:
System reliability has always been our top priority and as our usage has doubled in the past year, we've been proactively working to ensure our platform does not hit any bottlenecks down the line, as all systems eventually do. Given the large number of moving parts in a VOIP service, infrastructure automation is key, so we've made several key changes since Q3 to work towards this objective. This includes improving automated health checks, lowering health check thresholds, periodically swapping out servers, auto-scaling our services and improving logging/measurement/anomaly detection. We also launched our global V2 network to push all of our supported call routes to Tier 1 traffic in all of our supported region.
We share this to highlight that uptime/quality remains a north star for our operations and infrastructure roadmap.
What's been happening in APAC
The APAC region has suffered not because of a systemic reliability issue but some bad luck in the type/timing of our most recent releases. As a company that operates globally, we have varying levels of feature usage in each region and plan releases by running canaries in each region by expected impact. Usually, these will vary between regions but the latest changes happened to be heavily focused on APAC.
The root causes of the last three issues in order were:
Failure at AWS with autoscaling groups (AWS incident)
- this was resolved by following our disaster recovery process and was within the RTO shown in our SOC II. This was a one-off incident that hasn't occurred before or since.
A bad database migration (Truly core app issue)
- this was run 'off hours' across all regions without targeting any one region. The problem we had was that the read replicas in Australia took far longer to update once the issue was reported than other regions (due to distance from the primary DB at AWS Virginia and the DB workload/size. The RCA found that the problem was that our staging DB was sampled down from prod too much to hit the same choke point in the database index being rebuilt by the production change. The remediation involved a process change on our end to run a fresh dump of the full production database into staging when testing database migrations and disabling database replications temporarily on any database migration that changes indexes
Failed Puppet Config Load
- a race condition during instance rotation triggered an AWS lambda function twice, changing a SIP loadbancer IP address in the middle of configuration update by our infrastructure automation service Puppet. This resulted in the instance getting the wrong IP address, which caused a fraction of call that were set up through that single instance to drop after about 60 seconds due to a failed SIP timer ping between our media server and VOIP application. This also explain why the issue was limited to desktop instead of mobile, and why the issue was intermittent (we have three redundant loadbalancer in the APAC region.)
Aside from the process changes within each RCA:
- we are taking extra steps/precautions to ensure that we minimize any changes in that region to not stress the user experience any further than we need to.
- we are changing our release process to centralize our statuspage as the single place for communicating all infrastructure related changes, which will include a proactive email notification regarding upcoming system maintenance and updates throughout the system maintenance period.
Other Process Changes
Aside from these RCAs, we've made several process changes within our support team
- Customers now see a more accurate time to response. A limitation of our support system showed the incorrect time to response for EMEA/APAC which caused users to escalate to their admins instead of submitting tickets
- Users can submit incidents with one click: this automatically sets of pagerduty with our support team 24/7 and ensures any incident is recognized much quicker than our standard support time.
- Proactive Monitoring: although our standard support hours are during business hours M-F, we've extended this into the weekend to more proactively detect the result of Friday releases before the start of the business week.
We hope this sheds some light on our current roadmap/operations and gives you some additional comfort that we aim to have a quiet close to Q4, and that our uptime is only going to continue to get better moving forward. If you have any questions, please do not hesitate to reach out to us via support@truly.co