AI-Powered Analytics Help Global CPaaS Provider Ward Off Customer Issues

By Navid Khajehzadeh on May 24, 2019
read

callstats.io helped a global cloud communications platform provider proactively detect and isolate a critical service issue. Our AI-driven analytics determined users in Asia were connecting to servers in North America, potentially impairing call quality and customer satisfaction.  callstats.io automatically notified the provider of a dramatic fluctuation in round-trip times before customers reported service quality issues.

A leading Communications Platform as a Service (CPaaS) provider lets developers easily embed WebRTC calling capabilities directly into their applications.  The provider operates data centers on five different landmasses to enable global performance, resiliency and scalability.  The service is designed to connect applications to the nearest CPaaS data center to minimize network latency and optimize call quality.

 

Plivo diagram

Figure: CPaaS point-to-point application

 

The CPaaS provider uses callstats.io to monitor and analyze WebRTC metrics, accelerate troubleshooting and improve customer satisfaction.  Our AI-powered analytics help the provider intelligently detect and diagnose networking problems and potential call quality issues by correlating statistics and inferring potential root causes. By way of example, we recently helped the provider identify and isolate a core service issue that had the potential to impair voice quality; users in Asia were being routed to the North American data center instead of the Asian data center.

Sudden Changes in Round-Trip Time Indicated Potential Service Issues

callstats.io passively monitors WebRTC sessions, collecting hundreds of data points throughout every call. One of the many useful metrics the product tracks is round-trip time (RTT)—the time it takes for a packet to travel from a sending endpoint to a receiving endpoint and back. In the specific case of the cloud communications platform, callstats.io monitors RTT from a customer application to a server in a CPaaS data center and back, as shown below.

 

Plivo diagram 2

Figure: Roundtrip time for application-to-server 

 

Long RTTs indicate network delays that can create gaps in a conversation, generate echo or cause callers to talk over each other.  They can occur for a variety of reasons including network congestion and performance problems, IP routing issues, or CPaaS architectural constraints or configuration mishaps.

In March, callstats.io reported a distinct change in the provider’s global RTT statistics; daily average system-wide RTTs began fluctuating dramatically on February 14th and nearly doubled over the next four weeks, as shown in the callstats.io dashboard screen captures below.  Clearly some sort of incident occurred in mid-February.

 

Plivo screenshot 3

 Figure: callstats.io dashboard notification

 

Plivo screenshot

 Figure: callstats.io dashboard RTT stripchart

 

Plivo notification

Figure: callstats.io dashboard notification

 

A detailed breakdown revealed a significant increase in excessively long RTTs (RTTs > 200ms) in the week following the incident, as shown in the table below.  (As a general rule of thumb RTTs of 200ms or longer may indicate potential call quality issues).

Prior to the February 14th incident, long RTTs represented only 5% of total RTTs.  After the incident they represented 31% of RTTs.

 

 
30 Days Prior to Incident
7 Days After Incident

Range

% of RTTs

% of RTTs

0 < RTT < 100ms

83%

45%

100ms < RTT < 200ms

12%

24%

200ms < RTT

5%

31%

 

AI-Driven Analytics Pinpoint Root Cause of Incident

Our AI-driven analytics helped pinpoint the root cause of the problem, by identifying distinct “clusters” of statistics associated with long RTTs. Each cluster includes a collection of network statistics (loss, jitter, geographical distance from app to server) that might point to a particular root cause.

For example, a statistical cluster with high loss, high jitter and short distance might indicate a network congestion issue; a statistical cluster with low loss, low jitter and short distance might point to an IP routing issue (e.g. there are an excessive number of network hops between the application and the server); and a statistical cluster with low loss, low jitter and long distance might indicate an issue related to the cloud communications platform configuration or architecture (e.g. applications are connecting to a distant server).

Our detailed AI-driven analysis revealed that after the incident occurred, the vast majority of the long RTTs (92% of long RTTs) were associated with a “call routing” cluster; applications in Asia were connecting to servers in N. America, potentially indicating a communications platform service issue related to call setup.  (See table below for details)

 

 
Before Incident
After Incident
 

% of long RTTs

% of long RTTs

Congested-network statistical cluster
(high jitter, high loss, short distance)

1%

4%

IP routing statistical cluster
(low jitter, low loss, short distance)

82%

3%

Call routing statistical cluster
(Asian apps hitting N. America servers)

10%

92%

 

The daily traffic summary charts for Asia, shown below, confirm that beginning in the middle of February, most calls originating from Asia were directed to servers outside of Asia, which almost certainly caused the long RTTs.

 

Plivo graph 3

 Figure: Daily CPaaS server traffic

 

And an examination of global CPaaS server traffic revealed calls handled by Asian servers declined precipitously in the middle of February, while calls handled by N. American servers increased precipitously—further confirmation that calls originating from Asia were directed to distant servers.

Plivo graph 2

Figure: Analysis of CPaaS server traffic before/after incident

 

Based on the insights provided by callstats.io, the provider isolated the problem to a configuration change in their IP Geolocation database, and reconfigured it to ensure Asian users are connected to the Asian data center.

This CPaaS anecdote provides a great example of how our AI-driven analytics can help you efficiently detect, isolate and resolve real-time communications network problems and call quality issues. Using callstats.io the communications platform provider was able to identify and address their configuration mishap before customers reported service quality issues.

Tags: Artificial Intelligence