In the last 6 months, real-time video and audio calling usage has increased many folds. Many more endusers are working from home and the vagaries of the internet affects the call quality. That means that support, deployment, and product teams globally are handling a lot more media quality related issues. These issues might be escalations from the enduser into chat bots or agents, or handled within the product itself.
The callstats team's main focus has been these two fundamental aspects
1. Handling enormous amounts of data
2. Root-cause analytics
Handling Big data
When shelter in place orders were applied, callstats usage broke through new records. We started handling traffic in a day that we handled in a month. Needless to say, we had several sleepless nights, intermittent outages, and to overcome scaling issues -- we threw hardware at the problem and started working on the next version of our infrastructure. More on this will be discussed in the forthcoming blogposts, there were two major optimisations we've made: a) One-pass call summarisation and b) Storage.
Our current system that will be EOLed in the coming months, used a two pass method. Where in events within a short window were correlated in real-time and the rest of the analysis was delegated to the end of the call where the the summarisation would take into account larger time windows. In the new system, that will be rolled out in phases. The call is summarised as the data is ingested, we've been able to tune the time windows based on the root-cause analysis algorithms.
In addition, we rolled out a warm storage into our hot-cold storage, i.e., depending on call quality and connectivity related issues, the raw data may be available for quick retrieval. Details are discussed in an earlier blogpost on hot-warm-cold storage.
Given the incredible growth in usage, both our customers and our internal teams need quick and reliable ways to diagnose connectivity and call quality issues. We've added connectivity disruptions, media disruption events, and eMOS to assist with the root-cause analysis. For example to automatically tie the cut and blanks in an audio stream with an overwhelmed CPU (unable to playback packets fast enough from the jitter buffer) or that the packets arrived too late to be played out can help communicate the issue more clearly to the enduser. Formally this looks like
- Define the problem ("call and connectivity issue")
- Collect the data ("callstats")
- Identify possible causal factors
- Identify the root cause(s)
- Recommend or apply a solutions
The possible causal factors or detections are available in real-time within the application via the callstats API (callstats.on handler) or highlighted in the automatic diagnosis section of the conference details section. In addition, to narrow down the root-causes, we augmented the results with the data from the smart connectivity tests.
For example, several desktop-based products run the Smart Connectivity tests (SCT) in the background when the enduser is not making calls. This helps diagnose issues across calls, for example, at 8x8, we run these tests at regular intervals to check if the enduser has sufficient capacity to carry the audio and video, is there any cross traffic, if the issue is closer to the customer or on a particular path. The main goal is to identify if the enduser would have an issue if they made a call right now.
This data is now available to the enduser in the 8x8 application, in addition to the IT administrator in the callstats dashboard. We are looking at how these events/analysis can help in your workflow. Set appropriate expectations with the customer. Build adaptive products.
Do not hesitate to send us feedback at email@example.com