Problems in the WebRTC media pipeline can affect the quality of experience (QoE) for users of a real time communications service. These problems are manifested in many ways, including echo, clicking or garbled audio and pixelated, blurred or unsynchronized video. In this blog, I’ll describe the media pipeline, the problems that can occur during a WebRTC call and the ways in which these problems affect QoE. With these insights, engineers can better classify and diagnose QoE problems.
Quality of Experience in WebRTC Communications
As formally defined by Qualinet, quality of experience is “The degree of delight or annoyance of the user for an application or service.” All communications services strive to delight their users. But, unexpected problems can turn experiences from delightful to annoying. The WebRTC media pipeline is one of multiple potential sources of annoyance. A malfunction in this complex system can make audio sound noisy or distorted and video appear pixelated or frozen.
When these annoyances occur, users have to work harder to carry on their communication. Depending upon the severity of the annoyance, communication may be impossible and they abandon the service. Of course, when this happens, we want to know what went wrong so we can fix the problem or somehow avoid it.
Answering this question isn’t easy because a good experience with a WebRTC-based communications service depends upon many complex interconnected systems performing correctly - network, infrastructure and endpoint. With each of these systems being composed of multiple components, isolating a problem can be difficult, even with the best metrics and diagnostics. However, problems in the WebRTC media pipeline are often manifested in specific types of annoyances. With this knowledge, we can begin to isolate QoE problems to a specific pipeline system and component.
What is the WebRTC Media Pipeline?
Broadly speaking, the WebRTC media pipeline is composed of the audio and video processing components embedded in two or more communicating endpoints and the network that connects them. More specifically, it is a complex cascade of discrete software and hardware components that convert analog signals to digital network packets and vice-versa, as shown in the figure. The endpoint components are controlled by the device’s (PC, mobile device, etc.) operating system and include cameras and microphones (embedded or externally connected),, web browser and network interfaces. The network components include IP switches, routers, firewalls and other devices through which media packets may travel from sender to receiver.
A malfunction by any component can introduce communication annoyances into the pipeline. For simplicity, we present only a peer-to-peer media session. Multiparty sessions and sessions that are processed by infrastructure (e.g. selective forwarding unit or multipoint control unit) add even more complexity to the pipeline and additional types of annoyances.
Before we describe the annoyances associated with each component, let’s first discuss the operation of the pipeline.
The process for voice/video transmission through the sender pipeline includes these functions: The web browser requests access to the media input device (microphone or camera). The input device captures an audio or video signal in the RAW or RGB formats, respectively, and streams it via the operating system to the web browser.
In the case of audio, the signal is passed through the echo canceller and noise reduction components, which are embedded in the web browser. In the case of video, an image enhancement component is used to remove video noise. The signal is then sent to the encoder, where the raw data are compressed and encoded to either audio samples or video frames. WebRTC uses the Opus or ITU-T G.711 codecs for audio streams and ITU-T H.264 or VP8 codecs for video streams. If the device has an embedded hardware encoder, it can be used to accelerate the encoding process. Then, the encoder transfers samples (frames) to the packetizer which fragments the frames, if necessary, and encodes them into data packets by adding application-specific headers.
The next component in the system is the sender buffer, which is part of the network interface controller. The browser passes packets to it, via the operating system. The sender buffer is responsible for scheduling a packet for transmission.
Packets traverse the network through several routers, switches, firewalls and potentially versatile transmission media (e.g., wireless, or optical fibre). As packets flow through the network, they experience transmission and queuing delays, and may get lost or corrupted.
On the receiver device, we have the reverse process. The network interface receives packets and transfers them via the operating system to a jitter buffer (for audio, the buffer is also called NetEQ) implemented in software by the browser. The jitter buffer reorders and dejitterizes the packets, if necessary. It also drops packets that arrive too late to be played out.
Packets from the jitter buffer are passed to the browser’s de-packetizer, which removes application specific headers and combines packets to reconstruct the samples (frames). Next, the samples (frames) are passed to the decoder, which decompresses them into raw data. Finally, the browser transfers the raw data to the output device (i.e., a speaker or a screen) via the operating system so it can be played out. In the case of missing packets, the decoder may also mask their effects by generating an approximation of missing data using packet loss concealment (PLC) techniques.
Annoyances Commonly Introduced by the Media Pipeline
Obviously, the media pipelines are complex systems in which all components must work well together to deliver good QoE. Misbehaviour by a single component has a profound effect on overall quality. For example, if the voice activity detector (part of the encoder) incorrectly recognizes a period when a user speaks as silence, clipped audio (i.e., front-end or tail-end of a word is cut-off) may be generated by the sender device.
Furthermore, modern real-time communication devices operate in a shared resource environment (e.g., PCs), in which the memory and CPU used to operate certain components in the pipeline are not entirely dedicated to them. Heavy load presented by other applications may cause some of the pipeline components to work slower than expected, thus affecting QoE.
Media pipeline misbehaviours are manifested as two types of annoyances detectable by end users: audio and video.
Generally, audio annoyances are divided into two broad categories based on their symptoms:
Noise problems are originally caused by any noise on the line in addition to the voice signal. In a majority of situations, noise does not prevent the communication from being intelligible, but it usually causes user discomfort. For example, if humming is present, both parties are able to talk to each other, but they must make an extra effort to do so.
Noise annoyances include lack of background comfort noise (absolute silence), clicking, crackling, crosstalk, hissing, humming, tone popping, screeching and static. These problems are usually caused by a malfunctioning VAD or the presence of electrical interference.
Voice distortion problems include echo problems, garbled voice and volume distortions. Most of us have experienced echo, which occurs when a reflected copy of the speaker’s voice is played back after some lag time. This is usually caused by malfunctioning echo cancellers and it can be magnified by high latency between talking parties.
Garbled voice problems include choppy voice, clipped voice, robotic voice, and synthetic voice. Choppy voice occurs when multiple packets are lost or severely delayed, and the codec cannot mitigate the effects and inserts silence. Another cause for clipped voice is a misbehaving VAD, as described earlier.
Robotic voice is generally caused by a high rate of packet losses or discards due to excessive network jitter. The digital signal processing techniques featured in modern codecs, such as Opus, try to conceal these losses by inserting a predicted sample into the media stream. This produces the synthetic or robotic sounding voice experienced by the user.
Volume distortion is caused by incorrectly fixed volume levels or fluctuating volume levels. Symptoms include voices that are fluctuating, fuzzy, loud, muffled, soft or tinny. Typical causes are too much gain or attenuation applied to the signal.
In addition to these two broad categories, there is also the specific problem of one-way audio, which occurs when only one person can hear the other caller. This is caused by one side not transmitting data to the other (obviously!), or data is being blocked in one direction. A misconfigured enterprise firewall is often the cause of media blocking.
Video annoyances are observed by the user when the displayed signal is of poor quality. We divide these annoyances into three broad categories based on their symptoms.
Pixelated video is often caused by high packet loss rates. When this occurs, regions of the video feed where movement is being shown may appear to be blocky or pixelated.
Blurry or frozen video may be due to any of multiple causes, including sender problems, high packet loss rates on the network or excessive CPU load. In high CPU load conditions, the decoder drops received frames in order to ensure uninterrupted video playout. Blurry video can occur in multiple regions of the screen or across the entire picture. When video freezing occurs, the video playout stops, while audio usually continues.
Video-audio synchronization, (a.k.a., lip sync error) is measured by the amount of time the audio departs from perfect synchronization with the video. We don’t have a specific recommended tolerance for synchronization in WebRTC applications, but the ITU-T, has published these recommendations for broadcasting applications, which provide a reasonable guidepost: Humans perceive desynchronization if audio lags behind the video by more than 125ms or is ahead of it by more than 45ms.
In WebRTC, each media frame (sample) is assigned a timestamp and each media stream (voice and video) is independently sampled at different clock rates. It is the responsibility of the receiver device to synchronize these streams for playback. The synchronization is done using RTP Control Protocol (RTCP) Sender Reports (SR), which provide a mapping from a stream timestamp to a global Network Time Protocol (NTP) timestamp. The lack of synchronization between video and audio is usually caused by missing RTCP SR packets, wrong information contained inside them or erroneous processing logic.
Cues Point to the Root Cause of QoE Problems
While WebRTC communications services are inherently complex, we can troubleshoot many problems by isolating them to a particular system - network, infrastructure, or endpoint. In particular, problems in the media pipeline manifest themselves with specific types of annoyances that provide a troubleshooting cue. Using these cues, we can further investigate and identify the root cause of a QoE problem. More on this in future articles!