In its inception, WebRTC was designed to be a peer-to-peer communication technology. This means that the majority of technology development is focused around the client device. In spite of this, it is also very important to have a clear understanding of the server-side infrastructure for WebRTC. Every WebRTC application must have an infrastructure, at the very least for exchange of signaling messages. Highly advanced WebRTC applications require infrastructure support for media handling as well.
Three main WebRTC architectures exist: peer-to-peer, multipoint conferencing units, and selective forwarding units. Each architecture has its own strengths and weaknesses, and fits well for their own use cases.
Different WebRTC Architectures
Peer-to-peer communication for WebRTC assumes direct exchange of media content between two browsers. Unfortunately, the purely direct exchange may not always be possible, as a browser may be located behind a symmetric Network Address Translator (NAT). NATs force WebRTC applications to use TURN servers located in the public Internet for forwarding of media data between browsers.
The main advantage of this architecture is its implementation simplicity and low application operating cost, as very little backend infrastructure is needed. The secondary advantage is that there is guaranteed end-to-end security between participants.
The problems with peer-to-peer communication for WebRTC start with multiparty calls. In a multiparty call scenario, every participant must send his or her media content to all other participants. If we assume that there are N participants in the call, the same media stream must be sent N-1 times over an uplink to the N-1 participants. This requires a significant amount of uplink bandwidth from participants. Furthermore, there is also a significant computational cost for each client device, as it must encode the same stream multiple times. In practice, direct peer-to-peer communication works well if the number of call participants is low.
Figure 1: Peer-to-peer architecture. Alice is able to communicate with Bob directly without the need for a TURN server. Eve is behind a NAT, which means she needs to use a TURN server to talk with Alice and Bob. Since we have three participants in this call, each participant sends two streams to the other participants and receives one stream for each participant. It is important to understand that there is end-to-end encryption of all media streams, as WebRTC mandates encryption of streams in a connection and the TURN server does not terminate the connection.
Multipoint Conferencing Unit
Multipoint Conferencing Units (MCUs) have been used successfully for many years with legacy conferencing systems. The MCU architecture assumes that each conference participant sends his or her stream to the MCU. The MCU decodes each received stream, rescales it, composes a new stream from all received streams, encodes it, and sends a single to all other participants.
The MCU approach requires very little intelligence in device endpoints, as the majority of the logic is located in the MCU itself. The unit can generate output streams with different quality for different participants depending on their specific downlink conditions. This makes MCUs a solid solution for low capacity networks. Since the MCU approach has been used widely in the industry for many years, it is a very good solution if interoperability with legacy systems is required.
The main disadvantage of MCU is its cost. A secondary disadvantage is in delay, as decoding and re-encoding streams takes time and requires significant computing power by the MCU.
Figure 2: The MCU architecture. Each participant sends and receives only one media stream. The MCU must perform stream mixing (decoding, rescaling, composing, and encoding), so that the media stream sent to the participant contains media streams of all other participants. The MCU must have access to the media stream content, so it must terminate encryption as well.
Selective Forwarding Unit
Selective Forwarding Units (SFUs) are the most popular modern approach. In the SFU architecture, every participant sends his or her media stream to a centralized server (SFU) and receives streams from all other participants via the same central server. The architecture allows the call participant to send multiple media streams to the SFU, where the SFU may decide which of the media streams should be forwarded to the other call participants.
Unlike in the MCU architecture, the SFU does not need to decode and re-encode received streams, but simply acts as a forwarder of streams between call participants. The device endpoints need to be more intelligent and have more computing power than in the MCU architecture.
The main advantage of the SFU architecture is the ability to work with asymmetric bandwidth, or higher downlink bandwidth than uplink bandwidth. Because of this, it it suitable for asymmetric digital subscriber line (ADSL) networks. A secondary advantage is the scalability of the architecture, as adding more streams is fairly easy and not very challenging for the SFU. Thirdly, because every participant may send multiple versions of the same media stream and the SFU forwards a single one of them, it is easy to provide support for various screen layouts.
Figure 3: The SFU architecture. Every participant sends his or her own stream and receives media streams from all other participants. The SFU is only responsible for forwarding media streams between participants. If a participant sends multiple media streams, the SFU selects one of them to be forwarded to other participants.
Privacy Concerns with SFUs and MCUs
The problem with SFU and MCU architectures is that they do not support end-to-end media encryption, as the media server terminates the encryption once it receives the media stream and has direct access to it. This can be a serious blocker for the usage of SFU and MCU architectures for WebRTC applications, as many of them care a lot about the privacy of their call participants. For example, banking systems absolutely require end-to-end media encryption. Fortunately for them, Privacy Enhanced RTP Conferencing (PERC) is currently under development. PERC guarantees end-to-end encryption of media streams sent over a media server.
Discussion on WebRTC architectures
So this conversation raises the ultimate question: Which architecture is the best?
There is no particular answer to this question, as every architecture has its own strengths and weaknesses. The peer-to-peer architecture is the cheapest and simplest to implement, but it does not scale well as the number of participants increases. WebRTC applications built solely on peer-to-peer architecture can provide only direct media communication between two WebRTC endpoints. The peer-to-peer architecture does not work for legacy, non-WebRTC capable endpoints, either. In that case, the MCU architecture is the way to go, as the MCU acts as the WebRTC gateway to a legacy system. If you want to build a service that involves advanced features like computer vision, speech analytics, or media recording, you will always need a centralized server to provide support.
The question that remains is this: if a peer-to-peer architecture will not work for your use case, is it better to use an SFU or an MCU?
SFUs cannot be used if you need to support legacy endpoints, because legacy endpoints usually work with different media encoding schemes. Without the issue of legacy endpoints, the SFU architecture provides better scaling properties. It requires far less computing power on the server, since the computing requirements are delegated to the endpoints, which may be quite heavy for some mobile clients. It is also closer to the end-to-end principle, upon which the Internet is built. On the other hand, the SFU architecture has higher requirements on network bandwidth than the MCU architecture, as the number of media streams sent and received is usually higher.
Since neither architecture is superior to any other, it becomes a question of if they can be merged to form a new, hybrid architecture. In our WebRTC Metrics Report from December 2016, we show that direct peer-to-peer communication without a TURN server can work in 77% of all WebRTC sessions. This makes for a good argument for moving some WebRTC applications from a strict MCU or SFU architecture into a hybrid architecture to save costs.
If there are only two participants in the call and they do not require a TURN server to communicate, the peer-to-peer architecture is vastly superior. However, if they fail to establish a media connection, or if a new participant joins the call, the call can be “upgraded” to communicate using the MCU or SFU.
As explained above, there are also situations where usage scenarios are so versatile that the strict MCU or SFU architecture is not enough. For example, the SFU architecture is best if we consider only WebRTC-capable endpoints. However, if we have WebRTC-capable endpoints located in areas with very poor network bandwidth, it may be infeasible to send them multiple media streams and an MCU may be needed.
A potential solution to that particular challenge comes from Scalable Video Coding (SVC), which we will cover in detail later in this blog post series.
As you can see, in a lot of ways WebRTC architecture is turtles all the way down. Hopefully, this introduction gave you a look into the proper architecture for your use case, as well as its strengths and weaknesses.
In the coming weeks, we will be covering topics related to technology involved in SFUs, including Simulcast, Scalable Video Coding, popular SFUs vendors, and more.
Interested in implementing WebRTC monitoring in your application? Try our demo.