Check out our Amazon Connect solution! Learn More

Explaining the Real-time Transport Protocol of SRTP for WebRTC

By callstats on April 12, 2018

WebRTC uses Secure Real-time Transport Protocol (SRTP) to add encryption, message authentication and integrity, and replay attack protection for RTP data. It is a security framework that provides confidentiality by encrypting the RTP payload and supporting origin authentication. The security features of WebRTC are a crucial component of its reliability, and the basis for it all revolves around the Real-time Transport Protocol. But what is Real-time Transport Protocol? How does it work?

What is Real-time Transport Protocol?

The Real-time Transport Protocol (RTP) is a network protocol designed for multimedia telephony (voice-over-IP, video conferencing, telepresence systems), multimedia streaming (video on demand, live streaming), and multimedia broadcast. It was initially specified in by the Internet Engineering Task Force (IETF) in RFC1889. RTP was originally created to assist video conferences that involved several geographically dispersed members by the IETF’s Audio-Video Transport Working Group. Currently, the v2 specified in RFC3550 is the one that is being used for the last 15 years!

RTP’s design is based on the fundamental principles of application layer framing and integrated layer processing. It provides source and payload type identification, stream synchronisation, packet loss and reordering, and media stream monitoring.

RTP uses the RTP Control Protocol (RTCP) to report the performance of the media stream.

In this process, the media sender transmits encoded media encapsulated in RTP. It also sends RTCP Sender Reports, which facilitate playback synchronisation of different media streams. The receiver maintains a jitter buffer to reorder media packets and play them out as per timing information encoded in the packet. If a packet is missing, the receiver recovers the packet or conceals the error. Finally, the receiving endpoint reports rough or detailed statistics in the RTCP Receiver Report that enable the media sender to adapt its media encoding rate, change to a better codec, or vary the amount of forward error correction.

RTP Packet Header Format

The RTP packet header format is split into four sections: synchronization source identifier, contributing source identifier, timestamp, sequence number, and payload type.

1. Synchronization Source: The synchronization source assists in determining the source endpoint. Useful when an endpoint sends multiple media streams that need to be synchronized.

2. RTP Timestamp: The RTP timestamp assists in playing out the received packets at the appropriate time and recomposing the media frame from RTP packets.

3. RTP Sequence Number: The RTP Sequence number assists in identifying the lost packets and reordering packets in case packets arrive out of order.

4. Payload Type: The payload type describes the encoding of the media data it is carrying. Each codec must specify its corresponding payload format.

RTCP Reports

RTCP Reports come in three kinds: Sender Reports, Receiver Reports, and Extended Reports.

RTCP Sender Reports

The sender uses RTCP Sender Reports to help synchronize the media streams. In order to accomplish this, it relates the RTP timestamps of the individual media streams to the wall-clock time and notifies the receiver of the current packet rate and bit rate.

RTCP Receiver Reports

The receiver measures the incoming streams and reports the coarse-grained transport statistics in an RTCP Receiver Report. This report includes the current loss fraction, jitter, the highest sequence number received, and facilitates round-trip time calculations.

RTCP Extended Reports

RTCP Extended Reports are used by endpoints to describe complex metrics not exposed by the RTCP Receiver Report. This includes relevant performance monitoring and congestion control metrics, such as jitter buffer metrics, packet delay variation, delay metrics, the number of discarded packets, quality of experience, and others. New metrics may also be defined, so long as they address what is measured, how it is measured, and how it is reported to other endpoints.

What Do RTP Payload Formats Look Like?

Defining a payload format requires identifying the encoding of the media packets. These encodings may be one of two ways: codec-specific, for example, H.264, H.263, H.261, MPEG-2, JPEG, G.711, G.722, or AMR; generic, for example, Forward Error Correction (FEC), NACK, or multiplexed streams.

The payload document usually specifies a well-defined packet format for media codecs, and characterizes two kinds of rules for codecs: aggregation rules and fragmentation rules. Aggregation rules are defined for codecs that produce several small frames compared to the IP Maximum Transmission Unit, for example, audio. Fragmentation rules are defined for codecs that produce large frames, for example, I-frames by video codecs. Fragmenting large frames into smaller packets and not relying on IP fragmentation is mainly done because IP fragmented packets are commonly discarded in the network, especially by NATs or firewalls.

What are RTP Header Extensions For?

RTP header extensions are designed to carry media-independent information. More specifically, they carry data that may be generally applicable to several payload formats and need to be reported more frequently than RTCP reports are emitted.

For example, when sending NACK packets for interactive media, media flows in both directions and RTP packets are generated every tens of milliseconds. In this instance, the RTP header extension can indicate which sequence numbers were correctly received or lost. Therefore, they do not completely rely on the RTCP receiver reports to send NACKs or ACKs.

Using header extensions can be very useful, as they are backwards compatible. An endpoint that does not understand them can simply ignore them. Additionally, they are generic, as it is not necessary to redefine the same extension for each media codec.

RTP header extensions have several uses, including reporting the network send timestamp and equalizing a client’s audio levels across multiple streams in a media conference.

What is the RTCP Reporting Interval?

With RTP, a closed loop is created from sending RTP media packets and receiving RTCP feedback packets. The RTCP feedback interval is usually a very small fraction of the session bandwidth to not affect media traffic. The RTCP reporting interval is determined by the number of synchronization sources in the session and the session bandwidth.

While the session bandwidth is expected to be divided among participants, it is often actually the sum of the average throughput of the senders expected to be concurrently active. For example, in an audio conference, the session bandwidth would be one sender’s bandwidth. However, for a video conference, the session bandwidth would vary depending on the number of users shown. In order to manage this, the session bandwidth is given by the session management layer, so the same value for the RTCP interval is calculated for every participant.

5 % of session bandwidth should be allocated for control traffic.

During scenarios with a large number of receivers and a small number of senders, a quarter of the reporting bandwidth should be shared equally by the senders, with the remaining three quarters dedicated to the receivers. This allows new participants to quickly receive the CNAME and synchronization timestamp from the Sender Reports. For new participants, the RTCP interval is halved to quickly declare their presence.

The recommended RTCP value minimum is 5 seconds. This is suitable for unidirectional links, or for sessions that don’t require monitoring of the reception quality statistics.

The reduced minimum value is 360 divided by the session bandwidth (in seconds). It is suitable for participants in a unicast bidirectional multimedia session, and for sending timely feedback messages to perform congestion control or error repair.

Extended RTP Profile for RTCP-Based Feedback

In the case of an endpoint detecting packet loss or the onset of congestion midway through a reporting interval, RTCP reports cannot be sent early, and the endpoint must wait for the next scheduled RTCP report. This can cause instability and oscillation in the media bit rate. In order to address this, endpoints implement the Extended RTP Profile for RTCP-Based Feedback. This is an extension to RTP’s default timing rules that enables rapid feedback.

With this profile, the endpoint can adjust the RTCP reporting interval to send the RTCP reports earlier than the next scheduled RTCP report, so long as the reporting interval remains the same on average. Additionally, it also defines a suite of error-resilience feedback messages, including Negative Acknowledgement (NACK), Picture Loss Indication (PLI), Slice Loss Indication (SLI), and the Reference Picture Selection Indication (RPSI).


Interested in learning more about Real-time Transport Protocol? Check out our CEO, Varun Singh’s, thesis on Protocols and Algorithms for Adaptive Multimedia Systems.

Tags: Real-time Communications, WebRTC, Networking