Simulcast

Huddle01 is developing the next generation of real-time communications systems through progressive decentralization. We are developing a dRTC engine that will allow technical and social primitives to operate on existing L1 and L2 networks.

If we think about a video, it is just a series of images updating at high speed. Even old movies that used "photo reels" worked on this concept. Our minds tend to see this as motion or moving things, so it is considered to be moving. This principle is still used today, everywhere; just individual images are called "frames," not "pictures"

Fun fact: If you go below 30 frames per second, you can see that it’s a series of frames. Also, if you go above 60 FPS, you can’t tell much difference.

What does it have to do with Simulcast, and what is Simulcast anyway?

Simulcast allows publishers to distribute multiple versions of the same A/V stream in various encodings. Huddle01 receives multiple streams and chooses the best one for each subscriber based on network conditions.

If you understand the frame concept in a video, it is easy to grasp the workings of Simulcast.

Scalable Video Codecs (SVCs)

These are the types of video codecs that let a single media stream be encoded in different ways to meet different bitrates and resolutions. This is preferable to making multiple versions of the stream. Although VP9 and AV1 are newer codecs with built-in scalability (SVCs), they have the following drawbacks:

AV1 is only compatible with browsers that use the Chromium engine simulcast. VP9 is compute-intensive and recently received Safari Support in v15.0

Since SVCs are not yet compatible everywhere, we use more reliable codecs like VP8 and RTX for simulcast. We like VP9 for our Huddle01 mobile app because it frees us from browser limitations, which is a problem we often run into when developing with WebRTC (different browsers support different levels of video encoders and decoders, so fallbacks are needed).

Spatial Layering

Spatial layers are the different streams that the client publisher is sending out with different encodings, which in this case means different video qualities. In Huddle01’s case, we support creating three spatial layers: High resolution(1080p) Medium resolution(720p) Low resolution (480p) of the same A/V stream encoded at different bitrates.

Spatial Layering.png

In the image above, you can see an example of two spatial layers. A larger frame means that the video quality is better.

The streams go to huddle01's media server, which chooses which layers to send to the viewer. From a viewer's perspective, it keeps receiving one layer, which is dynamically switched by the Media Server based on the viewer's network conditions, while the viewer gets one smooth stream without noticing any visible switching. That is the magic of simulcast!

Temporal Layering

Temporal is just a geeky word for time. So these layers are separated by time, and it is the ability of our media server to dynamically change the frame rate on the fly without affecting the resolution.

Let's understand a little about how encoders work. There are two types of frames:

Keyframes: These are full-frame images

Delta frames: These only have the difference in the new frame with respect to the keyframe

Delta frames are usually smaller in size compared to keyframes, so encoders rely heavily on them; they also depend on the previous frames.

Encoded streams mostly contain delta frames.

Let's take this example where there is only one temporal layer

Temporal Layering.png

We have one temporal layer, T1, in which each frame is dependent on the frame before it. If a frame is lost, the decoder is unable to determine the difference between the next frames. As a result, it isn't scalable.

Temporal Layering2.png

Above, we see a temporal base layer (T0) with two upper layers (T1 and T2). T0 is our base temporal layer, and frames on the base layer only refer to other base layer frames.

We can send all three temporal layers to a high-bandwidth client, resulting in a smooth, high-quality experience. But if the client's network isn't reliable, the media server will only send the base temporal layer (T0) until the network gets better. This saves up to 67% of bandwidth and makes sure the user has a smooth experience.

Huddle01 Media server leverages the powers of Space and Time

Our Homemade Media server (SFU) utilizes both Spatial and Temporal Layers for our Simulcast Implementation.

Spatial Layering2.png

For video streams, we use S3T3_KEY, which is a 3-layer spatial simulcast and 3-layer-temporal encoding that has worked best in all-around network scenarios from our internal testing and experimentation.

For screen sharing, where clear quality is more important than fast frame rate, it's best to use more temporal layers than spatial layers (S2T3 and S2T3h).

That’s it, folks; these are some of the basics of Simulcast and how we have leveraged it! If you have any questions, suggestions, or team-ups in mind, reach out to us on Twitter or land on our Discord!

Simulcast

Table of contents

Scalable Video Codecs (SVCs)

Spatial Layering

Temporal Layering

Huddle01 Media server leverages the powers of Space and Time