Lynne's compiled musings

https://github.com/cyanreg https://pars.ee/@lynne

πŸ”‘

πŸ”‘

IRC: Lynne

20-10-06

Low bitrate white noise?

"Every collection of random bits is white noise.".

"Every collection of random bits is a valid Opus packet.".

But is every collection of random bits a valid Opus packet which decodes to white noise1? Here are two volume-reduced2 16kbps samples.

Encoded white noise (volume set to 5.37%) Random bytes (volume set to 0.28%)

Clearly not.

  1. Yes, in a VoIP context signalling an Opus packet is silence via the flag will cause the comfort noise generator to produce noise, but it won't be white. ↩
  2. Volume was reduced without transcoding via the opus_metadata FFmpeg BSF. ↩
 ·  opus  ·  CC-BY logo

20-09-30

How not to design... containers

This post is part 1 of the "How not to design..." series:

  1. How not to design... containers

A simple codec container

Do not rely on a magic word. Take a look at the BBC Dirac container syntax:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      'B'      |      'B'      |      'C'      |      'D'      | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Parse code    |  Next parse offset                            | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |  Last parse offset                            | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |  Unit data...                                 | 12-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Each packet starts with a 32-bit sequence BBDC, then a parse code which indicates the type of the unit, then two 32-bit DWORDS which are supposed to be the global offset of the current/last unit within the file, and finally that's followed by the unit data, which may be colorspace info or actual codec packets containing pixels.

On the surface its a simple container and you could write a bad parser in just a few lines and as many minutes. And here's a short list of what's wrong with it:

  • No size field. The only way to know how big a unit is would be to get the difference between the next and current unit's offset.
  • Fast and accurate seeking not possible in files over 4GB. Both offsets may overflow so seeking using fseek and a known frame numer means you have to parse every single unit.
  • Unsuitable for streaming. Both offsets make no sense for a stream as there's no start nor end known on the client-side.
  • A single-bit error in all but 8 bits of the 108-bit header will break a simple demuxer, while an error in the 8-bit leftover (parse code) will break decoding. A badly written demuxer will go into a loop with a near-0 chance of recovery, while a better one will look at each incoming byte for BBCD.
  • For a stream, there's barely 32 bits of usable entropy to find the next unit. While some sanity checks could be done to the offsets and parse code, a demuxer which accepts arbitrary input can't do them.
  • BBCD is not an uncommon sequence to see in compressed data, or in fact any data with a high amount of entropy. Which means you have a non-insignificant chance of parsing a wrong unit when there wasn't meant to be one.

Combined, all those issues make parsing not very robust and not very quick. A reliable parser can only be written for a very limited subset of the container.

A more robust simple codec container

"Let's just put a CRC on the whole data!". Hence, let's take a look at OGG:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      'O'      |      'g'      |      'g'      |      'S'      | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| version       | header_type   | granule_position              | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | bitstream_serial_number       | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | page_sequence_number          | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | CRC_checksum                  | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | page_segments | segment_table | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ...                                                           | 28-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Here, the 32-bit standard CRC_checksum covers the entire (possibly segmented) packet. When computing it, its just taken as 0x0 and replaced later.

  • A single bit error anywhere will desync demuxing. Which means you have to look for the OggS sequence at every byte position in a compressed stream.
  • You could ignore the CRC and let the packet through to the decoder, but there's no guarantee the codec won't output garbage, or fail to decode (or produce valid garbage, since any sequence of random bytes is a valid Opus packet).
  • If you don't ignore the CRC, you'll have to look at every byte position for the magic OggS and then at every match, perform a CRC which may be as long as 65k. This isn't fast nor good for battery consumption or memory.
  • The way chaining works in internet radio is to literally append the next file onto the stream, which causes a desync. So you have a mandatory desync every few minutes.
  • Seeking will also result in a desync and require you to search.
  • There's still only 32-bits of entropy in the magic sequence, and worse than Dirac, not many ways to check whether a packet is valid based on the header, since new streams may be created or simply dropped at any point.

There's a ton more I'm not covering here on what Ogg did disasterously wrong. Its a story for another time.

My point is that while magic sequences and CRCs are convenient, they're not robust, especially done on a per-packet level.

 ·  containers  ·  CC-BY logo

20-08-30

Fast parsing of Interleaved Signed exp-Golomb codes

Eventually, all well optimized decoder implementations of codecs will hit a bottleneck, which is the speed at which they're able to parse data from the bitstream. Spending the time to optimize this parsing code is usually not worth it unless users are actually starting to hit this and its preventing them from scaling up. For a complicated filter-heavy codec such as AV1, this becomes a problem quite late, especially as the specifications limit the maximum bitrate to around 20Mbps, and care was taken to allow easy SIMDing of parsing code during the writing of the spec. For simple mezzanine codecs such as ProRes, DNxHD, CFHD, Dirac/VC-2 or even Intra-only VLC H.264, where bitrates of the order of hundreds of Mpbs, optimizing bitstream parsing is usually importance number one.

The context we'll be using while looking at bitstream decoding is that within the VC-2 codec, which is essentially a stripped down version of Dirac, made by the BBC for allegedly 1 patent unencumbered near-lossless video transmission over bandwidth limited connectivity (think 1.5Gbps 420 1080p60 over a 1Gbps connection).

To achieve this, the pixels are first transformed using one of many possible wavelet transforms, then quantized by means of division, and encoded. The wavelet transforms used are simple and easy to SIMD, the quantization is somewhat tricky but still decently fast as its just a multiply and an add per coefficient. Boring, standard, old, uninspired and uninteresting, the basis of anyone's first look-I-wrote-my-own-codec 2. Which leaves only the quantized coefficient encoding.

The coefficients are encoded using something called Interleaved Signed exp-Golomb codes. We'll go over this word by word, in reverse.

exp-Golomb codes, or exponential-Golomb for long, or just Golomb to those lazy and who know their codecs, are a form of binary encoding of arbitrarily sized integers, where the length is encoded as a prefix preceding the data. To illustrate:

0 β‡’ +1 β‡’ 1 β‡’ 1 β‡’ 1
1 β‡’ +1 β‡’ 2 β‡’ 10 β‡’ 010
2 β‡’ +1 β‡’ 3 β‡’ 11 β‡’ 011
3 β‡’ +1 β‡’ 4 β‡’ 100 β‡’ 00100
4 β‡’ +1 β‡’ 5 β‡’ 101 β‡’ 00101
5 β‡’ +1 β‡’ 6 β‡’ 110 β‡’ 00110
6 β‡’ +1 β‡’ 7 β‡’ 111 β‡’ 00111
7 β‡’ +1 = 8 β‡’ 1000 β‡’ 0001000
...

1 is added to the number if encoding a 0 is necessary, since otherwise encoding it would take 0 bits. The prefix is just the amount of bits after the most significant non-zero bit for the integer, minus one, encoded as a sequence of zeroes. This encoding doesn't have any interesting properties about it and is simple to naΓ―vely decode.

Signed exp-Golomb codes are just the same as above, only an additional bit is appended at the end to signal a sign. By universal convention, 1 encodes a negative number, and 0 encodes a positive number. The bit is not signalled if the number is 0.

Interleaved exp-Golomb codes take the same amount of bits to encode as regular Golomb, however on a first glance they are very different:

0 β‡’ +1 β‡’ 1 β‡’ 1 β‡’ 1
1 β‡’ +1 β‡’ 2 β‡’ 10 β‡’ 001
2 β‡’ +1 β‡’ 3 β‡’ 11 β‡’ 011
3 β‡’ +1 β‡’ 4 β‡’ 100 β‡’ 00001
4 β‡’ +1 β‡’ 5 β‡’ 101 β‡’ 00011
5 β‡’ +1 β‡’ 6 β‡’ 110 β‡’ 01001
6 β‡’ +1 β‡’ 7 β‡’ 111 β‡’ 01011
7 β‡’ +1 = 8 β‡’ 1000 β‡’ 0000001
...

As the number of bits hasn't changed, and there are still the same amount of zeroes as in normal exp-Golomb codes, the prefix is still there. Its just interleaved, where every odd bit (except the last one) is a 0, while every even bit encodes the integer. The reason why it looks so different is that with this coding scheme, coding the very first non-zero bit is unnecessary, hence its implicitly set to 1 when decoding.

Interleaved Signed exp-Golomb codes are finally just an interleaved exp-golomb code with an additional bit at the end to signal a sign. That bit is of course not signalled if the number encoded is a 0.

A more convenient way to think about interleaved exp-golomb codes is that every odd bit is actually a flag that tells you whether the number has ended (a 1) or that the next bit is part of the number (a 0). A simple parser for signed codes would then look like this:

int read_sie_golomb()
{
    uint32_t val = 0x1;

    while (!get_bit()) {
        val <<= 1;
        val |= get_bit();
    }

    val -= 1;

    if (val && get_bit())
        val *= -1;

    return val;
}

Looks simple, and the loop has 3 instructions, so it should be fast, right? Bitstream readers are however, not exactly simple, not exactly have easily predicted branches, which makes them not exactly fast.

CPUs like it when the data you're looking at has a size of a power of two, with the smallest unit being a byte. Bytes can encode 256 total possibilities, which isn't exactly large, but if we could process a byte at a time rather than a bit at a time, we could have a potential 8x speedup.

01110010 is a sequence which encodes a -2 and a 1, both signed, and is exactly 8 bits. So if we make a lookup table, we can say that the byte contains a -2 and a 1, directly output the data to some array, and move on cleanly to the next byte.
This is possibility number one, where the byte contains all bits of the numbers present.

01101001 0xxxxxxx is a sequence which encodes a 2, a 0, and a -1. Unlike the previous example, all the numbers are terminated in a single byte, with only the sign bit missing. This is hence possibility number two, where the current byte has leftover data that needs exactly 1 bit from the next byte.

01011101 10xxxxxx is a sequence which encodes a -6 and a 2. Its 10 bits in length, so the last 2 bits of the 2 spill over into the next byte. We can output a -6, save the uncompleted bits of the 2, and move over to the next byte where we can combine the unterminated data from the previous byte with the data from the current byte.
However there's more to this. In the previous example, the -6 ended on an odd bit, making an even bit the start of the 2. As we know, the terminating bit of an interleaved exp-Golomb code will always be after an odd number of bits since the start. So we know that whenever the sequence ends, whether it be the next byte or the current byte, the ending bit of the sequence must be at an odd position. In other words, this is possibility number three, where the current byte is missing some data and needs the next, and possibly more bytes to complete, with the data ending at an odd position.
Of course, there's the possibility that the sequence will end on an even bit, such as with 01011110 110xxxxx (-6, 0, 0, 2), making this the final possibility number four.

So, with this we can exactly determine what the current byte contains, and we can know what we need to expect in the next byte. We know what we need to keep as a state (what we expect from the next byte and any unterminated data from this byte), so we can make a stateful parser:

#define POSSIBILITY_ONE_FULLY_TERMINATED 0
#define POSSIBILITY_TWO_SIGN_BIT 1
#define POSSIBILITY_THREE_ODD_BIT_TERMINATE 2
#define POSSIBILITY_FOUR_ODD_BIT_TERMINATE 3

typedef struct Lookup {
    int ready_nb;
    int *ready_out;

    uint64_t incomplete;
    int incomplete_bits;

    int next_state;
} Lookup;

static const Lookup lookup_table[] = {
    /* 0    - 255 = POSSIBILITY_ONE_FULLY_TERMINATED */
    /* 256  - 511 = POSSIBILITY_TWO_SIGN_BIT */
    /* 512  - 767 = POSSIBILITY_THREE_ODD_BIT_TERMINATE */
    /* 768 - 1023 = POSSIBILITY_FOUR_ODD_BIT_TERMINATE */
};

void read_golomb(int *output, uint8_t *data, int bytes)
{
    int next_state = POSSIBILITY_ONE_FULLY_TERMINATED;
    uint64_t incomplete = 0x0;
    int incomplete_bits = 0;

    for (int i = 0; i < bytes; i++) {
        /* Load state */
        Lookup state = lookup_table[next_state * data[i]];

        /* Directly output any numbers fully terminated in the current byte */
        if (state->ready_nb) {
            memcpy(output, state->ready_out, state->ready_nb * sizeof(int));
            output += state->ready_nb;
        }

        /* Save incomplete state */
        append_bits(&incomplete, &incomplete_bits, state->incomplete, state->incomplete_bits);

        /* Output if the byte has terminated the sequence */
        if (state->terminate) {
            *output++ = read_sie_golomb(incomplete);
            incomplete = incomplete_bits = 0;
        }

        /* Carry over the state for the next byte */
        next_state = state->next_state;
    }
}

And so, with this pseudocode, we can parse Interleaved Signed exp-Golomb codes at a speed at least a few times faster than a naive implementation. Generating the lookup tables is a simple matter of iterating through all numbers from 0 to 255 for every possibility from the four types and trying to decode the golomb codes in them.

There are more optimizations to do, such as instead of storing bits for the incomplete code, storing the a decoded version of them such that decoding is entirely skipped. And, given the state can neatly fit into a 128 bit register, SIMD is also possible, though limited. All of this is outside the scope of this already long article.

exp-Golomb codes, simple to naΓ―vely decode, not that difficult to optimize, have been the go-to for any codec that needs speed and doesn't need full entropy encoding to save a few percent.
Do they still have a use nowadays? Not really. Fast, multisymbol range entropy encoders have been around for more than a decade. They're quick to decode in software, can be SIMD'd if they are adaptive and in general save you enough to make up for the performance loss. And after all, the best way to speed up parsing is to just have less overall data to parse.

Appendix: aren't exp-Golomb codes just Variable Length Codes?

Short answer: yes, but you shouldn't use a Variable Length Code parser to decode them.
Many codecs specify large tables of bit sequences and their lengths where each entry maps to a single number. Unlike an exp-Golomb, there's no correlation necessary between bits and the final parsed number, e.g. 0xff can map to 0 just as how it can map to 255. Unless a codec specifies a very low maximum number that can be encoded in a valid bitstream with exp-Golomb, then using a VLC parser is not feasible as even after quantization, the encoded numbers in binary sequences will likely exceed 32 bits, and having lookup tables larger than 256Kb evaporates any performance gained.

  1. VC-2 uses wavelets for transforms, which are a well known patent minefield ↩
  2. Replacing a DCT with a Wavelet does however provide potential latency improvements for ASICs and FPGAs, though for worse frequency decomposition ʰᡉˑˑᡒ ᴢᴾᴱᴳ²⁰⁰⁰ ↩
 ·  bitstream  ·  CC-BY logo

20-07-26

DASH streaming from the top-down

Whilst many articles and posts exist on how to setup DASH, most assume some sort of underlying infrastructure, many are outdated, don't specify enough or are simply vague. This post aims to explain from the top-down how to do DASH streaming. Without involving nginx-rtmp or any other antiquated methods.

The webpage

There are plenty of examples on how to use dash.js and/or video.js and videojs-contrib-dash, and you can just copy paste something cargo-culted to quickly get up and running.
But do you really need 3 js frameworks? As it turns out, you absolutely do not. Practically all of the examples or tutorials use older ancient versions of video.js. Modern video.js version 7 needs neither dash.js nor videojs-contrib-dash, since it already comes prepackaged with everything you need to play both DASH or HLS.

<html>
<head>
    <title>Live</title>
</head>
<body>
    <link href="<< PATH TO video-js.min.css >>" rel="stylesheet" />
    <script src="<< PATH TO video.min.js >>"></script>
    <div>
        <video-js id="live-video" width="100%" height="auto" controls
                  poster="<< LINK TO PLAYER BACKGROUND >>" class="vjs-default-skin vjs-16-9"
                  rel="preload" preload="auto" crossorigin="anonymous">
            <p class="vjs-no-js">
                To view this video please enable JavaScript, and/or consider upgrading to a
                web browser that
                <a href="https://videojs.com/html5-video-support/" target="_blank">
                    supports AV1 video and Opus audio
                </a>
            </p>
        </video-js>
    </div>
    <script>
        var player = videojs('live-video', {
            "liveui": true,
            "responsive": true,
        });
        player.ready(function() {
            player.src({ /* Silences a warning about how the mime type is unsupported with DASH */
                src: document.getElementById("stream_url").href,
                type: document.getElementById("stream_url").type,
            });
            player.on("error", function() {
                error = player.error();
                if (error.code == 4) {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "The stream is offline right now, come back later and refresh.";
                } else {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "Error: " + error.message;
                }
            });
            player.on("ended", function() {
                document.querySelector(".vjs-modal-dialog-content").textContent =
                    "The stream is over.";
            });
        })
    </script>
    <a id="stream_url"
       href="<< LINK TO PLAYLIST >>"
       type="application/dash+xml">
        Direct link to stream.
    </a>
</body>
</html>

This example, although simplistic, is fully adequate to render a DASH livestream, with the client adaptively selecting between screen sizes and displays.
Let me explain some details:

  • crossorigin="anonymous" sends anonymous CORS requests such that everything still works if your files and playlists are on a different server.
    NOTE: this does not apply to the DASH UTC timing URL. You'll still need this on your server. Its unclear whether this is a video.js bug or not.
  • width="100%" height="auto" keeps the player size constant to the page width.
  • "liveui": true, enables a new video.js interface that allows for seeking into buffers of livestreams. You can rewind a limited amount (determined by the server and somewhat the client) but its a very valuable ability. Its not currently (as of video.js 7.9.2) enabled by default as it breaks some IE versions and an ancient IOS version, but if you're going to be streaming using modern codecs (you are, right?) they'd be broken anyway.
  • "responsive": true, just makes the UI scale along with the player size.
  • if (error.code == 4) { is an intentional hack. video.js returns the standard HTML5 MediaError code, which unfortunately maps MEDIA_ERR_SRC_NOT_SUPPORTED (value 4) to many errors, including source file is missing. Which it would be if the stream isn't running.

Its easy to add statistics by adding this to player.ready(function() { and having a <p id="player_stats"></p> paragraph anywhere on the webpage:

player.on("progress", function() {
    range = player.buffered();
    buffer = range.end(range.length - 1) - range.start(0);
    rate = player.vhs.throughput;
    document.getElementById("player_stats").innerHTML =
        "Buffered:&nbsp;" + Number((buffer).toFixed(1)) + "s&nbsp;&middot;&nbsp;" +
        "Bitrate:&nbsp;" + Number((rate / 100000000.0).toFixed(1)) + "Mbps";
});

The same code used in this example can be found on this page of my website, minus styling.

UTC sync URL

In all cases, we'll need a DASH UTC URL. Add this to your nginx server, under the same server section where your webpage is hosted.

location = /utc_timestamp {
    return 200 "$time_iso8601";
    default_type text/plain;
}

This produces a regular ISO 8601 formatted timestamp. The timezone must be UTC, and for that, your server has to be set to use the UTC timezone.

Serving (using an FFmpeg relay)

If you'd like to simply relay incoming Matroska, SRT or RTMP stream, just make sure you can access the destination folder via nginx and you have correct permissions set. For an example server-side FFmpeg configure line, you can use this:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
ffmpeg -i $SOURCE\
    -c copy\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    $DESTINATION_FOLDER

This will create a playlist, per-stream init files and segment files in the destination folder. It will also fully manage all the files it creates in the destination folder, such as modifying them and deleting them.
There are quite a lot of options, so going through them:

  • remove_at_exit 1 just deletes all segments and the playlist on exit.
  • seg_duration 2 determines the segment size in segments. Must start with a keyframe, so must be a multiple of the keyframe period. Directly correlates with latency.
  • target_latency 2 sets the latency for L-DASH only. Players usually don't respect this. Should match segment duration.
  • frag_type duration sets that the segments should be further divided into fragments based on duration. Don't use other options unless you know what you're doing.
  • frag_duration 0.1 sets the duration of each fragment in seconds. One fragment every 0.1 seconds is a good number. Should NOT be an irrational number otherwise you'll run into timestamp rounding issues. Hence you should not use frag_type every_frame since all it does is it sets the duration to that of a single frame.
  • window_size 10 sets how many segments to keep in the playlist before removing them from the playlist.
  • extra_window_size 3 sets how many old segments to keep once off the playlist before deleting them, helps bad connections.
  • streaming 1 self explanatory.
  • ldash 1 low-latency DASH. Will write incomplete files to reduce latency.
  • write_prft 1 writes info using the UTC URL. Auto-enabled for L-DASH, but doesn't hurt to enable it here.
  • use_template 1 instead of writing a playlist containing all current segments and updating it on every segment, just writes the playlist once, specifying a range of segments, how long they are, and how long they're around. Very recommended.
  • use_timeline 0 its a playlist option to really disable old-style non-templated playlists.
  • index_correction 1 Tries to fix segment index if the input stream is incompatible, weird or lagging. If anything, serves as a good indicator of whether your input is such, as it'll warn if it corrects an index.
  • fflags +nobuffer+flush_packets disables as much caching in libavformat as possible to keep the latency down. Can save up to a few seconds of latency.
  • format_options "movflags=+cmaf" is required for conformance.
  • adaptation_sets "id=0,streams=0 id=1,streams=1" sets the adaptation sets, e.g. separate streams with different resolution or bitrate which the client can adapt with.
    • id=0 sets the first adaptation set, which should contain only a single type of streams (e.g. video or audio only).
    • frag_type and frag_duration can be set here to override the fragmentation on a per-adaptation stream basis.
    • streams=0 a comma separated list of FFmpeg stream indices to put into the adaptation set.
  • utc_timing_url must be set to the URL which you setup in the previous section.
  • init_seg_name and media_seg_name just setup a nicer segment directory layout.

On your client, set the keyframe period to the segment duration times your framerate:
2 seconds per segments * 60 frames per second = 120 frames for the keyframe period.

For some security on the source (ingest) connection you can try forwarding via the various SSH options, or use the server as a VPN, or if you can SSHFS into the server, don't mind not using Matroska, SRT or RTMP, and are the only person using the server, you can run the same command line on your client to an SSHFS directory.

Serving (using a DASH relay)

The approach above works okay, but what if you want the ultimate low-latency, actual security and the ability to use codecs newer than 20 years (and don't want to experiment with using Matroska as an ingest format)? You can just generate DASH on the upload-side itself (how unorthodox) and upload it, without having FFmpeg running on your server. However, its more complicated.

First, you'll need dash_server.py. It creates a server which proxies the requests from nginx for both uploading and downloading (so you still get caching). It can also be used standalone without nginx for testing, but we're not focusing on this.

Follow the provided example nginx_config in the project's root directory and add

# define connection to dash_server.py
upstream dash_server_py {
    server [::1]:8000;
}

In the base of your nginx website configuration.

Then, create your uploading server:

# this server handles media ingest
# authentication is handled throught TLS client certificates
server {
    # network config
    listen [::]:8001 ssl default_server;
    server_name <ingest server name>;

    # server's TLS cert+key
    ssl_certificate <path to TLS cert>;
    ssl_certificate_key <path to TLS key>;
    #ssl_dhparam <path to DH params, optional>;

    # source authentication with TLS client certificates
    ssl_client_certificate <path to CA for client certs>;
    ssl_verify_client on;

    # only allow upload and delete requests
    if ($request_method !~ ^(POST|PUT|DELETE)$) {
        return 405; # Method Not Allowed
    }

    root <path to site root>;

    # define parameters for communicating with dash_server.py
    # enable chunked transfers
    proxy_http_version        1.1;
    proxy_buffering           off;
    proxy_request_buffering   off;
    # finish the upload even if the client does not bother waiting for our
    # response
    proxy_ignore_client_abort on;

    location /live/ {
        proxy_pass http://dash_server_py;
    }
}

You'll need 2 certificates on the server - one for HTTPS (which you can just let certbot manage) and one for client authentication that you'll need to create yourself.
While its less practical than a a very long URL, it provides actual security. You can use openssl to generate the client certificate.

Then, in the same server where you host your website and UTC time URL, add this section:

location /live/ {
    try_files $uri @dash_server;
}

location @dash_server {
    proxy_pass http://dash_server_py;
}

Now, you can just run python3 dash_server.py -p 8000 as any user on your server and follow on reading to the client-side DASH setup section to sending data to it.

Serving (using a CDN)

Nothing to do, you're all set. You can follow on to the client-side DASH setup section now.

Client-side setup (for a DASH relay or a CDN)

As for the client side, as an example, you can use this command line:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
TLS_KEY="<path_to_client_key>"
TLS_CERT="<path_to_client_cert>"
TLS_CA="<path_to_client_ca>"
ffmpeg -i $SOURCE\
    -c:v libx264 -b:v 3M -g:v 120 -keyint_min:v 120\
    -c:a libopus -b:a 128k -frame_duration 100 -application audio -vbr off\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -timeout 0.2\
    -ignore_io_errors 1\
    -http_persistent 0\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    -http_opts key_file=$TLS_KEY,cert_file=$TLS_CRT,ca_file=$TLS_CA,tls_verify=1\
    $DESTINATION_URL_PORT_DIRECTORY

All the options are described above in the FFmpeg relay section, but there are a few new ones we need:

  • timeout 0.2 sets a timeout for each upload operation to complete before abandoning it. Helps robustness.
  • ignore_io_errors 1 does not error out if an operation times out. Obviously helps robustness.
  • http_persistent 0 disables persistent HTTP connections due to a dash_server.py bug. Post will be updated if it gets fixed. Set it to 1 if you're using a CDN.
  • http_opts sets up the certificates to use for authentication with the server. Most CDNs use URL 'security' so this option should be omitted there.

The ffmpeg CLI is by no means the only tool to directly output DASH. Any program which can use libavformat, such as OBS, various media players with a recording functionality, even CD rippers and so on can.
In fact, I'm working on a fully scriptable compositing and streaming program called txproto which accepts the same options as the ffmpeg CLI.

 ·  dash  ·  CC-BY logo

20-06-28

IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    for (int i = 0; i < len; i++)
        coeff = y[i] = x[i] + coeff * 0.85f;

    return coeff;
}

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;

    for (int i = 0; i < len; i += 4) {
        y[0] = x[0] + c1*coeff +  0*x[2] +  0*x[1] +  0*x[0];
        y[1] = x[1] + c2*coeff +  0*x[2] +  0*x[1] + c1*x[0];
        y[2] = x[2] + c3*coeff +  0*x[2] + c1*x[1] + c2*x[0];
        y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];

        coeff = y[3];
        y += 4;
        x += 4;
    }

    return coeff;
}

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and inserting scalar data in vector registers is so painfully slow that on some occasions its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq] certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width shuffle.

But lets proceed anyway.1

         ; 0.85..^1    0.85..^2    0.85..^3    0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f

SECTION .text

INIT_XMM fma3
cglobal opus_deemphasis, 3, 3, 8, out, in, len
    ; coeff is already splatted in m0 on UNIX64

    movaps m4, [tab_st]
    VBROADCASTSS m5, m4
    shufps m6, m4, m4, q1111
    shufps m7, m4, m4, q2222

.loop:
    movaps  m1, [inq]                ; x0, x1, x2, x3

    pslldq  m2, m1, 4                ;  0, x0, x1, x2
    pslldq  m3, m1, 8                ;  0,  0, x0, x1

    fmaddps m2, m2, m5, m1           ; x + c1*x[0-2]
    pslldq  m1, 12                   ;  0,  0,  0, x0

    fmaddps m2, m3, m6, m2           ; x + c1*x[0-2] + c2*x[0-1]
    fmaddps m1, m1, m7, m2           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
    fmaddps m0, m0, m4, m1           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff

    movaps [outq], m0
    shufps m0, m0, q3333             ; new coeff

    add inq,  mmsize
    add outq, mmsize
    sub lend, mmsize >> 2
    jg .loop

    RET

We can increase speed and precision by combining the multiply and sum operations in a single fmaddps macro, which x86inc does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g. using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.

benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
FMA3:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1024 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4). We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting for each previous operation to finish to produce a single output. Instead we're waiting for only the last one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version, which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly). But in the interest of code readability and simplicity, we're keeping the small rolled version. Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like aarch64 and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
NEON:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1022 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU. And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance, but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.

  1. Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg source to find out how they're handled and weep at their ABI ineptitude ↩
  2. fun fact: CPUs before Ryzen don't signal the FMA4 flag but will still execute the instructions fine ↩
 ·  x86 neon  ·  CC-BY logo