Lynne's compiled musings



IRC: Lynne


Fast parsing of Interleaved Signed exp-Golomb codes

Eventually, all well optimized decoder implementations of codecs will hit a bottleneck, which is the speed at which they're able to parse data from the bitstream. Spending the time to optimize this parsing code is usually not worth it unless users are actually starting to hit this and its preventing them from scaling up. For a complicated filter-heavy codec such as AV1, this becomes a problem quite late, especially as the specifications limit the maximum bitrate to around 20Mbps, and care was taken to allow easy SIMDing of parsing code during the writing of the spec. For simple mezzanine codecs such as ProRes, DNxHD, CFHD, Dirac/VC-2 or even Intra-only VLC H.264, where bitrates of the order of hundreds of Mpbs, optimizing bitstream parsing is usually importance number one.

The context we'll be using while looking at bitstream decoding is that within the VC-2 codec, which is essentially a stripped down version of Dirac, made by the BBC for allegedly 1 patent unencumbered near-lossless video transmission over bandwidth limited connectivity (think 1.5Gbps 420 1080p60 over a 1Gbps connection).

To achieve this, the pixels are first transformed using one of many possible wavelet transforms, then quantized by means of division, and encoded. The wavelet transforms used are simple and easy to SIMD, the quantization is somewhat tricky but still decently fast as its just a multiply and an add per coefficient. Boring, standard, old, uninspired and uninteresting, the basis of anyone's first look-I-wrote-my-own-codec 2. Which leaves only the quantized coefficient encoding.

The coefficients are encoded using something called Interleaved Signed exp-Golomb codes. We'll go over this word by word, in reverse.

exp-Golomb codes, or exponential-Golomb for long, or just Golomb to those lazy and who know their codecs, are a form of binary encoding of arbitrarily sized integers, where the length is encoded as a prefix preceding the data. To illustrate:

0 β‡’ +1 β‡’ 1 β‡’ 1 β‡’ 1
1 β‡’ +1 β‡’ 2 β‡’ 10 β‡’ 010
2 β‡’ +1 β‡’ 3 β‡’ 11 β‡’ 011
3 β‡’ +1 β‡’ 4 β‡’ 100 β‡’ 00100
4 β‡’ +1 β‡’ 5 β‡’ 101 β‡’ 00101
5 β‡’ +1 β‡’ 6 β‡’ 110 β‡’ 00110
6 β‡’ +1 β‡’ 7 β‡’ 111 β‡’ 00111
7 β‡’ +1 = 8 β‡’ 1000 β‡’ 0001000

1 is added to the number if encoding a 0 is necessary, since otherwise encoding it would take 0 bits. The prefix is just the amount of bits after the most significant non-zero bit for the integer, minus one, encoded as a sequence of zeroes. This encoding doesn't have any interesting properties about it and is simple to naΓ―vely decode.

Signed exp-Golomb codes are just the same as above, only an additional bit is appended at the end to signal a sign. By universal convention, 1 encodes a negative number, and 0 encodes a positive number. The bit is not signalled if the number is 0.

Interleaved exp-Golomb codes take the same amount of bits to encode as regular Golomb, however on a first glance they are very different:

0 β‡’ +1 β‡’ 1 β‡’ 1 β‡’ 1
1 β‡’ +1 β‡’ 2 β‡’ 10 β‡’ 001
2 β‡’ +1 β‡’ 3 β‡’ 11 β‡’ 011
3 β‡’ +1 β‡’ 4 β‡’ 100 β‡’ 00001
4 β‡’ +1 β‡’ 5 β‡’ 101 β‡’ 00011
5 β‡’ +1 β‡’ 6 β‡’ 110 β‡’ 01001
6 β‡’ +1 β‡’ 7 β‡’ 111 β‡’ 01011
7 β‡’ +1 = 8 β‡’ 1000 β‡’ 0000001

As the number of bits hasn't changed, and there are still the same amount of zeroes as in normal exp-Golomb codes, the prefix is still there. Its just interleaved, where every odd bit (except the last one) is a 0, while every even bit encodes the integer. The reason why it looks so different is that with this coding scheme, coding the very first non-zero bit is unnecessary, hence its implicitly set to 1 when decoding.

Interleaved Signed exp-Golomb codes are finally just an interleaved exp-golomb code with an additional bit at the end to signal a sign. That bit is of course not signalled if the number encoded is a 0.

A more convenient way to think about interleaved exp-golomb codes is that every odd bit is actually a flag that tells you whether the number has ended (a 1) or that the next bit is part of the number (a 0). A simple parser for signed codes would then look like this:

int read_sie_golomb()
    uint32_t val = 0x1;

    while (!get_bit()) {
        val <<= 1;
        val |= get_bit();

    val -= 1;

    if (val && get_bit())
        val *= -1;

    return val;

Looks simple, and the loop has 3 instructions, so it should be fast, right? Bitstream readers are however, not exactly simple, not exactly have easily predicted branches, which makes them not exactly fast.

CPUs like it when the data you're looking at has a size of a power of two, with the smallest unit being a byte. Bytes can encode 256 total possibilities, which isn't exactly large, but if we could process a byte at a time rather than a bit at a time, we could have a potential 8x speedup.

01110010 is a sequence which encodes a -2 and a 1, both signed, and is exactly 8 bits. So if we make a lookup table, we can say that the byte contains a -2 and a 1, directly output the data to some array, and move on cleanly to the next byte.
This is possibility number one, where the byte contains all bits of the numbers present.

01101001 0xxxxxxx is a sequence which encodes a 2, a 0, and a -1. Unlike the previous example, all the numbers are terminated in a single byte, with only the sign bit missing. This is hence possibility number two, where the current byte has leftover data that needs exactly 1 bit from the next byte.

01011101 10xxxxxx is a sequence which encodes a -6 and a 2. Its 10 bits in length, so the last 2 bits of the 2 spill over into the next byte. We can output a -6, save the uncompleted bits of the 2, and move over to the next byte where we can combine the unterminated data from the previous byte with the data from the current byte.
However there's more to this. In the previous example, the -6 ended on an odd bit, making an even bit the start of the 2. As we know, the terminating bit of an interleaved exp-Golomb code will always be after an odd number of bits since the start. So we know that whenever the sequence ends, whether it be the next byte or the current byte, the ending bit of the sequence must be at an odd position. In other words, this is possibility number three, where the current byte is missing some data and needs the next, and possibly more bytes to complete, with the data ending at an odd position.
Of course, there's the possibility that the sequence will end on an even bit, such as with 01011110 110xxxxx (-6, 0, 0, 2), making this the final possibility number four.

So, with this we can exactly determine what the current byte contains, and we can know what we need to expect in the next byte. We know what we need to keep as a state (what we expect from the next byte and any unterminated data from this byte), so we can make a stateful parser:


typedef struct Lookup {
    int ready_nb;
    int *ready_out;

    uint64_t incomplete;
    int incomplete_bits;

    int next_state;
} Lookup;

static const Lookup lookup_table[] = {
    /* 256  - 511 = POSSIBILITY_TWO_SIGN_BIT */

void read_golomb(int *output, uint8_t *data, int bytes)
    uint64_t incomplete = 0x0;
    int incomplete_bits = 0;

    for (int i = 0; i < bytes; i++) {
        /* Load state */
        Lookup state = lookup_table[next_state * data[i]];

        /* Directly output any numbers fully terminated in the current byte */
        if (state->ready_nb) {
            memcpy(output, state->ready_out, state->ready_nb * sizeof(int));
            output += state->ready_nb;

        /* Save incomplete state */
        append_bits(&incomplete, &incomplete_bits, state->incomplete, state->incomplete_bits);

        /* Output if the byte has terminated the sequence */
        if (state->terminate) {
            *output++ = read_sie_golomb(incomplete);
            incomplete = incomplete_bits = 0;

        /* Carry over the state for the next byte */
        next_state = state->next_state;

And so, with this pseudocode, we can parse Interleaved Signed exp-Golomb codes at a speed at least a few times faster than a naive implementation. Generating the lookup tables is a simple matter of iterating through all numbers from 0 to 255 for every possibility from the four types and trying to decode the golomb codes in them.

There are more optimizations to do, such as instead of storing bits for the incomplete code, storing the a decoded version of them such that decoding is entirely skipped. And, given the state can neatly fit into a 128 bit register, SIMD is also possible, though limited. All of this is outside the scope of this already long article.

exp-Golomb codes, simple to naΓ―vely decode, not that difficult to optimize, have been the go-to for any codec that needs speed and doesn't need full entropy encoding to save a few percent.
Do they still have a use nowadays? Not really. Fast, multisymbol range entropy encoders have been around for more than a decade. They're quick to decode in software, can be SIMD'd if they are adaptive and in general save you enough to make up for the performance loss. And after all, the best way to speed up parsing is to just have less overall data to parse.

Appendix: aren't exp-Golomb codes just Variable Length Codes?

Short answer: yes, but you shouldn't use a Variable Length Code parser to decode them.
Many codecs specify large tables of bit sequences and their lengths where each entry maps to a single number. Unlike an exp-Golomb, there's no correlation necessary between bits and the final parsed number, e.g. 0xff can map to 0 just as how it can map to 255. Unless a codec specifies a very low maximum number that can be encoded in a valid bitstream with exp-Golomb, then using a VLC parser is not feasible as even after quantization, the encoded numbers in binary sequences will likely exceed 32 bits, and having lookup tables larger than 256Kb evaporates any performance gained.

  1. VC-2 uses wavelets for transforms, which are a well known patent minefield ↩
  2. Replacing a DCT with a Wavelet does however provide potential latency improvements for ASICs and FPGAs, though for worse frequency decomposition ʰᡉˑˑᡒ ᴢᴾᴱᴳ²⁰⁰⁰ ↩
 ·  bitstream  ·  CC-BY logo


DASH streaming from the top-down

Whilst many articles and posts exist on how to setup DASH, most assume some sort of underlying infrastructure, many are outdated, don't specify enough or are simply vague. This post aims to explain from the top-down how to do DASH streaming. Without involving nginx-rtmp or any other antiquated methods.

The webpage

There are plenty of examples on how to use dash.js and/or video.js and videojs-contrib-dash, and you can just copy paste something cargo-culted to quickly get up and running.
But do you really need 3 js frameworks? As it turns out, you absolutely do not. Practically all of the examples or tutorials use older ancient versions of video.js. Modern video.js version 7 needs neither dash.js nor videojs-contrib-dash, since it already comes prepackaged with everything you need to play both DASH or HLS.

    <link href="<< PATH TO video-js.min.css >>" rel="stylesheet" />
    <script src="<< PATH TO video.min.js >>"></script>
        <video-js id="live-video" width="100%" height="auto" controls
                  poster="<< LINK TO PLAYER BACKGROUND >>" class="vjs-default-skin vjs-16-9"
                  rel="preload" preload="auto" crossorigin="anonymous">
            <p class="vjs-no-js">
                To view this video please enable JavaScript, and/or consider upgrading to a
                web browser that
                <a href="" target="_blank">
                    supports AV1 video and Opus audio
        var player = videojs('live-video', {
            "liveui": true,
            "responsive": true,
        player.ready(function() {
            player.src({ /* Silences a warning about how the mime type is unsupported with DASH */
                src: document.getElementById("stream_url").href,
                type: document.getElementById("stream_url").type,
            player.on("error", function() {
                error = player.error();
                if (error.code == 4) {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "The stream is offline right now, come back later and refresh.";
                } else {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "Error: " + error.message;
            player.on("ended", function() {
                document.querySelector(".vjs-modal-dialog-content").textContent =
                    "The stream is over.";
    <a id="stream_url"
       href="<< LINK TO PLAYLIST >>"
        Direct link to stream.

This example, although simplistic, is fully adequate to render a DASH livestream, with the client adaptively selecting between screen sizes and displays.
Let me explain some details:

  • crossorigin="anonymous" sends anonymous CORS requests such that everything still works if your files and playlists are on a different server.
    NOTE: this does not apply to the DASH UTC timing URL. You'll still need this on your server. Its unclear whether this is a video.js bug or not.
  • width="100%" height="auto" keeps the player size constant to the page width.
  • "liveui": true, enables a new video.js interface that allows for seeking into buffers of livestreams. You can rewind a limited amount (determined by the server and somewhat the client) but its a very valuable ability. Its not currently (as of video.js 7.9.2) enabled by default as it breaks some IE versions and an ancient IOS version, but if you're going to be streaming using modern codecs (you are, right?) they'd be broken anyway.
  • "responsive": true, just makes the UI scale along with the player size.
  • if (error.code == 4) { is an intentional hack. video.js returns the standard HTML5 MediaError code, which unfortunately maps MEDIA_ERR_SRC_NOT_SUPPORTED (value 4) to many errors, including source file is missing. Which it would be if the stream isn't running.

Its easy to add statistics by adding this to player.ready(function() { and having a <p id="player_stats"></p> paragraph anywhere on the webpage:

player.on("progress", function() {
    range = player.buffered();
    buffer = range.end(range.length - 1) - range.start(0);
    rate = player.vhs.throughput;
    document.getElementById("player_stats").innerHTML =
        "Buffered:&nbsp;" + Number((buffer).toFixed(1)) + "s&nbsp;&middot;&nbsp;" +
        "Bitrate:&nbsp;" + Number((rate / 100000000.0).toFixed(1)) + "Mbps";

The same code used in this example can be found on this page of my website, minus styling.

UTC sync URL

In all cases, we'll need a DASH UTC URL. Add this to your nginx server, under the same server section where your webpage is hosted.

location = /utc_timestamp {
    return 200 "$time_iso8601";
    default_type text/plain;

This produces a regular ISO 8601 formatted timestamp. The timezone must be UTC, and for that, your server has to be set to use the UTC timezone.

Serving (using an FFmpeg relay)

If you'd like to simply relay incoming Matroska, SRT or RTMP stream, just make sure you can access the destination folder via nginx and you have correct permissions set. For an example server-side FFmpeg configure line, you can use this:

TS_NAME="$(date +%s)/"
ffmpeg -i $SOURCE\
    -c copy\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\

This will create a playlist, per-stream init files and segment files in the destination folder. It will also fully manage all the files it creates in the destination folder, such as modifying them and deleting them.
There are quite a lot of options, so going through them:

  • remove_at_exit 1 just deletes all segments and the playlist on exit.
  • seg_duration 2 determines the segment size in segments. Must start with a keyframe, so must be a multiple of the keyframe period. Directly correlates with latency.
  • target_latency 2 sets the latency for L-DASH only. Players usually don't respect this. Should match segment duration.
  • frag_type duration sets that the segments should be further divided into fragments based on duration. Don't use other options unless you know what you're doing.
  • frag_duration 0.1 sets the duration of each fragment in seconds. One fragment every 0.1 seconds is a good number. Should NOT be an irrational number otherwise you'll run into timestamp rounding issues. Hence you should not use frag_type every_frame since all it does is it sets the duration to that of a single frame.
  • window_size 10 sets how many segments to keep in the playlist before removing them from the playlist.
  • extra_window_size 3 sets how many old segments to keep once off the playlist before deleting them, helps bad connections.
  • streaming 1 self explanatory.
  • ldash 1 low-latency DASH. Will write incomplete files to reduce latency.
  • write_prft 1 writes info using the UTC URL. Auto-enabled for L-DASH, but doesn't hurt to enable it here.
  • use_template 1 instead of writing a playlist containing all current segments and updating it on every segment, just writes the playlist once, specifying a range of segments, how long they are, and how long they're around. Very recommended.
  • use_timeline 0 its a playlist option to really disable old-style non-templated playlists.
  • index_correction 1 Tries to fix segment index if the input stream is incompatible, weird or lagging. If anything, serves as a good indicator of whether your input is such, as it'll warn if it corrects an index.
  • fflags +nobuffer+flush_packets disables as much caching in libavformat as possible to keep the latency down. Can save up to a few seconds of latency.
  • format_options "movflags=+cmaf" is required for conformance.
  • adaptation_sets "id=0,streams=0 id=1,streams=1" sets the adaptation sets, e.g. separate streams with different resolution or bitrate which the client can adapt with.
    • id=0 sets the first adaptation set, which should contain only a single type of streams (e.g. video or audio only).
    • frag_type and frag_duration can be set here to override the fragmentation on a per-adaptation stream basis.
    • streams=0 a comma separated list of FFmpeg stream indices to put into the adaptation set.
  • utc_timing_url must be set to the URL which you setup in the previous section.
  • init_seg_name and media_seg_name just setup a nicer segment directory layout.

On your client, set the keyframe period to the segment duration times your framerate:
2 seconds per segments * 60 frames per second = 120 frames for the keyframe period.

For some security on the source (ingest) connection you can try forwarding via the various SSH options, or use the server as a VPN, or if you can SSHFS into the server, don't mind not using Matroska, SRT or RTMP, and are the only person using the server, you can run the same command line on your client to an SSHFS directory.

Serving (using a DASH relay)

The approach above works okay, but what if you want the ultimate low-latency, actual security and the ability to use codecs newer than 20 years (and don't want to experiment with using Matroska as an ingest format)? You can just generate DASH on the upload-side itself (how unorthodox) and upload it, without having FFmpeg running on your server. However, its more complicated.

First, you'll need It creates a server which proxies the requests from nginx for both uploading and downloading (so you still get caching). It can also be used standalone without nginx for testing, but we're not focusing on this.

Follow the provided example nginx_config in the project's root directory and add

# define connection to
upstream dash_server_py {
    server [::1]:8000;

In the base of your nginx website configuration.

Then, create your uploading server:

# this server handles media ingest
# authentication is handled throught TLS client certificates
server {
    # network config
    listen [::]:8001 ssl default_server;
    server_name <ingest server name>;

    # server's TLS cert+key
    ssl_certificate <path to TLS cert>;
    ssl_certificate_key <path to TLS key>;
    #ssl_dhparam <path to DH params, optional>;

    # source authentication with TLS client certificates
    ssl_client_certificate <path to CA for client certs>;
    ssl_verify_client on;

    # only allow upload and delete requests
    if ($request_method !~ ^(POST|PUT|DELETE)$) {
        return 405; # Method Not Allowed

    root <path to site root>;

    # define parameters for communicating with
    # enable chunked transfers
    proxy_http_version        1.1;
    proxy_buffering           off;
    proxy_request_buffering   off;
    # finish the upload even if the client does not bother waiting for our
    # response
    proxy_ignore_client_abort on;

    location /live/ {
        proxy_pass http://dash_server_py;

You'll need 2 certificates on the server - one for HTTPS (which you can just let certbot manage) and one for client authentication that you'll need to create yourself.
While its less practical than a a very long URL, it provides actual security. You can use openssl to generate the client certificate.

Then, in the same server where you host your website and UTC time URL, add this section:

location /live/ {
    try_files $uri @dash_server;

location @dash_server {
    proxy_pass http://dash_server_py;

Now, you can just run python3 -p 8000 as any user on your server and follow on reading to the client-side DASH setup section to sending data to it.

Serving (using a CDN)

Nothing to do, you're all set. You can follow on to the client-side DASH setup section now.

Client-side setup (for a DASH relay or a CDN)

As for the client side, as an example, you can use this command line:

TS_NAME="$(date +%s)/"
ffmpeg -i $SOURCE\
    -c:v libx264 -b:v 3M -g:v 120 -keyint_min:v 120\
    -c:a libopus -b:a 128k -frame_duration 100 -application audio -vbr off\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -timeout 0.2\
    -ignore_io_errors 1\
    -http_persistent 0\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    -http_opts key_file=$TLS_KEY,cert_file=$TLS_CRT,ca_file=$TLS_CA,tls_verify=1\

All the options are described above in the FFmpeg relay section, but there are a few new ones we need:

  • timeout 0.2 sets a timeout for each upload operation to complete before abandoning it. Helps robustness.
  • ignore_io_errors 1 does not error out if an operation times out. Obviously helps robustness.
  • http_persistent 0 disables persistent HTTP connections due to a bug. Post will be updated if it gets fixed. Set it to 1 if you're using a CDN.
  • http_opts sets up the certificates to use for authentication with the server. Most CDNs use URL 'security' so this option should be omitted there.

The ffmpeg CLI is by no means the only tool to directly output DASH. Any program which can use libavformat, such as OBS, various media players with a recording functionality, even CD rippers and so on can.
In fact, I'm working on a fully scriptable compositing and streaming program called txproto which accepts the same options as the ffmpeg CLI.

 ·  dash  ·  CC-BY logo


IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

static float deemphasis_c(float *y, float *x, float coeff, int len)
    for (int i = 0; i < len; i++)
        coeff = y[i] = x[i] + coeff * 0.85f;

    return coeff;

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

static float deemphasis_c(float *y, float *x, float coeff, int len)
    const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;

    for (int i = 0; i < len; i += 4) {
        y[0] = x[0] + c1*coeff +  0*x[2] +  0*x[1] +  0*x[0];
        y[1] = x[1] + c2*coeff +  0*x[2] +  0*x[1] + c1*x[0];
        y[2] = x[2] + c3*coeff +  0*x[2] + c1*x[1] + c2*x[0];
        y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];

        coeff = y[3];
        y += 4;
        x += 4;

    return coeff;

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and inserting scalar data in vector registers is so painfully slow that on some occasions its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq] certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width shuffle.

But lets proceed anyway.1

         ; 0.85..^1    0.85..^2    0.85..^3    0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f


cglobal opus_deemphasis, 3, 3, 8, out, in, len
    ; coeff is already splatted in m0 on UNIX64

    movaps m4, [tab_st]
    shufps m6, m4, m4, q1111
    shufps m7, m4, m4, q2222

    movaps  m1, [inq]                ; x0, x1, x2, x3

    pslldq  m2, m1, 4                ;  0, x0, x1, x2
    pslldq  m3, m1, 8                ;  0,  0, x0, x1

    fmaddps m2, m2, m5, m1           ; x + c1*x[0-2]
    pslldq  m1, 12                   ;  0,  0,  0, x0

    fmaddps m2, m3, m6, m2           ; x + c1*x[0-2] + c2*x[0-1]
    fmaddps m1, m1, m7, m2           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
    fmaddps m0, m0, m4, m1           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff

    movaps [outq], m0
    shufps m0, m0, q3333             ; new coeff

    add inq,  mmsize
    add outq, mmsize
    sub lend, mmsize >> 2
    jg .loop


We can increase speed and precision by combining the multiply and sum operations in a single fmaddps macro, which x86inc does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g. using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.

benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1024 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4). We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting for each previous operation to finish to produce a single output. Instead we're waiting for only the last one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version, which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly). But in the interest of code readability and simplicity, we're keeping the small rolled version. Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like aarch64 and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1022 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU. And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance, but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.

  1. Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg source to find out how they're handled and weep at their ABI ineptitude ↩
  2. fun fact: CPUs before Ryzen don't signal the FMA4 flag but will still execute the instructions fine ↩
 ·  x86 neon  ·  CC-BY logo