Lynne's compiled musings

https://github.com/cyanreg https://pars.ee/@lynne

🔑

🔑

IRC: Lynne

20-07-26

DASH streaming from the top-down

Whilst many articles and posts exist on how to setup DASH, most assume some sort of underlying infrastructure, many are outdated, don't specify enough or are simply vague. This post aims to explain from the top-down how to do DASH streaming. Without involving nginx-rtmp or any other antiquated methods.

The webpage

There are plenty of examples on how to use dash.js and/or video.js and videojs-contrib-dash, and you can just copy paste something cargo-culted to quickly get up and running.
But do you really need 3 js frameworks? As it turns out, you absolutely do not. Practically all of the examples or tutorials use older ancient versions of video.js. Modern video.js version 7 needs neither dash.js nor videojs-contrib-dash, since it already comes prepackaged with everything you need to play both DASH or HLS.

<html>
<head>
    <title>Live</title>
</head>
<body>
    <link href="<< PATH TO video-js.min.css >>" rel="stylesheet" />
    <script src="<< PATH TO video.min.js >>"></script>
    <div>
        <video-js id="live-video" width="100%" height="auto" controls
                  poster="<< LINK TO PLAYER BACKGROUND >>" class="vjs-default-skin vjs-16-9"
                  rel="preload" preload="auto" crossorigin="anonymous">
            <p class="vjs-no-js">
                To view this video please enable JavaScript, and/or consider upgrading to a
                web browser that
                <a href="https://videojs.com/html5-video-support/" target="_blank">
                    supports AV1 video and Opus audio
                </a>
            </p>
        </video-js>
    </div>
    <script>
        var player = videojs('live-video', {
            "liveui": true,
            "responsive": true,
        });
        player.ready(function() {
            player.src({ /* Silences a warning about how the mime type is unsupported with DASH */
                src: document.getElementById("stream_url").href,
                type: document.getElementById("stream_url").type,
            });
            player.on("error", function() {
                error = player.error();
                if (error.code == 4) {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "The stream is offline right now, come back later and refresh.";
                } else {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "Error: " + error.message;
                }
            });
            player.on("ended", function() {
                document.querySelector(".vjs-modal-dialog-content").textContent =
                    "The stream is over.";
            });
        })
    </script>
    <a id="stream_url"
       href="<< LINK TO PLAYLIST >>"
       type="application/dash+xml">
        Direct link to stream.
    </a>
</body>
</html>

This example, although simplistic, is fully adequate to render a DASH livestream, with the client adaptively selecting between screen sizes and displays.
Let me explain some details:

  • crossorigin="anonymous" sends anonymous CORS requests such that everything still works if your files and playlists are on a different server.
    NOTE: this does not apply to the DASH UTC timing URL. You'll still need this on your server. Its unclear whether this is a video.js bug or not.
  • width="100%" height="auto" keeps the player size constant to the page width.
  • "liveui": true, enables a new video.js interface that allows for seeking into buffers of livestreams. You can rewind a limited amount (determined by the server and somewhat the client) but its a very valuable ability. Its not currently (as of video.js 7.9.2) enabled by default as it breaks some IE versions and an ancient IOS version, but if you're going to be streaming using modern codecs (you are, right?) they'd be broken anyway.
  • "responsive": true, just makes the UI scale along with the player size.
  • if (error.code == 4) { is an intentional hack. video.js returns the standard HTML5 MediaError code, which unfortunately maps MEDIA_ERR_SRC_NOT_SUPPORTED (value 4) to many errors, including source file is missing. Which it would be if the stream isn't running.

Its easy to add statistics by adding this to player.ready(function() { and having a <p id="player_stats"></p> paragraph anywhere on the webpage:

player.on("progress", function() {
    range = player.buffered();
    buffer = range.end(range.length - 1) - range.start(0);
    rate = player.vhs.throughput;
    document.getElementById("player_stats").innerHTML =
        "Buffered:&nbsp;" + Number((buffer).toFixed(1)) + "s&nbsp;&middot;&nbsp;" +
        "Bitrate:&nbsp;" + Number((rate / 100000000.0).toFixed(1)) + "Mbps";
});

The same code used in this example can be found on this page of my website, minus styling.

UTC sync URL

In all cases, we'll need a DASH UTC URL. Add this to your nginx server, under the same server section where your webpage is hosted.

location = /utc_timestamp {
    return 200 "$time_iso8601";
    default_type text/plain;
}

This produces a regular ISO 8601 formatted timestamp. The timezone must be UTC, and for that, your server has to be set to use the UTC timezone.

Serving (using an FFmpeg relay)

If you'd like to simply relay incoming Matroska, SRT or RTMP stream, just make sure you can access the destination folder via nginx and you have correct permissions set. For an example server-side FFmpeg configure line, you can use this:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
ffmpeg -i $SOURCE\
    -c copy\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    $DESTINATION_FOLDER

This will create a playlist, per-stream init files and segment files in the destination folder. It will also fully manage all the files it creates in the destination folder, such as modifying them and deleting them.
There are quite a lot of options, so going through them:

  • remove_at_exit 1 just deletes all segments and the playlist on exit.
  • seg_duration 2 determines the segment size in segments. Must start with a keyframe, so must be a multiple of the keyframe period. Directly correlates with latency.
  • target_latency 2 sets the latency for L-DASH only. Players usually don't respect this. Should match segment duration.
  • frag_type duration sets that the segments should be further divided into fragments based on duration. Don't use other options unless you know what you're doing.
  • frag_duration 0.1 sets the duration of each fragment in seconds. One fragment every 0.1 seconds is a good number. Should NOT be an irrational number otherwise you'll run into timestamp rounding issues. Hence you should not use frag_type every_frame since all it does is it sets the duration to that of a single frame.
  • window_size 10 sets how many segments to keep in the playlist before removing them from the playlist.
  • extra_window_size 3 sets how many old segments to keep once off the playlist before deleting them, helps bad connections.
  • streaming 1 self explanatory.
  • ldash 1 low-latency DASH. Will write incomplete files to reduce latency.
  • write_prft 1 writes info using the UTC URL. Auto-enabled for L-DASH, but doesn't hurt to enable it here.
  • use_template 1 instead of writing a playlist containing all current segments and updating it on every segment, just writes the playlist once, specifying a range of segments, how long they are, and how long they're around. Very recommended.
  • use_timeline 0 its a playlist option to really disable old-style non-templated playlists.
  • index_correction 1 Tries to fix segment index if the input stream is incompatible, weird or lagging. If anything, serves as a good indicator of whether your input is such, as it'll warn if it corrects an index.
  • fflags +nobuffer+flush_packets disables as much caching in libavformat as possible to keep the latency down. Can save up to a few seconds of latency.
  • format_options "movflags=+cmaf" is required for conformance.
  • adaptation_sets "id=0,streams=0 id=1,streams=1" sets the adaptation sets, e.g. separate streams with different resolution or bitrate which the client can adapt with.
    • id=0 sets the first adaptation set, which should contain only a single type of streams (e.g. video or audio only).
    • frag_type and frag_duration can be set here to override the fragmentation on a per-adaptation stream basis.
    • streams=0 a comma separated list of FFmpeg stream indices to put into the adaptation set.
  • utc_timing_url must be set to the URL which you setup in the previous section.
  • init_seg_name and media_seg_name just setup a nicer segment directory layout.

On your client, set the keyframe period to the segment duration times your framerate:
2 seconds per segments * 60 frames per second = 120 frames for the keyframe period.

For some security on the source (ingest) connection you can try forwarding via the various SSH options, or use the server as a VPN, or if you can SSHFS into the server, don't mind not using Matroska, SRT or RTMP, and are the only person using the server, you can run the same command line on your client to an SSHFS directory.

Serving (using a DASH relay)

The approach above works okay, but what if you want the ultimate low-latency, actual security and the ability to use codecs newer than 20 years (and don't want to experiment with using Matroska as an ingest format)? You can just generate DASH on the upload-side itself (how unorthodox) and upload it, without having FFmpeg running on your server. However, its more complicated.

First, you'll need dash_server.py. It creates a server which proxies the requests from nginx for both uploading and downloading (so you still get caching). It can also be used standalone without nginx for testing, but we're not focusing on this.

Follow the provided example nginx_config in the project's root directory and add

# define connection to dash_server.py
upstream dash_server_py {
    server [::1]:8000;
}

In the base of your nginx website configuration.

Then, create your uploading server:

# this server handles media ingest
# authentication is handled throught TLS client certificates
server {
    # network config
    listen [::]:8001 ssl default_server;
    server_name <ingest server name>;

    # server's TLS cert+key
    ssl_certificate <path to TLS cert>;
    ssl_certificate_key <path to TLS key>;
    #ssl_dhparam <path to DH params, optional>;

    # source authentication with TLS client certificates
    ssl_client_certificate <path to CA for client certs>;
    ssl_verify_client on;

    # only allow upload and delete requests
    if ($request_method !~ ^(POST|PUT|DELETE)$) {
        return 405; # Method Not Allowed
    }

    root <path to site root>;

    # define parameters for communicating with dash_server.py
    # enable chunked transfers
    proxy_http_version        1.1;
    proxy_buffering           off;
    proxy_request_buffering   off;
    # finish the upload even if the client does not bother waiting for our
    # response
    proxy_ignore_client_abort on;

    location /live/ {
        proxy_pass http://dash_server_py;
    }
}

You'll need 2 certificates on the server - one for HTTPS (which you can just let certbot manage) and one for client authentication that you'll need to create yourself.
While its less practical than a a very long URL, it provides actual security. You can use openssl to generate the client certificate.

Then, in the same server where you host your website and UTC time URL, add this section:

location /live/ {
    try_files $uri @dash_server;
}

location @dash_server {
    proxy_pass http://dash_server_py;
}

Now, you can just run python3 dash_server.py -p 8000 as any user on your server and follow on reading to the client-side DASH setup section to sending data to it.

Serving (using a CDN)

Nothing to do, you're all set. You can follow on to the client-side DASH setup section now.

Client-side setup (for a DASH relay or a CDN)

As for the client side, as an example, you can use this command line:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
TLS_KEY="<path_to_client_key>"
TLS_CERT="<path_to_client_cert>"
TLS_CA="<path_to_client_ca>"
ffmpeg -i $SOURCE\
    -c:v libx264 -b:v 3M -g:v 120 -keyint_min:v 120\
    -c:a libopus -b:a 128k -frame_duration 100 -application audio -vbr off\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -timeout 0.2\
    -ignore_io_errors 1\
    -http_persistent 0\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    -http_opts key_file=$TLS_KEY,cert_file=$TLS_CRT,ca_file=$TLS_CA,tls_verify=1\
    $DESTINATION_URL_PORT_DIRECTORY

All the options are described above in the FFmpeg relay section, but there are a few new ones we need:

  • timeout 0.2 sets a timeout for each upload operation to complete before abandoning it. Helps robustness.
  • ignore_io_errors 1 does not error out if an operation times out. Obviously helps robustness.
  • http_persistent 0 disables persistent HTTP connections due to a dash_server.py bug. Post will be updated if it gets fixed. Set it to 1 if you're using a CDN.
  • http_opts sets up the certificates to use for authentication with the server. Most CDNs use URL 'security' so this option should be omitted there.

The ffmpeg CLI is by no means the only tool to directly output DASH. Any program which can use libavformat, such as OBS, various media players with a recording functionality, even CD rippers and so on can.
In fact, I'm working on a fully scriptable compositing and streaming program called txproto which accepts the same options as the ffmpeg CLI.

 ·  dash  ·  CC-BY logo

20-06-28

IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    for (int i = 0; i < len; i++)
        coeff = y[i] = x[i] + coeff * 0.85f;

    return coeff;
}

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;

    for (int i = 0; i < len; i += 4) {
        y[0] = x[0] + c1*coeff +  0*x[2] +  0*x[1] +  0*x[0];
        y[1] = x[1] + c2*coeff +  0*x[2] +  0*x[1] + c1*x[0];
        y[2] = x[2] + c3*coeff +  0*x[2] + c1*x[1] + c2*x[0];
        y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];

        coeff = y[3];
        y += 4;
        x += 4;
    }

    return coeff;
}

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and inserting scalar data in vector registers is so painfully slow that on some occasions its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq] certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width shuffle.

But lets proceed anyway.1

         ; 0.85..^1    0.85..^2    0.85..^3    0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f

SECTION .text

INIT_XMM fma3
cglobal opus_deemphasis, 3, 3, 8, out, in, len
    ; coeff is already splatted in m0 on UNIX64

    movaps m4, [tab_st]
    VBROADCASTSS m5, m4
    shufps m6, m4, m4, q1111
    shufps m7, m4, m4, q2222

.loop:
    movaps  m1, [inq]                ; x0, x1, x2, x3

    pslldq  m2, m1, 4                ;  0, x0, x1, x2
    pslldq  m3, m1, 8                ;  0,  0, x0, x1

    fmaddps m2, m2, m5, m1           ; x + c1*x[0-2]
    pslldq  m1, 12                   ;  0,  0,  0, x0

    fmaddps m2, m3, m6, m2           ; x + c1*x[0-2] + c2*x[0-1]
    fmaddps m1, m1, m7, m2           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
    fmaddps m0, m0, m4, m1           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff

    movaps [outq], m0
    shufps m0, m0, q3333             ; new coeff

    add inq,  mmsize
    add outq, mmsize
    sub lend, mmsize >> 2
    jg .loop

    RET

We can increase speed and precision by combining the multiply and sum operations in a single fmaddps macro, which x86inc does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g. using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.

benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
FMA3:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1024 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4). We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting for each previous operation to finish to produce a single output. Instead we're waiting for only the last one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version, which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly). But in the interest of code readability and simplicity, we're keeping the small rolled version. Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like aarch64 and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
NEON:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1022 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU. And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance, but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.

  1. Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg source to find out how they're handled and weep at their ABI ineptitude
  2. fun fact: CPUs before Ryzen don't signal the FMA4 flag but will still execute the instructions fine
 ·  x86 neon  ·  CC-BY logo