Lynne's compiled musings

23-12-19

Vulkan Video encoding

Today, Vulkan's video extensions have been officially released! FFmpeg support is upcoming, with development being on my branch.

To compile, run:

./configure --enable-vulkan && make

To test:

./ffmpeg_g -init_hw_device vulkan -i <input> -vf format=nv12,hwupload -c:v h264_vulkan -y <output>

The implementation currently supports I-frames only, and rather than using the driver's rate control system, uses its own, but support for all features in the specification will be added.

 ·  vulkan  ·  CC-BY logo

23-01-15

VK_MESA_video_decode_av1

With the standardization of the Vulkan decoding extension less than a month ago, two codecs were defined - H264 and H265. While they have cemented their position in multimedia, another, newer codec called AV1 appeared. Indeed, I was involved with its standardization. Not entirely satisfied with the pace of Khronos, nor with VAAPI's lack of synchronization, me and Dave Airlie decided to make our own extension to support AV1 decoding - VK_MESA_video_decode_av1. We were granted an official dedicated stable extension number from Khronos, 510, to avoid incompatibilities.

The extension is done in the same style as the other 2 decoder extensions, but with differences. Unlike MPEG codecs, AV1 lacks the overcomplicated multiple NALU structure. A single sequence header unit is all that's needed to begin decoding frames. Each frame is prefixed by a (quite large) frame header. Thus, the video session parameters was delegated to the only piece of header that may (or may not) be common amongst multiple frames - the sequence header. The frame header is supplied separately, via each frame's VkVideoDecodeInfoKHR.

AV1 has support for film grain insertion, which creates a challenge for hardware decoders, as the film grain is required to be put on the decode output only, and be missing from the reconstructed output used for references. In my previous article about vulkan decoding, I mentioned the three possible modes for decoding refrence buffers:

  • in-place (output frames are also used as references)
  • out-of-place (output frames are separate from references)
  • out-of-place layered (output frames are separate from references, which are in a single multi-layered image)

The first option cannot be combined with film grain, as the output images have film grain. But we still want to use it if the hardware supports it, and no film grain is enabled. So, we require that if the user signals that a frame has film grain enabled, the output image view MUST be different from the reference output image view. To accomplish that, we use a pool of decoding buffers only for frames with film grain, which requires that at least VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR is set for AV1 decoding. Devices which support in-place decoding also support out-of-place decoding, which makes this method compatible with future hardware. This allows us to handle cases where film grain is switched on for a frame, and then switched off, without wasting memory on separate reference buffers. This also allows for external film grain application during presentation, such as with libplacebo.

Another difference between the way references are handled between AV1 and MPEG codecs is that a single frame can overwrite multiple reference slots. Rather than naively do copies, we instead leave the higher-level decoder (not the hardware accelerator, which is what the extension is) handle this, by reference-counting the frame. This does require that the hardware supports reusing the same frame in multiple slots, which, to our knowledge, all hardware av1 hardware accelerators do.

Finally, the biggest issue was with the hardware itself. AMD's hardware decoder expects a unique 8-bit ID to be assigned for each frame. This was no problem for index-based APIs, such as VAAPI, VDPAU, DXVA2, NVDEC and practially all other decoding APIs.

Vulkan, however, is not index based. Each frame does not have an index - instead, it's much lower level, working with bare device addresses. Users are free to alias the address and use it as another frame, which immediately breaks the uniqueness of indices.

To workaround this hardware limitation, we had no choice but to create a made-up frame ID. Writing the code was difficult, as it was a huge hack in what was otherwise a straightforward implementation.

The AV1's frame header does feature frame IDs, however, those are completely optional, and most encoders skip them (with good reason, the frame header is already needlessly large).

While it's possible for the extension to become official, it requires maintaining and sitting through meetings, which neither of us has the time for. Instead, we hope that this extension becomes a starting point for an official version, with all the discussion points highlighted. The official extension probably wouldn't look very different. It's possible to build it in other ways, but doing so would be inefficient, and probably unimplementable - we were able to fit our extension to use the same model as all other AV1 hardware accelerators available in FFmpeg.

The extension's very likely going to get changes as we receive some feedback from hardware vendors (if we do at all, we'd certainly like to know why AMD designed theirs the way they did).

The code can be found here:

Additionally, you can read Dave's blog post here - https://airlied.blogspot.com/2023/01/vulkan-video-decoding-av1-yes-av1.html.

It was nice seeing all the areas I worked on in AV1 actually get used in practice. As for where this goes, well, saying that we might be getting access to AMD's 7000-series soon would be more than enough :)

Update: while looking through Intel's drivers and documentation for AV1 support, Dave Airlie discovered that the hardware only supports decoding of a single tilegroup per command buffer. This is not compatible with our extension, nor with the way Vulkan video decoding currently exists as operating on a frame-only basis. Hopefully some solution exists which does not involve extensive driver bitstream modifications. Also, in the case of AMD, it may be possible to hide the frame index in the driver-side VkImage structure. However, the hardware seems to expect an ID based on the frame structure, which may make this impossible.

 ·  vulkan  ·  CC-BY logo

22-12-15

Vulkan Video decoding

All video acceleration APIs came in three flavours.

  • System-specific
    • DXVA(2), DirectX 12 Video, MediaCodec
    • VAAPI, VDPAU, XvMC, XvBA, YAMI, V4L2, OMX
    • etc..
  • Vendor-specific
    • Quick Sync, MFX
    • Avivo, AMF
    • Crystal HD
    • CUVID, NVDEC, NVENC
    • RKMPP
    • Others I'm sure I've forgotten...
  • System AND vendor specific
    • Videotoolbox

All of those APIs come with quirks. Some insist that you can only use up to 31 frames, which is problematic for complex transcoding chains. Some insist that you preallocate all frames during initialization, which is memory-inefficient. Most require your H264 stream to be in Annex-B for no good reason. Some even give you VP9 packets in some unholy unstandardized Annex-B. Some require that you give them raw NALs, others want out-of-band NALs and slices, and some do both.

If you wanted to do processing on hardware video frames, your options were limited. Most of the APIs let you export frames to OpenGL for presentation. Some of the more benevolent APIs let you import frames back from OpenGL or DRM. A few of them also let you do OpenCL processing.

And of course, all of this happened with little to no synchronization. Artifacts like video tearing, block decoding not quite being finished, missing references are commonplace even nowadays. Most APIs being stateful made compensating for missing references or damaged files difficult.

Finally, attempting to standardize this jungle is Vulkan video. Rather than a compromise, it is low-level enough to describe most quirks of video acceleration silicon, and with a single codepath, let you decode and encode video with relative statelessness.

Implementation-wise, so far, there had only been a single example, the vk_video_samples repository. As far as example code goes, I wouldn't recommend it. Moreover, it uses a closed source parsing library.

I wrote and maintain the Vulkan code in FFmpeg, so it fell on me to integrate video decoding and encoding. At the same time, Dave Airlie started writing a RADV (Mesa's Vulkan code for AMD chips) implementation. With his invaluable help, in a few weeks, minus some months of inactivity, we have working and debuggable open-source driver implementation, and a clean and performant API user code.

Technical aspects

The biggest difference between Vulkan video and other APIs is that you have to manage memory yourself, specifically the reference frame buffer. Vulkan calls it the Decoded Picture Buffer (DPB), which is a rather MPEG-ese term, but fair enough. There are three possible configurations of the DPB:

  • Previous output pictures are usable as references. 1

  • Centralized DPB pool consisting of multiple images. 2

  • Centralized DPB pool consisting of a single image with multiple layers. 3

In the first case, you do not have to allocate any extra memory, but merely keep references of older frames. FFmpeg's hwaccel framework does this already.
Intel's video decoding hardware supports this behavior.

In the second case, for each output image you create, you have to allocate an extra image from a pool with a specific image usage flag. You give both the output, the output's separate reference DPB image, and all previous reference DPB images, and the driver then writes to your output frame, while simultaneously also writing to the DPB reference image for the current frame.
Recent AMD (Navi21+) and most Nvidia hardware support this mode.

In the third case, the situation is identical to the second case, only that you have to create a single image upfront with as many layers as there are maximum references. Then, when creating a VkImageView, you specify which layer you need based on the DPB slot. This is a problematic mode, as you have to allocate all references you need upfront, even if they're never used. Which, for 8k HEVC video, is around 3.2 gigabytes of Video RAM.
Older AMD hardware requires this.

Another difference with regards to buffer management is that unlike other APIs which all managed their own references, with Vulkan, you have to know which slot in the DPB each reference belongs to. For H264, this is simply the picture index. For HEVC, after considerable trial and error, we found to be the index of the frame in the DPB array. 'slot' is not a standard term in video decoding, but in lieu of anything better, it's appropriate.

Apart from this, the rest is mostly standard. Like NVDEC, VDPAU, DXVA, slice decoding is, sadly, not supported, which means you have to concatenate the data for each slice in a buffer, with start codes 4, then upload the data to a VkBuffer to decode from. Somewhat of an issue with very high bitrate video, but at least Vulkan lets you have spare and host buffers to work around this.

Unlike other decoding APIs, which let you only set a few SPS, PPS (and VPS in HEVC) fields, you have to parse and set practically every single field from those bitstream elements. For HEVC alone, the total maximum possible struct size for all fields is 114 megabytes, which means you really ought to pool the structure memory and expand it when necessary, because although it's unlikely that you will get a stream using all possible values, anyone can craft one and either corrupt your output or crash your decoder.

Vulkan video requires that multiplane YUV images are used. Multiplane images are rather limiting, as they're not well-supported, and if you'd like to use them to do processing, you have to use DISJOINT images with an EXTENDED creation flag (to be able to create VkImageViews with STORAGE usage flags), which are even less supported and quirky. Originally, the FFmpeg Vulkan code relied entirely on emulating multiplane images by using separate images per-plane. To work Vulkan video into this, I initially wrote some complicated ALIASing code to alias the memory from the separate VkImages to the multiplane VkImage necessary for decoding. This eventually got messy enough to make me give up on the idea, and port the entire code to allow for first-class multiplane support. What would've helped would've been some foreknowledge of the drafting process, but lacking this, as well as any involvement in the standardization, refactoring is necessary.

Code

As of 2022-12-19, the code has not yet been merged into mainline FFmpeg. My branch can be found here. There is still more refactoring necessary to make multiplane images first-class, which would be good enough to merge, but for now, it's able to decode both H264 and HEVC video streams in 8-bit and 10-bit form.

To compile, clone and checkout the vulkan_decode branch:

git clone -b vulkan_decode https://github.com/cyanreg/FFmpeg

To configure, use this line:

./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan

Then type make -j0 to compile.

To run,

./ffmpeg_g -init_hw_device "vulkan=vk:0,debug=1" -hwaccel vulkan -hwaccel_output_format vulkan -i <INPUT_FILE> -loglevel debug -filter_hw_device vk -vf hwdownload,format=nv12 -c:v rawvideo -an -y OUT.nut

The validation layers are turned on via the debug=1 option.
To decode 10-bit content, you must replace format=nv12 with format=p010,format=yuv420p10.
To use a different Vulkan device, replace vulkan=vk:0 with vulkan=vk:<N>, where <N> is the device index you'd like to use.

This will produce a OUT.nut file containing the uncompressed decoded data. You can play this using ffplay, mpv or VLC. Alternatively, there are many resources on how to use the FFmpeg CLI and output whatever format you'd like.

Non-subsampled 444 decoding is possible, provided drivers enable support for it.

Driver support

Currently, as of 2022-12-19, there are 3 drivers supporting Vulkan Video.

  • RADV
  • ANV
  • Nvidia Vulkan Beta drivers

For RADV, Dave Airlie's radv-vulkan-video-prelim-decode branch is necessary.
RADV has full support for Vulkan decoding - 8-bit H264, 8-bit and 10-bit HEVC. The output is spec-compliant.
For installation instructions, check out Dave's blog.

For ANV, his anv-vulkan-video-prelim-decode branch is needed instead.
ANV has partial support for H264 - certain streams may cause crashes. For installation instructions, check out Dave's blog.

For Nvidia, the Vulkan Beta drivers are necessary. Only Linux has been tested.
The drivers produce compliant 8-bit HEVC decoding output with my code. 10-bit HEVC decoding produces slight artifacts. 8-bit H264 decoding is broken. Nvidia are looking into the issues, progress can be tracked on the issue thread I made.

State/Future

Currently, I'm working with Dave Airlie on video encoding, which involves getting a usable implementation and drivers ready. The plan is to finish video encoding before merging the entire branch into mainline FFmpeg.
The encoding extension in Vulkan is very low level, but extremely flexible, which is unlike all other APIs that force you onto fixed coding paths and often bad rate control systems.
With good user-level code, even suboptimal hardware implementations could be made competitive with fast software implementations. The current code ignores the driver's rate control modes, and will integrate with Daala/rav1e's RC system.

Due to multiplane surfaces being needed for Vulkan encoding and decoding, Niklas Haas is working on integrating support for them in libplacebo, which would enable post-processing of decoded Vulkan frames in FFmpeg, and enable both mpv and VLC to display the decoded data directly.

In the near future, support for more codecs will hopefully materialize.

  1. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR.
  2. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR.
  3. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR and does NOT set VkVideoCapabilitiesKHR.flags = VK_VIDEO_CAPABILITY_SEPARATE_REFERENCE_IMAGES_BIT_KHR.
  4. { 0x0, 0x0, 0x1 }, sigh, MPEG-TS's curse never ends.
 ·  vulkan  ·  CC-BY logo

21-06-21

Generics in C11

There's a persistent belief that C generics are all useless, shouldn't be called Generics at all and in general should not have been standardized.
I'm not going to debate whether they should have been a part of the language, nor why they're called generics, but they are nonetheless a part of the language because someone thought they could be useful. So here's one useful application: filling in structs and saving you from copying or specifying things the compiler knows about already:

typedef struct SPRational {
    int64_t num;
    int64_t den;
} SPRational;

enum SPDataType {
    SP_DATA_TYPE_UNKNOWN = 0,
    SP_DATA_TYPE_FLOAT,
    SP_DATA_TYPE_DOUBLE,
    SP_DATA_TYPE_INT,
    SP_DATA_TYPE_UINT,
    SP_DATA_TYPE_U16,
    SP_DATA_TYPE_I16,
    SP_DATA_TYPE_I64,
    SP_DATA_TYPE_U64,
    SP_DATA_TYPE_STRING,
    SP_DATA_TYPE_RATIONAL,
};

#define D_TYPE(x) { #x, (&(x)), _Generic((x),                                  \
    float: SP_DATA_TYPE_FLOAT,                                                 \
    double: SP_DATA_TYPE_DOUBLE,                                               \
    int32_t: SP_DATA_TYPE_INT,                                                 \
    uint32_t: SP_DATA_TYPE_UINT,                                               \
    uint16_t: SP_DATA_TYPE_U16,                                                \
    int16_t: SP_DATA_TYPE_I16,                                                 \
    int64_t: SP_DATA_TYPE_I64,                                                 \
    uint64_t: SP_DATA_TYPE_U64,                                                \
    char *: SP_DATA_TYPE_STRING,                                               \
    SPRational: SP_DATA_TYPE_RATIONAL,                                         \
    default: SP_DATA_TYPE_UNKNOWN                                              \
) }

typedef struct SPDataSample {
    const char *name;
    void *location;
    enum SPDataType type;
} SPDataSample;

static int some_function(void)
{
    int some_variable;
    double some_other_variable;
    const char *some_string;
    SPRational some_rational;

    SPDataSample data_rep[] = {
        D_TYPE(some_variable),
        D_TYPE(some_other_variable),
        D_TYPE(some_string),
        D_TYPE(some_rational),
    };
}

The macro simply expands to { "some_variable", &some_variable, SP_DATA_TYPE_INT }. Sure, its almost no work to write it out explicitly like that. But if you don't care about C99/89 compatibility and somehow the idea of using new language features which only a few understand is appealing, go for it.

Of course, this is not the only application of generics, after all the tgmath.h header makes extensive use of them, and they can be useful to auto-template some DSP code. But for that case I'd much rather explicitly write it all out.

 ·  c  ·  CC-BY logo

20-10-06

Low bitrate white noise?

"Every collection of random bits is white noise.".

"Every collection of random bits is a valid Opus packet.".

But is every collection of random bits a valid Opus packet which decodes to white noise1? Here are two volume-reduced2 16kbps samples.

Encoded white noise (volume set to 5.37%) Random bytes (volume set to 0.28%)

Clearly not.

  1. Yes, in a VoIP context signalling an Opus packet is silence via the flag will cause the comfort noise generator to produce noise, but it won't be white.
  2. Volume was reduced without transcoding via the opus_metadata FFmpeg BSF.
 ·  opus  ·  CC-BY logo

20-09-30

How not to design... containers

This post is part 1 of the "How not to design..." series:

  1. How not to design... containers

A simple codec container

Do not rely on a magic word. Take a look at the BBC Dirac container syntax:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      'B'      |      'B'      |      'C'      |      'D'      | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Parse code    |  Next parse offset                            | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |  Last parse offset                            | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|               |  Unit data...                                 | 12-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Each packet starts with a 32-bit sequence BBDC, then a parse code which indicates the type of the unit, then two 32-bit DWORDS which are supposed to be the global offset of the current/last unit within the file, and finally that's followed by the unit data, which may be colorspace info or actual codec packets containing pixels.

On the surface its a simple container and you could write a bad parser in just a few lines and as many minutes. And here's a short list of what's wrong with it:

  • No size field. The only way to know how big a unit is would be to get the difference between the next and current unit's offset.
  • Fast and accurate seeking not possible in files over 4GB. Both offsets may overflow so seeking using fseek and a known frame numer means you have to parse every single unit.
  • Unsuitable for streaming. Both offsets make no sense for a stream as there's no start nor end known on the client-side.
  • A single-bit error in all but 8 bits of the 108-bit header will break a simple demuxer, while an error in the 8-bit leftover (parse code) will break decoding. A badly written demuxer will go into a loop with a near-0 chance of recovery, while a better one will look at each incoming byte for BBCD.
  • For a stream, there's barely 32 bits of usable entropy to find the next unit. While some sanity checks could be done to the offsets and parse code, a demuxer which accepts arbitrary input can't do them.
  • BBCD is not an uncommon sequence to see in compressed data, or in fact any data with a high amount of entropy. Which means you have a non-insignificant chance of parsing a wrong unit when there wasn't meant to be one.

Combined, all those issues make parsing not very robust and not very quick. A reliable parser can only be written for a very limited subset of the container.

A more robust simple codec container

"Let's just put a CRC on the whole data!". Hence, let's take a look at OGG:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      'O'      |      'g'      |      'g'      |      'S'      | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| version       | header_type   | granule_position              | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | bitstream_serial_number       | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | page_sequence_number          | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | CRC_checksum                  | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               | page_segments | segment_table | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ...                                                           | 28-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Here, the 32-bit standard CRC_checksum covers the entire (possibly segmented) packet. When computing it, its just taken as 0x0 and replaced later.

  • A single bit error anywhere will desync demuxing. Which means you have to look for the OggS sequence at every byte position in a compressed stream.
  • You could ignore the CRC and let the packet through to the decoder, but there's no guarantee the codec won't output garbage, or fail to decode (or produce valid garbage, since any sequence of random bytes is a valid Opus packet).
  • If you don't ignore the CRC, you'll have to look at every byte position for the magic OggS and then at every match, perform a CRC which may be as long as 65k. This isn't fast nor good for battery consumption or memory.
  • The way chaining works in internet radio is to literally append the next file onto the stream, which causes a desync. So you have a mandatory desync every few minutes.
  • Seeking will also result in a desync and require you to search.
  • There's still only 32-bits of entropy in the magic sequence, and worse than Dirac, not many ways to check whether a packet is valid based on the header, since new streams may be created or simply dropped at any point.

There's a ton more I'm not covering here on what Ogg did disasterously wrong. Its a story for another time.

My point is that while magic sequences and CRCs are convenient, they're not robust, especially done on a per-packet level.

 ·  containers  ·  CC-BY logo

20-08-30

Fast parsing of Interleaved Signed exp-Golomb codes

Eventually, all well optimized decoder implementations of codecs will hit a bottleneck, which is the speed at which they're able to parse data from the bitstream. Spending the time to optimize this parsing code is usually not worth it unless users are actually starting to hit this and its preventing them from scaling up. For a complicated filter-heavy codec such as AV1, this becomes a problem quite late, especially as the specifications limit the maximum bitrate to around 20Mbps, and care was taken to allow easy SIMDing of parsing code during the writing of the spec. For simple mezzanine codecs such as ProRes, DNxHD, CFHD, Dirac/VC-2 or even Intra-only VLC H.264, where bitrates of the order of hundreds of Mpbs, optimizing bitstream parsing is usually importance number one.

The context we'll be using while looking at bitstream decoding is that within the VC-2 codec, which is essentially a stripped down version of Dirac, made by the BBC for allegedly 1 patent unencumbered near-lossless video transmission over bandwidth limited connectivity (think 1.5Gbps 420 1080p60 over a 1Gbps connection).

To achieve this, the pixels are first transformed using one of many possible wavelet transforms, then quantized by means of division, and encoded. The wavelet transforms used are simple and easy to SIMD, the quantization is somewhat tricky but still decently fast as its just a multiply and an add per coefficient. Boring, standard, old, uninspired and uninteresting, the basis of anyone's first look-I-wrote-my-own-codec 2. Which leaves only the quantized coefficient encoding.

The coefficients are encoded using something called Interleaved Signed exp-Golomb codes. We'll go over this word by word, in reverse.

exp-Golomb codes, or exponential-Golomb for long, or just Golomb to those lazy and who know their codecs, are a form of binary encoding of arbitrarily sized integers, where the length is encoded as a prefix preceding the data. To illustrate:

0 => +1 => 1 => 1 => 1
1 => +1 => 2 => 10 => 010
2 => +1 => 3 => 11 => 011
3 => +1 => 4 => 100 => 00100
4 => +1 => 5 => 101 => 00101
5 => +1 => 6 => 110 => 00110
6 => +1 => 7 => 111 => 00111
7 => +1 = 8 => 1000 => 0001000
...

1 is added to the number if encoding a 0 is necessary, since otherwise encoding it would take 0 bits. The prefix is just the amount of bits after the most significant non-zero bit for the integer, minus one, encoded as a sequence of zeroes. This encoding doesn't have any interesting properties about it and is simple to naïvely decode.

Signed exp-Golomb codes are just the same as above, only an additional bit is appended at the end to signal a sign. By universal convention, 1 encodes a negative number, and 0 encodes a positive number. The bit is not signalled if the number is 0.

Interleaved exp-Golomb codes take the same amount of bits to encode as regular Golomb, however on a first glance they are very different:

0 => +1 => 1 => 1 => 1
1 => +1 => 2 => 10 => 001
2 => +1 => 3 => 11 => 011
3 => +1 => 4 => 100 => 00001
4 => +1 => 5 => 101 => 00011
5 => +1 => 6 => 110 => 01001
6 => +1 => 7 => 111 => 01011
7 => +1 = 8 => 1000 => 0000001
...

As the number of bits hasn't changed, and there are still the same amount of zeroes as in normal exp-Golomb codes, the prefix is still there. Its just interleaved, where every odd bit (except the last one) is a 0, while every even bit encodes the integer. The reason why it looks so different is that with this coding scheme, coding the very first non-zero bit is unnecessary, hence its implicitly set to 1 when decoding.

Interleaved Signed exp-Golomb codes are finally just an interleaved exp-golomb code with an additional bit at the end to signal a sign. That bit is of course not signalled if the number encoded is a 0.

A more convenient way to think about interleaved exp-golomb codes is that every odd bit is actually a flag that tells you whether the number has ended (a 1) or that the next bit is part of the number (a 0). A simple parser for signed codes would then look like this:

int read_sie_golomb()
{
    uint32_t val = 0x1;

    while (!get_bit()) {
        val <<= 1;
        val |= get_bit();
    }

    val -= 1;

    if (val && get_bit())
        val *= -1;

    return val;
}

Looks simple, and the loop has 3 instructions, so it should be fast, right? Bitstream readers are however, not exactly simple, not exactly have easily predicted branches, which makes them not exactly fast.

CPUs like it when the data you're looking at has a size of a power of two, with the smallest unit being a byte. Bytes can encode 256 total possibilities, which isn't exactly large, but if we could process a byte at a time rather than a bit at a time, we could have a potential 8x speedup.

01110010 is a sequence which encodes a -2 and a 1, both signed, and is exactly 8 bits. So if we make a lookup table, we can say that the byte contains a -2 and a 1, directly output the data to some array, and move on cleanly to the next byte.
This is possibility number one, where the byte contains all bits of the numbers present.

01101001 0xxxxxxx is a sequence which encodes a 2, a 0, and a -1. Unlike the previous example, all the numbers are terminated in a single byte, with only the sign bit missing. This is hence possibility number two, where the current byte has leftover data that needs exactly 1 bit from the next byte.

01011101 10xxxxxx is a sequence which encodes a -6 and a 2. Its 10 bits in length, so the last 2 bits of the 2 spill over into the next byte. We can output a -6, save the uncompleted bits of the 2, and move over to the next byte where we can combine the unterminated data from the previous byte with the data from the current byte.
However there's more to this. In the previous example, the -6 ended on an odd bit, making an even bit the start of the 2. As we know, the terminating bit of an interleaved exp-Golomb code will always be after an odd number of bits since the start. So we know that whenever the sequence ends, whether it be the next byte or the current byte, the ending bit of the sequence must be at an odd position. In other words, this is possibility number three, where the current byte is missing some data and needs the next, and possibly more bytes to complete, with the data ending at an odd position.
Of course, there's the possibility that the sequence will end on an even bit, such as with 01011110 110xxxxx (-6, 0, 0, 2), making this the final possibility number four.

So, with this we can exactly determine what the current byte contains, and we can know what we need to expect in the next byte. We know what we need to keep as a state (what we expect from the next byte and any unterminated data from this byte), so we can make a stateful parser:

#define POSSIBILITY_ONE_FULLY_TERMINATED 0
#define POSSIBILITY_TWO_SIGN_BIT 1
#define POSSIBILITY_THREE_ODD_BIT_TERMINATE 2
#define POSSIBILITY_FOUR_ODD_BIT_TERMINATE 3

typedef struct Lookup {
    int ready_nb;
    int *ready_out;

    uint64_t incomplete;
    int incomplete_bits;

    int next_state;
} Lookup;

static const Lookup lookup_table[] = {
    /* 0    - 255 = POSSIBILITY_ONE_FULLY_TERMINATED */
    /* 256  - 511 = POSSIBILITY_TWO_SIGN_BIT */
    /* 512  - 767 = POSSIBILITY_THREE_ODD_BIT_TERMINATE */
    /* 768 - 1023 = POSSIBILITY_FOUR_ODD_BIT_TERMINATE */
};

void read_golomb(int *output, uint8_t *data, int bytes)
{
    int next_state = POSSIBILITY_ONE_FULLY_TERMINATED;
    uint64_t incomplete = 0x0;
    int incomplete_bits = 0;

    for (int i = 0; i < bytes; i++) {
        /* Load state */
        Lookup state = lookup_table[next_state * data[i]];

        /* Directly output any numbers fully terminated in the current byte */
        if (state->ready_nb) {
            memcpy(output, state->ready_out, state->ready_nb * sizeof(int));
            output += state->ready_nb;
        }

        /* Save incomplete state */
        append_bits(&incomplete, &incomplete_bits, state->incomplete, state->incomplete_bits);

        /* Output if the byte has terminated the sequence */
        if (state->terminate) {
            *output++ = read_sie_golomb(incomplete);
            incomplete = incomplete_bits = 0;
        }

        /* Carry over the state for the next byte */
        next_state = state->next_state;
    }
}

And so, with this pseudocode, we can parse Interleaved Signed exp-Golomb codes at a speed at least a few times faster than a naive implementation. Generating the lookup tables is a simple matter of iterating through all numbers from 0 to 255 for every possibility from the four types and trying to decode the golomb codes in them.

There are more optimizations to do, such as instead of storing bits for the incomplete code, storing the a decoded version of them such that decoding is entirely skipped. And, given the state can neatly fit into a 128 bit register, SIMD is also possible, though limited. All of this is outside the scope of this already long article.

exp-Golomb codes, simple to naïvely decode, not that difficult to optimize, have been the go-to for any codec that needs speed and doesn't need full entropy encoding to save a few percent.
Do they still have a use nowadays? Not really. Fast, multisymbol range entropy encoders have been around for more than a decade. They're quick to decode in software, can be SIMD'd if they are adaptive and in general save you enough to make up for the performance loss. And after all, the best way to speed up parsing is to just have less overall data to parse.

Appendix: aren't exp-Golomb codes just Variable Length Codes?

Short answer: yes, but you shouldn't use a Variable Length Code parser to decode them.
Many codecs specify large tables of bit sequences and their lengths where each entry maps to a single number. Unlike an exp-Golomb, there's no correlation necessary between bits and the final parsed number, e.g. 0xff can map to 0 just as how it can map to 255. Unless a codec specifies a very low maximum number that can be encoded in a valid bitstream with exp-Golomb, then using a VLC parser is not feasible as even after quantization, the encoded numbers in binary sequences will likely exceed 32 bits, and having lookup tables larger than 256Kb evaporates any performance gained.

  1. VC-2 uses wavelets for transforms, which are a well known patent minefield
  2. Replacing a DCT with a Wavelet does however provide potential latency improvements for ASICs and FPGAs, though for worse frequency decomposition ʰᵉˡˡᵒ ᴶᴾᴱᴳ²⁰⁰⁰
 ·  bitstream  ·  CC-BY logo

20-07-26

DASH streaming from the top-down

Whilst many articles and posts exist on how to setup DASH, most assume some sort of underlying infrastructure, many are outdated, don't specify enough or are simply vague. This post aims to explain from the top-down how to do DASH streaming. Without involving nginx-rtmp or any other antiquated methods.

The webpage

There are plenty of examples on how to use dash.js and/or video.js and videojs-contrib-dash, and you can just copy paste something cargo-culted to quickly get up and running.
But do you really need 3 js frameworks? As it turns out, you absolutely do not. Practically all of the examples or tutorials use older ancient versions of video.js. Modern video.js version 7 needs neither dash.js nor videojs-contrib-dash, since it already comes prepackaged with everything you need to play both DASH or HLS.

<html>
<head>
    <title>Live</title>
</head>
<body>
    <link href="<< PATH TO video-js.min.css >>" rel="stylesheet" />
    <script src="<< PATH TO video.min.js >>"></script>
    <div>
        <video-js id="live-video" width="100%" height="auto" controls
                  poster="<< LINK TO PLAYER BACKGROUND >>" class="vjs-default-skin vjs-16-9"
                  rel="preload" preload="auto" crossorigin="anonymous">
            <p class="vjs-no-js">
                To view this video please enable JavaScript, and/or consider upgrading to a
                web browser that
                <a href="https://videojs.com/html5-video-support/" target="_blank">
                    supports AV1 video and Opus audio
                </a>
            </p>
        </video-js>
    </div>
    <script>
        var player = videojs('live-video', {
            "liveui": true,
            "responsive": true,
        });
        player.ready(function() {
            player.src({ /* Silences a warning about how the mime type is unsupported with DASH */
                src: document.getElementById("stream_url").href,
                type: document.getElementById("stream_url").type,
            });
            player.on("error", function() {
                error = player.error();
                if (error.code == 4) {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "The stream is offline right now, come back later and refresh.";
                } else {
                    document.querySelector(".vjs-modal-dialog-content").textContent =
                        "Error: " + error.message;
                }
            });
            player.on("ended", function() {
                document.querySelector(".vjs-modal-dialog-content").textContent =
                    "The stream is over.";
            });
        })
    </script>
    <a id="stream_url"
       href="<< LINK TO PLAYLIST >>"
       type="application/dash+xml">
        Direct link to stream.
    </a>
</body>
</html>

This example, although simplistic, is fully adequate to render a DASH livestream, with the client adaptively selecting between screen sizes and displays.
Let me explain some details:

  • crossorigin="anonymous" sends anonymous CORS requests such that everything still works if your files and playlists are on a different server.
    NOTE: this does not apply to the DASH UTC timing URL. You'll still need this on your server. Its unclear whether this is a video.js bug or not.
  • width="100%" height="auto" keeps the player size constant to the page width.
  • "liveui": true, enables a new video.js interface that allows for seeking into buffers of livestreams. You can rewind a limited amount (determined by the server and somewhat the client) but its a very valuable ability. Its not currently (as of video.js 7.9.2) enabled by default as it breaks some IE versions and an ancient IOS version, but if you're going to be streaming using modern codecs (you are, right?) they'd be broken anyway.
  • "responsive": true, just makes the UI scale along with the player size.
  • if (error.code == 4) { is an intentional hack. video.js returns the standard HTML5 MediaError code, which unfortunately maps MEDIA_ERR_SRC_NOT_SUPPORTED (value 4) to many errors, including source file is missing. Which it would be if the stream isn't running.

Its easy to add statistics by adding this to player.ready(function() { and having a <p id="player_stats"></p> paragraph anywhere on the webpage:

player.on("progress", function() {
    range = player.buffered();
    buffer = range.end(range.length - 1) - range.start(0);
    rate = player.vhs.throughput;
    document.getElementById("player_stats").innerHTML =
        "Buffered:&nbsp;" + Number((buffer).toFixed(1)) + "s&nbsp;&middot;&nbsp;" +
        "Bitrate:&nbsp;" + Number((rate / 100000000.0).toFixed(1)) + "Mbps";
});

The same code used in this example can be found on this page of my website, minus styling.

UTC sync URL

In all cases, we'll need a DASH UTC URL. Add this to your nginx server, under the same server section where your webpage is hosted.

location = /utc_timestamp {
    return 200 "$time_iso8601";
    default_type text/plain;
}

This produces a regular ISO 8601 formatted timestamp. The timezone must be UTC, and for that, your server has to be set to use the UTC timezone.

Serving (using an FFmpeg relay)

If you'd like to simply relay incoming Matroska, SRT or RTMP stream, just make sure you can access the destination folder via nginx and you have correct permissions set. For an example server-side FFmpeg configure line, you can use this:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
ffmpeg -i $SOURCE\
    -c copy\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    $DESTINATION_FOLDER

This will create a playlist, per-stream init files and segment files in the destination folder. It will also fully manage all the files it creates in the destination folder, such as modifying them and deleting them.
There are quite a lot of options, so going through them:

  • remove_at_exit 1 just deletes all segments and the playlist on exit.
  • seg_duration 2 determines the segment size in segments. Must start with a keyframe, so must be a multiple of the keyframe period. Directly correlates with latency.
  • target_latency 2 sets the latency for L-DASH only. Players usually don't respect this. Should match segment duration.
  • frag_type duration sets that the segments should be further divided into fragments based on duration. Don't use other options unless you know what you're doing.
  • frag_duration 0.1 sets the duration of each fragment in seconds. One fragment every 0.1 seconds is a good number. Should NOT be an irrational number otherwise you'll run into timestamp rounding issues. Hence you should not use frag_type every_frame since all it does is it sets the duration to that of a single frame.
  • window_size 10 sets how many segments to keep in the playlist before removing them from the playlist.
  • extra_window_size 3 sets how many old segments to keep once off the playlist before deleting them, helps bad connections.
  • streaming 1 self explanatory.
  • ldash 1 low-latency DASH. Will write incomplete files to reduce latency.
  • write_prft 1 writes info using the UTC URL. Auto-enabled for L-DASH, but doesn't hurt to enable it here.
  • use_template 1 instead of writing a playlist containing all current segments and updating it on every segment, just writes the playlist once, specifying a range of segments, how long they are, and how long they're around. Very recommended.
  • use_timeline 0 its a playlist option to really disable old-style non-templated playlists.
  • index_correction 1 Tries to fix segment index if the input stream is incompatible, weird or lagging. If anything, serves as a good indicator of whether your input is such, as it'll warn if it corrects an index.
  • fflags +nobuffer+flush_packets disables as much caching in libavformat as possible to keep the latency down. Can save up to a few seconds of latency.
  • format_options "movflags=+cmaf" is required for conformance.
  • adaptation_sets "id=0,streams=0 id=1,streams=1" sets the adaptation sets, e.g. separate streams with different resolution or bitrate which the client can adapt with.
    • id=0 sets the first adaptation set, which should contain only a single type of streams (e.g. video or audio only).
    • frag_type and frag_duration can be set here to override the fragmentation on a per-adaptation stream basis.
    • streams=0 a comma separated list of FFmpeg stream indices to put into the adaptation set.
  • utc_timing_url must be set to the URL which you setup in the previous section.
  • init_seg_name and media_seg_name just setup a nicer segment directory layout.

On your client, set the keyframe period to the segment duration times your framerate:
2 seconds per segments * 60 frames per second = 120 frames for the keyframe period.

For some security on the source (ingest) connection you can try forwarding via the various SSH options, or use the server as a VPN, or if you can SSHFS into the server, don't mind not using Matroska, SRT or RTMP, and are the only person using the server, you can run the same command line on your client to an SSHFS directory.

Serving (using a DASH relay)

The approach above works okay, but what if you want the ultimate low-latency, actual security and the ability to use codecs newer than 20 years (and don't want to experiment with using Matroska as an ingest format)? You can just generate DASH on the upload-side itself (how unorthodox) and upload it, without having FFmpeg running on your server. However, its more complicated.

First, you'll need dash_server.py. It creates a server which proxies the requests from nginx for both uploading and downloading (so you still get caching). It can also be used standalone without nginx for testing, but we're not focusing on this.

Follow the provided example nginx_config in the project's root directory and add

# define connection to dash_server.py
upstream dash_server_py {
    server [::1]:8000;
}

In the base of your nginx website configuration.

Then, create your uploading server:

# this server handles media ingest
# authentication is handled throught TLS client certificates
server {
    # network config
    listen [::]:8001 ssl default_server;
    server_name <ingest server name>;

    # server's TLS cert+key
    ssl_certificate <path to TLS cert>;
    ssl_certificate_key <path to TLS key>;
    #ssl_dhparam <path to DH params, optional>;

    # source authentication with TLS client certificates
    ssl_client_certificate <path to CA for client certs>;
    ssl_verify_client on;

    # only allow upload and delete requests
    if ($request_method !~ ^(POST|PUT|DELETE)$) {
        return 405; # Method Not Allowed
    }

    root <path to site root>;

    # define parameters for communicating with dash_server.py
    # enable chunked transfers
    proxy_http_version        1.1;
    proxy_buffering           off;
    proxy_request_buffering   off;
    # finish the upload even if the client does not bother waiting for our
    # response
    proxy_ignore_client_abort on;

    location /live/ {
        proxy_pass http://dash_server_py;
    }
}

You'll need 2 certificates on the server - one for HTTPS (which you can just let certbot manage) and one for client authentication that you'll need to create yourself.
While its less practical than a a very long URL, it provides actual security. You can use openssl to generate the client certificate.

Then, in the same server where you host your website and UTC time URL, add this section:

location /live/ {
    try_files $uri @dash_server;
}

location @dash_server {
    proxy_pass http://dash_server_py;
}

Now, you can just run python3 dash_server.py -p 8000 as any user on your server and follow on reading to the client-side DASH setup section to sending data to it.

Serving (using a CDN)

Nothing to do, you're all set. You can follow on to the client-side DASH setup section now.

Client-side setup (for a DASH relay or a CDN)

As for the client side, as an example, you can use this command line:

TS_NAME="$(date +%s)/"
TIME_URL="https://example.com/utc_timestamp"
TLS_KEY="<path_to_client_key>"
TLS_CERT="<path_to_client_cert>"
TLS_CA="<path_to_client_ca>"
ffmpeg -i $SOURCE\
    -c:v libx264 -b:v 3M -g:v 120 -keyint_min:v 120\
    -c:a libopus -b:a 128k -frame_duration 100 -application audio -vbr off\
    -f dash -dash_segment_type mp4\
    -remove_at_exit 1\
    -seg_duration 2\
    -target_latency 2\
    -frag_type duration\
    -frag_duration 0.1\
    -window_size 10\
    -extra_window_size 3\
    -streaming 1\
    -ldash 1\
    -write_prft 1\
    -use_template 1\
    -use_timeline 0\
    -index_correction 1\
    -timeout 0.2\
    -ignore_io_errors 1\
    -http_persistent 0\
    -fflags +nobuffer+flush_packets\
    -format_options "movflags=+cmaf"\
    -adaptation_sets "id=0,streams=0 id=1,streams=1"\
    -utc_timing_url "$TIME_URL"\
    -init_seg_name $TS_NAME'init_stream$RepresentationID$.$ext$'\
    -media_seg_name $TS_NAME'seg_stream$RepresentationID$-$Number$.$ext$'\
    -http_opts key_file=$TLS_KEY,cert_file=$TLS_CRT,ca_file=$TLS_CA,tls_verify=1\
    $DESTINATION_URL_PORT_DIRECTORY

All the options are described above in the FFmpeg relay section, but there are a few new ones we need:

  • timeout 0.2 sets a timeout for each upload operation to complete before abandoning it. Helps robustness.
  • ignore_io_errors 1 does not error out if an operation times out. Obviously helps robustness.
  • http_persistent 0 disables persistent HTTP connections due to a dash_server.py bug. Post will be updated if it gets fixed. Set it to 1 if you're using a CDN.
  • http_opts sets up the certificates to use for authentication with the server. Most CDNs use URL 'security' so this option should be omitted there.

The ffmpeg CLI is by no means the only tool to directly output DASH. Any program which can use libavformat, such as OBS, various media players with a recording functionality, even CD rippers and so on can.
In fact, I'm working on a fully scriptable compositing and streaming program called txproto which accepts the same options as the ffmpeg CLI.

 ·  dash  ·  CC-BY logo

20-06-28

IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    for (int i = 0; i < len; i++)
        coeff = y[i] = x[i] + coeff * 0.85f;

    return coeff;
}

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

static float deemphasis_c(float *y, float *x, float coeff, int len)
{
    const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;

    for (int i = 0; i < len; i += 4) {
        y[0] = x[0] + c1*coeff +  0*x[2] +  0*x[1] +  0*x[0];
        y[1] = x[1] + c2*coeff +  0*x[2] +  0*x[1] + c1*x[0];
        y[2] = x[2] + c3*coeff +  0*x[2] + c1*x[1] + c2*x[0];
        y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];

        coeff = y[3];
        y += 4;
        x += 4;
    }

    return coeff;
}

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and inserting scalar data in vector registers is so painfully slow that on some occasions its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq] certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width shuffle.

But lets proceed anyway.1

         ; 0.85..^1    0.85..^2    0.85..^3    0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f

SECTION .text

INIT_XMM fma3
cglobal opus_deemphasis, 3, 3, 8, out, in, len
    ; coeff is already splatted in m0 on UNIX64

    movaps m4, [tab_st]
    VBROADCASTSS m5, m4
    shufps m6, m4, m4, q1111
    shufps m7, m4, m4, q2222

.loop:
    movaps  m1, [inq]                ; x0, x1, x2, x3

    pslldq  m2, m1, 4                ;  0, x0, x1, x2
    pslldq  m3, m1, 8                ;  0,  0, x0, x1

    fmaddps m2, m2, m5, m1           ; x + c1*x[0-2]
    pslldq  m1, 12                   ;  0,  0,  0, x0

    fmaddps m2, m3, m6, m2           ; x + c1*x[0-2] + c2*x[0-1]
    fmaddps m1, m1, m7, m2           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
    fmaddps m0, m0, m4, m1           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff

    movaps [outq], m0
    shufps m0, m0, q3333             ; new coeff

    add inq,  mmsize
    add outq, mmsize
    sub lend, mmsize >> 2
    jg .loop

    RET

We can increase speed and precision by combining the multiply and sum operations in a single fmaddps macro, which x86inc does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g. using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.

benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
FMA3:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1024 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4). We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting for each previous operation to finish to produce a single output. Instead we're waiting for only the last one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version, which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly). But in the interest of code readability and simplicity, we're keeping the small rolled version. Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like aarch64 and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
NEON:
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1022 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU. And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance, but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.

  1. Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg source to find out how they're handled and weep at their ABI ineptitude
  2. fun fact: CPUs before Ryzen don't signal the FMA4 flag but will still execute the instructions fine
 ·  x86 neon  ·  CC-BY logo