Lynne's compiled musings

24-07-25

Vulkan Video encoding

The Vulkan encoding patchset for FFmpeg has finally enough features and functionality to be sent for merging into the codebase. A lot of changes took place in the codebase, from the upload/download path being completely rewritten, which boosted performance by two-fold, an extended queue familiy API to support future extensions, preliminary optical flow and shader object support, and many fixes and optimizations.

Those looking to build and test ahead of the patchset being merged can pull from my repository and run:

./configure --enable-vulkan && make
./ffmpeg_g -init_hw_device vulkan -i <input> -vf format=nv12,hwupload -c:v h264_vulkan -y <output>

As the encoding API is abstracted, existing users of FFmpeg, such as mpv and OBS can rapidly adopt the Vulkan encoding API, letting us and the entire ecosystem benefit from testing.

 ·  vulkan  ·  CC-BY logo

23-01-15

VK_MESA_video_decode_av1

With the standardization of the Vulkan decoding extension less than a month ago, two codecs were defined - H264 and H265. While they have cemented their position in multimedia, another, newer codec called AV1 appeared. Indeed, I was involved with its standardization. Not entirely satisfied with the pace of Khronos, nor with VAAPI's lack of synchronization, me and Dave Airlie decided to make our own extension to support AV1 decoding - VK_MESA_video_decode_av1. We were granted an official dedicated stable extension number from Khronos, 510, to avoid incompatibilities.

The extension is done in the same style as the other 2 decoder extensions, but with differences. Unlike MPEG codecs, AV1 lacks the overcomplicated multiple NALU structure. A single sequence header unit is all that's needed to begin decoding frames. Each frame is prefixed by a (quite large) frame header. Thus, the video session parameters was delegated to the only piece of header that may (or may not) be common amongst multiple frames - the sequence header. The frame header is supplied separately, via each frame's VkVideoDecodeInfoKHR.

AV1 has support for film grain insertion, which creates a challenge for hardware decoders, as the film grain is required to be put on the decode output only, and be missing from the reconstructed output used for references. In my previous article about vulkan decoding, I mentioned the three possible modes for decoding refrence buffers:

  • in-place (output frames are also used as references)
  • out-of-place (output frames are separate from references)
  • out-of-place layered (output frames are separate from references, which are in a single multi-layered image)

The first option cannot be combined with film grain, as the output images have film grain. But we still want to use it if the hardware supports it, and no film grain is enabled. So, we require that if the user signals that a frame has film grain enabled, the output image view MUST be different from the reference output image view. To accomplish that, we use a pool of decoding buffers only for frames with film grain, which requires that at least VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR is set for AV1 decoding. Devices which support in-place decoding also support out-of-place decoding, which makes this method compatible with future hardware. This allows us to handle cases where film grain is switched on for a frame, and then switched off, without wasting memory on separate reference buffers. This also allows for external film grain application during presentation, such as with libplacebo.

Another difference between the way references are handled between AV1 and MPEG codecs is that a single frame can overwrite multiple reference slots. Rather than naively do copies, we instead leave the higher-level decoder (not the hardware accelerator, which is what the extension is) handle this, by reference-counting the frame. This does require that the hardware supports reusing the same frame in multiple slots, which, to our knowledge, all hardware av1 hardware accelerators do.

Finally, the biggest issue was with the hardware itself. AMD's hardware decoder expects a unique 8-bit ID to be assigned for each frame. This was no problem for index-based APIs, such as VAAPI, VDPAU, DXVA2, NVDEC and practially all other decoding APIs.

Vulkan, however, is not index based. Each frame does not have an index - instead, it's much lower level, working with bare device addresses. Users are free to alias the address and use it as another frame, which immediately breaks the uniqueness of indices.

To workaround this hardware limitation, we had no choice but to create a made-up frame ID. Writing the code was difficult, as it was a huge hack in what was otherwise a straightforward implementation.

The AV1's frame header does feature frame IDs, however, those are completely optional, and most encoders skip them (with good reason, the frame header is already needlessly large).

While it's possible for the extension to become official, it requires maintaining and sitting through meetings, which neither of us has the time for. Instead, we hope that this extension becomes a starting point for an official version, with all the discussion points highlighted. The official extension probably wouldn't look very different. It's possible to build it in other ways, but doing so would be inefficient, and probably unimplementable - we were able to fit our extension to use the same model as all other AV1 hardware accelerators available in FFmpeg.

The extension's very likely going to get changes as we receive some feedback from hardware vendors (if we do at all, we'd certainly like to know why AMD designed theirs the way they did).

The code can be found here:

Additionally, you can read Dave's blog post here - https://airlied.blogspot.com/2023/01/vulkan-video-decoding-av1-yes-av1.html.

It was nice seeing all the areas I worked on in AV1 actually get used in practice. As for where this goes, well, saying that we might be getting access to AMD's 7000-series soon would be more than enough :)

Update: while looking through Intel's drivers and documentation for AV1 support, Dave Airlie discovered that the hardware only supports decoding of a single tilegroup per command buffer. This is not compatible with our extension, nor with the way Vulkan video decoding currently exists as operating on a frame-only basis. Hopefully some solution exists which does not involve extensive driver bitstream modifications. Also, in the case of AMD, it may be possible to hide the frame index in the driver-side VkImage structure. However, the hardware seems to expect an ID based on the frame structure, which may make this impossible.

 ·  vulkan  ·  CC-BY logo

22-12-15

Vulkan Video decoding

All video acceleration APIs came in three flavours.

  • System-specific
    • DXVA(2), DirectX 12 Video, MediaCodec
    • VAAPI, VDPAU, XvMC, XvBA, YAMI, V4L2, OMX
    • etc..
  • Vendor-specific
    • Quick Sync, MFX
    • Avivo, AMF
    • Crystal HD
    • CUVID, NVDEC, NVENC
    • RKMPP
    • Others I'm sure I've forgotten...
  • System AND vendor specific
    • Videotoolbox

All of those APIs come with quirks. Some insist that you can only use up to 31 frames, which is problematic for complex transcoding chains. Some insist that you preallocate all frames during initialization, which is memory-inefficient. Most require your H264 stream to be in Annex-B for no good reason. Some even give you VP9 packets in some unholy unstandardized Annex-B. Some require that you give them raw NALs, others want out-of-band NALs and slices, and some do both.

If you wanted to do processing on hardware video frames, your options were limited. Most of the APIs let you export frames to OpenGL for presentation. Some of the more benevolent APIs let you import frames back from OpenGL or DRM. A few of them also let you do OpenCL processing.

And of course, all of this happened with little to no synchronization. Artifacts like video tearing, block decoding not quite being finished, missing references are commonplace even nowadays. Most APIs being stateful made compensating for missing references or damaged files difficult.

Finally, attempting to standardize this jungle is Vulkan video. Rather than a compromise, it is low-level enough to describe most quirks of video acceleration silicon, and with a single codepath, let you decode and encode video with relative statelessness.

Implementation-wise, so far, there had only been a single example, the vk_video_samples repository. As far as example code goes, I wouldn't recommend it. Moreover, it uses a closed source parsing library.

I wrote and maintain the Vulkan code in FFmpeg, so it fell on me to integrate video decoding and encoding. At the same time, Dave Airlie started writing a RADV (Mesa's Vulkan code for AMD chips) implementation. With his invaluable help, in a few weeks, minus some months of inactivity, we have working and debuggable open-source driver implementation, and a clean and performant API user code.

Technical aspects

The biggest difference between Vulkan video and other APIs is that you have to manage memory yourself, specifically the reference frame buffer. Vulkan calls it the Decoded Picture Buffer (DPB), which is a rather MPEG-ese term, but fair enough. There are three possible configurations of the DPB:

  • Previous output pictures are usable as references. 1

  • Centralized DPB pool consisting of multiple images. 2

  • Centralized DPB pool consisting of a single image with multiple layers. 3

In the first case, you do not have to allocate any extra memory, but merely keep references of older frames. FFmpeg's hwaccel framework does this already.
Intel's video decoding hardware supports this behavior.

In the second case, for each output image you create, you have to allocate an extra image from a pool with a specific image usage flag. You give both the output, the output's separate reference DPB image, and all previous reference DPB images, and the driver then writes to your output frame, while simultaneously also writing to the DPB reference image for the current frame.
Recent AMD (Navi21+) and most Nvidia hardware support this mode.

In the third case, the situation is identical to the second case, only that you have to create a single image upfront with as many layers as there are maximum references. Then, when creating a VkImageView, you specify which layer you need based on the DPB slot. This is a problematic mode, as you have to allocate all references you need upfront, even if they're never used. Which, for 8k HEVC video, is around 3.2 gigabytes of Video RAM.
Older AMD hardware requires this.

Another difference with regards to buffer management is that unlike other APIs which all managed their own references, with Vulkan, you have to know which slot in the DPB each reference belongs to. For H264, this is simply the picture index. For HEVC, after considerable trial and error, we found to be the index of the frame in the DPB array. 'slot' is not a standard term in video decoding, but in lieu of anything better, it's appropriate.

Apart from this, the rest is mostly standard. Like NVDEC, VDPAU, DXVA, slice decoding is, sadly, not supported, which means you have to concatenate the data for each slice in a buffer, with start codes 4, then upload the data to a VkBuffer to decode from. Somewhat of an issue with very high bitrate video, but at least Vulkan lets you have spare and host buffers to work around this.

Unlike other decoding APIs, which let you only set a few SPS, PPS (and VPS in HEVC) fields, you have to parse and set practically every single field from those bitstream elements. For HEVC alone, the total maximum possible struct size for all fields is 114 megabytes, which means you really ought to pool the structure memory and expand it when necessary, because although it's unlikely that you will get a stream using all possible values, anyone can craft one and either corrupt your output or crash your decoder.

Vulkan video requires that multiplane YUV images are used. Multiplane images are rather limiting, as they're not well-supported, and if you'd like to use them to do processing, you have to use DISJOINT images with an EXTENDED creation flag (to be able to create VkImageViews with STORAGE usage flags), which are even less supported and quirky. Originally, the FFmpeg Vulkan code relied entirely on emulating multiplane images by using separate images per-plane. To work Vulkan video into this, I initially wrote some complicated ALIASing code to alias the memory from the separate VkImages to the multiplane VkImage necessary for decoding. This eventually got messy enough to make me give up on the idea, and port the entire code to allow for first-class multiplane support. What would've helped would've been some foreknowledge of the drafting process, but lacking this, as well as any involvement in the standardization, refactoring is necessary.

Code

As of 2022-12-19, the code has not yet been merged into mainline FFmpeg. My branch can be found here. There is still more refactoring necessary to make multiplane images first-class, which would be good enough to merge, but for now, it's able to decode both H264 and HEVC video streams in 8-bit and 10-bit form.

To compile, clone and checkout the vulkan_decode branch:

git clone -b vulkan_decode https://github.com/cyanreg/FFmpeg

To configure, use this line:

./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan

Then type make -j0 to compile.

To run,

./ffmpeg_g -init_hw_device "vulkan=vk:0,debug=1" -hwaccel vulkan -hwaccel_output_format vulkan -i <INPUT_FILE> -loglevel debug -filter_hw_device vk -vf hwdownload,format=nv12 -c:v rawvideo -an -y OUT.nut

The validation layers are turned on via the debug=1 option.
To decode 10-bit content, you must replace format=nv12 with format=p010,format=yuv420p10.
To use a different Vulkan device, replace vulkan=vk:0 with vulkan=vk:<N>, where <N> is the device index you'd like to use.

This will produce a OUT.nut file containing the uncompressed decoded data. You can play this using ffplay, mpv or VLC. Alternatively, there are many resources on how to use the FFmpeg CLI and output whatever format you'd like.

Non-subsampled 444 decoding is possible, provided drivers enable support for it.

Driver support

Currently, as of 2022-12-19, there are 3 drivers supporting Vulkan Video.

  • RADV
  • ANV
  • Nvidia Vulkan Beta drivers

For RADV, Dave Airlie's radv-vulkan-video-prelim-decode branch is necessary.
RADV has full support for Vulkan decoding - 8-bit H264, 8-bit and 10-bit HEVC. The output is spec-compliant.
For installation instructions, check out Dave's blog.

For ANV, his anv-vulkan-video-prelim-decode branch is needed instead.
ANV has partial support for H264 - certain streams may cause crashes. For installation instructions, check out Dave's blog.

For Nvidia, the Vulkan Beta drivers are necessary. Only Linux has been tested.
The drivers produce compliant 8-bit HEVC decoding output with my code. 10-bit HEVC decoding produces slight artifacts. 8-bit H264 decoding is broken. Nvidia are looking into the issues, progress can be tracked on the issue thread I made.

State/Future

Currently, I'm working with Dave Airlie on video encoding, which involves getting a usable implementation and drivers ready. The plan is to finish video encoding before merging the entire branch into mainline FFmpeg.
The encoding extension in Vulkan is very low level, but extremely flexible, which is unlike all other APIs that force you onto fixed coding paths and often bad rate control systems.
With good user-level code, even suboptimal hardware implementations could be made competitive with fast software implementations. The current code ignores the driver's rate control modes, and will integrate with Daala/rav1e's RC system.

Due to multiplane surfaces being needed for Vulkan encoding and decoding, Niklas Haas is working on integrating support for them in libplacebo, which would enable post-processing of decoded Vulkan frames in FFmpeg, and enable both mpv and VLC to display the decoded data directly.

In the near future, support for more codecs will hopefully materialize.

  1. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR.
  2. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR.
  3. When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR and does NOT set VkVideoCapabilitiesKHR.flags = VK_VIDEO_CAPABILITY_SEPARATE_REFERENCE_IMAGES_BIT_KHR.
  4. { 0x0, 0x0, 0x1 }, sigh, MPEG-TS's curse never ends.
 ·  vulkan  ·  CC-BY logo