Vulkan Video decoding

All video acceleration APIs came in three flavours.

System-specific
- DXVA(2), DirectX 12 Video, MediaCodec
- VAAPI, VDPAU, XvMC, XvBA, YAMI, V4L2, OMX
- etc..
Vendor-specific
- Quick Sync, MFX
- Avivo, AMF
- Crystal HD
- CUVID, NVDEC, NVENC
- RKMPP
- Others I'm sure I've forgotten...
System AND vendor specific
- Videotoolbox

All of those APIs come with quirks. Some insist that you can only use up to 31 frames, which is problematic for complex transcoding chains. Some insist that you preallocate all frames during initialization, which is memory-inefficient. Most require your H264 stream to be in Annex-B for no good reason. Some even give you VP9 packets in some unholy unstandardized Annex-B. Some require that you give them raw NALs, others want out-of-band NALs and slices, and some do both.

If you wanted to do processing on hardware video frames, your options were limited. Most of the APIs let you export frames to OpenGL for presentation. Some of the more benevolent APIs let you import frames back from OpenGL or DRM. A few of them also let you do OpenCL processing.

And of course, all of this happened with little to no synchronization. Artifacts like video tearing, block decoding not quite being finished, missing references are commonplace even nowadays. Most APIs being stateful made compensating for missing references or damaged files difficult.

Finally, attempting to standardize this jungle is Vulkan video. Rather than a compromise, it is low-level enough to describe most quirks of video acceleration silicon, and with a single codepath, let you decode and encode video with relative statelessness.

Implementation-wise, so far, there had only been a single example, the vk_video_samples repository. As far as example code goes, I wouldn't recommend it. Moreover, it uses a closed source parsing library.

I wrote and maintain the Vulkan code in FFmpeg, so it fell on me to integrate video decoding and encoding. At the same time, Dave Airlie started writing a RADV (Mesa's Vulkan code for AMD chips) implementation. With his invaluable help, in a few weeks, minus some months of inactivity, we have working and debuggable open-source driver implementation, and a clean and performant API user code.

Technical aspects

The biggest difference between Vulkan video and other APIs is that you have to manage memory yourself, specifically the reference frame buffer. Vulkan calls it the Decoded Picture Buffer (DPB), which is a rather MPEG-ese term, but fair enough. There are three possible configurations of the DPB:

Previous output pictures are usable as references. ¹
Centralized DPB pool consisting of multiple images. ²
Centralized DPB pool consisting of a single image with multiple layers. ³

In the first case, you do not have to allocate any extra memory, but merely keep references of older frames. FFmpeg's hwaccel framework does this already.
Intel's video decoding hardware supports this behavior.

In the second case, for each output image you create, you have to allocate an extra image from a pool with a specific image usage flag. You give both the output, the output's separate reference DPB image, and all previous reference DPB images, and the driver then writes to your output frame, while simultaneously also writing to the DPB reference image for the current frame.
Recent AMD (Navi21+) and most Nvidia hardware support this mode.

In the third case, the situation is identical to the second case, only that you have to create a single image upfront with as many layers as there are maximum references. Then, when creating a VkImageView, you specify which layer you need based on the DPB slot. This is a problematic mode, as you have to allocate all references you need upfront, even if they're never used. Which, for 8k HEVC video, is around 3.2 gigabytes of Video RAM.
Older AMD hardware requires this.

Another difference with regards to buffer management is that unlike other APIs which all managed their own references, with Vulkan, you have to know which slot in the DPB each reference belongs to. For H264, this is simply the picture index. For HEVC, after considerable trial and error, we found to be the index of the frame in the DPB array. 'slot' is not a standard term in video decoding, but in lieu of anything better, it's appropriate.

Apart from this, the rest is mostly standard. Like NVDEC, VDPAU, DXVA, slice decoding is, sadly, not supported, which means you have to concatenate the data for each slice in a buffer, with start codes ⁴, then upload the data to a VkBuffer to decode from. Somewhat of an issue with very high bitrate video, but at least Vulkan lets you have spare and host buffers to work around this.

Unlike other decoding APIs, which let you only set a few SPS, PPS (and VPS in HEVC) fields, you have to parse and set practically every single field from those bitstream elements. For HEVC alone, the total maximum possible struct size for all fields is 114 megabytes, which means you really ought to pool the structure memory and expand it when necessary, because although it's unlikely that you will get a stream using all possible values, anyone can craft one and either corrupt your output or crash your decoder.

Vulkan video requires that multiplane YUV images are used. Multiplane images are rather limiting, as they're not well-supported, and if you'd like to use them to do processing, you have to use DISJOINT images with an EXTENDED creation flag (to be able to create VkImageViews with STORAGE usage flags), which are even less supported and quirky. Originally, the FFmpeg Vulkan code relied entirely on emulating multiplane images by using separate images per-plane. To work Vulkan video into this, I initially wrote some complicated ALIASing code to alias the memory from the separate VkImages to the multiplane VkImage necessary for decoding. This eventually got messy enough to make me give up on the idea, and port the entire code to allow for first-class multiplane support. What would've helped would've been some foreknowledge of the drafting process, but lacking this, as well as any involvement in the standardization, refactoring is necessary.

Code

As of 2022-12-19, the code has not yet been merged into mainline FFmpeg. My branch can be found here. There is still more refactoring necessary to make multiplane images first-class, which would be good enough to merge, but for now, it's able to decode both H264 and HEVC video streams in 8-bit and 10-bit form.

To compile, clone and checkout the vulkan_decode branch:

git clone -b vulkan_decode https://github.com/cyanreg/FFmpeg

To configure, use this line:

./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan

Then type make -j0 to compile.

To run,

./ffmpeg_g -init_hw_device "vulkan=vk:0,debug=1" -hwaccel vulkan -hwaccel_output_format vulkan -i <INPUT_FILE> -loglevel debug -filter_hw_device vk -vf hwdownload,format=nv12 -c:v rawvideo -an -y OUT.nut

The validation layers are turned on via the debug=1 option.
To decode 10-bit content, you must replace format=nv12 with format=p010,format=yuv420p10.
To use a different Vulkan device, replace vulkan=vk:0 with vulkan=vk:<N>, where <N> is the device index you'd like to use.

This will produce a OUT.nut file containing the uncompressed decoded data. You can play this using ffplay, mpv or VLC. Alternatively, there are many resources on how to use the FFmpeg CLI and output whatever format you'd like.

Non-subsampled 444 decoding is possible, provided drivers enable support for it.

Driver support

Currently, as of 2022-12-19, there are 3 drivers supporting Vulkan Video.

RADV
ANV
Nvidia Vulkan Beta drivers

For RADV, Dave Airlie's radv-vulkan-video-prelim-decode branch is necessary.
RADV has full support for Vulkan decoding - 8-bit H264, 8-bit and 10-bit HEVC. The output is spec-compliant.
For installation instructions, check out Dave's blog.

For ANV, his anv-vulkan-video-prelim-decode branch is needed instead.
ANV has partial support for H264 - certain streams may cause crashes. For installation instructions, check out Dave's blog.

For Nvidia, the Vulkan Beta drivers are necessary. Only Linux has been tested.
The drivers produce compliant 8-bit HEVC decoding output with my code. 10-bit HEVC decoding produces slight artifacts. 8-bit H264 decoding is broken. Nvidia are looking into the issues, progress can be tracked on the issue thread I made.

State/Future

Currently, I'm working with Dave Airlie on video encoding, which involves getting a usable implementation and drivers ready. The plan is to finish video encoding before merging the entire branch into mainline FFmpeg.
The encoding extension in Vulkan is very low level, but extremely flexible, which is unlike all other APIs that force you onto fixed coding paths and often bad rate control systems.
With good user-level code, even suboptimal hardware implementations could be made competitive with fast software implementations. The current code ignores the driver's rate control modes, and will integrate with Daala/rav1e's RC system.

Due to multiplane surfaces being needed for Vulkan encoding and decoding, Niklas Haas is working on integrating support for them in libplacebo, which would enable post-processing of decoded Vulkan frames in FFmpeg, and enable both mpv and VLC to display the decoded data directly.

In the near future, support for more codecs will hopefully materialize.

When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR. ↩
When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR. ↩
When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR and does NOT set VkVideoCapabilitiesKHR.flags = VK_VIDEO_CAPABILITY_SEPARATE_REFERENCE_IMAGES_BIT_KHR. ↩
{ 0x0, 0x0, 0x1 }, sigh, MPEG-TS's curse never ends. ↩

video · vulkan ·

Lynne's compiled musings

22-12-15