With the standardization of the Vulkan decoding extension less than a month ago, two codecs were
defined - H264 and H265. While they have cemented their position in multimedia, another, newer
codec called AV1 appeared. Indeed, I was involved with its standardization. Not entirely satisfied
with the pace of Khronos, nor with VAAPI's lack of synchronization, me and Dave Airlie decided to
make our own extension to support AV1 decoding - VK_MESA_video_decode_av1
.
We were granted an official dedicated stable extension number from Khronos, 510
, to avoid incompatibilities.
The extension is done in the same style as the other 2 decoder extensions, but with differences.
Unlike MPEG codecs, AV1 lacks the overcomplicated multiple NALU structure. A single sequence
header unit is all that's needed to begin decoding frames. Each frame is prefixed by a (quite large)
frame header. Thus, the video session parameters was delegated to the only piece of header that
may (or may not) be common amongst multiple frames - the sequence header. The frame header is supplied
separately, via each frame's VkVideoDecodeInfoKHR
.
AV1 has support for film grain insertion, which creates a challenge for hardware decoders,
as the film grain is required to be put on the decode output only, and be missing from the
reconstructed output used for references.
In my previous article about vulkan decoding, I mentioned the three possible modes for
decoding refrence buffers:
- in-place (output frames are also used as references)
- out-of-place (output frames are separate from references)
- out-of-place layered (output frames are separate from references, which are in a single multi-layered image)
The first option cannot be combined with film grain, as the output images have film grain. But we
still want to use it if the hardware supports it, and no film grain is enabled. So, we require
that if the user signals that a frame has film grain enabled, the output image view MUST be different
from the reference output image view. To accomplish that, we use a pool of decoding buffers only for frames with
film grain, which requires that at least VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR
is set for
AV1 decoding.
Devices which support in-place decoding also support out-of-place decoding, which makes this method compatible
with future hardware.
This allows us to handle cases where film grain is switched on for a frame, and then switched off,
without wasting memory on separate reference buffers. This also allows for external film grain application
during presentation, such as with libplacebo.
Another difference between the way references are handled between AV1 and MPEG codecs is that a single
frame can overwrite multiple reference slots. Rather than naively do copies, we instead leave the higher-level
decoder (not the hardware accelerator, which is what the extension is) handle this, by reference-counting the
frame. This does require that the hardware supports reusing the same frame in multiple slots, which, to our
knowledge, all hardware av1 hardware accelerators do.
Finally, the biggest issue was with the hardware itself. AMD's hardware decoder expects a unique 8-bit
ID to be assigned for each frame. This was no problem for index-based APIs, such as VAAPI, VDPAU, DXVA2,
NVDEC and practially all other decoding APIs.
Vulkan, however, is not index based. Each frame does not have an index - instead, it's much lower level,
working with bare device addresses. Users are free to alias the address and use it as another frame,
which immediately breaks the uniqueness of indices.
To workaround this hardware limitation, we had no choice but to create a made-up frame ID. Writing
the code was difficult, as it was a huge hack in what was otherwise a straightforward implementation.
The AV1's frame header does feature frame IDs, however, those are completely optional,
and most encoders skip them (with good reason, the frame header is already needlessly large).
While it's possible for the extension to become official, it requires maintaining and sitting through
meetings, which neither of us has the time for. Instead, we hope that this extension becomes a starting
point for an official version, with all the discussion points highlighted.
The official extension probably wouldn't look very different. It's possible to build it in other ways,
but doing so would be inefficient, and probably unimplementable - we were able to fit our extension
to use the same model as all other AV1 hardware accelerators available in FFmpeg.
The extension's very likely going to get changes as we receive some feedback from hardware vendors (if
we do at all, we'd certainly like to know why AMD designed theirs the way they did).
The code can be found here:
Additionally, you can read Dave's blog post here - https://airlied.blogspot.com/2023/01/vulkan-video-decoding-av1-yes-av1.html.
It was nice seeing all the areas I worked on in AV1 actually get used in practice. As for where this goes,
well, saying that we might be getting access to AMD's 7000-series soon would be more than enough :)
Update: while looking through Intel's drivers and documentation for AV1 support, Dave Airlie discovered that the hardware only supports decoding of a single tilegroup per command buffer. This is not compatible with our extension, nor with the way Vulkan video decoding currently exists as operating on a frame-only basis. Hopefully some solution exists which does not involve extensive driver bitstream modifications.
Also, in the case of AMD, it may be possible to hide the frame index in the driver-side VkImage structure. However, the hardware seems to expect an ID based on the frame structure, which may make this impossible.
All video acceleration APIs came in three flavours.
- System-specific
- DXVA(2), DirectX 12 Video, MediaCodec
- VAAPI, VDPAU, XvMC, XvBA, YAMI, V4L2, OMX
- etc..
- Vendor-specific
- Quick Sync, MFX
- Avivo, AMF
- Crystal HD
- CUVID, NVDEC, NVENC
- RKMPP
- Others I'm sure I've forgotten...
- System AND vendor specific
All of those APIs come with quirks. Some insist that you can
only use up to 31 frames, which is problematic for complex
transcoding chains. Some insist that you preallocate all frames
during initialization, which is memory-inefficient. Most
require your H264 stream to be in Annex-B for no good reason.
Some even give you VP9 packets in some unholy unstandardized
Annex-B. Some require that you give them raw NALs, others
want out-of-band NALs and slices, and some do both.
If you wanted to do processing on hardware video frames,
your options were limited. Most of the APIs let you export
frames to OpenGL for presentation. Some of the more benevolent
APIs let you import frames back from OpenGL or DRM. A few of them
also let you do OpenCL processing.
And of course, all of this happened with little to no
synchronization. Artifacts like video tearing, block decoding
not quite being finished, missing references are commonplace
even nowadays. Most APIs being stateful made compensating
for missing references or damaged files difficult.
Finally, attempting to standardize this jungle is Vulkan video.
Rather than a compromise, it is low-level enough to describe most
quirks of video acceleration silicon, and with a single codepath,
let you decode and encode video with relative statelessness.
Implementation-wise, so far, there had only been a single example, the
vk_video_samples
repository. As far as example code goes, I wouldn't recommend it. Moreover,
it uses a closed source parsing library.
I wrote and maintain the Vulkan code in FFmpeg, so it fell on me
to integrate video decoding and encoding. At the same time, Dave Airlie
started writing a RADV (Mesa's Vulkan code for AMD chips) implementation.
With his invaluable help, in a few weeks, minus some months of inactivity,
we have working and debuggable open-source driver implementation,
and a clean and performant API user code.
The biggest difference between Vulkan video and other APIs is
that you have to manage memory yourself, specifically the reference
frame buffer. Vulkan calls it the Decoded Picture Buffer (DPB), which
is a rather MPEG-ese term, but fair enough.
There are three possible configurations of the DPB:
-
Previous output pictures are usable as references.
-
Centralized DPB pool consisting of multiple images.
-
Centralized DPB pool consisting of a single image with multiple layers.
In the first case, you do not have to allocate any extra memory, but
merely keep references of older frames. FFmpeg's hwaccel framework does this already.
Intel's video decoding hardware supports this behavior.
In the second case, for each output image you create, you have to allocate an extra
image from a pool with a specific image usage flag. You give both the output,
the output's separate reference DPB image, and all previous reference DPB images,
and the driver then writes to your output frame, while simultaneously also writing
to the DPB reference image for the current frame.
Recent AMD (Navi21+) and most Nvidia hardware support this mode.
In the third case, the situation is identical to the second case, only that you
have to create a single image upfront with as many layers as there are maximum references.
Then, when creating a VkImageView, you specify which layer you need based on the DPB slot.
This is a problematic mode, as you have to allocate all references you need upfront, even
if they're never used. Which, for 8k HEVC video, is around 3.2 gigabytes of Video RAM.
Older AMD hardware requires this.
Another difference with regards to buffer management is that unlike other APIs which all managed
their own references, with Vulkan, you have to know which slot in the DPB each reference belongs to.
For H264, this is simply the picture index. For HEVC, after considerable trial and error, we found
to be the index of the frame in the DPB array. 'slot' is not a standard term in video decoding,
but in lieu of anything better, it's appropriate.
Apart from this, the rest is mostly standard. Like NVDEC, VDPAU, DXVA, slice decoding is,
sadly, not supported, which means you have to concatenate the data for each slice in a
buffer, with start codes ,
then upload the data to a VkBuffer to decode from. Somewhat of an issue with very high bitrate
video, but at least Vulkan lets you have spare and host buffers to work around this.
Unlike other decoding APIs, which let you only set a few SPS, PPS (and VPS in HEVC) fields,
you have to parse and set practically every single field from those bitstream elements.
For HEVC alone, the total maximum possible struct size for all fields is 114 megabytes, which
means you really ought to pool the structure memory and expand it when necessary, because
although it's unlikely that you will get a stream using all possible values,
anyone can craft one and either corrupt your output or crash your decoder.
Vulkan video requires that multiplane YUV images are used. Multiplane images are rather limiting,
as they're not well-supported, and if you'd like to use them to do processing, you have to use DISJOINT
images with an EXTENDED creation flag (to be able to create VkImageViews with STORAGE usage flags),
which are even less supported and quirky. Originally, the FFmpeg Vulkan code relied entirely on emulating
multiplane images by using separate images per-plane. To work Vulkan video into this, I initially
wrote some complicated ALIAS
ing code to alias the memory from the separate VkImages to the multiplane
VkImage necessary for decoding. This eventually got messy enough to make me give up on the idea,
and port the entire code to allow for first-class multiplane support. What would've helped would've been
some foreknowledge of the drafting process, but lacking this, as well as any involvement in the standardization,
refactoring is necessary.
As of 2022-12-19, the code has not yet been merged into mainline FFmpeg.
My branch can be found here.
There is still more refactoring necessary to make multiplane images first-class,
which would be good enough to merge, but for now, it's able to decode both H264
and HEVC video streams in 8-bit and 10-bit form.
To compile, clone and checkout the vulkan_decode
branch:
git clone -b vulkan_decode https://github.com/cyanreg/FFmpeg
To configure, use this line:
./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan
Then type make -j0
to compile.
To run,
./ffmpeg_g -init_hw_device "vulkan=vk:0,debug=1" -hwaccel vulkan -hwaccel_output_format vulkan -i <INPUT_FILE> -loglevel debug -filter_hw_device vk -vf hwdownload,format=nv12 -c:v rawvideo -an -y OUT.nut
The validation layers are turned on via the debug=1
option.
To decode 10-bit content, you must replace format=nv12
with format=p010,format=yuv420p10
.
To use a different Vulkan device, replace vulkan=vk:0
with vulkan=vk:<N>
, where <N>
is the device index you'd like to use.
This will produce a OUT.nut file containing the uncompressed decoded data. You can play this
using ffplay
, mpv
or VLC
.
Alternatively, there are many resources on how to use the FFmpeg CLI and output whatever format you'd like.
Non-subsampled 444
decoding is possible, provided drivers enable support for it.
Currently, as of 2022-12-19, there are 3 drivers supporting Vulkan Video.
- RADV
- ANV
- Nvidia Vulkan Beta drivers
For RADV, Dave Airlie's radv-vulkan-video-prelim-decode branch is necessary.
RADV has full support for Vulkan decoding - 8-bit H264, 8-bit and 10-bit HEVC. The output is spec-compliant.
For installation instructions, check out Dave's blog.
For ANV, his anv-vulkan-video-prelim-decode branch is needed instead.
ANV has partial support for H264 - certain streams may cause crashes.
For installation instructions, check out Dave's blog.
For Nvidia, the Vulkan Beta drivers are necessary. Only Linux has been tested.
The drivers produce compliant 8-bit HEVC decoding output with my code. 10-bit HEVC decoding produces slight artifacts.
8-bit H264 decoding is broken. Nvidia are looking into the issues, progress can be tracked on the issue thread I made.
Currently, I'm working with Dave Airlie on video encoding, which involves
getting a usable implementation and drivers ready. The plan is to finish
video encoding before merging the entire branch into mainline FFmpeg.
The encoding extension in Vulkan is very low level, but extremely flexible,
which is unlike all other APIs that force you onto fixed coding paths and often
bad rate control systems.
With good user-level code, even suboptimal hardware implementations could be
made competitive with fast software implementations. The current code ignores
the driver's rate control modes, and will integrate with Daala/rav1e's RC
system.
Due to multiplane surfaces being needed for Vulkan encoding and decoding,
Niklas Haas is working on integrating support for them in
libplacebo, which would enable
post-processing of decoded Vulkan frames in FFmpeg, and enable both
mpv and VLC
to display the decoded data directly.
In the near future, support for more codecs will hopefully materialize.