The Vulkan encoding patchset for FFmpeg has finally enough features and functionality to be sent for merging into the codebase.
A lot of changes took place in the codebase, from the upload/download path being completely rewritten, which boosted performance
by two-fold, an extended queue familiy API to support future extensions, preliminary optical flow and shader object support,
and many fixes and optimizations.
Those looking to build and test ahead of the patchset being merged can pull from my repository
and run:
As the encoding API is abstracted, existing users of FFmpeg, such as mpv and OBS can rapidly adopt the Vulkan encoding API,
letting us and the entire ecosystem benefit from testing.
Friday night. Amidst the sound of rain, the wailing of police sirens, and the distant clacking of a cargo train passing by, a cloaked figure stood seated, hunched in a darkened office. Heavy cigarette smoke filled the room, which was only lit with a small desk lamp.
Like gunshots at a shootout, mechanical clangs created blast waves through the thick atmosphere. It was a typewriter, and as if chiselling marble, the man was pounding out word after word. The shining silver lettering on top of it said 'Clark Nova', and the whole typewriter itself exuded a mythic resonance, establishing its presence in the room.
The typewriter's constant punches sped up, a premonition of a distant artillery strike. The man's expression was strained, and the cigarette he kept in his mouth was already trying to burn through its filter, in futile. Although the needle-like teeth of the cold were biting their way in to the man's skin, drops of sweat were dripping past the man's forehead.
The marked sheet of paper was already extending way past the head of the machine, when an exclamation mark's engraving on the bottom right signaled the sheet's completion. Without even treating this as a nuisance, the man pulled the sheet out of the machine, put it in a neat pile, which was surrounded by dozens upon dozens of crushed sheets, took a new sheet, and threaded it through the machine. The new sheet's vellum-colored allure invited the words to come out of the man's mind, and the rhythmic mechanical march begun anew.
The images of a marathon runner seeing the finish line, a lion's paw on the verge of touching its victim, and a fly's vain attempt to escape its silken webbed prison entered the man's mind. This was it. The last sheet that would bring the thunderous rumbling of the past ones to an explosion, and would tie the plot's conclusion to a satisfying end.
The mechanical clattering suddenly stopped. The sheet's capacity was barely halfway used.
What followed were a series of lonesome strikes, like deafening howitzer bangs bringing silence to small arms fire. Each one was set off by the typewriter head's movement, engraving a deep indentation into the paper.
Slowly, one per line, the following words appeared on the paper:
...
1983, July, Antarctic Plateau, Dome C
Soviet Antarctic base "НАДЕЖДА". Winter. Complete isolation.
Between the 10s of steaming hot radioisotope thermal generators, a large hut stands tall, lit by powerful projectors to melt snow. From it, deafening machine noises pierce the air.
Under the guise of a dark matter detection experiment, around the clock digging is done. The 3 kilometer deep cover of snow and ice is no match for red hot decaying nuclear matter. After 2 years of digging, the sonde is finally nearing the continental mass.
Vast expanses of the so-called black gold, oil, the true target of the expedition, have been detected just below the crust. And Soviet leaders want to claim it, no matter the cost.
On July 22nd, success. The probe stops, with a violent grinding noise resonating all the way to the surface - it hit solid ground.
The engineers change the speed to постепенно, retract the nuclear tip, and begin regular drilling.
Liquefied soil is bursting through the pipe, and being shot a kilometer away from the base. During the summer, American satellites would no doubt see this, so time is of the essence.
July 27rd. Suspiciously close to the start of drilling, gas begins to bubble up. Ovations in the control room.
July 30th. Despite progress, no oil. More gas, yet no soil. A void, filled with nothing but very hot, high-pressure natural gas.
Scientists decide to acoustically scan the void. Turning off the hot water means the drill may freeze solid in the ground within moments, but nevertheless, the risk must be taken.
Machines are turned off. Deadly silence, after months of non-stop sound. Only the rumble of the gas escaping from the hole is heard.
An acoustic impulse radar descends through the piping. Six hours of sweat and temperature readouts, the probe hits the drilling head, turning it into an antenna.
The emitter activates, and a metallic ping travels through the chasm and the wall.
But alas, the noise is too much to make a guess.
Another one. Repeated pings, but too much noise to make an estimation.
The engineer in charge notices the noise does not subside. Could it be the sound of gas whirling around the drilling head?
They record the noise on tape.
Wanting immediate help on how to proceed, they activate the 500 kilowatt AM radio transmitter at the specified frequency, and with no preamble, stream the sound out directly, hoping no one detects it.
The Americans, however, run around the clock radio frequency observation under a similar guise of a science experiment, and put it to tape.
After analysis, the tape is labelled as "Unknown vocalizations, July 30th, source: unknown". A thin sheet of paper lies next to the tape. Titled, "Contents", in bold. Below it, the following text:
...
Friday. 3 AM. The epilogue to a cloudy, tiring summer week.
After staying way past than acceptable, you're trying to sleep. Just as you drift off, a gentle, low-frequency rumbling knocks you back awake.
"Must be a washing machine...", you think.
3 minutes later, you notice the rumble has changed tone, and gone up in pitch slightly. "Ah, the clothes are drying, getting lighter, so maybe that's why.".
2 more minutes, and the rumbling keeps getting higher in pitch and a little louder.
You stand up, move around the room to figure if there's an odd resonance, to no avail. You open a window, thinking there's some truck parked outside, but the rumble seemingly permeates through walls and windows. You see several more people out sticking their heads to figure out where the noise is coming from.
You go back to bed, thinking the rumble will soon pass by, but it doesn't it just keeps getting louder and higher in pitch.
Reminded of similar incidents around the world you've heard, you reluctantly open your web browser and search for "the rumble", only to be blasted with all sorts of theories: "The souls of the damned are yelling out!", "It's a secret military project!", "A nuclear reactor has gone nuts!".
You decide to be scientific, and you turn on your sound spectral analyzer. The rumble's base frequency has gone up to 245Hz. At various multiples, there are overtones, which give an eerie, choir-type feel to it.
...
35 minutes after starting, the rumble now sounds like a million people are screaming out from the distance.
Almost thinking these really are the screams out the damned, you snap yourself back to reality, and you check the internet again. Some users on an IRC channel have started to collaborate. Though their discussion into the nature of the screaming has yielded no explanation, they decided to try to triangulate the origin, despite the fact it should be impossible for a sound to spread so far through the atmosphere.
Using the start time and energy, they've found the epicentre - Seoul. But along with that, they've also detected hotspots around several of the largest cities in the world. Oddly, the screaming there begun after that in Seoul, with an offset that exactly matches the distance travelled at the speed of light.
The screaming finally starts to reduce in volume. But the mystery remains.
You decide to look for any recent news mentioning Seoul, or in Korean in general. But every single news outlet is unusable, brimming with articles about how 2 of the top K-Pop boy stars have started dating each other. Annoyed, you try looking harder, but no one has reported anything but this for the past 2 hours.
"There must be some other explanation...", you think, but soon, a small thought enters your mind, "...or a correlation.". "Millions of screaming girls can't cause such a phenomenon, right? Right?"....
With the standardization of the Vulkan decoding extension less than a month ago, two codecs were
defined - H264 and H265. While they have cemented their position in multimedia, another, newer
codec called AV1 appeared. Indeed, I was involved with its standardization. Not entirely satisfied
with the pace of Khronos, nor with VAAPI's lack of synchronization, me and Dave Airlie decided to
make our own extension to support AV1 decoding - VK_MESA_video_decode_av1.
We were granted an official dedicated stable extension number from Khronos, 510, to avoid incompatibilities.
The extension is done in the same style as the other 2 decoder extensions, but with differences.
Unlike MPEG codecs, AV1 lacks the overcomplicated multiple NALU structure. A single sequence
header unit is all that's needed to begin decoding frames. Each frame is prefixed by a (quite large)
frame header. Thus, the video session parameters was delegated to the only piece of header that
may (or may not) be common amongst multiple frames - the sequence header. The frame header is supplied
separately, via each frame's VkVideoDecodeInfoKHR.
AV1 has support for film grain insertion, which creates a challenge for hardware decoders,
as the film grain is required to be put on the decode output only, and be missing from the
reconstructed output used for references.
In my previous article about vulkan decoding, I mentioned the three possible modes for
decoding refrence buffers:
in-place (output frames are also used as references)
out-of-place (output frames are separate from references)
out-of-place layered (output frames are separate from references, which are in a single multi-layered image)
The first option cannot be combined with film grain, as the output images have film grain. But we
still want to use it if the hardware supports it, and no film grain is enabled. So, we require
that if the user signals that a frame has film grain enabled, the output image view MUST be different
from the reference output image view. To accomplish that, we use a pool of decoding buffers only for frames with
film grain, which requires that at least VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR is set for
AV1 decoding.
Devices which support in-place decoding also support out-of-place decoding, which makes this method compatible
with future hardware.
This allows us to handle cases where film grain is switched on for a frame, and then switched off,
without wasting memory on separate reference buffers. This also allows for external film grain application
during presentation, such as with libplacebo.
Another difference between the way references are handled between AV1 and MPEG codecs is that a single
frame can overwrite multiple reference slots. Rather than naively do copies, we instead leave the higher-level
decoder (not the hardware accelerator, which is what the extension is) handle this, by reference-counting the
frame. This does require that the hardware supports reusing the same frame in multiple slots, which, to our
knowledge, all hardware av1 hardware accelerators do.
Finally, the biggest issue was with the hardware itself. AMD's hardware decoder expects a unique 8-bit
ID to be assigned for each frame. This was no problem for index-based APIs, such as VAAPI, VDPAU, DXVA2,
NVDEC and practially all other decoding APIs.
Vulkan, however, is not index based. Each frame does not have an index - instead, it's much lower level,
working with bare device addresses. Users are free to alias the address and use it as another frame,
which immediately breaks the uniqueness of indices.
To workaround this hardware limitation, we had no choice but to create a made-up frame ID. Writing
the code was difficult, as it was a huge hack in what was otherwise a straightforward implementation.
The AV1's frame header does feature frame IDs, however, those are completely optional,
and most encoders skip them (with good reason, the frame header is already needlessly large).
While it's possible for the extension to become official, it requires maintaining and sitting through
meetings, which neither of us has the time for. Instead, we hope that this extension becomes a starting
point for an official version, with all the discussion points highlighted.
The official extension probably wouldn't look very different. It's possible to build it in other ways,
but doing so would be inefficient, and probably unimplementable - we were able to fit our extension
to use the same model as all other AV1 hardware accelerators available in FFmpeg.
The extension's very likely going to get changes as we receive some feedback from hardware vendors (if
we do at all, we'd certainly like to know why AMD designed theirs the way they did).
It was nice seeing all the areas I worked on in AV1 actually get used in practice. As for where this goes,
well, saying that we might be getting access to AMD's 7000-series soon would be more than enough :)
Update: while looking through Intel's drivers and documentation for AV1 support, Dave Airlie discovered that the hardware only supports decoding of a single tilegroup per command buffer. This is not compatible with our extension, nor with the way Vulkan video decoding currently exists as operating on a frame-only basis. Hopefully some solution exists which does not involve extensive driver bitstream modifications.
Also, in the case of AMD, it may be possible to hide the frame index in the driver-side VkImage structure. However, the hardware seems to expect an ID based on the frame structure, which may make this impossible.
All video acceleration APIs came in three flavours.
System-specific
DXVA(2), DirectX 12 Video, MediaCodec
VAAPI, VDPAU, XvMC, XvBA, YAMI, V4L2, OMX
etc..
Vendor-specific
Quick Sync, MFX
Avivo, AMF
Crystal HD
CUVID, NVDEC, NVENC
RKMPP
Others I'm sure I've forgotten...
System AND vendor specific
Videotoolbox
All of those APIs come with quirks. Some insist that you can
only use up to 31 frames, which is problematic for complex
transcoding chains. Some insist that you preallocate all frames
during initialization, which is memory-inefficient. Most
require your H264 stream to be in Annex-B for no good reason.
Some even give you VP9 packets in some unholy unstandardized
Annex-B. Some require that you give them raw NALs, others
want out-of-band NALs and slices, and some do both.
If you wanted to do processing on hardware video frames,
your options were limited. Most of the APIs let you export
frames to OpenGL for presentation. Some of the more benevolent
APIs let you import frames back from OpenGL or DRM. A few of them
also let you do OpenCL processing.
And of course, all of this happened with little to no
synchronization. Artifacts like video tearing, block decoding
not quite being finished, missing references are commonplace
even nowadays. Most APIs being stateful made compensating
for missing references or damaged files difficult.
Finally, attempting to standardize this jungle is Vulkan video.
Rather than a compromise, it is low-level enough to describe most
quirks of video acceleration silicon, and with a single codepath,
let you decode and encode video with relative statelessness.
Implementation-wise, so far, there had only been a single example, the
vk_video_samples
repository. As far as example code goes, I wouldn't recommend it. Moreover,
it uses a closed source parsing library.
I wrote and maintain the Vulkan code in FFmpeg, so it fell on me
to integrate video decoding and encoding. At the same time, Dave Airlie
started writing a RADV (Mesa's Vulkan code for AMD chips) implementation.
With his invaluable help, in a few weeks, minus some months of inactivity,
we have working and debuggable open-source driver implementation,
and a clean and performant API user code.
The biggest difference between Vulkan video and other APIs is
that you have to manage memory yourself, specifically the reference
frame buffer. Vulkan calls it the Decoded Picture Buffer (DPB), which
is a rather MPEG-ese term, but fair enough.
There are three possible configurations of the DPB:
Previous output pictures are usable as references.
1
Centralized DPB pool consisting of multiple images.
2
Centralized DPB pool consisting of a single image with multiple layers.
3
In the first case, you do not have to allocate any extra memory, but
merely keep references of older frames. FFmpeg's hwaccel framework does this already.
Intel's video decoding hardware supports this behavior.
In the second case, for each output image you create, you have to allocate an extra
image from a pool with a specific image usage flag. You give both the output,
the output's separate reference DPB image, and all previous reference DPB images,
and the driver then writes to your output frame, while simultaneously also writing
to the DPB reference image for the current frame.
Recent AMD (Navi21+) and most Nvidia hardware support this mode.
In the third case, the situation is identical to the second case, only that you
have to create a single image upfront with as many layers as there are maximum references.
Then, when creating a VkImageView, you specify which layer you need based on the DPB slot.
This is a problematic mode, as you have to allocate all references you need upfront, even
if they're never used. Which, for 8k HEVC video, is around 3.2 gigabytes of Video RAM.
Older AMD hardware requires this.
Another difference with regards to buffer management is that unlike other APIs which all managed
their own references, with Vulkan, you have to know which slot in the DPB each reference belongs to.
For H264, this is simply the picture index. For HEVC, after considerable trial and error, we found
to be the index of the frame in the DPB array. 'slot' is not a standard term in video decoding,
but in lieu of anything better, it's appropriate.
Apart from this, the rest is mostly standard. Like NVDEC, VDPAU, DXVA, slice decoding is,
sadly, not supported, which means you have to concatenate the data for each slice in a
buffer, with start codes 4,
then upload the data to a VkBuffer to decode from. Somewhat of an issue with very high bitrate
video, but at least Vulkan lets you have spare and host buffers to work around this.
Unlike other decoding APIs, which let you only set a few SPS, PPS (and VPS in HEVC) fields,
you have to parse and set practically every single field from those bitstream elements.
For HEVC alone, the total maximum possible struct size for all fields is 114 megabytes, which
means you really ought to pool the structure memory and expand it when necessary, because
although it's unlikely that you will get a stream using all possible values,
anyone can craft one and either corrupt your output or crash your decoder.
Vulkan video requires that multiplane YUV images are used. Multiplane images are rather limiting,
as they're not well-supported, and if you'd like to use them to do processing, you have to use DISJOINT
images with an EXTENDED creation flag (to be able to create VkImageViews with STORAGE usage flags),
which are even less supported and quirky. Originally, the FFmpeg Vulkan code relied entirely on emulating
multiplane images by using separate images per-plane. To work Vulkan video into this, I initially
wrote some complicated ALIASing code to alias the memory from the separate VkImages to the multiplane
VkImage necessary for decoding. This eventually got messy enough to make me give up on the idea,
and port the entire code to allow for first-class multiplane support. What would've helped would've been
some foreknowledge of the drafting process, but lacking this, as well as any involvement in the standardization,
refactoring is necessary.
As of 2022-12-19, the code has not yet been merged into mainline FFmpeg.
My branch can be found here.
There is still more refactoring necessary to make multiplane images first-class,
which would be good enough to merge, but for now, it's able to decode both H264
and HEVC video streams in 8-bit and 10-bit form.
To compile, clone and checkout the vulkan_decode branch:
The validation layers are turned on via the debug=1 option.
To decode 10-bit content, you must replace format=nv12 with format=p010,format=yuv420p10.
To use a different Vulkan device, replace vulkan=vk:0 with vulkan=vk:<N>, where <N> is the device index you'd like to use.
This will produce a OUT.nut file containing the uncompressed decoded data. You can play this
using ffplay, mpv or VLC.
Alternatively, there are many resources on how to use the FFmpeg CLI and output whatever format you'd like.
Non-subsampled 444 decoding is possible, provided drivers enable support for it.
Currently, as of 2022-12-19, there are 3 drivers supporting Vulkan Video.
RADV
ANV
Nvidia Vulkan Beta drivers
For RADV, Dave Airlie's radv-vulkan-video-prelim-decode branch is necessary.
RADV has full support for Vulkan decoding - 8-bit H264, 8-bit and 10-bit HEVC. The output is spec-compliant.
For installation instructions, check out Dave's blog.
For ANV, his anv-vulkan-video-prelim-decode branch is needed instead.
ANV has partial support for H264 - certain streams may cause crashes.
For installation instructions, check out Dave's blog.
For Nvidia, the Vulkan Beta drivers are necessary. Only Linux has been tested.
The drivers produce compliant 8-bit HEVC decoding output with my code. 10-bit HEVC decoding produces slight artifacts.
8-bit H264 decoding is broken. Nvidia are looking into the issues, progress can be tracked on the issue thread I made.
Currently, I'm working with Dave Airlie on video encoding, which involves
getting a usable implementation and drivers ready. The plan is to finish
video encoding before merging the entire branch into mainline FFmpeg.
The encoding extension in Vulkan is very low level, but extremely flexible,
which is unlike all other APIs that force you onto fixed coding paths and often
bad rate control systems.
With good user-level code, even suboptimal hardware implementations could be
made competitive with fast software implementations. The current code ignores
the driver's rate control modes, and will integrate with Daala/rav1e's RC
system.
Due to multiplane surfaces being needed for Vulkan encoding and decoding,
Niklas Haas is working on integrating support for them in
libplacebo, which would enable
post-processing of decoded Vulkan frames in FFmpeg, and enable both
mpv and VLC
to display the decoded data directly.
In the near future, support for more codecs will hopefully materialize.
When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR. ↩
When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR. ↩
When the driver sets VkVideoDecodeCapabilitiesKHR.flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_DISTINCT_BIT_KHR and does NOT set VkVideoCapabilitiesKHR.flags = VK_VIDEO_CAPABILITY_SEPARATE_REFERENCE_IMAGES_BIT_KHR. ↩
There's a persistent belief that C generics are all useless, shouldn't be called Generics
at all and in general should not have been standardized.
I'm not going to debate whether they should have been a part of the language, nor why they're
called generics, but they are nonetheless a part of the language because someone thought they
could be useful. So here's one useful application: filling in structs and saving you from
copying or specifying things the compiler knows about already:
The macro simply expands to { "some_variable", &some_variable, SP_DATA_TYPE_INT }.
Sure, its almost no work to write it out explicitly like that. But if you don't care
about C99/89 compatibility and somehow the idea of using new language features which
only a few understand is appealing, go for it.
Of course, this is not the only application of generics, after all the tgmath.h
header makes extensive use of them, and they can be useful to auto-template some
DSP code. But for that case I'd much rather explicitly write it all out.
"Every collection of random bits is white noise.".
"Every collection of random bits is a valid Opus packet.".
But is every collection of random bits a valid Opus packet which decodes to white
noise1?
Here are two volume-reduced2
16kbps samples.
Encoded white noise (volume set to 5.37%)
Random bytes (volume set to 0.28%)
Clearly not.
Yes, in a VoIP context signalling an Opus packet is silence via the flag will cause the comfort noise generator
to produce noise, but it won't be white. ↩
Volume was reduced without transcoding via the opus_metadata FFmpeg BSF. ↩
Each packet starts with a 32-bit sequence BBDC, then a parse code
which indicates the type of the unit, then two 32-bit DWORDS which are supposed to
be the global offset of the current/last unit within the file, and finally that's
followed by the unit data, which may be colorspace info or actual codec packets
containing pixels.
On the surface its a simple container and you could write a bad parser in just a few
lines and as many minutes. And here's a short list of what's wrong with it:
No size field. The only way to know how big a unit is would be to get the
difference between the next and current unit's offset.
Fast and accurate seeking not possible in files over 4GB. Both offsets may overflow
so seeking using fseek and a known frame numer means you have to parse every single
unit.
Unsuitable for streaming. Both offsets make no sense for a stream as there's no
start nor end known on the client-side.
A single-bit error in all but 8 bits of the 108-bit header will break a simple
demuxer, while an error in the 8-bit leftover (parse code) will break decoding.
A badly written demuxer will go into a loop with a near-0 chance of recovery,
while a better one will look at each incoming byte for BBCD.
For a stream, there's barely 32 bits of usable entropy to find the next unit.
While some sanity checks could be done to the offsets and parse code, a demuxer
which accepts arbitrary input can't do them.
BBCD is not an uncommon sequence to see in compressed data, or in fact any data
with a high amount of entropy. Which means you have a non-insignificant chance
of parsing a wrong unit when there wasn't meant to be one.
Combined, all those issues make parsing not very robust and not very quick. A reliable
parser can only be written for a very limited subset of the container.
A more robust simple codec container
"Let's just put a CRC on the whole data!". Hence, let's take a look at OGG:
Here, the 32-bit standard CRC_checksum covers the entire (possibly segmented) packet.
When computing it, its just taken as 0x0 and replaced later.
A single bit error anywhere will desync demuxing. Which means you have to look for
the OggS sequence at every byte position in a compressed stream.
You could ignore the CRC and let the packet through to the decoder, but there's no
guarantee the codec won't output garbage, or fail to decode (or produce valid garbage,
since any sequence of random bytes is a valid Opus packet).
If you don't ignore the CRC, you'll have to look at every byte position for the magic
OggS and then at every match, perform a CRC which may be as long as 65k. This isn't
fast nor good for battery consumption or memory.
The way chaining works in internet radio is to literally append the next file onto
the stream, which causes a desync. So you have a mandatory desync every few minutes.
Seeking will also result in a desync and require you to search.
There's still only 32-bits of entropy in the magic sequence, and worse than Dirac,
not many ways to check whether a packet is valid based on the header, since new
streams may be created or simply dropped at any point.
There's a ton more I'm not covering here on what Ogg did disasterously wrong. Its a story
for another time.
My point is that while magic sequences and CRCs are convenient, they're not robust,
especially done on a per-packet level.
Eventually, all well optimized decoder implementations of codecs will hit a bottleneck,
which is the speed at which they're able to parse data from the bitstream.
Spending the time to optimize this parsing code is usually not worth it unless users are
actually starting to hit this and its preventing them from scaling up.
For a complicated filter-heavy codec such as AV1, this becomes a problem quite late,
especially as the specifications limit the maximum bitrate to around 20Mbps,
and care was taken to allow easy SIMDing of parsing code during the writing of the spec.
For simple mezzanine codecs such as ProRes, DNxHD, CFHD, Dirac/VC-2 or even Intra-only VLC H.264,
where bitrates of the order of hundreds of Mpbs, optimizing bitstream parsing is usually importance
number one.
The context we'll be using while looking at bitstream decoding is that within the VC-2 codec,
which is essentially a stripped down version of Dirac, made by the BBC for allegedly
1
patent unencumbered near-lossless video transmission over bandwidth limited connectivity
(think 1.5Gbps 420 1080p60 over a 1Gbps connection).
To achieve this, the pixels are first transformed using one of many possible wavelet
transforms, then quantized by means of division, and encoded. The wavelet transforms
used are simple and easy to SIMD, the quantization is somewhat tricky but still
decently fast as its just a multiply and an add per coefficient. Boring, standard, old,
uninspired and uninteresting, the basis of anyone's first look-I-wrote-my-own-codec
2.
Which leaves only the quantized coefficient encoding.
The coefficients are encoded using something called
InterleavedSignedexp-Golombcodes.
We'll go over this word by word, in reverse.
exp-Golomb codes, or exponential-Golomb for long, or just Golomb to those
lazy and who know their codecs, are a form of binary encoding of arbitrarily sized integers,
where the length is encoded as a prefix preceding the data. To illustrate:
1 is added to the number if encoding a 0 is necessary, since otherwise encoding it
would take 0 bits. The prefix is just the amount of bits after the most significant
non-zero bit for the integer, minus one, encoded as a sequence of zeroes.
This encoding doesn't have any interesting properties about it and is simple
to naïvely decode.
Signedexp-Golomb codes are just the same as above, only an additional bit
is appended at the end to signal a sign. By universal convention, 1 encodes a negative
number, and 0 encodes a positive number. The bit is not signalled if the number is 0.
Interleavedexp-Golomb codes take the same amount of bits to encode as
regular Golomb, however on a first glance they are very different:
As the number of bits hasn't changed, and there are still the same amount of zeroes
as in normal exp-Golomb codes, the prefix is still there. Its just interleaved, where
every odd bit (except the last one) is a 0, while every even bit encodes the integer.
The reason why it looks so different is that with this coding scheme, coding the
very first non-zero bit is unnecessary, hence its implicitly set to 1 when decoding.
InterleavedSignedexp-Golomb codes are finally just an interleaved exp-golomb
code with an additional bit at the end to signal a sign. That bit is of course not signalled
if the number encoded is a 0.
A more convenient way to think about interleaved exp-golomb codes is that
every odd bit is actually a flag that tells you whether the number has ended (a 1)
or that the next bit is part of the number (a 0). A simple parser for signed codes
would then look like this:
Looks simple, and the loop has 3 instructions, so it should be fast, right?
Bitstream readers are however, not exactly simple, not exactly have easily predicted
branches, which makes them not exactly fast.
CPUs like it when the data you're looking at has a size of a power of two,
with the smallest unit being a byte. Bytes can encode 256 total possibilities,
which isn't exactly large, but if we could process a byte at a time rather than
a bit at a time, we could have a potential 8x speedup.
01110010 is a sequence which encodes a -2 and a 1, both signed, and is exactly 8 bits.
So if we make a lookup table, we can say that the byte contains a -2 and a 1, directly
output the data to some array, and move on cleanly to the next byte.
This is possibilitynumberone, where the byte contains all bits of
the numbers present.
01101001 0xxxxxxx is a sequence which encodes a 2, a 0, and a -1. Unlike the previous
example, all the numbers are terminated in a single byte, with only the sign bit
missing. This is hence possibilitynumbertwo, where the current byte
has leftover data that needs exactly 1 bit from the next byte.
01011101 10xxxxxx is a sequence which encodes a -6 and a 2. Its 10 bits in length,
so the last 2 bits of the 2 spill over into the next byte. We can output a -6,
save the uncompleted bits of the 2, and move over to the next byte where we can
combine the unterminated data from the previous byte with the data from the
current byte.
However there's more to this. In the previous example, the -6 ended on an odd bit,
making an even bit the start of the 2. As we know, the terminating bit of an
interleaved exp-Golomb code will always be after an odd number of bits since the start.
So we know that whenever the sequence ends, whether it be the next byte or the current
byte, the ending bit of the sequence must be at an odd position. In other words,
this is possibilitynumberthree, where the current byte is missing some data
and needs the next, and possibly more bytes to complete, with the data ending at an odd
position.
Of course, there's the possibility that the sequence will end on an even bit, such
as with 01011110 110xxxxx (-6, 0, 0, 2), making this the final possibilitynumberfour.
So, with this we can exactly determine what the current byte contains, and we can know
what we need to expect in the next byte. We know what we need to keep as a state
(what we expect from the next byte and any unterminated data from this byte),
so we can make a stateful parser:
#define POSSIBILITY_ONE_FULLY_TERMINATED 0#define POSSIBILITY_TWO_SIGN_BIT 1#define POSSIBILITY_THREE_ODD_BIT_TERMINATE 2#define POSSIBILITY_FOUR_ODD_BIT_TERMINATE 3typedefstructLookup{intready_nb;int*ready_out;uint64_tincomplete;intincomplete_bits;intnext_state;}Lookup;staticconstLookuplookup_table[]={/* 0 - 255 = POSSIBILITY_ONE_FULLY_TERMINATED *//* 256 - 511 = POSSIBILITY_TWO_SIGN_BIT *//* 512 - 767 = POSSIBILITY_THREE_ODD_BIT_TERMINATE *//* 768 - 1023 = POSSIBILITY_FOUR_ODD_BIT_TERMINATE */};voidread_golomb(int*output,uint8_t*data,intbytes){intnext_state=POSSIBILITY_ONE_FULLY_TERMINATED;uint64_tincomplete=0x0;intincomplete_bits=0;for(inti=0;i<bytes;i++){/* Load state */Lookupstate=lookup_table[next_state*data[i]];/* Directly output any numbers fully terminated in the current byte */if(state->ready_nb){memcpy(output,state->ready_out,state->ready_nb*sizeof(int));output+=state->ready_nb;}/* Save incomplete state */append_bits(&incomplete,&incomplete_bits,state->incomplete,state->incomplete_bits);/* Output if the byte has terminated the sequence */if(state->terminate){*output++=read_sie_golomb(incomplete);incomplete=incomplete_bits=0;}/* Carry over the state for the next byte */next_state=state->next_state;}}
And so, with this pseudocode, we can parse Interleaved Signed exp-Golomb codes
at a speed at least a few times faster than a naive implementation. Generating
the lookup tables is a simple matter of iterating through all numbers from 0 to 255
for every possibility from the four types and trying to decode the golomb codes in them.
There are more optimizations to do, such as instead of storing bits for the incomplete
code, storing the a decoded version of them such that decoding is entirely skipped.
And, given the state can neatly fit into a 128 bit register, SIMD is also
possible, though limited. All of this is outside the scope of this already long
article.
exp-Golomb codes, simple to naïvely decode, not that difficult to optimize,
have been the go-to for any codec that needs speed and doesn't need full entropy
encoding to save a few percent.
Do they still have a use nowadays? Not really. Fast, multisymbol range entropy
encoders have been around for more than a decade. They're quick to decode in software,
can be SIMD'd if they are adaptive and in general save you enough to make up for
the performance loss. And after all, the best way to speed up parsing is to just
have less overall data to parse.
Appendix: aren't exp-Golomb codes just Variable Length Codes?
Short answer: yes, but you shouldn't use a Variable Length Code parser to decode them.
Many codecs specify large tables of bit sequences and their lengths where each entry
maps to a single number. Unlike an exp-Golomb, there's no correlation necessary between
bits and the final parsed number, e.g. 0xff can map to 0 just as how it can map to 255.
Unless a codec specifies a very low maximum number that can be encoded in a valid
bitstream with exp-Golomb, then using a VLC parser is not feasible as even after
quantization, the encoded numbers in binary sequences will likely exceed 32 bits,
and having lookup tables larger than 256Kb evaporates any performance gained.
VC-2 uses wavelets for transforms, which are a well known patent minefield ↩
Replacing a DCT with a Wavelet does however provide potential latency improvements
for ASICs and FPGAs, though for worse frequency decomposition ʰᵉˡˡᵒ ᴶᴾᴱᴳ²⁰⁰⁰ ↩
Whilst many articles and posts exist on how to setup DASH, most assume some sort of
underlying infrastructure, many are outdated, don't specify enough or are simply vague.
This post aims to explain from the top-down how to do DASH streaming. Without involving
nginx-rtmp or any other antiquated methods.
There are plenty of examples on how to use dash.js and/or video.js
and videojs-contrib-dash, and you can just copy paste something cargo-culted to quickly get up and running.
But do you really need 3 js frameworks? As it turns out, you absolutely do not.
Practically all of the examples or tutorials use older ancient versions of video.js.
Modern video.js version 7 needs neither dash.js nor videojs-contrib-dash, since it already
comes prepackaged with everything you need to play both DASH or HLS.
<html><head><title>Live</title></head><body><linkhref="<< PATH TO video-js.min.css >>"rel="stylesheet"/><scriptsrc="<< PATH TO video.min.js >>"></script><div><video-jsid="live-video"width="100%"height="auto"controlsposter="<< LINK TO PLAYER BACKGROUND >>"class="vjs-default-skin vjs-16-9"rel="preload"preload="auto"crossorigin="anonymous"><pclass="vjs-no-js">
To view this video please enable JavaScript, and/or consider upgrading to a
web browser that
<ahref="https://videojs.com/html5-video-support/"target="_blank">
supports AV1 video and Opus audio
</a></p></video-js></div><script>varplayer=videojs('live-video',{"liveui":true,"responsive":true,});player.ready(function(){player.src({/* Silences a warning about how the mime type is unsupported with DASH */src:document.getElementById("stream_url").href,type:document.getElementById("stream_url").type,});player.on("error",function(){error=player.error();if(error.code==4){document.querySelector(".vjs-modal-dialog-content").textContent="The stream is offline right now, come back later and refresh.";}else{document.querySelector(".vjs-modal-dialog-content").textContent="Error: "+error.message;}});player.on("ended",function(){document.querySelector(".vjs-modal-dialog-content").textContent="The stream is over.";});})</script><aid="stream_url"href="<< LINK TO PLAYLIST >>"type="application/dash+xml">
Direct link to stream.
</a></body></html>
This example, although simplistic, is fully adequate to render a DASH livestream, with
the client adaptively selecting between screen sizes and displays.
Let me explain some details:
crossorigin="anonymous" sends anonymous CORS requests
such that everything still works if your files and playlists are on a different server. NOTE: this does not apply to the DASH UTC timing URL. You'll still need this
on your server. Its unclear whether this is a video.js bug or not.
width="100%" height="auto" keeps the player size constant to the page width.
"liveui": true, enables a new video.js interface that allows for seeking into buffers
of livestreams. You can rewind a limited amount (determined by the server and somewhat
the client) but its a very valuable ability. Its not currently (as of video.js 7.9.2)
enabled by default as it breaks some IE versions and an ancient IOS version, but
if you're going to be streaming using modern codecs (you are, right?) they'd be
broken anyway.
"responsive": true, just makes the UI scale along with the player size.
if (error.code == 4) { is an intentional hack. video.js returns the standard
HTML5 MediaError code,
which unfortunately maps MEDIA_ERR_SRC_NOT_SUPPORTED (value 4) to many errors,
including source file is missing. Which it would be if the stream isn't running.
Its easy to add statistics by adding this to player.ready(function() { and having
a <p id="player_stats"></p> paragraph anywhere on the webpage:
If you'd like to simply relay incoming Matroska, SRT or RTMP stream, just make
sure you can access the destination folder via nginx and you have correct permissions
set. For an example server-side FFmpeg configure line, you can use this:
This will create a playlist, per-stream init files and segment files in the destination
folder. It will also fully manage all the files it creates in the destination folder,
such as modifying them and deleting them.
There are quite a lot of options, so going through them:
remove_at_exit 1 just deletes all segments and the playlist on exit.
seg_duration 2 determines the segment size in segments. Must start with a keyframe,
so must be a multiple of the keyframe period. Directly correlates with latency.
target_latency 2 sets the latency for L-DASH only. Players usually don't respect this. Should match segment duration.
frag_type duration sets that the segments should be further divided into fragments based on duration.
Don't use other options unless you know what you're doing.
frag_duration 0.1 sets the duration of each fragment in seconds. One fragment every 0.1 seconds is a good number.
Should NOT be an irrational number otherwise you'll run into timestamp rounding issues. Hence you
should not use frag_type every_frame since all it does is it sets the duration to that of a single frame.
window_size 10 sets how many segments to keep in the playlist before removing them from the playlist.
extra_window_size 3 sets how many old segments to keep once off the playlist before deleting them, helps bad connections.
streaming 1 self explanatory.
ldash 1 low-latency DASH. Will write incomplete files to reduce latency.
write_prft 1 writes info using the UTC URL. Auto-enabled for L-DASH, but doesn't hurt to enable it here.
use_template 1 instead of writing a playlist containing all current segments and updating it on every
segment, just writes the playlist once, specifying a range of segments, how long they are, and how long
they're around. Very recommended.
use_timeline 0 its a playlist option to really disable old-style non-templated playlists.
index_correction 1 Tries to fix segment index if the input stream is incompatible, weird or lagging.
If anything, serves as a good indicator of whether your input is such, as it'll warn if it corrects
an index.
fflags +nobuffer+flush_packets disables as much caching in libavformat as possible to keep
the latency down. Can save up to a few seconds of latency.
format_options "movflags=+cmaf" is required for conformance.
adaptation_sets "id=0,streams=0 id=1,streams=1" sets the adaptation sets, e.g. separate streams with different
resolution or bitrate which the client can adapt with.
id=0 sets the first adaptation set, which should contain only a single type of streams (e.g. video or audio only).
frag_type and frag_duration can be set here to override the fragmentation on a per-adaptation stream basis.
streams=0 a comma separated list of FFmpeg stream indices to put into the adaptation set.
utc_timing_url must be set to the URL which you setup in the previous section.
init_seg_name and media_seg_name just setup a nicer segment directory layout.
On your client, set the keyframe period to the segment duration times your framerate: 2 seconds per segments * 60 frames per second = 120 frames for the keyframe period.
For some security on the source (ingest) connection you can try forwarding via the various SSH options,
or use the server as a VPN, or if you can SSHFS into the server, don't mind not using Matroska, SRT or RTMP,
and are the only person using the server, you can run the same command line on your client to an SSHFS directory.
The approach above works okay, but what if you want the ultimate low-latency, actual security and the ability
to use codecs newer than 20 years (and don't want to experiment with using Matroska as an ingest format)?
You can just generate DASH on the upload-side itself (how unorthodox) and upload it, without
having FFmpeg running on your server. However, its more complicated.
First, you'll need dash_server.py. It creates a server which
proxies the requests from nginx for both uploading and downloading (so you still get caching).
It can also be used standalone without nginx for testing, but we're not focusing on this.
Follow the provided example nginx_config in the project's root directory and add
# define connection to dash_server.pyupstreamdash_server_py{server[::1]:8000;}
In the base of your nginx website configuration.
Then, create your uploading server:
# this server handles media ingest# authentication is handled throught TLS client certificatesserver{# network configlisten[::]:8001ssldefault_server;server_name<ingestservername>;# server's TLS cert+keyssl_certificate<pathtoTLScert>;ssl_certificate_key<pathtoTLSkey>;#ssl_dhparam <path to DH params, optional>;# source authentication with TLS client certificatesssl_client_certificate<pathtoCAforclientcerts>;ssl_verify_clienton;# only allow upload and delete requestsif($request_method!~^(POST|PUT|DELETE)$){return405;# Method Not Allowed}root<pathtositeroot>;# define parameters for communicating with dash_server.py# enable chunked transfersproxy_http_version1.1;proxy_bufferingoff;proxy_request_bufferingoff;# finish the upload even if the client does not bother waiting for our# responseproxy_ignore_client_aborton;location/live/{proxy_passhttp://dash_server_py;}}
You'll need 2 certificates on the server - one for HTTPS (which you can just let certbot manage)
and one for client authentication that you'll need to create yourself.
While its less practical than a a very long URL, it provides actual security.
You can use openssl to generate the client certificate.
Then, in the same server where you host your website and UTC time URL, add this
section:
Now, you can just run python3 dash_server.py -p 8000 as any user on your server and
follow on reading to the client-side DASH setup section
to sending data to it.
All the options are described above in the FFmpeg relay section,
but there are a few new ones we need:
timeout 0.2 sets a timeout for each upload operation to complete before abandoning it. Helps robustness.
ignore_io_errors 1 does not error out if an operation times out. Obviously helps robustness.
http_persistent 0 disables persistent HTTP connections due to a dash_server.py bug. Post will be updated
if it gets fixed. Set it to 1 if you're using a CDN.
http_opts sets up the certificates to use for authentication with the server. Most CDNs use URL 'security'
so this option should be omitted there.
The ffmpeg CLI is by no means the only tool to directly output DASH. Any program which can use libavformat,
such as OBS, various media players with a recording functionality, even CD rippers and so on can.
In fact, I'm working on a fully scriptable compositing and streaming program called txproto
which accepts the same options as the ffmpeg CLI.
As you can see, each new output depends on the previous result. This might look like the
worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to
do so or to even gauge how much gains you'd get.
But lets proceed anyway.
Since each operation depends on the previous, and there's no way of getting around this fact,
your only choice is to duplicate the operations done for each previous output:
Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation
on such registers takes the same amount of cycles as if you're working with scalars, the potential
4x performance gains fade away from your expectations as you count the total operations that need
to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions.
Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general
only have a single unit capable of doing that, leading to high latencies.
Whilst we could get away with copying just a single element for the last output, extracting and
inserting scalar data in vector registers is so painfully slow that on some occasions
its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq]
certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width
shuffle.
; 0.85..^1 0.85..^2 0.85..^3 0.85..^4tab_st:dd0x3f599a00,0x3f38f671,0x3f1d382a,0x3f05a32fSECTION.textINIT_XMMfma3cglobalopus_deemphasis,3,3,8,out,in,len; coeff is already splatted in m0 on UNIX64movapsm4,[tab_st]VBROADCASTSSm5,m4shufpsm6,m4,m4,q1111shufpsm7,m4,m4,q2222.loop:movapsm1,[inq]; x0, x1, x2, x3pslldqm2,m1,4; 0, x0, x1, x2pslldqm3,m1,8; 0, 0, x0, x1fmaddpsm2,m2,m5,m1; x + c1*x[0-2]pslldqm1,12; 0, 0, 0, x0fmaddpsm2,m3,m6,m2; x + c1*x[0-2] + c2*x[0-1]fmaddpsm1,m1,m7,m2; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]fmaddpsm0,m0,m4,m1; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeffmovaps[outq],m0shufpsm0,m0,q3333; new coeffaddinq,mmsizeaddoutq,mmsizesublend,mmsize>>2jg.loopRET
We can increase speed and precision by combining the multiply and sum operations in a single fmaddps
macro, which x86inc
does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs
can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.
Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g.
using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU
had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units
are capable of both float and integer operations, or the penalty is so small its immeasurable.
Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.
The units are decicycles, but they are irrelevant as we're only interested in the ratio between the
C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup
we could get with 4x32 vector registers.
To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4).
We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting
for each previous operation to finish to produce a single output. Instead we're waiting for only the last
one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption
of a 4x maximum speedup is only valid for that case.
To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version,
which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly).
But in the interest of code readability and simplicity, we're keeping the small rolled version.
Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like
aarch64
and so its rare that unoptimized functions get used.
Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:
The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on
ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead
and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU.
And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup
was observed.
In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance,
but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.
Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg
source
to find out how they're handled and weep at their ABI ineptitude ↩
fun fact: CPUs before Ryzen don't signal
the FMA4 flag but will still execute the instructions fine ↩