Lynne's compiled musings

20-08-30

Fast parsing of Interleaved Signed exp-Golomb codes

Eventually, all well optimized decoder implementations of codecs will hit a bottleneck, which is the speed at which they're able to parse data from the bitstream. Spending the time to optimize this parsing code is usually not worth it unless users are actually starting to hit this and its preventing them from scaling up. For a complicated filter-heavy codec such as AV1, this becomes a problem quite late, especially as the specifications limit the maximum bitrate to around 20Mbps, and care was taken to allow easy SIMDing of parsing code during the writing of the spec. For simple mezzanine codecs such as ProRes, DNxHD, CFHD, Dirac/VC-2 or even Intra-only VLC H.264, where bitrates of the order of hundreds of Mpbs, optimizing bitstream parsing is usually importance number one.

The context we'll be using while looking at bitstream decoding is that within the VC-2 codec, which is essentially a stripped down version of Dirac, made by the BBC for allegedly 1 patent unencumbered near-lossless video transmission over bandwidth limited connectivity (think 1.5Gbps 420 1080p60 over a 1Gbps connection).

To achieve this, the pixels are first transformed using one of many possible wavelet transforms, then quantized by means of division, and encoded. The wavelet transforms used are simple and easy to SIMD, the quantization is somewhat tricky but still decently fast as its just a multiply and an add per coefficient. Boring, standard, old, uninspired and uninteresting, the basis of anyone's first look-I-wrote-my-own-codec 2. Which leaves only the quantized coefficient encoding.

The coefficients are encoded using something called Interleaved Signed exp-Golomb codes. We'll go over this word by word, in reverse.

exp-Golomb codes, or exponential-Golomb for long, or just Golomb to those lazy and who know their codecs, are a form of binary encoding of arbitrarily sized integers, where the length is encoded as a prefix preceding the data. To illustrate:

0 => +1 => 1 => 1 => 1
1 => +1 => 2 => 10 => 010
2 => +1 => 3 => 11 => 011
3 => +1 => 4 => 100 => 00100
4 => +1 => 5 => 101 => 00101
5 => +1 => 6 => 110 => 00110
6 => +1 => 7 => 111 => 00111
7 => +1 = 8 => 1000 => 0001000
...

1 is added to the number if encoding a 0 is necessary, since otherwise encoding it would take 0 bits. The prefix is just the amount of bits after the most significant non-zero bit for the integer, minus one, encoded as a sequence of zeroes. This encoding doesn't have any interesting properties about it and is simple to naïvely decode.

Signed exp-Golomb codes are just the same as above, only an additional bit is appended at the end to signal a sign. By universal convention, 1 encodes a negative number, and 0 encodes a positive number. The bit is not signalled if the number is 0.

Interleaved exp-Golomb codes take the same amount of bits to encode as regular Golomb, however on a first glance they are very different:

0 => +1 => 1 => 1 => 1
1 => +1 => 2 => 10 => 001
2 => +1 => 3 => 11 => 011
3 => +1 => 4 => 100 => 00001
4 => +1 => 5 => 101 => 00011
5 => +1 => 6 => 110 => 01001
6 => +1 => 7 => 111 => 01011
7 => +1 = 8 => 1000 => 0000001
...

As the number of bits hasn't changed, and there are still the same amount of zeroes as in normal exp-Golomb codes, the prefix is still there. Its just interleaved, where every odd bit (except the last one) is a 0, while every even bit encodes the integer. The reason why it looks so different is that with this coding scheme, coding the very first non-zero bit is unnecessary, hence its implicitly set to 1 when decoding.

Interleaved Signed exp-Golomb codes are finally just an interleaved exp-golomb code with an additional bit at the end to signal a sign. That bit is of course not signalled if the number encoded is a 0.

A more convenient way to think about interleaved exp-golomb codes is that every odd bit is actually a flag that tells you whether the number has ended (a 1) or that the next bit is part of the number (a 0). A simple parser for signed codes would then look like this:

int read_sie_golomb()
{
    uint32_t val = 0x1;

    while (!get_bit()) {
        val <<= 1;
        val |= get_bit();
    }

    val -= 1;

    if (val && get_bit())
        val *= -1;

    return val;
}

Looks simple, and the loop has 3 instructions, so it should be fast, right? Bitstream readers are however, not exactly simple, not exactly have easily predicted branches, which makes them not exactly fast.

CPUs like it when the data you're looking at has a size of a power of two, with the smallest unit being a byte. Bytes can encode 256 total possibilities, which isn't exactly large, but if we could process a byte at a time rather than a bit at a time, we could have a potential 8x speedup.

01110010 is a sequence which encodes a -2 and a 1, both signed, and is exactly 8 bits. So if we make a lookup table, we can say that the byte contains a -2 and a 1, directly output the data to some array, and move on cleanly to the next byte.
This is possibility number one, where the byte contains all bits of the numbers present.

01101001 0xxxxxxx is a sequence which encodes a 2, a 0, and a -1. Unlike the previous example, all the numbers are terminated in a single byte, with only the sign bit missing. This is hence possibility number two, where the current byte has leftover data that needs exactly 1 bit from the next byte.

01011101 10xxxxxx is a sequence which encodes a -6 and a 2. Its 10 bits in length, so the last 2 bits of the 2 spill over into the next byte. We can output a -6, save the uncompleted bits of the 2, and move over to the next byte where we can combine the unterminated data from the previous byte with the data from the current byte.
However there's more to this. In the previous example, the -6 ended on an odd bit, making an even bit the start of the 2. As we know, the terminating bit of an interleaved exp-Golomb code will always be after an odd number of bits since the start. So we know that whenever the sequence ends, whether it be the next byte or the current byte, the ending bit of the sequence must be at an odd position. In other words, this is possibility number three, where the current byte is missing some data and needs the next, and possibly more bytes to complete, with the data ending at an odd position.
Of course, there's the possibility that the sequence will end on an even bit, such as with 01011110 110xxxxx (-6, 0, 0, 2), making this the final possibility number four.

So, with this we can exactly determine what the current byte contains, and we can know what we need to expect in the next byte. We know what we need to keep as a state (what we expect from the next byte and any unterminated data from this byte), so we can make a stateful parser:

#define POSSIBILITY_ONE_FULLY_TERMINATED 0
#define POSSIBILITY_TWO_SIGN_BIT 1
#define POSSIBILITY_THREE_ODD_BIT_TERMINATE 2
#define POSSIBILITY_FOUR_ODD_BIT_TERMINATE 3

typedef struct Lookup {
    int ready_nb;
    int *ready_out;

    uint64_t incomplete;
    int incomplete_bits;

    int next_state;
} Lookup;

static const Lookup lookup_table[] = {
    /* 0    - 255 = POSSIBILITY_ONE_FULLY_TERMINATED */
    /* 256  - 511 = POSSIBILITY_TWO_SIGN_BIT */
    /* 512  - 767 = POSSIBILITY_THREE_ODD_BIT_TERMINATE */
    /* 768 - 1023 = POSSIBILITY_FOUR_ODD_BIT_TERMINATE */
};

void read_golomb(int *output, uint8_t *data, int bytes)
{
    int next_state = POSSIBILITY_ONE_FULLY_TERMINATED;
    uint64_t incomplete = 0x0;
    int incomplete_bits = 0;

    for (int i = 0; i < bytes; i++) {
        /* Load state */
        Lookup state = lookup_table[next_state * data[i]];

        /* Directly output any numbers fully terminated in the current byte */
        if (state->ready_nb) {
            memcpy(output, state->ready_out, state->ready_nb * sizeof(int));
            output += state->ready_nb;
        }

        /* Save incomplete state */
        append_bits(&incomplete, &incomplete_bits, state->incomplete, state->incomplete_bits);

        /* Output if the byte has terminated the sequence */
        if (state->terminate) {
            *output++ = read_sie_golomb(incomplete);
            incomplete = incomplete_bits = 0;
        }

        /* Carry over the state for the next byte */
        next_state = state->next_state;
    }
}

And so, with this pseudocode, we can parse Interleaved Signed exp-Golomb codes at a speed at least a few times faster than a naive implementation. Generating the lookup tables is a simple matter of iterating through all numbers from 0 to 255 for every possibility from the four types and trying to decode the golomb codes in them.

There are more optimizations to do, such as instead of storing bits for the incomplete code, storing the a decoded version of them such that decoding is entirely skipped. And, given the state can neatly fit into a 128 bit register, SIMD is also possible, though limited. All of this is outside the scope of this already long article.

exp-Golomb codes, simple to naïvely decode, not that difficult to optimize, have been the go-to for any codec that needs speed and doesn't need full entropy encoding to save a few percent.
Do they still have a use nowadays? Not really. Fast, multisymbol range entropy encoders have been around for more than a decade. They're quick to decode in software, can be SIMD'd if they are adaptive and in general save you enough to make up for the performance loss. And after all, the best way to speed up parsing is to just have less overall data to parse.

Appendix: aren't exp-Golomb codes just Variable Length Codes?

Short answer: yes, but you shouldn't use a Variable Length Code parser to decode them.
Many codecs specify large tables of bit sequences and their lengths where each entry maps to a single number. Unlike an exp-Golomb, there's no correlation necessary between bits and the final parsed number, e.g. 0xff can map to 0 just as how it can map to 255. Unless a codec specifies a very low maximum number that can be encoded in a valid bitstream with exp-Golomb, then using a VLC parser is not feasible as even after quantization, the encoded numbers in binary sequences will likely exceed 32 bits, and having lookup tables larger than 256Kb evaporates any performance gained.

  1. VC-2 uses wavelets for transforms, which are a well known patent minefield
  2. Replacing a DCT with a Wavelet does however provide potential latency improvements for ASICs and FPGAs, though for worse frequency decomposition ʰᵉˡˡᵒ ᴶᴾᴱᴳ²⁰⁰⁰
 ·  bitstream  ·  CC-BY logo