Lynne's compiled musings


IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

static float deemphasis_c(float *y, float *x, float coeff, int len)
    for (int i = 0; i < len; i++)
        coeff = y[i] = x[i] + coeff * 0.85f;

    return coeff;

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

static float deemphasis_c(float *y, float *x, float coeff, int len)
    const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;

    for (int i = 0; i < len; i += 4) {
        y[0] = x[0] + c1*coeff +  0*x[2] +  0*x[1] +  0*x[0];
        y[1] = x[1] + c2*coeff +  0*x[2] +  0*x[1] + c1*x[0];
        y[2] = x[2] + c3*coeff +  0*x[2] + c1*x[1] + c2*x[0];
        y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];

        coeff = y[3];
        y += 4;
        x += 4;

    return coeff;

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and inserting scalar data in vector registers is so painfully slow that on some occasions its better to just save via movsd [outq], m0 or similarly load via movsd/movhps xm0, [inq] certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width shuffle.

But lets proceed anyway.1

         ; 0.85..^1    0.85..^2    0.85..^3    0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f


cglobal opus_deemphasis, 3, 3, 8, out, in, len
    ; coeff is already splatted in m0 on UNIX64

    movaps m4, [tab_st]
    shufps m6, m4, m4, q1111
    shufps m7, m4, m4, q2222

    movaps  m1, [inq]                ; x0, x1, x2, x3

    pslldq  m2, m1, 4                ;  0, x0, x1, x2
    pslldq  m3, m1, 8                ;  0,  0, x0, x1

    fmaddps m2, m2, m5, m1           ; x + c1*x[0-2]
    pslldq  m1, 12                   ;  0,  0,  0, x0

    fmaddps m2, m3, m6, m2           ; x + c1*x[0-2] + c2*x[0-1]
    fmaddps m1, m1, m7, m2           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
    fmaddps m0, m0, m4, m1           ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff

    movaps [outq], m0
    shufps m0, m0, q3333             ; new coeff

    add inq,  mmsize
    add outq, mmsize
    sub lend, mmsize >> 2
    jg .loop


We can increase speed and precision by combining the multiply and sum operations in a single fmaddps macro, which x86inc does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs can also be generated, but considering AMD themselves dropped support for that2, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of shufps we can use pslldq.
Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g. using mulps marked the resulting register as float and using pxor on it would incur a slowdown as the CPU had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench to see how slow we are.

benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1024 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: for (int i = 0; i < len; i += 4). We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting for each previous operation to finish to produce a single output. Instead we're waiting for only the last one to complete to run another iteration of the loop, 3 times less than before.
This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version, which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly). But in the interest of code readability and simplicity, we're keeping the small rolled version. Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like aarch64 and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
 - opusdsp.postfilter_15   [OK]
 - opusdsp.postfilter_512  [OK]
 - opusdsp.postfilter_1022 [OK]
 - opusdsp.deemphasis      [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on ARM requires high privileges and even closed source binary blobs such as on the C2.
The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU. And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance, but in certain cases just trying and doing the dumb thing can yield much greater gains than expected.

  1. Yes, this is ignoring non-UNIX64 platforms, take a look at the FFmpeg source to find out how they're handled and weep at their ABI ineptitude
  2. fun fact: CPUs before Ryzen don't signal the FMA4 flag but will still execute the instructions fine
 ·  x86 neon  ·  CC-BY logo