## IIR filter SIMD and instruction dependencies

Last year I wrote some Opus emphasis filter SIMD. Let's take a look at the C version for the inverse filter:

```
static float deemphasis_c(float *y, float *x, float coeff, int len)
{
for (int i = 0; i < len; i++)
coeff = y[i] = x[i] + coeff * 0.85f;
return coeff;
}
```

As you can see, each new output depends on the previous result. This might look like the worst possible code to SIMD, and indeed its very effective at repelling any casual attempts to do so or to even gauge how much gains you'd get.

But lets proceed anyway.

Since each operation depends on the previous, and there's no way of getting around this fact, your only choice is to duplicate the operations done for each previous output:

```
static float deemphasis_c(float *y, float *x, float coeff, int len)
{
const float c1 = 0.85, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
for (int i = 0; i < len; i += 4) {
y[0] = x[0] + c1*coeff + 0*x[2] + 0*x[1] + 0*x[0];
y[1] = x[1] + c2*coeff + 0*x[2] + 0*x[1] + c1*x[0];
y[2] = x[2] + c3*coeff + 0*x[2] + c1*x[1] + c2*x[0];
y[3] = x[3] + c4*coeff + c1*x[2] + c2*x[1] + c3*x[0];
coeff = y[3];
y += 4;
x += 4;
}
return coeff;
}
```

Even though you can have 128-bit registers capable of storing 4 32-bit floats, and each operation on such registers takes the same amount of cycles as if you're working with scalars, the potential 4x performance gains fade away from your expectations as you count the total operations that need to be performed, which, excluding loads and writes, adds up to 4 multiplies and 5 additions. Moreover, each sucessive output requires a shuffle for the input to match up, and CPUs in general only have a single unit capable of doing that, leading to high latencies.

Whilst we could get away with copying just a single element for the last output, extracting and
inserting scalar data in vector registers is so painfully slow that on some occasions
its better to just save via `movsd [outq], m0`

or similarly load via `movsd/movhps xm0, [inq]`

certain parts of the register and ignore what happens in the rest. Hence we need to use a full-width
shuffle.

But lets proceed anyway.^{1}

```
; 0.85..^1 0.85..^2 0.85..^3 0.85..^4
tab_st: dd 0x3f599a00, 0x3f38f671, 0x3f1d382a, 0x3f05a32f
SECTION .text
INIT_XMM fma3
cglobal opus_deemphasis, 3, 3, 8, out, in, len
; coeff is already splatted in m0 on UNIX64
movaps m4, [tab_st]
VBROADCASTSS m5, m4
shufps m6, m4, m4, q1111
shufps m7, m4, m4, q2222
.loop:
movaps m1, [inq] ; x0, x1, x2, x3
pslldq m2, m1, 4 ; 0, x0, x1, x2
pslldq m3, m1, 8 ; 0, 0, x0, x1
fmaddps m2, m2, m5, m1 ; x + c1*x[0-2]
pslldq m1, 12 ; 0, 0, 0, x0
fmaddps m2, m3, m6, m2 ; x + c1*x[0-2] + c2*x[0-1]
fmaddps m1, m1, m7, m2 ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0]
fmaddps m0, m0, m4, m1 ; x + c1*x[0-2] + c2*x[0-1] + c3*x[0] + c1,c2,c3,c4*coeff
movaps [outq], m0
shufps m0, m0, q3333 ; new coeff
add inq, mmsize
add outq, mmsize
sub lend, mmsize >> 2
jg .loop
RET
```

We can increase speed and precision by combining the multiply and sum operations in a single `fmaddps`

macro, which x86inc
does magic on to output one of the new 3-operand Intel fused multiply-adds, based on operand order. Old AMD 4-operand style FMAs
can also be generated, but considering AMD themselves dropped support for that^{2}, it would only serve to waste binary space.

Since all we're doing is we're shifting data in the 128-bit register, instead of `shufps`

we can use `pslldq`

.

Old CPUs used to have penalties for using instructions of different type to the one the vector has, e.g.
using `mulps`

marked the resulting register as float and using `pxor`

on it would incur a slowdown as the CPU
had to switch transistors to route the register to a different unit. Newer CPUs don't have that as their units
are capable of both float and integer operations, or the penalty is so small its immeasurable.

Let's run FFmpeg's `make checkasm && ./tests/checkasm/checkasm --test=opusdsp --bench`

to see how slow we are.

```
benchmarking with native FFmpeg timers
nop: 32.8
checkasm: using random seed 524800271
FMA3:
- opusdsp.postfilter_15 [OK]
- opusdsp.postfilter_512 [OK]
- opusdsp.postfilter_1024 [OK]
- opusdsp.deemphasis [OK]
checkasm: all 4 tests passed
deemphasis_c: 7344.0
deemphasis_fma3: 1055.3
...
```

The units are decicycles, but they are irrelevant as we're only interested in the ratio between the C and FMA3 versions, and in this case that's a 7x speedup. A lot more than the theoretical 4x speedup we could get with 4x32 vector registers.

To explain how, take a look at the unrolled version again: `for (int i = 0; i < len; i += 4)`

.
We're running the loop 4 times less than the unrolled version by doing more. But we're also not waiting
for each previous operation to finish to produce a single output. Instead we're waiting for only the last
one to complete to run another iteration of the loop, 3 times less than before.

This delay simply doesn't exist on more commonly SIMD'd functions like a dot product, and our assumption
of a 4x maximum speedup is only valid for that case.

To be completely fair, we ought to be comparing the unrolled version to the handwritten assembly version,
which reduced the speedup to 5x (5360.9 decicycles vs 1016.7 decicycles for the handwritten assembly).
But in the interest of code readability and simplicity, we're keeping the small *rolled* version.
Besides, we can afford to, as we usually end up writing more assembly for other popular platforms like
aarch64
and so its rare that unoptimized functions get used.

Unfortunately the aarch64 version is a lot slower than x86's version, at least on the ODROID C2 I'm able to test on:

```
benchmarking with Linux Perf Monitoring API
nop: 66.1
checkasm: using random seed 585765880
NEON:
- opusdsp.postfilter_15 [OK]
- opusdsp.postfilter_512 [OK]
- opusdsp.postfilter_1022 [OK]
- opusdsp.deemphasis [OK]
checkasm: all 4 tests passed
deemphasis_c: 8680.7
deemphasis_neon: 4134.7
...
```

The units are arbitrary numbers the Linux perf API outputs, since access to the cycle counter on
ARM requires high privileges and even closed source binary blobs such as on the C2.

The speedup on the C2 was barely 2.10x. In general ARM CPUs have a lot less advanced lookahead
and speculative execution than x86 CPUs. Some are even still in-order like the Raspberry PI 3's CPU.
And to even get that much, the assembly loop had to be unrolled twice, otherwise only a ~1.2x speedup
was observed.

In conclusion traditional analog filter SIMD can be weird and impenetrable on a first or second glance,
but in certain cases just trying and doing the *dumb* thing can yield much greater gains than expected.