Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31
Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
Conversation
|
Nice! Is it something we can apply to other variants as well? |
The `aegis128l_common.h` code contains repeated lines of paired XOR and
AND operations, for example:
msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));
This is suboptimal on Arm because there is no instruction do to XOR and
AND in a single instruction.
The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR)
instruction which is the equivalent of `XOR(a, AND(b, NOT(c)))`, however
this does not quite match due to the need to negate `c`.
To enable the BCAX instruction to be used, introduce a new
`AES_INVERT_STATE37` toggle to optionally store `state[3]` and
`state[7]` as bitwise-negated in `aegis128l_common.h`. With LLVM 22 this
is sufficient to have the compiler automatically make use of the BCAX
instructions so there is no need to use them explicitly.
Since `state[3]` and `state[7]` are now bitwise-negated, also update
`aegis128l_neon_sha3.c` to add a new `AES_ENC1` macro that undoes the
bitwise negation as part of the AESE instruction. The compiler will
ordinarily try to materialise the all-ones constant here in a
sub-optimal way, necessitating the use of inline assembly.
Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
a 5-15% speedup over the existing Neon SHA3 implementation.
cf0950a to
6c6a2ce
Compare
Good point, I think yes! I did a quick test and it seems like it shows a speedup in most cases. For the larger cases LLVM is sometimes struggling to generate code for the state arrays without spilling it all to the stack which is ruining performance, I will need to investigate further to see if I can avoid that. Assuming I can get that to work, I'll aim to put up something similar to this for the other cases some time in the next few weeks. |
There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and platform being used. Change-Id: I5e308faaad35e8971ee2ace59fe8e7ac92fa6262
There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and platform being used.
There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.
We didn't have existing Neon SHA3 variants of the non-128L cases so I've added those with this trick included as part of #33. |
|
With LLVM 21 on Apple M4, there is no different with AEGIS-128L, and AEGIS-128L MAC gets a little bit slower (- 6%). I'll ran new benchmarks with LLVM 22, but this is probably a case where assembly would be required for consistent performance. |
One thing I have noticed is that the Zig benchmarks appear to get a different set of LLVM features enabled by default compared to if you were compiling with LLVM as a standalone C/C++ compiler. This difference can lead to I had left the code as e.g.
Thanks for benchmarking! My benchmarks were done on Arm Neoverse server platforms, I can believe that the performance is not portable to a different platform with different micro-architectural characteristics. One option here would be to just disable the SHA3 code path by guarding it with I had a quick look at the generated AEGIS-128L SHA3 code path assembly with LLVM 21 (with Let me know what you prefer and I will update this and #33. Thanks! |
The
aegis128l_common.hcode contains repeated lines of paired XOR and AND operations, for example:This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction.
The
FEAT_SHA3extension includes theBCAX(bit-clear and XOR) instruction which is the equivalent ofXOR(a, AND(b, NOT(c))), however this does not quite match due to the need to negatec.To enable the
BCAXinstruction to be used, introduce a newAES_INVERT_STATE37toggle to optionally storestate[3]andstate[7]as bitwise-negated inaegis128l_common.h. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly.Since
state[3]andstate[7]are now bitwise-negated, also updateaegis128l_neon_sha3.cto add a newAES_ENC1macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly.Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.