Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX by georges-arm · Pull Request #31 · aegis-aead/libaegis

georges-arm · 2026-03-12T16:46:47Z

The aegis128l_common.h code contains repeated lines of paired XOR and AND operations, for example:

msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));

This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction.

The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR) instruction which is the equivalent of XOR(a, AND(b, NOT(c))), however this does not quite match due to the need to negate c.

To enable the BCAX instruction to be used, introduce a new AES_INVERT_STATE37 toggle to optionally store state[3] and state[7] as bitwise-negated in aegis128l_common.h. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly.

Since state[3] and state[7] are now bitwise-negated, also update aegis128l_neon_sha3.c to add a new AES_ENC1 macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly.

Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.

jedisct1 · 2026-03-12T17:35:25Z

Nice!

Is it something we can apply to other variants as well?

The `aegis128l_common.h` code contains repeated lines of paired XOR and AND operations, for example: msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3])); This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction. The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR) instruction which is the equivalent of `XOR(a, AND(b, NOT(c)))`, however this does not quite match due to the need to negate `c`. To enable the BCAX instruction to be used, introduce a new `AES_INVERT_STATE37` toggle to optionally store `state[3]` and `state[7]` as bitwise-negated in `aegis128l_common.h`. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly. Since `state[3]` and `state[7]` are now bitwise-negated, also update `aegis128l_neon_sha3.c` to add a new `AES_ENC1` macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly. Benchmarking this on a range of Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.

georges-arm · 2026-03-13T16:48:20Z

Is it something we can apply to other variants as well?

Good point, I think yes! I did a quick test and it seems like it shows a speedup in most cases. For the larger cases LLVM is sometimes struggling to generate code for the state arrays without spilling it all to the stack which is ruining performance, I will need to investigate further to see if I can avoid that.

Assuming I can get that to work, I'll aim to put up something similar to this for the other cases some time in the next few weeks.

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and platform being used. Change-Id: I5e308faaad35e8971ee2ace59fe8e7ac92fa6262

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and platform being used.

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.

georges-arm · 2026-03-23T16:27:58Z

Nice!

Is it something we can apply to other variants as well?

We didn't have existing Neon SHA3 variants of the non-128L cases so I've added those with this trick included as part of #33.

jedisct1 · 2026-03-31T07:48:44Z

With LLVM 21 on Apple M4, there is no different with AEGIS-128L, and AEGIS-128L MAC gets a little bit slower (- 6%).

I'll ran new benchmarks with LLVM 22, but this is probably a case where assembly would be required for consistent performance.

georges-arm · 2026-04-02T15:11:31Z

I'll ran new benchmarks with LLVM 22, but this is probably a case where assembly would be required for consistent performance.

One thing I have noticed is that the Zig benchmarks appear to get a different set of LLVM features enabled by default compared to if you were compiling with LLVM as a standalone C/C++ compiler. This difference can lead to AESE and AESMC being placed far apart from each other which hurts performance, and therefore the Zig benchmarks report much worse performance than an equivalent C/C++ benchmark by default, at least on Linux platforms. I raised that as an issue here: https://codeberg.org/ziglang/zig/issues/31443. It seems like a workaround in the meantime is to explicitly specify -Dcpu=native or -Dcpu=generic, which seems to recover the performance on Linux.

I had left the code as e.g. tmp0 = AES_BLOCK_XOR(tmp0, AES_BLOCK_AND(state[2], AES_BLOCK_NOT(state[3]))); to try and minimise changes to the common code but I appreciate this is reliant on the compiler performing the transformation. I could add an explicit AES_BLOCK_BCAX macro (alternative naming welcome) and use the vbcaxq_u8 intrinsic directly instead if you feel that would be more portable?

With LLVM 21 on Apple M4, there is no different with AEGIS-128L, and AEGIS-128L MAC gets a little bit slower (- 6%).

Thanks for benchmarking! My benchmarks were done on Arm Neoverse server platforms, I can believe that the performance is not portable to a different platform with different micro-architectural characteristics. One option here would be to just disable the SHA3 code path by guarding it with !__APPLE__ or similar?

I had a quick look at the generated AEGIS-128L SHA3 code path assembly with LLVM 21 (with -Dcpu=generic, see above) and it looks reasonable, so I don't think that moving to assembly would fix the Apple Silicon performance inversion that you observed here and in #33.

Let me know what you prefer and I will update this and #33.

Thanks!

georges-arm requested a review from jedisct1 March 12, 2026 16:47

georges-arm force-pushed the georges-arm/aarch64-sha3-use-bcax branch from cf0950a to 6c6a2ce Compare March 13, 2026 16:44

georges-arm mentioned this pull request Mar 23, 2026

Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-use-bcax

georges-arm commented Mar 12, 2026

Uh oh!

jedisct1 commented Mar 12, 2026

Uh oh!

georges-arm commented Mar 13, 2026

Uh oh!

georges-arm commented Mar 23, 2026

Uh oh!

jedisct1 commented Mar 31, 2026

Uh oh!

georges-arm commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

georges-arm commented Mar 12, 2026

Uh oh!

jedisct1 commented Mar 12, 2026

Uh oh!

georges-arm commented Mar 13, 2026

Uh oh!

georges-arm commented Mar 23, 2026

Uh oh!

jedisct1 commented Mar 31, 2026

Uh oh!

georges-arm commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants