Skip to content

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31

Open
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-use-bcax
Open

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX#31
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-use-bcax

Conversation

@georges-arm
Copy link
Copy Markdown
Collaborator

The aegis128l_common.h code contains repeated lines of paired XOR and AND operations, for example:

msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));

This is suboptimal on Arm because there is no instruction do to XOR and AND in a single instruction.

The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR) instruction which is the equivalent of XOR(a, AND(b, NOT(c))), however this does not quite match due to the need to negate c.

To enable the BCAX instruction to be used, introduce a new AES_INVERT_STATE37 toggle to optionally store state[3] and state[7] as bitwise-negated in aegis128l_common.h. With LLVM 22 this is sufficient to have the compiler automatically make use of the BCAX instructions so there is no need to use them explicitly.

Since state[3] and state[7] are now bitwise-negated, also update aegis128l_neon_sha3.c to add a new AES_ENC1 macro that undoes the bitwise negation as part of the AESE instruction. The compiler will ordinarily try to materialise the all-ones constant here in a sub-optimal way, necessitating the use of inline assembly.

Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see a 5-15% speedup over the existing Neon SHA3 implementation.

@georges-arm georges-arm requested a review from jedisct1 March 12, 2026 16:47
@jedisct1
Copy link
Copy Markdown
Collaborator

Nice!

Is it something we can apply to other variants as well?

The `aegis128l_common.h` code contains repeated lines of paired XOR and
AND operations, for example:

    msg0 = AES_BLOCK_XOR(msg0, AES_BLOCK_AND(state[2], state[3]));

This is suboptimal on Arm because there is no instruction do to XOR and
AND in a single instruction.

The FEAT_SHA3 extension includes the BCAX (bit-clear and XOR)
instruction which is the equivalent of `XOR(a, AND(b, NOT(c)))`, however
this does not quite match due to the need to negate `c`.

To enable the BCAX instruction to be used, introduce a new
`AES_INVERT_STATE37` toggle to optionally store `state[3]` and
`state[7]` as bitwise-negated in `aegis128l_common.h`. With LLVM 22 this
is sufficient to have the compiler automatically make use of the BCAX
instructions so there is no need to use them explicitly.

Since `state[3]` and `state[7]` are now bitwise-negated, also update
`aegis128l_neon_sha3.c` to add a new `AES_ENC1` macro that undoes the
bitwise negation as part of the AESE instruction. The compiler will
ordinarily try to materialise the all-ones constant here in a
sub-optimal way, necessitating the use of inline assembly.

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
a 5-15% speedup over the existing Neon SHA3 implementation.
@georges-arm georges-arm force-pushed the georges-arm/aarch64-sha3-use-bcax branch from cf0950a to 6c6a2ce Compare March 13, 2026 16:44
@georges-arm
Copy link
Copy Markdown
Collaborator Author

Is it something we can apply to other variants as well?

Good point, I think yes! I did a quick test and it seems like it shows a speedup in most cases. For the larger cases LLVM is sometimes struggling to generate code for the state arrays without spilling it all to the stack which is ruining performance, I will need to investigate further to see if I can avoid that.

Assuming I can get that to work, I'll aim to put up something similar to this for the other cases some time in the next few weeks.

georges-arm added a commit to georges-arm/libaegis that referenced this pull request Mar 23, 2026
There is already an existing implementation of AEGIS-128L using the Neon
SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the
same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of
bitwise-negating `state[3]` (and `state[7]` where relevant).

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
1-33% speedups depending on the micro-architecture and platform being
used.

Change-Id: I5e308faaad35e8971ee2ace59fe8e7ac92fa6262
georges-arm added a commit to georges-arm/libaegis that referenced this pull request Mar 23, 2026
There is already an existing implementation of AEGIS-128L using the Neon
SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the
same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of
bitwise-negating `state[3]` (and `state[7]` where relevant).

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
1-33% speedups depending on the micro-architecture and platform being
used.
georges-arm added a commit to georges-arm/libaegis that referenced this pull request Mar 23, 2026
There is already an existing implementation of AEGIS-128L using the Neon
SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the
same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of
bitwise-negating `state[3]` (and `state[7]` where relevant).

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
1-33% speedups depending on the micro-architecture and code path being
used.
@georges-arm
Copy link
Copy Markdown
Collaborator Author

Nice!

Is it something we can apply to other variants as well?

We didn't have existing Neon SHA3 variants of the non-128L cases so I've added those with this trick included as part of #33.

@jedisct1
Copy link
Copy Markdown
Collaborator

With LLVM 21 on Apple M4, there is no different with AEGIS-128L, and AEGIS-128L MAC gets a little bit slower (- 6%).

I'll ran new benchmarks with LLVM 22, but this is probably a case where assembly would be required for consistent performance.

@georges-arm
Copy link
Copy Markdown
Collaborator Author

I'll ran new benchmarks with LLVM 22, but this is probably a case where assembly would be required for consistent performance.

One thing I have noticed is that the Zig benchmarks appear to get a different set of LLVM features enabled by default compared to if you were compiling with LLVM as a standalone C/C++ compiler. This difference can lead to AESE and AESMC being placed far apart from each other which hurts performance, and therefore the Zig benchmarks report much worse performance than an equivalent C/C++ benchmark by default, at least on Linux platforms. I raised that as an issue here: https://codeberg.org/ziglang/zig/issues/31443. It seems like a workaround in the meantime is to explicitly specify -Dcpu=native or -Dcpu=generic, which seems to recover the performance on Linux.

I had left the code as e.g. tmp0 = AES_BLOCK_XOR(tmp0, AES_BLOCK_AND(state[2], AES_BLOCK_NOT(state[3]))); to try and minimise changes to the common code but I appreciate this is reliant on the compiler performing the transformation. I could add an explicit AES_BLOCK_BCAX macro (alternative naming welcome) and use the vbcaxq_u8 intrinsic directly instead if you feel that would be more portable?

With LLVM 21 on Apple M4, there is no different with AEGIS-128L, and AEGIS-128L MAC gets a little bit slower (- 6%).

Thanks for benchmarking! My benchmarks were done on Arm Neoverse server platforms, I can believe that the performance is not portable to a different platform with different micro-architectural characteristics. One option here would be to just disable the SHA3 code path by guarding it with !__APPLE__ or similar?

I had a quick look at the generated AEGIS-128L SHA3 code path assembly with LLVM 21 (with -Dcpu=generic, see above) and it looks reasonable, so I don't think that moving to assembly would fix the Apple Silicon performance inversion that you observed here and in #33.

Let me know what you prefer and I will update this and #33.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants