Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions by georges-arm · Pull Request #33 · aegis-aead/libaegis

georges-arm · 2026-03-23T16:24:08Z

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR #31) of bitwise-negating state[3] (and state[7] where relevant).

Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.

georges-arm · 2026-03-23T16:30:19Z

It's worth noting that there's a bit of duplication in the Neon AES and Neon SHA3 code paths (e.g. most of the helper function definitions). Additionally the fresh_ones helper function is currently duplicated in each of the Neon SHA3 implementations.

Let me know if you have a preference regarding refactoring the code to reduce this duplication, I didn't see a precedent as far as impl-specific common headers so wasn't sure what the best approach would be here.

jedisct1 · 2026-03-31T07:44:55Z

Hi!

I ran a benchmark on Apple M4, and noticed severe regressions with these changes:

AEGIS-128X2: -57% (175 → 76 Gb/s)
AEGIS-256X2: -50% (128 → 64 Gb/s)
AEGIS-256X4: -44% (116 → 65 Gb/s)
AEGIS-256: -10% (93 → 84 Gb/s)

The regressions are consistent across multiple runs, so this isn't noise.

AEAD Encrypt/Decrypt

Variant	main (Mb/s)	PR #33 (Mb/s)	Change
AEGIS-128L	~170,000	~171,000	~0%
AEGIS-128X2	~175,000	~76,000	-57%
AEGIS-128X4	~102,000	~107,000	+5%
AEGIS-256	~93,000	~84,000	-10%
AEGIS-256X2	~128,000	~64,000	-50%
AEGIS-256X4	~116,000	~65,000	-44%

The MAC paths are mostly neutral, with a small +5% improvement for AEGIS-128X2 MAC.

Variant	main (Mb/s)	PR #33 (Mb/s)	Change
AEGIS-128L MAC	~215,000	~215,000	~0%
AEGIS-128X2 MAC	~238,000	~249,000	+5%
AEGIS-128X4 MAC	~185,000	~168,000	-9%
AEGIS-256 MAC	~115,000	~115,000	~0%
AEGIS-256X2 MAC	~179,000	~179,000	~0%
AEGIS-256X4 MAC	~141,000	~135,000	-4%

This is using LLVM 21. Maybe things are different with LLVM 22, but we can't really expect people to use LLVM 22 yet. LLVM 21 is still what Xcode ships with.

But these optimizations are nice and very promising.

Wondering if it wouldn't make sense to have optimized assembly implementations (similar to aegis-jasmin), that can be packaged separately first, so that we can really get the best possible scheduling and not have to depend on LLVM.

georges-arm requested a review from jedisct1 March 23, 2026 16:24

georges-arm mentioned this pull request Mar 23, 2026

Arm: Conditionally negate state[{3,7}] to enable using SHA3 BCAX #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33

Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-everything-else

georges-arm commented Mar 23, 2026

Uh oh!

georges-arm commented Mar 23, 2026

Uh oh!

jedisct1 commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

georges-arm commented Mar 23, 2026

Uh oh!

georges-arm commented Mar 23, 2026

Uh oh!

jedisct1 commented Mar 31, 2026

AEAD Encrypt/Decrypt

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants