Skip to content

Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33

Open
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-everything-else
Open

Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33
georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
georges-arm:georges-arm/aarch64-sha3-everything-else

Conversation

@georges-arm
Copy link
Copy Markdown
Collaborator

There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR #31) of bitwise-negating state[3] (and state[7] where relevant).

Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.

There is already an existing implementation of AEGIS-128L using the Neon
SHA3 extension, but all other implementations are currently absent.

Add implementations for all of the non-128L code paths, mirroring the
same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of
bitwise-negating `state[3]` (and `state[7]` where relevant).

Benchmarking this on a range of Neoverse platforms with LLVM 22, we see
1-33% speedups depending on the micro-architecture and code path being
used.
@georges-arm
Copy link
Copy Markdown
Collaborator Author

It's worth noting that there's a bit of duplication in the Neon AES and Neon SHA3 code paths (e.g. most of the helper function definitions). Additionally the fresh_ones helper function is currently duplicated in each of the Neon SHA3 implementations.

Let me know if you have a preference regarding refactoring the code to reduce this duplication, I didn't see a precedent as far as impl-specific common headers so wasn't sure what the best approach would be here.

@jedisct1
Copy link
Copy Markdown
Collaborator

Hi!

I ran a benchmark on Apple M4, and noticed severe regressions with these changes:

  • AEGIS-128X2: -57% (175 → 76 Gb/s)
  • AEGIS-256X2: -50% (128 → 64 Gb/s)
  • AEGIS-256X4: -44% (116 → 65 Gb/s)
  • AEGIS-256: -10% (93 → 84 Gb/s)

The regressions are consistent across multiple runs, so this isn't noise.

AEAD Encrypt/Decrypt

Variant main (Mb/s) PR #33 (Mb/s) Change
AEGIS-128L ~170,000 ~171,000 ~0%
AEGIS-128X2 ~175,000 ~76,000 -57%
AEGIS-128X4 ~102,000 ~107,000 +5%
AEGIS-256 ~93,000 ~84,000 -10%
AEGIS-256X2 ~128,000 ~64,000 -50%
AEGIS-256X4 ~116,000 ~65,000 -44%

The MAC paths are mostly neutral, with a small +5% improvement for AEGIS-128X2 MAC.

Variant main (Mb/s) PR #33 (Mb/s) Change
AEGIS-128L MAC ~215,000 ~215,000 ~0%
AEGIS-128X2 MAC ~238,000 ~249,000 +5%
AEGIS-128X4 MAC ~185,000 ~168,000 -9%
AEGIS-256 MAC ~115,000 ~115,000 ~0%
AEGIS-256X2 MAC ~179,000 ~179,000 ~0%
AEGIS-256X4 MAC ~141,000 ~135,000 -4%

This is using LLVM 21. Maybe things are different with LLVM 22, but we can't really expect people to use LLVM 22 yet. LLVM 21 is still what Xcode ships with.

But these optimizations are nice and very promising.

Wondering if it wouldn't make sense to have optimized assembly implementations (similar to aegis-jasmin), that can be packaged separately first, so that we can really get the best possible scheduling and not have to depend on LLVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants