Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33
Arm: Introduce Neon SHA3 versions for AEGIS non-128L functions#33georges-arm wants to merge 1 commit intoaegis-aead:mainfrom
Conversation
There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent. Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR aegis-aead#31) of bitwise-negating `state[3]` (and `state[7]` where relevant). Benchmarking this on a range of Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.
|
It's worth noting that there's a bit of duplication in the Neon AES and Neon SHA3 code paths (e.g. most of the helper function definitions). Additionally the Let me know if you have a preference regarding refactoring the code to reduce this duplication, I didn't see a precedent as far as impl-specific common headers so wasn't sure what the best approach would be here. |
|
Hi! I ran a benchmark on Apple M4, and noticed severe regressions with these changes:
The regressions are consistent across multiple runs, so this isn't noise. AEAD Encrypt/Decrypt
The MAC paths are mostly neutral, with a small +5% improvement for AEGIS-128X2 MAC.
This is using LLVM 21. Maybe things are different with LLVM 22, but we can't really expect people to use LLVM 22 yet. LLVM 21 is still what Xcode ships with. But these optimizations are nice and very promising. Wondering if it wouldn't make sense to have optimized assembly implementations (similar to aegis-jasmin), that can be packaged separately first, so that we can really get the best possible scheduling and not have to depend on LLVM. |
There is already an existing implementation of AEGIS-128L using the Neon SHA3 extension, but all other implementations are currently absent.
Add implementations for all of the non-128L code paths, mirroring the same approach throughout. This includes the BCAX trick (see PR #31) of bitwise-negating
state[3](andstate[7]where relevant).Benchmarking this on a range of Arm Neoverse platforms with LLVM 22, we see 1-33% speedups depending on the micro-architecture and code path being used.