Enable optimized Arm assembly on Neoverse N2 by peteman-oai · Pull Request #3295 · aws/aws-lc

peteman-oai · 2026-06-09T15:00:06Z

Description of changes:

AWS-LC currently treats Neoverse N2 as a narrow-multiplier CPU. On N2, the existing native Montgomery implementation and s2n-bignum _alt curve implementations are faster.

This change:

detects Arm Neoverse N2 (0x41/0xd49) and Microsoft Cobalt 100 (0x6d/0xd49);
adds runtime and static N2 capability support; and
classifies N2 as wide-multiplier capable, selecting the native generic Montgomery path and existing _alt implementations for P-256, P-384, P-521, X25519, and Ed25519.

The CPU IDs follow the Linux definitions. Linux also documents Cobalt 100 as N2-based.

Call-outs:

This is dispatch-only. It does not add or change any cryptographic arithmetic. Generic Montgomery switches between existing implementations. The curve _alt pairs compute the same results with instruction scheduling intended for CPUs with higher multiply throughput.

Testing:

Added tests for both N2 MIDRs, static capability configuration, the wide-multiplier classification, and generic Montgomery dispatch. Updated the Arm capability-mask test configurations.

Test results:

Debug: 2,721 passed, 1 environment-dependent skip
Release: 2,689 passed, 1 environment-dependent skip
OPENSSL_NO_ASM: 2,671 passed, 1 environment-dependent skip
FIPS: 3,776 passed, 2 expected skips
Neoverse N2: 2,687 passed, 2 expected skips
AArch64 cross-builds and static-N2 dispatch tests passed under QEMU

Benchmarks were pinned to one core on Neoverse N2 r0p0 (MIDR_EL1=0x410fd490). RSA values are medians of 15 samples. Curve values are medians of three paired samples comparing the otherwise identical narrow and wide N2 dispatch.

Operation	Previous dispatch	N2 wide dispatch	Change
RSA 2048 sign	884.1 ops/s	1,534.6 ops/s	+73.6%
RSA 3072 sign	252.5 ops/s	483.2 ops/s	+91.4%
RSA 4096 sign	132.8 ops/s	216.4 ops/s	+63.0%
ECDH P-256	14,423 ops/s	18,392 ops/s	+27.5%
ECDH P-384	4,101 ops/s	4,852 ops/s	+18.3%
ECDH P-521	2,099 ops/s	3,029 ops/s	+44.3%
ECDH X25519	23,909 ops/s	29,547 ops/s	+23.6%
Ed25519 sign	97,998 ops/s	129,481 ops/s	+32.1%
Ed25519 verify	18,461 ops/s	25,858 ops/s	+40.1%

P-256 signing, verification, key generation, point addition, and point doubling changed by less than 0.1%.

Reproduction:

taskset -c <cpu> ./tool/bssl speed -filter RSA -timeout 5 -json
taskset -c <cpu> env OPENSSL_armcap=~0x40000 \
  ./tool/bssl speed -filter RSA -timeout 5 -json
taskset -c <cpu> ./tool/bssl speed \
  -filter P-256,P-384,P-521,25519 -timeout 1 -json
taskset -c <cpu> env OPENSSL_armcap=~0x40000 \
  ./tool/bssl speed -filter P-256,P-384,P-521,25519 -timeout 1 -json

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

nebeid · 2026-06-11T15:57:52Z

Thanks for putting this together, Peter @peteman-oai.

Location of dispatch tests

For consistency, I suggest to co-locate the N2 dispatch coverage with the other algorithm-dispatch tests in impl_dispatch_test.cc rather than in bn_test.cc?

The natural home is the AArch64-only block (right after SHA512 / SHA3_512):

aws-lc/crypto/impl_dispatch_test.cc

Lines 261 to 306 in 7f7d548

    
           #ifdef OPENSSL_AARCH64 
        
           TEST_F(ImplDispatchTest, SHA512) { 
        
             AssertFunctionsHit( 
        
                 { 
        
                     {kFlag_sha512_hw, sha_512_ext_}, 
        
                 }, 
        
                 [] { 
        
                   const uint8_t in[32] = {0}; 
        
                   uint8_t out[SHA512_DIGEST_LENGTH]; 
        
                   SHA512(in, 32, out); 
        
                 }); 
        
           } 
        
           TEST_F(ImplDispatchTest, SHA3_512) { 
        
             // Assembly dispatch logic for Keccak-x1 on AArch64: 
        
             // - For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum 
        
             //   (`sha3_keccak_f1600()`) 
        
             //   leveraging lazy rotations from https://eprint.iacr.org/2022/1243. 
        
             // - Otherwise, if the Neon SHA3 extension is supported, we use the Neon 
        
             //   Keccak assembly from s2n-bignum (`sha3_keccak_f1600_alt()`), 
        
             //   leveraging that extension. 
        
             // - Otherwise, fall back to scalar Keccak implementation from OpenSSL, 
        
             //   (`Keccak1600_hw()`), not using lazy rotations. 
        
             AssertFunctionsHit( 
        
                 { 
        
                     {kFlag_sha3_keccak_f1600, 
        
                      have_s2n_bignum_asm_ && 
        
                      (neoverse_n1_ || neoverse_v1_ || neoverse_v2_) }, 
        
                     {kFlag_sha3_keccak_f1600_alt, 
        
                      have_s2n_bignum_asm_ && 
        
                      !(neoverse_n1_ || neoverse_v1_ || neoverse_v2_) && 
        
                      (assembler_has_neon_sha3_extension_ && sha3_ext_) }, 
        
                     {kFlag_KeccakF1600_hw, 
        
                      !have_s2n_bignum_asm_ || 
        
                      ( 
        
                        !(neoverse_n1_ || neoverse_v1_ || neoverse_v2_) && 
        
                        !(assembler_has_neon_sha3_extension_ && sha3_ext_) 
        
                      ) }, 
        
                 }, 
        
                 [] { 
        
                   const uint8_t in[32] = {0}; 
        
                   uint8_t out[SHA3_512_DIGEST_LENGTH]; 
        
                   SHA3_512(in, 32, out); 
        
                 }); 
        
           } 
        
           #endif // OPENSSL_AARCH64

That file already tracks neoverse_n1_ / neoverse_v1_ / neoverse_v2_ via the SetUp() capability flags and asserts the actually-executed implementation through the BORINGSSL_function_hit[] mechanism — so an N2 case fits right in.

Two things this would let us cover that the current bn_test.cc test doesn't:

The curve _alt path. Right now MontgomeryN2Dispatch only exercises the Montgomery/RSA consumer of CRYPTO_is_ARMv8_wide_multiplier_capable(). The curve side —
use_s2n_bignum_alt() → the P-256/384/521 / X25519 / Ed25519 _selectors → the _alt routines — is the other half of this PR's win and isn't directly asserted anywhere.
End-to-end Cobalt 100. NeoverseN2MIDR proves the MIDR decodes correctly and MontgomeryN2Dispatch sets the armcap bit directly, but no single test runs a Cobalt-100 MIDR all
the way through to the wide-multiplier path.

nebeid · 2026-06-11T16:01:49Z

One more follow-up on dispatch coverage. This PR adds N2 to CRYPTO_is_ARMv8_wide_multiplier_capable() (BN + curve _alt), but N2 is left out of the SHA3/SHAKE and AES-GCM dispatch sites — while the amended arm_arch.h comment now reads "...detected to allow selecting optimized implementations for BN, SHA3/SHAKE, and AES-GCM", which implies N2 feeds all three. Could we either wire N2 into those paths or narrow the comment to BN?

Heads-up on the current N2 fallthrough: the code already documents that the Neoverse family's SHA3 instructions are implemented on only ~1/4 of the Neon units and are slower than scalar — which is exactly why V1/V2 are routed onto the scalar lazy-rotation path:

aws-lc/crypto/fipsmodule/sha/keccak1600.c

Lines 357 to 359 in 7f7d548

    
           // Neoverse V1 and V2 do support SHA3 instructions, but they are only 
        
           // implemented on 1/4 of Neon units, and are thus slower than a scalar 
        
           // implementation.

As written, N2 doesn't match the N1/V1/V2 branches, so it falls through to the SHA3-extension paths (sha3_keccak_f1600_alt for x1, sha3_keccak2_f1600 for x4) — i.e. the path the comment above calls the slow one (but fast on M cores). Since N2 shares that family characteristic, it's likely on the wrong path today.

Suggested change — x1 Keccak (KeccakF1600), add N2 to the scalar lazy-rotation group:

aws-lc/crypto/fipsmodule/sha/keccak1600.c

Lines 362 to 366 in 7f7d548

    
           if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_V1() || CRYPTO_is_Neoverse_V2()) { 
        
               keccak_log_dispatch(10); // kFlag_sha3_keccak_f1600 
        
               sha3_keccak_f1600((uint64_t *)A, iotas); 
        
               return; 
        
           }

if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_V1() ||
    CRYPTO_is_Neoverse_V2() || CRYPTO_is_Neoverse_N2()) {
    keccak_log_dispatch(10); // kFlag_sha3_keccak_f1600
    sha3_keccak_f1600((uint64_t *)A, iotas);
    return;
}

Suggested change — x4 Keccak (Keccak1600_x4). Here I'd lean toward grouping N2 with N1's scalar batched hybrid (sha3_keccak4_f1600_alt) rather than V1/V2's SIMD _alt2, since N2 is the narrower core of the family (the reason it's classified narrow-multiplier in the first place):

aws-lc/crypto/fipsmodule/sha/keccak1600.c

Lines 436 to 447 in 7f7d548

    
               if (CRYPTO_is_Neoverse_N1()) { 
        
                   keccak_log_dispatch(13); // kFlag_sha3_keccak4_f1600_alt 
        
                   sha3_keccak4_f1600_alt((uint64_t *)A, iotas); 
        
                   return; 
        
               } 
        
           #if defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION) 
        
               if (CRYPTO_is_Neoverse_V1() || CRYPTO_is_Neoverse_V2()) { 
        
                   keccak_log_dispatch(14); // kFlag_sha3_keccak4_f1600_alt2 
        
                   sha3_keccak4_f1600_alt2((uint64_t *)A, iotas); 
        
                   return; 
        
               }

if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_N2()) {
    keccak_log_dispatch(13); // kFlag_sha3_keccak4_f1600_alt
    sha3_keccak4_f1600_alt((uint64_t *)A, iotas);
    return;
}

Both of these are hypotheses, not assertions — could you benchmark SHA3/SHAKE on N2 (x1 and x4) across the candidate paths and let the numbers decide? The x1 case has a strong prior from the comment above; the x4 N1-vs-V1/V2 split is the genuinely uncertain one.

Separately, AES-GCM-8x (CRYPTO_is_ARMv8_GCM_8x_capable):

aws-lc/crypto/fipsmodule/cpucap/internal.h

Lines 242 to 247 in 7f7d548

    
           OPENSSL_INLINE int CRYPTO_is_ARMv8_GCM_8x_capable(void) { 
        
             return (CRYPTO_is_ARMv8_SHA3_capable() && 
        
                     ((OPENSSL_armcap_P & ARMV8_NEOVERSE_V1) != 0 || 
        
                      (OPENSSL_armcap_P & ARMV8_NEOVERSE_V2) != 0 || 
        
                      (OPENSSL_armcap_P & ARMV8_APPLE_M) != 0)); 
        
           }

This is a different perf axis — the 8x kernel is bound by PMULL/AES throughput, not the integer multiplier this PR classifies — so the curve/RSA results don't predict it. N2 does advertise SHA3 (so it'd pass the first half of the gate). If it's easy, a quick bssl speed -filter AES-128-GCM comparison on N2 with and without the N2 bit in GCM_8x would tell us whether it belongs there too.

codecov-commenter · 2026-06-11T16:37:22Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.16%. Comparing base (7f7d548) to head (53726bd).
⚠️ Report is 17 commits behind head on main.

Files with missing lines	Patch %	Lines
crypto/fipsmodule/bn/montgomery.c	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3295      +/-   ##
==========================================
- Coverage   78.17%   78.16%   -0.01%     
==========================================
  Files         689      689              
  Lines      123732   123735       +3     
  Branches    17199    17199              
==========================================
- Hits        96723    96718       -5     
- Misses      26089    26100      +11     
+ Partials      920      917       -3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

peteman-oai · 2026-06-24T23:03:10Z

Thanks for the feedback Nevine, I'm planning to circle back here but just got busy with some other tasks. I will make some updates soon.

Enable optimized Arm assembly on Neoverse N2

53726bd

peteman-oai had a problem deploying to manual-approval June 9, 2026 15:00 — with GitHub Actions Error

peteman-oai requested a deployment to manual-approval June 9, 2026 15:01 — with GitHub Actions Waiting

peteman-oai marked this pull request as ready for review June 9, 2026 15:01

peteman-oai requested a review from a team as a code owner June 9, 2026 15:01

justsmth requested a review from nebeid June 9, 2026 15:37

dougch mentioned this pull request Jun 11, 2026

throw-away PR to see what arm platform our GHA are on #3297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable optimized Arm assembly on Neoverse N2#3295

Enable optimized Arm assembly on Neoverse N2#3295
peteman-oai wants to merge 1 commit into
aws:mainfrom
peteman-oai:peteman/n2-arm-support

peteman-oai commented Jun 9, 2026 •

edited

Loading

Uh oh!

nebeid commented Jun 11, 2026 •

edited

Loading

Uh oh!

nebeid commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026 •

edited

Loading

Uh oh!

peteman-oai commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

peteman-oai commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes:

Call-outs:

Testing:

Uh oh!

nebeid commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Location of dispatch tests

Uh oh!

nebeid commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

peteman-oai commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peteman-oai commented Jun 9, 2026 •

edited

Loading

nebeid commented Jun 11, 2026 •

edited

Loading

codecov-commenter commented Jun 11, 2026 •

edited

Loading