Skip to content

Enable optimized Arm assembly on Neoverse N2#3295

Open
peteman-oai wants to merge 1 commit into
aws:mainfrom
peteman-oai:peteman/n2-arm-support
Open

Enable optimized Arm assembly on Neoverse N2#3295
peteman-oai wants to merge 1 commit into
aws:mainfrom
peteman-oai:peteman/n2-arm-support

Conversation

@peteman-oai

@peteman-oai peteman-oai commented Jun 9, 2026

Copy link
Copy Markdown

Description of changes:

AWS-LC currently treats Neoverse N2 as a narrow-multiplier CPU. On N2, the existing native Montgomery implementation and s2n-bignum _alt curve implementations are faster.

This change:

  • detects Arm Neoverse N2 (0x41/0xd49) and Microsoft Cobalt 100 (0x6d/0xd49);
  • adds runtime and static N2 capability support; and
  • classifies N2 as wide-multiplier capable, selecting the native generic Montgomery path and existing _alt implementations for P-256, P-384, P-521, X25519, and Ed25519.

The CPU IDs follow the Linux definitions. Linux also documents Cobalt 100 as N2-based.

Call-outs:

This is dispatch-only. It does not add or change any cryptographic arithmetic. Generic Montgomery switches between existing implementations. The curve _alt pairs compute the same results with instruction scheduling intended for CPUs with higher multiply throughput.

Testing:

Added tests for both N2 MIDRs, static capability configuration, the wide-multiplier classification, and generic Montgomery dispatch. Updated the Arm capability-mask test configurations.

Test results:

  • Debug: 2,721 passed, 1 environment-dependent skip
  • Release: 2,689 passed, 1 environment-dependent skip
  • OPENSSL_NO_ASM: 2,671 passed, 1 environment-dependent skip
  • FIPS: 3,776 passed, 2 expected skips
  • Neoverse N2: 2,687 passed, 2 expected skips
  • AArch64 cross-builds and static-N2 dispatch tests passed under QEMU

Benchmarks were pinned to one core on Neoverse N2 r0p0 (MIDR_EL1=0x410fd490). RSA values are medians of 15 samples. Curve values are medians of three paired samples comparing the otherwise identical narrow and wide N2 dispatch.

Operation Previous dispatch N2 wide dispatch Change
RSA 2048 sign 884.1 ops/s 1,534.6 ops/s +73.6%
RSA 3072 sign 252.5 ops/s 483.2 ops/s +91.4%
RSA 4096 sign 132.8 ops/s 216.4 ops/s +63.0%
ECDH P-256 14,423 ops/s 18,392 ops/s +27.5%
ECDH P-384 4,101 ops/s 4,852 ops/s +18.3%
ECDH P-521 2,099 ops/s 3,029 ops/s +44.3%
ECDH X25519 23,909 ops/s 29,547 ops/s +23.6%
Ed25519 sign 97,998 ops/s 129,481 ops/s +32.1%
Ed25519 verify 18,461 ops/s 25,858 ops/s +40.1%

P-256 signing, verification, key generation, point addition, and point doubling changed by less than 0.1%.

Reproduction:

taskset -c <cpu> ./tool/bssl speed -filter RSA -timeout 5 -json
taskset -c <cpu> env OPENSSL_armcap=~0x40000 \
  ./tool/bssl speed -filter RSA -timeout 5 -json
taskset -c <cpu> ./tool/bssl speed \
  -filter P-256,P-384,P-521,25519 -timeout 1 -json
taskset -c <cpu> env OPENSSL_armcap=~0x40000 \
  ./tool/bssl speed -filter P-256,P-384,P-521,25519 -timeout 1 -json

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@peteman-oai peteman-oai marked this pull request as ready for review June 9, 2026 15:01
@peteman-oai peteman-oai requested a review from a team as a code owner June 9, 2026 15:01
@justsmth justsmth requested a review from nebeid June 9, 2026 15:37
@nebeid

nebeid commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Thanks for putting this together, Peter @peteman-oai.

Location of dispatch tests

For consistency, I suggest to co-locate the N2 dispatch coverage with the other algorithm-dispatch tests in impl_dispatch_test.cc rather than in bn_test.cc?

The natural home is the AArch64-only block (right after SHA512 / SHA3_512):

#ifdef OPENSSL_AARCH64
TEST_F(ImplDispatchTest, SHA512) {
AssertFunctionsHit(
{
{kFlag_sha512_hw, sha_512_ext_},
},
[] {
const uint8_t in[32] = {0};
uint8_t out[SHA512_DIGEST_LENGTH];
SHA512(in, 32, out);
});
}
TEST_F(ImplDispatchTest, SHA3_512) {
// Assembly dispatch logic for Keccak-x1 on AArch64:
// - For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum
// (`sha3_keccak_f1600()`)
// leveraging lazy rotations from https://eprint.iacr.org/2022/1243.
// - Otherwise, if the Neon SHA3 extension is supported, we use the Neon
// Keccak assembly from s2n-bignum (`sha3_keccak_f1600_alt()`),
// leveraging that extension.
// - Otherwise, fall back to scalar Keccak implementation from OpenSSL,
// (`Keccak1600_hw()`), not using lazy rotations.
AssertFunctionsHit(
{
{kFlag_sha3_keccak_f1600,
have_s2n_bignum_asm_ &&
(neoverse_n1_ || neoverse_v1_ || neoverse_v2_) },
{kFlag_sha3_keccak_f1600_alt,
have_s2n_bignum_asm_ &&
!(neoverse_n1_ || neoverse_v1_ || neoverse_v2_) &&
(assembler_has_neon_sha3_extension_ && sha3_ext_) },
{kFlag_KeccakF1600_hw,
!have_s2n_bignum_asm_ ||
(
!(neoverse_n1_ || neoverse_v1_ || neoverse_v2_) &&
!(assembler_has_neon_sha3_extension_ && sha3_ext_)
) },
},
[] {
const uint8_t in[32] = {0};
uint8_t out[SHA3_512_DIGEST_LENGTH];
SHA3_512(in, 32, out);
});
}
#endif // OPENSSL_AARCH64

That file already tracks neoverse_n1_ / neoverse_v1_ / neoverse_v2_ via the SetUp() capability flags and asserts the actually-executed implementation through the BORINGSSL_function_hit[] mechanism — so an N2 case fits right in.

Two things this would let us cover that the current bn_test.cc test doesn't:

  1. The curve _alt path. Right now MontgomeryN2Dispatch only exercises the Montgomery/RSA consumer of CRYPTO_is_ARMv8_wide_multiplier_capable(). The curve side —
    use_s2n_bignum_alt() → the P-256/384/521 / X25519 / Ed25519 _selectors → the _alt routines — is the other half of this PR's win and isn't directly asserted anywhere.
  2. End-to-end Cobalt 100. NeoverseN2MIDR proves the MIDR decodes correctly and MontgomeryN2Dispatch sets the armcap bit directly, but no single test runs a Cobalt-100 MIDR all
    the way through to the wide-multiplier path.

@nebeid

nebeid commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

One more follow-up on dispatch coverage. This PR adds N2 to CRYPTO_is_ARMv8_wide_multiplier_capable() (BN + curve _alt), but N2 is left out of the SHA3/SHAKE and AES-GCM dispatch sites — while the amended arm_arch.h comment now reads "...detected to allow selecting optimized implementations for BN, SHA3/SHAKE, and AES-GCM", which implies N2 feeds all three. Could we either wire N2 into those paths or narrow the comment to BN?

Heads-up on the current N2 fallthrough: the code already documents that the Neoverse family's SHA3 instructions are implemented on only ~1/4 of the Neon units and are slower than scalar — which is exactly why V1/V2 are routed onto the scalar lazy-rotation path:

// Neoverse V1 and V2 do support SHA3 instructions, but they are only
// implemented on 1/4 of Neon units, and are thus slower than a scalar
// implementation.

As written, N2 doesn't match the N1/V1/V2 branches, so it falls through to the SHA3-extension paths (sha3_keccak_f1600_alt for x1, sha3_keccak2_f1600 for x4) — i.e. the path the comment above calls the slow one (but fast on M cores). Since N2 shares that family characteristic, it's likely on the wrong path today.

Suggested change — x1 Keccak (KeccakF1600), add N2 to the scalar lazy-rotation group:

if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_V1() || CRYPTO_is_Neoverse_V2()) {
keccak_log_dispatch(10); // kFlag_sha3_keccak_f1600
sha3_keccak_f1600((uint64_t *)A, iotas);
return;
}

if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_V1() ||
    CRYPTO_is_Neoverse_V2() || CRYPTO_is_Neoverse_N2()) {
    keccak_log_dispatch(10); // kFlag_sha3_keccak_f1600
    sha3_keccak_f1600((uint64_t *)A, iotas);
    return;
}

Suggested change — x4 Keccak (Keccak1600_x4). Here I'd lean toward grouping N2 with N1's scalar batched hybrid (sha3_keccak4_f1600_alt) rather than V1/V2's SIMD _alt2, since N2 is the narrower core of the family (the reason it's classified narrow-multiplier in the first place):

if (CRYPTO_is_Neoverse_N1()) {
keccak_log_dispatch(13); // kFlag_sha3_keccak4_f1600_alt
sha3_keccak4_f1600_alt((uint64_t *)A, iotas);
return;
}
#if defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
if (CRYPTO_is_Neoverse_V1() || CRYPTO_is_Neoverse_V2()) {
keccak_log_dispatch(14); // kFlag_sha3_keccak4_f1600_alt2
sha3_keccak4_f1600_alt2((uint64_t *)A, iotas);
return;
}

if (CRYPTO_is_Neoverse_N1() || CRYPTO_is_Neoverse_N2()) {
    keccak_log_dispatch(13); // kFlag_sha3_keccak4_f1600_alt
    sha3_keccak4_f1600_alt((uint64_t *)A, iotas);
    return;
}

Both of these are hypotheses, not assertions — could you benchmark SHA3/SHAKE on N2 (x1 and x4) across the candidate paths and let the numbers decide? The x1 case has a strong prior from the comment above; the x4 N1-vs-V1/V2 split is the genuinely uncertain one.

Separately, AES-GCM-8x (CRYPTO_is_ARMv8_GCM_8x_capable):

OPENSSL_INLINE int CRYPTO_is_ARMv8_GCM_8x_capable(void) {
return (CRYPTO_is_ARMv8_SHA3_capable() &&
((OPENSSL_armcap_P & ARMV8_NEOVERSE_V1) != 0 ||
(OPENSSL_armcap_P & ARMV8_NEOVERSE_V2) != 0 ||
(OPENSSL_armcap_P & ARMV8_APPLE_M) != 0));
}

This is a different perf axis — the 8x kernel is bound by PMULL/AES throughput, not the integer multiplier this PR classifies — so the curve/RSA results don't predict it. N2 does advertise SHA3 (so it'd pass the first half of the gate). If it's easy, a quick bssl speed -filter AES-128-GCM comparison on N2 with and without the N2 bit in GCM_8x would tell us whether it belongs there too.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.16%. Comparing base (7f7d548) to head (53726bd).
⚠️ Report is 17 commits behind head on main.

Files with missing lines Patch % Lines
crypto/fipsmodule/bn/montgomery.c 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3295      +/-   ##
==========================================
- Coverage   78.17%   78.16%   -0.01%     
==========================================
  Files         689      689              
  Lines      123732   123735       +3     
  Branches    17199    17199              
==========================================
- Hits        96723    96718       -5     
- Misses      26089    26100      +11     
+ Partials      920      917       -3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@peteman-oai

Copy link
Copy Markdown
Author

Thanks for the feedback Nevine, I'm planning to circle back here but just got busy with some other tasks. I will make some updates soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants