Add AVX-512 support by Shnatsel · Pull Request #231 · linebender/fearless_simd

Shnatsel · 2026-05-24T21:09:01Z

Yes, really. It's all here. In one humongous PR. Sorry 😅

This is probably best reviewed commit-by-commit. The first commit is still big because the history was getting really messy with changes and rollbacks, and squashing it made it less of a mess.

This also touches other backends in three ways:

set_mask() is now a backend method so it could be specialized per-level
Changes to mask conversion routines to support different internal representations bled into other levels. It occasionally adds an intermediate array but it gets optimized out in practice.
transmute_copy() is wrapped into checked_transmute_copy() and the raw version disallowed after I almost had a horrible accident with it. ~~This could be its own PR but I wanted the insurance right away.~~ This was split and shipped in v0.5.0

Everything changed here should be covered by tests. I've expanded test coverage where it was lacking.

Closes #179

…edicated AVX-512 implementations for complex int/float vector operations that benefit the most. LLM summary of the changes: Implemented: - Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`. - Generated new `fearless_simd/src/generated/avx512.rs`. - Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support. - Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly. - Bumped MSRV/docs/CI/check-target metadata to Rust 1.89. Generator/backend behavior: - 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`. - AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper. - Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too. - Added AVX-512 compare/select paths using mask-returning compares and mask blends. - Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`. - Added AVX-512 vector slides for vectors only; masks intentionally have no slide support. - Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths. Tests/coverage: - Extended `#[simd_test]` to include AVX-512. - Added AVX-512 detection/dispatch coverage. - Updated mask bitwise tests for canonical boolean mask lanes. - Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.

…nt the spooky bug I almost introduced

…rage for these ops.

…calar, now we use the dedicated intrinsics.

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.

…it vectors on AVX-512; expand test coverage

… no cost to throughput

…ide test

… so they didn't show up earlier when I removed those methods.

…e get dead code warnings

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

LaurenzV · 2026-05-25T06:18:15Z

I think it would indeed be great to have a custom PR for 3.

Shnatsel · 2026-05-25T08:47:09Z

It will cause a lot of conflicts if I try to split it, but I have it isolated to its own commit at least: f08f7e6

Includes the regenerated AVX-512 output from the same generator update.

Includes regenerated AVX-512 slide helpers for the same safety cleanup.

Shnatsel · 2026-06-17T23:17:17Z

I've researched whether the instruction set we chose is forward-compatible with Intel's upcoming AVX10. It is: according to the Intel AVX10 architecture specification revision 7.0, all AVX10 CPUs include the AVX-512 features from Ice Lake (our target) as well as Sapphire Rapids (higher than our target but doesn't add anything particularly useful).

Includes regenerated AVX-512 interleaved load/store output.

…ssible

Shnatsel · 2026-06-20T23:59:56Z

I've run Vello benchmarks on Zen4, which doesn't even have native 512-bit vectors, and it slashes the end-to-end rendering benchmarks by about 15%!

Full benchmark run

$ cargo bench --bench main render_strips/ -- --load-baseline=avx512-8442ef44 --baseline=main-8442ef44
    Finished `bench` profile [optimized] target(s) in 0.09s
     Running benches/main.rs (/home/shnatsel/Code/vello/target/release/deps/main-efa34b2edd34e856)
render_strips/Ghostscript_Tiger_simd
                        time:   [125.64 µs 125.77 µs 125.94 µs]
                        change: [-16.343% -16.180% -16.011%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 50 measurements (18.00%)
  3 (6.00%) high mild
  6 (12.00%) high severe
render_strips/coat_of_arms_simd
                        time:   [1.3414 ms 1.3442 ms 1.3470 ms]
                        change: [-14.185% -14.081% -13.961%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 50 measurements (10.00%)
  5 (10.00%) high mild
render_strips/Saimaa_Canal_(map)_simd
                        time:   [3.6695 ms 3.6715 ms 3.6744 ms]
                        change: [-11.730% -11.510% -11.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 50 measurements (20.00%)
  2 (4.00%) high mild
  8 (16.00%) high severe
render_strips/heraldry_simd
                        time:   [883.81 µs 883.88 µs 883.97 µs]
                        change: [-15.411% -15.239% -15.078%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  1 (2.00%) high mild
  6 (12.00%) high severe

…ess than 512

LaurenzV · 2026-06-27T20:18:34Z

(Sorry, meant to push to my private branch 😅 )

Shnatsel · 2026-06-27T20:22:38Z

No worries, I have a local backup. I'm glad you're looking into this!

LaurenzV

So, disclaimer, I have not actually tried to deeply understand how most of the more complex operations are implemented, I'm relying on my trust and the extensive tests here. 🙏 I don't have the time to try to validate all of this manually.

I did try to read through all code though, so just some comments here and there. For me, it's just important that we land some form of #228 before making a new release.

Apart from that, some other things:

I think the mk_86 code in fearless_simd_gen is getting pretty convoluted with all of the special casing for AVX512... I'm wondering if there is room for improvement in the future, but no idea.
I'm also wondering whether in the future there is some way of having all of the required ffeatures be auto-generated so they only need to be defined once in fearless_simd_gen, but also something for the future not now.

LaurenzV · 2026-06-27T09:19:52Z

+RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu
+RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu --features force_support_fallback


Why not just set the AVX512 feature flags here, like below? Also, do we need to update the commands below to activate all feature flags that were added to SSE4.2/AVX2 a while ago?

I guess it makes sense to keep it shorter, but the invocations below probably need to be updated (in a follow-up), no? Since they are missing the other target features we require. Or am I missing something?

oh yeah I missed that script, it does need updating for v2/v3 targets, good call

LaurenzV · 2026-06-27T09:21:16Z

 clippy.fn_to_numeric_cast_any = "warn"
 clippy.infinite_loop = "warn"
-clippy.large_stack_arrays = "warn"
+clippy.large_stack_arrays = "allow"             # appears to be buggy as of 1.93, fixed in 1.95. TODO: re-enable


Why would changing the MSRV from 1.88 to 1.89 impact this then?

I believe Clippy is run on MSRV instead of latest stable on CI.

We could probably switch it to latest stable now that it respects the crate MSRV, but that would randomly cause CI to fail on main because of Clippy adding new lints in later releases, and dealing with that is rather miserable in my experience.

LaurenzV · 2026-06-27T09:29:02Z


+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+fn x86_detects_icelake_avx512() -> bool {
+    std::arch::is_x86_feature_detected!("adx")


What about f16c, which we use for AVX2?

It is implied by avx512f, you can verify it by running:

rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-feature=+avx512f

This will print target_feature="f16c" among other things.

LaurenzV · 2026-06-27T09:35:04Z

 mod soundness;

+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+fn x86_detects_icelake_avx512() -> bool {


It would also be good to add a comment how this list was derived. I presume using rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-cpu=icelake-server?

LaurenzV · 2026-06-27T09:40:58Z

+
+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+#[test]
+fn avx512_masks_are_compact() {


Up to you, but this seems a bit superfluous to test.

LaurenzV · 2026-06-27T21:08:29Z

    }

    pub(crate) fn handle_zip(&self, op: Op, vec_ty: &VecType, select_low: bool) -> TokenStream {
+        if *self == Self::Avx512 && vec_ty.scalar != ScalarType::Mask && vec_ty.n_bits() >= 256 {


Can this method even be called with masks? If not it seems like the sceond condition can just be omitted.

Same also for interleave/deinterleave etc. Some other positions as well.

LaurenzV · 2026-06-27T21:20:30Z

+            && target_scalar == ScalarType::Float
+            && vec_ty.scalar_bits == 32
+        {
+            // We cannot emit the intrinsics for the conversion instructions


Let's orefix with a TODO then so we don't forget aout this.

LaurenzV · 2026-06-27T21:25:15Z

+            @cfg any(target_arch = "x86", target_arch = "x86_64");
+            @token_ty $crate::Avx512;
+            @kernel_attrs #[target_feature(
+                enable = "adx,aes,avx512bitalg,avx512bw,avx512cd,avx512dq,avx512f,avx512ifma,avx512vbmi,avx512vbmi2,avx512vl,avx512vnni,avx512vpopcntdq,bmi1,bmi2,cmpxchg16b,fma,gfni,lzcnt,movbe,pclmulqdq,popcnt,rdrand,rdseed,sha,vaes,vpclmulqdq,xsave,xsavec,xsaveopt,xsaves"


I really wish there were a more readable and easier-verifiable way for this 😓

I never managed to come up with one. We can't use variables here, and a declarative macro doesn't help either because we write these out in different contexts in slightly different ways, so even the strings we insert aren't the same.

LaurenzV · 2026-06-27T21:34:10Z

    // lower to LLVM intrinsics, they will likely not be optimized until much later in the pipeline (if at all),
    // resulting in substantially worse codegen. See https://github.com/linebender/fearless_simd/pull/185.
+    //
+    // Safety: The native vector type backing any implementation will be:


Isn't this the wrong place for a safety comment? There isn't actually any unsafe here, shouldn't this be in the transmute module (and I think we already have a similar comment there).

LaurenzV · 2026-06-27T21:42:32Z

-            val: crate::transmute::checked_transmute_copy(&arch),
-            simd,
-        }
+        let lanes: [i8; 32usize] = crate::transmute::checked_transmute_copy(&arch);


In the future, could this be avoided by specializing the SimdFrom impls for specific backends instad of making them generic over Simd?

I guess? I didn't see the point on complicating the generator further for the sake of this, since it's just two transmutes in a row anyway and optimizes into a by-value transmute.

Shnatsel · 2026-06-27T23:42:51Z

I think the mk_86 code in fearless_simd_gen is getting pretty convoluted with all of the special casing for AVX512... I'm wondering if there is room for improvement in the future, but no idea.

I agree. However, much of it is genuinely shared with the other levels, e.g. all the basic math operations emitting mostly the same intrinsics with 256 swapped for 512, so I'm not quite sure what the right cut points would be.

Shnatsel added 20 commits May 24, 2026 18:24

Add checked_transmute_copy and ban transmute_copy to statically preve…

f08f7e6

…nt the spooky bug I almost introduced

Expand native type conversion test coverage

aef1cac

Rename test: mask_methods.rs -> mask_roundtrip.rs

c12a7cc

Check in the new generated AVX-512 file

9d9adf8

Fix build after file rename

81441cf

Use AVX-512 instructions for f32 -> u32 conversions. Expand test cove…

0d6af5d

…rage for these ops.

Optimize load_array/as_array on AVX-512 masks; the initial impl was s…

025c172

…calar, now we use the dedicated intrinsics.

Split set_mask into a backend method so it could be specialized per b…

7927383

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

Optimize load_interleaved/store_interleaved for AVX-512. Add one more…

57de129

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

Optimize floor/ceil/round_ties_even/trunc/approximate_recip for 512-b…

f2ba8c9

…it vectors on AVX-512; expand test coverage

Use AVX-512 rcp14 for smaller vector sizes too; improves precision at…

9cddbb2

… no cost to throughput

Optimize slide_within_blocks for AVX-512; verified with exhaustive sl…

9d02c3a

…ide test

Remove stale tests for mask slide APIs; they were under #[cfg(false)]…

85b44c9

… so they didn't show up earlier when I removed those methods.

consistent clippy error messages

1c558ca

satisfy Clippy

6c8f7d7

get rid of useless extra braces

e475ae1

KISS the native type mask roundtrip tests

6f1081f

cargo fmt

1e2a096

Shnatsel mentioned this pull request May 24, 2026

Initial AVX-512 support #201

Closed

Shnatsel added 6 commits May 24, 2026 22:15

Satisfy clippy some more. Hoisted by my own restriction lint.

7fc16d4

Satisfy the toml formatting check

359650d

Stick an #[expect] onto checked_transmute_copy on wasm32, otherwise w…

37df3e3

…e get dead code warnings

Suppress an apparently buggy Clippy lint; surfaced only in `cargo cli…

8825bfb

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

Satisfy the toml formatter again

cf3ff7d

Add miri out-outs for extra slow tests

cb5780f

Shnatsel mentioned this pull request May 24, 2026

Set up Miri tests in CI #173

Open

Shnatsel added 12 commits June 17, 2026 12:33

Record no branch-specific changes for PR linebender#240

3c4bcbc

Merge main PR linebender#241: update SDE CI download

2d0595d

Record no branch-specific changes for PR linebender#241

0847ebf

Merge main PR linebender#242: add missing authors

672772f

Record no branch-specific changes for PR linebender#242

1887405

Merge main PR linebender#236: remove unsafe generated intrinsic calls

6e5672a

Includes the regenerated AVX-512 output from the same generator update.

Merge main PR linebender#243: document transmute wrappers

dc4c8fe

Merge main PR linebender#244: remove unsafe generated helpers

ce28db9

Includes regenerated AVX-512 slide helpers for the same safety cleanup.

Merge main PR linebender#246: revert authors additions

ca1759b

Merge main PR linebender#247: explain transmute wrapper motivation

484d1bf

Merge main PR linebender#249: add author entry

7efdb1a

cargo fmt

014e4b7

Shnatsel added 4 commits June 18, 2026 00:24

Merge main PR linebender#245: safer interleaved load/store

9d4f115

Includes regenerated AVX-512 interleaved load/store output.

cargo fmt

d49f6a2

Merge branch 'linebender:main' into avx512-yes-really

00eb81d

Wrap AVX-512-specific codepaths in kernel! instead of unsafe where po…

046ee30

…ssible

Shnatsel added 6 commits June 21, 2026 01:46

Optimize u32->f32 conversion for 128-bit and 256-bit vectors on AVX-512

70e489b

Optimize precise i32 to f32 conversions on AVX-512 for vector sizes l…

a5f1b3a

…ess than 512

Optimize 128-bit unzip and deinterleave on AVX-512

490f83b

Merge branch 'main' into avx512-yes-really

a234432

Document AVX-512 support in the README

6d5f4ed

Regenerate README

cf18ec3

Shnatsel mentioned this pull request Jun 23, 2026

Add 64-bit integer vectors and operations on them #253

Open

LaurenzV force-pushed the avx512-yes-really branch from 4806bf7 to cf18ec3 Compare June 27, 2026 20:18

LaurenzV reviewed Jun 27, 2026

View reviewed changes

		RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu
		RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu --features force_support_fallback

Uh oh!

Conversation

Shnatsel commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented May 25, 2026

Uh oh!

Shnatsel commented May 25, 2026

Uh oh!

Shnatsel commented Jun 17, 2026

Uh oh!

Shnatsel commented Jun 20, 2026

Uh oh!

LaurenzV commented Jun 27, 2026

Uh oh!

Shnatsel commented Jun 27, 2026

Uh oh!

LaurenzV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shnatsel commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shnatsel commented May 24, 2026 •

edited

Loading