Skip to content

Add AVX-512 support#231

Open
Shnatsel wants to merge 67 commits into
linebender:mainfrom
Shnatsel:avx512-yes-really
Open

Add AVX-512 support#231
Shnatsel wants to merge 67 commits into
linebender:mainfrom
Shnatsel:avx512-yes-really

Conversation

@Shnatsel

@Shnatsel Shnatsel commented May 24, 2026

Copy link
Copy Markdown
Contributor

Yes, really. It's all here. In one humongous PR. Sorry 😅

This is probably best reviewed commit-by-commit. The first commit is still big because the history was getting really messy with changes and rollbacks, and squashing it made it less of a mess.

This also touches other backends in three ways:

  1. set_mask() is now a backend method so it could be specialized per-level
  2. Changes to mask conversion routines to support different internal representations bled into other levels. It occasionally adds an intermediate array but it gets optimized out in practice.
  3. transmute_copy() is wrapped into checked_transmute_copy() and the raw version disallowed after I almost had a horrible accident with it. This could be its own PR but I wanted the insurance right away. This was split and shipped in v0.5.0

Everything changed here should be covered by tests. I've expanded test coverage where it was lacking.

Closes #179

Shnatsel added 20 commits May 24, 2026 18:24
…edicated AVX-512 implementations for complex int/float vector operations that benefit the most.

LLM summary of the changes:

Implemented:
- Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`.
- Generated new `fearless_simd/src/generated/avx512.rs`.
- Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support.
- Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly.
- Bumped MSRV/docs/CI/check-target metadata to Rust 1.89.

Generator/backend behavior:
- 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`.
- AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper.
- Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too.
- Added AVX-512 compare/select paths using mask-returning compares and mask blends.
- Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`.
- Added AVX-512 vector slides for vectors only; masks intentionally have no slide support.
- Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths.

Tests/coverage:
- Extended `#[simd_test]` to include AVX-512.
- Added AVX-512 detection/dispatch coverage.
- Updated mask bitwise tests for canonical boolean mask lanes.
- Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.
…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.
…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.
… so they didn't show up earlier when I removed those methods.
@Shnatsel Shnatsel mentioned this pull request May 24, 2026
@LaurenzV

Copy link
Copy Markdown
Collaborator

I think it would indeed be great to have a custom PR for 3.

@Shnatsel

Copy link
Copy Markdown
Contributor Author

It will cause a lot of conflicts if I try to split it, but I have it isolated to its own commit at least: f08f7e6

@Shnatsel

Copy link
Copy Markdown
Contributor Author

I've researched whether the instruction set we chose is forward-compatible with Intel's upcoming AVX10. It is: according to the Intel AVX10 architecture specification revision 7.0, all AVX10 CPUs include the AVX-512 features from Ice Lake (our target) as well as Sapphire Rapids (higher than our target but doesn't add anything particularly useful).

@Shnatsel

Copy link
Copy Markdown
Contributor Author

I've run Vello benchmarks on Zen4, which doesn't even have native 512-bit vectors, and it slashes the end-to-end rendering benchmarks by about 15%!

Full benchmark run

$ cargo bench --bench main render_strips/ -- --load-baseline=avx512-8442ef44 --baseline=main-8442ef44
    Finished `bench` profile [optimized] target(s) in 0.09s
     Running benches/main.rs (/home/shnatsel/Code/vello/target/release/deps/main-efa34b2edd34e856)
render_strips/Ghostscript_Tiger_simd
                        time:   [125.64 µs 125.77 µs 125.94 µs]
                        change: [-16.343% -16.180% -16.011%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 50 measurements (18.00%)
  3 (6.00%) high mild
  6 (12.00%) high severe
render_strips/coat_of_arms_simd
                        time:   [1.3414 ms 1.3442 ms 1.3470 ms]
                        change: [-14.185% -14.081% -13.961%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 50 measurements (10.00%)
  5 (10.00%) high mild
render_strips/Saimaa_Canal_(map)_simd
                        time:   [3.6695 ms 3.6715 ms 3.6744 ms]
                        change: [-11.730% -11.510% -11.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 50 measurements (20.00%)
  2 (4.00%) high mild
  8 (16.00%) high severe
render_strips/heraldry_simd
                        time:   [883.81 µs 883.88 µs 883.97 µs]
                        change: [-15.411% -15.239% -15.078%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 50 measurements (14.00%)
  1 (2.00%) high mild
  6 (12.00%) high severe

@LaurenzV

Copy link
Copy Markdown
Collaborator

(Sorry, meant to push to my private branch 😅 )

@Shnatsel

Copy link
Copy Markdown
Contributor Author

No worries, I have a local backup. I'm glad you're looking into this!

@LaurenzV LaurenzV left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, disclaimer, I have not actually tried to deeply understand how most of the more complex operations are implemented, I'm relying on my trust and the extensive tests here. 🙏 I don't have the time to try to validate all of this manually.

I did try to read through all code though, so just some comments here and there. For me, it's just important that we land some form of #228 before making a new release.

Apart from that, some other things:

  • I think the mk_86 code in fearless_simd_gen is getting pretty convoluted with all of the special casing for AVX512... I'm wondering if there is room for improvement in the future, but no idea.
  • I'm also wondering whether in the future there is some way of having all of the required ffeatures be auto-generated so they only need to be defined once in fearless_simd_gen, but also something for the future not now.

Comment thread check_targets.sh
Comment on lines +18 to +19
RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu
RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu --features force_support_fallback

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just set the AVX512 feature flags here, like below? Also, do we need to update the commands below to activate all feature flags that were added to SSE4.2/AVX2 a while ago?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense to keep it shorter, but the invocations below probably need to be updated (in a follow-up), no? Since they are missing the other target features we require. Or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I missed that script, it does need updating for v2/v3 targets, good call

Comment thread Cargo.toml
clippy.fn_to_numeric_cast_any = "warn"
clippy.infinite_loop = "warn"
clippy.large_stack_arrays = "warn"
clippy.large_stack_arrays = "allow" # appears to be buggy as of 1.93, fixed in 1.95. TODO: re-enable

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would changing the MSRV from 1.88 to 1.89 impact this then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Clippy is run on MSRV instead of latest stable on CI.

We could probably switch it to latest stable now that it respects the crate MSRV, but that would randomly cause CI to fail on main because of Clippy adding new lints in later releases, and dealing with that is rather miserable in my experience.


#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn x86_detects_icelake_avx512() -> bool {
std::arch::is_x86_feature_detected!("adx")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about f16c, which we use for AVX2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implied by avx512f, you can verify it by running:

rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-feature=+avx512f

This will print target_feature="f16c" among other things.

mod soundness;

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn x86_detects_icelake_avx512() -> bool {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to add a comment how this list was derived. I presume using rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-cpu=icelake-server?


#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[test]
fn avx512_masks_are_compact() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, but this seems a bit superfluous to test.

}

pub(crate) fn handle_zip(&self, op: Op, vec_ty: &VecType, select_low: bool) -> TokenStream {
if *self == Self::Avx512 && vec_ty.scalar != ScalarType::Mask && vec_ty.n_bits() >= 256 {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this method even be called with masks? If not it seems like the sceond condition can just be omitted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same also for interleave/deinterleave etc. Some other positions as well.

&& target_scalar == ScalarType::Float
&& vec_ty.scalar_bits == 32
{
// We cannot emit the intrinsics for the conversion instructions

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's orefix with a TODO then so we don't forget aout this.

@cfg any(target_arch = "x86", target_arch = "x86_64");
@token_ty $crate::Avx512;
@kernel_attrs #[target_feature(
enable = "adx,aes,avx512bitalg,avx512bw,avx512cd,avx512dq,avx512f,avx512ifma,avx512vbmi,avx512vbmi2,avx512vl,avx512vnni,avx512vpopcntdq,bmi1,bmi2,cmpxchg16b,fma,gfni,lzcnt,movbe,pclmulqdq,popcnt,rdrand,rdseed,sha,vaes,vpclmulqdq,xsave,xsavec,xsaveopt,xsaves"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish there were a more readable and easier-verifiable way for this 😓

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never managed to come up with one. We can't use variables here, and a declarative macro doesn't help either because we write these out in different contexts in slightly different ways, so even the strings we insert aren't the same.

// lower to LLVM intrinsics, they will likely not be optimized until much later in the pipeline (if at all),
// resulting in substantially worse codegen. See https://github.com/linebender/fearless_simd/pull/185.
//
// Safety: The native vector type backing any implementation will be:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the wrong place for a safety comment? There isn't actually any unsafe here, shouldn't this be in the transmute module (and I think we already have a similar comment there).

val: crate::transmute::checked_transmute_copy(&arch),
simd,
}
let lanes: [i8; 32usize] = crate::transmute::checked_transmute_copy(&arch);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, could this be avoided by specializing the SimdFrom impls for specific backends instad of making them generic over Simd?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess? I didn't see the point on complicating the generator further for the sake of this, since it's just two transmutes in a row anyway and optimizes into a by-value transmute.

@Shnatsel

Copy link
Copy Markdown
Contributor Author

I think the mk_86 code in fearless_simd_gen is getting pretty convoluted with all of the special casing for AVX512... I'm wondering if there is room for improvement in the future, but no idea.

I agree. However, much of it is genuinely shared with the other levels, e.g. all the basic math operations emitting mostly the same intrinsics with 256 swapped for 512, so I'm not quite sure what the right cut points would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mask types for AVX-512

2 participants