Add AVX-512 support#231
Conversation
…edicated AVX-512 implementations for complex int/float vector operations that benefit the most. LLM summary of the changes: Implemented: - Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`. - Generated new `fearless_simd/src/generated/avx512.rs`. - Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support. - Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly. - Bumped MSRV/docs/CI/check-target metadata to Rust 1.89. Generator/backend behavior: - 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`. - AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper. - Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too. - Added AVX-512 compare/select paths using mask-returning compares and mask blends. - Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`. - Added AVX-512 vector slides for vectors only; masks intentionally have no slide support. - Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths. Tests/coverage: - Extended `#[simd_test]` to include AVX-512. - Added AVX-512 detection/dispatch coverage. - Updated mask bitwise tests for canonical boolean mask lanes. - Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.
…nt the spooky bug I almost introduced
…rage for these ops.
…calar, now we use the dedicated intrinsics.
…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.
… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891
…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.
…it vectors on AVX-512; expand test coverage
… no cost to throughput
… so they didn't show up earlier when I removed those methods.
…e get dead code warnings
…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there
|
I think it would indeed be great to have a custom PR for 3. |
|
It will cause a lot of conflicts if I try to split it, but I have it isolated to its own commit at least: f08f7e6 |
Includes the regenerated AVX-512 output from the same generator update.
Includes regenerated AVX-512 slide helpers for the same safety cleanup.
|
I've researched whether the instruction set we chose is forward-compatible with Intel's upcoming AVX10. It is: according to the Intel AVX10 architecture specification revision 7.0, all AVX10 CPUs include the AVX-512 features from Ice Lake (our target) as well as Sapphire Rapids (higher than our target but doesn't add anything particularly useful). |
Includes regenerated AVX-512 interleaved load/store output.
|
I've run Vello benchmarks on Zen4, which doesn't even have native 512-bit vectors, and it slashes the end-to-end rendering benchmarks by about 15%! Full benchmark run
|
4806bf7 to
cf18ec3
Compare
|
(Sorry, meant to push to my private branch 😅 ) |
|
No worries, I have a local backup. I'm glad you're looking into this! |
LaurenzV
left a comment
There was a problem hiding this comment.
So, disclaimer, I have not actually tried to deeply understand how most of the more complex operations are implemented, I'm relying on my trust and the extensive tests here. 🙏 I don't have the time to try to validate all of this manually.
I did try to read through all code though, so just some comments here and there. For me, it's just important that we land some form of #228 before making a new release.
Apart from that, some other things:
- I think the mk_86 code in
fearless_simd_genis getting pretty convoluted with all of the special casing for AVX512... I'm wondering if there is room for improvement in the future, but no idea. - I'm also wondering whether in the future there is some way of having all of the required ffeatures be auto-generated so they only need to be defined once in fearless_simd_gen, but also something for the future not now.
| RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu | ||
| RUSTFLAGS=-Ctarget-cpu=icelake-server cargo check -p fearless_simd --target x86_64-unknown-linux-gnu --features force_support_fallback |
There was a problem hiding this comment.
Why not just set the AVX512 feature flags here, like below? Also, do we need to update the commands below to activate all feature flags that were added to SSE4.2/AVX2 a while ago?
There was a problem hiding this comment.
I guess it makes sense to keep it shorter, but the invocations below probably need to be updated (in a follow-up), no? Since they are missing the other target features we require. Or am I missing something?
There was a problem hiding this comment.
oh yeah I missed that script, it does need updating for v2/v3 targets, good call
| clippy.fn_to_numeric_cast_any = "warn" | ||
| clippy.infinite_loop = "warn" | ||
| clippy.large_stack_arrays = "warn" | ||
| clippy.large_stack_arrays = "allow" # appears to be buggy as of 1.93, fixed in 1.95. TODO: re-enable |
There was a problem hiding this comment.
Why would changing the MSRV from 1.88 to 1.89 impact this then?
There was a problem hiding this comment.
I believe Clippy is run on MSRV instead of latest stable on CI.
We could probably switch it to latest stable now that it respects the crate MSRV, but that would randomly cause CI to fail on main because of Clippy adding new lints in later releases, and dealing with that is rather miserable in my experience.
|
|
||
| #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
| fn x86_detects_icelake_avx512() -> bool { | ||
| std::arch::is_x86_feature_detected!("adx") |
There was a problem hiding this comment.
What about f16c, which we use for AVX2?
There was a problem hiding this comment.
It is implied by avx512f, you can verify it by running:
rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-feature=+avx512f
This will print target_feature="f16c" among other things.
| mod soundness; | ||
|
|
||
| #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
| fn x86_detects_icelake_avx512() -> bool { |
There was a problem hiding this comment.
It would also be good to add a comment how this list was derived. I presume using rustc --print=cfg --target x86_64-unknown-linux-gnu -C target-cpu=icelake-server?
|
|
||
| #[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
| #[test] | ||
| fn avx512_masks_are_compact() { |
There was a problem hiding this comment.
Up to you, but this seems a bit superfluous to test.
| } | ||
|
|
||
| pub(crate) fn handle_zip(&self, op: Op, vec_ty: &VecType, select_low: bool) -> TokenStream { | ||
| if *self == Self::Avx512 && vec_ty.scalar != ScalarType::Mask && vec_ty.n_bits() >= 256 { |
There was a problem hiding this comment.
Can this method even be called with masks? If not it seems like the sceond condition can just be omitted.
There was a problem hiding this comment.
Same also for interleave/deinterleave etc. Some other positions as well.
| && target_scalar == ScalarType::Float | ||
| && vec_ty.scalar_bits == 32 | ||
| { | ||
| // We cannot emit the intrinsics for the conversion instructions |
There was a problem hiding this comment.
Let's orefix with a TODO then so we don't forget aout this.
| @cfg any(target_arch = "x86", target_arch = "x86_64"); | ||
| @token_ty $crate::Avx512; | ||
| @kernel_attrs #[target_feature( | ||
| enable = "adx,aes,avx512bitalg,avx512bw,avx512cd,avx512dq,avx512f,avx512ifma,avx512vbmi,avx512vbmi2,avx512vl,avx512vnni,avx512vpopcntdq,bmi1,bmi2,cmpxchg16b,fma,gfni,lzcnt,movbe,pclmulqdq,popcnt,rdrand,rdseed,sha,vaes,vpclmulqdq,xsave,xsavec,xsaveopt,xsaves" |
There was a problem hiding this comment.
I really wish there were a more readable and easier-verifiable way for this 😓
There was a problem hiding this comment.
I never managed to come up with one. We can't use variables here, and a declarative macro doesn't help either because we write these out in different contexts in slightly different ways, so even the strings we insert aren't the same.
| // lower to LLVM intrinsics, they will likely not be optimized until much later in the pipeline (if at all), | ||
| // resulting in substantially worse codegen. See https://github.com/linebender/fearless_simd/pull/185. | ||
| // | ||
| // Safety: The native vector type backing any implementation will be: |
There was a problem hiding this comment.
Isn't this the wrong place for a safety comment? There isn't actually any unsafe here, shouldn't this be in the transmute module (and I think we already have a similar comment there).
| val: crate::transmute::checked_transmute_copy(&arch), | ||
| simd, | ||
| } | ||
| let lanes: [i8; 32usize] = crate::transmute::checked_transmute_copy(&arch); |
There was a problem hiding this comment.
In the future, could this be avoided by specializing the SimdFrom impls for specific backends instad of making them generic over Simd?
There was a problem hiding this comment.
I guess? I didn't see the point on complicating the generator further for the sake of this, since it's just two transmutes in a row anyway and optimizes into a by-value transmute.
I agree. However, much of it is genuinely shared with the other levels, e.g. all the basic math operations emitting mostly the same intrinsics with 256 swapped for 512, so I'm not quite sure what the right cut points would be. |
Yes, really. It's all here. In one humongous PR. Sorry 😅
This is probably best reviewed commit-by-commit. The first commit is still big because the history was getting really messy with changes and rollbacks, and squashing it made it less of a mess.
This also touches other backends in three ways:
set_mask()is now a backend method so it could be specialized per-leveltransmute_copy()is wrapped intochecked_transmute_copy()and the raw version disallowed after I almost had a horrible accident with it.This could be its own PR but I wanted the insurance right away.This was split and shipped in v0.5.0Everything changed here should be covered by tests. I've expanded test coverage where it was lacking.
Closes #179