Add 64-bit integer vectors and operations on them by Shnatsel · Pull Request #253 · linebender/fearless_simd

Shnatsel · 2026-06-23T13:03:12Z

Stacked on top of #231 because many 64-bit ops (e.g. min/max) were only added in AVX-512

Supersedes #97

…edicated AVX-512 implementations for complex int/float vector operations that benefit the most. LLM summary of the changes: Implemented: - Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`. - Generated new `fearless_simd/src/generated/avx512.rs`. - Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support. - Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly. - Bumped MSRV/docs/CI/check-target metadata to Rust 1.89. Generator/backend behavior: - 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`. - AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper. - Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too. - Added AVX-512 compare/select paths using mask-returning compares and mask blends. - Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`. - Added AVX-512 vector slides for vectors only; masks intentionally have no slide support. - Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths. Tests/coverage: - Extended `#[simd_test]` to include AVX-512. - Added AVX-512 detection/dispatch coverage. - Updated mask bitwise tests for canonical boolean mask lanes. - Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.

…nt the spooky bug I almost introduced

…rage for these ops.

…calar, now we use the dedicated intrinsics.

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.

…it vectors on AVX-512; expand test coverage

… no cost to throughput

…ide test

… so they didn't show up earlier when I removed those methods.

…e get dead code warnings

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

…an't enforce Pod without an external dependency.

# Conflicts: # fearless_simd/src/generated/avx2.rs # fearless_simd/src/generated/neon.rs # fearless_simd/src/generated/sse4_2.rs # fearless_simd/src/generated/wasm.rs # fearless_simd_gen/src/generic.rs # fearless_simd_gen/src/level.rs

…ame name but different semantics from the production code to avoid confusion

Includes the regenerated AVX-512 output from the same generator update.

Includes regenerated AVX-512 slide helpers for the same safety cleanup.

Includes regenerated AVX-512 interleaved load/store output.

…ssible

…ess than 512

…nd avx2

Shnatsel · 2026-06-23T14:12:09Z

The documentation for load/store_interleaved_128 was misleading. Both formulations are valid for 32-bit elements but the 8- and 16-bit elements already behaved differently, following the NEON vld4/vst4 semantics rather than our documented semantics. This misled me into generalizing the op to 64-bit numbers incorrectly.

I've changed the implementation back to vld4/vst4 semantics in subsequent commits and updated documentation.

Shnatsel added 30 commits May 24, 2026 18:24

Add checked_transmute_copy and ban transmute_copy to statically preve…

f08f7e6

…nt the spooky bug I almost introduced

Expand native type conversion test coverage

aef1cac

Rename test: mask_methods.rs -> mask_roundtrip.rs

c12a7cc

Check in the new generated AVX-512 file

9d9adf8

Fix build after file rename

81441cf

Use AVX-512 instructions for f32 -> u32 conversions. Expand test cove…

0d6af5d

…rage for these ops.

Optimize load_array/as_array on AVX-512 masks; the initial impl was s…

025c172

…calar, now we use the dedicated intrinsics.

Split set_mask into a backend method so it could be specialized per b…

7927383

…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.

Optimize load_interleaved/store_interleaved for AVX-512. Add one more…

57de129

… test to exercise it. i8/u8 test is still bad because of rust-lang/rust#156891

Optimize floor/ceil/round_ties_even/trunc/approximate_recip for 512-b…

f2ba8c9

…it vectors on AVX-512; expand test coverage

Use AVX-512 rcp14 for smaller vector sizes too; improves precision at…

9cddbb2

… no cost to throughput

Optimize slide_within_blocks for AVX-512; verified with exhaustive sl…

9d02c3a

…ide test

Remove stale tests for mask slide APIs; they were under #[cfg(false)]…

85b44c9

… so they didn't show up earlier when I removed those methods.

consistent clippy error messages

1c558ca

satisfy Clippy

6c8f7d7

get rid of useless extra braces

e475ae1

KISS the native type mask roundtrip tests

6f1081f

cargo fmt

1e2a096

Satisfy clippy some more. Hoisted by my own restriction lint.

7fc16d4

Satisfy the toml formatting check

359650d

Stick an #[expect] onto checked_transmute_copy on wasm32, otherwise w…

37df3e3

…e get dead code warnings

Suppress an apparently buggy Clippy lint; surfaced only in `cargo cli…

8825bfb

…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there

Satisfy the toml formatter again

cf3ff7d

Add miri out-outs for extra slow tests

cb5780f

Also enforce that both types are Copy in checked_transmute_copy. We c…

f55271b

…an't enforce Pod without an external dependency.

Merge branch 'main' into avx512-yes-really

cd8192c

# Conflicts: # fearless_simd/src/generated/avx2.rs # fearless_simd/src/generated/neon.rs # fearless_simd/src/generated/sse4_2.rs # fearless_simd/src/generated/wasm.rs # fearless_simd_gen/src/generic.rs # fearless_simd_gen/src/level.rs

Fix disallowed methods setup that got mangled in the merge

15f5ab8

Drop a custom transmute_copy wrapper from tests now that it has the s…

6233743

…ame name but different semantics from the production code to avoid confusion

Shnatsel added 29 commits June 17, 2026 12:33

Record no branch-specific changes for PR linebender#240

3c4bcbc

Merge main PR linebender#241: update SDE CI download

2d0595d

Record no branch-specific changes for PR linebender#241

0847ebf

Merge main PR linebender#242: add missing authors

672772f

Record no branch-specific changes for PR linebender#242

1887405

Merge main PR linebender#236: remove unsafe generated intrinsic calls

6e5672a

Includes the regenerated AVX-512 output from the same generator update.

Merge main PR linebender#243: document transmute wrappers

dc4c8fe

Merge main PR linebender#244: remove unsafe generated helpers

ce28db9

Includes regenerated AVX-512 slide helpers for the same safety cleanup.

Merge main PR linebender#246: revert authors additions

ca1759b

Merge main PR linebender#247: explain transmute wrapper motivation

484d1bf

Merge main PR linebender#249: add author entry

7efdb1a

cargo fmt

014e4b7

Merge main PR linebender#245: safer interleaved load/store

9d4f115

Includes regenerated AVX-512 interleaved load/store output.

cargo fmt

d49f6a2

Merge branch 'linebender:main' into avx512-yes-really

00eb81d

Wrap AVX-512-specific codepaths in kernel! instead of unsafe where po…

046ee30

…ssible

Optimize u32->f32 conversion for 128-bit and 256-bit vectors on AVX-512

70e489b

Optimize precise i32 to f32 conversions on AVX-512 for vector sizes l…

a5f1b3a

…ess than 512

Optimize 128-bit unzip and deinterleave on AVX-512

490f83b

Merge branch 'main' into avx512-yes-really

a234432

Document AVX-512 support in the README

6d5f4ed

Regenerate README

cf18ec3

Add 64-bit integer vectors and operations on them

8dc0938

Placate Clippy

63fd005

Placate Clippy some more

47312fa

Placate Clippy in tests

93c1cc3

Align u64 load/store interleaved with vld4/vst4 semantics

172f2b7

Emit optimized implementations for load/store_interleaved on sse4.2 a…

7602a42

…nd avx2

Realign WASM load/store_interleaved impls with vld4/vst4 semantics

d5fae13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 64-bit integer vectors and operations on them#253

Add 64-bit integer vectors and operations on them#253
Shnatsel wants to merge 74 commits into
linebender:mainfrom
Shnatsel:64-bit-ints

Shnatsel commented Jun 23, 2026

Uh oh!

Shnatsel commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Shnatsel commented Jun 23, 2026

Uh oh!

Shnatsel commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant