Skip to content

Add 64-bit integer vectors and operations on them#253

Open
Shnatsel wants to merge 74 commits into
linebender:mainfrom
Shnatsel:64-bit-ints
Open

Add 64-bit integer vectors and operations on them#253
Shnatsel wants to merge 74 commits into
linebender:mainfrom
Shnatsel:64-bit-ints

Conversation

@Shnatsel

Copy link
Copy Markdown
Contributor

Stacked on top of #231 because many 64-bit ops (e.g. min/max) were only added in AVX-512

Supersedes #97

Shnatsel added 30 commits May 24, 2026 18:24
…edicated AVX-512 implementations for complex int/float vector operations that benefit the most.

LLM summary of the changes:

Implemented:
- Added `X86::Avx512` in the generator with Ice Lake feature set, `native_width = 512`, `max_block_size = 512`.
- Generated new `fearless_simd/src/generated/avx512.rs`.
- Wired public API: `Avx512`, `x86::Avx512`, `Level::Avx512`, `Level::as_avx512`, dispatch, and `kernel!` support.
- Updated runtime/static detection so Ice Lake AVX-512 is selected before AVX2, while `as_avx2()` and `as_sse4_2()` downgrade correctly.
- Bumped MSRV/docs/CI/check-target metadata to Rust 1.89.

Generator/backend behavior:
- 512-bit vectors use native `__m512`, `__m512d`, and `__m512i`.
- AVX-512 masks now use raw compact `__mmask8/16/32/64` storage, with no aligned wrapper.
- Generic `SimdFrom<__mmask*, S>` / `From<mask*, __mmask*>` now route through `from_bitmask` / `to_bitmask`, so they are correct for non-AVX-512 `S` too.
- Added AVX-512 compare/select paths using mask-returning compares and mask blends.
- Added direct conversion paths, including `f32 <-> i32/u32` and `u8 <-> u16`.
- Added AVX-512 vector slides for vectors only; masks intentionally have no slide support.
- Added dedicated AVX-512 zip/unzip/interleave/deinterleave using `permutex2var`, especially for 256/512-bit widths.

Tests/coverage:
- Extended `#[simd_test]` to include AVX-512.
- Added AVX-512 detection/dispatch coverage.
- Updated mask bitwise tests for canonical boolean mask lanes.
- Added a regression test that AVX-512 mask public types are compact and match `__mmask*` sizes.
…ackend, and specialize it for AVX-512. Add test coverage that sets every single bit and verifies it was set correctly.
…rage. Only for 8-bit left shift LLVM autovectorizes the scalar fallback into GFNI instructions on 256-bit halves which emits more instructions but schedules better and ends up being slightly faster according to llvm-mca on sapphire rapids; but the difference isn't huge and I don't want to rely on autovectorization because of its fragility.
… so they didn't show up earlier when I removed those methods.
…ppy --tests` without a reported location, I've failed to isolate it to a specific crate and suppress it there
…an't enforce Pod without an external dependency.
# Conflicts:
#	fearless_simd/src/generated/avx2.rs
#	fearless_simd/src/generated/neon.rs
#	fearless_simd/src/generated/sse4_2.rs
#	fearless_simd/src/generated/wasm.rs
#	fearless_simd_gen/src/generic.rs
#	fearless_simd_gen/src/level.rs
…ame name but different semantics from the production code to avoid confusion
Shnatsel added 29 commits June 17, 2026 12:33
Includes the regenerated AVX-512 output from the same generator update.
Includes regenerated AVX-512 slide helpers for the same safety cleanup.
Includes regenerated AVX-512 interleaved load/store output.
@Shnatsel

Copy link
Copy Markdown
Contributor Author

The documentation for load/store_interleaved_128 was misleading. Both formulations are valid for 32-bit elements but the 8- and 16-bit elements already behaved differently, following the NEON vld4/vst4 semantics rather than our documented semantics. This misled me into generalizing the op to 64-bit numbers incorrectly.

I've changed the implementation back to vld4/vst4 semantics in subsequent commits and updated documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant