Skip to content

perf: AVX2 8888 load/store + f32x16 save specializations#175

Open
bhark wants to merge 3 commits into
linebender:mainfrom
bhark:perf/avx2-8888
Open

perf: AVX2 8888 load/store + f32x16 save specializations#175
bhark wants to merge 3 commits into
linebender:mainfrom
bhark:perf/avx2-8888

Conversation

@bhark

@bhark bhark commented May 23, 2026

Copy link
Copy Markdown

This PR is part of a series linked to #174.

What

AVX2 specializations for these primitives:

  • lowp::load_8888 / lowp::store_8888 (these are the bulk of source_over_rbga)
  • lowp::load_8
  • highp::load_8888 / highp::store_8888
  • f32x16::save_to_u16x16

Gated on feature = "simd", target_feature = "avx2". So SSe2/SSE4.1/AVX, Neon WASM-SIMD etc. are unchanged.

Why

These primitives, on AVX2, can collapse a long scalar shuffle into a handful of instrinsics. This means lowp::source_over_rgba drops from ~350 to about 50 instructions.

Results

speedup
Geomean (55 benches) 1.95x
blend::destination_atop 3.83x
gradients::two_stops_linear_pad 3.56x
blend::source_over 2.53x
fill::rect 2.04x
gradients::three_stops_linear_even 1.92x
hairline::aa 1.34x
blend::clear, fill::opaque (don't hit specialized paths) 1.00x

I found no regressions anywhere.

Notes

Some nuances to this:

  • lowp::store_8888 AVX2 ORs channels instead of truncating with as u8. Shouldn't be an issue, unless something produces channel values > 255 in lowp, in which case AVX2 would surface a visual artifact instead of truncating silently.
  • Inspired style-wise by the existing src/wide/i32x8_t.rs / u32x8_t.rs patterns
  • Default cargo builds naturally won't see this. You'll have to build with -Ctarget-cpu=haswell, -Ctarget-feature=+avx2 or target-cpu=native.

@RazrFalcon

Copy link
Copy Markdown
Collaborator

Looks good, but I would avoid direct SIMD intrinsics calls and cfg_if in highp/lowp code. All of them should be nicely wrapped in the wide module.

@bhark

bhark commented May 26, 2026

Copy link
Copy Markdown
Author

@RazrFalcon And there's that too, is this more aligned with what you had in mind?

Comment thread src/wide/f32x16_t.rs Outdated
dst.0[15] = n1[7] as u16;
cfg_if::cfg_if! {
if #[cfg(all(feature = "simd", target_feature = "avx2"))] {
#[cfg(target_arch = "x86")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use global use instead. Like we do in u32x4_t.rs

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to module scope now. Also applied to u16x16_t.rs to keep things consistent.

@RazrFalcon

Copy link
Copy Markdown
Collaborator

Yes, looks much better now. The idea is too hide all low-level simd stuff in the wide module.

@RazrFalcon

Copy link
Copy Markdown
Collaborator

Good. Hopefully someone from the linebender team would merge it soon. @nicoburns ?
I think we can merge #172, #175, #176 and #177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants