perf: AVX2 8888 load/store + f32x16 save specializations by bhark · Pull Request #175 · linebender/tiny-skia

bhark · 2026-05-23T10:31:38Z

This PR is part of a series linked to #174.

What

AVX2 specializations for these primitives:

lowp::load_8888 / lowp::store_8888 (these are the bulk of source_over_rbga)
lowp::load_8
highp::load_8888 / highp::store_8888
f32x16::save_to_u16x16

Gated on feature = "simd", target_feature = "avx2". So SSe2/SSE4.1/AVX, Neon WASM-SIMD etc. are unchanged.

Why

These primitives, on AVX2, can collapse a long scalar shuffle into a handful of instrinsics. This means lowp::source_over_rgba drops from ~350 to about 50 instructions.

Results

	speedup
Geomean (55 benches)	1.95x
`blend::destination_atop`	3.83x
`gradients::two_stops_linear_pad`	3.56x
`blend::source_over`	2.53x
`fill::rect`	2.04x
`gradients::three_stops_linear_even`	1.92x
`hairline::aa`	1.34x
`blend::clear`, `fill::opaque` (don't hit specialized paths)	1.00x

I found no regressions anywhere.

Notes

Some nuances to this:

lowp::store_8888 AVX2 ORs channels instead of truncating with as u8. Shouldn't be an issue, unless something produces channel values > 255 in lowp, in which case AVX2 would surface a visual artifact instead of truncating silently.
Inspired style-wise by the existing src/wide/i32x8_t.rs / u32x8_t.rs patterns
Default cargo builds naturally won't see this. You'll have to build with -Ctarget-cpu=haswell, -Ctarget-feature=+avx2 or target-cpu=native.

RazrFalcon · 2026-05-25T07:56:49Z

Looks good, but I would avoid direct SIMD intrinsics calls and cfg_if in highp/lowp code. All of them should be nicely wrapped in the wide module.

bhark · 2026-05-26T19:42:15Z

@RazrFalcon And there's that too, is this more aligned with what you had in mind?

RazrFalcon · 2026-06-05T08:16:01Z

-        dst.0[15] = n1[7] as u16;
+        cfg_if::cfg_if! {
+            if #[cfg(all(feature = "simd", target_feature = "avx2"))] {
+                #[cfg(target_arch = "x86")]


Let's use global use instead. Like we do in u32x4_t.rs

Moved to module scope now. Also applied to u16x16_t.rs to keep things consistent.

RazrFalcon · 2026-06-05T08:18:53Z

Yes, looks much better now. The idea is too hide all low-level simd stuff in the wide module.

RazrFalcon · 2026-06-07T14:47:18Z

Good. Hopefully someone from the linebender team would merge it soon. @nicoburns ?
I think we can merge #172, #175, #176 and #177

perf: AVX2 8888 load/store + f32x16 save specializations

2d62200

refactor: move AVX2 8888 load/store into wide module

a093b7f

RazrFalcon reviewed Jun 5, 2026

View reviewed changes

refactor: move avx2 intrinsic imports to module scope

1863285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: AVX2 8888 load/store + f32x16 save specializations#175

perf: AVX2 8888 load/store + f32x16 save specializations#175
bhark wants to merge 3 commits into
linebender:mainfrom
bhark:perf/avx2-8888

bhark commented May 23, 2026

Uh oh!

RazrFalcon commented May 25, 2026

Uh oh!

bhark commented May 26, 2026

Uh oh!

RazrFalcon Jun 5, 2026

Uh oh!

bhark Jun 6, 2026

Uh oh!

RazrFalcon commented Jun 5, 2026

Uh oh!

RazrFalcon commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhark commented May 23, 2026

What

Why

Results

Notes

Uh oh!

RazrFalcon commented May 25, 2026

Uh oh!

bhark commented May 26, 2026

Uh oh!

RazrFalcon Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

bhark Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

RazrFalcon commented Jun 5, 2026

Uh oh!

RazrFalcon commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants