Ensure other types of SIMD related loads, stores, and indirections are marked as used with SIMD intrinsics#129563
Ensure other types of SIMD related loads, stores, and indirections are marked as used with SIMD intrinsics#129563tannergooding wants to merge 2 commits into
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
I went for the broader change that was closer to how stores were already being handled to start. Depending on diffs (codegen and throughput), it can be pulled back a bit to not be so "aggressive" in the marking. In general, we shouldn't have any such locals being marked unless users are inserting some kind of bit or reinterpret cast to/from the built-in SIMD types. So this shouldn't block promotion in normal scenarios, only the ones that are explicitly involving SIMD already. This should also make it easier to increase promotion to more fields, which is currently stopped if you have 4. So the above example with a |
There was a problem hiding this comment.
Pull request overview
This PR adjusts how the JIT marks locals as “used in a SIMD intrinsic” so that additional IR shapes (not just direct local reads/stores) are recognized, helping avoid struct promotion in cases where SIMD-style moves/bitcasts are intended.
Changes:
- Removes the old
setLclRelatedToSIMDIntrinsichelper (previously insimd.cpp) and centralizes marking logic inCompiler::SetOpLclRelatedToSIMDIntrinsic. - Adds
gtInitializeLclVarNodeand hooks it into local-node creation, so SIMD/mask local accesses are consistently marked. - Extends store/indir initialization to mark SIMD/mask-related stores and indirections (including
IND(LCL_ADDR ...)shapes).
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/simd.cpp | Removes the now-obsolete SIMD-local marking helper. |
| src/coreclr/jit/gentree.cpp | Adds local-node initialization and broadens SIMD/mask marking across locals, stores, and indirections. |
| src/coreclr/jit/compiler.h | Declares gtInitializeLclVarNode and removes the old helper declaration. |
Copilot's findings
- Files reviewed: 3/3 changed files
- Comments generated: 2
|
@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors |
|
Meant to mark this as draft while I'm validating diffs are acceptable, etc. |
|
Trying a smaller change that doesn't pessimize cases like the below: [MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe void StoreArmVector128x4ToDestination(ushort* dest, ushort* destStart, int destLength,
Vector128<byte> res1, Vector128<byte> res2, Vector128<byte> res3, Vector128<byte> res4)
{
(Vector128<ushort> utf16LowVector1, Vector128<ushort> utf16HighVector1) = Vector128.Widen(res1);
(Vector128<ushort> utf16LowVector2, Vector128<ushort> utf16HighVector2) = Vector128.Widen(res2);
(Vector128<ushort> utf16LowVector3, Vector128<ushort> utf16HighVector3) = Vector128.Widen(res3);
(Vector128<ushort> utf16LowVector4, Vector128<ushort> utf16HighVector4) = Vector128.Widen(res4);
AdvSimd.Arm64.StoreVectorAndZip(dest, (utf16LowVector1, utf16LowVector2, utf16LowVector3, utf16LowVector4));
AdvSimd.Arm64.StoreVectorAndZip(dest + 32, (utf16HighVector1, utf16HighVector2, utf16HighVector3, utf16HighVector4));
}The general pessimization looks to have been coming from arbitrary |
|
@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors |
0e8303a to
1af70f3
Compare
|
Did some more local testing and diffing, the more exhaustive change is the better one with the issue having been that a Rather, we just want to handle all operands to hardware intrinsics as we had already been doing and otherwise mark locals based on whether the access itself is |
|
@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors |
|
@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors |
|
Diffs were better as expected, doing one more test where |
|
@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors |
02810c7 to
7255d1c
Compare
|
I'm happy with the diffs now. For mihu-bot, the one just prior to the latest is the accurate one: MihuBot/runtime-utils#1996. -- I tried and reverted a commit which was handling For existing packages, it gets - mov rax, qword ptr [rdi+0x08]
- mov qword ptr [rbp-0x10], rax
- mov eax, dword ptr [rdi+0x10]
- mov dword ptr [rbp-0x08], eax
- mov rax, qword ptr [rsi+0x08]
- mov qword ptr [rbp-0x20], rax
- mov eax, dword ptr [rsi+0x10]
- mov dword ptr [rbp-0x18], eax
- vmovsd xmm0, qword ptr [rbp-0x10]
- vinsertps xmm0, xmm0, dword ptr [rbp-0x08], 40
+ vmovsd xmm0, qword ptr [rdi+0x08]
+ vinsertps xmm0, xmm0, dword ptr [rdi+0x10], 40or - vmovss xmm0, dword ptr [rdi]
- vmovss xmm1, dword ptr [rdi+0x04]
- vmovss xmm2, dword ptr [rdi+0x08]
- vmovss xmm3, dword ptr [rdi+0x0C]
- vmovss xmm4, dword ptr [rdi+0x10]
- vmovss xmm5, dword ptr [rdi+0x14]
+ vmovsd xmm0, qword ptr [rdi]
+ vmovsd xmm1, qword ptr [rdi+0x08]
+ vmovsd xmm2, qword ptr [rdi+0x10]We do see a handful of small regressions, 4 of them in total accounting for |
|
CC. @dotnet/jit-contrib, @EgorBo, @dhartglassMSFT for review. Also CC. @jakobbotsch since this explicitly impacts promotion. As per the above comments this is generally improving codegen for cases where a user-defined struct is regularly being bitcast or reinterpret cast to or from the built-in SIMD types ( |
|
What do the diffs look like if we completely disable the old promotion for SIMDs? runtime/src/coreclr/jit/lclvars.cpp Lines 1732 to 1739 in b1ea6ff |
This handles various IR shapes such as:
and
Ensuring that promotion fails as:
instead of succeeding as:
As per the general comment related to the marking, accesses to locals involving a SIMD value can be done using a single instruction
mov. Additionally, such accesses are often done in perf critical code where it is explicitly being used with SIMD intrinsics.Such patterns often appear when users are doing
bitcastingbetween their own user-defined struct and the built-inSIMDtypes (i.e.Vector64/128/256/512) and are prevalent in code such as the following:By handling this, we get the following ideal codegen:
rather than this much more pessimized codegen: