Ensure other types of SIMD related loads, stores, and indirections are marked as used with SIMD intrinsics by tannergooding · Pull Request #129563 · dotnet/runtime

tannergooding · 2026-06-18T06:50:59Z

This handles various IR shapes such as:

               [000018] DA--G------                         *  STORE_LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V06 tmp2         
               [000017] U---G------                         \--*  IND       simd16
               [000016] -----------                            \--*  LCL_ADDR  byref  V02 arg1         [+0]

and

              [000021] DA--G------                         *  STORE_LCL_VAR simd16 V07 tmp3         
               [000020] ----G------                         \--*  HWINTRINSIC simd16 16 double Add
               [000015] U---G------                            +--*  IND       simd16
               [000014] -----------                            |  \--*  LCL_ADDR  byref  V01 arg0         [+0]
               [000019] -----------                            \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V06 tmp2

Ensuring that promotion fails as:

  struct promotion of V01 is disabled because lvIsUsedInSIMDIntrinsic()

instead of succeeding as:

Promoting struct local V01 (Program+Vector2Double):
lvaGrabTemp returning 10 (V10 tmp6) (a long lifetime temp) called for field V01.X (fldOffset=0x0).

lvaGrabTemp returning 11 (V11 tmp7) (a long lifetime temp) called for field V01.Y (fldOffset=0x8).

As per the general comment related to the marking, accesses to locals involving a SIMD value can be done using a single instruction mov. Additionally, such accesses are often done in perf critical code where it is explicitly being used with SIMD intrinsics.

Such patterns often appear when users are doing bitcasting between their own user-defined struct and the built-in SIMD types (i.e. Vector64/128/256/512) and are prevalent in code such as the following:

[StructLayout(LayoutKind.Sequential)]
public struct Vector2Double(double x, double y)
{
    public double X = x;
    public double Y = y;

    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector2Double operator +(Vector2Double left, Vector2Double right)
    {
        var simdLeft = Unsafe.BitCast<Vector2Double, Vector128<double>>(left);
        var simdRight = Unsafe.BitCast<Vector2Double, Vector128<double>>(right);
        return Unsafe.BitCast<Vector128<double>, Vector2Double>(simdLeft + simdRight);
    }
}

public static class Vector2DoubleAdder
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static Vector2Double Add(Vector2Double a, Vector2Double b, Vector2Double c)
    {
        return a + b + c;
    }
}

By handling this, we get the following ideal codegen:

; Method Program+Vector2DoubleAdder:Add(Program+Vector2Double,Program+Vector2Double,Program+Vector2Double):Program+Vector2Double (FullOpts)
G_M5453_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M5453_IG02:  ;; offset=0x0000
       vmovups  xmm0, xmmword ptr [rdx]
       vaddpd   xmm0, xmm0, xmmword ptr [r8]
       vaddpd   xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=21 bbWeight=1 PerfScore 15.25

G_M5453_IG03:  ;; offset=0x0015
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 22

rather than this much more pessimized codegen:

; Method Program+Vector2DoubleAdder:Add(Program+Vector2Double,Program+Vector2Double,Program+Vector2Double):Program+Vector2Double (FullOpts)
G_M5453_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M5453_IG02:  ;; offset=0x0004
       vmovups  xmm0, xmmword ptr [rdx]
       vaddpd   xmm0, xmm0, xmmword ptr [r8]
       vmovaps  xmmword ptr [rsp], xmm0
       vmovups  xmm0, xmmword ptr [rsp]
       vmovups  xmmword ptr [rsp+0x18], xmm0
       vmovups  xmm0, xmmword ptr [rsp+0x18]
       vaddpd   xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=43 bbWeight=1 PerfScore 19.25

G_M5453_IG03:  ;; offset=0x002F
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 52

dotnet-policy-service · 2026-06-18T06:52:16Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2026-06-18T06:53:06Z

I went for the broader change that was closer to how stores were already being handled to start. Depending on diffs (codegen and throughput), it can be pulled back a bit to not be so "aggressive" in the marking.

In general, we shouldn't have any such locals being marked unless users are inserting some kind of bit or reinterpret cast to/from the built-in SIMD types. So this shouldn't block promotion in normal scenarios, only the ones that are explicitly involving SIMD already.

This should also make it easier to increase promotion to more fields, which is currently stopped if you have 4. So the above example with a Vector4Double struct already had good codegen due to promotion being disabled in a different way.

Copilot

Pull request overview

This PR adjusts how the JIT marks locals as “used in a SIMD intrinsic” so that additional IR shapes (not just direct local reads/stores) are recognized, helping avoid struct promotion in cases where SIMD-style moves/bitcasts are intended.

Changes:

Removes the old setLclRelatedToSIMDIntrinsic helper (previously in simd.cpp) and centralizes marking logic in Compiler::SetOpLclRelatedToSIMDIntrinsic.
Adds gtInitializeLclVarNode and hooks it into local-node creation, so SIMD/mask local accesses are consistently marked.
Extends store/indir initialization to mark SIMD/mask-related stores and indirections (including IND(LCL_ADDR ...) shapes).

Show a summary per file

File	Description
src/coreclr/jit/simd.cpp	Removes the now-obsolete SIMD-local marking helper.
src/coreclr/jit/gentree.cpp	Adds local-node initialization and broadens SIMD/mask marking across locals, stores, and indirections.
src/coreclr/jit/compiler.h	Declares `gtInitializeLclVarNode` and removes the old helper declaration.

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 2

MichalPetryka · 2026-06-18T06:59:31Z

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

tannergooding · 2026-06-18T08:33:34Z

Meant to mark this as draft while I'm validating diffs are acceptable, etc.

Copilot

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 3

tannergooding · 2026-06-18T08:56:28Z

Trying a smaller change that doesn't pessimize cases like the below:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe void StoreArmVector128x4ToDestination(ushort* dest, ushort* destStart, int destLength,
            Vector128<byte> res1, Vector128<byte> res2, Vector128<byte> res3, Vector128<byte> res4)
{
    (Vector128<ushort> utf16LowVector1, Vector128<ushort> utf16HighVector1) = Vector128.Widen(res1);
    (Vector128<ushort> utf16LowVector2, Vector128<ushort> utf16HighVector2) = Vector128.Widen(res2);
    (Vector128<ushort> utf16LowVector3, Vector128<ushort> utf16HighVector3) = Vector128.Widen(res3);
    (Vector128<ushort> utf16LowVector4, Vector128<ushort> utf16HighVector4) = Vector128.Widen(res4);
    AdvSimd.Arm64.StoreVectorAndZip(dest, (utf16LowVector1, utf16LowVector2, utf16LowVector3, utf16LowVector4));
    AdvSimd.Arm64.StoreVectorAndZip(dest + 32, (utf16HighVector1, utf16HighVector2, utf16HighVector3, utf16HighVector4));
}

The general pessimization looks to have been coming from arbitrary GT_STOREIND being marked related, so I want to see if that is applicable to all indirections and if the handling of LCL_ADDR should just be removed. If it should, then I'll get a different diff with just handling OperIsScalarLocal and doing so universally like the first iteration attempted

tannergooding · 2026-06-18T08:57:22Z

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

…D intrinsics

tannergooding · 2026-06-18T14:32:09Z

Did some more local testing and diffing, the more exhaustive change is the better one with the issue having been that a TYP_SIMD based STOREIND to a non TYP_SIMD local is a negative since it blocks the non-simd type from being promoted.

Rather, we just want to handle all operands to hardware intrinsics as we had already been doing and otherwise mark locals based on whether the access itself is TYP_SIMD or is an address to a TYP_SIMD local.

tannergooding · 2026-06-18T14:34:54Z

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

tannergooding · 2026-06-18T15:10:22Z

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

tannergooding · 2026-06-18T15:11:30Z

Diffs were better as expected, doing one more test where TYP_SIMD based GT_IND (loads) are also universally marked to see if it can get closer to the original numbers without the GT_STOREIND based regressions.

Copilot

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 1

tannergooding · 2026-06-18T15:49:48Z

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

Copilot

Copilot's findings

Files reviewed: 3/3 changed files
Comments generated: 4

tannergooding · 2026-06-18T16:49:01Z

I'm happy with the diffs now. For mihu-bot, the one just prior to the latest is the accurate one: MihuBot/runtime-utils#1996. -- I tried and reverted a commit which was handling LCL_FLD and STORE_LCL_FLD. We have a couple cases for it, but it doesn't actually get any wins.

For existing packages, it gets Total bytes of delta: -1731 (-0.00 % of base) namely in ImageSharp and BepuPhysics, namely where it fixes the individual field shuffling for the bitcast like structs we encounter:

-       mov      rax, qword ptr [rdi+0x08]
-       mov      qword ptr [rbp-0x10], rax
-       mov      eax, dword ptr [rdi+0x10]
-       mov      dword ptr [rbp-0x08], eax
-       mov      rax, qword ptr [rsi+0x08]
-       mov      qword ptr [rbp-0x20], rax
-       mov      eax, dword ptr [rsi+0x10]
-       mov      dword ptr [rbp-0x18], eax
-       vmovsd   xmm0, qword ptr [rbp-0x10]
-       vinsertps xmm0, xmm0, dword ptr [rbp-0x08], 40
+       vmovsd   xmm0, qword ptr [rdi+0x08]
+       vinsertps xmm0, xmm0, dword ptr [rdi+0x10], 40

or

-       vmovss   xmm0, dword ptr [rdi]
-       vmovss   xmm1, dword ptr [rdi+0x04]
-       vmovss   xmm2, dword ptr [rdi+0x08]
-       vmovss   xmm3, dword ptr [rdi+0x0C]
-       vmovss   xmm4, dword ptr [rdi+0x10]
-       vmovss   xmm5, dword ptr [rdi+0x14]
+       vmovsd   xmm0, qword ptr [rdi]
+       vmovsd   xmm1, qword ptr [rdi+0x08]
+       vmovsd   xmm2, qword ptr [rdi+0x10]

We do see a handful of small regressions, 4 of them in total accounting for +23 bytes. It looks to generally be because we end up generating some insert or extract and the codegen for those is slightly longer, but also typically more efficient for these scenarios; so they are acceptable and we can look at improving those scenarios separately if desired.

tannergooding · 2026-06-18T16:58:49Z

CC. @dotnet/jit-contrib, @EgorBo, @dhartglassMSFT for review. Also CC. @jakobbotsch since this explicitly impacts promotion.

As per the above comments this is generally improving codegen for cases where a user-defined struct is regularly being bitcast or reinterpret cast to or from the built-in SIMD types (Vector64/128/256/512). Such user-defined structs regularly come up in perf oriented multimedia libraries (ImageSharp, BepuPhysics, Silk.NET, etc) and even our own user-defined SIMD helpers (Vector2/3/4, Quaterion, and Plane) which we are special casing in the JIT today (and could potentially remove that special casing moving forward since its already implemented as bitcasting to the built-in SIMD types)

jakobbotsch · 2026-06-18T17:25:54Z

What do the diffs look like if we completely disable the old promotion for SIMDs?
Basically just change

runtime/src/coreclr/jit/lclvars.cpp

Lines 1732 to 1739 in b1ea6ff

    
           // If this lclVar is used in a SIMD intrinsic, then we don't want to struct promote it. 
        
           // Note, however, that SIMD lclVars that are NOT used in a SIMD intrinsic may be 
        
           // profitably promoted. 
        
           if (varDsc->lvIsUsedInSIMDIntrinsic()) 
        
           { 
        
               JITDUMP("  struct promotion of V%02u is disabled because lvIsUsedInSIMDIntrinsic()\n", lclNum); 
        
               return false; 
        
           }

to reject all SIMDs.

Copilot AI review requested due to automatic review settings June 18, 2026 06:51

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 18, 2026

Copilot started reviewing on behalf of tannergooding June 18, 2026 06:51 View session

dotnet-policy-service Bot assigned tannergooding Jun 18, 2026

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

MihuBot mentioned this pull request Jun 18, 2026

[JitDiff X64] [tannergooding] Ensure other types of SIMD related loads, stor ... MihuBot/runtime-utils#1993

Open

tannergooding marked this pull request as draft June 18, 2026 08:33

Copilot AI review requested due to automatic review settings June 18, 2026 08:37

Copilot started reviewing on behalf of tannergooding June 18, 2026 08:38 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp

MihuBot mentioned this pull request Jun 18, 2026

[JitDiff X64] [tannergooding] Ensure other types of SIMD related loads, stor ... MihuBot/runtime-utils#1994

Open

Ensure other types of SIMD related locals are marked as used with SIM…

1af70f3

…D intrinsics

tannergooding force-pushed the usedInSimd branch from 0e8303a to 1af70f3 Compare June 18, 2026 14:24

MihuBot mentioned this pull request Jun 18, 2026

[JitDiff X64] [tannergooding] Ensure other types of SIMD related loads, stor ... MihuBot/runtime-utils#1995

Open

Ensure that TYP_SIMD based GT_IND is marked

7255d1c

Copilot AI review requested due to automatic review settings June 18, 2026 15:10

Copilot started reviewing on behalf of tannergooding June 18, 2026 15:10 View session

MihuBot mentioned this pull request Jun 18, 2026

[JitDiff X64] [tannergooding] Ensure other types of SIMD related loads, stor ... MihuBot/runtime-utils#1996

Open

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

MihuBot mentioned this pull request Jun 18, 2026

[JitDiff X64] [tannergooding] Ensure other types of SIMD related loads, stor ... MihuBot/runtime-utils#1997

Open

tannergooding force-pushed the usedInSimd branch from 02810c7 to 7255d1c Compare June 18, 2026 16:40

Copilot AI review requested due to automatic review settings June 18, 2026 16:40

tannergooding marked this pull request as ready for review June 18, 2026 16:40

Copilot started reviewing on behalf of tannergooding June 18, 2026 16:40 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

Conversation

tannergooding commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service Bot commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

MichalPetryka commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

tannergooding commented Jun 18, 2026

Uh oh!

jakobbotsch commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tannergooding commented Jun 18, 2026 •

edited

Loading

tannergooding commented Jun 18, 2026 •

edited

Loading