Skip to content

Ensure other types of SIMD related loads, stores, and indirections are marked as used with SIMD intrinsics#129563

Open
tannergooding wants to merge 2 commits into
dotnet:mainfrom
tannergooding:usedInSimd
Open

Ensure other types of SIMD related loads, stores, and indirections are marked as used with SIMD intrinsics#129563
tannergooding wants to merge 2 commits into
dotnet:mainfrom
tannergooding:usedInSimd

Conversation

@tannergooding

@tannergooding tannergooding commented Jun 18, 2026

Copy link
Copy Markdown
Member

This handles various IR shapes such as:

               [000018] DA--G------                         *  STORE_LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V06 tmp2         
               [000017] U---G------                         \--*  IND       simd16
               [000016] -----------                            \--*  LCL_ADDR  byref  V02 arg1         [+0]

and

              [000021] DA--G------                         *  STORE_LCL_VAR simd16 V07 tmp3         
               [000020] ----G------                         \--*  HWINTRINSIC simd16 16 double Add
               [000015] U---G------                            +--*  IND       simd16
               [000014] -----------                            |  \--*  LCL_ADDR  byref  V01 arg0         [+0]
               [000019] -----------                            \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V06 tmp2 

Ensuring that promotion fails as:

  struct promotion of V01 is disabled because lvIsUsedInSIMDIntrinsic()

instead of succeeding as:

Promoting struct local V01 (Program+Vector2Double):
lvaGrabTemp returning 10 (V10 tmp6) (a long lifetime temp) called for field V01.X (fldOffset=0x0).

lvaGrabTemp returning 11 (V11 tmp7) (a long lifetime temp) called for field V01.Y (fldOffset=0x8).

As per the general comment related to the marking, accesses to locals involving a SIMD value can be done using a single instruction mov. Additionally, such accesses are often done in perf critical code where it is explicitly being used with SIMD intrinsics.

Such patterns often appear when users are doing bitcasting between their own user-defined struct and the built-in SIMD types (i.e. Vector64/128/256/512) and are prevalent in code such as the following:

[StructLayout(LayoutKind.Sequential)]
public struct Vector2Double(double x, double y)
{
    public double X = x;
    public double Y = y;

    //[MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector2Double operator +(Vector2Double left, Vector2Double right)
    {
        var simdLeft = Unsafe.BitCast<Vector2Double, Vector128<double>>(left);
        var simdRight = Unsafe.BitCast<Vector2Double, Vector128<double>>(right);
        return Unsafe.BitCast<Vector128<double>, Vector2Double>(simdLeft + simdRight);
    }
}

public static class Vector2DoubleAdder
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static Vector2Double Add(Vector2Double a, Vector2Double b, Vector2Double c)
    {
        return a + b + c;
    }
}

By handling this, we get the following ideal codegen:

; Method Program+Vector2DoubleAdder:Add(Program+Vector2Double,Program+Vector2Double,Program+Vector2Double):Program+Vector2Double (FullOpts)
G_M5453_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M5453_IG02:  ;; offset=0x0000
       vmovups  xmm0, xmmword ptr [rdx]
       vaddpd   xmm0, xmm0, xmmword ptr [r8]
       vaddpd   xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=21 bbWeight=1 PerfScore 15.25

G_M5453_IG03:  ;; offset=0x0015
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code: 22

rather than this much more pessimized codegen:

; Method Program+Vector2DoubleAdder:Add(Program+Vector2Double,Program+Vector2Double,Program+Vector2Double):Program+Vector2Double (FullOpts)
G_M5453_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M5453_IG02:  ;; offset=0x0004
       vmovups  xmm0, xmmword ptr [rdx]
       vaddpd   xmm0, xmm0, xmmword ptr [r8]
       vmovaps  xmmword ptr [rsp], xmm0
       vmovups  xmm0, xmmword ptr [rsp]
       vmovups  xmmword ptr [rsp+0x18], xmm0
       vmovups  xmm0, xmmword ptr [rsp+0x18]
       vaddpd   xmm0, xmm0, xmmword ptr [r9]
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=43 bbWeight=1 PerfScore 19.25

G_M5453_IG03:  ;; offset=0x002F
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 52

Copilot AI review requested due to automatic review settings June 18, 2026 06:51
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 18, 2026
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding

tannergooding commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

I went for the broader change that was closer to how stores were already being handled to start. Depending on diffs (codegen and throughput), it can be pulled back a bit to not be so "aggressive" in the marking.

In general, we shouldn't have any such locals being marked unless users are inserting some kind of bit or reinterpret cast to/from the built-in SIMD types. So this shouldn't block promotion in normal scenarios, only the ones that are explicitly involving SIMD already.

This should also make it easier to increase promotion to more fields, which is currently stopped if you have 4. So the above example with a Vector4Double struct already had good codegen due to promotion being disabled in a different way.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts how the JIT marks locals as “used in a SIMD intrinsic” so that additional IR shapes (not just direct local reads/stores) are recognized, helping avoid struct promotion in cases where SIMD-style moves/bitcasts are intended.

Changes:

  • Removes the old setLclRelatedToSIMDIntrinsic helper (previously in simd.cpp) and centralizes marking logic in Compiler::SetOpLclRelatedToSIMDIntrinsic.
  • Adds gtInitializeLclVarNode and hooks it into local-node creation, so SIMD/mask local accesses are consistently marked.
  • Extends store/indir initialization to mark SIMD/mask-related stores and indirections (including IND(LCL_ADDR ...) shapes).
Show a summary per file
File Description
src/coreclr/jit/simd.cpp Removes the now-obsolete SIMD-local marking helper.
src/coreclr/jit/gentree.cpp Adds local-node initialization and broadens SIMD/mask marking across locals, stores, and indirections.
src/coreclr/jit/compiler.h Declares gtInitializeLclVarNode and removes the old helper declaration.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 2

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp Outdated
@MichalPetryka

Copy link
Copy Markdown
Contributor

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

@tannergooding

Copy link
Copy Markdown
Member Author

Meant to mark this as draft while I'm validating diffs are acceptable, etc.

Copilot AI review requested due to automatic review settings June 18, 2026 08:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 3

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp
@tannergooding

Copy link
Copy Markdown
Member Author

Trying a smaller change that doesn't pessimize cases like the below:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe void StoreArmVector128x4ToDestination(ushort* dest, ushort* destStart, int destLength,
            Vector128<byte> res1, Vector128<byte> res2, Vector128<byte> res3, Vector128<byte> res4)
{
    (Vector128<ushort> utf16LowVector1, Vector128<ushort> utf16HighVector1) = Vector128.Widen(res1);
    (Vector128<ushort> utf16LowVector2, Vector128<ushort> utf16HighVector2) = Vector128.Widen(res2);
    (Vector128<ushort> utf16LowVector3, Vector128<ushort> utf16HighVector3) = Vector128.Widen(res3);
    (Vector128<ushort> utf16LowVector4, Vector128<ushort> utf16HighVector4) = Vector128.Widen(res4);
    AdvSimd.Arm64.StoreVectorAndZip(dest, (utf16LowVector1, utf16LowVector2, utf16LowVector3, utf16LowVector4));
    AdvSimd.Arm64.StoreVectorAndZip(dest + 32, (utf16HighVector1, utf16HighVector2, utf16HighVector3, utf16HighVector4));
}

The general pessimization looks to have been coming from arbitrary GT_STOREIND being marked related, so I want to see if that is applicable to all indirections and if the handling of LCL_ADDR should just be removed. If it should, then I'll get a different diff with just handling OperIsScalarLocal and doing so universally like the first iteration attempted

@tannergooding

Copy link
Copy Markdown
Member Author

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

@tannergooding

Copy link
Copy Markdown
Member Author

Did some more local testing and diffing, the more exhaustive change is the better one with the issue having been that a TYP_SIMD based STOREIND to a non TYP_SIMD local is a negative since it blocks the non-simd type from being promoted.

Rather, we just want to handle all operands to hardware intrinsics as we had already been doing and otherwise mark locals based on whether the access itself is TYP_SIMD or is an address to a TYP_SIMD local.

@tannergooding

Copy link
Copy Markdown
Member Author

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

Copilot AI review requested due to automatic review settings June 18, 2026 15:10
@tannergooding

Copy link
Copy Markdown
Member Author

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

@tannergooding

Copy link
Copy Markdown
Member Author

Diffs were better as expected, doing one more test where TYP_SIMD based GT_IND (loads) are also universally marked to see if it can get closer to the original numbers without the GT_STOREIND based regressions.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 1

Comment thread src/coreclr/jit/gentree.cpp
@tannergooding

Copy link
Copy Markdown
Member Author

@MihuBot -nuget -jitutils-repo EgorBo/jitutils -jitutils-branch pmi-deterministic-cctors

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 3/3 changed files
  • Comments generated: 4

Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
@tannergooding

Copy link
Copy Markdown
Member Author

I'm happy with the diffs now. For mihu-bot, the one just prior to the latest is the accurate one: MihuBot/runtime-utils#1996. -- I tried and reverted a commit which was handling LCL_FLD and STORE_LCL_FLD. We have a couple cases for it, but it doesn't actually get any wins.

For existing packages, it gets Total bytes of delta: -1731 (-0.00 % of base) namely in ImageSharp and BepuPhysics, namely where it fixes the individual field shuffling for the bitcast like structs we encounter:

-       mov      rax, qword ptr [rdi+0x08]
-       mov      qword ptr [rbp-0x10], rax
-       mov      eax, dword ptr [rdi+0x10]
-       mov      dword ptr [rbp-0x08], eax
-       mov      rax, qword ptr [rsi+0x08]
-       mov      qword ptr [rbp-0x20], rax
-       mov      eax, dword ptr [rsi+0x10]
-       mov      dword ptr [rbp-0x18], eax
-       vmovsd   xmm0, qword ptr [rbp-0x10]
-       vinsertps xmm0, xmm0, dword ptr [rbp-0x08], 40
+       vmovsd   xmm0, qword ptr [rdi+0x08]
+       vinsertps xmm0, xmm0, dword ptr [rdi+0x10], 40

or

-       vmovss   xmm0, dword ptr [rdi]
-       vmovss   xmm1, dword ptr [rdi+0x04]
-       vmovss   xmm2, dword ptr [rdi+0x08]
-       vmovss   xmm3, dword ptr [rdi+0x0C]
-       vmovss   xmm4, dword ptr [rdi+0x10]
-       vmovss   xmm5, dword ptr [rdi+0x14]
+       vmovsd   xmm0, qword ptr [rdi]
+       vmovsd   xmm1, qword ptr [rdi+0x08]
+       vmovsd   xmm2, qword ptr [rdi+0x10]

We do see a handful of small regressions, 4 of them in total accounting for +23 bytes. It looks to generally be because we end up generating some insert or extract and the codegen for those is slightly longer, but also typically more efficient for these scenarios; so they are acceptable and we can look at improving those scenarios separately if desired.

@tannergooding

Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo, @dhartglassMSFT for review. Also CC. @jakobbotsch since this explicitly impacts promotion.

As per the above comments this is generally improving codegen for cases where a user-defined struct is regularly being bitcast or reinterpret cast to or from the built-in SIMD types (Vector64/128/256/512). Such user-defined structs regularly come up in perf oriented multimedia libraries (ImageSharp, BepuPhysics, Silk.NET, etc) and even our own user-defined SIMD helpers (Vector2/3/4, Quaterion, and Plane) which we are special casing in the JIT today (and could potentially remove that special casing moving forward since its already implemented as bitcasting to the built-in SIMD types)

@jakobbotsch

Copy link
Copy Markdown
Member

What do the diffs look like if we completely disable the old promotion for SIMDs?
Basically just change

// If this lclVar is used in a SIMD intrinsic, then we don't want to struct promote it.
// Note, however, that SIMD lclVars that are NOT used in a SIMD intrinsic may be
// profitably promoted.
if (varDsc->lvIsUsedInSIMDIntrinsic())
{
JITDUMP(" struct promotion of V%02u is disabled because lvIsUsedInSIMDIntrinsic()\n", lclNum);
return false;
}
to reject all SIMDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants