Skip to content

[0035] Update wave-scope matrix groupshared operations#879

Open
llvm-beanz wants to merge 4 commits into
microsoft:mainfrom
llvm-beanz:866
Open

[0035] Update wave-scope matrix groupshared operations#879
llvm-beanz wants to merge 4 commits into
microsoft:mainfrom
llvm-beanz:866

Conversation

@llvm-beanz

Copy link
Copy Markdown
Collaborator

This updates the load, store, and interlocked accumlation operations on groupshared memory to align data conversion handling and support accumulating into packed data types.

Fixes #866

This updates the load, store, and interlocked accumlation operations on
groupshared memory to align data conversion handling and support
accumulating into packed data types.

Fixes microsoft#866
@llvm-beanz

Copy link
Copy Markdown
Collaborator Author

Tagging a few people from hardware vendors for feedback:
@Anteru, @pow2clk, @anupamachandra & @mbedy-amd.

Comment thread proposals/0035-linalg-matrix.md Outdated
Comment thread proposals/0035-linalg-matrix.md Outdated
Comment on lines +905 to +906
be in any arithmetic or packed data type. The type of the array must match the
type of the matrix being loaded into.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I'm reading this is if I want to load a packed uint8_t matrix the array must also be packed uint8_t. Is that correct?

Our concern with WebGPU is someone declaring a buffer (opaque type) in workgroup and using with both packed 8bit and 16-bit granularity. We aren't sure it will be guaranteed that equivalent matrices would be supported to use the Cast() method.

Also in the dxil section it says non-natively supported types must use i32 for the array. Would that be possible here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I'm reading this is if I want to load a packed uint8_t matrix the array must also be packed uint8_t. Is that correct?

DXIL doesn't have 8-bit integer types, so uint8_t isn't a native type, so it would need to be uint8_t4_packed, which becomes an i32.

Our concern with WebGPU is someone declaring a buffer (opaque type) in workgroup and using with both packed 8bit and 16-bit granularity. We aren't sure it will be guaranteed that equivalent matrices would be supported to use the Cast() method.

In general DXIL doesn't allow casting groupshared pointers, so we don't really have a reasonable way to reinterpret the data from a single allocation to mean different things. The one oddity of this is non-native types that we treat as i32 (uint8_t4_packed in the API), where we carry both the data type and the interpretation so the accumulation operations know how to operate on the data.

Also in the dxil section it says non-natively supported types must use i32 for the array. Would that be possible here?

That is what the uint8_t4_packed type is, but it only works for a subset of types.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Alan says, it looks like we would still not be able to support LinAlg in WebGPU due to this requirement. In addition to the opaque buffer type that we will soon have in WGSL, we also decompose groupshared buffers into flat arrays of (e.g.) u32 or u16 in order to work around bugs in D3D drivers. This means that we won't always have to ability to force the type to be a uint8_t4_packed, or to use f32 for an f32 matrix etc.

In general DXIL doesn't allow casting groupshared pointers

Maybe naive question, but does DXIL really need to support casting? Could the DXIL representation for these load/store operations not just allow the type mismatch, and specify that the operations are performed without conversion? This is essentially the SPIR-V approach - you can use any scalar/vector type in the src/dst pointer, and it's only used for addressing calculations.

We also considered trying to polyfill the matrix load/store operations ourselves by manually loading the data and then constructing the matrix objects using Set(), but a) I don't know how efficient this will be and b) the 8-bit matrix types don't support Get()/Set() as far as I can tell.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating the data in a groupshared array as two (or more) types of different sizes is currently illegal in DXIL. You can kinda work around that by just always treating the array as 32-bit integers and doing load->mask->convert chains, but the fundamental limitation exists.

There isn't a good way to work around that limitation through the linalg spec as specified.

I think we should probably have a conversation with the hardware vendors about whether or not those restrictions are still required for DXIL, or if we should remove them. Removing them has some general benefits, but would make some transformations that DXC performs on groupshared allocations potentially illegal (which is also maybe fine).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps SPV_KHR_cooperative_matrix does not require the matrix component type to match the memory type. So any vendor supporting that extension should already be able to support this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it needs updating to match the new DXIL sections?

Co-authored-by: Ashley Coleman <ascoleman@microsoft.com>

@pow2clk pow2clk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments relate to changes made to the header that weren't mirrored to the other 1 or 2 places where the code is duplicated.

We discussed transforming this spec from containing so many of the requirements in header code to simplifying the code to leave that to the actual header while better explaining restrictions in the text. Is that no longer the plan? This seems to be moving in the opposite direction.

Splat(T Val);
[[nodiscard]] static
typename hlsl::enable_if<hlsl::is_arithmetic<T>::value, Matrix>::type
Splat(T Val);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While a welcome change, this isn't mentioned in the description

Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md Outdated
MatrixAccum32Ty::Load(GSMat, 0, 8, MatrixLayout::RowMajor);

// The next line is an error because the groupshared array isn't the right
// type for the matrix component. MatrixAccumTy M2 =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format did you dirty here and moved part of the "next line" onto the description of it. Is this a known bug or an example of how the header is expected to error on this usage?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an expected error based on the current state of this spec, although there is a request (in other comments) to change that requirement.

Comment thread proposals/0035-linalg-matrix.md Outdated
Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md
Comment thread proposals/0035-linalg-matrix.md Outdated
MatrixAccum32Ty::Load(GSMat, 0, 8, MatrixLayout::RowMajor);

// The next line is an error because the groupshared array isn't the right
// type for the matrix component. MatrixAccumTy M2 =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format again

@llvm-beanz

Copy link
Copy Markdown
Collaborator Author

A bunch of the code snippets across the file were out-of-sync with the appendix. I've re-sync'd all of that now.

@llvm-beanz

Copy link
Copy Markdown
Collaborator Author

We discussed transforming this spec from containing so many of the requirements in header code to simplifying the code to leave that to the actual header while better explaining restrictions in the text. Is that no longer the plan? This seems to be moving in the opposite direction.

Where we are without this change is a bit of a worst-of-all-worlds. The header in the annex was so far out of date from the implementation that it was no longer useful for crafting interface changes, and the code snippets across the file were in even worse state.

I did not fully sync this to the implementation, but I did pull in some of the differences in the implementation required to help me craft API changes that were needed for this PR.

This relaxes the groupshared data constraints so that i32 memory may
store any type opaquely.

@mjbedy mjbedy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good general direction. I'm going to follow up to confirm a couple things, but left some comments in the meantime.

Comment on lines +905 to +906
be in any arithmetic or packed data type. The type of the array must match the
type of the matrix being loaded into.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it needs updating to match the new DXIL sections?

[Conversions](#data-conversion-rules) section below.
Populates a matrix with data from a `groupshared` array. If the groupshared
memory is any type other than `i32` the matrix's component type must match the
scalar ype of the groupshared array. If the groupshared memory is `i32` the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scalar ype of the groupshared array. If the groupshared memory is `i32` the
scalar type of the groupshared array. If the groupshared memory is `i32` the

and groupshared memory are defined in the [Conversions](#data-conversion-rules) section
below.
Store a matrix to groupshared memory. If the groupshared memory is any type
other than `i32` the matrix's component type must match the scalar ype of the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
other than `i32` the matrix's component type must match the scalar ype of the
other than `i32` the matrix's component type must match the scalar type of the

Comment on lines 1621 to 1628
declare void @dx.op.linAlgMatrixStoreToMemory.[MatTy].[Ty](
immarg i32, ; opcode
%dx.types.LinAlgMatrix<mangling>, ; matrix
[Ty] addrspace(3)*, ; groupshared Ty[M * N]
i32, ; Offset
i32, ; Stride
i32 ; matrix layout
)

@mjbedy mjbedy Jun 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the Offset and Stride intended to be in terms of the element type of the groupshared array? (e.g., if you are storing f16 elements to i32 array, the Offset and Strides will be in terms of 4 byte elements.)

immarg i32, ; opcode
%dx.types.LinAlgMatrix<mangling>, ; matrix
[Ty] addrspace(3)*, ; groupshared Ty[M * N]
immarg i32, ; target data type

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add a comment directly to the right spot because there is no diff there but can you update the sections that start with

For the Load operations on groupshared arrays: and For the Store operations on groupshared arrays: to comment on what Stride and Offset are expected to be when target data type and gs array data type disagree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[0035] Loading Matrix from groupshared memory without conversion

7 participants