[0035] Update wave-scope matrix groupshared operations#879
Conversation
This updates the load, store, and interlocked accumlation operations on groupshared memory to align data conversion handling and support accumulating into packed data types. Fixes microsoft#866
|
Tagging a few people from hardware vendors for feedback: |
| be in any arithmetic or packed data type. The type of the array must match the | ||
| type of the matrix being loaded into. |
There was a problem hiding this comment.
The way I'm reading this is if I want to load a packed uint8_t matrix the array must also be packed uint8_t. Is that correct?
Our concern with WebGPU is someone declaring a buffer (opaque type) in workgroup and using with both packed 8bit and 16-bit granularity. We aren't sure it will be guaranteed that equivalent matrices would be supported to use the Cast() method.
Also in the dxil section it says non-natively supported types must use i32 for the array. Would that be possible here?
There was a problem hiding this comment.
The way I'm reading this is if I want to load a packed uint8_t matrix the array must also be packed uint8_t. Is that correct?
DXIL doesn't have 8-bit integer types, so uint8_t isn't a native type, so it would need to be uint8_t4_packed, which becomes an i32.
Our concern with WebGPU is someone declaring a buffer (opaque type) in workgroup and using with both packed 8bit and 16-bit granularity. We aren't sure it will be guaranteed that equivalent matrices would be supported to use the
Cast()method.
In general DXIL doesn't allow casting groupshared pointers, so we don't really have a reasonable way to reinterpret the data from a single allocation to mean different things. The one oddity of this is non-native types that we treat as i32 (uint8_t4_packed in the API), where we carry both the data type and the interpretation so the accumulation operations know how to operate on the data.
Also in the dxil section it says non-natively supported types must use i32 for the array. Would that be possible here?
That is what the uint8_t4_packed type is, but it only works for a subset of types.
There was a problem hiding this comment.
As Alan says, it looks like we would still not be able to support LinAlg in WebGPU due to this requirement. In addition to the opaque buffer type that we will soon have in WGSL, we also decompose groupshared buffers into flat arrays of (e.g.) u32 or u16 in order to work around bugs in D3D drivers. This means that we won't always have to ability to force the type to be a uint8_t4_packed, or to use f32 for an f32 matrix etc.
In general DXIL doesn't allow casting groupshared pointers
Maybe naive question, but does DXIL really need to support casting? Could the DXIL representation for these load/store operations not just allow the type mismatch, and specify that the operations are performed without conversion? This is essentially the SPIR-V approach - you can use any scalar/vector type in the src/dst pointer, and it's only used for addressing calculations.
We also considered trying to polyfill the matrix load/store operations ourselves by manually loading the data and then constructing the matrix objects using Set(), but a) I don't know how efficient this will be and b) the 8-bit matrix types don't support Get()/Set() as far as I can tell.
There was a problem hiding this comment.
Treating the data in a groupshared array as two (or more) types of different sizes is currently illegal in DXIL. You can kinda work around that by just always treating the array as 32-bit integers and doing load->mask->convert chains, but the fundamental limitation exists.
There isn't a good way to work around that limitation through the linalg spec as specified.
I think we should probably have a conversation with the hardware vendors about whether or not those restrictions are still required for DXIL, or if we should remove them. Removing them has some general benefits, but would make some transformations that DXC performs on groupshared allocations potentially illegal (which is also maybe fine).
There was a problem hiding this comment.
If it helps SPV_KHR_cooperative_matrix does not require the matrix component type to match the memory type. So any vendor supporting that extension should already be able to support this.
There was a problem hiding this comment.
This looks like it needs updating to match the new DXIL sections?
Co-authored-by: Ashley Coleman <ascoleman@microsoft.com>
pow2clk
left a comment
There was a problem hiding this comment.
Most of my comments relate to changes made to the header that weren't mirrored to the other 1 or 2 places where the code is duplicated.
We discussed transforming this spec from containing so many of the requirements in header code to simplifying the code to leave that to the actual header while better explaining restrictions in the text. Is that no longer the plan? This seems to be moving in the opposite direction.
| Splat(T Val); | ||
| [[nodiscard]] static | ||
| typename hlsl::enable_if<hlsl::is_arithmetic<T>::value, Matrix>::type | ||
| Splat(T Val); |
There was a problem hiding this comment.
While a welcome change, this isn't mentioned in the description
| MatrixAccum32Ty::Load(GSMat, 0, 8, MatrixLayout::RowMajor); | ||
|
|
||
| // The next line is an error because the groupshared array isn't the right | ||
| // type for the matrix component. MatrixAccumTy M2 = |
There was a problem hiding this comment.
clang-format did you dirty here and moved part of the "next line" onto the description of it. Is this a known bug or an example of how the header is expected to error on this usage?
There was a problem hiding this comment.
It is an expected error based on the current state of this spec, although there is a request (in other comments) to change that requirement.
| MatrixAccum32Ty::Load(GSMat, 0, 8, MatrixLayout::RowMajor); | ||
|
|
||
| // The next line is an error because the groupshared array isn't the right | ||
| // type for the matrix component. MatrixAccumTy M2 = |
|
A bunch of the code snippets across the file were out-of-sync with the appendix. I've re-sync'd all of that now. |
Where we are without this change is a bit of a worst-of-all-worlds. The header in the annex was so far out of date from the implementation that it was no longer useful for crafting interface changes, and the code snippets across the file were in even worse state. I did not fully sync this to the implementation, but I did pull in some of the differences in the implementation required to help me craft API changes that were needed for this PR. |
This relaxes the groupshared data constraints so that i32 memory may store any type opaquely.
mjbedy
left a comment
There was a problem hiding this comment.
This seems like a good general direction. I'm going to follow up to confirm a couple things, but left some comments in the meantime.
| be in any arithmetic or packed data type. The type of the array must match the | ||
| type of the matrix being loaded into. |
There was a problem hiding this comment.
This looks like it needs updating to match the new DXIL sections?
| [Conversions](#data-conversion-rules) section below. | ||
| Populates a matrix with data from a `groupshared` array. If the groupshared | ||
| memory is any type other than `i32` the matrix's component type must match the | ||
| scalar ype of the groupshared array. If the groupshared memory is `i32` the |
There was a problem hiding this comment.
| scalar ype of the groupshared array. If the groupshared memory is `i32` the | |
| scalar type of the groupshared array. If the groupshared memory is `i32` the |
| and groupshared memory are defined in the [Conversions](#data-conversion-rules) section | ||
| below. | ||
| Store a matrix to groupshared memory. If the groupshared memory is any type | ||
| other than `i32` the matrix's component type must match the scalar ype of the |
There was a problem hiding this comment.
| other than `i32` the matrix's component type must match the scalar ype of the | |
| other than `i32` the matrix's component type must match the scalar type of the |
| declare void @dx.op.linAlgMatrixStoreToMemory.[MatTy].[Ty]( | ||
| immarg i32, ; opcode | ||
| %dx.types.LinAlgMatrix<mangling>, ; matrix | ||
| [Ty] addrspace(3)*, ; groupshared Ty[M * N] | ||
| i32, ; Offset | ||
| i32, ; Stride | ||
| i32 ; matrix layout | ||
| ) |
There was a problem hiding this comment.
Are the Offset and Stride intended to be in terms of the element type of the groupshared array? (e.g., if you are storing f16 elements to i32 array, the Offset and Strides will be in terms of 4 byte elements.)
| immarg i32, ; opcode | ||
| %dx.types.LinAlgMatrix<mangling>, ; matrix | ||
| [Ty] addrspace(3)*, ; groupshared Ty[M * N] | ||
| immarg i32, ; target data type |
There was a problem hiding this comment.
I can't add a comment directly to the right spot because there is no diff there but can you update the sections that start with
For the Load operations on groupshared arrays: and For the Store operations on groupshared arrays: to comment on what Stride and Offset are expected to be when target data type and gs array data type disagree?
This updates the load, store, and interlocked accumlation operations on groupshared memory to align data conversion handling and support accumulating into packed data types.
Fixes #866