[TLE-Raw] Add DSLRegion metadata for deferred vendor lowering and fix redundant-copy removal#700
Open
i3wanna2 wants to merge 10 commits into
Open
[TLE-Raw] Add DSLRegion metadata for deferred vendor lowering and fix redundant-copy removal#700i3wanna2 wants to merge 10 commits into
i3wanna2 wants to merge 10 commits into
Conversation
Add output_operand_indices/hint attrs, eager and deferred create paths, source_store, and split ConvertArgToMemDesc from RemoveRedundantCopy so loop accumulator redundancy is eliminated in the dedicated pass.
The refactored core.py routes region creation through JIT function helpers; add the same method to MLIRJITFunction used by mlir tutorials.
Default deferred=False preserves eager behavior; callers opt in with deferred=True on the CUDA dialect decorator.
Temporary commit for backup only — not intended for review or merge. Work in progress on NVIDIA deferred tle_raw materialize scaffolding. - Move deferred_raw_materialize to make_llir (before dsl_region_inline) - Keep convert_arg_to_memdesc/remove_redundant_copy in make_ttgir - Add NVIDIA deferred_raw.py + MaterializeDeferredRaw pass skeleton - Add deferred CUDA unit test fixture
Extract shared materialize logic into TritonTLERawUtils, wire the NVIDIA make_llir pass to compile pending sources and fill stub dsl_regions before inline. Add deferred runtime hooks for CUDA/MLIR and defered tutorials; remove in-repo unit test fixtures moved to external debug workspace.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR extends tle.dsl_region with metadata (region_dialect, arg_dialect, output_operand_indices, optional hint) to support vendor-specific DSL regions
and a deferred-lowering skeleton, while keeping the current eager CUDA path working.
Changes
• Refactor Python/C++ creation path (call / call_smem, CUDA runtime, source_store, deferred create API)
• Keep ConvertArgToMemDesc as a pure conversion pass
• Fix RemoveRedundantCopy to use output_operand_indices for shared-memory redundant-copy elimination
• Auto-derive alias indices from LLVM return analysis when output_indices is not provided
• Add minimal TOPS Python adapter for compatibility with the shared creation API
Notes
• TOPS remains eager-only (no deferred logic)
• Deferred materialization is out of scope in this PR