Skip to content

[TLE-Raw] Add DSLRegion metadata for deferred vendor lowering and fix redundant-copy removal#700

Open
i3wanna2 wants to merge 10 commits into
triton_v3.6.xfrom
lxy/tle-raw-restore
Open

[TLE-Raw] Add DSLRegion metadata for deferred vendor lowering and fix redundant-copy removal#700
i3wanna2 wants to merge 10 commits into
triton_v3.6.xfrom
lxy/tle-raw-restore

Conversation

@i3wanna2

Copy link
Copy Markdown
Collaborator

Summary

This PR extends tle.dsl_region with metadata (region_dialect, arg_dialect, output_operand_indices, optional hint) to support vendor-specific DSL regions
and a deferred-lowering skeleton, while keeping the current eager CUDA path working.

Changes

• Refactor Python/C++ creation path (call / call_smem, CUDA runtime, source_store, deferred create API)
• Keep ConvertArgToMemDesc as a pure conversion pass
• Fix RemoveRedundantCopy to use output_operand_indices for shared-memory redundant-copy elimination
• Auto-derive alias indices from LLVM return analysis when output_indices is not provided
• Add minimal TOPS Python adapter for compatibility with the shared creation API

Notes

• TOPS remains eager-only (no deferred logic)
• Deferred materialization is out of scope in this PR

i3wanna2 added 2 commits June 15, 2026 08:25
Add output_operand_indices/hint attrs, eager and deferred create paths,
source_store, and split ConvertArgToMemDesc from RemoveRedundantCopy so
loop accumulator redundancy is eliminated in the dedicated pass.
flagtree-bot and others added 5 commits June 15, 2026 08:28
The refactored core.py routes region creation through JIT function
helpers; add the same method to MLIRJITFunction used by mlir tutorials.
Default deferred=False preserves eager behavior; callers opt in with deferred=True on the CUDA dialect decorator.
Temporary commit for backup only — not intended for review or merge.
Work in progress on NVIDIA deferred tle_raw materialize scaffolding.

- Move deferred_raw_materialize to make_llir (before dsl_region_inline)
- Keep convert_arg_to_memdesc/remove_redundant_copy in make_ttgir
- Add NVIDIA deferred_raw.py + MaterializeDeferredRaw pass skeleton
- Add deferred CUDA unit test fixture
flagtree-bot and others added 3 commits June 17, 2026 10:15
Extract shared materialize logic into TritonTLERawUtils, wire the NVIDIA
make_llir pass to compile pending sources and fill stub dsl_regions before
inline. Add deferred runtime hooks for CUDA/MLIR and defered tutorials;
remove in-repo unit test fixtures moved to external debug workspace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants