升级到triton3.5 重构tt到linalg#180
Merged
Merged
Conversation
…eline
This is a foundational dialect-level refactoring that renames, removes, and
introduces MLIR dialects in preparation for the new compilation pipeline:
Dialect Renames:
- NPU Dialect → TritonDicp Dialect
The NPU dialect is renamed to TritonDicp to better reflect its role as the
DICP-specific Triton intermediate representation. Expanded with new op
definitions (TritonDicpOps.td), attribute definitions
(TritonDicpAttrDefs.td), and corresponding C++ implementation.
Dialect Removals:
- LinalgExt Dialect (entirely removed)
Removes the LinalgExt dialect including its IR definitions (Ops, Traits),
transforms (LinalgGenericToSCF, LinalgIfToSelect,
RemoveSingleIterationLoop, ScalarTo1DTensor, TensorTransform), and
VectorizeParallelLoopPass. This functionality is superseded by the new
TritonToLinalg and TritonToStructured conversion passes.
- TritonExt Dialect (entirely removed)
Removes the TritonExt dialect including its transforms
(BoolTritonPtrPromotionPass, CanonicalizeCmpiPass,
CanonicalizeTritonIRAscend, CanonicalizerPattern). These canonicalization
and optimization patterns are now handled by the new pass pipeline.
Dialect Additions:
- CommonIR Dialect (new)
Provides shared infrastructure passes: VectorizeParallelLoop and
AnnotateKernelAttrs. These were moved out of the removed LinalgExt
dialect and are now available as common utilities across the pipeline.
- TritonStructured Dialect (new)
Introduces a structured operations dialect that serves as the lowering
target for Triton→Structured conversion, providing a clean IR interface
between the Triton frontend dialect and backend-specific lowerings.
Breaking Changes:
- Any code referencing the NPU, LinalgExt, or TritonExt dialects must be
updated to use TritonDicp, CommonIR, or the new conversion passes.
- Dialect registration order in pipeline setup must be updated.
…l IR analysis
This is the core compiler pipeline rewrite that replaces the old monolithic
Conversion/ passes with a modular, multi-stage lowering pipeline backed by
comprehensive graph-level analysis infrastructure.
Removed Passes (under compiler/lib/Conversion/):
- LinalgToLinked: Triton→Linalg→Linked lowering (superseded)
- LinalgToNPU: Linalg→NPU direct lowering (superseded)
- TritonToLinalgNPU: Triton→Linalg→NPU conversion with MemRefCopyGather
and TritonArithToLinalg patterns (superseded)
- LinkedToHIVM: Linked→HIVM backend lowering (superseded)
New Conversion Passes:
TritonToLinalg (compiler/lib/TritonToLinalg/):
Comprehensive Triton IR → Linalg lowering with:
- TritonOpConverter (723-line header): converts all Triton dialect ops
(load, store, atomic ops, reduce, scan, dot, make_range, etc.)
- BlockPtrAnalysis (2407 lines): block pointer analysis for
efficient memory access pattern detection
- MaskAnalysis: predicated execution mask propagation
- UseAnalysis (559 lines): def-use chain analysis for value lifetime
- LoadStoreConverter (1242 lines): memory operation legalization
- ImplicitPermute (608 lines): automatic transpose insertion
- HoistBroadcast (206 lines): broadcast hoisting optimization
- DescriptorConverter (180 lines): memory descriptor generation
- FunctionConverter: kernel function signature conversion
- ArgMinMaxConverter: argmin/argmax lowering patterns
- MarkTensorKindPass: tensor kind annotation for backend selection
- AscendNPUIRLegalizePass: Ascend-specific IR legalization
TritonToStructured (compiler/lib/TritonToStructured/):
Alternative Triton→Structured lowering path with:
- CannonicalizerConverter (2594 lines): canonicalization-driven
conversion from Triton to structured ops
- PtrAnalysis (1557 lines): pointer/address analysis
- MaskAnalysis (992 lines): structured mask propagation
- MemOpConverter (554 lines): memory operation lowering
TritonToGraph (compiler/lib/TritonToGraph/):
Graph-level IR analysis infrastructure:
- ControlFlowGraph (1120 lines): CFG construction and analysis
with support for structured control flow regions
- ControlFlowGraphBuilder (883 lines): CFG builder from MLIR regions
- DataflowGraph (285 lines): data-flow graph construction
- MemorySSA (MemorySSA.h, 274 lines): memory SSA form for alias analysis
- MemorySsaBuilder (605 lines): memory SSA construction passes
- AliasAnalysis (228 lines): pointer aliasing analysis
- GraphAnalysis (626 lines): graph-level optimization analysis
- InterProceduralCFG (331 lines): cross-function CFG analysis
- tensor.h (146 lines): tensor abstraction for graph IR
Backend Lowering Passes:
- TritonToHFusion (160 lines): horizontal fusion pass
- TritonToHIVM (89 lines): HIVM backend code generation
- TritonToLLVM (269 lines): LLVM IR backend lowering
- TritonToAnnotation (56 lines): kernel metadata annotation
TritonToUnstructure (compiler/lib/TritonToUnstructure/):
Moved from Conversion/ with enhancements:
- BubbleUpOperation (164→enhanced): operation bubbling optimization
- OffsetAnalysis (413→enhanced): memory offset computation
- ReplaceArguments (301 lines, new): argument replacement utility
- UnstructureConversionPass (411→enhanced): unstructured conversion
New Infrastructure Passes:
DiscreteMaskAccessConversion (compiler/lib/DiscreteMaskAccessConversion/):
Moved from Conversion/ with substantial enhancements (366 lines)
DynamicCVPipeline (compiler/lib/DynamicCVPipeline/):
Dynamic compute/vector pipeline planning and execution:
- PlanComputeBlock: classifies ops and plans compute block boundaries
with OpClassifier (369 lines) and ComputeBlockIdManager (122 lines)
- AddControlFlowCondition: inserts dynamic CV control flow guards
- AddDynamicCVPipeline: orchestrates dynamic CV pipeline insertion
- AllocMultiCache: multi-level cache allocation for dynamic CV
- SplitDataflow: dataflow graph splitting pass
- SeparateMemoryFromCompute: memory/compute operation separation
- MemoryEffectsTracker (404 lines): memory side-effect analysis
TritonAffinityOpt (compiler/lib/TritonAffinityOpt/):
DAG-based affinity optimization for Ascend NPU:
- DAG (518 lines): directed acyclic graph construction and analysis
- DAGSSBuffer (5581 lines): smart-sync buffer assignment algorithm
- DAGScope (1084 lines): scope-based optimization regions
- DAGSync (1617 lines): synchronization insertion and optimization
AutoBlockify (compiler/lib/AutoBlockify/):
Automatic block formation for unstructured IR:
- AutoBlockify (345 lines): main blockification driver
- RewriteOperation (492 lines): IR rewriting for block formation
- Utils (191 lines): blockification utility functions
AscendLegalize (compiler/lib/AscendLegalize/):
Ascend NPU-specific IR legalization (74 lines)
Utility Changes:
- Utils.cpp expanded (1222→enhanced): general compiler utilities
- InterleaveOptimization (705 lines, new): interleaved execution
optimization for NPU memory/compute overlap
Build System:
- compiler/CMakeLists.txt restructured with proper add_subdirectory
- compiler/lib/CMakeLists.txt updated with all new pass directories
- dicp_triton_opt tool updated with new pass pipeline registration
Breaking Changes:
- Old Conversion/ pass names and pipeline ordering are removed
- Pass registration must use the new modular pass paths
- dicp_triton_opt pipeline arguments have changed
…on pipeline
Complete rewrite of the backend orchestration layer to support the new
multi-stage compilation pipeline and Triton 3.5 IR. The monolithic NPU
backend is split into modular components with clear separation of concerns.
New Components:
backend/npu_driver.py (968 lines):
New NPU driver implementation providing the core device interface for
Ascend NPU. Handles device initialization, memory management, kernel
launching, and synchronization with support for dynamic CV (compute
vector) scenarios. Replaces the inline driver logic previously embedded
in npu.py.
backend/npu_compiler_flags.py (120 lines):
Centralized compiler flag management for the NPU backend. Defines all
compilation options including optimization levels, debug flags, memory
allocation strategies, and Ascend-specific tuning parameters. Provides
a single source of truth for compiler configuration across the pipeline.
Refactored Components:
backend/npu.py (2038→restructured):
Refactored from a monolithic backend module into a lean orchestration
layer. The NPU backend now delegates compilation to the new modular
pipeline (TritonToLinalg → TritonToStructured → ... → TritonToLLVM),
device management to npu_driver.py, and flag configuration to
npu_compiler_flags.py.
backend/utils.py (643→enhanced):
Expanded utility module with new helper functions for the restructured
compilation pipeline including IR dumping, pass timing, metadata
extraction, and device capability detection.
backend/compiler.py (153→refined):
Updated compiler options (DICPOptions) to align with the new pipeline
stages and flag configuration.
backend/commonir/backend.py (108→rewritten):
Rewritten common IR backend with a new commonir_to_linkedir pipeline
that integrates with the restructured pass infrastructure. Adds support
for IR replacement via environment variables for debugging, and
platform-specific compilation paths (910_95, nd2nz_on_vector,
select_analysis).
backend/commonir/compiler.py, adapter.py:
Updated to use TRITON_DEBUG environment variable and work with the
new compilation flow.
backend/driver.py:
Updated driver interface to support the new backend architecture.
backend/cpu_backend.py:
Refactored CPU backend for consistency with the new driver interface.
File Reorganization:
backend/cpu_verify/ (moved from backend/include/ExecutionEngine/):
CPU verification runtime files (CRunnerUtils.cpp, CRunnerUtils.h,
Msan.h) moved to a dedicated cpu_verify directory to clearly separate
the CPU verification backend from general execution engine headers.
Removed:
- backend/include/ExecutionEngine/version.txt
- dicp_triton.cc (superseded by enhanced triton_dicp_triton.cc)
C++ Bindings:
triton_dicp_triton.cc (1337→enhanced):
Significantly expanded pybind11 bindings to expose the new compilation
pipeline stages, pass managers, and dialect registrations to Python.
This is the critical bridge between the Python backend layer and the
C++ compiler infrastructure.
Binary Assets:
- backend/lib/libdevice.10.bc: Ascend device library bitcode for
standard math and utility functions (84KB)
Breaking Changes:
- Direct imports from backend/npu.py internals must be updated
- Driver initialization APIs have changed
- Compiler flag names and environment variables are renamed
(DLC_DUMP_IR → TRITON_DEBUG, versioned CI variables v34→v35)
…modules
Complete restructure of the language/deeplink frontend package, introducing
a modular cann backend subpackage that separates extension, buffer, and
device library concerns. This aligns the frontend architecture with the
restructured compiler pipeline.
New cann Subpackage (language/deeplink/cann/):
cann/extension/ — Ascend NPU kernel extension framework:
- __init__.py (197 lines): Public API surface for custom operators,
synchronization primitives, and memory hierarchy annotations
- core.py (285 lines): Core extension infrastructure including
compute-copy overlap scheduling and multi-level memory ops
- custom_op.py (427 lines): Custom operator registration system with
support for multiple modes (MODE.CPU_VERIFY, MODE.NPU_EXEC)
- semantic.py (303 lines): Semantic analysis and lowering rules for
extension operations
- builder.py (73 lines): IR builder helpers for extension ops
- code_generator.py (162 lines): Code generation for extension ops
- dispatch.py (13 lines): Operation dispatch routing
- aux_ops.py (25 lines): Auxiliary operation definitions
- math_ops.py (24 lines): Math operation overrides
- mem_ops.py (182 lines): Memory operation primitives (insert_slice,
extract_slice, sync_block_*)
- vec_ops.py (64 lines): Vectorized operation primitives
- scope.py (43 lines): Scope-based resource management
cann/buffer/ — Buffer management subsystem:
- __init__.py (43 lines): Buffer API surface
- core.py (397 lines): Buffer allocation, deallocation, and lifetime
management with support for multi-level memory hierarchy
- builder.py (82 lines): Buffer IR construction helpers
- semantic.py (158 lines): Buffer operation semantic rules
cann/libdevice.py (1056 lines):
Comprehensive device library providing standard math functions
(exp, log, sin, cos, sqrt, rsqrt, erf, floor, ceil, fma, abs, etc.)
with bf16 cast support for Ascend NPU. Includes tanh with bf16
support that overrides triton.language.math.tanh.
Refactored Modules:
language/deeplink/__init__.py:
Restructured imports to delegate to cann subpackage. Standard math
functions are now glued to triton.language.math where possible,
with libdevice overrides for Ascend-specific implementations.
Re-exports extension, buffer, and custom_op APIs from cann.
language/deeplink/core.py:
Streamlined core module that delegates to cann subpackage internals.
Removed:
- language/deeplink/custom_op.py (379 lines, moved to cann/extension/)
- language/deeplink/libdevice.py (294 lines, moved to cann/)
- language/deeplink/semantic.py (107 lines, moved to cann/extension/)
Added:
- language/deeplink/extension.py (7 lines): Re-export shim
Breaking Changes:
- Direct imports from language.deeplink.custom_op must use
language.deeplink.cann.extension.custom_op or the re-exported API
- Direct imports from language.deeplink.libdevice must use
language.deeplink.cann.libdevice
- Custom operator registration APIs now reside under cann.extension
…ture
Comprehensive infrastructure upgrade to support Triton 3.5 and the
restructured compilation pipeline. Includes build system overhaul,
dependency updates, CI/CD modernization, and patch set refresh.
Triton 3.5 Upgrade:
third_party/triton:
Updated submodule pointer to Triton 3.5 compatible revision with new
IR definitions, updated dialect interfaces, and revised pass
infrastructure.
third_party/ascendnpu-ir:
Updated submodule pointer for Ascend NPU IR compatibility with the
new pipeline.
Patch Set Refresh (patch/triton/):
- Removed: include_triton_Dialect_Triton_IR_TritonOps_td.patch (Triton
upstream has incorporated these changes)
- Removed: python_triton_utils_py.patch (no longer needed)
- Removed: patch/ascendnpu-ir.patch (upstream updated)
- Removed: patch/ttshared/triton_shared.patch (deprecated)
- Added: CMakeLists_txt.patch — Triton CMake build integration
- Added: lib_Dialect_Triton_IR_Ops_cpp.patch — Triton op C++ fixes
- Added: python_src_ir_h.patch — Python IR header bindings
- Added: python_triton__utils_py.patch — utils.py compatibility
- Updated: python_src_ir_cc.patch — revised IR C++ bindings
- Updated: python_triton_compiler_code_generator_py.patch — codegen
- Updated: python_triton_compiler_compiler_py.patch — compiler flow
- Updated: python_triton_language_semantic_py.patch — semantic rules
- Updated: setup_py.patch — package configuration
- Updated: unittest_googletest_cmake.patch — test infrastructure
Build System:
CMakeLists.txt:
Restructured root CMakeLists to properly integrate the new compiler
pass directories and dialect libraries. Updated include paths and
library dependencies for the modular pipeline.
compile_shared.sh:
Updated shared library compilation script with new build flags and
paths for Triton 3.5 and the restructured compiler.
format.sh:
Updated code formatting script with expanded file patterns covering
new source directories.
conda.sh (new, 79 lines):
Conda environment setup script for reproducible development
environment creation with all required dependencies.
CI/CD Modernization (.github/workflows/):
main.yml, release.yml:
- Upgraded CI infrastructure from v34 to v35 (JSON_PATH35,
GOOGLETEST_DIR35, LLVM_TGZ_PATH35)
- Updated LLVM toolchain to llvm-7d5de303-ubuntu-arm64.tar.gz
- Removed redundant build steps aligned with new CMake structure
- Cleaned up environment variable propagation
Dependencies:
requirements.txt (128→restructured):
Updated Python package dependencies for Triton 3.5 compatibility.
Pinned versions for critical packages to ensure reproducible builds.
docker/Dockerfile (276→refined):
Updated Docker image definition with new CANN toolkit paths,
updated build dependencies, and revised environment setup for the
restructured compilation pipeline.
Configuration:
.gitignore:
Added patterns for new build artifacts and temporary directories.
.gitmodules:
Updated submodule URLs/branches if needed for new third-party refs.
README.md:
Minor updates reflecting the new project structure.
Test Updates:
test/ascend/:
Updated test files for new APIs and pipeline:
- cpu_verify/: test_bare_matmul, test_fa, test_vec_add
- failed_tests/: test_flip, test_gather
- passed_tests/: test_atan, test_common, test_isnan, test_log1p,
test_multi_return, test_pow, test_relu, test_zeros, test_zeroslike
- test_custom_op.py: updated for new extension API
- test_mlir.sh: updated MLIR test script
- mlir/: linalg_broadcast.mlir, linalg_multi_assign.mlir
test/commonir/run_tests.sh:
Updated common IR test runner.
… semantics Inline the CompilerFlag declarative builder into direct imperative flag construction within each linalg-to-bin entry point, removing the npu_compiler_flags.py abstraction layer. Move AscendLegalize pass/pattern class bodies from the header into the .cpp file and add a MaxNumFToMaximumF rewrite pattern that replaces arith::MaxNumFOp (NaN-quiet) with arith::MaximumFOp (NaN-propagating) on Ascend NPU targets. This fixes online-softmax correctness where NaN must propagate through the max reduction region.
…DLCompiler Migrate triton-ascend auto-tuning (AST auto-tiling, compile-option search, parallel compilation, NPU benchmark) into DLCompiler's ascend backend. - ascend_autotune_runtime/: port autoparser, tile_generator, autotuner, utils from triton-ascend/third_party/ascend/backend/runtime with import paths fixed for triton.backends.dicp_triton - ascend_autotune_hooks.py: lightweight module-level proxy that switches triton.autotune/max_autotune based on _USE_ASCEND flag; installs on import; no Strategy/Registry ceremony - testing.py: do_bench_npu NPU profiler with kernel_details.csv parsing - driver.py: wire hook_autotune_for_ascend() into ascend backend init - patch/triton/python_triton___init___py.patch: add max_autotune fallback - docs/ascend_autotune_design.md: architecture and data-flow documentation
This commit introduces a first-class compile-options path for Ascend autotune runtime and wires it into the Triton backend flow, with corresponding tests and small behavior updates. Changes include: - Add new runtime modules for compile option parsing, benchmark orchestration, and autotune execution support. - Extend backend/autotuner integration so options propagate through parser/compiler/npu layers. - Add C++ legalize updates needed for the new runtime behavior. - Add and update autotune tests (doc e2e, compile_options, compatibility) and remove obsolete deeplink runtime path. - Keep existing CLI behavior stable while adding richer metadata for benchmark and config checks. Testing: - Added test coverage in test/ascend/autotune/* - Updated 04-libentry and do_bench compatibility tests to validate option handling.
All staged changes are bug-fix oriented and focus on restoring stable autotune behavior across Ascend backend components. Changes: - Fix CommonIR backend/compiler glue to preserve compile options and argument metadata through autotune stages. - Fix deeplink runtime package path migration to avoid missing module entry points and cache handling regressions. - Fix AscendLegalize pass source handling and keep patch metadata in sync with compiler invocation changes. - Add/update backend/autotune design doc alignment with runtime behavior changes. - Keep API surface unchanged while addressing latent runtime path and compiler invocation bugs. Validation: - Existing regression candidates should be covered by the previously failing staged tests in the autotune flow.
fix(ascend):update test_mod test_pow test_scalar_calc
|
liguoliang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
e9bd863 to
7eb8ba3
Compare
7eb8ba3 to
022cf69
Compare
022cf69 to
50ff6fb
Compare
80e79b5 to
e707315
Compare
e707315 to
00ce801
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.