Skip to content

升级到triton3.5 重构tt到linalg#180

Merged
Truth-Ke merged 17 commits into
mainfrom
kzx/triton3.5_0525
Jun 12, 2026
Merged

升级到triton3.5 重构tt到linalg#180
Truth-Ke merged 17 commits into
mainfrom
kzx/triton3.5_0525

Conversation

@Truth-Ke

Copy link
Copy Markdown
Collaborator

No description provided.

Truth-Ke and others added 16 commits May 25, 2026 17:56
…eline

This is a foundational dialect-level refactoring that renames, removes, and
introduces MLIR dialects in preparation for the new compilation pipeline:

Dialect Renames:
  - NPU Dialect → TritonDicp Dialect
    The NPU dialect is renamed to TritonDicp to better reflect its role as the
    DICP-specific Triton intermediate representation. Expanded with new op
    definitions (TritonDicpOps.td), attribute definitions
    (TritonDicpAttrDefs.td), and corresponding C++ implementation.

Dialect Removals:
  - LinalgExt Dialect (entirely removed)
    Removes the LinalgExt dialect including its IR definitions (Ops, Traits),
    transforms (LinalgGenericToSCF, LinalgIfToSelect,
    RemoveSingleIterationLoop, ScalarTo1DTensor, TensorTransform), and
    VectorizeParallelLoopPass. This functionality is superseded by the new
    TritonToLinalg and TritonToStructured conversion passes.

  - TritonExt Dialect (entirely removed)
    Removes the TritonExt dialect including its transforms
    (BoolTritonPtrPromotionPass, CanonicalizeCmpiPass,
    CanonicalizeTritonIRAscend, CanonicalizerPattern). These canonicalization
    and optimization patterns are now handled by the new pass pipeline.

Dialect Additions:
  - CommonIR Dialect (new)
    Provides shared infrastructure passes: VectorizeParallelLoop and
    AnnotateKernelAttrs. These were moved out of the removed LinalgExt
    dialect and are now available as common utilities across the pipeline.

  - TritonStructured Dialect (new)
    Introduces a structured operations dialect that serves as the lowering
    target for Triton→Structured conversion, providing a clean IR interface
    between the Triton frontend dialect and backend-specific lowerings.

Breaking Changes:
  - Any code referencing the NPU, LinalgExt, or TritonExt dialects must be
    updated to use TritonDicp, CommonIR, or the new conversion passes.
  - Dialect registration order in pipeline setup must be updated.
…l IR analysis

This is the core compiler pipeline rewrite that replaces the old monolithic
Conversion/ passes with a modular, multi-stage lowering pipeline backed by
comprehensive graph-level analysis infrastructure.

Removed Passes (under compiler/lib/Conversion/):
  - LinalgToLinked: Triton→Linalg→Linked lowering (superseded)
  - LinalgToNPU: Linalg→NPU direct lowering (superseded)
  - TritonToLinalgNPU: Triton→Linalg→NPU conversion with MemRefCopyGather
    and TritonArithToLinalg patterns (superseded)
  - LinkedToHIVM: Linked→HIVM backend lowering (superseded)

New Conversion Passes:

  TritonToLinalg (compiler/lib/TritonToLinalg/):
    Comprehensive Triton IR → Linalg lowering with:
    - TritonOpConverter (723-line header): converts all Triton dialect ops
      (load, store, atomic ops, reduce, scan, dot, make_range, etc.)
    - BlockPtrAnalysis (2407 lines): block pointer analysis for
      efficient memory access pattern detection
    - MaskAnalysis: predicated execution mask propagation
    - UseAnalysis (559 lines): def-use chain analysis for value lifetime
    - LoadStoreConverter (1242 lines): memory operation legalization
    - ImplicitPermute (608 lines): automatic transpose insertion
    - HoistBroadcast (206 lines): broadcast hoisting optimization
    - DescriptorConverter (180 lines): memory descriptor generation
    - FunctionConverter: kernel function signature conversion
    - ArgMinMaxConverter: argmin/argmax lowering patterns
    - MarkTensorKindPass: tensor kind annotation for backend selection
    - AscendNPUIRLegalizePass: Ascend-specific IR legalization

  TritonToStructured (compiler/lib/TritonToStructured/):
    Alternative Triton→Structured lowering path with:
    - CannonicalizerConverter (2594 lines): canonicalization-driven
      conversion from Triton to structured ops
    - PtrAnalysis (1557 lines): pointer/address analysis
    - MaskAnalysis (992 lines): structured mask propagation
    - MemOpConverter (554 lines): memory operation lowering

  TritonToGraph (compiler/lib/TritonToGraph/):
    Graph-level IR analysis infrastructure:
    - ControlFlowGraph (1120 lines): CFG construction and analysis
      with support for structured control flow regions
    - ControlFlowGraphBuilder (883 lines): CFG builder from MLIR regions
    - DataflowGraph (285 lines): data-flow graph construction
    - MemorySSA (MemorySSA.h, 274 lines): memory SSA form for alias analysis
    - MemorySsaBuilder (605 lines): memory SSA construction passes
    - AliasAnalysis (228 lines): pointer aliasing analysis
    - GraphAnalysis (626 lines): graph-level optimization analysis
    - InterProceduralCFG (331 lines): cross-function CFG analysis
    - tensor.h (146 lines): tensor abstraction for graph IR

  Backend Lowering Passes:
    - TritonToHFusion (160 lines): horizontal fusion pass
    - TritonToHIVM (89 lines): HIVM backend code generation
    - TritonToLLVM (269 lines): LLVM IR backend lowering
    - TritonToAnnotation (56 lines): kernel metadata annotation

  TritonToUnstructure (compiler/lib/TritonToUnstructure/):
    Moved from Conversion/ with enhancements:
    - BubbleUpOperation (164→enhanced): operation bubbling optimization
    - OffsetAnalysis (413→enhanced): memory offset computation
    - ReplaceArguments (301 lines, new): argument replacement utility
    - UnstructureConversionPass (411→enhanced): unstructured conversion

New Infrastructure Passes:

  DiscreteMaskAccessConversion (compiler/lib/DiscreteMaskAccessConversion/):
    Moved from Conversion/ with substantial enhancements (366 lines)

  DynamicCVPipeline (compiler/lib/DynamicCVPipeline/):
    Dynamic compute/vector pipeline planning and execution:
    - PlanComputeBlock: classifies ops and plans compute block boundaries
      with OpClassifier (369 lines) and ComputeBlockIdManager (122 lines)
    - AddControlFlowCondition: inserts dynamic CV control flow guards
    - AddDynamicCVPipeline: orchestrates dynamic CV pipeline insertion
    - AllocMultiCache: multi-level cache allocation for dynamic CV
    - SplitDataflow: dataflow graph splitting pass
    - SeparateMemoryFromCompute: memory/compute operation separation
    - MemoryEffectsTracker (404 lines): memory side-effect analysis

  TritonAffinityOpt (compiler/lib/TritonAffinityOpt/):
    DAG-based affinity optimization for Ascend NPU:
    - DAG (518 lines): directed acyclic graph construction and analysis
    - DAGSSBuffer (5581 lines): smart-sync buffer assignment algorithm
    - DAGScope (1084 lines): scope-based optimization regions
    - DAGSync (1617 lines): synchronization insertion and optimization

  AutoBlockify (compiler/lib/AutoBlockify/):
    Automatic block formation for unstructured IR:
    - AutoBlockify (345 lines): main blockification driver
    - RewriteOperation (492 lines): IR rewriting for block formation
    - Utils (191 lines): blockification utility functions

  AscendLegalize (compiler/lib/AscendLegalize/):
    Ascend NPU-specific IR legalization (74 lines)

Utility Changes:
  - Utils.cpp expanded (1222→enhanced): general compiler utilities
  - InterleaveOptimization (705 lines, new): interleaved execution
    optimization for NPU memory/compute overlap

Build System:
  - compiler/CMakeLists.txt restructured with proper add_subdirectory
  - compiler/lib/CMakeLists.txt updated with all new pass directories
  - dicp_triton_opt tool updated with new pass pipeline registration

Breaking Changes:
  - Old Conversion/ pass names and pipeline ordering are removed
  - Pass registration must use the new modular pass paths
  - dicp_triton_opt pipeline arguments have changed
…on pipeline

Complete rewrite of the backend orchestration layer to support the new
multi-stage compilation pipeline and Triton 3.5 IR. The monolithic NPU
backend is split into modular components with clear separation of concerns.

New Components:

  backend/npu_driver.py (968 lines):
    New NPU driver implementation providing the core device interface for
    Ascend NPU. Handles device initialization, memory management, kernel
    launching, and synchronization with support for dynamic CV (compute
    vector) scenarios. Replaces the inline driver logic previously embedded
    in npu.py.

  backend/npu_compiler_flags.py (120 lines):
    Centralized compiler flag management for the NPU backend. Defines all
    compilation options including optimization levels, debug flags, memory
    allocation strategies, and Ascend-specific tuning parameters. Provides
    a single source of truth for compiler configuration across the pipeline.

Refactored Components:

  backend/npu.py (2038→restructured):
    Refactored from a monolithic backend module into a lean orchestration
    layer. The NPU backend now delegates compilation to the new modular
    pipeline (TritonToLinalg → TritonToStructured → ... → TritonToLLVM),
    device management to npu_driver.py, and flag configuration to
    npu_compiler_flags.py.

  backend/utils.py (643→enhanced):
    Expanded utility module with new helper functions for the restructured
    compilation pipeline including IR dumping, pass timing, metadata
    extraction, and device capability detection.

  backend/compiler.py (153→refined):
    Updated compiler options (DICPOptions) to align with the new pipeline
    stages and flag configuration.

  backend/commonir/backend.py (108→rewritten):
    Rewritten common IR backend with a new commonir_to_linkedir pipeline
    that integrates with the restructured pass infrastructure. Adds support
    for IR replacement via environment variables for debugging, and
    platform-specific compilation paths (910_95, nd2nz_on_vector,
    select_analysis).

  backend/commonir/compiler.py, adapter.py:
    Updated to use TRITON_DEBUG environment variable and work with the
    new compilation flow.

  backend/driver.py:
    Updated driver interface to support the new backend architecture.

  backend/cpu_backend.py:
    Refactored CPU backend for consistency with the new driver interface.

File Reorganization:

  backend/cpu_verify/ (moved from backend/include/ExecutionEngine/):
    CPU verification runtime files (CRunnerUtils.cpp, CRunnerUtils.h,
    Msan.h) moved to a dedicated cpu_verify directory to clearly separate
    the CPU verification backend from general execution engine headers.

Removed:
  - backend/include/ExecutionEngine/version.txt
  - dicp_triton.cc (superseded by enhanced triton_dicp_triton.cc)

C++ Bindings:

  triton_dicp_triton.cc (1337→enhanced):
    Significantly expanded pybind11 bindings to expose the new compilation
    pipeline stages, pass managers, and dialect registrations to Python.
    This is the critical bridge between the Python backend layer and the
    C++ compiler infrastructure.

Binary Assets:
  - backend/lib/libdevice.10.bc: Ascend device library bitcode for
    standard math and utility functions (84KB)

Breaking Changes:
  - Direct imports from backend/npu.py internals must be updated
  - Driver initialization APIs have changed
  - Compiler flag names and environment variables are renamed
    (DLC_DUMP_IR → TRITON_DEBUG, versioned CI variables v34→v35)
…modules

Complete restructure of the language/deeplink frontend package, introducing
a modular cann backend subpackage that separates extension, buffer, and
device library concerns. This aligns the frontend architecture with the
restructured compiler pipeline.

New cann Subpackage (language/deeplink/cann/):

  cann/extension/ — Ascend NPU kernel extension framework:
    - __init__.py (197 lines): Public API surface for custom operators,
      synchronization primitives, and memory hierarchy annotations
    - core.py (285 lines): Core extension infrastructure including
      compute-copy overlap scheduling and multi-level memory ops
    - custom_op.py (427 lines): Custom operator registration system with
      support for multiple modes (MODE.CPU_VERIFY, MODE.NPU_EXEC)
    - semantic.py (303 lines): Semantic analysis and lowering rules for
      extension operations
    - builder.py (73 lines): IR builder helpers for extension ops
    - code_generator.py (162 lines): Code generation for extension ops
    - dispatch.py (13 lines): Operation dispatch routing
    - aux_ops.py (25 lines): Auxiliary operation definitions
    - math_ops.py (24 lines): Math operation overrides
    - mem_ops.py (182 lines): Memory operation primitives (insert_slice,
      extract_slice, sync_block_*)
    - vec_ops.py (64 lines): Vectorized operation primitives
    - scope.py (43 lines): Scope-based resource management

  cann/buffer/ — Buffer management subsystem:
    - __init__.py (43 lines): Buffer API surface
    - core.py (397 lines): Buffer allocation, deallocation, and lifetime
      management with support for multi-level memory hierarchy
    - builder.py (82 lines): Buffer IR construction helpers
    - semantic.py (158 lines): Buffer operation semantic rules

  cann/libdevice.py (1056 lines):
    Comprehensive device library providing standard math functions
    (exp, log, sin, cos, sqrt, rsqrt, erf, floor, ceil, fma, abs, etc.)
    with bf16 cast support for Ascend NPU. Includes tanh with bf16
    support that overrides triton.language.math.tanh.

Refactored Modules:

  language/deeplink/__init__.py:
    Restructured imports to delegate to cann subpackage. Standard math
    functions are now glued to triton.language.math where possible,
    with libdevice overrides for Ascend-specific implementations.
    Re-exports extension, buffer, and custom_op APIs from cann.

  language/deeplink/core.py:
    Streamlined core module that delegates to cann subpackage internals.

Removed:
  - language/deeplink/custom_op.py (379 lines, moved to cann/extension/)
  - language/deeplink/libdevice.py (294 lines, moved to cann/)
  - language/deeplink/semantic.py (107 lines, moved to cann/extension/)

Added:
  - language/deeplink/extension.py (7 lines): Re-export shim

Breaking Changes:
  - Direct imports from language.deeplink.custom_op must use
    language.deeplink.cann.extension.custom_op or the re-exported API
  - Direct imports from language.deeplink.libdevice must use
    language.deeplink.cann.libdevice
  - Custom operator registration APIs now reside under cann.extension
…ture

Comprehensive infrastructure upgrade to support Triton 3.5 and the
restructured compilation pipeline. Includes build system overhaul,
dependency updates, CI/CD modernization, and patch set refresh.

Triton 3.5 Upgrade:

  third_party/triton:
    Updated submodule pointer to Triton 3.5 compatible revision with new
    IR definitions, updated dialect interfaces, and revised pass
    infrastructure.

  third_party/ascendnpu-ir:
    Updated submodule pointer for Ascend NPU IR compatibility with the
    new pipeline.

Patch Set Refresh (patch/triton/):
  - Removed: include_triton_Dialect_Triton_IR_TritonOps_td.patch (Triton
    upstream has incorporated these changes)
  - Removed: python_triton_utils_py.patch (no longer needed)
  - Removed: patch/ascendnpu-ir.patch (upstream updated)
  - Removed: patch/ttshared/triton_shared.patch (deprecated)
  - Added: CMakeLists_txt.patch — Triton CMake build integration
  - Added: lib_Dialect_Triton_IR_Ops_cpp.patch — Triton op C++ fixes
  - Added: python_src_ir_h.patch — Python IR header bindings
  - Added: python_triton__utils_py.patch — utils.py compatibility
  - Updated: python_src_ir_cc.patch — revised IR C++ bindings
  - Updated: python_triton_compiler_code_generator_py.patch — codegen
  - Updated: python_triton_compiler_compiler_py.patch — compiler flow
  - Updated: python_triton_language_semantic_py.patch — semantic rules
  - Updated: setup_py.patch — package configuration
  - Updated: unittest_googletest_cmake.patch — test infrastructure

Build System:

  CMakeLists.txt:
    Restructured root CMakeLists to properly integrate the new compiler
    pass directories and dialect libraries. Updated include paths and
    library dependencies for the modular pipeline.

  compile_shared.sh:
    Updated shared library compilation script with new build flags and
    paths for Triton 3.5 and the restructured compiler.

  format.sh:
    Updated code formatting script with expanded file patterns covering
    new source directories.

  conda.sh (new, 79 lines):
    Conda environment setup script for reproducible development
    environment creation with all required dependencies.

CI/CD Modernization (.github/workflows/):

  main.yml, release.yml:
    - Upgraded CI infrastructure from v34 to v35 (JSON_PATH35,
      GOOGLETEST_DIR35, LLVM_TGZ_PATH35)
    - Updated LLVM toolchain to llvm-7d5de303-ubuntu-arm64.tar.gz
    - Removed redundant build steps aligned with new CMake structure
    - Cleaned up environment variable propagation

Dependencies:

  requirements.txt (128→restructured):
    Updated Python package dependencies for Triton 3.5 compatibility.
    Pinned versions for critical packages to ensure reproducible builds.

  docker/Dockerfile (276→refined):
    Updated Docker image definition with new CANN toolkit paths,
    updated build dependencies, and revised environment setup for the
    restructured compilation pipeline.

Configuration:

  .gitignore:
    Added patterns for new build artifacts and temporary directories.

  .gitmodules:
    Updated submodule URLs/branches if needed for new third-party refs.

  README.md:
    Minor updates reflecting the new project structure.

Test Updates:

  test/ascend/:
    Updated test files for new APIs and pipeline:
    - cpu_verify/: test_bare_matmul, test_fa, test_vec_add
    - failed_tests/: test_flip, test_gather
    - passed_tests/: test_atan, test_common, test_isnan, test_log1p,
      test_multi_return, test_pow, test_relu, test_zeros, test_zeroslike
    - test_custom_op.py: updated for new extension API
    - test_mlir.sh: updated MLIR test script
    - mlir/: linalg_broadcast.mlir, linalg_multi_assign.mlir

  test/commonir/run_tests.sh:
    Updated common IR test runner.
… semantics

Inline the CompilerFlag declarative builder into direct imperative flag
construction within each linalg-to-bin entry point, removing the
npu_compiler_flags.py abstraction layer.

Move AscendLegalize pass/pattern class bodies from the header into the
.cpp file and add a MaxNumFToMaximumF rewrite pattern that replaces
arith::MaxNumFOp (NaN-quiet) with arith::MaximumFOp (NaN-propagating)
on Ascend NPU targets. This fixes online-softmax correctness where NaN
must propagate through the max reduction region.
…DLCompiler

Migrate triton-ascend auto-tuning (AST auto-tiling, compile-option search,
parallel compilation, NPU benchmark) into DLCompiler's ascend backend.

- ascend_autotune_runtime/: port autoparser, tile_generator, autotuner,
  utils from triton-ascend/third_party/ascend/backend/runtime with
  import paths fixed for triton.backends.dicp_triton
- ascend_autotune_hooks.py: lightweight module-level proxy that switches
  triton.autotune/max_autotune based on _USE_ASCEND flag; installs on
  import; no Strategy/Registry ceremony
- testing.py: do_bench_npu NPU profiler with kernel_details.csv parsing
- driver.py: wire hook_autotune_for_ascend() into ascend backend init
- patch/triton/python_triton___init___py.patch: add max_autotune fallback
- docs/ascend_autotune_design.md: architecture and data-flow documentation
This commit introduces a first-class compile-options path for Ascend autotune runtime and wires it into the Triton backend flow, with corresponding tests and small behavior updates.

Changes include:

- Add new runtime modules for compile option parsing, benchmark orchestration, and autotune execution support.

- Extend backend/autotuner integration so options propagate through parser/compiler/npu layers.

- Add C++ legalize updates needed for the new runtime behavior.

- Add and update autotune tests (doc e2e, compile_options, compatibility) and remove obsolete deeplink runtime path.

- Keep existing CLI behavior stable while adding richer metadata for benchmark and config checks.

Testing:

- Added test coverage in test/ascend/autotune/*

- Updated 04-libentry and do_bench compatibility tests to validate option handling.
All staged changes are bug-fix oriented and focus on restoring stable autotune behavior across Ascend backend components.

Changes:

- Fix CommonIR backend/compiler glue to preserve compile options and argument metadata through autotune stages.

- Fix deeplink runtime package path migration to avoid missing module entry points and cache handling regressions.

- Fix AscendLegalize pass source handling and keep patch metadata in sync with compiler invocation changes.

- Add/update backend/autotune design doc alignment with runtime behavior changes.

- Keep API surface unchanged while addressing latent runtime path and compiler invocation bugs.

Validation:

- Existing regression candidates should be covered by the previously failing staged tests in the autotune flow.
fix(ascend):update test_mod test_pow test_scalar_calc
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Truth-Ke
❌ liguoliang


liguoliang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@Truth-Ke Truth-Ke force-pushed the kzx/triton3.5_0525 branch 3 times, most recently from e9bd863 to 7eb8ba3 Compare June 11, 2026 03:25
@Truth-Ke Truth-Ke closed this Jun 11, 2026
@Truth-Ke Truth-Ke reopened this Jun 11, 2026
@Truth-Ke Truth-Ke force-pushed the kzx/triton3.5_0525 branch from 7eb8ba3 to 022cf69 Compare June 11, 2026 07:42
@Truth-Ke Truth-Ke closed this Jun 11, 2026
@Truth-Ke Truth-Ke reopened this Jun 11, 2026
@Truth-Ke Truth-Ke closed this Jun 11, 2026
@Truth-Ke Truth-Ke reopened this Jun 11, 2026
@Truth-Ke Truth-Ke closed this Jun 11, 2026
@Truth-Ke Truth-Ke reopened this Jun 11, 2026
@Truth-Ke Truth-Ke force-pushed the kzx/triton3.5_0525 branch from 022cf69 to 50ff6fb Compare June 11, 2026 10:00
@Truth-Ke Truth-Ke force-pushed the kzx/triton3.5_0525 branch 5 times, most recently from 80e79b5 to e707315 Compare June 12, 2026 02:01
@Truth-Ke Truth-Ke force-pushed the kzx/triton3.5_0525 branch from e707315 to 00ce801 Compare June 12, 2026 03:04
@Truth-Ke Truth-Ke merged commit 278116b into main Jun 12, 2026
4 of 5 checks passed
@Truth-Ke Truth-Ke deleted the kzx/triton3.5_0525 branch June 12, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants