升级到triton3.5 重构tt到linalg by Truth-Ke · Pull Request #180 · DeepLink-org/DLCompiler

Truth-Ke · 2026-06-10T10:19:58Z

No description provided.

…eline This is a foundational dialect-level refactoring that renames, removes, and introduces MLIR dialects in preparation for the new compilation pipeline: Dialect Renames: - NPU Dialect → TritonDicp Dialect The NPU dialect is renamed to TritonDicp to better reflect its role as the DICP-specific Triton intermediate representation. Expanded with new op definitions (TritonDicpOps.td), attribute definitions (TritonDicpAttrDefs.td), and corresponding C++ implementation. Dialect Removals: - LinalgExt Dialect (entirely removed) Removes the LinalgExt dialect including its IR definitions (Ops, Traits), transforms (LinalgGenericToSCF, LinalgIfToSelect, RemoveSingleIterationLoop, ScalarTo1DTensor, TensorTransform), and VectorizeParallelLoopPass. This functionality is superseded by the new TritonToLinalg and TritonToStructured conversion passes. - TritonExt Dialect (entirely removed) Removes the TritonExt dialect including its transforms (BoolTritonPtrPromotionPass, CanonicalizeCmpiPass, CanonicalizeTritonIRAscend, CanonicalizerPattern). These canonicalization and optimization patterns are now handled by the new pass pipeline. Dialect Additions: - CommonIR Dialect (new) Provides shared infrastructure passes: VectorizeParallelLoop and AnnotateKernelAttrs. These were moved out of the removed LinalgExt dialect and are now available as common utilities across the pipeline. - TritonStructured Dialect (new) Introduces a structured operations dialect that serves as the lowering target for Triton→Structured conversion, providing a clean IR interface between the Triton frontend dialect and backend-specific lowerings. Breaking Changes: - Any code referencing the NPU, LinalgExt, or TritonExt dialects must be updated to use TritonDicp, CommonIR, or the new conversion passes. - Dialect registration order in pipeline setup must be updated.

…l IR analysis This is the core compiler pipeline rewrite that replaces the old monolithic Conversion/ passes with a modular, multi-stage lowering pipeline backed by comprehensive graph-level analysis infrastructure. Removed Passes (under compiler/lib/Conversion/): - LinalgToLinked: Triton→Linalg→Linked lowering (superseded) - LinalgToNPU: Linalg→NPU direct lowering (superseded) - TritonToLinalgNPU: Triton→Linalg→NPU conversion with MemRefCopyGather and TritonArithToLinalg patterns (superseded) - LinkedToHIVM: Linked→HIVM backend lowering (superseded) New Conversion Passes: TritonToLinalg (compiler/lib/TritonToLinalg/): Comprehensive Triton IR → Linalg lowering with: - TritonOpConverter (723-line header): converts all Triton dialect ops (load, store, atomic ops, reduce, scan, dot, make_range, etc.) - BlockPtrAnalysis (2407 lines): block pointer analysis for efficient memory access pattern detection - MaskAnalysis: predicated execution mask propagation - UseAnalysis (559 lines): def-use chain analysis for value lifetime - LoadStoreConverter (1242 lines): memory operation legalization - ImplicitPermute (608 lines): automatic transpose insertion - HoistBroadcast (206 lines): broadcast hoisting optimization - DescriptorConverter (180 lines): memory descriptor generation - FunctionConverter: kernel function signature conversion - ArgMinMaxConverter: argmin/argmax lowering patterns - MarkTensorKindPass: tensor kind annotation for backend selection - AscendNPUIRLegalizePass: Ascend-specific IR legalization TritonToStructured (compiler/lib/TritonToStructured/): Alternative Triton→Structured lowering path with: - CannonicalizerConverter (2594 lines): canonicalization-driven conversion from Triton to structured ops - PtrAnalysis (1557 lines): pointer/address analysis - MaskAnalysis (992 lines): structured mask propagation - MemOpConverter (554 lines): memory operation lowering TritonToGraph (compiler/lib/TritonToGraph/): Graph-level IR analysis infrastructure: - ControlFlowGraph (1120 lines): CFG construction and analysis with support for structured control flow regions - ControlFlowGraphBuilder (883 lines): CFG builder from MLIR regions - DataflowGraph (285 lines): data-flow graph construction - MemorySSA (MemorySSA.h, 274 lines): memory SSA form for alias analysis - MemorySsaBuilder (605 lines): memory SSA construction passes - AliasAnalysis (228 lines): pointer aliasing analysis - GraphAnalysis (626 lines): graph-level optimization analysis - InterProceduralCFG (331 lines): cross-function CFG analysis - tensor.h (146 lines): tensor abstraction for graph IR Backend Lowering Passes: - TritonToHFusion (160 lines): horizontal fusion pass - TritonToHIVM (89 lines): HIVM backend code generation - TritonToLLVM (269 lines): LLVM IR backend lowering - TritonToAnnotation (56 lines): kernel metadata annotation TritonToUnstructure (compiler/lib/TritonToUnstructure/): Moved from Conversion/ with enhancements: - BubbleUpOperation (164→enhanced): operation bubbling optimization - OffsetAnalysis (413→enhanced): memory offset computation - ReplaceArguments (301 lines, new): argument replacement utility - UnstructureConversionPass (411→enhanced): unstructured conversion New Infrastructure Passes: DiscreteMaskAccessConversion (compiler/lib/DiscreteMaskAccessConversion/): Moved from Conversion/ with substantial enhancements (366 lines) DynamicCVPipeline (compiler/lib/DynamicCVPipeline/): Dynamic compute/vector pipeline planning and execution: - PlanComputeBlock: classifies ops and plans compute block boundaries with OpClassifier (369 lines) and ComputeBlockIdManager (122 lines) - AddControlFlowCondition: inserts dynamic CV control flow guards - AddDynamicCVPipeline: orchestrates dynamic CV pipeline insertion - AllocMultiCache: multi-level cache allocation for dynamic CV - SplitDataflow: dataflow graph splitting pass - SeparateMemoryFromCompute: memory/compute operation separation - MemoryEffectsTracker (404 lines): memory side-effect analysis TritonAffinityOpt (compiler/lib/TritonAffinityOpt/): DAG-based affinity optimization for Ascend NPU: - DAG (518 lines): directed acyclic graph construction and analysis - DAGSSBuffer (5581 lines): smart-sync buffer assignment algorithm - DAGScope (1084 lines): scope-based optimization regions - DAGSync (1617 lines): synchronization insertion and optimization AutoBlockify (compiler/lib/AutoBlockify/): Automatic block formation for unstructured IR: - AutoBlockify (345 lines): main blockification driver - RewriteOperation (492 lines): IR rewriting for block formation - Utils (191 lines): blockification utility functions AscendLegalize (compiler/lib/AscendLegalize/): Ascend NPU-specific IR legalization (74 lines) Utility Changes: - Utils.cpp expanded (1222→enhanced): general compiler utilities - InterleaveOptimization (705 lines, new): interleaved execution optimization for NPU memory/compute overlap Build System: - compiler/CMakeLists.txt restructured with proper add_subdirectory - compiler/lib/CMakeLists.txt updated with all new pass directories - dicp_triton_opt tool updated with new pass pipeline registration Breaking Changes: - Old Conversion/ pass names and pipeline ordering are removed - Pass registration must use the new modular pass paths - dicp_triton_opt pipeline arguments have changed

…on pipeline Complete rewrite of the backend orchestration layer to support the new multi-stage compilation pipeline and Triton 3.5 IR. The monolithic NPU backend is split into modular components with clear separation of concerns. New Components: backend/npu_driver.py (968 lines): New NPU driver implementation providing the core device interface for Ascend NPU. Handles device initialization, memory management, kernel launching, and synchronization with support for dynamic CV (compute vector) scenarios. Replaces the inline driver logic previously embedded in npu.py. backend/npu_compiler_flags.py (120 lines): Centralized compiler flag management for the NPU backend. Defines all compilation options including optimization levels, debug flags, memory allocation strategies, and Ascend-specific tuning parameters. Provides a single source of truth for compiler configuration across the pipeline. Refactored Components: backend/npu.py (2038→restructured): Refactored from a monolithic backend module into a lean orchestration layer. The NPU backend now delegates compilation to the new modular pipeline (TritonToLinalg → TritonToStructured → ... → TritonToLLVM), device management to npu_driver.py, and flag configuration to npu_compiler_flags.py. backend/utils.py (643→enhanced): Expanded utility module with new helper functions for the restructured compilation pipeline including IR dumping, pass timing, metadata extraction, and device capability detection. backend/compiler.py (153→refined): Updated compiler options (DICPOptions) to align with the new pipeline stages and flag configuration. backend/commonir/backend.py (108→rewritten): Rewritten common IR backend with a new commonir_to_linkedir pipeline that integrates with the restructured pass infrastructure. Adds support for IR replacement via environment variables for debugging, and platform-specific compilation paths (910_95, nd2nz_on_vector, select_analysis). backend/commonir/compiler.py, adapter.py: Updated to use TRITON_DEBUG environment variable and work with the new compilation flow. backend/driver.py: Updated driver interface to support the new backend architecture. backend/cpu_backend.py: Refactored CPU backend for consistency with the new driver interface. File Reorganization: backend/cpu_verify/ (moved from backend/include/ExecutionEngine/): CPU verification runtime files (CRunnerUtils.cpp, CRunnerUtils.h, Msan.h) moved to a dedicated cpu_verify directory to clearly separate the CPU verification backend from general execution engine headers. Removed: - backend/include/ExecutionEngine/version.txt - dicp_triton.cc (superseded by enhanced triton_dicp_triton.cc) C++ Bindings: triton_dicp_triton.cc (1337→enhanced): Significantly expanded pybind11 bindings to expose the new compilation pipeline stages, pass managers, and dialect registrations to Python. This is the critical bridge between the Python backend layer and the C++ compiler infrastructure. Binary Assets: - backend/lib/libdevice.10.bc: Ascend device library bitcode for standard math and utility functions (84KB) Breaking Changes: - Direct imports from backend/npu.py internals must be updated - Driver initialization APIs have changed - Compiler flag names and environment variables are renamed (DLC_DUMP_IR → TRITON_DEBUG, versioned CI variables v34→v35)

…modules Complete restructure of the language/deeplink frontend package, introducing a modular cann backend subpackage that separates extension, buffer, and device library concerns. This aligns the frontend architecture with the restructured compiler pipeline. New cann Subpackage (language/deeplink/cann/): cann/extension/ — Ascend NPU kernel extension framework: - __init__.py (197 lines): Public API surface for custom operators, synchronization primitives, and memory hierarchy annotations - core.py (285 lines): Core extension infrastructure including compute-copy overlap scheduling and multi-level memory ops - custom_op.py (427 lines): Custom operator registration system with support for multiple modes (MODE.CPU_VERIFY, MODE.NPU_EXEC) - semantic.py (303 lines): Semantic analysis and lowering rules for extension operations - builder.py (73 lines): IR builder helpers for extension ops - code_generator.py (162 lines): Code generation for extension ops - dispatch.py (13 lines): Operation dispatch routing - aux_ops.py (25 lines): Auxiliary operation definitions - math_ops.py (24 lines): Math operation overrides - mem_ops.py (182 lines): Memory operation primitives (insert_slice, extract_slice, sync_block_*) - vec_ops.py (64 lines): Vectorized operation primitives - scope.py (43 lines): Scope-based resource management cann/buffer/ — Buffer management subsystem: - __init__.py (43 lines): Buffer API surface - core.py (397 lines): Buffer allocation, deallocation, and lifetime management with support for multi-level memory hierarchy - builder.py (82 lines): Buffer IR construction helpers - semantic.py (158 lines): Buffer operation semantic rules cann/libdevice.py (1056 lines): Comprehensive device library providing standard math functions (exp, log, sin, cos, sqrt, rsqrt, erf, floor, ceil, fma, abs, etc.) with bf16 cast support for Ascend NPU. Includes tanh with bf16 support that overrides triton.language.math.tanh. Refactored Modules: language/deeplink/__init__.py: Restructured imports to delegate to cann subpackage. Standard math functions are now glued to triton.language.math where possible, with libdevice overrides for Ascend-specific implementations. Re-exports extension, buffer, and custom_op APIs from cann. language/deeplink/core.py: Streamlined core module that delegates to cann subpackage internals. Removed: - language/deeplink/custom_op.py (379 lines, moved to cann/extension/) - language/deeplink/libdevice.py (294 lines, moved to cann/) - language/deeplink/semantic.py (107 lines, moved to cann/extension/) Added: - language/deeplink/extension.py (7 lines): Re-export shim Breaking Changes: - Direct imports from language.deeplink.custom_op must use language.deeplink.cann.extension.custom_op or the re-exported API - Direct imports from language.deeplink.libdevice must use language.deeplink.cann.libdevice - Custom operator registration APIs now reside under cann.extension

…ture Comprehensive infrastructure upgrade to support Triton 3.5 and the restructured compilation pipeline. Includes build system overhaul, dependency updates, CI/CD modernization, and patch set refresh. Triton 3.5 Upgrade: third_party/triton: Updated submodule pointer to Triton 3.5 compatible revision with new IR definitions, updated dialect interfaces, and revised pass infrastructure. third_party/ascendnpu-ir: Updated submodule pointer for Ascend NPU IR compatibility with the new pipeline. Patch Set Refresh (patch/triton/): - Removed: include_triton_Dialect_Triton_IR_TritonOps_td.patch (Triton upstream has incorporated these changes) - Removed: python_triton_utils_py.patch (no longer needed) - Removed: patch/ascendnpu-ir.patch (upstream updated) - Removed: patch/ttshared/triton_shared.patch (deprecated) - Added: CMakeLists_txt.patch — Triton CMake build integration - Added: lib_Dialect_Triton_IR_Ops_cpp.patch — Triton op C++ fixes - Added: python_src_ir_h.patch — Python IR header bindings - Added: python_triton__utils_py.patch — utils.py compatibility - Updated: python_src_ir_cc.patch — revised IR C++ bindings - Updated: python_triton_compiler_code_generator_py.patch — codegen - Updated: python_triton_compiler_compiler_py.patch — compiler flow - Updated: python_triton_language_semantic_py.patch — semantic rules - Updated: setup_py.patch — package configuration - Updated: unittest_googletest_cmake.patch — test infrastructure Build System: CMakeLists.txt: Restructured root CMakeLists to properly integrate the new compiler pass directories and dialect libraries. Updated include paths and library dependencies for the modular pipeline. compile_shared.sh: Updated shared library compilation script with new build flags and paths for Triton 3.5 and the restructured compiler. format.sh: Updated code formatting script with expanded file patterns covering new source directories. conda.sh (new, 79 lines): Conda environment setup script for reproducible development environment creation with all required dependencies. CI/CD Modernization (.github/workflows/): main.yml, release.yml: - Upgraded CI infrastructure from v34 to v35 (JSON_PATH35, GOOGLETEST_DIR35, LLVM_TGZ_PATH35) - Updated LLVM toolchain to llvm-7d5de303-ubuntu-arm64.tar.gz - Removed redundant build steps aligned with new CMake structure - Cleaned up environment variable propagation Dependencies: requirements.txt (128→restructured): Updated Python package dependencies for Triton 3.5 compatibility. Pinned versions for critical packages to ensure reproducible builds. docker/Dockerfile (276→refined): Updated Docker image definition with new CANN toolkit paths, updated build dependencies, and revised environment setup for the restructured compilation pipeline. Configuration: .gitignore: Added patterns for new build artifacts and temporary directories. .gitmodules: Updated submodule URLs/branches if needed for new third-party refs. README.md: Minor updates reflecting the new project structure. Test Updates: test/ascend/: Updated test files for new APIs and pipeline: - cpu_verify/: test_bare_matmul, test_fa, test_vec_add - failed_tests/: test_flip, test_gather - passed_tests/: test_atan, test_common, test_isnan, test_log1p, test_multi_return, test_pow, test_relu, test_zeros, test_zeroslike - test_custom_op.py: updated for new extension API - test_mlir.sh: updated MLIR test script - mlir/: linalg_broadcast.mlir, linalg_multi_assign.mlir test/commonir/run_tests.sh: Updated common IR test runner.

… semantics Inline the CompilerFlag declarative builder into direct imperative flag construction within each linalg-to-bin entry point, removing the npu_compiler_flags.py abstraction layer. Move AscendLegalize pass/pattern class bodies from the header into the .cpp file and add a MaxNumFToMaximumF rewrite pattern that replaces arith::MaxNumFOp (NaN-quiet) with arith::MaximumFOp (NaN-propagating) on Ascend NPU targets. This fixes online-softmax correctness where NaN must propagate through the max reduction region.

…DLCompiler Migrate triton-ascend auto-tuning (AST auto-tiling, compile-option search, parallel compilation, NPU benchmark) into DLCompiler's ascend backend. - ascend_autotune_runtime/: port autoparser, tile_generator, autotuner, utils from triton-ascend/third_party/ascend/backend/runtime with import paths fixed for triton.backends.dicp_triton - ascend_autotune_hooks.py: lightweight module-level proxy that switches triton.autotune/max_autotune based on _USE_ASCEND flag; installs on import; no Strategy/Registry ceremony - testing.py: do_bench_npu NPU profiler with kernel_details.csv parsing - driver.py: wire hook_autotune_for_ascend() into ascend backend init - patch/triton/python_triton___init___py.patch: add max_autotune fallback - docs/ascend_autotune_design.md: architecture and data-flow documentation

This commit introduces a first-class compile-options path for Ascend autotune runtime and wires it into the Triton backend flow, with corresponding tests and small behavior updates. Changes include: - Add new runtime modules for compile option parsing, benchmark orchestration, and autotune execution support. - Extend backend/autotuner integration so options propagate through parser/compiler/npu layers. - Add C++ legalize updates needed for the new runtime behavior. - Add and update autotune tests (doc e2e, compile_options, compatibility) and remove obsolete deeplink runtime path. - Keep existing CLI behavior stable while adding richer metadata for benchmark and config checks. Testing: - Added test coverage in test/ascend/autotune/* - Updated 04-libentry and do_bench compatibility tests to validate option handling.

All staged changes are bug-fix oriented and focus on restoring stable autotune behavior across Ascend backend components. Changes: - Fix CommonIR backend/compiler glue to preserve compile options and argument metadata through autotune stages. - Fix deeplink runtime package path migration to avoid missing module entry points and cache handling regressions. - Fix AscendLegalize pass source handling and keep patch metadata in sync with compiler invocation changes. - Add/update backend/autotune design doc alignment with runtime behavior changes. - Keep API surface unchanged while addressing latent runtime path and compiler invocation bugs. Validation: - Existing regression candidates should be covered by the previously failing staged tests in the autotune flow.

fix(ascend):update test_mod test_pow test_scalar_calc

CLAassistant · 2026-06-10T10:20:06Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Truth-Ke
❌ liguoliang

liguoliang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Truth-Ke and others added 16 commits May 25, 2026 17:56

chore: remove triton_shared in git

f388176

feat(compiler): remove DynamicCVPipeline and compute-block planning

d15d19c

fix: autotune import

bafc576

fix: repeat dsl & update workflow

05df92e

fix(ascend):update test_mod test_pow test_scalar_calc

c1986cb

fix: format code

3fa423d

Merge pull request #179 from liguoliangzi/lgl/fix-ascend-mod-pow

18a2b9d

fix(ascend):update test_mod test_pow test_scalar_calc

Truth-Ke force-pushed the kzx/triton3.5_0525 branch 3 times, most recently from e9bd863 to 7eb8ba3 Compare June 11, 2026 03:25

Truth-Ke closed this Jun 11, 2026

Truth-Ke reopened this Jun 11, 2026

Truth-Ke force-pushed the kzx/triton3.5_0525 branch from 7eb8ba3 to 022cf69 Compare June 11, 2026 07:42

Truth-Ke closed this Jun 11, 2026

Truth-Ke reopened this Jun 11, 2026

Truth-Ke closed this Jun 11, 2026

Truth-Ke reopened this Jun 11, 2026

Truth-Ke closed this Jun 11, 2026

Truth-Ke reopened this Jun 11, 2026

Truth-Ke force-pushed the kzx/triton3.5_0525 branch from 022cf69 to 50ff6fb Compare June 11, 2026 10:00

Truth-Ke force-pushed the kzx/triton3.5_0525 branch 5 times, most recently from 80e79b5 to e707315 Compare June 12, 2026 02:01

fix: repeat dsl & update workflow

00ce801

Truth-Ke force-pushed the kzx/triton3.5_0525 branch from e707315 to 00ce801 Compare June 12, 2026 03:04

Truth-Ke merged commit 278116b into main Jun 12, 2026
4 of 5 checks passed

Truth-Ke deleted the kzx/triton3.5_0525 branch June 12, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

升级到triton3.5 重构tt到linalg#180

升级到triton3.5 重构tt到linalg#180
Truth-Ke merged 17 commits into
mainfrom
kzx/triton3.5_0525

Truth-Ke commented Jun 10, 2026

Uh oh!

CLAassistant commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Truth-Ke commented Jun 10, 2026

Uh oh!

CLAassistant commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants