Skip to content

refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100

Open
cherhh wants to merge 8 commits into
inclusionAI:mainfrom
cherhh:refactor
Open

refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100
cherhh wants to merge 8 commits into
inclusionAI:mainfrom
cherhh:refactor

Conversation

@cherhh

@cherhh cherhh commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

📌 Description

Reorganization

Restructure the flat cula/ops/*.py layout into an arch-first hierarchy:

Before After
cula/ops/kda_decode.py cula/ops/kda/decode/cute.py
cula/ops/fwd_o_sm100.py cula/ops/kda/sm100/fwd_o.py
cula/ops/sm100/chunk_delta_h.py cula/ops/kda/sm100/delta_h.py
cula/ops/sm100/bwd_wy_dqkg.py cula/ops/kda/sm100/bwd_wy_dqkg.py
cula/ops/sm100/cp/* cula/ops/kda/sm100/cp/*
cula/ops/intrinsics_sm100.py cula/ops/sm100/ptx.py
cula/ops/ptx_umma_ext.py (consolidated into cula/ops/sm100/ptx.py)
cula/ops/kda_fully_fused_sm100_wip.py cula/ops/kda/experimental/sm100_fused/
cula/ops/la_decode.py cula/ops/lightning/decode.py
cula/ops/prefill_sm100.py cula/ops/lightning/prefill_sm100.py

New cula/ops/ layout: see updated REPO_LAYOUT.md.

Lazy imports (PEP 562)

Add __getattr__ in cula/__init__.py, cula/kda/__init__.py, and cula/ops/__init__.py so that importing the top-level package does not eagerly import CuTeDSL/CUDA kernel modules.

Intracard-fwd_h cleanup

Separate policy from execution in the SM100 intra-card CP path:

  • Policy decisions are now handled entirely by sm100_intracard_cp_decision in policy.py
  • intracard_fwd_h is a pure executor: split and run, or raise NotSplittableError
  • The caller owns the fallback logic based on the policy result (forced CP → re-raise, auto → fall through to serial)

What is NOT changed

  • SM90 prefill kernel path: still uses the existing C++ kernel under csrc/kda/sm90/
  • SM100 kernel behavior: unchanged (pure move/rename)

🔍 Related Issues

N/A

🧪 Tests

  • Pre-commit (clang-format + ruff + ruff-format) — all passing

Reviewer Notes

cheheng.ch added 7 commits June 26, 2026 14:28
…a/ layout

- Move SM100 (Blackwell) modular-chunk backends, decode, and the unwired
  fully-fused WIP from flat cula/ops/*.py into cula/ops/kda/{sm100,decode,experimental}/.
- Move the non-KDA lightning/linear prototypes under cula/ops/.
- Add a central CP dispatch policy at cula/ops/kda/policy.py.
- Make cula / cula.ops / cula.kda imports lazy (PEP 562) so `import cula` no
  longer eagerly pulls the CuTeDSL/CUDA-heavy modules.
- Repoint all in-repo imports, benchmarks, tests, and docs.

Pure reorganization, no kernel behavior change. The SM90 (Hopper) prefill stays
the existing C++ kernel under csrc/kda/sm90.
- Drop kda_prefill_blackwell from the cula.kda public exports; the fully-fused
  Blackwell prefill (cula/ops/kda/experimental/sm100_fused/) is unwired WIP.
- get_kda_fused_fwd now raises NotImplementedError on SM100/SM103 instead of
  returning that experimental kernel.
- Production Blackwell prefill stays the modular chunk_kda path.
- intracard_fwd_h now raises NotSplittableError when the shape cannot be
  meaningfully split, instead of silently falling back.
- Drop the allow_fallback / skip_precheck flags, the two duplicated _no_cp
  fallback blocks, and the redundant pre-split heuristic recheck that the
  dispatch policy already performed.
- chunk_gated_delta_rule_fwd_h now owns the fallback: re-raise for forced CP,
  fall through to the serial body for auto.
- NotSplittableError subclasses ValueError for backward compatibility.

Behavior-preserving: force -> raise and auto -> serial fallback are unchanged.
@cherhh cherhh requested review from icavan and zheyang0825 June 26, 2026 06:30

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reorganizes the repository layout by migrating KDA backend kernels to an arch-first structure (e.g., grouping SM100 Blackwell kernels, decode, and experimental paths) and refactoring shared PTX/MLIR helpers. It also introduces a centralized context-parallel dispatch policy in policy.py for SM100. Regarding the changes, a performance improvement opportunity was identified in chunk_gated_delta_rule_fwd_h where cu_seqlens_cpu can be materialized twice when it is not provided, leading to redundant device-to-host synchronizations. Materializing it once at the beginning of the function would resolve this.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cula/ops/kda/sm100/delta_h.py
- cula/__init__.py: drop trailing blank line
- cula/ops/kda/__init__.py: dedent module docstring, add final newline
- cula/ops/kda/sm100/delta_h.py: drop extra blank lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant