refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100
refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100cherhh wants to merge 8 commits into
Conversation
…a/ layout
- Move SM100 (Blackwell) modular-chunk backends, decode, and the unwired
fully-fused WIP from flat cula/ops/*.py into cula/ops/kda/{sm100,decode,experimental}/.
- Move the non-KDA lightning/linear prototypes under cula/ops/.
- Add a central CP dispatch policy at cula/ops/kda/policy.py.
- Make cula / cula.ops / cula.kda imports lazy (PEP 562) so `import cula` no
longer eagerly pulls the CuTeDSL/CUDA-heavy modules.
- Repoint all in-repo imports, benchmarks, tests, and docs.
Pure reorganization, no kernel behavior change. The SM90 (Hopper) prefill stays
the existing C++ kernel under csrc/kda/sm90.
- Drop kda_prefill_blackwell from the cula.kda public exports; the fully-fused Blackwell prefill (cula/ops/kda/experimental/sm100_fused/) is unwired WIP. - get_kda_fused_fwd now raises NotImplementedError on SM100/SM103 instead of returning that experimental kernel. - Production Blackwell prefill stays the modular chunk_kda path.
- intracard_fwd_h now raises NotSplittableError when the shape cannot be meaningfully split, instead of silently falling back. - Drop the allow_fallback / skip_precheck flags, the two duplicated _no_cp fallback blocks, and the redundant pre-split heuristic recheck that the dispatch policy already performed. - chunk_gated_delta_rule_fwd_h now owns the fallback: re-raise for forced CP, fall through to the serial body for auto. - NotSplittableError subclasses ValueError for backward compatibility. Behavior-preserving: force -> raise and auto -> serial fallback are unchanged.
…d stale REPO_LAYOUT sections
…st_intracard_cp.py
There was a problem hiding this comment.
Code Review
This pull request reorganizes the repository layout by migrating KDA backend kernels to an arch-first structure (e.g., grouping SM100 Blackwell kernels, decode, and experimental paths) and refactoring shared PTX/MLIR helpers. It also introduces a centralized context-parallel dispatch policy in policy.py for SM100. Regarding the changes, a performance improvement opportunity was identified in chunk_gated_delta_rule_fwd_h where cu_seqlens_cpu can be materialized twice when it is not provided, leading to redundant device-to-host synchronizations. Materializing it once at the beginning of the function would resolve this.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
- cula/__init__.py: drop trailing blank line - cula/ops/kda/__init__.py: dedent module docstring, add final newline - cula/ops/kda/sm100/delta_h.py: drop extra blank lines
📌 Description
Reorganization
Restructure the flat
cula/ops/*.pylayout into an arch-first hierarchy:cula/ops/kda_decode.pycula/ops/kda/decode/cute.pycula/ops/fwd_o_sm100.pycula/ops/kda/sm100/fwd_o.pycula/ops/sm100/chunk_delta_h.pycula/ops/kda/sm100/delta_h.pycula/ops/sm100/bwd_wy_dqkg.pycula/ops/kda/sm100/bwd_wy_dqkg.pycula/ops/sm100/cp/*cula/ops/kda/sm100/cp/*cula/ops/intrinsics_sm100.pycula/ops/sm100/ptx.pycula/ops/ptx_umma_ext.pycula/ops/sm100/ptx.py)cula/ops/kda_fully_fused_sm100_wip.pycula/ops/kda/experimental/sm100_fused/cula/ops/la_decode.pycula/ops/lightning/decode.pycula/ops/prefill_sm100.pycula/ops/lightning/prefill_sm100.pyNew
cula/ops/layout: see updatedREPO_LAYOUT.md.Lazy imports (PEP 562)
Add
__getattr__incula/__init__.py,cula/kda/__init__.py, andcula/ops/__init__.pyso that importing the top-level package does not eagerly import CuTeDSL/CUDA kernel modules.Intracard-fwd_h cleanup
Separate policy from execution in the SM100 intra-card CP path:
sm100_intracard_cp_decisioninpolicy.pyintracard_fwd_his a pure executor: split and run, or raiseNotSplittableErrorWhat is NOT changed
csrc/kda/sm90/🔍 Related Issues
N/A
🧪 Tests
Reviewer Notes