refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports by cherhh · Pull Request #100 · inclusionAI/cuLA

cherhh · 2026-06-26T06:30:18Z

📌 Description

Reorganization

Restructure the flat cula/ops/*.py layout into an arch-first hierarchy:

Before	After
`cula/ops/kda_decode.py`	`cula/ops/kda/decode/cute.py`
`cula/ops/fwd_o_sm100.py`	`cula/ops/kda/sm100/fwd_o.py`
`cula/ops/sm100/chunk_delta_h.py`	`cula/ops/kda/sm100/delta_h.py`
`cula/ops/sm100/bwd_wy_dqkg.py`	`cula/ops/kda/sm100/bwd_wy_dqkg.py`
`cula/ops/sm100/cp/*`	`cula/ops/kda/sm100/cp/*`
`cula/ops/intrinsics_sm100.py`	`cula/ops/sm100/ptx.py`
`cula/ops/ptx_umma_ext.py`	(consolidated into `cula/ops/sm100/ptx.py`)
`cula/ops/kda_fully_fused_sm100_wip.py`	`cula/ops/kda/experimental/sm100_fused/`
`cula/ops/la_decode.py`	`cula/ops/lightning/decode.py`
`cula/ops/prefill_sm100.py`	`cula/ops/lightning/prefill_sm100.py`

New cula/ops/ layout: see updated REPO_LAYOUT.md.

Lazy imports (PEP 562)

Add __getattr__ in cula/__init__.py, cula/kda/__init__.py, and cula/ops/__init__.py so that importing the top-level package does not eagerly import CuTeDSL/CUDA kernel modules.

Intracard-fwd_h cleanup

Separate policy from execution in the SM100 intra-card CP path:

Policy decisions are now handled entirely by sm100_intracard_cp_decision in policy.py
intracard_fwd_h is a pure executor: split and run, or raise NotSplittableError
The caller owns the fallback logic based on the policy result (forced CP → re-raise, auto → fall through to serial)

What is NOT changed

SM90 prefill kernel path: still uses the existing C++ kernel under csrc/kda/sm90/
SM100 kernel behavior: unchanged (pure move/rename)

🔍 Related Issues

N/A

🧪 Tests

Pre-commit (clang-format + ruff + ruff-format) — all passing

Reviewer Notes

…a/ layout - Move SM100 (Blackwell) modular-chunk backends, decode, and the unwired fully-fused WIP from flat cula/ops/*.py into cula/ops/kda/{sm100,decode,experimental}/. - Move the non-KDA lightning/linear prototypes under cula/ops/. - Add a central CP dispatch policy at cula/ops/kda/policy.py. - Make cula / cula.ops / cula.kda imports lazy (PEP 562) so `import cula` no longer eagerly pulls the CuTeDSL/CUDA-heavy modules. - Repoint all in-repo imports, benchmarks, tests, and docs. Pure reorganization, no kernel behavior change. The SM90 (Hopper) prefill stays the existing C++ kernel under csrc/kda/sm90.

- Drop kda_prefill_blackwell from the cula.kda public exports; the fully-fused Blackwell prefill (cula/ops/kda/experimental/sm100_fused/) is unwired WIP. - get_kda_fused_fwd now raises NotImplementedError on SM100/SM103 instead of returning that experimental kernel. - Production Blackwell prefill stays the modular chunk_kda path.

- intracard_fwd_h now raises NotSplittableError when the shape cannot be meaningfully split, instead of silently falling back. - Drop the allow_fallback / skip_precheck flags, the two duplicated _no_cp fallback blocks, and the redundant pre-split heuristic recheck that the dispatch policy already performed. - chunk_gated_delta_rule_fwd_h now owns the fallback: re-raise for forced CP, fall through to the serial body for auto. - NotSplittableError subclasses ValueError for backward compatibility. Behavior-preserving: force -> raise and auto -> serial fallback are unchanged.

…d stale REPO_LAYOUT sections

…ions

…ocstring

…st_intracard_cp.py

gemini-code-assist

Code Review

This pull request reorganizes the repository layout by migrating KDA backend kernels to an arch-first structure (e.g., grouping SM100 Blackwell kernels, decode, and experimental paths) and refactoring shared PTX/MLIR helpers. It also introduces a centralized context-parallel dispatch policy in policy.py for SM100. Regarding the changes, a performance improvement opportunity was identified in chunk_gated_delta_rule_fwd_h where cu_seqlens_cpu can be materialized twice when it is not provided, leading to redundant device-to-host synchronizations. Materializing it once at the beginning of the function would resolve this.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

- cula/__init__.py: drop trailing blank line - cula/ops/kda/__init__.py: dedent module docstring, add final newline - cula/ops/kda/sm100/delta_h.py: drop extra blank lines

cheheng.ch added 7 commits June 26, 2026 14:28

refactor(kda): remove dead LinearAttentionChunkwiseDecay re-export an…

d118159

…d stale REPO_LAYOUT sections

doc: clean up REPO_LAYOUT.md — remove stale sections and fix descript…

4973f2d

…ions

refactor(kda): annotate use_cp compat shim and clean up kda package d…

c3c49bf

…ocstring

refactor(kda): trim verbose comments in policy.py, delta_h.py, and te…

743de85

…st_intracard_cp.py

cherhh requested review from icavan and zheyang0825 June 26, 2026 06:30

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread cula/ops/kda/sm100/delta_h.py

style: apply ruff-format

ed4624b

- cula/__init__.py: drop trailing blank line - cula/ops/kda/__init__.py: dedent module docstring, add final newline - cula/ops/kda/sm100/delta_h.py: drop extra blank lines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100

refactor(kda): reorganize KDA backends into arch-first layout and add lazy imports#100
cherhh wants to merge 8 commits into
inclusionAI:mainfrom
cherhh:refactor

cherhh commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cherhh commented Jun 26, 2026

📌 Description

Reorganization

Lazy imports (PEP 562)

Intracard-fwd_h cleanup

What is NOT changed

🔍 Related Issues

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant