Skip to content

[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT)#551

Merged
tiankongdeguiji merged 25 commits into
masterfrom
bump_torch_2.12.1
Jul 2, 2026
Merged

[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT)#551
tiankongdeguiji merged 25 commits into
masterfrom
bump_torch_2.12.1

Conversation

@tiankongdeguiji

@tiankongdeguiji tiankongdeguiji commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

TorchEasyRec 1.3.0 — torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2

Upgrades the dependency stack (including the numpy-2 migration) and cuts the 1.3.0 release, adding a CUDA-13 + TensorRT image.

Dependency stack

Component Old New
tzrec 1.2.21 1.3.0
torch 2.11.0 2.12.1
triton 3.6.0 3.7.1 (ptxas-swapped, see below)
fbgemm_gpu 1.6.0 1.7.0
torchrec 1.6.0 1.7.0
torch_tensorrt 2.11.0 2.12.1 (cu130 only)
numpy <2 unconstrained (numpy 2)
dynamicemb …e0c1fbb …5dc46a2.cu130
fbgemm_gpu_hstu …+cu12.9 …9fd44403
CUDA images cu126 / cu129 / cpu + cu130 (CUDA 13 + TRT)
Docker image tag 1.2 1.3

numpy 2 migration

Drops the numpy<2 cap. graphlearn 1.3.7 → 1.3.8 (numpy-2 sampler), pandas unconstrained, faiss 1.14.3 (cu12/cu13). Source fix: np.string_np.bytes_ (removed in numpy 2). All device images resolve and import cleanly on numpy 2 with no scipy / scikit-learn source builds.

triton 3.7.1 — bundled ptxas WGMMA fix (H20 workaround removed)

Stock triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the HSTU WGMMA kernel on sm_90 (Hopper/H20) into a shared-memory OOB — an illegal memory access in _hstu_attn_bwd during autotuning. This was introduced when release/3.7 downgraded ptxas 12.9.86 → 12.8.93 (commit 6c96454f2f); it is not the triton#9514 sync bug (verified: #9514 alone does not fix it). The 1.3.0 image ships a repackaged triton 3.7.1 wheel with ptxas swapped back to 12.9.86 — only the ptxas binary is replaced, preserving the manylinux wheel torch's inductor is coupled to — so it is default-on, needs no DISABLE_MMA_V3, and keeps Hopper v3 perf.

The DISABLE_MMA_V3 / _force_mma_v2 test workaround is removed: the HSTU tests now run the shipped v3/WGMMA path (validated on a real H20 — the four previously-gated tests pass with 0 failures / 0 errors), so unittest_h20_ci now guards the ptxas fix against regression. FAQ Q15 rewritten accordingly. The DISABLE_MMA_V3 runtime knob is retained as an escape hatch.

CUDA 13 + TensorRT image

New cu130 image (also the default tzrec-devel:1.3 / :latest): the CUDA 13.0 toolchain on the cu129 software stack, plus torch_tensorrt 2.12.1 / TensorRT 10.16 for TRT export and inference. cu126/cu129 stay TRT-free (torch_tensorrt 2.12 has no CUDA-12 build) and degrade gracefully (has_tensorrt=False, as the cpu image already does); AOTI export/predict is unaffected.

Required source changes

  • tzrec/optim/optimizer.py — re-sync the apply_split_helper monkeypatch to fbgemm 1.7.0. It gained make_persistent / preallocated_host_buffer params and its internal TBE caller now passes them, so the old override raised TypeError on every fused-embedding table construction. The tzrec momentum1 (TF-Adagrad) init customization is preserved.
  • tzrec/acc/aot_utils.py — keep the pytorch#178147 inductor int-array-dedup backport (still absent in torch 2.12.1; lands in 2.13.0+).
  • tzrec/utils/dynamicemb_util.py — restore the None→[] coercion torchrec 1.7.0 dropped for EmbeddingCollectionContext list-fields (dynamicemb still forwards None, which otherwise breaks the sharded EC forward).
  • torchrec 1.6→1.7 and torch 2.11→2.12.1 private APIs otherwise need no source change (the only torchrec removal, TrainPipelineContext.output_dist_embeddings_requests, is unused by tzrec).

Build / infra

  • Dockerfile: PIP_MIRROR default → mirrors.aliyun.com; cu130 nvidia-* / cuda-toolkit metadata handling; dropped the executorch metadata strip (unnecessary once numpy is unconstrained).
  • RDMA userspace-driver installer moved to the hpn-driver mirror (the previous bucket now returns 403).
  • Docs: SID model docs, dynamicemb / hstu versions, FAQ Q15 (ptxas story), install commands, image names.
  • Pre-commit hooks bumped to latest stable (mdformat kept at 0.7.22).

Images

All device images (cpu / cu126 / cu129 / cu130) are built, CI-validated on the numpy-2 stack, and promoted to tzrec-devel:1.3* (:1.3 / :latest = cu130), digest-verified. The workflows validate against the promoted release images.

🤖 Generated with Claude Code

tiankongdeguiji and others added 24 commits June 24, 2026 12:36
Upgrade the dependency stack to PyTorch 2.12.1, torchrec 1.7.0,
fbgemm_gpu 1.7.0, triton 3.7.1, and refreshed dynamicemb / hstu ops;
bump the project version to 1.3.0 and the Docker image tag to 1.3.

Required source change: re-sync the apply_split_helper monkeypatch in
tzrec/optim/optimizer.py to fbgemm 1.7.0's signature/body. fbgemm 1.7.0
added make_persistent and preallocated_host_buffer params and its
internal TBE caller now passes them, so the previous override would
raise TypeError on every fused-embedding table construction. The tzrec
momentum1 (TF-Adagrad) state-init customization is preserved.

Keep the pytorch#178147 inductor codegen backport in
tzrec/acc/aot_utils.py: the fix is still absent from torch 2.12.1
(it lands in 2.13.0+); only the stale comment is refreshed.

TensorRT: drop torch-tensorrt from the cu129/cu126 images. torch_tensorrt
2.12 has no CUDA-12 build (it moved to CUDA 13), so it is incompatible
with the cu129 stack. The framework degrades gracefully (has_tensorrt is
False, as in the cpu image) and AOTI export/predict is unaffected. cu130
plus TRT support is planned to follow.

triton 3.7.1: keep the DISABLE_MMA_V3 Hopper/H20 workaround; the RS-WGMMA
synchronization fix (triton#9514) is not in the 3.7.x release line.

Dockerfile: set PIP_MIRROR default to mirrors.aliyun.com and drop the
mirrors.cloud.aliyuncs.com rewrite. Update docs and pre-commit hooks
(all to latest stable except mdformat, which is unchanged).

Workflow image refs temporarily point at tzrec-test:1.3 for CI
validation; reverted to tzrec-devel after image promotion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…a hash mismatch

torch 2.12.1 pulls the nvidia cuda-toolkit meta-package, which makes pip
hash-check the nvidia sub-wheels. The aliyun PyPI mirror serves a repackaged
nvidia_cuda_nvrtc_cu12 wheel whose content hash differs from what the
pytorch-wheels find-links page advertises, so `pip install torch -f <find-links>`
fails with "PACKAGES DO NOT MATCH THE HASHES". Install torch from find-links with
--no-deps, then resolve its deps via a plain `pip install torch==2.12.1` (no
find-links, so no conflicting hashes) for the cu126/cu129 images.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ata contract

torchrec 1.7.0 (#3852) reworked the planner estimator contract to pass a
picklable SharderData snapshot instead of live ModuleSharder objects:
EmbeddingEnumerator.populate_estimates now calls
estimate(sharding_options, sharder_data_map=...), ShardEstimator.estimate
takes sharder_data_map: SharderDataMap, calculate_shard_storages takes
sharder_data: SharderData, and ShardPerfContext.build_shard_perf_contexts
takes sharder_data. tzrec's copied/overridden estimators were left on the
1.6 sharder_map/sharder contract, so planning crashed with
"estimate() got an unexpected keyword argument 'sharder_data_map'" and
"build_shard_perf_contexts() missing 1 required positional argument: 'sharder'".

Port tzrec's EmbeddingStorageEstimator.estimate, the calculate_shard_storages
wrapper (plan_util.py) and the dynamicemb perf-context / storage helpers
(dynamicemb_util.py) to the SharderData API, reading fused_params off
SharderData and forwarding sharder_data, preserving the dynamicemb x_eff
cache_params injection and storage estimation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…e tag

Add a cu130 Docker image variant on the CUDA 13.0 toolchain that restores
TensorRT support (torch_tensorrt 2.12.0+cu130 + tensorrt_cu13 10.16.1.11),
which is unavailable on the cu129 stack because torch_tensorrt 2.12 dropped
CUDA 12. The cu130 image carries the full torch 2.12.1 / fbgemm 1.7.0 /
torchrec 1.7.0 stack; dynamicemb / hstu remain cu129-only.

Upgrade faiss to 1.14.3 for all GPU images: faiss+cu12 for cu126/cu129,
faiss+cu13 for cu130 (replacing faiss_gpu_cu12 1.11.0).

Bump the tzrec-test image tag suffix to -u1 (all of cpu/cu126/cu129/cu130)
since the faiss change rebuilds every image. promote_docker.sh strips the
suffix so tzrec-devel keeps the clean 1.3 tags. CI workflow image refs point
at tzrec-test:1.3-u1 / :1.3-cpu-u1 for validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…elds to []

torchrec 1.7.0 (commit 32f40e01) removed EmbeddingCollectionContext.__post_init__,
which used to coerce None list-fields to []. The dynamicemb package's
DynamicEmbeddingCollectionContext.__init__ still forwards input_features=None (and
sharding_contexts / reverse_indices / seq_vbe_ctx) to torchrec's super().__init__;
under 1.7.0 they stay None and crash EmbeddingCollectionContext.record_stream and
compute_and_output_dist with "TypeError: 'NoneType' object is not iterable" on
every sharded EmbeddingCollection forward.

Because tzrec shards all EmbeddingCollections with DynamicEmbeddingCollectionSharder
when dynamicemb is installed (plan_util.get_default_sharders), this broke every
HSTU/sequence train_eval on H20 (6 integration tests), even for plain num_buckets
features. Monkeypatch the dynamicemb context __init__ to restore the None -> []
coercion. Confirmed on H20: the failing context is DynamicEmbeddingCollectionContext
with input_features=None.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…esolution

The cu130 image failed to build at `pip install -r requirements-cu130.txt`: pip
backtracked scikit-learn/scipy into source builds that fail to compile with Cython 3
("Exception values are incompatible", "Declare ... as noexcept"). Root cause is a
numpy floor conflict unique to the CUDA 13 stack:

- torch 2.12.1+cu130 pulls `cuda-bindings`/`cuda-pathfinder` (new in the cu13
  packaging) which my dist-info strip regex (cuda-toolkit|nvidia-) missed, leaving
  torch's cu13 deps in the resolution graph.
- torch-tensorrt 2.12.0 requires `executorch>=1.2.0`, and executorch requires
  `numpy>=2.0.0`, which is unsatisfiable against tzrec's `numpy<2` -> pip backtracks
  scipy/scikit-learn to ancient sdists.

Fix, matching the cu126/cu129 "use system cuda" approach:
- uninstall the cuda-toolkit-provided cu13 nvidia wheels + cuda-toolkit/cuda-bindings/
  cuda-pathfinder (system cuda-toolkit-13-0 provides the libs; torch imports
  cuda.bindings via try/except with a None fallback, so removal is safe),
- extend the torch dist-info strip to (cuda-toolkit|cuda-bindings|cuda-pathfinder|nvidia-),
- strip `executorch` from torch_tensorrt's dist-info (tzrec only uses the dynamo TRT
  path, not the executorch backend).

Verified in-image: torch 2.12.1+cu130, numpy 1.26.4, scipy 1.17.1, scikit-learn 1.9.0
all import; the cu130 build + `-r` install now succeed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Revert the temporary tzrec-test:1.3-u1 / tzrec-test:1.3-cpu-u1 validation image
refs to the promoted tzrec-devel:1.3 / tzrec-devel:1.3-cpu. Net change vs master is
the 1.2 -> 1.3 image tag bump; ppu workflow stays on tzrec-devel:1.1-ppu.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Refresh the hstu op wheel (same commit 9fd44403, newer 20260626 build) in
requirements/extra.txt (cp310/311/312) and the dlrm_hstu doc install command.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
# Conflicts:
#	.github/workflows/unittest_h20_ci.yml
#	tzrec/version.py
Repoint the default GPU tags to the CUDA 13.0 image: registry tags
tzrec-test:1.3-u1, tzrec-devel:1.3 and tzrec-devel:latest now resolve to the cu130
build (was cu129), and build_docker.sh / promote_docker.sh derive the default tag
from -cu130 going forward. The cu129 build stays available as the -cu129 variant.

The GPU CI lanes are pinned to tzrec-devel:1.3-cu129 because the cu130 image cannot
run the current CI harness as-is: CUDA 13.0 needs forward-compat on the runners'
535/550 drivers (LD_LIBRARY_PATH=/usr/local/cuda-13.0/compat), and ci_test.sh
installs the cu129 faiss/dynamicemb/hstu wheels which are ABI-incompatible with
CUDA 13. CI therefore continues to validate on the proven cu129 image; cpu and ppu
lanes are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ver runners)

Now that cu130 is the default 1.3 image, point all GPU CI lanes back to
tzrec-devel:1.3 (cu130) instead of the cu129 variant, and make the harness cu130-ready:

- unittest_ci / unittest_nightly / benchmark run on tzrec-runner / tzrec-bench-runner
  (R535 driver, below CUDA 13's R580); add LD_LIBRARY_PATH=/usr/local/cuda-13/compat so
  CUDA 13 runs via forward-compat (verified: torch.cuda + fused TBE work on the A10).
- unittest_h20 runs on tzrec-h20-runner (R580, native CUDA 13) -> no compat path needed.
- CUDA_HOME /usr/local/cuda-12 -> /usr/local/cuda-13 (cu130 toolkit path) for AOTI.
- requirements-gpu.txt -> requirements/cu130.txt (faiss-cu13 instead of cu12).
- requirements/extra.txt -> cu130 dynamicemb/hstu wheels.

cpu and ppu lanes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
- torch_tensorrt 2.12.0 -> 2.12.1; install without --no-deps so it pulls
  tensorrt-cu13 itself (drop the manual packaging/typing-extensions/dllist/psutil/
  tensorrt_cu13 line). Keep stripping only `executorch` from torch_tensorrt's
  dist-info -- it still Requires-Dist executorch>=1.2.0 (numpy>=2), and leaving it
  in makes `pip install -r requirements-cu130.txt` (numpy<2) backtrack scipy/
  scikit-learn into failing source builds. executorch (~16MB) is installed but
  unused by tzrec.
- Stop uninstalling/stripping cuda-toolkit / cuda-bindings / cuda-pathfinder for
  cu130: the executorch strip alone keeps the numpy<2 resolution clean, so those
  torch cu13 deps can stay (torch imports cuda.bindings via try/except).
- Merge the byte-identical "cu126") and "cu129") branches into "cu126"|"cu129")
  using ${DEVICE}.

Verified: cu130 image builds with no scipy/scikit-learn source build; numpy 1.26.4,
scipy 1.17.1, scikit-learn 1.7.1, faiss 1.14.3 and torch_tensorrt 2.12.1+cu130 all
import (the latter on GPU via cuda-13 forward-compat).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…mp dynamicemb

triton (Q15): triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the sm_90 HSTU
WGMMA kernel into a shared-memory OOB (regressed onto release/3.7 by triton commit
6c96454f2f, which downgraded ptxas 12.9.86 -> 12.8.93; bisected and confirmed on H20).
Force-reinstall the official triton 3.7.1 manylinux wheel repackaged to bundle ptxas
12.9.86 -- only the ptxas binary swapped, RECORD regenerated -- for cu126/cu129/cu130.
Verified: clean install, inductor set_driver_to_gpu OK, compute-sanitizer ERROR
SUMMARY 0. No DISABLE_MMA_V3 fallback, no triton rebuild (avoids the torch<->triton
runtime mismatch that breaks AOT export).

cu126|cu129: add nvidia-cufile-cu12 to the uninstall list -- torch 2.12.1 installs it,
but the list inherited from the 1.2.x Dockerfile omitted it while cu130 already strips
nvidia-cufile.

cu130: re-add the nvidia-* uninstall + torch-metadata strip (cuda-toolkit/cuda-bindings/
cuda-pathfinder kept installed).

dynamicemb: 0.1.0+20260624.2550a9c -> 0.1.0+20260630.5dc46a2 (extra.txt cp310/311/312
and the feature doc).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…post1)

graphlearn 1.3.7 and faiss-gpu-cu12 1.11.0 were the deps pinning numpy<2.

- graphlearn 1.3.7 -> 1.3.8 (declares no numpy bound).
- faiss-gpu-cu12 1.11.0 -> 1.14.1.post1: 1.11.0 requires numpy<2; faiss became
  numpy-2 ready at 1.13.2 (numpy>=2,<3). Source moves from the OSS-mirrored
  wheel to PyPI (matching how faiss-cpu is already sourced); its
  nvidia-cuda-runtime/cublas constraints are unchanged (>=12.1), so the CUDA
  stack is unaffected.

Require numpy>2 first to exercise the numpy 2.x path through CI before relaxing
the pin entirely.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg
…faiss via [gpu]

- tzrec/datasets/sampler.py: numpy 2.0 removed np.string_; use np.bytes_ (same
  'S' dtype, np.char.decode unchanged) when parsing graphlearn string node
  attrs. Fixes AttributeError crashing the negative/hard-negative/TDM sampler
  DataLoader workers under numpy 2.

- .github/workflows/buildtest_ci.yml: install the built wheel's [gpu] extra so
  faiss-gpu-cu12 upgrades to the numpy-2 build. Plain `pip install <wheel>` only
  pulls install_requires (numpy>2, no faiss), bumping numpy to 2.x while leaving
  the image's numpy-1.x faiss 1.11.0 -> `import tzrec` crashed with
  "numpy.core.multiarray failed to import".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
- graphlearn 1.3.7 -> 1.3.8 (numpy-2-compatible sampler; new OSS path)
- numpy: drop the <2 cap -> unconstrained (graphlearn 1.3.8 supports numpy 2); pandas unconstrained
- keep the 1.3.0 stack's faiss 1.14.3 (cu12/cu13) wheels, version 1.3.0, and the original buildtest_ci.yml
- brings SID model docs (rqvae/rqkmeans) + sampler updates

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
numpy is now unconstrained (graphlearn 1.3.8 supports numpy 2), so executorch's
numpy>=2 requirement resolves cleanly and no longer forces scipy/scikit-learn into
failing source builds -- the strip is unnecessary.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
The pythonrun.oss-cn-zhangjiakou bucket now returns 403 for the Mellanox RDMA
installer, breaking every GPU image build at the last step. Point Step 26 at the
hpn-driver mirror (ubuntu22.04-rdma-core-23.10.tar): plain .tar (tar xf), deb-based
install.sh (apt-get install -f) so add apt-get update + clean around it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Temporary: point the workflow images at the tzrec-test -u3 build (numpy 2 / graphlearn
1.3.8 / ptxas-swapped triton / RDMA mirror) to validate before promoting to tzrec-devel.
ppu lane untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ROGRESS.md

- faq.md Q15: rewrite to the real cause -- triton 3.7.1's bundled ptxas 12.8.93
  mis-compiles the HSTU WGMMA kernel on sm_90 (ptxas 12.9.86 fixes it) -- and the
  shipped remedy (image bundles the ptxas-swapped 3.7.1 wheel; no DISABLE_MMA_V3,
  no Hopper v3 perf loss). The old MMA-v3 "still affected" text was wrong.
- local_tutorial.md: drop the stale "dynamicemb/hstu cu129-only, not in cu130" note.
- aot_utils.py / optimizer.py / dynamicemb_util.py: condense the backport/patch/coercion comments.
- stop tracking PROGRESS.md (worktree scratch, accidentally staged during the merge).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
The sm_90 WGMMA shmem-OOB that _force_mma_v2 dodged is fixed by the ptxas-12.9.86
swap in the 1.3.0 triton wheel. Verified on H20: all four affected tests
(test_attn_triton_long_seqs, test_cache, test_attn_cutlass, test_sla_attn_cutlass)
pass on the real v3/WGMMA path with 0 failures / 0 errors. Removing the workaround so
unittest_h20_ci exercises the shipped v3 path and guards the fix against regression
(e.g. a future triton bump re-downgrading ptxas, or the ptxas-swapped wheel missing).

- hstu_attention_test.py: delete _force_mma_v2 + _DISABLE_V3_CACHE_SUFFIX + now-unused
  imports; unwrap the 4 with-blocks (dedent bodies).
- rank_integration_test.py: drop DISABLE_MMA_V3=1 from the hstu train/eval/export env.
- The DISABLE_MMA_V3 runtime knob itself is kept as an escape hatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ripts

The 1.3.0 test images that were built, CI-validated, and promoted to tzrec-devel
are the -u3 iteration; point build_docker.sh / promote_docker.sh at -u3 so a
re-run targets the same tags.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Reverts the temporary tzrec-test:1.3-u3 / :1.3-cpu-u3 validation refs now that the
-u3 images are promoted to tzrec-devel:1.3 / :1.3-cpu (identical digests). Final CI
runs against the promoted release images. ppu lane untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
@tiankongdeguiji tiankongdeguiji changed the title [feat] bump to 1.3.0 with torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 [feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT) Jul 1, 2026
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Jul 2, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jul 2, 2026
@WhiteSwan1 WhiteSwan1 self-requested a review July 2, 2026 01:45
WhiteSwan1
WhiteSwan1 previously approved these changes Jul 2, 2026
Comment thread tzrec/optim/optimizer.py
Comment on lines +147 to +168
if preallocated_host_buffer is not None:
assert preallocated_host_buffer.numel() == split.host_size, (
f"preallocated_host_buffer size mismatch for '{prefix}_host': "
f"expected {split.host_size}, got {preallocated_host_buffer.numel()}"
)
assert preallocated_host_buffer.is_contiguous(), (
f"preallocated_host_buffer for '{prefix}_host' must be contiguous"
)
assert preallocated_host_buffer.dim() == 1, (
f"preallocated_host_buffer for '{prefix}_host' must be 1D, got "
f"{preallocated_host_buffer.dim()}D with shape "
f"{preallocated_host_buffer.shape}"
)
assert preallocated_host_buffer.dtype == dtype, (
f"preallocated_host_buffer dtype mismatch for '{prefix}_host': "
f"expected {dtype}, got {preallocated_host_buffer.dtype}"
)
assert preallocated_host_buffer.device == current_device, (
f"preallocated_host_buffer device mismatch for '{prefix}_host': "
f"expected {current_device}, got {preallocated_host_buffer.device}"
)
host_buffer = preallocated_host_buffer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When fbgemm passes a preallocated_host_buffer, it is adopted as-is — the use_init_value / torch.full(init_value) fill (the whole purpose of this monkeypatch, for TF-Adagrad initial_accumulator_value) is silently skipped for host-placed shards. If this path is ever hit for momentum1, the accumulator starts at the pool's contents (typically zeros) instead of init_value, so Adagrad behavior would differ between HBM-placed and host-placed tables.

Consider adding after line 168:

if use_init_value:
    host_buffer.fill_(init_value)

or assert not (use_init_value and preallocated_host_buffer is not None) if the combination is believed unreachable. Note this branch (and the make_persistent one) is currently not exercised by any test — a small unit test calling apply_split_helper directly with a preallocated 1-D CPU buffer would cover both.

Comment on lines +153 to +177
from dynamicemb.shard.embedding import DynamicEmbeddingCollectionContext

# torchrec 1.7.0 no longer coerces EmbeddingCollectionContext's None list-fields
# to []; dynamicemb still passes None (breaks sharded EC forward), so re-coerce.
_orig_demb_ctx_init = DynamicEmbeddingCollectionContext.__init__

def _demb_ctx_init_coerce_none(
self,
sharding_contexts=None, # pyre-ignore [2]
input_features=None, # pyre-ignore [2]
reverse_indices=None, # pyre-ignore [2]
seq_vbe_ctx=None, # pyre-ignore [2]
frequency_counters=None, # pyre-ignore [2]
): # pyre-ignore [3]
_orig_demb_ctx_init(
self,
sharding_contexts if sharding_contexts is not None else [],
input_features if input_features is not None else [],
reverse_indices if reverse_indices is not None else [],
seq_vbe_ctx if seq_vbe_ctx is not None else [],
frequency_counters,
)

# pyre-ignore [9]
DynamicEmbeddingCollectionContext.__init__ = _demb_ctx_init_coerce_none

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two robustness notes on this new patch:

  1. It's the only monkeypatch in this file living inside the blanket try/except Exception: pass (all others are under if has_dynamicemb:). If dynamicemb.shard.embedding moves or the patch throws against a future wheel (the dynamicemb pin was bumped in this same PR), the exception is swallowed, has_dynamicemb stays False, every dynamicemb CI test silently skips, and users get the misleading "dynamicemb is not installed" RuntimeError. Consider logging the swallowed exception in the except branch so breakage fails loudly.

  2. The wrapper hardcodes the 5 current parameters and forwards them positionally; if a future dynamicemb adds a context field, construction will TypeError at runtime. Forwarding trailing *args, **kwargs would make it drift-tolerant.

Comment thread docker/Dockerfile
pip cache purge ;; \
esac && \
case ${DEVICE} in \
"cu126"|"cu129"|"cu130") pip install --force-reinstall --no-deps https://tzrec.oss-accelerate.aliyuncs.com/third_party/triton/triton-3.7.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl && \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repackaged wheel (ptxas 12.9.86 swap) reuses the stock 3.7.1 version string, so pip freeze/SBOMs can't distinguish it from PyPI triton, and any later reinstall or dependency re-resolution silently reverts the ptxas fix with no version signal. The previous scheme used a distinguishing local version (3.6.0+565c08520).

Consider republishing as e.g. 3.7.1+ptxas12986 — PEP 440 local versions still satisfy the triton==3.7.1 pins in requirements/cu1*.txt — and optionally adding --hash=sha256:... here. Relatedly, those requirements files resolve stock PyPI 3.7.1, so pip-based (non-docker) installs on Hopper still hit the FAQ Q15 crash; pointing them at the repackaged wheel (as already done for faiss/dynamicemb) or adding a comment referencing FAQ Q15 would be more consistent.

Comment thread docker/Dockerfile
Comment on lines +115 to 119
wget https://hpn-driver.oss-cn-hangzhou.aliyuncs.com/nic-drivers/shuyao/drivers/ubuntu22.04-rdma-core-23.10.tar && \
tar xf ubuntu22.04-rdma-core-23.10.tar && \
cd ubuntu22.04-rdma-core-23.10 && \
apt-get update && \
echo Y | /bin/bash install.sh && \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tarball comes from a bucket outside the project's own tzrec OSS bucket (the shuyao/ path segment looks like a personal directory) and its install.sh runs as root with no integrity check. Anyone who can overwrite that object gets code execution in every future image build, and the result ships to a public registry. Consider mirroring the tarball into tzrec.oss-accelerate.aliyuncs.com/third_party/ and/or verifying a pinned sha256sum before executing.

Comment thread docs/source/feature/dynamicemb.md Outdated
@@ -8,7 +8,7 @@ DynamicEmbedding 是特征零Hash冲突Id化的一种方式,它相比设置`ha

```bash
# DEVICE 可选: cu126/cu129 (支持 Python 3.10/3.11/3.12)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale device list: requirements/extra.txt now pins the 20260630.5dc46a2 build as cu130-only wheels, and the default tzrec-devel:1.3 image is the cu130 variant — but this comment still offers only cu126/cu129. Add cu130 (and confirm cu126/cu129 wheels of the new build actually exist on OSS before keeping them listed).

Comment thread docs/source/models/dlrm_hstu.md Outdated

- TRITON: 基于Triton的实现,通常比PYTORCH快2-3x,节省2-3x显存
- CUTLASS: 基于CUTLASS的CUDA融合算子实现,需安装fbgemm_gpu_hstu包(DEVICE可选cu126/cu129,对应DEVICE_DOTTED为cu12.6/cu12.9:`pip install fbgemm_gpu_hstu==0.1.0+${DEVICE_DOTTED} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`),要求`attention_dim`等于`hidden_dim`,支持Ampere/Ada/Hopper GPU
- CUTLASS: 基于CUTLASS的CUDA融合算子实现,需安装fbgemm_gpu_hstu包(DEVICE可选cu126/cu129:`pip install fbgemm_gpu_hstu==0.1.0+20260626.9fd44403.${DEVICE} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`),要求`attention_dim`等于`hidden_dim`,支持Ampere/Ada/Hopper GPU

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same stale device list as dynamicemb.md: the version string was updated to 20260626.9fd44403, but DEVICE可选cu126/cu129 was kept while requirements/extra.txt (and CI) now use the cu130 wheel of this build. cu130 is missing from the list; verify cu126/cu129 wheels of this build exist.

Comment on lines 575 to 576
# Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the
# pure-CUTLASS NFUNC path (no Triton kernel runs).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment: the _force_mma_v2 workaround this refers to was removed in this PR (the sibling comment in test_attn_cutlass was rewritten; this one was missed).

Suggested change
# Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the
# pure-CUTLASS NFUNC path (no Triton kernel runs).

Comment thread scripts/build_docker.sh
DOCKER_TAG=1.2
DOCKER_TAG_SUFFIX=
DOCKER_TAG=1.3
DOCKER_TAG_SUFFIX=-u3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOCKER_TAG_SUFFIX=-u3 (previously empty) looks like leftover state from image-build iteration — the next person running this script will unknowingly produce 1.3-<device>-u3 tags. Consider resetting to empty after promotion, or documenting the suffix convention.

Comment thread requirements/runtime.txt
graphlearn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/graphlearn/graphlearn-1.3.8-cp310-cp310-linux_x86_64.whl ; python_version=="3.10"
grpcio-tools<1.63.0
numpy<2
numpy

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully unconstrained numpy will silently pick up a future numpy 3.x. Given the codebase just completed the 1.x→2.x migration, an upper bound documents what's validated:

Suggested change
numpy
numpy<3

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code review summary (multi-area review)

Overall this is a well-executed release bump. The monkeypatch re-syncs (apply_split_helper → fbgemm 1.7.0, SharderData migration in plan_util.py/dynamicemb_util.py) are internally consistent with no stragglers, the np.string_np.bytes_ numpy-2 migration is complete (no other removed aliases remain in the tree), the FAQ Q15 rewrite matches the shipped triton wheel and the workaround removal, and docs/version references were updated thoroughly. The torch #178147 backport stays correctly gated for 2.12.1, and trt_utils.py already degrades gracefully for the now-TRT-free cu126/cu129 stacks.

Noteworthy items are posted as inline comments (init-value skip on the preallocated_host_buffer path, silent-disable hazard around the new dynamicemb patch, triton wheel version shadowing, unverified RDMA tarball, stale cu126/cu129 device lists in docs, leftover -u3 tag suffix, unconstrained numpy, stale test comment). A few findings land on files this PR doesn't touch, so they go here:

  • setup.py:81 — the tzrec[gpu] extra still resolves requirements/cu129.txt, while this PR flips requirements-gpu.txt to requirements/cu130.txt. Before the PR both were cu129; now pip install tzrec[gpu] yields a different stack (cu12 faiss, no torch-tensorrt) than the repo's own GPU requirements. If keeping the PyPI extra CUDA-12-friendly is intentional, a comment or a separate cu130 extra would make it explicit.
  • tzrec/ops/utils.py:81clear_triton_caches is now dead code: its only caller (_force_mma_v2) was removed here, and its docstring still motivates itself via the removed DISABLE_MMA_V3 workflow. Remove it, or keep it deliberately as a general utility with an updated docstring.
  • Test-coverage suggestions (advisory): this PR is itself the failure mode the fbgemm monkeypatch risks — upstream grew two kwargs and the old override raised TypeError. A small signature-parity test (inspect.signature of upstream apply_split_helper vs the tzrec override) would turn the next such break into a unit-test failure instead of a runtime crash. Similarly, a one-line assertion that DynamicEmbeddingCollectionContext() yields [] (not None) for its list fields would catch both coercion regressions and signature drift.
  • Dockerfile minor: the plain pip install torch==2.12.1 after the --no-deps install re-resolves and downloads the multi-GB nvidia-* wheels only to uninstall them — running the Requires-Dist sed strip before the second install would avoid that. Also, mirrors.aliyun.com serves HTTPS; the generated pip.conf could use https:// and drop trusted-host now that the default mirror is public.

🤖 Generated with Claude Code

…cs to cu130

- optim/optimizer.py: apply_split_helper adopted fbgemm's preallocated host buffer
  as-is and skipped the init_value fill, so a host-placed momentum1 shard started at
  the pool contents (zeros) instead of FBGEMM_MOMENTUM1_STATE_INIT_VALUE. Fill it when
  use_init_value, matching the dev and non-preallocated host paths.
- setup.py: tzrec[gpu] extra -> requirements/cu130.txt (the 1.3 default image is cu130).
- ops/utils.py: clear_triton_caches has no caller after the _force_mma_v2 removal; keep
  it as a general Triton-cache utility and drop the DISABLE_MMA_V3-specific docstring.
- dynamicemb.md / dlrm_hstu.md: DEVICE list cu126/cu129 -> cu126/cu129/cu130 (the
  5dc46a2 / 9fd44403 wheels are confirmed on OSS for all three).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
@tiankongdeguiji tiankongdeguiji merged commit ee3ef09 into master Jul 2, 2026
9 checks passed
@tiankongdeguiji tiankongdeguiji deleted the bump_torch_2.12.1 branch July 2, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants