[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT) by tiankongdeguiji · Pull Request #551 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-06-24T10:18:43Z

TorchEasyRec 1.3.0 — torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2

Upgrades the dependency stack (including the numpy-2 migration) and cuts the 1.3.0 release, adding a CUDA-13 + TensorRT image.

Dependency stack

Component	Old	New
tzrec	1.2.21	1.3.0
torch	2.11.0	2.12.1
triton	3.6.0	3.7.1 (ptxas-swapped, see below)
fbgemm_gpu	1.6.0	1.7.0
torchrec	1.6.0	1.7.0
torch_tensorrt	2.11.0	2.12.1 (cu130 only)
numpy	`<2`	unconstrained (numpy 2)
dynamicemb	…e0c1fbb	…5dc46a2.cu130
fbgemm_gpu_hstu	…+cu12.9	…9fd44403
CUDA images	cu126 / cu129 / cpu	+ cu130 (CUDA 13 + TRT)
Docker image tag	1.2	1.3

numpy 2 migration

Drops the numpy<2 cap. graphlearn 1.3.7 → 1.3.8 (numpy-2 sampler), pandas unconstrained, faiss 1.14.3 (cu12/cu13). Source fix: np.string_ → np.bytes_ (removed in numpy 2). All device images resolve and import cleanly on numpy 2 with no scipy / scikit-learn source builds.

triton 3.7.1 — bundled ptxas WGMMA fix (H20 workaround removed)

Stock triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the HSTU WGMMA kernel on sm_90 (Hopper/H20) into a shared-memory OOB — an illegal memory access in _hstu_attn_bwd during autotuning. This was introduced when release/3.7 downgraded ptxas 12.9.86 → 12.8.93 (commit 6c96454f2f); it is not the triton#9514 sync bug (verified: #9514 alone does not fix it). The 1.3.0 image ships a repackaged triton 3.7.1 wheel with ptxas swapped back to 12.9.86 — only the ptxas binary is replaced, preserving the manylinux wheel torch's inductor is coupled to — so it is default-on, needs no DISABLE_MMA_V3, and keeps Hopper v3 perf.

The DISABLE_MMA_V3 / _force_mma_v2 test workaround is removed: the HSTU tests now run the shipped v3/WGMMA path (validated on a real H20 — the four previously-gated tests pass with 0 failures / 0 errors), so unittest_h20_ci now guards the ptxas fix against regression. FAQ Q15 rewritten accordingly. The DISABLE_MMA_V3 runtime knob is retained as an escape hatch.

CUDA 13 + TensorRT image

New cu130 image (also the default tzrec-devel:1.3 / :latest): the CUDA 13.0 toolchain on the cu129 software stack, plus torch_tensorrt 2.12.1 / TensorRT 10.16 for TRT export and inference. cu126/cu129 stay TRT-free (torch_tensorrt 2.12 has no CUDA-12 build) and degrade gracefully (has_tensorrt=False, as the cpu image already does); AOTI export/predict is unaffected.

Required source changes

tzrec/optim/optimizer.py — re-sync the apply_split_helper monkeypatch to fbgemm 1.7.0. It gained make_persistent / preallocated_host_buffer params and its internal TBE caller now passes them, so the old override raised TypeError on every fused-embedding table construction. The tzrec momentum1 (TF-Adagrad) init customization is preserved.
tzrec/acc/aot_utils.py — keep the pytorch#178147 inductor int-array-dedup backport (still absent in torch 2.12.1; lands in 2.13.0+).
tzrec/utils/dynamicemb_util.py — restore the None→[] coercion torchrec 1.7.0 dropped for EmbeddingCollectionContext list-fields (dynamicemb still forwards None, which otherwise breaks the sharded EC forward).
torchrec 1.6→1.7 and torch 2.11→2.12.1 private APIs otherwise need no source change (the only torchrec removal, TrainPipelineContext.output_dist_embeddings_requests, is unused by tzrec).

Build / infra

Dockerfile: PIP_MIRROR default → mirrors.aliyun.com; cu130 nvidia-* / cuda-toolkit metadata handling; dropped the executorch metadata strip (unnecessary once numpy is unconstrained).
RDMA userspace-driver installer moved to the hpn-driver mirror (the previous bucket now returns 403).
Docs: SID model docs, dynamicemb / hstu versions, FAQ Q15 (ptxas story), install commands, image names.
Pre-commit hooks bumped to latest stable (mdformat kept at 0.7.22).

Images

All device images (cpu / cu126 / cu129 / cu130) are built, CI-validated on the numpy-2 stack, and promoted to tzrec-devel:1.3* (:1.3 / :latest = cu130), digest-verified. The workflows validate against the promoted release images.

🤖 Generated with Claude Code

Upgrade the dependency stack to PyTorch 2.12.1, torchrec 1.7.0, fbgemm_gpu 1.7.0, triton 3.7.1, and refreshed dynamicemb / hstu ops; bump the project version to 1.3.0 and the Docker image tag to 1.3. Required source change: re-sync the apply_split_helper monkeypatch in tzrec/optim/optimizer.py to fbgemm 1.7.0's signature/body. fbgemm 1.7.0 added make_persistent and preallocated_host_buffer params and its internal TBE caller now passes them, so the previous override would raise TypeError on every fused-embedding table construction. The tzrec momentum1 (TF-Adagrad) state-init customization is preserved. Keep the pytorch#178147 inductor codegen backport in tzrec/acc/aot_utils.py: the fix is still absent from torch 2.12.1 (it lands in 2.13.0+); only the stale comment is refreshed. TensorRT: drop torch-tensorrt from the cu129/cu126 images. torch_tensorrt 2.12 has no CUDA-12 build (it moved to CUDA 13), so it is incompatible with the cu129 stack. The framework degrades gracefully (has_tensorrt is False, as in the cpu image) and AOTI export/predict is unaffected. cu130 plus TRT support is planned to follow. triton 3.7.1: keep the DISABLE_MMA_V3 Hopper/H20 workaround; the RS-WGMMA synchronization fix (triton#9514) is not in the 3.7.x release line. Dockerfile: set PIP_MIRROR default to mirrors.aliyun.com and drop the mirrors.cloud.aliyuncs.com rewrite. Update docs and pre-commit hooks (all to latest stable except mdformat, which is unchanged). Workflow image refs temporarily point at tzrec-test:1.3 for CI validation; reverted to tzrec-devel after image promotion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…a hash mismatch torch 2.12.1 pulls the nvidia cuda-toolkit meta-package, which makes pip hash-check the nvidia sub-wheels. The aliyun PyPI mirror serves a repackaged nvidia_cuda_nvrtc_cu12 wheel whose content hash differs from what the pytorch-wheels find-links page advertises, so `pip install torch -f <find-links>` fails with "PACKAGES DO NOT MATCH THE HASHES". Install torch from find-links with --no-deps, then resolve its deps via a plain `pip install torch==2.12.1` (no find-links, so no conflicting hashes) for the cu126/cu129 images. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…ata contract torchrec 1.7.0 (#3852) reworked the planner estimator contract to pass a picklable SharderData snapshot instead of live ModuleSharder objects: EmbeddingEnumerator.populate_estimates now calls estimate(sharding_options, sharder_data_map=...), ShardEstimator.estimate takes sharder_data_map: SharderDataMap, calculate_shard_storages takes sharder_data: SharderData, and ShardPerfContext.build_shard_perf_contexts takes sharder_data. tzrec's copied/overridden estimators were left on the 1.6 sharder_map/sharder contract, so planning crashed with "estimate() got an unexpected keyword argument 'sharder_data_map'" and "build_shard_perf_contexts() missing 1 required positional argument: 'sharder'". Port tzrec's EmbeddingStorageEstimator.estimate, the calculate_shard_storages wrapper (plan_util.py) and the dynamicemb perf-context / storage helpers (dynamicemb_util.py) to the SharderData API, reading fused_params off SharderData and forwarding sharder_data, preserving the dynamicemb x_eff cache_params injection and storage estimation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…e tag Add a cu130 Docker image variant on the CUDA 13.0 toolchain that restores TensorRT support (torch_tensorrt 2.12.0+cu130 + tensorrt_cu13 10.16.1.11), which is unavailable on the cu129 stack because torch_tensorrt 2.12 dropped CUDA 12. The cu130 image carries the full torch 2.12.1 / fbgemm 1.7.0 / torchrec 1.7.0 stack; dynamicemb / hstu remain cu129-only. Upgrade faiss to 1.14.3 for all GPU images: faiss+cu12 for cu126/cu129, faiss+cu13 for cu130 (replacing faiss_gpu_cu12 1.11.0). Bump the tzrec-test image tag suffix to -u1 (all of cpu/cu126/cu129/cu130) since the faiss change rebuilds every image. promote_docker.sh strips the suffix so tzrec-devel keeps the clean 1.3 tags. CI workflow image refs point at tzrec-test:1.3-u1 / :1.3-cpu-u1 for validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…elds to [] torchrec 1.7.0 (commit 32f40e01) removed EmbeddingCollectionContext.__post_init__, which used to coerce None list-fields to []. The dynamicemb package's DynamicEmbeddingCollectionContext.__init__ still forwards input_features=None (and sharding_contexts / reverse_indices / seq_vbe_ctx) to torchrec's super().__init__; under 1.7.0 they stay None and crash EmbeddingCollectionContext.record_stream and compute_and_output_dist with "TypeError: 'NoneType' object is not iterable" on every sharded EmbeddingCollection forward. Because tzrec shards all EmbeddingCollections with DynamicEmbeddingCollectionSharder when dynamicemb is installed (plan_util.get_default_sharders), this broke every HSTU/sequence train_eval on H20 (6 integration tests), even for plain num_buckets features. Monkeypatch the dynamicemb context __init__ to restore the None -> [] coercion. Confirmed on H20: the failing context is DynamicEmbeddingCollectionContext with input_features=None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…esolution The cu130 image failed to build at `pip install -r requirements-cu130.txt`: pip backtracked scikit-learn/scipy into source builds that fail to compile with Cython 3 ("Exception values are incompatible", "Declare ... as noexcept"). Root cause is a numpy floor conflict unique to the CUDA 13 stack: - torch 2.12.1+cu130 pulls `cuda-bindings`/`cuda-pathfinder` (new in the cu13 packaging) which my dist-info strip regex (cuda-toolkit|nvidia-) missed, leaving torch's cu13 deps in the resolution graph. - torch-tensorrt 2.12.0 requires `executorch>=1.2.0`, and executorch requires `numpy>=2.0.0`, which is unsatisfiable against tzrec's `numpy<2` -> pip backtracks scipy/scikit-learn to ancient sdists. Fix, matching the cu126/cu129 "use system cuda" approach: - uninstall the cuda-toolkit-provided cu13 nvidia wheels + cuda-toolkit/cuda-bindings/ cuda-pathfinder (system cuda-toolkit-13-0 provides the libs; torch imports cuda.bindings via try/except with a None fallback, so removal is safe), - extend the torch dist-info strip to (cuda-toolkit|cuda-bindings|cuda-pathfinder|nvidia-), - strip `executorch` from torch_tensorrt's dist-info (tzrec only uses the dynamo TRT path, not the executorch backend). Verified in-image: torch 2.12.1+cu130, numpy 1.26.4, scipy 1.17.1, scikit-learn 1.9.0 all import; the cu130 build + `-r` install now succeed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

Revert the temporary tzrec-test:1.3-u1 / tzrec-test:1.3-cpu-u1 validation image refs to the promoted tzrec-devel:1.3 / tzrec-devel:1.3-cpu. Net change vs master is the 1.2 -> 1.3 image tag bump; ppu workflow stays on tzrec-devel:1.1-ppu. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

Refresh the hstu op wheel (same commit 9fd44403, newer 20260626 build) in requirements/extra.txt (cp310/311/312) and the dlrm_hstu doc install command. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

# Conflicts: # .github/workflows/unittest_h20_ci.yml # tzrec/version.py

Repoint the default GPU tags to the CUDA 13.0 image: registry tags tzrec-test:1.3-u1, tzrec-devel:1.3 and tzrec-devel:latest now resolve to the cu130 build (was cu129), and build_docker.sh / promote_docker.sh derive the default tag from -cu130 going forward. The cu129 build stays available as the -cu129 variant. The GPU CI lanes are pinned to tzrec-devel:1.3-cu129 because the cu130 image cannot run the current CI harness as-is: CUDA 13.0 needs forward-compat on the runners' 535/550 drivers (LD_LIBRARY_PATH=/usr/local/cuda-13.0/compat), and ci_test.sh installs the cu129 faiss/dynamicemb/hstu wheels which are ABI-incompatible with CUDA 13. CI therefore continues to validate on the proven cu129 image; cpu and ppu lanes are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…ver runners) Now that cu130 is the default 1.3 image, point all GPU CI lanes back to tzrec-devel:1.3 (cu130) instead of the cu129 variant, and make the harness cu130-ready: - unittest_ci / unittest_nightly / benchmark run on tzrec-runner / tzrec-bench-runner (R535 driver, below CUDA 13's R580); add LD_LIBRARY_PATH=/usr/local/cuda-13/compat so CUDA 13 runs via forward-compat (verified: torch.cuda + fused TBE work on the A10). - unittest_h20 runs on tzrec-h20-runner (R580, native CUDA 13) -> no compat path needed. - CUDA_HOME /usr/local/cuda-12 -> /usr/local/cuda-13 (cu130 toolkit path) for AOTI. - requirements-gpu.txt -> requirements/cu130.txt (faiss-cu13 instead of cu12). - requirements/extra.txt -> cu130 dynamicemb/hstu wheels. cpu and ppu lanes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

- torch_tensorrt 2.12.0 -> 2.12.1; install without --no-deps so it pulls tensorrt-cu13 itself (drop the manual packaging/typing-extensions/dllist/psutil/ tensorrt_cu13 line). Keep stripping only `executorch` from torch_tensorrt's dist-info -- it still Requires-Dist executorch>=1.2.0 (numpy>=2), and leaving it in makes `pip install -r requirements-cu130.txt` (numpy<2) backtrack scipy/ scikit-learn into failing source builds. executorch (~16MB) is installed but unused by tzrec. - Stop uninstalling/stripping cuda-toolkit / cuda-bindings / cuda-pathfinder for cu130: the executorch strip alone keeps the numpy<2 resolution clean, so those torch cu13 deps can stay (torch imports cuda.bindings via try/except). - Merge the byte-identical "cu126") and "cu129") branches into "cu126"|"cu129") using ${DEVICE}. Verified: cu130 image builds with no scipy/scikit-learn source build; numpy 1.26.4, scipy 1.17.1, scikit-learn 1.7.1, faiss 1.14.3 and torch_tensorrt 2.12.1+cu130 all import (the latter on GPU via cuda-13 forward-compat). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…mp dynamicemb triton (Q15): triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the sm_90 HSTU WGMMA kernel into a shared-memory OOB (regressed onto release/3.7 by triton commit 6c96454f2f, which downgraded ptxas 12.9.86 -> 12.8.93; bisected and confirmed on H20). Force-reinstall the official triton 3.7.1 manylinux wheel repackaged to bundle ptxas 12.9.86 -- only the ptxas binary swapped, RECORD regenerated -- for cu126/cu129/cu130. Verified: clean install, inductor set_driver_to_gpu OK, compute-sanitizer ERROR SUMMARY 0. No DISABLE_MMA_V3 fallback, no triton rebuild (avoids the torch<->triton runtime mismatch that breaks AOT export). cu126|cu129: add nvidia-cufile-cu12 to the uninstall list -- torch 2.12.1 installs it, but the list inherited from the 1.2.x Dockerfile omitted it while cu130 already strips nvidia-cufile. cu130: re-add the nvidia-* uninstall + torch-metadata strip (cuda-toolkit/cuda-bindings/ cuda-pathfinder kept installed). dynamicemb: 0.1.0+20260624.2550a9c -> 0.1.0+20260630.5dc46a2 (extra.txt cp310/311/312 and the feature doc). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…post1) graphlearn 1.3.7 and faiss-gpu-cu12 1.11.0 were the deps pinning numpy<2. - graphlearn 1.3.7 -> 1.3.8 (declares no numpy bound). - faiss-gpu-cu12 1.11.0 -> 1.14.1.post1: 1.11.0 requires numpy<2; faiss became numpy-2 ready at 1.13.2 (numpy>=2,<3). Source moves from the OSS-mirrored wheel to PyPI (matching how faiss-cpu is already sourced); its nvidia-cuda-runtime/cublas constraints are unchanged (>=12.1), so the CUDA stack is unaffected. Require numpy>2 first to exercise the numpy 2.x path through CI before relaxing the pin entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg

…faiss via [gpu] - tzrec/datasets/sampler.py: numpy 2.0 removed np.string_; use np.bytes_ (same 'S' dtype, np.char.decode unchanged) when parsing graphlearn string node attrs. Fixes AttributeError crashing the negative/hard-negative/TDM sampler DataLoader workers under numpy 2. - .github/workflows/buildtest_ci.yml: install the built wheel's [gpu] extra so faiss-gpu-cu12 upgrades to the numpy-2 build. Plain `pip install <wheel>` only pulls install_requires (numpy>2, no faiss), bumping numpy to 2.x while leaving the image's numpy-1.x faiss 1.11.0 -> `import tzrec` crashed with "numpy.core.multiarray failed to import". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

- graphlearn 1.3.7 -> 1.3.8 (numpy-2-compatible sampler; new OSS path) - numpy: drop the <2 cap -> unconstrained (graphlearn 1.3.8 supports numpy 2); pandas unconstrained - keep the 1.3.0 stack's faiss 1.14.3 (cu12/cu13) wheels, version 1.3.0, and the original buildtest_ci.yml - brings SID model docs (rqvae/rqkmeans) + sampler updates Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

numpy is now unconstrained (graphlearn 1.3.8 supports numpy 2), so executorch's numpy>=2 requirement resolves cleanly and no longer forces scipy/scikit-learn into failing source builds -- the strip is unnecessary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

The pythonrun.oss-cn-zhangjiakou bucket now returns 403 for the Mellanox RDMA installer, breaking every GPU image build at the last step. Point Step 26 at the hpn-driver mirror (ubuntu22.04-rdma-core-23.10.tar): plain .tar (tar xf), deb-based install.sh (apt-get install -f) so add apt-get update + clean around it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

Temporary: point the workflow images at the tzrec-test -u3 build (numpy 2 / graphlearn 1.3.8 / ptxas-swapped triton / RDMA mirror) to validate before promoting to tzrec-devel. ppu lane untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…ROGRESS.md - faq.md Q15: rewrite to the real cause -- triton 3.7.1's bundled ptxas 12.8.93 mis-compiles the HSTU WGMMA kernel on sm_90 (ptxas 12.9.86 fixes it) -- and the shipped remedy (image bundles the ptxas-swapped 3.7.1 wheel; no DISABLE_MMA_V3, no Hopper v3 perf loss). The old MMA-v3 "still affected" text was wrong. - local_tutorial.md: drop the stale "dynamicemb/hstu cu129-only, not in cu130" note. - aot_utils.py / optimizer.py / dynamicemb_util.py: condense the backport/patch/coercion comments. - stop tracking PROGRESS.md (worktree scratch, accidentally staged during the merge). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

The sm_90 WGMMA shmem-OOB that _force_mma_v2 dodged is fixed by the ptxas-12.9.86 swap in the 1.3.0 triton wheel. Verified on H20: all four affected tests (test_attn_triton_long_seqs, test_cache, test_attn_cutlass, test_sla_attn_cutlass) pass on the real v3/WGMMA path with 0 failures / 0 errors. Removing the workaround so unittest_h20_ci exercises the shipped v3 path and guards the fix against regression (e.g. a future triton bump re-downgrading ptxas, or the ptxas-swapped wheel missing). - hstu_attention_test.py: delete _force_mma_v2 + _DISABLE_V3_CACHE_SUFFIX + now-unused imports; unwrap the 4 with-blocks (dedent bodies). - rank_integration_test.py: drop DISABLE_MMA_V3=1 from the hstu train/eval/export env. - The DISABLE_MMA_V3 runtime knob itself is kept as an escape hatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

…ripts The 1.3.0 test images that were built, CI-validated, and promoted to tzrec-devel are the -u3 iteration; point build_docker.sh / promote_docker.sh at -u3 so a re-run targets the same tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

Reverts the temporary tzrec-test:1.3-u3 / :1.3-cpu-u3 validation refs now that the -u3 images are promoted to tzrec-devel:1.3 / :1.3-cpu (identical digests). Final CI runs against the promoted release images. ppu lane untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

github-actions · 2026-07-02T01:55:28Z

+        if preallocated_host_buffer is not None:
+            assert preallocated_host_buffer.numel() == split.host_size, (
+                f"preallocated_host_buffer size mismatch for '{prefix}_host': "
+                f"expected {split.host_size}, got {preallocated_host_buffer.numel()}"
+            )
+            assert preallocated_host_buffer.is_contiguous(), (
+                f"preallocated_host_buffer for '{prefix}_host' must be contiguous"
+            )
+            assert preallocated_host_buffer.dim() == 1, (
+                f"preallocated_host_buffer for '{prefix}_host' must be 1D, got "
+                f"{preallocated_host_buffer.dim()}D with shape "
+                f"{preallocated_host_buffer.shape}"
+            )
+            assert preallocated_host_buffer.dtype == dtype, (
+                f"preallocated_host_buffer dtype mismatch for '{prefix}_host': "
+                f"expected {dtype}, got {preallocated_host_buffer.dtype}"
+            )
+            assert preallocated_host_buffer.device == current_device, (
+                f"preallocated_host_buffer device mismatch for '{prefix}_host': "
+                f"expected {current_device}, got {preallocated_host_buffer.device}"
+            )
+            host_buffer = preallocated_host_buffer


When fbgemm passes a preallocated_host_buffer, it is adopted as-is — the use_init_value / torch.full(init_value) fill (the whole purpose of this monkeypatch, for TF-Adagrad initial_accumulator_value) is silently skipped for host-placed shards. If this path is ever hit for momentum1, the accumulator starts at the pool's contents (typically zeros) instead of init_value, so Adagrad behavior would differ between HBM-placed and host-placed tables.

Consider adding after line 168:

if use_init_value: host_buffer.fill_(init_value)

or assert not (use_init_value and preallocated_host_buffer is not None) if the combination is believed unreachable. Note this branch (and the make_persistent one) is currently not exercised by any test — a small unit test calling apply_split_helper directly with a preallocated 1-D CPU buffer would cover both.

github-actions · 2026-07-02T01:55:34Z

+    from dynamicemb.shard.embedding import DynamicEmbeddingCollectionContext
+
+    # torchrec 1.7.0 no longer coerces EmbeddingCollectionContext's None list-fields
+    # to []; dynamicemb still passes None (breaks sharded EC forward), so re-coerce.
+    _orig_demb_ctx_init = DynamicEmbeddingCollectionContext.__init__
+
+    def _demb_ctx_init_coerce_none(
+        self,
+        sharding_contexts=None,  # pyre-ignore [2]
+        input_features=None,  # pyre-ignore [2]
+        reverse_indices=None,  # pyre-ignore [2]
+        seq_vbe_ctx=None,  # pyre-ignore [2]
+        frequency_counters=None,  # pyre-ignore [2]
+    ):  # pyre-ignore [3]
+        _orig_demb_ctx_init(
+            self,
+            sharding_contexts if sharding_contexts is not None else [],
+            input_features if input_features is not None else [],
+            reverse_indices if reverse_indices is not None else [],
+            seq_vbe_ctx if seq_vbe_ctx is not None else [],
+            frequency_counters,
+        )
+
+    # pyre-ignore [9]
+    DynamicEmbeddingCollectionContext.__init__ = _demb_ctx_init_coerce_none


Two robustness notes on this new patch:

It's the only monkeypatch in this file living inside the blanket try/except Exception: pass (all others are under if has_dynamicemb:). If dynamicemb.shard.embedding moves or the patch throws against a future wheel (the dynamicemb pin was bumped in this same PR), the exception is swallowed, has_dynamicemb stays False, every dynamicemb CI test silently skips, and users get the misleading "dynamicemb is not installed" RuntimeError. Consider logging the swallowed exception in the except branch so breakage fails loudly.

The wrapper hardcodes the 5 current parameters and forwards them positionally; if a future dynamicemb adds a context field, construction will TypeError at runtime. Forwarding trailing *args, **kwargs would make it drift-tolerant.

github-actions · 2026-07-02T01:55:37Z

+                 pip cache purge ;; \
+    esac && \
+    case ${DEVICE} in \
+        "cu126"|"cu129"|"cu130") pip install --force-reinstall --no-deps https://tzrec.oss-accelerate.aliyuncs.com/third_party/triton/triton-3.7.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl && \


The repackaged wheel (ptxas 12.9.86 swap) reuses the stock 3.7.1 version string, so pip freeze/SBOMs can't distinguish it from PyPI triton, and any later reinstall or dependency re-resolution silently reverts the ptxas fix with no version signal. The previous scheme used a distinguishing local version (3.6.0+565c08520).

Consider republishing as e.g. 3.7.1+ptxas12986 — PEP 440 local versions still satisfy the triton==3.7.1 pins in requirements/cu1*.txt — and optionally adding --hash=sha256:... here. Relatedly, those requirements files resolve stock PyPI 3.7.1, so pip-based (non-docker) installs on Hopper still hit the FAQ Q15 crash; pointing them at the repackaged wheel (as already done for faiss/dynamicemb) or adding a comment referencing FAQ Q15 would be more consistent.

github-actions · 2026-07-02T01:55:40Z

+                 wget https://hpn-driver.oss-cn-hangzhou.aliyuncs.com/nic-drivers/shuyao/drivers/ubuntu22.04-rdma-core-23.10.tar && \
+                 tar xf ubuntu22.04-rdma-core-23.10.tar && \
+                 cd ubuntu22.04-rdma-core-23.10 && \
+                 apt-get update && \
                 echo Y | /bin/bash install.sh && \


This tarball comes from a bucket outside the project's own tzrec OSS bucket (the shuyao/ path segment looks like a personal directory) and its install.sh runs as root with no integrity check. Anyone who can overwrite that object gets code execution in every future image build, and the result ships to a public registry. Consider mirroring the tarball into tzrec.oss-accelerate.aliyuncs.com/third_party/ and/or verifying a pinned sha256sum before executing.

github-actions · 2026-07-02T01:55:42Z

@@ -8,7 +8,7 @@ DynamicEmbedding 是特征零Hash冲突Id化的一种方式，它相比设置`ha

 ```bash
 # DEVICE 可选: cu126/cu129 (支持 Python 3.10/3.11/3.12)


Stale device list: requirements/extra.txt now pins the 20260630.5dc46a2 build as cu130-only wheels, and the default tzrec-devel:1.3 image is the cu130 variant — but this comment still offers only cu126/cu129. Add cu130 (and confirm cu126/cu129 wheels of the new build actually exist on OSS before keeping them listed).

github-actions · 2026-07-02T01:55:46Z


  - TRITON: 基于Triton的实现，通常比PYTORCH快2-3x，节省2-3x显存
-  - CUTLASS: 基于CUTLASS的CUDA融合算子实现，需安装fbgemm_gpu_hstu包（DEVICE可选cu126/cu129，对应DEVICE_DOTTED为cu12.6/cu12.9：`pip install fbgemm_gpu_hstu==0.1.0+${DEVICE_DOTTED} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`），要求`attention_dim`等于`hidden_dim`，支持Ampere/Ada/Hopper GPU
+  - CUTLASS: 基于CUTLASS的CUDA融合算子实现，需安装fbgemm_gpu_hstu包（DEVICE可选cu126/cu129：`pip install fbgemm_gpu_hstu==0.1.0+20260626.9fd44403.${DEVICE} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`），要求`attention_dim`等于`hidden_dim`，支持Ampere/Ada/Hopper GPU


Same stale device list as dynamicemb.md: the version string was updated to 20260626.9fd44403, but DEVICE可选cu126/cu129 was kept while requirements/extra.txt (and CI) now use the cu130 wheel of this build. cu130 is missing from the list; verify cu126/cu129 wheels of this build exist.

github-actions · 2026-07-02T01:55:48Z

        # Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the
        # pure-CUTLASS NFUNC path (no Triton kernel runs).


Stale comment: the _force_mma_v2 workaround this refers to was removed in this PR (the sibling comment in test_attn_cutlass was rewritten; this one was missed).

Suggested change

# Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the

# pure-CUTLASS NFUNC path (no Triton kernel runs).

github-actions · 2026-07-02T01:55:50Z

-DOCKER_TAG=1.2
-DOCKER_TAG_SUFFIX=
+DOCKER_TAG=1.3
+DOCKER_TAG_SUFFIX=-u3


DOCKER_TAG_SUFFIX=-u3 (previously empty) looks like leftover state from image-build iteration — the next person running this script will unknowingly produce 1.3-<device>-u3 tags. Consider resetting to empty after promotion, or documenting the suffix convention.

github-actions · 2026-07-02T01:55:52Z

+graphlearn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/graphlearn/graphlearn-1.3.8-cp310-cp310-linux_x86_64.whl ; python_version=="3.10"
 grpcio-tools<1.63.0
-numpy<2
+numpy


Fully unconstrained numpy will silently pick up a future numpy 3.x. Given the codebase just completed the 1.x→2.x migration, an upper bound documents what's validated:

Suggested change

numpy

numpy<3

github-actions · 2026-07-02T01:56:17Z

Code review summary (multi-area review)

Overall this is a well-executed release bump. The monkeypatch re-syncs (apply_split_helper → fbgemm 1.7.0, SharderData migration in plan_util.py/dynamicemb_util.py) are internally consistent with no stragglers, the np.string_→np.bytes_ numpy-2 migration is complete (no other removed aliases remain in the tree), the FAQ Q15 rewrite matches the shipped triton wheel and the workaround removal, and docs/version references were updated thoroughly. The torch #178147 backport stays correctly gated for 2.12.1, and trt_utils.py already degrades gracefully for the now-TRT-free cu126/cu129 stacks.

Noteworthy items are posted as inline comments (init-value skip on the preallocated_host_buffer path, silent-disable hazard around the new dynamicemb patch, triton wheel version shadowing, unverified RDMA tarball, stale cu126/cu129 device lists in docs, leftover -u3 tag suffix, unconstrained numpy, stale test comment). A few findings land on files this PR doesn't touch, so they go here:

setup.py:81 — the tzrec[gpu] extra still resolves requirements/cu129.txt, while this PR flips requirements-gpu.txt to requirements/cu130.txt. Before the PR both were cu129; now pip install tzrec[gpu] yields a different stack (cu12 faiss, no torch-tensorrt) than the repo's own GPU requirements. If keeping the PyPI extra CUDA-12-friendly is intentional, a comment or a separate cu130 extra would make it explicit.
tzrec/ops/utils.py:81 — clear_triton_caches is now dead code: its only caller (_force_mma_v2) was removed here, and its docstring still motivates itself via the removed DISABLE_MMA_V3 workflow. Remove it, or keep it deliberately as a general utility with an updated docstring.
Test-coverage suggestions (advisory): this PR is itself the failure mode the fbgemm monkeypatch risks — upstream grew two kwargs and the old override raised TypeError. A small signature-parity test (inspect.signature of upstream apply_split_helper vs the tzrec override) would turn the next such break into a unit-test failure instead of a runtime crash. Similarly, a one-line assertion that DynamicEmbeddingCollectionContext() yields [] (not None) for its list fields would catch both coercion regressions and signature drift.
Dockerfile minor: the plain pip install torch==2.12.1 after the --no-deps install re-resolves and downloads the multi-GB nvidia-* wheels only to uninstall them — running the Requires-Dist sed strip before the second install would avoid that. Also, mirrors.aliyun.com serves HTTPS; the generated pip.conf could use https:// and drop trusted-host now that the default mirror is public.

🤖 Generated with Claude Code

…cs to cu130 - optim/optimizer.py: apply_split_helper adopted fbgemm's preallocated host buffer as-is and skipped the init_value fill, so a host-placed momentum1 shard started at the pool contents (zeros) instead of FBGEMM_MOMENTUM1_STATE_INIT_VALUE. Fill it when use_init_value, matching the dev and non-preallocated host paths. - setup.py: tzrec[gpu] extra -> requirements/cu130.txt (the 1.3 default image is cu130). - ops/utils.py: clear_triton_caches has no caller after the _force_mma_v2 removal; keep it as a general Triton-cache utility and drop the DISABLE_MMA_V3-specific docstring. - dynamicemb.md / dlrm_hstu.md: DEVICE list cu126/cu129 -> cu126/cu129/cu130 (the 5dc46a2 / 9fd44403 wheels are confirmed on OSS for all three). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

tiankongdeguiji and others added 24 commits June 24, 2026 12:36

Merge remote-tracking branch 'origin/master' into bump_torch_2.12.1

c1dd1df

# Conflicts: # .github/workflows/unittest_h20_ci.yml # tzrec/version.py

[chore] simplify dynamicemb None-coercion monkeypatch comment

e8a2af3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U

tiankongdeguiji changed the title ~~[feat] bump to 1.3.0 with torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0~~ [feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT) Jul 1, 2026

tiankongdeguiji added the claude-review Let Claude Review label Jul 2, 2026

tiankongdeguiji mentioned this pull request Jul 2, 2026

[feat] bump graphlearn to 1.3.8, drop numpy<2 pin #557

Closed

github-actions Bot removed the claude-review Let Claude Review label Jul 2, 2026

WhiteSwan1 self-requested a review July 2, 2026 01:45

WhiteSwan1 previously approved these changes Jul 2, 2026

View reviewed changes

github-actions Bot reviewed Jul 2, 2026

View reviewed changes

tiankongdeguiji dismissed WhiteSwan1’s stale review via 7c14dd4 July 2, 2026 02:33

WhiteSwan1 approved these changes Jul 2, 2026

View reviewed changes

tiankongdeguiji merged commit ee3ef09 into master Jul 2, 2026
9 checks passed

tiankongdeguiji deleted the bump_torch_2.12.1 branch July 2, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT)#551

[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT)#551
tiankongdeguiji merged 25 commits into
masterfrom
bump_torch_2.12.1

tiankongdeguiji commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -8,7 +8,7 @@ DynamicEmbedding 是特征零Hash冲突Id化的一种方式，它相比设置`ha

		```bash
		# DEVICE 可选: cu126/cu129 (支持 Python 3.10/3.11/3.12)

		# Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the
		# pure-CUTLASS NFUNC path (no Triton kernel runs).

Uh oh!

Conversation

tiankongdeguiji commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TorchEasyRec 1.3.0 — torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2

Dependency stack

numpy 2 migration

triton 3.7.1 — bundled ptxas WGMMA fix (H20 workaround removed)

CUDA 13 + TensorRT image

Required source changes

Build / infra

Images

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 2, 2026

Code review summary (multi-area review)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tiankongdeguiji commented Jun 24, 2026 •

edited

Loading