[feat] bump to 1.3.0: torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2 (+ cu130 TRT)#551
Conversation
Upgrade the dependency stack to PyTorch 2.12.1, torchrec 1.7.0, fbgemm_gpu 1.7.0, triton 3.7.1, and refreshed dynamicemb / hstu ops; bump the project version to 1.3.0 and the Docker image tag to 1.3. Required source change: re-sync the apply_split_helper monkeypatch in tzrec/optim/optimizer.py to fbgemm 1.7.0's signature/body. fbgemm 1.7.0 added make_persistent and preallocated_host_buffer params and its internal TBE caller now passes them, so the previous override would raise TypeError on every fused-embedding table construction. The tzrec momentum1 (TF-Adagrad) state-init customization is preserved. Keep the pytorch#178147 inductor codegen backport in tzrec/acc/aot_utils.py: the fix is still absent from torch 2.12.1 (it lands in 2.13.0+); only the stale comment is refreshed. TensorRT: drop torch-tensorrt from the cu129/cu126 images. torch_tensorrt 2.12 has no CUDA-12 build (it moved to CUDA 13), so it is incompatible with the cu129 stack. The framework degrades gracefully (has_tensorrt is False, as in the cpu image) and AOTI export/predict is unaffected. cu130 plus TRT support is planned to follow. triton 3.7.1: keep the DISABLE_MMA_V3 Hopper/H20 workaround; the RS-WGMMA synchronization fix (triton#9514) is not in the 3.7.x release line. Dockerfile: set PIP_MIRROR default to mirrors.aliyun.com and drop the mirrors.cloud.aliyuncs.com rewrite. Update docs and pre-commit hooks (all to latest stable except mdformat, which is unchanged). Workflow image refs temporarily point at tzrec-test:1.3 for CI validation; reverted to tzrec-devel after image promotion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…a hash mismatch torch 2.12.1 pulls the nvidia cuda-toolkit meta-package, which makes pip hash-check the nvidia sub-wheels. The aliyun PyPI mirror serves a repackaged nvidia_cuda_nvrtc_cu12 wheel whose content hash differs from what the pytorch-wheels find-links page advertises, so `pip install torch -f <find-links>` fails with "PACKAGES DO NOT MATCH THE HASHES". Install torch from find-links with --no-deps, then resolve its deps via a plain `pip install torch==2.12.1` (no find-links, so no conflicting hashes) for the cu126/cu129 images. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ata contract torchrec 1.7.0 (#3852) reworked the planner estimator contract to pass a picklable SharderData snapshot instead of live ModuleSharder objects: EmbeddingEnumerator.populate_estimates now calls estimate(sharding_options, sharder_data_map=...), ShardEstimator.estimate takes sharder_data_map: SharderDataMap, calculate_shard_storages takes sharder_data: SharderData, and ShardPerfContext.build_shard_perf_contexts takes sharder_data. tzrec's copied/overridden estimators were left on the 1.6 sharder_map/sharder contract, so planning crashed with "estimate() got an unexpected keyword argument 'sharder_data_map'" and "build_shard_perf_contexts() missing 1 required positional argument: 'sharder'". Port tzrec's EmbeddingStorageEstimator.estimate, the calculate_shard_storages wrapper (plan_util.py) and the dynamicemb perf-context / storage helpers (dynamicemb_util.py) to the SharderData API, reading fused_params off SharderData and forwarding sharder_data, preserving the dynamicemb x_eff cache_params injection and storage estimation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…e tag Add a cu130 Docker image variant on the CUDA 13.0 toolchain that restores TensorRT support (torch_tensorrt 2.12.0+cu130 + tensorrt_cu13 10.16.1.11), which is unavailable on the cu129 stack because torch_tensorrt 2.12 dropped CUDA 12. The cu130 image carries the full torch 2.12.1 / fbgemm 1.7.0 / torchrec 1.7.0 stack; dynamicemb / hstu remain cu129-only. Upgrade faiss to 1.14.3 for all GPU images: faiss+cu12 for cu126/cu129, faiss+cu13 for cu130 (replacing faiss_gpu_cu12 1.11.0). Bump the tzrec-test image tag suffix to -u1 (all of cpu/cu126/cu129/cu130) since the faiss change rebuilds every image. promote_docker.sh strips the suffix so tzrec-devel keeps the clean 1.3 tags. CI workflow image refs point at tzrec-test:1.3-u1 / :1.3-cpu-u1 for validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…elds to [] torchrec 1.7.0 (commit 32f40e01) removed EmbeddingCollectionContext.__post_init__, which used to coerce None list-fields to []. The dynamicemb package's DynamicEmbeddingCollectionContext.__init__ still forwards input_features=None (and sharding_contexts / reverse_indices / seq_vbe_ctx) to torchrec's super().__init__; under 1.7.0 they stay None and crash EmbeddingCollectionContext.record_stream and compute_and_output_dist with "TypeError: 'NoneType' object is not iterable" on every sharded EmbeddingCollection forward. Because tzrec shards all EmbeddingCollections with DynamicEmbeddingCollectionSharder when dynamicemb is installed (plan_util.get_default_sharders), this broke every HSTU/sequence train_eval on H20 (6 integration tests), even for plain num_buckets features. Monkeypatch the dynamicemb context __init__ to restore the None -> [] coercion. Confirmed on H20: the failing context is DynamicEmbeddingCollectionContext with input_features=None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…esolution
The cu130 image failed to build at `pip install -r requirements-cu130.txt`: pip
backtracked scikit-learn/scipy into source builds that fail to compile with Cython 3
("Exception values are incompatible", "Declare ... as noexcept"). Root cause is a
numpy floor conflict unique to the CUDA 13 stack:
- torch 2.12.1+cu130 pulls `cuda-bindings`/`cuda-pathfinder` (new in the cu13
packaging) which my dist-info strip regex (cuda-toolkit|nvidia-) missed, leaving
torch's cu13 deps in the resolution graph.
- torch-tensorrt 2.12.0 requires `executorch>=1.2.0`, and executorch requires
`numpy>=2.0.0`, which is unsatisfiable against tzrec's `numpy<2` -> pip backtracks
scipy/scikit-learn to ancient sdists.
Fix, matching the cu126/cu129 "use system cuda" approach:
- uninstall the cuda-toolkit-provided cu13 nvidia wheels + cuda-toolkit/cuda-bindings/
cuda-pathfinder (system cuda-toolkit-13-0 provides the libs; torch imports
cuda.bindings via try/except with a None fallback, so removal is safe),
- extend the torch dist-info strip to (cuda-toolkit|cuda-bindings|cuda-pathfinder|nvidia-),
- strip `executorch` from torch_tensorrt's dist-info (tzrec only uses the dynamo TRT
path, not the executorch backend).
Verified in-image: torch 2.12.1+cu130, numpy 1.26.4, scipy 1.17.1, scikit-learn 1.9.0
all import; the cu130 build + `-r` install now succeed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Revert the temporary tzrec-test:1.3-u1 / tzrec-test:1.3-cpu-u1 validation image refs to the promoted tzrec-devel:1.3 / tzrec-devel:1.3-cpu. Net change vs master is the 1.2 -> 1.3 image tag bump; ppu workflow stays on tzrec-devel:1.1-ppu. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Refresh the hstu op wheel (same commit 9fd44403, newer 20260626 build) in requirements/extra.txt (cp310/311/312) and the dlrm_hstu doc install command. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
# Conflicts: # .github/workflows/unittest_h20_ci.yml # tzrec/version.py
Repoint the default GPU tags to the CUDA 13.0 image: registry tags tzrec-test:1.3-u1, tzrec-devel:1.3 and tzrec-devel:latest now resolve to the cu130 build (was cu129), and build_docker.sh / promote_docker.sh derive the default tag from -cu130 going forward. The cu129 build stays available as the -cu129 variant. The GPU CI lanes are pinned to tzrec-devel:1.3-cu129 because the cu130 image cannot run the current CI harness as-is: CUDA 13.0 needs forward-compat on the runners' 535/550 drivers (LD_LIBRARY_PATH=/usr/local/cuda-13.0/compat), and ci_test.sh installs the cu129 faiss/dynamicemb/hstu wheels which are ABI-incompatible with CUDA 13. CI therefore continues to validate on the proven cu129 image; cpu and ppu lanes are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ver runners) Now that cu130 is the default 1.3 image, point all GPU CI lanes back to tzrec-devel:1.3 (cu130) instead of the cu129 variant, and make the harness cu130-ready: - unittest_ci / unittest_nightly / benchmark run on tzrec-runner / tzrec-bench-runner (R535 driver, below CUDA 13's R580); add LD_LIBRARY_PATH=/usr/local/cuda-13/compat so CUDA 13 runs via forward-compat (verified: torch.cuda + fused TBE work on the A10). - unittest_h20 runs on tzrec-h20-runner (R580, native CUDA 13) -> no compat path needed. - CUDA_HOME /usr/local/cuda-12 -> /usr/local/cuda-13 (cu130 toolkit path) for AOTI. - requirements-gpu.txt -> requirements/cu130.txt (faiss-cu13 instead of cu12). - requirements/extra.txt -> cu130 dynamicemb/hstu wheels. cpu and ppu lanes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
- torch_tensorrt 2.12.0 -> 2.12.1; install without --no-deps so it pulls
tensorrt-cu13 itself (drop the manual packaging/typing-extensions/dllist/psutil/
tensorrt_cu13 line). Keep stripping only `executorch` from torch_tensorrt's
dist-info -- it still Requires-Dist executorch>=1.2.0 (numpy>=2), and leaving it
in makes `pip install -r requirements-cu130.txt` (numpy<2) backtrack scipy/
scikit-learn into failing source builds. executorch (~16MB) is installed but
unused by tzrec.
- Stop uninstalling/stripping cuda-toolkit / cuda-bindings / cuda-pathfinder for
cu130: the executorch strip alone keeps the numpy<2 resolution clean, so those
torch cu13 deps can stay (torch imports cuda.bindings via try/except).
- Merge the byte-identical "cu126") and "cu129") branches into "cu126"|"cu129")
using ${DEVICE}.
Verified: cu130 image builds with no scipy/scikit-learn source build; numpy 1.26.4,
scipy 1.17.1, scikit-learn 1.7.1, faiss 1.14.3 and torch_tensorrt 2.12.1+cu130 all
import (the latter on GPU via cuda-13 forward-compat).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…mp dynamicemb triton (Q15): triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the sm_90 HSTU WGMMA kernel into a shared-memory OOB (regressed onto release/3.7 by triton commit 6c96454f2f, which downgraded ptxas 12.9.86 -> 12.8.93; bisected and confirmed on H20). Force-reinstall the official triton 3.7.1 manylinux wheel repackaged to bundle ptxas 12.9.86 -- only the ptxas binary swapped, RECORD regenerated -- for cu126/cu129/cu130. Verified: clean install, inductor set_driver_to_gpu OK, compute-sanitizer ERROR SUMMARY 0. No DISABLE_MMA_V3 fallback, no triton rebuild (avoids the torch<->triton runtime mismatch that breaks AOT export). cu126|cu129: add nvidia-cufile-cu12 to the uninstall list -- torch 2.12.1 installs it, but the list inherited from the 1.2.x Dockerfile omitted it while cu130 already strips nvidia-cufile. cu130: re-add the nvidia-* uninstall + torch-metadata strip (cuda-toolkit/cuda-bindings/ cuda-pathfinder kept installed). dynamicemb: 0.1.0+20260624.2550a9c -> 0.1.0+20260630.5dc46a2 (extra.txt cp310/311/312 and the feature doc). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…post1) graphlearn 1.3.7 and faiss-gpu-cu12 1.11.0 were the deps pinning numpy<2. - graphlearn 1.3.7 -> 1.3.8 (declares no numpy bound). - faiss-gpu-cu12 1.11.0 -> 1.14.1.post1: 1.11.0 requires numpy<2; faiss became numpy-2 ready at 1.13.2 (numpy>=2,<3). Source moves from the OSS-mirrored wheel to PyPI (matching how faiss-cpu is already sourced); its nvidia-cuda-runtime/cublas constraints are unchanged (>=12.1), so the CUDA stack is unaffected. Require numpy>2 first to exercise the numpy 2.x path through CI before relaxing the pin entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg
…faiss via [gpu] - tzrec/datasets/sampler.py: numpy 2.0 removed np.string_; use np.bytes_ (same 'S' dtype, np.char.decode unchanged) when parsing graphlearn string node attrs. Fixes AttributeError crashing the negative/hard-negative/TDM sampler DataLoader workers under numpy 2. - .github/workflows/buildtest_ci.yml: install the built wheel's [gpu] extra so faiss-gpu-cu12 upgrades to the numpy-2 build. Plain `pip install <wheel>` only pulls install_requires (numpy>2, no faiss), bumping numpy to 2.x while leaving the image's numpy-1.x faiss 1.11.0 -> `import tzrec` crashed with "numpy.core.multiarray failed to import". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NPSTfsBJfgwD1Y6afhyVzg
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
- graphlearn 1.3.7 -> 1.3.8 (numpy-2-compatible sampler; new OSS path) - numpy: drop the <2 cap -> unconstrained (graphlearn 1.3.8 supports numpy 2); pandas unconstrained - keep the 1.3.0 stack's faiss 1.14.3 (cu12/cu13) wheels, version 1.3.0, and the original buildtest_ci.yml - brings SID model docs (rqvae/rqkmeans) + sampler updates Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
numpy is now unconstrained (graphlearn 1.3.8 supports numpy 2), so executorch's numpy>=2 requirement resolves cleanly and no longer forces scipy/scikit-learn into failing source builds -- the strip is unnecessary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
The pythonrun.oss-cn-zhangjiakou bucket now returns 403 for the Mellanox RDMA installer, breaking every GPU image build at the last step. Point Step 26 at the hpn-driver mirror (ubuntu22.04-rdma-core-23.10.tar): plain .tar (tar xf), deb-based install.sh (apt-get install -f) so add apt-get update + clean around it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Temporary: point the workflow images at the tzrec-test -u3 build (numpy 2 / graphlearn 1.3.8 / ptxas-swapped triton / RDMA mirror) to validate before promoting to tzrec-devel. ppu lane untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ROGRESS.md - faq.md Q15: rewrite to the real cause -- triton 3.7.1's bundled ptxas 12.8.93 mis-compiles the HSTU WGMMA kernel on sm_90 (ptxas 12.9.86 fixes it) -- and the shipped remedy (image bundles the ptxas-swapped 3.7.1 wheel; no DISABLE_MMA_V3, no Hopper v3 perf loss). The old MMA-v3 "still affected" text was wrong. - local_tutorial.md: drop the stale "dynamicemb/hstu cu129-only, not in cu130" note. - aot_utils.py / optimizer.py / dynamicemb_util.py: condense the backport/patch/coercion comments. - stop tracking PROGRESS.md (worktree scratch, accidentally staged during the merge). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
The sm_90 WGMMA shmem-OOB that _force_mma_v2 dodged is fixed by the ptxas-12.9.86 swap in the 1.3.0 triton wheel. Verified on H20: all four affected tests (test_attn_triton_long_seqs, test_cache, test_attn_cutlass, test_sla_attn_cutlass) pass on the real v3/WGMMA path with 0 failures / 0 errors. Removing the workaround so unittest_h20_ci exercises the shipped v3 path and guards the fix against regression (e.g. a future triton bump re-downgrading ptxas, or the ptxas-swapped wheel missing). - hstu_attention_test.py: delete _force_mma_v2 + _DISABLE_V3_CACHE_SUFFIX + now-unused imports; unwrap the 4 with-blocks (dedent bodies). - rank_integration_test.py: drop DISABLE_MMA_V3=1 from the hstu train/eval/export env. - The DISABLE_MMA_V3 runtime knob itself is kept as an escape hatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
…ripts The 1.3.0 test images that were built, CI-validated, and promoted to tzrec-devel are the -u3 iteration; point build_docker.sh / promote_docker.sh at -u3 so a re-run targets the same tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
Reverts the temporary tzrec-test:1.3-u3 / :1.3-cpu-u3 validation refs now that the -u3 images are promoted to tzrec-devel:1.3 / :1.3-cpu (identical digests). Final CI runs against the promoted release images. ppu lane untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
| if preallocated_host_buffer is not None: | ||
| assert preallocated_host_buffer.numel() == split.host_size, ( | ||
| f"preallocated_host_buffer size mismatch for '{prefix}_host': " | ||
| f"expected {split.host_size}, got {preallocated_host_buffer.numel()}" | ||
| ) | ||
| assert preallocated_host_buffer.is_contiguous(), ( | ||
| f"preallocated_host_buffer for '{prefix}_host' must be contiguous" | ||
| ) | ||
| assert preallocated_host_buffer.dim() == 1, ( | ||
| f"preallocated_host_buffer for '{prefix}_host' must be 1D, got " | ||
| f"{preallocated_host_buffer.dim()}D with shape " | ||
| f"{preallocated_host_buffer.shape}" | ||
| ) | ||
| assert preallocated_host_buffer.dtype == dtype, ( | ||
| f"preallocated_host_buffer dtype mismatch for '{prefix}_host': " | ||
| f"expected {dtype}, got {preallocated_host_buffer.dtype}" | ||
| ) | ||
| assert preallocated_host_buffer.device == current_device, ( | ||
| f"preallocated_host_buffer device mismatch for '{prefix}_host': " | ||
| f"expected {current_device}, got {preallocated_host_buffer.device}" | ||
| ) | ||
| host_buffer = preallocated_host_buffer |
There was a problem hiding this comment.
When fbgemm passes a preallocated_host_buffer, it is adopted as-is — the use_init_value / torch.full(init_value) fill (the whole purpose of this monkeypatch, for TF-Adagrad initial_accumulator_value) is silently skipped for host-placed shards. If this path is ever hit for momentum1, the accumulator starts at the pool's contents (typically zeros) instead of init_value, so Adagrad behavior would differ between HBM-placed and host-placed tables.
Consider adding after line 168:
if use_init_value:
host_buffer.fill_(init_value)or assert not (use_init_value and preallocated_host_buffer is not None) if the combination is believed unreachable. Note this branch (and the make_persistent one) is currently not exercised by any test — a small unit test calling apply_split_helper directly with a preallocated 1-D CPU buffer would cover both.
| from dynamicemb.shard.embedding import DynamicEmbeddingCollectionContext | ||
|
|
||
| # torchrec 1.7.0 no longer coerces EmbeddingCollectionContext's None list-fields | ||
| # to []; dynamicemb still passes None (breaks sharded EC forward), so re-coerce. | ||
| _orig_demb_ctx_init = DynamicEmbeddingCollectionContext.__init__ | ||
|
|
||
| def _demb_ctx_init_coerce_none( | ||
| self, | ||
| sharding_contexts=None, # pyre-ignore [2] | ||
| input_features=None, # pyre-ignore [2] | ||
| reverse_indices=None, # pyre-ignore [2] | ||
| seq_vbe_ctx=None, # pyre-ignore [2] | ||
| frequency_counters=None, # pyre-ignore [2] | ||
| ): # pyre-ignore [3] | ||
| _orig_demb_ctx_init( | ||
| self, | ||
| sharding_contexts if sharding_contexts is not None else [], | ||
| input_features if input_features is not None else [], | ||
| reverse_indices if reverse_indices is not None else [], | ||
| seq_vbe_ctx if seq_vbe_ctx is not None else [], | ||
| frequency_counters, | ||
| ) | ||
|
|
||
| # pyre-ignore [9] | ||
| DynamicEmbeddingCollectionContext.__init__ = _demb_ctx_init_coerce_none |
There was a problem hiding this comment.
Two robustness notes on this new patch:
-
It's the only monkeypatch in this file living inside the blanket
try/except Exception: pass(all others are underif has_dynamicemb:). Ifdynamicemb.shard.embeddingmoves or the patch throws against a future wheel (the dynamicemb pin was bumped in this same PR), the exception is swallowed,has_dynamicembstaysFalse, every dynamicemb CI test silently skips, and users get the misleading"dynamicemb is not installed"RuntimeError. Consider logging the swallowed exception in theexceptbranch so breakage fails loudly. -
The wrapper hardcodes the 5 current parameters and forwards them positionally; if a future dynamicemb adds a context field, construction will
TypeErrorat runtime. Forwarding trailing*args, **kwargswould make it drift-tolerant.
| pip cache purge ;; \ | ||
| esac && \ | ||
| case ${DEVICE} in \ | ||
| "cu126"|"cu129"|"cu130") pip install --force-reinstall --no-deps https://tzrec.oss-accelerate.aliyuncs.com/third_party/triton/triton-3.7.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl && \ |
There was a problem hiding this comment.
The repackaged wheel (ptxas 12.9.86 swap) reuses the stock 3.7.1 version string, so pip freeze/SBOMs can't distinguish it from PyPI triton, and any later reinstall or dependency re-resolution silently reverts the ptxas fix with no version signal. The previous scheme used a distinguishing local version (3.6.0+565c08520).
Consider republishing as e.g. 3.7.1+ptxas12986 — PEP 440 local versions still satisfy the triton==3.7.1 pins in requirements/cu1*.txt — and optionally adding --hash=sha256:... here. Relatedly, those requirements files resolve stock PyPI 3.7.1, so pip-based (non-docker) installs on Hopper still hit the FAQ Q15 crash; pointing them at the repackaged wheel (as already done for faiss/dynamicemb) or adding a comment referencing FAQ Q15 would be more consistent.
| wget https://hpn-driver.oss-cn-hangzhou.aliyuncs.com/nic-drivers/shuyao/drivers/ubuntu22.04-rdma-core-23.10.tar && \ | ||
| tar xf ubuntu22.04-rdma-core-23.10.tar && \ | ||
| cd ubuntu22.04-rdma-core-23.10 && \ | ||
| apt-get update && \ | ||
| echo Y | /bin/bash install.sh && \ |
There was a problem hiding this comment.
This tarball comes from a bucket outside the project's own tzrec OSS bucket (the shuyao/ path segment looks like a personal directory) and its install.sh runs as root with no integrity check. Anyone who can overwrite that object gets code execution in every future image build, and the result ships to a public registry. Consider mirroring the tarball into tzrec.oss-accelerate.aliyuncs.com/third_party/ and/or verifying a pinned sha256sum before executing.
| @@ -8,7 +8,7 @@ DynamicEmbedding 是特征零Hash冲突Id化的一种方式,它相比设置`ha | |||
|
|
|||
| ```bash | |||
| # DEVICE 可选: cu126/cu129 (支持 Python 3.10/3.11/3.12) | |||
There was a problem hiding this comment.
Stale device list: requirements/extra.txt now pins the 20260630.5dc46a2 build as cu130-only wheels, and the default tzrec-devel:1.3 image is the cu130 variant — but this comment still offers only cu126/cu129. Add cu130 (and confirm cu126/cu129 wheels of the new build actually exist on OSS before keeping them listed).
|
|
||
| - TRITON: 基于Triton的实现,通常比PYTORCH快2-3x,节省2-3x显存 | ||
| - CUTLASS: 基于CUTLASS的CUDA融合算子实现,需安装fbgemm_gpu_hstu包(DEVICE可选cu126/cu129,对应DEVICE_DOTTED为cu12.6/cu12.9:`pip install fbgemm_gpu_hstu==0.1.0+${DEVICE_DOTTED} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`),要求`attention_dim`等于`hidden_dim`,支持Ampere/Ada/Hopper GPU | ||
| - CUTLASS: 基于CUTLASS的CUDA融合算子实现,需安装fbgemm_gpu_hstu包(DEVICE可选cu126/cu129:`pip install fbgemm_gpu_hstu==0.1.0+20260626.9fd44403.${DEVICE} -f https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/${DEVICE}/repo.html`),要求`attention_dim`等于`hidden_dim`,支持Ampere/Ada/Hopper GPU |
There was a problem hiding this comment.
Same stale device list as dynamicemb.md: the version string was updated to 20260626.9fd44403, but DEVICE可选cu126/cu129 was kept while requirements/extra.txt (and CI) now use the cu130 wheel of this build. cu130 is missing from the list; verify cu126/cu129 wheels of this build exist.
| # Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the | ||
| # pure-CUTLASS NFUNC path (no Triton kernel runs). |
There was a problem hiding this comment.
Stale comment: the _force_mma_v2 workaround this refers to was removed in this PR (the sibling comment in test_attn_cutlass was rewritten; this one was missed).
| # Same WGMMA-bwd workaround as test_attn_cutlass; harmless on the | |
| # pure-CUTLASS NFUNC path (no Triton kernel runs). |
| DOCKER_TAG=1.2 | ||
| DOCKER_TAG_SUFFIX= | ||
| DOCKER_TAG=1.3 | ||
| DOCKER_TAG_SUFFIX=-u3 |
There was a problem hiding this comment.
DOCKER_TAG_SUFFIX=-u3 (previously empty) looks like leftover state from image-build iteration — the next person running this script will unknowingly produce 1.3-<device>-u3 tags. Consider resetting to empty after promotion, or documenting the suffix convention.
| graphlearn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/graphlearn/graphlearn-1.3.8-cp310-cp310-linux_x86_64.whl ; python_version=="3.10" | ||
| grpcio-tools<1.63.0 | ||
| numpy<2 | ||
| numpy |
There was a problem hiding this comment.
Fully unconstrained numpy will silently pick up a future numpy 3.x. Given the codebase just completed the 1.x→2.x migration, an upper bound documents what's validated:
| numpy | |
| numpy<3 |
Code review summary (multi-area review)Overall this is a well-executed release bump. The monkeypatch re-syncs ( Noteworthy items are posted as inline comments (init-value skip on the
🤖 Generated with Claude Code |
…cs to cu130 - optim/optimizer.py: apply_split_helper adopted fbgemm's preallocated host buffer as-is and skipped the init_value fill, so a host-placed momentum1 shard started at the pool contents (zeros) instead of FBGEMM_MOMENTUM1_STATE_INIT_VALUE. Fill it when use_init_value, matching the dev and non-preallocated host paths. - setup.py: tzrec[gpu] extra -> requirements/cu130.txt (the 1.3 default image is cu130). - ops/utils.py: clear_triton_caches has no caller after the _force_mma_v2 removal; keep it as a general Triton-cache utility and drop the DISABLE_MMA_V3-specific docstring. - dynamicemb.md / dlrm_hstu.md: DEVICE list cu126/cu129 -> cu126/cu129/cu130 (the 5dc46a2 / 9fd44403 wheels are confirmed on OSS for all three). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015TLiqLg6erJt9urTn5Eb2U
TorchEasyRec 1.3.0 — torch 2.12.1 / torchrec 1.7.0 / fbgemm 1.7.0 / numpy 2
Upgrades the dependency stack (including the numpy-2 migration) and cuts the 1.3.0 release, adding a CUDA-13 + TensorRT image.
Dependency stack
<2numpy 2 migration
Drops the
numpy<2cap. graphlearn 1.3.7 → 1.3.8 (numpy-2 sampler), pandas unconstrained, faiss 1.14.3 (cu12/cu13). Source fix:np.string_→np.bytes_(removed in numpy 2). All device images resolve and import cleanly on numpy 2 with no scipy / scikit-learn source builds.triton 3.7.1 — bundled ptxas WGMMA fix (H20 workaround removed)
Stock triton 3.7.1 bundles ptxas 12.8.93, which mis-compiles the HSTU WGMMA kernel on sm_90 (Hopper/H20) into a shared-memory OOB — an illegal memory access in
_hstu_attn_bwdduring autotuning. This was introduced whenrelease/3.7downgraded ptxas 12.9.86 → 12.8.93 (commit6c96454f2f); it is not the triton#9514 sync bug (verified: #9514 alone does not fix it). The 1.3.0 image ships a repackaged triton 3.7.1 wheel with ptxas swapped back to 12.9.86 — only theptxasbinary is replaced, preserving the manylinux wheel torch's inductor is coupled to — so it is default-on, needs noDISABLE_MMA_V3, and keeps Hopper v3 perf.The
DISABLE_MMA_V3/_force_mma_v2test workaround is removed: the HSTU tests now run the shipped v3/WGMMA path (validated on a real H20 — the four previously-gated tests pass with 0 failures / 0 errors), sounittest_h20_cinow guards the ptxas fix against regression. FAQ Q15 rewritten accordingly. TheDISABLE_MMA_V3runtime knob is retained as an escape hatch.CUDA 13 + TensorRT image
New cu130 image (also the default
tzrec-devel:1.3/:latest): the CUDA 13.0 toolchain on the cu129 software stack, plus torch_tensorrt 2.12.1 / TensorRT 10.16 for TRT export and inference. cu126/cu129 stay TRT-free (torch_tensorrt 2.12 has no CUDA-12 build) and degrade gracefully (has_tensorrt=False, as the cpu image already does); AOTI export/predict is unaffected.Required source changes
tzrec/optim/optimizer.py— re-sync theapply_split_helpermonkeypatch to fbgemm 1.7.0. It gainedmake_persistent/preallocated_host_bufferparams and its internal TBE caller now passes them, so the old override raisedTypeErroron every fused-embedding table construction. The tzrecmomentum1(TF-Adagrad) init customization is preserved.tzrec/acc/aot_utils.py— keep the pytorch#178147 inductor int-array-dedup backport (still absent in torch 2.12.1; lands in 2.13.0+).tzrec/utils/dynamicemb_util.py— restore the None→[] coercion torchrec 1.7.0 dropped forEmbeddingCollectionContextlist-fields (dynamicemb still forwardsNone, which otherwise breaks the sharded EC forward).TrainPipelineContext.output_dist_embeddings_requests, is unused by tzrec).Build / infra
PIP_MIRRORdefault →mirrors.aliyun.com; cu130nvidia-*/cuda-toolkitmetadata handling; dropped the executorch metadata strip (unnecessary once numpy is unconstrained).Images
All device images (cpu / cu126 / cu129 / cu130) are built, CI-validated on the numpy-2 stack, and promoted to
tzrec-devel:1.3*(:1.3/:latest= cu130), digest-verified. The workflows validate against the promoted release images.🤖 Generated with Claude Code