Skip to content

[feat] distributed embedding export #525

Open
eric-gecheng wants to merge 39 commits into
alibaba:masterfrom
eric-gecheng:feature/export_split
Open

[feat] distributed embedding export #525
eric-gecheng wants to merge 39 commits into
alibaba:masterfrom
eric-gecheng:feature/export_split

Conversation

@eric-gecheng

Copy link
Copy Markdown
Collaborator

No description provided.

@eric-gecheng eric-gecheng added the claude-review Let Claude Review label Jun 9, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 9, 2026
Comment thread tzrec/modules/embedding.py
Comment thread tzrec/utils/export_util.py
Comment thread tzrec/utils/export_util.py Outdated
Comment thread tzrec/utils/export_util.py Outdated
Comment thread tzrec/utils/export_util.py
Comment thread tzrec/modules/embedding.py Outdated
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review summary — distributed embedding export

Solid, well-documented feature. The dynamicemb shard read path has genuinely good input validation (shape/count cross-checks, clear errors), and the per-rank temp symlink-view + try/finally cleanup in restore_model is the right pattern (no mutation of a possibly read-only checkpoint). Inline comments cover the specific findings; a few cross-cutting notes below.

Correctness (see inline)

  • embedding.py: shared_name built from a stale emb_config (diverges from EmbeddingGroupImpl), plus a raw-name vs shared_name keying mismatch in the new _forward_impl filter that can drop shared sequence sparse features. Please confirm shared-feature handling and add a regression test.
  • _forward_impl now does O(N²) list-membership work on every eager train/eval/predict step — use sets.
  • export_distributed_embedding: unguarded os.makedirs race across ranks; data_config mutated without deepcopy.

Test coverage — ~1030 new lines, 1 new test. Several pure/near-pure helpers are CPU-testable and currently uncovered: _merge_sharded_embedding_json (incl. the new ValueError consistency branches), ensure_input_tile_for_distributed_embedding (3-way env branching), the _INPUT_TILE_USER_REPLACEMENTS remap in create_local_plan, and _make_dynamicemb_input_tile_user_view (symlink aliasing — the contract the realpath-dedup depends on). The one added test's pooling="SUM" stub passes only by coincidence of str(...).split(".")[-1]; use a real config/enum.

Consistency / cleanup

  • env_util.use_distributed_embedding reads a lowercase env var use_distributed_embedding; every sibling uses UPPER_SNAKE (USE_RTP, INPUT_TILE, …). Recommend USE_DISTRIBUTED_EMBEDDING.
  • Leftover commented-out / dead code in the new functions (e.g. # local_rank, # np.savez(...), several # shard_offsets/# pooling lines, unused nested _seq_len_name/_seq_feat_name, additional_fg always empty); import tempfile mid-function — move to module top.
  • The dedup-by-realpath block in _get_sparse_embedding_tensor/_get_rtp_embedding_tensor globs the checkpoint dynamicemb/ dir, but the ebc_user symlinks it references are only created in the separate temp view used by DynamicEmbLoad — so it appears to never trigger there. Worth confirming the intent (or the comment is misleading).

Docs — the new use_distributed_embedding mode / env var isn't documented in docs/source/usage/export.md (which lists INPUT_TILE, ENABLE_AOT, etc.); fx_mark_keyed_tensor's new is_dense arg is undocumented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant