Skip to content

[bugfix]dynemb export supports input_tile=3#495

Open
eric-gecheng wants to merge 1 commit into
alibaba:masterfrom
eric-gecheng:bugfix/dynemb_input_tile_3
Open

[bugfix]dynemb export supports input_tile=3#495
eric-gecheng wants to merge 1 commit into
alibaba:masterfrom
eric-gecheng:bugfix/dynemb_input_tile_3

Conversation

@eric-gecheng
Copy link
Copy Markdown
Collaborator

No description provided.

@eric-gecheng eric-gecheng changed the title dynemb export supports input_tile=3 [bugfix]dynemb export supports input_tile=3 May 5, 2026
@eric-gecheng eric-gecheng added the claude-review Let Claude Review label May 5, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label May 5, 2026
# so DynamicEmbLoad raises "can't find path to load" for
# the user-side ones. Symlink the missing twin paths.
input_tile = os.environ.get("INPUT_TILE", "")
if input_tile.startswith("3"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reusing is_input_tile_emb() from tzrec/acc/utils.py (already used in tzrec/utils/export_util.py) instead of re-deriving the INPUT_TILE=3 check inline. Keeps the env-var convention (input_tile[0] == "3") in one place and avoids drift if it changes.

Comment on lines +469 to +484
for i, seg in enumerate(segs):
if seg not in _INPUT_TILE_USER_SEGMENTS:
continue
user_segs = list(segs)
user_segs[i] = f"{seg}_user"
user_entry = ".".join(user_segs)
user_path = os.path.join(dynamicemb_path, user_entry)
if os.path.lexists(user_path):
continue
try:
os.symlink(entry, user_path)
logger.info(
f"created INPUT_TILE=3 dynamicemb symlink {user_entry} -> {entry}"
)
except OSError as e:
logger.warning(f"failed to create dynamicemb symlink {user_entry}: {e}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor defensive nit: if a single dotted entry ever contained more than one segment in _INPUT_TILE_USER_SEGMENTS, this loop would create N separate symlinks (each renaming a different segment). In practice the construction in EmbeddingGroupImpl / SequenceEmbeddingGroupImpl produces only one such segment per path, so this is theoretical — but adding a break after a successful (or skipped-because-exists) os.symlink would make the intent explicit and match the construction-side rename.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Code review summary

Targeted bugfix for the INPUT_TILE=3 export path: the checkpoint only stores non-_user dynamicemb dirs, but the exported model expects *_user twins, so loading raised "can't find path to load". Symlinking the user-side paths on rank 0 (with a barrier) is a clean, minimal fix.

Noteworthy:

  • Reuse is_input_tile_emb() instead of inlining the env check (see inline comment).
  • Inner-loop break would make the single-segment intent explicit (see inline comment).
  • Test coverage gap: _link_dynamicemb_input_tile_user_paths is purely filesystem ops and trivially unit-testable, but no unit test exists. The existing restore_model tests in tzrec/utils/checkpoint_util_test.py don't exercise the has_dynamicemb branch at all, and the INPUT_TILE=3 integration tests in tzrec/tests/rank_integration_test.py never combine with dynamicemb. A small unit test covering: matching segment → symlink created, non-matching entries skipped, idempotency under existing symlink, and OSError swallow path would lock in the contract.

Verified accurate:

  • Code references in the docstring/comments match EmbeddingGroupImpl.ebc_user/mc_ebc_user and SequenceEmbeddingGroupImpl.ec_dict_user/mc_ec_dict_user.
  • No user-facing docs claim "dynemb does not support INPUT_TILE=3", so no doc updates needed.
  • Rank-0 + dist.barrier() pattern is consistent with existing checkpoint-restore code; performance impact is negligible (cold-start, O(few dozen entries)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant