Skip to content

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508

Open
tiankongdeguiji wants to merge 15 commits into
alibaba:masterfrom
tiankongdeguiji:dynamicemb-hybrid-caching-planner
Open

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508
tiankongdeguiji wants to merge 15 commits into
alibaba:masterfrom
tiankongdeguiji:dynamicemb-hybrid-caching-planner

Conversation

@tiankongdeguiji
Copy link
Copy Markdown
Collaborator

@tiankongdeguiji tiankongdeguiji commented May 12, 2026

Summary

Make tzrec's embedding sharding planner take HBM and host-memory limits from the topology and decide per dynamicemb table whether to use HYBRID (HBM + DRAM hash-partitioned) or CACHING (HBM cache + full DRAM backing) mode. Per the consensus on NVIDIA/recsys-examples#306 (closed), both placement modes are kept and the planner — not the user — picks the best one under the joint budget.

  • Enumerator emits 20 ShardingOptions per dynamicemb table: { HYBRID, CACHING } × { cache_load_factor 0.1, ..., 1.0 }. The base ShardingOption is deep-copied once per variant; dynamicemb_options.caching is then flipped in-place on the already-fresh per-variant copy.
  • Storage estimator is mode-aware: HYBRID DDR = (1 − cache_load_factor) · T, CACHING DDR = T (full backing store). HBM accounting and hash-table metadata stay HBM-side only, matching the existing tzrec convention.
  • Perf model is fitted from an on-device dynamicemb sweep (experiments/sweep_20260513-161030/full_a10gpu1.json: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes — HYBRID@1.0 = 0.80 ms (HBM-only fast path), CACHING@<1.0 = 2.63 ms (~3.3x slower), HYBRID@<1.0 = 5.44 ms (~6.8x slower). Inverting the linear bw model with torchrec defaults gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID), with x_eff = 1.0 at cache_load_factor = 1.0 reflecting the runtime kernel switch. A small +0.01·x tiebreaker preserves a strict DP ranking within each block.
  • DynamicProgrammingProposer is now 2D: discretizes per-rank HBM (100 bins/dev) and per-machine DDR (50 bins/dev) and picks per-table options that minimize total perf under both budgets. HBM prune is per-device, DDR prune is per-machine (DDR is host-shared across co-located ranks). Per-axis bin count is capped at 1024 to avoid blow-up on multi-host worlds. Legacy mem_bins_per_device kwarg is preserved as an HBM alias.
  • No new proto fields — budgets come straight from Topology.hbm_cap / Topology.ddr_cap minus the existing reservation.

Test plan

  • tzrec/utils/dynamicemb_util_test.py — storage formula + empirical perf-model + direct wrapper tests (43 cases total)
  • tzrec/utils/plan_util_test.py — variant emission + 2D DP unit tests including multi-table mixed-modes, legacy mem_bins_per_device alias, empty search space, and the headline DP tests
  • CUDA-gated end-to-end enumerate + DP test (folded into plan_util_test.py)
  • pre-commit run --files <changed> clean
  • Smoke test on a real dynamicemb config end-to-end via torchrun --nproc-per-node=2 -m tzrec.train_eval ...

🤖 Generated with Claude Code

tiankongdeguiji and others added 10 commits May 12, 2026 20:28
…ING)

Extend `_calculate_dynamicemb_table_storage_specific_size` and its callers
with a `caching: bool` parameter. HYBRID keeps the original split formula
(DDR = (1 - cache_ratio) * T); CACHING accounts the full backing store on
host (DDR = T) regardless of cache_ratio. HBM accounting and metadata
(key + score + digest + bucket header) stay identical between modes.

The flag is sourced from `dynamicemb_options.caching` on the ShardingOption
in `dynamicemb_calculate_shard_storages`. Subsequent commits wire this into
the enumerator (per-table sweep over both modes) and the proposer (2D DP).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `_dynamicemb_effective_cache_ratio(cache_load_factor, caching, stats)`:
HYBRID passes the ratio through (hash-partitioned HBM hit probability);
CACHING boosts to `min(1.0, x * multiplier)`, default multiplier 10.0 via
the `TZREC_CACHING_HIT_RATE_MULTIPLIER` env var. `cache_params.stats`, when
provided, takes precedence — `1 - stats.expected_miss_rate(x)` — and is
clamped to never fall below the HYBRID ratio.

Inject the boost at planning time by monkey-patching
`ShardPerfContext.build_shard_perf_contexts` to temporarily replace the
sharding option's `cache_params.load_factor` with the effective ratio.
The original cache_params is restored on return so the storage estimator
(which reads cache_load_factor for HBM/DDR sizing) still sees the
unboosted value.

Net effect: at the same `cache_load_factor`, CACHING is strictly cheaper
in perf and strictly more expensive in DDR. The DP proposer (next commit)
trades one for the other against the topology budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the per-dynamicemb-table variant emission into
`_emit_dynamicemb_variants(base, dynamicemb_options) -> List[ShardingOption]`
and sweep both modes per cache_load_factor. When `cache_params` is unset,
the helper emits 20 variants (10 factors × 2 modes); when fixed by the
caller, it emits 2 (both modes at the fixed factor). Each variant owns a
deep-copied `dynamicemb_options` so per-variant `caching` mutations stay
isolated. `cache_params.stats` is preserved across all clones for the
perf-side miss-rate override.

The downstream 2D DP proposer (next commit) selects per table the (mode,
ratio) pair that fits both HBM and host budgets while minimizing perf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite `DynamicProgrammingProposer.feedback` to discretize both per-rank
HBM and per-rank DDR into bins (default 100 × 50 per device) and pick per
table the (mode, cache_load_factor) pair that minimizes total perf while
fitting both topology budgets.

State:
  dp[table][hbm_bin][ddr_bin]      = (perf, hbm_actual, ddr_actual)
  backtrack[table][hbm_bin][ddr_bin] = (opt_id, prev_hbm_bin, prev_ddr_bin)

Options whose largest shard exceeds either per-device cap are pruned
upfront. Backtracking enumerates feasible (hbm_total, ddr_total) cells in
decreasing total memory, yielding Pareto-optimal proposals. The legacy
`mem_bins_per_device` kwarg is preserved as an alias for `hbm_bins_per_device`
so existing callers still work, and HBM-degenerate (CPU-only) or
DDR-degenerate topologies collapse the unused axis to a single bin.

Adds `DynamicProgrammingProposer2DTest`:
  - generous DDR → CACHING wins (cheaper perf, host can fit T)
  - tight DDR → CACHING rejected, falls back to HYBRID
  - high cache_load_factor → modes converge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercise the full planning pipeline on a tiny TestSparseNN backed by a
DynamicEmbParameterConstraints constraint. Asserts:
  - EmbeddingEnumerator yields 20 sharding options (2 modes × 10 factors)
  - All cache_load_factors and both caching modes are represented
  - Storage and perf estimators populate non-zero values for each option
  - DynamicProgrammingProposer.propose() returns feasible plans whose
    dynamicemb options carry a valid caching flag

Gated on has_dynamicemb + torch.cuda.is_available() since the dynamicemb
sharder requires CUDA at planning time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper

(1) Restore the byte-budget framing in `_calculate_dynamicemb_table_storage_specific_size`
docstring (`total_value_memory`, `num_buckets`, `hbm_budget`, `ddr_budget`)
that was lost in the PR alibaba#508 rewrite, extending `ddr_budget` to cover both
HYBRID and CACHING modes in one place.

(2) Simplify `_dynamicemb_aware_build_shard_perf_contexts`: single return,
no try/finally, no early-return branch. The unconditional
`sharding_option.cache_params = original_cache_params` line at the end is a
no-op when caching is False (we never mutated the field) and restores the
original cache_params when caching is True. If the wrapped call raises,
planning aborts and the option is discarded -- no need for try/finally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDR is host-shared across ranks co-located on one machine, so the
per-option fit check in `DynamicProgrammingProposer.feedback` should
compare `max_shard_ddr` against the largest machine's total DDR pool, not
against any single rank's slice. HBM stays per-device (each GPU has its
own HBM; no cross-rank sharing).

Compute `max_machine_ddr` by summing per-device DDR within each contiguous
`local_world_size`-sized window. Update the test fake-topology fixture to
carry `local_world_size` (defaults to num_devices).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… copyright

(5) Move `PlanUtilDynamicEmbE2ETest` from the standalone
`tzrec/utils/plan_util_dynamicemb_e2e_test.py` into `plan_util_test.py`
where the rest of the plan_util tests already live. Delete the standalone
file.

(6) Rename `DynamicProgrammingProposer2DTest` -> `DynamicProgrammingProposerTest`.
The proposer class itself stays `DynamicProgrammingProposer`.

(7) Bump copyright `2024 -> 2026` in the new file
`tzrec/utils/dynamicemb_util_test.py`. The header in
plan_util_test.py stays at 2024 since it was a pre-existing file we only
extended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urve

Replace the heuristic ``_dynamicemb_effective_cache_ratio`` curve with
constants fitted from an on-device perf sweep
(``experiments/sweep_20260513-161030/full_a10gpu1.json``: 4M-row table,
dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency
clustered into three regimes:

  HYBRID @ x=1.0:    0.80 ms   (HBM-only fast path)
  CACHING @ x<1.0:   2.63 ms   (~3.3x slower)
  HYBRID  @ x<1.0:   5.44 ms   (~6.8x slower)

Inverting the linear bw model bw = x_eff*HBM + (1-x_eff)*HBM_TO_DDR
(torchrec defaults 897 / 32 GB/s) gives x_eff bases 0.28 (CACHING) and
0.11 (HYBRID). At x = 1.0 the runtime drops the host tier in both modes,
so x_eff = 1.0 there. A +0.01*x tiebreaker term keeps the DP's ranking
strict within each block.

Also extend the ``_dynamicemb_aware_build_shard_perf_contexts`` wrapper
to apply the empirical x_eff for HYBRID (previously HYBRID used raw
``cache_load_factor`` as x_eff -- inconsistent with the empirical data
that shows HYBRID @ 0.9 is much slower than HYBRID @ 1.0 due to the
storage backend switch).

Drop the now-unused ``TZREC_CACHING_HIT_RATE_MULTIPLIER`` env knob and
its related tests; rewrite ``EffectiveCacheRatioTest`` to assert the
empirical strict-block ranking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same provenance is already covered in
``_dynamicemb_effective_cache_ratio``'s docstring below; the block
comment above the constants was redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label May 13, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label May 13, 2026
Comment thread tzrec/utils/dynamicemb_util.py Outdated
Comment on lines +515 to +527
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
sharding_option.cache_params = original_cache_params
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore should be in finally. If _orig_build_shard_perf_contexts raises, sharding_option.cache_params is permanently left as the boosted x_eff clone. Downstream consumers — the storage estimator and _to_sharding_plan at line 435 (cache_params=sharding_option.cache_params) — then see the wrong load factor on every other option that touches the same ShardingOption. Suggested:

Suggested change
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
sharding_option.cache_params = original_cache_params
return result
try:
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
finally:
sharding_option.cache_params = original_cache_params
return result

Comment thread tzrec/utils/plan_util.py
Comment on lines +846 to +853
for caching_mode in (False, True):
for load_factor in load_factors:
opt = copy.deepcopy(base_option)
opt.cache_params = CacheParams(load_factor=load_factor, stats=stats)
# pyre-ignore [16]
opt.dynamicemb_options = copy.deepcopy(dynamicemb_options)
opt.dynamicemb_options.caching = caching_mode
variants.append(opt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant double deep-copy of dynamicemb_options. The caller (line 1035) attaches dynamicemb_options to base_option before calling this function, so copy.deepcopy(base_option) on line 848 already deep-copies dynamicemb_options into opt. Line 851 then deep-copies the original again and replaces it. With 20 variants per table this doubles a non-trivial copy. Either drop line 851 and just set opt.dynamicemb_options.caching = caching_mode, or stop attaching dynamicemb_options on the caller side before this call.

Comment thread tzrec/utils/plan_util.py
Given :math:`M` tables each with up to :math:`N` ShardingOptions, pick one
option per table to minimize total perf while respecting both per-rank HBM
and per-rank DDR budgets from the topology.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says "per-rank DDR" but the implementation enforces per-machine DDR (lines 349–361 sum DDR across local_world_size co-located ranks and the per-option prune at line 391 compares against max_machine_ddr). The inline comment at 350–352 has the right framing — the class docstring should match: "per-rank HBM and per-machine DDR budgets."

Comment on lines +86 to +89
if x >= 1.0:
return 1.0
base = _DYNAMICEMB_CACHING_X_EFF_BASE if caching else _DYNAMICEMB_HYBRID_X_EFF_BASE
return base + _DYNAMICEMB_X_EFF_TIEBREAK * x
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharp discontinuity at x=1.0. HYBRID@x=0.99 → 0.1199; HYBRID@x=1.00 → 1.0 — an ~8× jump for a 1% ratio change. The DP will reliably prefer x=1.0 over x=0.99 by a huge perf margin, then prefer CACHING@x=0.5 (0.285) over HYBRID@x=0.9 (0.119) even on workloads where HYBRID@0.9 is plainly faster in reality. If the empirical sweep really shows a step at x=1.0 because the runtime drops the host tier, please call that out as "x=1.0 = HBM-only kernel, not the same algorithm as x=0.99" — and consider whether the enumerator should emit a discrete mode={HBM_ONLY, HYBRID, CACHING} axis rather than packing the discontinuity into the same x knob.

Comment thread tzrec/utils/plan_util.py
Comment on lines +434 to +463
for table_i in range(1, table_count):
prev_dp = dp[table_i - 1]
cur_dp = dp[table_i]
cur_back = backtrack[table_i]
hbm_t = hbm_by_fqn[table_i]
ddr_t = ddr_by_fqn[table_i]
perf_t = perf_by_fqn[table_i]
for h in range(hbm_bins):
prev_dp_h = prev_dp[h]
for d in range(ddr_bins):
prev_state = prev_dp_h[d]
prev_perf = prev_state[0]
if prev_perf == _INF:
continue
prev_h_val = prev_state[1]
prev_d_val = prev_state[2]
for opt_j in range(option_count):
delta_perf = perf_t[opt_j]
if delta_perf == _INF:
continue
new_h = prev_h_val + hbm_t[opt_j]
new_d = prev_d_val + ddr_t[opt_j]
if new_h >= hbm_bins or new_d >= ddr_bins:
continue
new_h_i = int(new_h)
new_d_i = int(new_d)
new_perf = prev_perf + delta_perf
if cur_dp[new_h_i][new_d_i][0] > new_perf:
cur_dp[new_h_i][new_d_i] = (new_perf, new_h, new_d)
cur_back[new_h_i][new_d_i] = (opt_j, h, d)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2D DP cost scales with world size — consider clamping total bins. hbm_bins = 100 · num_devices and ddr_bins = 50 · num_devices. On a 16-GPU world the inner triple-loop is 1600 × 800 × option_count pure-Python iterations per table; with 100 dynamicemb tables (20 options each) that's a few-billion-op DP, and dp + backtrack allocate ~256M tuples. Even on smaller worlds the bin counts multiply faster than necessary — a per-device discretization isn't really needed since the budget is a single scalar per axis. Consider clamping totals (e.g. hbm_bins = min(hbm_bins_per_device · num_devices, 2_000)) or vectorizing the transition with NumPy. The defaults are fine for 2–4 GPUs but worth a guardrail before the next user hits this.

Comment on lines +241 to +309
class DynamicProgrammingProposerTest(unittest.TestCase):
"""2D DP across HBM × DDR picks per-table mode under joint budgets."""

def _run(self, search_space, topology):
proposer = DynamicProgrammingProposer(
hbm_bins_per_device=20, ddr_bins_per_device=20
)
proposer.load(search_space)
# First propose returns the lowest-mem-per-table seed.
proposer.propose()
proposer.feedback(partitionable=True, storage_constraint=topology)
proposals = []
proposal = proposer.propose()
while proposal:
proposals.append(proposal)
proposer.feedback(partitionable=True, storage_constraint=topology)
proposal = proposer.propose()
return proposals

def test_caching_preferred_when_ddr_is_generous(self):
# Three options for one table:
# HYBRID @ x=1.0: hbm = T, ddr = 0, perf = high (HBM-only)
# HYBRID @ x=0.1: hbm = .1T, ddr = .9T, perf = high (slow)
# CACHING @ x=0.1: hbm = .1T, ddr = T, perf = low (fast — modeled hits)
opts = [
_FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0),
_FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
_FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
]
topology = _make_topology(
num_devices=2, hbm_per_device=2000, ddr_per_device=2000
)
proposals = self._run(opts, topology)
# Best plan must be the CACHING option (perf=10).
best = min(proposals, key=lambda p: sum(o.total_perf for o in p))
self.assertEqual(best[0].total_perf, 10.0)

def test_caching_rejected_when_ddr_is_tight(self):
# Host budget is only 950 — CACHING (ddr=1000) cannot fit; HYBRID can.
opts = [
_FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
_FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
]
topology = _make_topology(
num_devices=1, hbm_per_device=2000, ddr_per_device=950
)
proposals = self._run(opts, topology)
# Every proposed plan must pick the HYBRID option (perf=80).
for p in proposals:
self.assertEqual(p[0].total_perf, 80.0)

def test_high_factor_collapses_modes(self):
# At x=1.0 HYBRID == CACHING in HBM and CACHING.ddr = T = HYBRID.hbm.
# If we offer just the high-factor options, DP picks one of them.
opts = [
_FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), # HYBRID x=1.0
_FakeShardingOption(
"table_a", hbm=1000, ddr=1000, perf=50.0
), # CACHING x=1.0
]
topology = _make_topology(
num_devices=1, hbm_per_device=1100, ddr_per_device=2000
)
proposals = self._run(opts, topology)
# Either option is fine — they're tied. Just verify the proposer
# returned something feasible.
self.assertGreater(len(proposals), 0)
for p in proposals:
self.assertEqual(p[0].total_perf, 50.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the headline multi-table 2D-DP scenario. All three DynamicProgrammingProposerTest cases use a single table, so the cross-table DP transition at plan_util.py:434–463 is exercised only at table_i == 0. The PR's value prop is "pick CACHING for one table and HYBRID for another under joint HBM+DDR budgets" — a 2-table fixture where global DDR can only hold one CACHING but plenty of HBM is available would directly cover that. Same gap for the mem_bins_per_device legacy alias (untested) and the all-options-pruned / empty-search-space degenerate paths.

only_values=only_values,
bucket_capacity=self.BUCKET_CAPACITY,
caching=caching,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct unit test for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts is missing. Both have non-trivial branches — admission_counter HBM accounting (dynamicemb_util.py:663–672), the compute_device == "cpu" and not is_inference DDR branch (line 697), and the cache_params swap+restore (line 502–514) — that are reachable only via the E2E test, which is gated on has_dynamicemb and cuda. On a CPU dev box none of this is covered; a fake-ed _orig_build_shard_perf_contexts would let you verify the boost is applied and (more importantly) that the original cache_params is restored even when the wrapped call raises.

@github-actions
Copy link
Copy Markdown

Review summary

Solid PR overall — the 2D DP rewrite is clean, the mode-aware storage formula is well-tested, and the empirical perf-model docstring is refreshingly honest about its provenance. A few things worth addressing before merge (inline above):

Should fix

  • dynamicemb_util.py:515-527 — missing try/finally around the cache_params swap; an exception in the wrapped estimator permanently corrupts the option's cache load factor.
  • dynamicemb_util.py:56-89 — the x_eff step from 0.12 → 1.0 at x=1.00 means HYBRID@0.99 and CACHING@0.5 both look much worse than HYBRID@1.0 by construction. If x=1.0 is genuinely a different code path (HBM-only kernel), modeling it as a separate mode axis is cleaner than a discontinuity in the same knob.
  • plan_util.py:846-853 — redundant double-deepcopy of dynamicemb_options (20×/table).

Worth a follow-up

  • plan_util.py:434-463 — 2D DP state blows up on multi-node worlds (bins scale with world_size); consider an absolute cap on total bins, or vectorize with NumPy.
  • Test gap on the multi-table DP case where one table wants CACHING and another HYBRID under a shared DDR budget — that's the headline scenario and currently isn't covered by a unit test. Same for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts (only reachable via the CUDA-gated E2E test) and the mem_bins_per_device legacy alias.

Docs

  • DynamicProgrammingProposer docstring says "per-rank DDR" but the implementation budgets DDR per-machine — worth aligning the wording with the inline comment on lines 350-352.
  • dynamicemb_calculate_shard_storages Args block (not in diff, around line 620): the caching_ratio description still reads "ratio of HBM to DDR memory for UVM caching" — under the new CACHING semantics that's stale; host always holds the full backing store and caching_ratio only sizes the HBM hot-row cache.
  • PR description mentions TZREC_CACHING_HIT_RATE_MULTIPLIER and x_eff = min(1.0, x · multiplier); the code uses fitted constants instead (no env var). Worth updating the description so it doesn't mislead future readers.

No security/multi-process concerns: the monkey-patches run under the import lock and each rank is its own process.

tiankongdeguiji and others added 5 commits May 13, 2026 17:02
`copy.deepcopy(base_option)` on the first line of each variant body
already copies `dynamicemb_options` because the caller in `enumerate()`
attaches it onto `base_option` before invoking `_emit_dynamicemb_variants`.
The second `copy.deepcopy(dynamicemb_options)` was wasted work — at 20
variants per dynamicemb table it doubled a non-trivial copy. Drop it and
mutate the already-fresh per-variant `dynamicemb_options.caching` directly.

PR alibaba#508 review R3 (github-actions, 2026-05-13).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blowup

The 2D DP discretizes each memory axis into `bins_per_device *
num_devices` bins. On a 16-GPU world that becomes 1600 hbm bins × 800
ddr bins = 1.28M cells per table; with 100 dynamicemb tables and 20
options each, the inner Python loop is a few-billion-op DP and dp +
backtrack tuples consume hundreds of MB. The HBM and DDR budgets are
scalar, so multiplying bins by num_devices stops adding meaningful
precision past a point. Clamp the per-axis bin count at `_DP_AXIS_BIN_CAP`
(1024) to keep the multi-host case tractable.

NumPy vectorization of the inner DP loop is a bigger refactor; leaving a
TODO referencing PR alibaba#508 review R5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

* DynamicProgrammingProposer class docstring: "per-rank DDR" -> "per-rank
  HBM and per-machine DDR" so it matches the machine-DDR prune introduced
  in commit 38bd7ad (PR alibaba#508 review R4).

* _dynamicemb_effective_cache_ratio: call out that the discontinuity at
  x=1.0 is a deliberate kernel switch (HBM-only DynamicEmbStorage vs
  dual-tier HybridStorage/DynamicEmbCache), not a smoothing artifact. Note
  a future refactor could lift this to a discrete `mode` axis on the
  enumerator (PR alibaba#508 review R2).

* dynamicemb_calculate_shard_storages caching_ratio Args: rewrite the
  stale "ratio of HBM to DDR memory for UVM caching" framing -- under the
  dynamicemb modes, host always holds the full backing in CACHING, and
  caching_ratio sizes the HBM hot-row cache (top-level docs note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch space

Add three DynamicProgrammingProposerTest methods that the original
PR alibaba#508 review (R6) flagged as missing:

  test_two_tables_pick_mixed_modes_under_joint_budget
    Two tables under a topology where both-HYBRID is HBM-infeasible and
    both-CACHING is DDR-infeasible. The DP must pick one of each. This
    exercises the cross-table transition at plan_util.py table_i == 1
    which the existing single-table cases never reach.

  test_legacy_mem_bins_per_device_alias
    Verifies the back-compat kwarg `mem_bins_per_device` still wins over
    the new `hbm_bins_per_device` when both are passed.

  test_empty_search_space_returns_empty_proposal
    Edge case where `load([])` short-circuits via `table_count == 0`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper tests

Combine PR alibaba#508 review R1 + R7. The new
``BuildShardPerfContextsWrapperTest.test_cache_params_restored_on_exception``
exists to pin the bug R1 describes: when ``_orig_build_shard_perf_contexts``
raises, the boosted ``x_eff`` clone of ``sharding_option.cache_params``
must not leak to the downstream storage estimator's view of the same
option. The corresponding ``try/finally`` around the wrapped call closes
the leak. Reverts the previous "single return, no try/finally"
simplification.

Also adds:

* ``BuildShardPerfContextsWrapperTest`` — direct unit tests for the
  ShardPerfContext.build_shard_perf_contexts monkey-patch covering:
  boost applied for caching, boost applied for hybrid, no boost when the
  option has no dynamicemb_options, cache_params restored on success,
  and the restore-on-exception case driving R1.

* ``DynamicEmbCalcShardStoragesTest.test_admission_counter_increases_hbm``
  — direct cover for the admission_counter HBM accounting branch in
  ``dynamicemb_calculate_shard_storages`` that the CUDA-gated E2E test
  doesn't exercise on a CPU dev box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant