[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508
[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508tiankongdeguiji wants to merge 15 commits into
Conversation
…ING) Extend `_calculate_dynamicemb_table_storage_specific_size` and its callers with a `caching: bool` parameter. HYBRID keeps the original split formula (DDR = (1 - cache_ratio) * T); CACHING accounts the full backing store on host (DDR = T) regardless of cache_ratio. HBM accounting and metadata (key + score + digest + bucket header) stay identical between modes. The flag is sourced from `dynamicemb_options.caching` on the ShardingOption in `dynamicemb_calculate_shard_storages`. Subsequent commits wire this into the enumerator (per-table sweep over both modes) and the proposer (2D DP). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `_dynamicemb_effective_cache_ratio(cache_load_factor, caching, stats)`: HYBRID passes the ratio through (hash-partitioned HBM hit probability); CACHING boosts to `min(1.0, x * multiplier)`, default multiplier 10.0 via the `TZREC_CACHING_HIT_RATE_MULTIPLIER` env var. `cache_params.stats`, when provided, takes precedence — `1 - stats.expected_miss_rate(x)` — and is clamped to never fall below the HYBRID ratio. Inject the boost at planning time by monkey-patching `ShardPerfContext.build_shard_perf_contexts` to temporarily replace the sharding option's `cache_params.load_factor` with the effective ratio. The original cache_params is restored on return so the storage estimator (which reads cache_load_factor for HBM/DDR sizing) still sees the unboosted value. Net effect: at the same `cache_load_factor`, CACHING is strictly cheaper in perf and strictly more expensive in DDR. The DP proposer (next commit) trades one for the other against the topology budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the per-dynamicemb-table variant emission into `_emit_dynamicemb_variants(base, dynamicemb_options) -> List[ShardingOption]` and sweep both modes per cache_load_factor. When `cache_params` is unset, the helper emits 20 variants (10 factors × 2 modes); when fixed by the caller, it emits 2 (both modes at the fixed factor). Each variant owns a deep-copied `dynamicemb_options` so per-variant `caching` mutations stay isolated. `cache_params.stats` is preserved across all clones for the perf-side miss-rate override. The downstream 2D DP proposer (next commit) selects per table the (mode, ratio) pair that fits both HBM and host budgets while minimizing perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite `DynamicProgrammingProposer.feedback` to discretize both per-rank HBM and per-rank DDR into bins (default 100 × 50 per device) and pick per table the (mode, cache_load_factor) pair that minimizes total perf while fitting both topology budgets. State: dp[table][hbm_bin][ddr_bin] = (perf, hbm_actual, ddr_actual) backtrack[table][hbm_bin][ddr_bin] = (opt_id, prev_hbm_bin, prev_ddr_bin) Options whose largest shard exceeds either per-device cap are pruned upfront. Backtracking enumerates feasible (hbm_total, ddr_total) cells in decreasing total memory, yielding Pareto-optimal proposals. The legacy `mem_bins_per_device` kwarg is preserved as an alias for `hbm_bins_per_device` so existing callers still work, and HBM-degenerate (CPU-only) or DDR-degenerate topologies collapse the unused axis to a single bin. Adds `DynamicProgrammingProposer2DTest`: - generous DDR → CACHING wins (cheaper perf, host can fit T) - tight DDR → CACHING rejected, falls back to HYBRID - high cache_load_factor → modes converge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercise the full planning pipeline on a tiny TestSparseNN backed by a
DynamicEmbParameterConstraints constraint. Asserts:
- EmbeddingEnumerator yields 20 sharding options (2 modes × 10 factors)
- All cache_load_factors and both caching modes are represented
- Storage and perf estimators populate non-zero values for each option
- DynamicProgrammingProposer.propose() returns feasible plans whose
dynamicemb options carry a valid caching flag
Gated on has_dynamicemb + torch.cuda.is_available() since the dynamicemb
sharder requires CUDA at planning time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper (1) Restore the byte-budget framing in `_calculate_dynamicemb_table_storage_specific_size` docstring (`total_value_memory`, `num_buckets`, `hbm_budget`, `ddr_budget`) that was lost in the PR alibaba#508 rewrite, extending `ddr_budget` to cover both HYBRID and CACHING modes in one place. (2) Simplify `_dynamicemb_aware_build_shard_perf_contexts`: single return, no try/finally, no early-return branch. The unconditional `sharding_option.cache_params = original_cache_params` line at the end is a no-op when caching is False (we never mutated the field) and restores the original cache_params when caching is True. If the wrapped call raises, planning aborts and the option is discarded -- no need for try/finally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDR is host-shared across ranks co-located on one machine, so the per-option fit check in `DynamicProgrammingProposer.feedback` should compare `max_shard_ddr` against the largest machine's total DDR pool, not against any single rank's slice. HBM stays per-device (each GPU has its own HBM; no cross-rank sharing). Compute `max_machine_ddr` by summing per-device DDR within each contiguous `local_world_size`-sized window. Update the test fake-topology fixture to carry `local_world_size` (defaults to num_devices). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… copyright (5) Move `PlanUtilDynamicEmbE2ETest` from the standalone `tzrec/utils/plan_util_dynamicemb_e2e_test.py` into `plan_util_test.py` where the rest of the plan_util tests already live. Delete the standalone file. (6) Rename `DynamicProgrammingProposer2DTest` -> `DynamicProgrammingProposerTest`. The proposer class itself stays `DynamicProgrammingProposer`. (7) Bump copyright `2024 -> 2026` in the new file `tzrec/utils/dynamicemb_util_test.py`. The header in plan_util_test.py stays at 2024 since it was a pre-existing file we only extended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urve Replace the heuristic ``_dynamicemb_effective_cache_ratio`` curve with constants fitted from an on-device perf sweep (``experiments/sweep_20260513-161030/full_a10gpu1.json``: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes: HYBRID @ x=1.0: 0.80 ms (HBM-only fast path) CACHING @ x<1.0: 2.63 ms (~3.3x slower) HYBRID @ x<1.0: 5.44 ms (~6.8x slower) Inverting the linear bw model bw = x_eff*HBM + (1-x_eff)*HBM_TO_DDR (torchrec defaults 897 / 32 GB/s) gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID). At x = 1.0 the runtime drops the host tier in both modes, so x_eff = 1.0 there. A +0.01*x tiebreaker term keeps the DP's ranking strict within each block. Also extend the ``_dynamicemb_aware_build_shard_perf_contexts`` wrapper to apply the empirical x_eff for HYBRID (previously HYBRID used raw ``cache_load_factor`` as x_eff -- inconsistent with the empirical data that shows HYBRID @ 0.9 is much slower than HYBRID @ 1.0 due to the storage backend switch). Drop the now-unused ``TZREC_CACHING_HIT_RATE_MULTIPLIER`` env knob and its related tests; rewrite ``EffectiveCacheRatioTest`` to assert the empirical strict-block ranking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same provenance is already covered in ``_dynamicemb_effective_cache_ratio``'s docstring below; the block comment above the constants was redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| result = _orig_build_shard_perf_contexts( | ||
| cls, | ||
| config, | ||
| shard_sizes, | ||
| sharding_option, | ||
| topology, | ||
| constraints, | ||
| sharder, | ||
| *args, | ||
| **kwargs, | ||
| ) | ||
| sharding_option.cache_params = original_cache_params | ||
| return result |
There was a problem hiding this comment.
Restore should be in finally. If _orig_build_shard_perf_contexts raises, sharding_option.cache_params is permanently left as the boosted x_eff clone. Downstream consumers — the storage estimator and _to_sharding_plan at line 435 (cache_params=sharding_option.cache_params) — then see the wrong load factor on every other option that touches the same ShardingOption. Suggested:
| result = _orig_build_shard_perf_contexts( | |
| cls, | |
| config, | |
| shard_sizes, | |
| sharding_option, | |
| topology, | |
| constraints, | |
| sharder, | |
| *args, | |
| **kwargs, | |
| ) | |
| sharding_option.cache_params = original_cache_params | |
| return result | |
| try: | |
| result = _orig_build_shard_perf_contexts( | |
| cls, | |
| config, | |
| shard_sizes, | |
| sharding_option, | |
| topology, | |
| constraints, | |
| sharder, | |
| *args, | |
| **kwargs, | |
| ) | |
| finally: | |
| sharding_option.cache_params = original_cache_params | |
| return result |
| for caching_mode in (False, True): | ||
| for load_factor in load_factors: | ||
| opt = copy.deepcopy(base_option) | ||
| opt.cache_params = CacheParams(load_factor=load_factor, stats=stats) | ||
| # pyre-ignore [16] | ||
| opt.dynamicemb_options = copy.deepcopy(dynamicemb_options) | ||
| opt.dynamicemb_options.caching = caching_mode | ||
| variants.append(opt) |
There was a problem hiding this comment.
Redundant double deep-copy of dynamicemb_options. The caller (line 1035) attaches dynamicemb_options to base_option before calling this function, so copy.deepcopy(base_option) on line 848 already deep-copies dynamicemb_options into opt. Line 851 then deep-copies the original again and replaces it. With 20 variants per table this doubles a non-trivial copy. Either drop line 851 and just set opt.dynamicemb_options.caching = caching_mode, or stop attaching dynamicemb_options on the caller side before this call.
| Given :math:`M` tables each with up to :math:`N` ShardingOptions, pick one | ||
| option per table to minimize total perf while respecting both per-rank HBM | ||
| and per-rank DDR budgets from the topology. | ||
|
|
There was a problem hiding this comment.
Docstring says "per-rank DDR" but the implementation enforces per-machine DDR (lines 349–361 sum DDR across local_world_size co-located ranks and the per-option prune at line 391 compares against max_machine_ddr). The inline comment at 350–352 has the right framing — the class docstring should match: "per-rank HBM and per-machine DDR budgets."
| if x >= 1.0: | ||
| return 1.0 | ||
| base = _DYNAMICEMB_CACHING_X_EFF_BASE if caching else _DYNAMICEMB_HYBRID_X_EFF_BASE | ||
| return base + _DYNAMICEMB_X_EFF_TIEBREAK * x |
There was a problem hiding this comment.
Sharp discontinuity at x=1.0. HYBRID@x=0.99 → 0.1199; HYBRID@x=1.00 → 1.0 — an ~8× jump for a 1% ratio change. The DP will reliably prefer x=1.0 over x=0.99 by a huge perf margin, then prefer CACHING@x=0.5 (0.285) over HYBRID@x=0.9 (0.119) even on workloads where HYBRID@0.9 is plainly faster in reality. If the empirical sweep really shows a step at x=1.0 because the runtime drops the host tier, please call that out as "x=1.0 = HBM-only kernel, not the same algorithm as x=0.99" — and consider whether the enumerator should emit a discrete mode={HBM_ONLY, HYBRID, CACHING} axis rather than packing the discontinuity into the same x knob.
| for table_i in range(1, table_count): | ||
| prev_dp = dp[table_i - 1] | ||
| cur_dp = dp[table_i] | ||
| cur_back = backtrack[table_i] | ||
| hbm_t = hbm_by_fqn[table_i] | ||
| ddr_t = ddr_by_fqn[table_i] | ||
| perf_t = perf_by_fqn[table_i] | ||
| for h in range(hbm_bins): | ||
| prev_dp_h = prev_dp[h] | ||
| for d in range(ddr_bins): | ||
| prev_state = prev_dp_h[d] | ||
| prev_perf = prev_state[0] | ||
| if prev_perf == _INF: | ||
| continue | ||
| prev_h_val = prev_state[1] | ||
| prev_d_val = prev_state[2] | ||
| for opt_j in range(option_count): | ||
| delta_perf = perf_t[opt_j] | ||
| if delta_perf == _INF: | ||
| continue | ||
| new_h = prev_h_val + hbm_t[opt_j] | ||
| new_d = prev_d_val + ddr_t[opt_j] | ||
| if new_h >= hbm_bins or new_d >= ddr_bins: | ||
| continue | ||
| new_h_i = int(new_h) | ||
| new_d_i = int(new_d) | ||
| new_perf = prev_perf + delta_perf | ||
| if cur_dp[new_h_i][new_d_i][0] > new_perf: | ||
| cur_dp[new_h_i][new_d_i] = (new_perf, new_h, new_d) | ||
| cur_back[new_h_i][new_d_i] = (opt_j, h, d) |
There was a problem hiding this comment.
2D DP cost scales with world size — consider clamping total bins. hbm_bins = 100 · num_devices and ddr_bins = 50 · num_devices. On a 16-GPU world the inner triple-loop is 1600 × 800 × option_count pure-Python iterations per table; with 100 dynamicemb tables (20 options each) that's a few-billion-op DP, and dp + backtrack allocate ~256M tuples. Even on smaller worlds the bin counts multiply faster than necessary — a per-device discretization isn't really needed since the budget is a single scalar per axis. Consider clamping totals (e.g. hbm_bins = min(hbm_bins_per_device · num_devices, 2_000)) or vectorizing the transition with NumPy. The defaults are fine for 2–4 GPUs but worth a guardrail before the next user hits this.
| class DynamicProgrammingProposerTest(unittest.TestCase): | ||
| """2D DP across HBM × DDR picks per-table mode under joint budgets.""" | ||
|
|
||
| def _run(self, search_space, topology): | ||
| proposer = DynamicProgrammingProposer( | ||
| hbm_bins_per_device=20, ddr_bins_per_device=20 | ||
| ) | ||
| proposer.load(search_space) | ||
| # First propose returns the lowest-mem-per-table seed. | ||
| proposer.propose() | ||
| proposer.feedback(partitionable=True, storage_constraint=topology) | ||
| proposals = [] | ||
| proposal = proposer.propose() | ||
| while proposal: | ||
| proposals.append(proposal) | ||
| proposer.feedback(partitionable=True, storage_constraint=topology) | ||
| proposal = proposer.propose() | ||
| return proposals | ||
|
|
||
| def test_caching_preferred_when_ddr_is_generous(self): | ||
| # Three options for one table: | ||
| # HYBRID @ x=1.0: hbm = T, ddr = 0, perf = high (HBM-only) | ||
| # HYBRID @ x=0.1: hbm = .1T, ddr = .9T, perf = high (slow) | ||
| # CACHING @ x=0.1: hbm = .1T, ddr = T, perf = low (fast — modeled hits) | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0), | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=2, hbm_per_device=2000, ddr_per_device=2000 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Best plan must be the CACHING option (perf=10). | ||
| best = min(proposals, key=lambda p: sum(o.total_perf for o in p)) | ||
| self.assertEqual(best[0].total_perf, 10.0) | ||
|
|
||
| def test_caching_rejected_when_ddr_is_tight(self): | ||
| # Host budget is only 950 — CACHING (ddr=1000) cannot fit; HYBRID can. | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0), | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=1, hbm_per_device=2000, ddr_per_device=950 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Every proposed plan must pick the HYBRID option (perf=80). | ||
| for p in proposals: | ||
| self.assertEqual(p[0].total_perf, 80.0) | ||
|
|
||
| def test_high_factor_collapses_modes(self): | ||
| # At x=1.0 HYBRID == CACHING in HBM and CACHING.ddr = T = HYBRID.hbm. | ||
| # If we offer just the high-factor options, DP picks one of them. | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), # HYBRID x=1.0 | ||
| _FakeShardingOption( | ||
| "table_a", hbm=1000, ddr=1000, perf=50.0 | ||
| ), # CACHING x=1.0 | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=1, hbm_per_device=1100, ddr_per_device=2000 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Either option is fine — they're tied. Just verify the proposer | ||
| # returned something feasible. | ||
| self.assertGreater(len(proposals), 0) | ||
| for p in proposals: | ||
| self.assertEqual(p[0].total_perf, 50.0) |
There was a problem hiding this comment.
Missing the headline multi-table 2D-DP scenario. All three DynamicProgrammingProposerTest cases use a single table, so the cross-table DP transition at plan_util.py:434–463 is exercised only at table_i == 0. The PR's value prop is "pick CACHING for one table and HYBRID for another under joint HBM+DDR budgets" — a 2-table fixture where global DDR can only hold one CACHING but plenty of HBM is available would directly cover that. Same gap for the mem_bins_per_device legacy alias (untested) and the all-options-pruned / empty-search-space degenerate paths.
| only_values=only_values, | ||
| bucket_capacity=self.BUCKET_CAPACITY, | ||
| caching=caching, | ||
| ) |
There was a problem hiding this comment.
Direct unit test for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts is missing. Both have non-trivial branches — admission_counter HBM accounting (dynamicemb_util.py:663–672), the compute_device == "cpu" and not is_inference DDR branch (line 697), and the cache_params swap+restore (line 502–514) — that are reachable only via the E2E test, which is gated on has_dynamicemb and cuda. On a CPU dev box none of this is covered; a fake-ed _orig_build_shard_perf_contexts would let you verify the boost is applied and (more importantly) that the original cache_params is restored even when the wrapped call raises.
Review summarySolid PR overall — the 2D DP rewrite is clean, the mode-aware storage formula is well-tested, and the empirical perf-model docstring is refreshingly honest about its provenance. A few things worth addressing before merge (inline above): Should fix
Worth a follow-up
Docs
No security/multi-process concerns: the monkey-patches run under the import lock and each rank is its own process. |
`copy.deepcopy(base_option)` on the first line of each variant body already copies `dynamicemb_options` because the caller in `enumerate()` attaches it onto `base_option` before invoking `_emit_dynamicemb_variants`. The second `copy.deepcopy(dynamicemb_options)` was wasted work — at 20 variants per dynamicemb table it doubled a non-trivial copy. Drop it and mutate the already-fresh per-variant `dynamicemb_options.caching` directly. PR alibaba#508 review R3 (github-actions, 2026-05-13). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blowup The 2D DP discretizes each memory axis into `bins_per_device * num_devices` bins. On a 16-GPU world that becomes 1600 hbm bins × 800 ddr bins = 1.28M cells per table; with 100 dynamicemb tables and 20 options each, the inner Python loop is a few-billion-op DP and dp + backtrack tuples consume hundreds of MB. The HBM and DDR budgets are scalar, so multiplying bins by num_devices stops adding meaningful precision past a point. Clamp the per-axis bin count at `_DP_AXIS_BIN_CAP` (1024) to keep the multi-host case tractable. NumPy vectorization of the inner DP loop is a bigger refactor; leaving a TODO referencing PR alibaba#508 review R5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation * DynamicProgrammingProposer class docstring: "per-rank DDR" -> "per-rank HBM and per-machine DDR" so it matches the machine-DDR prune introduced in commit 38bd7ad (PR alibaba#508 review R4). * _dynamicemb_effective_cache_ratio: call out that the discontinuity at x=1.0 is a deliberate kernel switch (HBM-only DynamicEmbStorage vs dual-tier HybridStorage/DynamicEmbCache), not a smoothing artifact. Note a future refactor could lift this to a discrete `mode` axis on the enumerator (PR alibaba#508 review R2). * dynamicemb_calculate_shard_storages caching_ratio Args: rewrite the stale "ratio of HBM to DDR memory for UVM caching" framing -- under the dynamicemb modes, host always holds the full backing in CACHING, and caching_ratio sizes the HBM hot-row cache (top-level docs note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch space Add three DynamicProgrammingProposerTest methods that the original PR alibaba#508 review (R6) flagged as missing: test_two_tables_pick_mixed_modes_under_joint_budget Two tables under a topology where both-HYBRID is HBM-infeasible and both-CACHING is DDR-infeasible. The DP must pick one of each. This exercises the cross-table transition at plan_util.py table_i == 1 which the existing single-table cases never reach. test_legacy_mem_bins_per_device_alias Verifies the back-compat kwarg `mem_bins_per_device` still wins over the new `hbm_bins_per_device` when both are passed. test_empty_search_space_returns_empty_proposal Edge case where `load([])` short-circuits via `table_count == 0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper tests Combine PR alibaba#508 review R1 + R7. The new ``BuildShardPerfContextsWrapperTest.test_cache_params_restored_on_exception`` exists to pin the bug R1 describes: when ``_orig_build_shard_perf_contexts`` raises, the boosted ``x_eff`` clone of ``sharding_option.cache_params`` must not leak to the downstream storage estimator's view of the same option. The corresponding ``try/finally`` around the wrapped call closes the leak. Reverts the previous "single return, no try/finally" simplification. Also adds: * ``BuildShardPerfContextsWrapperTest`` — direct unit tests for the ShardPerfContext.build_shard_perf_contexts monkey-patch covering: boost applied for caching, boost applied for hybrid, no boost when the option has no dynamicemb_options, cache_params restored on success, and the restore-on-exception case driving R1. * ``DynamicEmbCalcShardStoragesTest.test_admission_counter_increases_hbm`` — direct cover for the admission_counter HBM accounting branch in ``dynamicemb_calculate_shard_storages`` that the CUDA-gated E2E test doesn't exercise on a CPU dev box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Make tzrec's embedding sharding planner take HBM and host-memory limits from the topology and decide per dynamicemb table whether to use HYBRID (HBM + DRAM hash-partitioned) or CACHING (HBM cache + full DRAM backing) mode. Per the consensus on NVIDIA/recsys-examples#306 (closed), both placement modes are kept and the planner — not the user — picks the best one under the joint budget.
{ HYBRID, CACHING } × { cache_load_factor 0.1, ..., 1.0 }. The base ShardingOption is deep-copied once per variant;dynamicemb_options.cachingis then flipped in-place on the already-fresh per-variant copy.(1 − cache_load_factor) · T, CACHING DDR =T(full backing store). HBM accounting and hash-table metadata stay HBM-side only, matching the existing tzrec convention.experiments/sweep_20260513-161030/full_a10gpu1.json: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes — HYBRID@1.0 = 0.80 ms (HBM-only fast path), CACHING@<1.0 = 2.63 ms (~3.3x slower), HYBRID@<1.0 = 5.44 ms (~6.8x slower). Inverting the linear bw model with torchrec defaults givesx_effbases 0.28 (CACHING) and 0.11 (HYBRID), withx_eff = 1.0atcache_load_factor = 1.0reflecting the runtime kernel switch. A small+0.01·xtiebreaker preserves a strict DP ranking within each block.mem_bins_per_devicekwarg is preserved as an HBM alias.Topology.hbm_cap/Topology.ddr_capminus the existing reservation.Test plan
tzrec/utils/dynamicemb_util_test.py— storage formula + empirical perf-model + direct wrapper tests (43 cases total)tzrec/utils/plan_util_test.py— variant emission + 2D DP unit tests including multi-table mixed-modes, legacymem_bins_per_devicealias, empty search space, and the headline DP testsplan_util_test.py)pre-commit run --files <changed>cleantorchrun --nproc-per-node=2 -m tzrec.train_eval ...🤖 Generated with Claude Code