[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets by tiankongdeguiji · Pull Request #508 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-05-12T14:14:02Z

Summary

Make tzrec's embedding sharding planner take HBM and host-memory limits from the topology and decide per dynamicemb table whether to use HYBRID (HBM + DRAM hash-partitioned) or CACHING (HBM cache + full DRAM backing) mode. Per the consensus on NVIDIA/recsys-examples#306 (closed), both placement modes are kept and the planner — not the user — picks the best one under the joint budget.

Enumerator emits 20 ShardingOptions per dynamicemb table: { HYBRID, CACHING } × { cache_load_factor 0.1, ..., 1.0 }. The base ShardingOption is deep-copied once per variant; dynamicemb_options.caching is then flipped in-place on the already-fresh per-variant copy.
Storage estimator is mode-aware: HYBRID DDR = (1 − cache_load_factor) · T, CACHING DDR = T (full backing store). HBM accounting and hash-table metadata stay HBM-side only, matching the existing tzrec convention.
Perf model is fitted from an on-device dynamicemb sweep (experiments/sweep_20260513-161030/full_a10gpu1.json: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes — HYBRID@1.0 = 0.80 ms (HBM-only fast path), CACHING@<1.0 = 2.63 ms (~3.3x slower), HYBRID@<1.0 = 5.44 ms (~6.8x slower). Inverting the linear bw model with torchrec defaults gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID), with x_eff = 1.0 at cache_load_factor = 1.0 reflecting the runtime kernel switch. A small +0.01·x tiebreaker preserves a strict DP ranking within each block.
DynamicProgrammingProposer is now 2D: discretizes per-rank HBM (100 bins/dev) and per-machine DDR (50 bins/dev) and picks per-table options that minimize total perf under both budgets. HBM prune is per-device, DDR prune is per-machine (DDR is host-shared across co-located ranks). Per-axis bin count is capped at 1024 to avoid blow-up on multi-host worlds. Legacy mem_bins_per_device kwarg is preserved as an HBM alias.
No new proto fields — budgets come straight from Topology.hbm_cap / Topology.ddr_cap minus the existing reservation.

Test plan

tzrec/utils/dynamicemb_util_test.py — storage formula + empirical perf-model + direct wrapper tests (43 cases total)
tzrec/utils/plan_util_test.py — variant emission + 2D DP unit tests including multi-table mixed-modes, legacy mem_bins_per_device alias, empty search space, and the headline DP tests
CUDA-gated end-to-end enumerate + DP test (folded into plan_util_test.py)
pre-commit run --files <changed> clean
Smoke test on a real dynamicemb config end-to-end via torchrun --nproc-per-node=2 -m tzrec.train_eval ...

🤖 Generated with Claude Code

…ING) Extend `_calculate_dynamicemb_table_storage_specific_size` and its callers with a `caching: bool` parameter. HYBRID keeps the original split formula (DDR = (1 - cache_ratio) * T); CACHING accounts the full backing store on host (DDR = T) regardless of cache_ratio. HBM accounting and metadata (key + score + digest + bucket header) stay identical between modes. The flag is sourced from `dynamicemb_options.caching` on the ShardingOption in `dynamicemb_calculate_shard_storages`. Subsequent commits wire this into the enumerator (per-table sweep over both modes) and the proposer (2D DP). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `_dynamicemb_effective_cache_ratio(cache_load_factor, caching, stats)`: HYBRID passes the ratio through (hash-partitioned HBM hit probability); CACHING boosts to `min(1.0, x * multiplier)`, default multiplier 10.0 via the `TZREC_CACHING_HIT_RATE_MULTIPLIER` env var. `cache_params.stats`, when provided, takes precedence — `1 - stats.expected_miss_rate(x)` — and is clamped to never fall below the HYBRID ratio. Inject the boost at planning time by monkey-patching `ShardPerfContext.build_shard_perf_contexts` to temporarily replace the sharding option's `cache_params.load_factor` with the effective ratio. The original cache_params is restored on return so the storage estimator (which reads cache_load_factor for HBM/DDR sizing) still sees the unboosted value. Net effect: at the same `cache_load_factor`, CACHING is strictly cheaper in perf and strictly more expensive in DDR. The DP proposer (next commit) trades one for the other against the topology budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extract the per-dynamicemb-table variant emission into `_emit_dynamicemb_variants(base, dynamicemb_options) -> List[ShardingOption]` and sweep both modes per cache_load_factor. When `cache_params` is unset, the helper emits 20 variants (10 factors × 2 modes); when fixed by the caller, it emits 2 (both modes at the fixed factor). Each variant owns a deep-copied `dynamicemb_options` so per-variant `caching` mutations stay isolated. `cache_params.stats` is preserved across all clones for the perf-side miss-rate override. The downstream 2D DP proposer (next commit) selects per table the (mode, ratio) pair that fits both HBM and host budgets while minimizing perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rewrite `DynamicProgrammingProposer.feedback` to discretize both per-rank HBM and per-rank DDR into bins (default 100 × 50 per device) and pick per table the (mode, cache_load_factor) pair that minimizes total perf while fitting both topology budgets. State: dp[table][hbm_bin][ddr_bin] = (perf, hbm_actual, ddr_actual) backtrack[table][hbm_bin][ddr_bin] = (opt_id, prev_hbm_bin, prev_ddr_bin) Options whose largest shard exceeds either per-device cap are pruned upfront. Backtracking enumerates feasible (hbm_total, ddr_total) cells in decreasing total memory, yielding Pareto-optimal proposals. The legacy `mem_bins_per_device` kwarg is preserved as an alias for `hbm_bins_per_device` so existing callers still work, and HBM-degenerate (CPU-only) or DDR-degenerate topologies collapse the unused axis to a single bin. Adds `DynamicProgrammingProposer2DTest`: - generous DDR → CACHING wins (cheaper perf, host can fit T) - tight DDR → CACHING rejected, falls back to HYBRID - high cache_load_factor → modes converge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Exercise the full planning pipeline on a tiny TestSparseNN backed by a DynamicEmbParameterConstraints constraint. Asserts: - EmbeddingEnumerator yields 20 sharding options (2 modes × 10 factors) - All cache_load_factors and both caching modes are represented - Storage and perf estimators populate non-zero values for each option - DynamicProgrammingProposer.propose() returns feasible plans whose dynamicemb options carry a valid caching flag Gated on has_dynamicemb + torch.cuda.is_available() since the dynamicemb sharder requires CUDA at planning time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…apper (1) Restore the byte-budget framing in `_calculate_dynamicemb_table_storage_specific_size` docstring (`total_value_memory`, `num_buckets`, `hbm_budget`, `ddr_budget`) that was lost in the PR alibaba#508 rewrite, extending `ddr_budget` to cover both HYBRID and CACHING modes in one place. (2) Simplify `_dynamicemb_aware_build_shard_perf_contexts`: single return, no try/finally, no early-return branch. The unconditional `sharding_option.cache_params = original_cache_params` line at the end is a no-op when caching is False (we never mutated the field) and restores the original cache_params when caching is True. If the wrapped call raises, planning aborts and the option is discarded -- no need for try/finally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DDR is host-shared across ranks co-located on one machine, so the per-option fit check in `DynamicProgrammingProposer.feedback` should compare `max_shard_ddr` against the largest machine's total DDR pool, not against any single rank's slice. HBM stays per-device (each GPU has its own HBM; no cross-rank sharing). Compute `max_machine_ddr` by summing per-device DDR within each contiguous `local_world_size`-sized window. Update the test fake-topology fixture to carry `local_world_size` (defaults to num_devices). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… copyright (5) Move `PlanUtilDynamicEmbE2ETest` from the standalone `tzrec/utils/plan_util_dynamicemb_e2e_test.py` into `plan_util_test.py` where the rest of the plan_util tests already live. Delete the standalone file. (6) Rename `DynamicProgrammingProposer2DTest` -> `DynamicProgrammingProposerTest`. The proposer class itself stays `DynamicProgrammingProposer`. (7) Bump copyright `2024 -> 2026` in the new file `tzrec/utils/dynamicemb_util_test.py`. The header in plan_util_test.py stays at 2024 since it was a pre-existing file we only extended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…urve Replace the heuristic ``_dynamicemb_effective_cache_ratio`` curve with constants fitted from an on-device perf sweep (``experiments/sweep_20260513-161030/full_a10gpu1.json``: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes: HYBRID @ x=1.0: 0.80 ms (HBM-only fast path) CACHING @ x<1.0: 2.63 ms (~3.3x slower) HYBRID @ x<1.0: 5.44 ms (~6.8x slower) Inverting the linear bw model bw = x_eff*HBM + (1-x_eff)*HBM_TO_DDR (torchrec defaults 897 / 32 GB/s) gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID). At x = 1.0 the runtime drops the host tier in both modes, so x_eff = 1.0 there. A +0.01*x tiebreaker term keeps the DP's ranking strict within each block. Also extend the ``_dynamicemb_aware_build_shard_perf_contexts`` wrapper to apply the empirical x_eff for HYBRID (previously HYBRID used raw ``cache_load_factor`` as x_eff -- inconsistent with the empirical data that shows HYBRID @ 0.9 is much slower than HYBRID @ 1.0 due to the storage backend switch). Drop the now-unused ``TZREC_CACHING_HIT_RATE_MULTIPLIER`` env knob and its related tests; rewrite ``EffectiveCacheRatioTest`` to assert the empirical strict-block ranking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The same provenance is already covered in ``_dynamicemb_effective_cache_ratio``'s docstring below; the block comment above the constants was redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-13T08:43:36Z

+        result = _orig_build_shard_perf_contexts(
+            cls,
+            config,
+            shard_sizes,
+            sharding_option,
+            topology,
+            constraints,
+            sharder,
+            *args,
+            **kwargs,
+        )
+        sharding_option.cache_params = original_cache_params
+        return result


Restore should be in finally. If _orig_build_shard_perf_contexts raises, sharding_option.cache_params is permanently left as the boosted x_eff clone. Downstream consumers — the storage estimator and _to_sharding_plan at line 435 (cache_params=sharding_option.cache_params) — then see the wrong load factor on every other option that touches the same ShardingOption. Suggested:

Suggested change

result = _orig_build_shard_perf_contexts(

cls,

config,

shard_sizes,

sharding_option,

topology,

constraints,

sharder,

*args,

**kwargs,

)

sharding_option.cache_params = original_cache_params

return result

try:

result = _orig_build_shard_perf_contexts(

cls,

config,

shard_sizes,

sharding_option,

topology,

constraints,

sharder,

*args,

**kwargs,

)

finally:

sharding_option.cache_params = original_cache_params

return result

github-actions · 2026-05-13T08:43:39Z

+    for caching_mode in (False, True):
+        for load_factor in load_factors:
+            opt = copy.deepcopy(base_option)
+            opt.cache_params = CacheParams(load_factor=load_factor, stats=stats)
+            # pyre-ignore [16]
+            opt.dynamicemb_options = copy.deepcopy(dynamicemb_options)
+            opt.dynamicemb_options.caching = caching_mode
+            variants.append(opt)


Redundant double deep-copy of dynamicemb_options. The caller (line 1035) attaches dynamicemb_options to base_option before calling this function, so copy.deepcopy(base_option) on line 848 already deep-copies dynamicemb_options into opt. Line 851 then deep-copies the original again and replaces it. With 20 variants per table this doubles a non-trivial copy. Either drop line 851 and just set opt.dynamicemb_options.caching = caching_mode, or stop attaching dynamicemb_options on the caller side before this call.

github-actions · 2026-05-13T08:43:41Z

+    Given :math:`M` tables each with up to :math:`N` ShardingOptions, pick one
+    option per table to minimize total perf while respecting both per-rank HBM
+    and per-rank DDR budgets from the topology.



Docstring says "per-rank DDR" but the implementation enforces per-machine DDR (lines 349–361 sum DDR across local_world_size co-located ranks and the per-option prune at line 391 compares against max_machine_ddr). The inline comment at 350–352 has the right framing — the class docstring should match: "per-rank HBM and per-machine DDR budgets."

github-actions · 2026-05-13T08:43:48Z

+    if x >= 1.0:
+        return 1.0
+    base = _DYNAMICEMB_CACHING_X_EFF_BASE if caching else _DYNAMICEMB_HYBRID_X_EFF_BASE
+    return base + _DYNAMICEMB_X_EFF_TIEBREAK * x


Sharp discontinuity at x=1.0. HYBRID@x=0.99 → 0.1199; HYBRID@x=1.00 → 1.0 — an ~8× jump for a 1% ratio change. The DP will reliably prefer x=1.0 over x=0.99 by a huge perf margin, then prefer CACHING@x=0.5 (0.285) over HYBRID@x=0.9 (0.119) even on workloads where HYBRID@0.9 is plainly faster in reality. If the empirical sweep really shows a step at x=1.0 because the runtime drops the host tier, please call that out as "x=1.0 = HBM-only kernel, not the same algorithm as x=0.99" — and consider whether the enumerator should emit a discrete mode={HBM_ONLY, HYBRID, CACHING} axis rather than packing the discontinuity into the same x knob.

github-actions · 2026-05-13T08:43:53Z

+        for table_i in range(1, table_count):
+            prev_dp = dp[table_i - 1]
+            cur_dp = dp[table_i]
+            cur_back = backtrack[table_i]
+            hbm_t = hbm_by_fqn[table_i]
+            ddr_t = ddr_by_fqn[table_i]
+            perf_t = perf_by_fqn[table_i]
+            for h in range(hbm_bins):
+                prev_dp_h = prev_dp[h]
+                for d in range(ddr_bins):
+                    prev_state = prev_dp_h[d]
+                    prev_perf = prev_state[0]
+                    if prev_perf == _INF:
+                        continue
+                    prev_h_val = prev_state[1]
+                    prev_d_val = prev_state[2]
+                    for opt_j in range(option_count):
+                        delta_perf = perf_t[opt_j]
+                        if delta_perf == _INF:
+                            continue
+                        new_h = prev_h_val + hbm_t[opt_j]
+                        new_d = prev_d_val + ddr_t[opt_j]
+                        if new_h >= hbm_bins or new_d >= ddr_bins:
+                            continue
+                        new_h_i = int(new_h)
+                        new_d_i = int(new_d)
+                        new_perf = prev_perf + delta_perf
+                        if cur_dp[new_h_i][new_d_i][0] > new_perf:
+                            cur_dp[new_h_i][new_d_i] = (new_perf, new_h, new_d)
+                            cur_back[new_h_i][new_d_i] = (opt_j, h, d)


2D DP cost scales with world size — consider clamping total bins. hbm_bins = 100 · num_devices and ddr_bins = 50 · num_devices. On a 16-GPU world the inner triple-loop is 1600 × 800 × option_count pure-Python iterations per table; with 100 dynamicemb tables (20 options each) that's a few-billion-op DP, and dp + backtrack allocate ~256M tuples. Even on smaller worlds the bin counts multiply faster than necessary — a per-device discretization isn't really needed since the budget is a single scalar per axis. Consider clamping totals (e.g. hbm_bins = min(hbm_bins_per_device · num_devices, 2_000)) or vectorizing the transition with NumPy. The defaults are fine for 2–4 GPUs but worth a guardrail before the next user hits this.

github-actions · 2026-05-13T08:43:57Z

+class DynamicProgrammingProposerTest(unittest.TestCase):
+    """2D DP across HBM × DDR picks per-table mode under joint budgets."""
+
+    def _run(self, search_space, topology):
+        proposer = DynamicProgrammingProposer(
+            hbm_bins_per_device=20, ddr_bins_per_device=20
+        )
+        proposer.load(search_space)
+        # First propose returns the lowest-mem-per-table seed.
+        proposer.propose()
+        proposer.feedback(partitionable=True, storage_constraint=topology)
+        proposals = []
+        proposal = proposer.propose()
+        while proposal:
+            proposals.append(proposal)
+            proposer.feedback(partitionable=True, storage_constraint=topology)
+            proposal = proposer.propose()
+        return proposals
+
+    def test_caching_preferred_when_ddr_is_generous(self):
+        # Three options for one table:
+        #   HYBRID @ x=1.0: hbm = T,   ddr = 0,   perf = high  (HBM-only)
+        #   HYBRID @ x=0.1: hbm = .1T, ddr = .9T, perf = high  (slow)
+        #   CACHING @ x=0.1: hbm = .1T, ddr = T,  perf = low   (fast — modeled hits)
+        opts = [
+            _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0),
+            _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
+            _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
+        ]
+        topology = _make_topology(
+            num_devices=2, hbm_per_device=2000, ddr_per_device=2000
+        )
+        proposals = self._run(opts, topology)
+        # Best plan must be the CACHING option (perf=10).
+        best = min(proposals, key=lambda p: sum(o.total_perf for o in p))
+        self.assertEqual(best[0].total_perf, 10.0)
+
+    def test_caching_rejected_when_ddr_is_tight(self):
+        # Host budget is only 950 — CACHING (ddr=1000) cannot fit; HYBRID can.
+        opts = [
+            _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
+            _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
+        ]
+        topology = _make_topology(
+            num_devices=1, hbm_per_device=2000, ddr_per_device=950
+        )
+        proposals = self._run(opts, topology)
+        # Every proposed plan must pick the HYBRID option (perf=80).
+        for p in proposals:
+            self.assertEqual(p[0].total_perf, 80.0)
+
+    def test_high_factor_collapses_modes(self):
+        # At x=1.0 HYBRID == CACHING in HBM and CACHING.ddr = T = HYBRID.hbm.
+        # If we offer just the high-factor options, DP picks one of them.
+        opts = [
+            _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0),  # HYBRID x=1.0
+            _FakeShardingOption(
+                "table_a", hbm=1000, ddr=1000, perf=50.0
+            ),  # CACHING x=1.0
+        ]
+        topology = _make_topology(
+            num_devices=1, hbm_per_device=1100, ddr_per_device=2000
+        )
+        proposals = self._run(opts, topology)
+        # Either option is fine — they're tied. Just verify the proposer
+        # returned something feasible.
+        self.assertGreater(len(proposals), 0)
+        for p in proposals:
+            self.assertEqual(p[0].total_perf, 50.0)


Missing the headline multi-table 2D-DP scenario. All three DynamicProgrammingProposerTest cases use a single table, so the cross-table DP transition at plan_util.py:434–463 is exercised only at table_i == 0. The PR's value prop is "pick CACHING for one table and HYBRID for another under joint HBM+DDR budgets" — a 2-table fixture where global DDR can only hold one CACHING but plenty of HBM is available would directly cover that. Same gap for the mem_bins_per_device legacy alias (untested) and the all-options-pruned / empty-search-space degenerate paths.

github-actions · 2026-05-13T08:44:00Z

+            only_values=only_values,
+            bucket_capacity=self.BUCKET_CAPACITY,
+            caching=caching,
+        )


Direct unit test for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts is missing. Both have non-trivial branches — admission_counter HBM accounting (dynamicemb_util.py:663–672), the compute_device == "cpu" and not is_inference DDR branch (line 697), and the cache_params swap+restore (line 502–514) — that are reachable only via the E2E test, which is gated on has_dynamicemb and cuda. On a CPU dev box none of this is covered; a fake-ed _orig_build_shard_perf_contexts would let you verify the boost is applied and (more importantly) that the original cache_params is restored even when the wrapped call raises.

github-actions · 2026-05-13T08:44:22Z

Review summary

Solid PR overall — the 2D DP rewrite is clean, the mode-aware storage formula is well-tested, and the empirical perf-model docstring is refreshingly honest about its provenance. A few things worth addressing before merge (inline above):

Should fix

dynamicemb_util.py:515-527 — missing try/finally around the cache_params swap; an exception in the wrapped estimator permanently corrupts the option's cache load factor.
dynamicemb_util.py:56-89 — the x_eff step from 0.12 → 1.0 at x=1.00 means HYBRID@0.99 and CACHING@0.5 both look much worse than HYBRID@1.0 by construction. If x=1.0 is genuinely a different code path (HBM-only kernel), modeling it as a separate mode axis is cleaner than a discontinuity in the same knob.
plan_util.py:846-853 — redundant double-deepcopy of dynamicemb_options (20×/table).

Worth a follow-up

plan_util.py:434-463 — 2D DP state blows up on multi-node worlds (bins scale with world_size); consider an absolute cap on total bins, or vectorize with NumPy.
Test gap on the multi-table DP case where one table wants CACHING and another HYBRID under a shared DDR budget — that's the headline scenario and currently isn't covered by a unit test. Same for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts (only reachable via the CUDA-gated E2E test) and the mem_bins_per_device legacy alias.

Docs

DynamicProgrammingProposer docstring says "per-rank DDR" but the implementation budgets DDR per-machine — worth aligning the wording with the inline comment on lines 350-352.
dynamicemb_calculate_shard_storages Args block (not in diff, around line 620): the caching_ratio description still reads "ratio of HBM to DDR memory for UVM caching" — under the new CACHING semantics that's stale; host always holds the full backing store and caching_ratio only sizes the HBM hot-row cache.
PR description mentions TZREC_CACHING_HIT_RATE_MULTIPLIER and x_eff = min(1.0, x · multiplier); the code uses fitted constants instead (no env var). Worth updating the description so it doesn't mislead future readers.

No security/multi-process concerns: the monkey-patches run under the import lock and each rank is its own process.

`copy.deepcopy(base_option)` on the first line of each variant body already copies `dynamicemb_options` because the caller in `enumerate()` attaches it onto `base_option` before invoking `_emit_dynamicemb_variants`. The second `copy.deepcopy(dynamicemb_options)` was wasted work — at 20 variants per dynamicemb table it doubled a non-trivial copy. Drop it and mutate the already-fresh per-variant `dynamicemb_options.caching` directly. PR alibaba#508 review R3 (github-actions, 2026-05-13). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… blowup The 2D DP discretizes each memory axis into `bins_per_device * num_devices` bins. On a 16-GPU world that becomes 1600 hbm bins × 800 ddr bins = 1.28M cells per table; with 100 dynamicemb tables and 20 options each, the inner Python loop is a few-billion-op DP and dp + backtrack tuples consume hundreds of MB. The HBM and DDR budgets are scalar, so multiplying bins by num_devices stops adding meaningful precision past a point. Clamp the per-axis bin count at `_DP_AXIS_BIN_CAP` (1024) to keep the multi-host case tractable. NumPy vectorization of the inner DP loop is a bigger refactor; leaving a TODO referencing PR alibaba#508 review R5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation * DynamicProgrammingProposer class docstring: "per-rank DDR" -> "per-rank HBM and per-machine DDR" so it matches the machine-DDR prune introduced in commit 38bd7ad (PR alibaba#508 review R4). * _dynamicemb_effective_cache_ratio: call out that the discontinuity at x=1.0 is a deliberate kernel switch (HBM-only DynamicEmbStorage vs dual-tier HybridStorage/DynamicEmbCache), not a smoothing artifact. Note a future refactor could lift this to a discrete `mode` axis on the enumerator (PR alibaba#508 review R2). * dynamicemb_calculate_shard_storages caching_ratio Args: rewrite the stale "ratio of HBM to DDR memory for UVM caching" framing -- under the dynamicemb modes, host always holds the full backing in CACHING, and caching_ratio sizes the HBM hot-row cache (top-level docs note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ch space Add three DynamicProgrammingProposerTest methods that the original PR alibaba#508 review (R6) flagged as missing: test_two_tables_pick_mixed_modes_under_joint_budget Two tables under a topology where both-HYBRID is HBM-infeasible and both-CACHING is DDR-infeasible. The DP must pick one of each. This exercises the cross-table transition at plan_util.py table_i == 1 which the existing single-table cases never reach. test_legacy_mem_bins_per_device_alias Verifies the back-compat kwarg `mem_bins_per_device` still wins over the new `hbm_bins_per_device` when both are passed. test_empty_search_space_returns_empty_proposal Edge case where `load([])` short-circuits via `table_count == 0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…apper tests Combine PR alibaba#508 review R1 + R7. The new ``BuildShardPerfContextsWrapperTest.test_cache_params_restored_on_exception`` exists to pin the bug R1 describes: when ``_orig_build_shard_perf_contexts`` raises, the boosted ``x_eff`` clone of ``sharding_option.cache_params`` must not leak to the downstream storage estimator's view of the same option. The corresponding ``try/finally`` around the wrapped call closes the leak. Reverts the previous "single return, no try/finally" simplification. Also adds: * ``BuildShardPerfContextsWrapperTest`` — direct unit tests for the ShardPerfContext.build_shard_perf_contexts monkey-patch covering: boost applied for caching, boost applied for hybrid, no boost when the option has no dynamicemb_options, cache_params restored on success, and the restore-on-exception case driving R1. * ``DynamicEmbCalcShardStoragesTest.test_admission_counter_increases_hbm`` — direct cover for the admission_counter HBM accounting branch in ``dynamicemb_calculate_shard_storages`` that the CUDA-gated E2E test doesn't exercise on a CPU dev box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiankongdeguiji and others added 10 commits May 12, 2026 20:28

[fix] drop duplicate empirical-fit comment above the constants

5280156

The same provenance is already covered in ``_dynamicemb_effective_cache_ratio``'s docstring below; the block comment above the constants was redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiankongdeguiji added the claude-review Let Claude Review label May 13, 2026

github-actions Bot removed the claude-review Let Claude Review label May 13, 2026

github-actions Bot reviewed May 13, 2026

View reviewed changes

tiankongdeguiji and others added 5 commits May 13, 2026 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508
tiankongdeguiji wants to merge 15 commits into
alibaba:masterfrom
tiankongdeguiji:dynamicemb-hybrid-caching-planner

tiankongdeguiji commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tiankongdeguiji commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 13, 2026

Review summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tiankongdeguiji commented May 12, 2026 •

edited

Loading