Skip to content

fix: harden async scheduling admission follow-ups#680

Merged
eric-tramel merged 4 commits into
scheduling-yolofrom
codex/661-review-fixes
May 19, 2026
Merged

fix: harden async scheduling admission follow-ups#680
eric-tramel merged 4 commits into
scheduling-yolofrom
codex/661-review-fixes

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

📋 Summary

Follow-up fixes for the PR #661 review. This moves retry accounting to the request-admission boundary, fixes hard request-denial waits, fills request-admission capacity-plan fields, bounds release bookkeeping, stabilizes the idle regression harness, restores historical Fern versions, and hardens the MCP e2e ordering assertion. The request-pressure generation-domain normalization is intentionally left to PR #679, which already owns that change.

🔗 Related Issue

N/A. Follow-up to PR #661 and complements PR #679.

🔄 Changes

  • Disable hidden transport retries when request admission wraps a model client, then retry transient provider failures in ModelRequestExecutor so every outbound attempt gets its own acquire/release lease cycle.
  • Raise typed RequestAdmissionError for terminal hard-policy denials in blocking sync/async acquire paths, and keep task/request duplicate-release tracking bounded.
  • Populate AsyncCapacityPlan with request resources, provider/model static caps, request limits, controller config snapshots, runtime request pressure, and observed maxima.
  • Wire the scheduler to the model registry request-pressure provider for capacity diagnostics.
  • Make idle regression --skip-run read-only for missing artifacts, add extra measured stress iterations, and tune the quick wide-stress threshold to match observed quick-mode noise.
  • Restore historical fern/versions/v0.5.8 and fern/versions/v0.5.9 snapshots back to the base branch content.
  • Accept ordered multiple tool calls within the same assistant message in the MCP e2e test.

🧪 Testing

  • uv run --group dev ruff check ...
  • PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev pytest packages/data-designer-config/tests/config/test_scheduling.py packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/models/clients/test_factory.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py packages/data-designer-engine/tests/engine/test_capacity.py packages/data-designer-engine/tests/engine/test_observability.py packages/data-designer-engine/tests/engine/test_async_scheduling_benchmark.py -q — 222 passed
  • uv run --directory tests_e2e pytest -q — 8 passed, 8 warnings
  • PYTHONPATH=packages/data-designer-config/src:packages/data-designer-engine/src:packages/data-designer/src uv run --group dev python scripts/benchmarks/run_async_scheduling_idle_regression.py --quick --artifact-dir .tmp/idle-regression-fixes --report-path .tmp/idle-regression-fixes/report.html --summary-path .tmp/idle-regression-fixes/summary.json --checks-path .tmp/idle-regression-fixes/checks.json — PASS, 0 errors, 0 warnings

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A — restored historical version snapshots instead of changing current architecture docs)

@nabinchha nabinchha requested a review from a team as a code owner May 19, 2026 19:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

MkDocs preview: https://2d46e2e7.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-680.docs.buildwithfern.com/nemo/datadesigner​

Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR hardens the async scheduling and request-admission stack with several targeted follow-up fixes: retry accounting is moved to the ModelRequestExecutor boundary so each outbound attempt gets its own acquire/release lease cycle, transport-layer retries are suppressed on underlying adapters when wrapped by ModelRequestExecutor, and hard terminal-denial reasons now raise RequestAdmissionError immediately from blocking wait loops.

  • Retry logic (model_request_executor.py, factory.py): _execute_sync/async wrap renamed *_attempt helpers in a bounded retry loop; _should_retry correctly excludes admission errors, rate-limit, and TIMEOUT without a status code.
  • Bounded release history (controller.py, task_admission.py): A parallel deque(maxlen=8192) evicts the oldest entry from the _released set on overflow, capping memory growth.
  • Capacity plan population (async_scheduler.py, dataset_builder.py): AsyncCapacityPlan is now filled with live request-pressure snapshots, provider/model static caps, domain limits, and observed maxima.

Confidence Score: 5/5

All lease release paths remain intact across the new retry loops and the terminal-denial fast-exit is correctly scoped to non-retryable reasons.

All exception paths in _execute_sync/async_attempt release the lease before propagating; _should_retry correctly excludes admission-error-wrapped ProviderError instances; bounded-history eviction keeps the _released set and deque in sync. Tests cover the new retry and denial behaviors end-to-end.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py Adds per-attempt acquire/release retry loop with _should_retry correctly guarding against admission errors, rate limits, and TIMEOUT-without-status-code; all exception paths release the lease before propagating.
packages/data-designer-engine/src/data_designer/engine/models/request_admission/controller.py Introduces terminal-denial early exit in both sync and async blocking-wait loops, and bounds the _released set to 8192 entries via a parallel deque; lock is correctly released before propagating RequestAdmissionError.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_admission.py Mirrors the controller's bounded-history pattern for task lease release tracking; _remember_released correctly evicts the front of the deque from the set before appending.
packages/data-designer-engine/src/data_designer/engine/models/clients/factory.py Disables transport-layer retries on underlying adapters when ModelRequestExecutor wraps them, forwarding the full retry_config so every retry gets its own admission lease cycle.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Populates AsyncCapacityPlan with request resources, provider/model static caps, domain limits, runtime pressure snapshots, and observed maxima from the wired-in RequestPressureSnapshotProvider.
tests_e2e/tests/test_mcp_demo.py Changes ordering assertion to use (msg_index, tool_call_index) tuple so multiple tool calls within the same assistant message are ordered correctly via lexicographic tuple comparison.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ModelRequestExecutor
    participant RequestAdmissionController
    participant InnerClient

    loop For each attempt (1..max_retries+1)
        Caller->>ModelRequestExecutor: completion(request)
        ModelRequestExecutor->>RequestAdmissionController: acquire_sync(item)
        alt Hard/shutdown denial (terminal)
            RequestAdmissionController-->>ModelRequestExecutor: raise RequestAdmissionError
            ModelRequestExecutor-->>Caller: raise ProviderError(TIMEOUT) [no retry]
        else Lease granted
            RequestAdmissionController-->>ModelRequestExecutor: RequestAdmissionLease
            ModelRequestExecutor->>InnerClient: call()
            alt Success
                InnerClient-->>ModelRequestExecutor: result
                ModelRequestExecutor->>RequestAdmissionController: release(lease, success)
                ModelRequestExecutor-->>Caller: result
            else Retryable ProviderError (e.g. 503)
                InnerClient-->>ModelRequestExecutor: ProviderError
                ModelRequestExecutor->>RequestAdmissionController: release(lease, provider_failure)
                Note over ModelRequestExecutor: _should_retry True, sleep, retry
            else Non-retryable
                InnerClient-->>ModelRequestExecutor: ProviderError
                ModelRequestExecutor->>RequestAdmissionController: release(lease, outcome)
                ModelRequestExecutor-->>Caller: raise ProviderError [no retry]
            end
        end
    end
Loading

Reviews (5): Last reviewed commit: "refactor: bound released lease history w..." | Re-trigger Greptile

@eric-tramel eric-tramel self-assigned this May 19, 2026
Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha force-pushed the codex/661-review-fixes branch from aaea837 to 6aba2f4 Compare May 19, 2026 19:28
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
@eric-tramel eric-tramel merged commit 2c4379a into scheduling-yolo May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants