Skip to content

fix(dynamo): multimodal chat processor, hybrid KV cache compat, subprocess env passthrough#2053

Open
praateekmahajan wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
praateekmahajan:praateek/dynamo-server-fixes
Open

fix(dynamo): multimodal chat processor, hybrid KV cache compat, subprocess env passthrough#2053
praateekmahajan wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
praateekmahajan:praateek/dynamo-server-fixes

Conversation

@praateekmahajan

@praateekmahajan praateekmahajan commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Multimodal frontend processor: Dynamo's native-Rust chat processor serializes OpenAI multimodal `content` arrays instead of flattening them, corrupting prompts for models with pass-through chat templates (e.g. Nemotron-Parse's `{{ message['content'] }}`). `DynamoBackend._frontend_router_kwargs` is now a pure passthrough — callers must set `dyn_chat_processor="vllm"` explicitly in `router_kwargs` when needed (see Nemotron-Parse example below).

  • Hybrid KV cache manager compat (prefill/decode + disagg paths): vLLM rejects `--kv-events-config` when `--no-disable-hybrid-kv-cache-manager` is also passed, even if `enable_kv_cache_events=False`. Both `_launch_vllm_worker` and `_launch_disagg_role` now skip `--kv-events-config` when the caller has explicitly opted in to the hybrid KV cache manager via `disable_hybrid_kv_cache_manager=False` in `engine_kwargs`.

  • Subprocess env passthrough: `backend_cfg.subprocess_env` is now merged into the base env dict passed to frontend and worker processes, so user-supplied vars like `DYN_TCP_REQUEST_TIMEOUT` actually reach the subprocesses.

Nemotron-Parse setup

All three fixes above are required to serve Nemotron-Parse via `InferenceServer`:

from nemo_curator.core.serve import InferenceServer
from nemo_curator.core.serve.dynamo import DynamoServerConfig, DynamoRouterConfig
from nemo_curator.core.serve.dynamo.vllm import DynamoVLLMModelConfig

server = InferenceServer(
    models=[DynamoVLLMModelConfig(
        model_identifier="/path/to/NVIDIA-Nemotron-Parse-v1.2",
        engine_kwargs={
            "trust_remote_code": True,
            "dtype": "bfloat16",
            "limit_mm_per_prompt": {"image": 1},
            "enable_prefix_caching": False,   # encoder-decoder models crash with APC
            "disable_hybrid_kv_cache_manager": False,  # fix: skip --kv-events-config
        },
        dynamo_kwargs={"enable_multimodal": True},
    )],
    backend=DynamoServerConfig(
        request_plane="tcp",
        router=DynamoRouterConfig(router_kwargs={
            "dyn_chat_processor": "vllm",     # fix: native-Rust corrupts multimodal arrays
            "trust_remote_code": True,
            "chat_template_content_format": "string",
        }),
        subprocess_env={"DYN_TCP_REQUEST_TIMEOUT": "180"},  # fix: env now reaches subprocess
    ),
)

Test plan

  • tests/core/serve/dynamo/test_backend.py — passthrough (no auto-inject), explicit processor forwarded, text-only unchanged, TCP env passthrough
  • tests/core/serve/dynamo/test_vllm.py — hybrid KV cache manager skip path for both worker and disagg roles
  • tests/core/serve/dynamo/test_infra.py

🤖 Generated with Claude Code

@praateekmahajan praateekmahajan requested a review from a team as a code owner June 5, 2026 22:14
@praateekmahajan praateekmahajan requested review from oyilmaz-nvidia and removed request for a team June 5, 2026 22:14
@copy-pr-bot

copy-pr-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR applies three targeted fixes to the Dynamo inference backend: a multimodal chat-processor passthrough (removing implicit dyn_chat_processor injection so callers must opt in), a guard in the aggregated (_launch_vllm_worker) path to skip --kv-events-config when hybrid KV cache manager is explicitly enabled, and a subprocess_env merge into base_env so user-supplied env vars reach frontend and worker subprocesses.

  • Multimodal chat processor: _frontend_router_kwargs is now a pure copy of router_kwargs; callers must set dyn_chat_processor=\"vllm\" explicitly for models whose pass-through chat templates are corrupted by the native-Rust processor.
  • Hybrid KV cache manager compat: explicit_hybrid_kv_cache_manager_enabled is introduced and used in _launch_vllm_worker to conditionally omit --kv-events-config; the disagg path (_launch_disagg_role) retains its unconditional --kv-events-config emission (no guard added there).
  • Subprocess env passthrough: backend_cfg.subprocess_env is spread first into base_env (before ETCD_ENDPOINTS/NATS_SERVER), so system-computed values always win over user-supplied ones.

Confidence Score: 4/5

Safe to merge with awareness that the disagg path retains unconditional --kv-events-config emission, which conflicts with any model that sets disable_hybrid_kv_cache_manager=False in disaggregated mode.

The aggregated worker path is correctly guarded and well-tested. The disagg path, however, still unconditionally passes --kv-events-config regardless of whether the caller has opted into the hybrid KV cache manager — the same conflict vLLM rejects in the aggregated path remains live in _launch_disagg_role. The PR description claims both paths were fixed, but only one was.

nemo_curator/core/serve/dynamo/vllm.py — specifically _launch_disagg_role, which lacks the hybrid-KV guard that _launch_vllm_worker received in this PR.

Important Files Changed

Filename Overview
nemo_curator/core/serve/dynamo/backend.py Adds _frontend_router_kwargs (pure passthrough) and merges backend_cfg.subprocess_env into base_env; both changes are correct and straightforward.
nemo_curator/core/serve/dynamo/vllm.py Aggregated path (_launch_vllm_worker) gains the hybrid-KV guard correctly; disagg path (_launch_disagg_role) still passes --kv-events-config unconditionally, conflicting with the PR description's claim that both paths were fixed.
nemo_curator/core/serve/dynamo/infra.py engine_kwargs_to_cli_flags now emits --no-<flag> for every bool False value, not just BooleanOptionalAction ones; this can break vLLM/Dynamo flags that use store_true and have no --no-* form.
nemo_curator/core/serve/dynamo/config.py Adds subprocess_env: dict[str, str] field with a default empty dict; clean, minimal, correct change.
tests/core/serve/dynamo/test_backend.py New multimodal and chat-processor tests are correct; the env-passthrough test bypasses the actual subprocess_env merge point in _deploy_and_healthcheck, leaving the core fix without direct coverage.
tests/core/serve/dynamo/test_vllm.py New test_explicit_hma_omits_disabled_kv_events_config correctly exercises the aggregated hybrid-KV guard; disagg counterpart is absent but mirrors the missing production fix.
tests/core/serve/dynamo/test_infra.py Parametrized test updated to reflect the new --no-enforce-eager / --no-disable-hybrid-kv-cache-manager behavior; accurately captures the changed semantics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DynamoServerConfig\nsubprocess_env + request_plane + router] --> B[_deploy_and_healthcheck]

    B --> C["base_env = {**subprocess_env, ETCD_ENDPOINTS, NATS_SERVER}"]

    C --> D[launch_replicas\naggregated path]
    C --> E[launch_disagg_replicas\ndisagg path]
    C --> F[_launch_frontend]

    D --> G[_launch_vllm_worker per node rank]
    G --> H{explicit_hybrid_kv\n_cache_manager_enabled?}
    H -- No --> I[build_worker_kv_events_config\nemit --kv-events-config]
    H -- Yes & kv_events disabled --> J[skip --kv-events-config]

    E --> K[_launch_disagg_role]
    K --> L[build_worker_kv_events_config\nalways emit --kv-events-config\nno hybrid-KV guard]

    F --> M[engine_kwargs_to_cli_flags\n_frontend_router_kwargs passthrough]
    M --> N[python -m dynamo.frontend\nsubprocess_env = base_env]
    I --> O[python -m dynamo.vllm\nsubprocess_env = base_env + PYTHONHASHSEED]
    L --> P[python -m dynamo.vllm disagg\nsubprocess_env = base_env + nixl port]
Loading

Reviews (4): Last reviewed commit: "refactor(dynamo): remove auto-inject of ..." | Re-trigger Greptile

Comment on lines 118 to +122
if isinstance(value, bool):
if value:
flags.append(flag)
else:
flags.append("--no-" + key.replace("_", "-"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 --no-<flag> emitted for all False booleans, not only BooleanOptionalAction ones

The docstring now states this is for vLLM BooleanOptionalAction arguments, but the code applies the rule to every bool value that is False in any kwargs dict — including dynamo_kwargs, router_kwargs, and future users of this helper. vLLM (and Dynamo) have many flags that use store_true and have no --no-<flag> form; passing an unrecognised --no-* flag will cause the subprocess to exit with an argument-parsing error. There is no guard, validator, or documented list of which flags are safe to set to False.

@praateekmahajan

Copy link
Copy Markdown
Contributor Author

@claude review

@praateekmahajan praateekmahajan force-pushed the praateek/dynamo-server-fixes branch from f300f60 to d40d27f Compare June 5, 2026 22:38
…ubprocess env passthrough

- Default Dynamo frontend to vLLM chat processor for multimodal models: the
  native-Rust processor serializes OpenAI multimodal content arrays instead of
  flattening them, corrupting prompts with pass-through chat templates
- Skip --kv-events-config when user explicitly enables hybrid KV cache manager
  (disable_hybrid_kv_cache_manager=False); vLLM rejects both flags together even
  when kv_cache_events=False
- Merge subprocess_env from backend config into the base env dict so user-supplied
  env vars (e.g. DYN_TCP_REQUEST_TIMEOUT) reach frontend and worker processes
- Add tests covering all three fixes

Signed-off-by: Praateek <praateekm@gmail.com>
…odal models

dyn_chat_processor=vllm is only required for Nemotron-Parse (pass-through
chat template). Auto-detecting multimodal and injecting it as a default
would cause unnecessary latency for other multimodal models. Callers must
now set it explicitly in router_kwargs when needed.

Signed-off-by: Praateek <praateekm@gmail.com>

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a documentation or example that users can follow? If yes, those need updates. If no, we need that documentation and/or examples.

kv_events_config = None
# vLLM treats any non-None kv_events_config as incompatible with explicitly
# enabled hybrid KV cache manager, even when enable_kv_cache_events=False.
if kv_events_enabled or not explicit_hybrid_kv_cache_manager_enabled(model_config.engine_kwargs):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this same check in the _launch_disagg_role() as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants