fix(dynamo): multimodal chat processor, hybrid KV cache compat, subprocess env passthrough#2053
Conversation
Greptile SummaryThis PR applies three targeted fixes to the Dynamo inference backend: a multimodal chat-processor passthrough (removing implicit
Confidence Score: 4/5Safe to merge with awareness that the disagg path retains unconditional --kv-events-config emission, which conflicts with any model that sets disable_hybrid_kv_cache_manager=False in disaggregated mode. The aggregated worker path is correctly guarded and well-tested. The disagg path, however, still unconditionally passes --kv-events-config regardless of whether the caller has opted into the hybrid KV cache manager — the same conflict vLLM rejects in the aggregated path remains live in _launch_disagg_role. The PR description claims both paths were fixed, but only one was. nemo_curator/core/serve/dynamo/vllm.py — specifically _launch_disagg_role, which lacks the hybrid-KV guard that _launch_vllm_worker received in this PR. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DynamoServerConfig\nsubprocess_env + request_plane + router] --> B[_deploy_and_healthcheck]
B --> C["base_env = {**subprocess_env, ETCD_ENDPOINTS, NATS_SERVER}"]
C --> D[launch_replicas\naggregated path]
C --> E[launch_disagg_replicas\ndisagg path]
C --> F[_launch_frontend]
D --> G[_launch_vllm_worker per node rank]
G --> H{explicit_hybrid_kv\n_cache_manager_enabled?}
H -- No --> I[build_worker_kv_events_config\nemit --kv-events-config]
H -- Yes & kv_events disabled --> J[skip --kv-events-config]
E --> K[_launch_disagg_role]
K --> L[build_worker_kv_events_config\nalways emit --kv-events-config\nno hybrid-KV guard]
F --> M[engine_kwargs_to_cli_flags\n_frontend_router_kwargs passthrough]
M --> N[python -m dynamo.frontend\nsubprocess_env = base_env]
I --> O[python -m dynamo.vllm\nsubprocess_env = base_env + PYTHONHASHSEED]
L --> P[python -m dynamo.vllm disagg\nsubprocess_env = base_env + nixl port]
Reviews (4): Last reviewed commit: "refactor(dynamo): remove auto-inject of ..." | Re-trigger Greptile |
| if isinstance(value, bool): | ||
| if value: | ||
| flags.append(flag) | ||
| else: | ||
| flags.append("--no-" + key.replace("_", "-")) |
There was a problem hiding this comment.
--no-<flag> emitted for all False booleans, not only BooleanOptionalAction ones
The docstring now states this is for vLLM BooleanOptionalAction arguments, but the code applies the rule to every bool value that is False in any kwargs dict — including dynamo_kwargs, router_kwargs, and future users of this helper. vLLM (and Dynamo) have many flags that use store_true and have no --no-<flag> form; passing an unrecognised --no-* flag will cause the subprocess to exit with an argument-parsing error. There is no guard, validator, or documented list of which flags are safe to set to False.
|
@claude review |
f300f60 to
d40d27f
Compare
…ubprocess env passthrough - Default Dynamo frontend to vLLM chat processor for multimodal models: the native-Rust processor serializes OpenAI multimodal content arrays instead of flattening them, corrupting prompts with pass-through chat templates - Skip --kv-events-config when user explicitly enables hybrid KV cache manager (disable_hybrid_kv_cache_manager=False); vLLM rejects both flags together even when kv_cache_events=False - Merge subprocess_env from backend config into the base env dict so user-supplied env vars (e.g. DYN_TCP_REQUEST_TIMEOUT) reach frontend and worker processes - Add tests covering all three fixes Signed-off-by: Praateek <praateekm@gmail.com>
…odal models dyn_chat_processor=vllm is only required for Nemotron-Parse (pass-through chat template). Auto-detecting multimodal and injecting it as a default would cause unnecessary latency for other multimodal models. Callers must now set it explicitly in router_kwargs when needed. Signed-off-by: Praateek <praateekm@gmail.com>
b181bc5 to
3849c1b
Compare
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
Do we have a documentation or example that users can follow? If yes, those need updates. If no, we need that documentation and/or examples.
| kv_events_config = None | ||
| # vLLM treats any non-None kv_events_config as incompatible with explicitly | ||
| # enabled hybrid KV cache manager, even when enable_kv_cache_events=False. | ||
| if kv_events_enabled or not explicit_hybrid_kv_cache_manager_enabled(model_config.engine_kwargs): |
There was a problem hiding this comment.
Do we need this same check in the _launch_disagg_role() as well?
Summary
Multimodal frontend processor: Dynamo's native-Rust chat processor serializes OpenAI multimodal `content` arrays instead of flattening them, corrupting prompts for models with pass-through chat templates (e.g. Nemotron-Parse's `{{ message['content'] }}`). `DynamoBackend._frontend_router_kwargs` is now a pure passthrough — callers must set `dyn_chat_processor="vllm"` explicitly in `router_kwargs` when needed (see Nemotron-Parse example below).
Hybrid KV cache manager compat (prefill/decode + disagg paths): vLLM rejects `--kv-events-config` when `--no-disable-hybrid-kv-cache-manager` is also passed, even if `enable_kv_cache_events=False`. Both `_launch_vllm_worker` and `_launch_disagg_role` now skip `--kv-events-config` when the caller has explicitly opted in to the hybrid KV cache manager via `disable_hybrid_kv_cache_manager=False` in `engine_kwargs`.
Subprocess env passthrough: `backend_cfg.subprocess_env` is now merged into the base env dict passed to frontend and worker processes, so user-supplied vars like `DYN_TCP_REQUEST_TIMEOUT` actually reach the subprocesses.
Nemotron-Parse setup
All three fixes above are required to serve Nemotron-Parse via `InferenceServer`:
Test plan
tests/core/serve/dynamo/test_backend.py— passthrough (no auto-inject), explicit processor forwarded, text-only unchanged, TCP env passthroughtests/core/serve/dynamo/test_vllm.py— hybrid KV cache manager skip path for both worker and disagg rolestests/core/serve/dynamo/test_infra.py🤖 Generated with Claude Code