From 040d5d61702714b22019ba8bae80907cd5fc3fa9 Mon Sep 17 00:00:00 2001
From: Maciej Bala <mbala@nvidia.com>
Date: Tue, 16 Jun 2026 14:51:27 +0200
Subject: [PATCH 1/2] Initial version of transfer for vllm-omni

Signed-off-by: Maciej Bala <mbala@nvidia.com>
---
 README.md                                     |  53 +-
 cookbooks/cosmos3/README.md                   |  22 +-
 .../cosmos3/generator/audiovisual/README.md   |  49 +-
 .../audiovisual/run_with_vllm_omni.ipynb      | 195 ++++++-
 .../cosmos3/generator/transfer/README.md      |  95 +++-
 ...video_transfer_with_cosmos_framework.ipynb |   2 +-
 .../run_video_transfer_with_vllm_omni.ipynb   | 492 ++++++++++++++++++
 7 files changed, 862 insertions(+), 46 deletions(-)
 create mode 100644 cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb

diff --git a/README.md b/README.md
index d54eaa07..a5f808aa 100644
--- a/README.md
+++ b/README.md
@@ -291,9 +291,9 @@ See the [Cosmos 3 Diffusers documentation](https://huggingface.co/docs/diffusers
 
 Use vLLM-Omni for Generator production inference behind an OpenAI-compatible API. This integration loads the full Cosmos 3 checkpoint, including the Qwen3-VL-based reasoner path and the diffusion generation path. For understanding-only tasks that return text, use [Reasoner with vLLM](#reasoner-with-vllm) instead, which loads only the reasoner.
 
-> **Compatibility status:** Cosmos 3 Generator support has landed in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main`: text-to-image, text-to-video, and image-to-video ([#3454](https://github.com/vllm-project/vllm-omni/pull/3454)) and video-with-sound ([#4073](https://github.com/vllm-project/vllm-omni/pull/4073)) are merged; action (policy / forward-dynamics) is in review ([#4102](https://github.com/vllm-project/vllm-omni/pull/4102)) and video-to-video is planned. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md).
+> **Compatibility status:** Cosmos 3 Generator support is available in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main` for text-to-image, text-to-video, image-to-video, video-to-video, transfer-control video-to-video, video-with-sound, and action generation. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md).
 
-Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read.
+Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read. The command below runs from `/workspace`, so repo-local paths such as `cookbooks/...` resolve inside the container.
 
 ```shell
 docker run --runtime nvidia --gpus all \
@@ -301,6 +301,7 @@ docker run --runtime nvidia --gpus all \
   -v "$(pwd):/workspace" \
   -p 8000:8000 \
   --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Nano \
   --omni \
@@ -326,7 +327,7 @@ Additional parallelism options:
 
 When combining parallelism options, ensure the server has enough GPUs for the product of the enabled degrees (`tensor_parallel_size` × `cfg_parallel_size` × `ulysses_degree`).
 
-To install vLLM-Omni from `main` instead of using the Docker image (text-to-image, text-to-video, image-to-video, and video-with-sound are merged there; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage), create a venv and install, choosing the CUDA build that matches your driver:
+To install vLLM-Omni from `main` instead of using the Docker image, create a venv and install, choosing the CUDA build that matches your driver. This path uses the same request formats as the Docker image; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage:
 
 ```shell
 uv venv --python 3.13 --seed --managed-python
@@ -349,7 +350,8 @@ Vision endpoints:
 | Text to video | `POST /v1/videos/sync` | Blocks and returns the MP4 bytes directly |
 | Image to video | `POST /v1/videos/sync` | Upload the conditioning image with `input_reference` |
 | Video to video | `POST /v1/videos/sync` | Upload a source video and choose which frames stay as clean conditioning |
-| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to produce a soundtrack alongside the video |
+| Transfer video to video | `POST /v1/videos/sync` | Pass a transfer hint such as `edge`, `blur`, `depth`, `seg`, or `wsm` in `extra_params` |
+| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to supported text-to-video or image-to-video requests |
 
 Action modes use Cosmos 3 as a world model: they condition on an embodiment (`domain_name`) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, so send those through the asynchronous `POST /v1/videos` job and read the action data from the completed result; forward dynamics returns only video and can use synchronous `POST /v1/videos/sync`.
 
@@ -378,6 +380,43 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   -o cosmos3_t2v_output.mp4
 ```
 
+Example video-to-video request:
+
+```shell
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  --form-string "prompt=Continue the same driving scene with smooth natural motion." \
+  --form-string "negative_prompt=blurry, distorted, low quality, jittery, deformed" \
+  --form-string "size=832x480" \
+  --form-string "num_frames=61" \
+  --form-string "fps=10" \
+  --form-string "num_inference_steps=35" \
+  --form-string "guidance_scale=6.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=2222" \
+  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"condition_frame_indexes_vision":[0,1],"condition_video_keep":"first"}' \
+  -F "input_reference=@cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4;type=video/mp4" \
+  -o cosmos3_v2v_output.mp4
+```
+
+Example transfer-control request:
+
+```shell
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  --form-string "prompt=Generate a realistic scene following the provided depth control video." \
+  --form-string "negative_prompt=blurry, distorted, low quality" \
+  --form-string "size=1280x720" \
+  --form-string "num_frames=121" \
+  --form-string "fps=30" \
+  --form-string "num_inference_steps=50" \
+  --form-string "guidance_scale=3.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=2026" \
+  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \
+  -o cosmos3_transfer_depth.mp4
+```
+
 Use `--form-string` for text fields (`prompt`, `negative_prompt`, `extra_params`) rather than `-F`: with `-F`, `curl` treats `;` as a content-type separator and silently truncates any value that contains one.
 
 Common request fields (the image endpoint follows the [Image Generation API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/image_generation_api/), and the video endpoints follow the [Videos API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/videos_api/#request-parameters)):
@@ -394,7 +433,8 @@ Common request fields (the image endpoint follows the [Image Generation API](htt
 | `seed` | Reproducibility seed |
 | `max_sequence_length` | Maximum number of prompt tokens kept for conditioning (Cosmos 3 default `512`); longer prompts are truncated with a warning, shorter ones padded |
 | `input_reference` | Uploaded image or video for image-to-video, video-to-video, and action requests |
-| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
+| `video_reference` | JSON-safe video reference for video-to-video requests, such as `{"video_url":"https://..."}`; do not combine with `input_reference` or `image_reference` |
+| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
 | `extra_args` | JSON object for Cosmos 3-specific image-endpoint options such as `use_resolution_template` |
 
 Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding `guardrails: false` to `extra_params`:
@@ -633,12 +673,13 @@ We are building examples that show Cosmos 3 capabilities end to end, including w
 | --- | --- | --- | --- | --- |
 | Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) |
 | Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) |
-| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
+| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, text-to-video, image-to-video, and video-to-video, with supported sound modes, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
 | Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) |
 | Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) |
 | Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) |
 | Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) |
 | Transfer with Cosmos Framework | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) |
+| Transfer with vLLM-Omni | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) |
 | Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) |
 | Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) |
 | Reasoner with NIM | Reasoner | The same image and video reasoning examples as the vLLM notebook, run against the prebuilt, OpenAI-compatible [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container; local media is sent as base64 data URIs. | [Notebook](cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_nim.ipynb) |
diff --git a/cookbooks/cosmos3/README.md b/cookbooks/cosmos3/README.md
index ecf78a06..71a1c469 100644
--- a/cookbooks/cosmos3/README.md
+++ b/cookbooks/cosmos3/README.md
@@ -10,7 +10,7 @@ backend you want to run and follow that one section.
 | [Diffusers](#diffusers) | Direct generation with `Cosmos3OmniPipeline` | Generator (Audiovisual) |
 | [Transformers](#transformers-coming-soon) | Hugging Face Transformers inference | Reasoner |
 | [vLLM](#vllm) | OpenAI-compatible reasoning server (image/video understanding) | Reasoner |
-| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action) | Generator (Audiovisual, Action) |
+| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action/transfer) | Generator (Audiovisual, Action, **Transfer**) |
 | [NIM](#nim) | Prebuilt OpenAI-compatible reasoning server (image/video understanding); no venv | Reasoner |
 
 ## Prerequisites
@@ -250,7 +250,7 @@ graphs compile.
 
 ## vLLM-Omni
 
-OpenAI-compatible **generation** server (image/video/audio/action) for the
+OpenAI-compatible **generation** server (image/video/audio/action/transfer) for the
 Generator cookbooks.
 
 Cosmos3 checkpoints can exceed the default server init timeout — always pass
@@ -259,7 +259,7 @@ Cosmos3 checkpoints can exceed the default server init timeout — always pass
 ### Option 1: Docker (recommended)
 
 The prebuilt image `vllm/vllm-omni:cosmos3` supports every Generator modality
-(including action). Pull once:
+(including video-to-video, transfer controls, and action). Pull once:
 
 ```bash
 docker pull vllm/vllm-omni:cosmos3
@@ -275,7 +275,8 @@ export COSMOS3_HOST_PORT="${COSMOS3_HOST_PORT:-8000}"
 
 The container listens on port 8000; `-p "${COSMOS3_HOST_PORT}:8000"` publishes it
 on the host. Generator notebooks often use `COSMOS3_HOST_PORT=8001` so port 8000
-stays free for a Reasoner server.
+stays free for a Reasoner server. The Docker commands run from `/workspace`, so
+repo-local paths such as `cookbooks/...` resolve inside the container.
 
 **Cosmos3-Nano** (single GPU):
 
@@ -285,6 +286,7 @@ docker run --runtime nvidia --gpus '"device=0"' \
   -v "${HF_HOME}:/root/.cache/huggingface" \
   -v "${COSMOS3_WORKDIR}:/workspace" \
   -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Nano \
     --omni \
@@ -301,6 +303,7 @@ docker run --runtime nvidia --gpus all \
   -v "${HF_HOME}:/root/.cache/huggingface" \
   -v "${COSMOS3_WORKDIR}:/workspace" \
   -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Super \
     --omni \
@@ -318,11 +321,10 @@ filesystem should be readable.
 
 vLLM-Omni prints `Application startup complete.` when the API is ready.
 
-### Option 2: Native venv (limited modalities)
+### Option 2: Native venv
 
-To install from the upstreaming PR branch instead of Docker (text-to-image,
-text-to-video, and image-to-video only — not action or sound yet), create a venv
-and pick the CUDA build that matches your driver (see
+To install from `main` instead of Docker, create a venv and pick the CUDA build
+that matches your driver (see
 [CUDA driver and the `cuXXX` backend](#cuda-driver-and-the-cuxxx-backend)):
 
 ```bash
@@ -331,11 +333,11 @@ source .venv/bin/activate
 
 # CUDA 13 driver:
 uv pip install --torch-backend=cu130 \
-  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
+  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main"
 
 # CUDA 12.x driver:
 # uv pip install --torch-backend=cu128 \
-#   "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
+#   "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main"
 ```
 
 Run the same `vllm serve` arguments as in the Docker commands above, directly on
diff --git a/cookbooks/cosmos3/generator/audiovisual/README.md b/cookbooks/cosmos3/generator/audiovisual/README.md
index d80adad4..d95bf835 100644
--- a/cookbooks/cosmos3/generator/audiovisual/README.md
+++ b/cookbooks/cosmos3/generator/audiovisual/README.md
@@ -1,8 +1,8 @@
 # Cosmos3 Generator Audiovisual Examples
 
-Generate images and video (with optional audio) from text or image prompts with
-`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Sample
-prompts live under [`assets/`](./assets).
+Generate images and video (with optional audio) from text, image, or video
+prompts with `Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends.
+Sample prompts live under [`assets/`](./assets).
 
 Environment setup for every backend is centralized in the shared
 [Cosmos3 cookbooks environment setup](../../README.md) guide; each backend below
@@ -176,11 +176,48 @@ Path("/tmp/cosmos3_t2v.mp4").write_bytes(response.content)
 
 For image-to-video, post to the same endpoint with an image under
 `files={"input_reference": ...}`. For audio, add `"generate_sound": "true"`.
+For video-to-video, upload a source video under `input_reference` and choose the
+clean conditioning frames through `extra_params`:
+
+```python
+from pathlib import Path
+
+source_video = Path("../action/assets/videos/av_0.mp4").resolve()
+with source_video.open("rb") as video_file:
+    response = requests.post(
+        "http://localhost:8000/v1/videos/sync",
+        data={
+            "prompt": "Continue the same driving scene with smooth natural motion.",
+            "negative_prompt": "blurry, distorted, low quality, jittery, deformed",
+            "size": "832x480",
+            "num_frames": "61",
+            "fps": "10",
+            "num_inference_steps": "35",
+            "guidance_scale": "6.0",
+            "flow_shift": "10.0",
+            "seed": "2222",
+            "extra_params": json.dumps(
+                {
+                    "use_resolution_template": False,
+                    "use_duration_template": False,
+                    "guardrails": True,
+                    "condition_frame_indexes_vision": [0, 1],
+                    "condition_video_keep": "first",
+                }
+            ),
+        },
+        files={"input_reference": (source_video.name, video_file, "video/mp4")},
+        headers={"Accept": "video/mp4"},
+    )
+response.raise_for_status()
+Path("/tmp/cosmos3_v2v.mp4").write_bytes(response.content)
+```
 
 ### Notebook walkthrough
 
 [`run_with_vllm_omni.ipynb`](./run_with_vllm_omni.ipynb) is the full tutorial for
 the vLLM-Omni backend: it walks through text-to-image, text-to-video, and
-image-to-video requests with audio on or off. Server launch options (Nano and
-Super, tensor parallelism, layerwise offload, and CFG-parallel variants) live in
-the [shared environment setup guide](../../README.md#vllm-omni).
+image-to-video requests with audio on or off, plus standard video-to-video
+requests. Server launch options (Nano and Super, tensor parallelism, layerwise
+offload, and CFG-parallel variants) live in the
+[shared environment setup guide](../../README.md#vllm-omni).
diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb
index a74f91a2..fae30615 100644
--- a/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb
+++ b/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb
@@ -16,7 +16,7 @@
     "\n",
     "This notebook calls already-running vLLM Omni Cosmos3 servers with direct `curl` requests from Python.\n",
     "\n",
-    "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint.\n"
+    "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint and includes standard video-to-video examples where an input video is uploaded as the reference.\n"
    ],
    "id": "d88fe9a8"
   },
@@ -54,6 +54,7 @@
     "  -v \"$(pwd):/workspace\" \\\n",
     "  -p 8000:8000 \\\n",
     "  --ipc=host \\\n",
+    "  -w /workspace \\\n",
     "  vllm/vllm-omni:cosmos3 \\\n",
     "  vllm serve nvidia/Cosmos3-Nano \\\n",
     "  --omni \\\n",
@@ -75,6 +76,7 @@
     "  -v \"$(pwd):/workspace\" \\\n",
     "  -p 8000:8000 \\\n",
     "  --ipc=host \\\n",
+    "  -w /workspace \\\n",
     "  vllm/vllm-omni:cosmos3 \\\n",
     "  vllm serve nvidia/Cosmos3-Super \\\n",
     "  --omni \\\n",
@@ -100,8 +102,7 @@
     "  --init-timeout 1800\n",
     "```\n",
     "\n",
-    "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n",
-    ""
+    "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n"
    ],
    "id": "26776c50"
   },
@@ -298,6 +299,22 @@
     "        \"image\": \"assets/images/image2video/coastal_road_audio.jpg\",\n",
     "        \"enable_sound\": True,\n",
     "    },\n",
+    "    \"v2v_nano_noaudio\": {\n",
+    "        \"model\": \"Cosmos3-Nano\",\n",
+    "        \"mode\": \"video2video\",\n",
+    "        \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n",
+    "        \"negative_prompt_mode\": \"image2video\",\n",
+    "        \"video\": \"../action/assets/videos/av_0.mp4\",\n",
+    "        \"enable_sound\": False,\n",
+    "        \"sampling\": {\n",
+    "            \"size\": \"832x480\",\n",
+    "            \"fps\": 10,\n",
+    "            \"num_frames\": 61,\n",
+    "            \"seed\": 2222,\n",
+    "            \"condition_frame_indexes_vision\": [0, 1],\n",
+    "            \"condition_video_keep\": \"first\",\n",
+    "        },\n",
+    "    },\n",
     "    \"t2v_super_noaudio\": {\n",
     "        \"model\": \"Cosmos3-Super\",\n",
     "        \"mode\": \"text2video\",\n",
@@ -311,6 +328,22 @@
     "        \"image\": \"assets/images/image2video/car_driving.jpg\",\n",
     "        \"enable_sound\": False,\n",
     "    },\n",
+    "    \"v2v_super_noaudio\": {\n",
+    "        \"model\": \"Cosmos3-Super\",\n",
+    "        \"mode\": \"video2video\",\n",
+    "        \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n",
+    "        \"negative_prompt_mode\": \"image2video\",\n",
+    "        \"video\": \"../action/assets/videos/av_0.mp4\",\n",
+    "        \"enable_sound\": False,\n",
+    "        \"sampling\": {\n",
+    "            \"size\": \"832x480\",\n",
+    "            \"fps\": 10,\n",
+    "            \"num_frames\": 61,\n",
+    "            \"seed\": 3333,\n",
+    "            \"condition_frame_indexes_vision\": [0, 1],\n",
+    "            \"condition_video_keep\": \"first\",\n",
+    "        },\n",
+    "    },\n",
     "}\n",
     "\n",
     "\n",
@@ -326,6 +359,9 @@
     "\n",
     "\n",
     "def payload_dimensions(payload: dict) -> tuple[int, int]:\n",
+    "    if payload.get(\"size\"):\n",
+    "        width, height = (int(part) for part in payload[\"size\"].split(\"x\", 1))\n",
+    "        return height, width\n",
     "    if payload.get(\"resolution\") == \"720\" and payload.get(\"aspect_ratio\") == \"16,9\":\n",
     "        return 720, 1280\n",
     "    if payload.get(\"resolution\") == \"256\" and payload.get(\"aspect_ratio\") == \"16,9\":\n",
@@ -350,7 +386,8 @@
     "    prompt_path = asset_path(spec[\"prompt\"])\n",
     "    negative_prompt = \"\"\n",
     "    if spec[\"mode\"] != \"text2image\":\n",
-    "        negative_prompt_path = asset_path(f\"assets/negative_prompts/{spec['mode']}/neg_prompt.json\")\n",
+    "        negative_prompt_mode = spec.get(\"negative_prompt_mode\", spec[\"mode\"])\n",
+    "        negative_prompt_path = asset_path(f\"assets/negative_prompts/{negative_prompt_mode}/neg_prompt.json\")\n",
     "        negative_prompt = compact_json_file(negative_prompt_path)\n",
     "    payload_path = payload_dir / f\"{use_case}.json\"\n",
     "    payload = {\n",
@@ -361,9 +398,13 @@
     "        \"enable_sound\": spec[\"enable_sound\"],\n",
     "        **FIXED_SAMPLING,\n",
     "    }\n",
+    "    payload.update(spec.get(\"sampling\", {}))\n",
     "    if spec[\"mode\"] == \"image2video\":\n",
     "        image_path = asset_path(spec[\"image\"])\n",
     "        payload[\"vision_path\"] = os.path.relpath(image_path, payload_path.parent)\n",
+    "    elif spec[\"mode\"] == \"video2video\":\n",
+    "        video_path = asset_path(spec[\"video\"])\n",
+    "        payload[\"vision_path\"] = os.path.relpath(video_path, payload_path.parent)\n",
     "\n",
     "    payload_path.write_text(json.dumps(payload, indent=2) + \"\\n\")\n",
     "\n",
@@ -375,10 +416,29 @@
     "    print(f\"output:  {output_dir}\")\n",
     "    print(f\"prompt:  {prompt_path.relative_to(COSMOS_ROOT)}\")\n",
     "    if \"vision_path\" in payload:\n",
-    "        image_display_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n",
-    "        print(f\"image:   {image_display_path.relative_to(COSMOS_ROOT)}\")\n",
-    "        display(Image(filename=str(image_display_path), width=420))\n",
-    "    print(json.dumps({k: payload[k] for k in [\"model_mode\", \"name\", \"enable_sound\", \"num_steps\", \"guidance\", \"shift\", \"fps\", \"num_frames\", \"resolution\", \"aspect_ratio\", \"seed\"]}, indent=2))\n",
+    "        reference_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n",
+    "        if payload[\"model_mode\"] == \"image2video\":\n",
+    "            print(f\"image:   {reference_path.relative_to(COSMOS_ROOT)}\")\n",
+    "            display(Image(filename=str(reference_path), width=420))\n",
+    "        else:\n",
+    "            print(f\"video:   {reference_path.relative_to(COSMOS_ROOT)}\")\n",
+    "    keys = [\n",
+    "        \"model_mode\",\n",
+    "        \"name\",\n",
+    "        \"enable_sound\",\n",
+    "        \"num_steps\",\n",
+    "        \"guidance\",\n",
+    "        \"shift\",\n",
+    "        \"fps\",\n",
+    "        \"num_frames\",\n",
+    "        \"resolution\",\n",
+    "        \"aspect_ratio\",\n",
+    "        \"size\",\n",
+    "        \"condition_frame_indexes_vision\",\n",
+    "        \"condition_video_keep\",\n",
+    "        \"seed\",\n",
+    "    ]\n",
+    "    print(json.dumps({k: payload[k] for k in keys if k in payload}, indent=2))\n",
     "    return payload_path, output_dir, spec[\"model\"]\n",
     "\n",
     "\n",
@@ -414,6 +474,9 @@
     "        \"use_duration_template\": False,\n",
     "        \"guardrails\": True,\n",
     "    }\n",
+    "    for key in (\"condition_frame_indexes_vision\", \"condition_video_keep\"):\n",
+    "        if key in payload:\n",
+    "            extra_params[key] = payload[key]\n",
     "    form = {\n",
     "        \"prompt\": payload[\"prompt\"],\n",
     "        \"negative_prompt\": payload[\"negative_prompt\"],\n",
@@ -475,9 +538,10 @@
     "    for key, value in build_vllm_form(payload).items():\n",
     "        cmd += [\"--form-string\", f\"{key}={value}\"]\n",
     "\n",
-    "    if payload[\"model_mode\"] == \"image2video\":\n",
-    "        image_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n",
-    "        cmd += [\"-F\", f\"input_reference=@{image_path}\"]\n",
+    "    if payload[\"model_mode\"] in {\"image2video\", \"video2video\"}:\n",
+    "        reference_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n",
+    "        media_type = \"video/mp4\" if payload[\"model_mode\"] == \"video2video\" else \"image/jpeg\"\n",
+    "        cmd += [\"-F\", f\"input_reference=@{reference_path};type={media_type}\"]\n",
     "\n",
     "    cmd += [\"-o\", str(tmp_path)]\n",
     "    result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n",
@@ -536,8 +600,9 @@
     "    print(\"endpoint:\", endpoint)\n",
     "    print(\"payload:\", payload_path)\n",
     "    print(\"output:\", output_path)\n",
-    "    if payload[\"model_mode\"] == \"image2video\":\n",
-    "        print(\"input image:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n",
+    "    if payload.get(\"vision_path\"):\n",
+    "        label = \"input video\" if payload[\"model_mode\"] == \"video2video\" else \"input image\"\n",
+    "        print(f\"{label}:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n",
     "    t0 = time.time()\n",
     "    if payload[\"model_mode\"] == \"text2image\":\n",
     "        post_image(payload=payload, output_path=output_path, model=model)\n",
@@ -889,6 +954,58 @@
    "outputs": [],
    "id": "58a28e11"
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Nano: Video to Video Without Audio\n",
+    "\n",
+    "Nano video-to-video generation using the checked-in autonomous-driving clip as the reference video. The request conditions on the first two latent video indexes and keeps the first source frames as clean conditioning.\n",
+    "\n",
+    "### Create Payload\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "v2v_nano_noaudio_payload, v2v_nano_noaudio_output, v2v_nano_noaudio_model = create_payload(\"v2v_nano_noaudio\", backend=\"vllm\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run_vllm_payload(v2v_nano_noaudio_payload, v2v_nano_noaudio_output, model=\"Cosmos3-Nano\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### View Results\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_run(v2v_nano_noaudio_output)\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1062,6 +1179,58 @@
    ],
    "execution_count": null,
    "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Super: Video to Video Without Audio\n",
+    "\n",
+    "Super video-to-video generation using the same reference-video request shape as Nano, against the `Cosmos3-Super` endpoint.\n",
+    "\n",
+    "### Create Payload\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "v2v_super_noaudio_payload, v2v_super_noaudio_output, v2v_super_noaudio_model = create_payload(\"v2v_super_noaudio\", backend=\"vllm\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "run_vllm_payload(v2v_super_noaudio_payload, v2v_super_noaudio_output, model=\"Cosmos3-Super\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### View Results\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_run(v2v_super_noaudio_output)\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/cookbooks/cosmos3/generator/transfer/README.md b/cookbooks/cosmos3/generator/transfer/README.md
index 0c477056..13991561 100644
--- a/cookbooks/cosmos3/generator/transfer/README.md
+++ b/cookbooks/cosmos3/generator/transfer/README.md
@@ -1,7 +1,8 @@
 # Cosmos3 Generator Transfer Examples
 
-Cosmos3-Nano video **transfer** examples on the native PyTorch (Cosmos Framework) path.
-Sample assets under [`assets/`](./assets) cover spatial control signals paired with
+Cosmos3-Nano video **transfer** examples for the native PyTorch (Cosmos
+Framework) path and the OpenAI-compatible vLLM-Omni server path. Sample assets
+under [`assets/`](./assets) cover spatial control signals paired with
 `prompt.json` files:
 
 - **Edge (Canny)** — edge map control plus caption.
@@ -10,18 +11,20 @@ Sample assets under [`assets/`](./assets) cover spatial control signals paired w
 - **Segmentation** — segmentation map control plus caption.
 - **World scenario (WSM)** — world-scenario map control plus caption.
 
-vLLM-Omni does not expose transfer controls today.
-
 Environment setup is centralized in the shared
 [Cosmos3 cookbooks environment setup](../../README.md) guide.
 
 ## Transfer Definition
 
-Video transfer generates a target clip from a `prompt.json` caption and a precomputed
-control video on the hint block (`control_path`). Inference uses `model_mode` `video2video`;
-there is no `vision_path` or source RGB video at run time. Output frame count and geometry
-come from the control video; see the spec field reference for how `fps` and
-`aspect_ratio` are resolved. All examples share
+Video transfer generates a target clip from a `prompt.json` caption and a
+precomputed control video on a hint block (`control_path`). The Framework path
+uses `model_mode` `video2video` in a local JSON spec. The vLLM-Omni path uses
+`POST /v1/videos/sync` and passes the same hint key (`edge`, `blur`, `depth`,
+`seg`, or `wsm`) inside `extra_params`.
+
+There is no source RGB video at run time for the checked-in transfer examples.
+Output frame count and geometry come from the control video; see the spec field
+reference for how `fps` and `aspect_ratio` are resolved. All examples share
 `assets/negative_prompt.json` for the negative caption.
 
 | Control | Asset folder | Inference input | Generation duration |
@@ -32,7 +35,8 @@ come from the control video; see the spec field reference for how `fps` and
 | Segmentation | `assets/seg/` | `control_seg.mp4` + `prompt.json` | 121 frames @ 30 FPS |
 | World scenario (WSM) | `assets/wsm/` | `control_wsm.mp4` + `prompt.json` | 101 frames @ 10 FPS |
 
-Transfer inference is selected automatically when any hint key is present in the spec.
+Transfer inference is selected automatically when any hint key is present in the
+Framework spec or in vLLM-Omni `extra_params`.
 
 ## Run with Cosmos Framework
 
@@ -97,6 +101,74 @@ checked-in assets under [`assets/`](./assets) via paths relative to [`specs/`](.
 Outputs are written under the directory passed to `-o`, with one subdirectory per sample name,
 for example `output/transfer_edge/vision.mp4`. Batch size must be 1 for transfer.
 
+## Run with vLLM-Omni
+
+### Quickstart
+
+Set up the environment and start the server:
+[vLLM-Omni setup](../../README.md#vllm-omni) (Docker recommended). Run the
+Docker command from the `cosmos` repo root so the repo is mounted at
+`/workspace` and the server runs from that directory inside the container:
+
+```bash
+export COSMOS3_WORKDIR="$(pwd)"
+export COSMOS3_HOST_PORT=8000
+```
+
+The transfer examples send repo-local `control_path` strings to the server. For
+Docker, those paths must be visible from the server working directory. With the
+shared Docker setup, the checked-in depth control video is:
+
+```text
+cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4
+```
+
+If your server does not run from the repo root, start it from the repo root or
+adjust `control_path` to a path the server process can read.
+
+Send a depth-transfer request:
+
+```python
+import json
+from pathlib import Path
+
+import requests
+
+transfer_root = Path("cookbooks/cosmos3/generator/transfer")
+prompt = json.dumps(json.load(open(transfer_root / "assets/depth/prompt.json")))
+negative = json.dumps(json.load(open(transfer_root / "assets/negative_prompt.json")))
+control_path = transfer_root / "assets/depth/control_depth.mp4"
+
+response = requests.post(
+    "http://localhost:8000/v1/videos/sync",
+    data={
+        "prompt": prompt,
+        "negative_prompt": negative,
+        "size": "1280x720",
+        "num_frames": "121",
+        "fps": "30",
+        "num_inference_steps": "50",
+        "guidance_scale": "3.0",
+        "flow_shift": "10.0",
+        "seed": "2026",
+        "extra_params": json.dumps(
+            {
+                "use_resolution_template": False,
+                "use_duration_template": False,
+                "guardrails": True,
+                "depth": {"control_path": control_path.as_posix()},
+                "control_guidance": 1.5,
+                "num_video_frames_per_chunk": 121,
+                "max_frames": 121,
+            }
+        ),
+    },
+    headers={"Accept": "video/mp4"},
+)
+response.raise_for_status()
+Path("/tmp/cosmos3_transfer_depth.mp4").write_bytes(response.content)
+```
+
 ### Spec field reference
 
 A representative spec (`specs/edge.json`):
@@ -140,6 +212,9 @@ Key fields:
 - [`run_video_transfer_with_cosmos_framework.ipynb`](./run_video_transfer_with_cosmos_framework.ipynb) —
   full tutorial on a **GPU host**: environment setup, `nvidia-smi` check, then five inference blocks
   (edge, blur, depth, seg, wsm) with previews. See [Cosmos3 environment setup](../../README.md).
+- [`run_video_transfer_with_vllm_omni.ipynb`](./run_video_transfer_with_vllm_omni.ipynb) —
+  full tutorial against an already-running vLLM-Omni server: endpoint checks, repo-local
+  control paths, five transfer requests, and compact previews.
 - [`specs/`](./specs) — checked-in Framework input JSON per control (paths relative to `specs/`).
 
 ### Troubleshooting
diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb
index faf6ad3a..3e5cfb97 100644
--- a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb
+++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb
@@ -30,7 +30,7 @@
     "- **seg** \u2014 segmentation map (`control_seg.mp4`)\n",
     "- **wsm** \u2014 world-scenario map (`control_wsm.mp4`)\n",
     "\n",
-    "vLLM-Omni does not expose transfer controls today; use this Cosmos Framework path only.\n",
+    "For the OpenAI-compatible vLLM-Omni transfer path, see [`run_video_transfer_with_vllm_omni.ipynb`](./run_video_transfer_with_vllm_omni.ipynb). This notebook focuses on the Cosmos Framework path.\n",
     "\n",
     "Sections **8\u201312** each run one control (inference + preview). Run only the blocks you need.\n",
     "\n",
diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb
new file mode 100644
index 00000000..0c7f1f46
--- /dev/null
+++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb
@@ -0,0 +1,492 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
+    "SPDX-License-Identifier: OpenMDW-1.1 -->\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Cosmos3 Nano Transfer with vLLM-Omni\n",
+    "\n",
+    "This notebook calls an already-running vLLM-Omni Cosmos3 server through the OpenAI-compatible video API. It reuses the checked-in transfer control assets and specs from this cookbook, then sends edge, blur, depth, segmentation, and world-scenario-map transfer requests to `POST /v1/videos/sync`.\n",
+    "\n",
+    "The notebook does not modify the vLLM-Omni source tree.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Prerequisites\n",
+    "\n",
+    "Start a vLLM-Omni Cosmos3 server before running the request cells. The examples assume the `cosmos` repo is mounted at `/workspace` and the server runs from that directory inside the container, so repo-local `cookbooks/...` paths resolve correctly.\n",
+    "\n",
+    "```bash\n",
+    "cd /path/to/cosmos\n",
+    "export HF_HOME=\"${HF_HOME:-$HOME/.cache/huggingface}\"\n",
+    "export COSMOS3_WORKDIR=\"$PWD\"\n",
+    "export COSMOS3_HOST_PORT=8000\n",
+    "\n",
+    "docker run --runtime nvidia --gpus all \\\n",
+    "  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n",
+    "  -v \"${HF_HOME}:/root/.cache/huggingface\" \\\n",
+    "  -v \"${COSMOS3_WORKDIR}:/workspace\" \\\n",
+    "  -p \"${COSMOS3_HOST_PORT}:8000\" --ipc=host \\\n",
+    "  -w /workspace \\\n",
+    "  vllm/vllm-omni:cosmos3 \\\n",
+    "  vllm serve nvidia/Cosmos3-Nano \\\n",
+    "    --omni \\\n",
+    "    --model-class-name Cosmos3OmniDiffusersPipeline \\\n",
+    "    --allowed-local-media-path / \\\n",
+    "    --port 8000 \\\n",
+    "    --init-timeout 1800\n",
+    "```\n",
+    "\n",
+    "Generator guardrails are on by default and require access to the gated `nvidia/Cosmos-1.0-Guardrail` repository. To disable guardrails for these sample requests, set `COSMOS3_VLLM_GUARDRAILS=false`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Configure Paths and Endpoints\n",
+    "\n",
+    "Run this cell from anywhere inside the `cosmos` checkout. It resolves local assets, output paths, the vLLM-Omni endpoint, and repo-local `control_path` values.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import base64\n",
+    "import html\n",
+    "import json\n",
+    "import os\n",
+    "import shutil\n",
+    "import subprocess\n",
+    "import time\n",
+    "from IPython.display import HTML, display\n",
+    "\n",
+    "\n",
+    "def find_repo_root(start: Path) -> Path:\n",
+    "    for path in [start, *start.parents]:\n",
+    "        if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n",
+    "            return path\n",
+    "    return start\n",
+    "\n",
+    "\n",
+    "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n",
+    "TRANSFER_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"generator\" / \"transfer\"\n",
+    "SPECS_DIR = TRANSFER_ROOT / \"specs\"\n",
+    "ASSETS_DIR = TRANSFER_ROOT / \"assets\"\n",
+    "OUTPUT_ROOT = Path(\n",
+    "    os.environ.get(\"COSMOS3_TRANSFER_VLLM_OUTPUT_ROOT\", TRANSFER_ROOT / \"outputs\" / \"vllm_omni\")\n",
+    ").resolve()\n",
+    "VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8000\").rstrip(\"/\")\n",
+    "GUARDRAILS = os.environ.get(\"COSMOS3_VLLM_GUARDRAILS\", \"true\").strip().lower() not in {\"0\", \"false\", \"no\", \"off\"}\n",
+    "\n",
+    "OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n",
+    "print(\"TRANSFER_ROOT:\", TRANSFER_ROOT)\n",
+    "print(\"OUTPUT_ROOT:\", OUTPUT_ROOT)\n",
+    "print(\"VLLM_BASE_URL:\", VLLM_BASE_URL)\n",
+    "print(\"GUARDRAILS:\", GUARDRAILS)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Verify Endpoint Configuration\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from urllib.parse import urlparse\n",
+    "\n",
+    "try:\n",
+    "    import requests\n",
+    "except ImportError as exc:\n",
+    "    raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n",
+    "\n",
+    "\n",
+    "def api_root_url(base_url: str) -> str:\n",
+    "    normalized = base_url.rstrip(\"/\")\n",
+    "    if not normalized.endswith(\"/v1\"):\n",
+    "        normalized = f\"{normalized}/v1\"\n",
+    "    return normalized\n",
+    "\n",
+    "\n",
+    "API_ROOT = api_root_url(VLLM_BASE_URL)\n",
+    "VIDEOS_SYNC_URL = f\"{API_ROOT}/videos/sync\"\n",
+    "MODELS_URL = f\"{API_ROOT}/models\"\n",
+    "parsed = urlparse(API_ROOT)\n",
+    "print(\"api root:\", API_ROOT)\n",
+    "print(\"videos sync:\", VIDEOS_SYNC_URL)\n",
+    "print(\"models:\", MODELS_URL)\n",
+    "print(\"scheme:\", parsed.scheme)\n",
+    "print(\"host:\", parsed.netloc)\n",
+    "\n",
+    "response = requests.get(MODELS_URL, timeout=30)\n",
+    "response.raise_for_status()\n",
+    "print(response.json())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Define Transfer Request Helpers\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "TRANSFER_CONTROLS = (\"edge\", \"blur\", \"depth\", \"seg\", \"wsm\")\n",
+    "\n",
+    "\n",
+    "def compact_json_file(path: Path) -> str:\n",
+    "    return json.dumps(json.loads(path.read_text()), ensure_ascii=True, separators=(\",\", \":\"))\n",
+    "\n",
+    "\n",
+    "def resolve_spec_path(spec_path: Path, value: str) -> Path:\n",
+    "    path = Path(value)\n",
+    "    if path.is_absolute():\n",
+    "        return path\n",
+    "    return (spec_path.parent / path).resolve()\n",
+    "\n",
+    "\n",
+    "def repo_relative_path(local_path: Path) -> str:\n",
+    "    return local_path.resolve().relative_to(COSMOS_ROOT).as_posix()\n",
+    "\n",
+    "\n",
+    "def spec_size(spec: dict) -> str:\n",
+    "    height = int(spec[\"resolution\"])\n",
+    "    width_ratio, height_ratio = (int(part) for part in spec[\"aspect_ratio\"].split(\",\", 1))\n",
+    "    width = round(height * width_ratio / height_ratio)\n",
+    "    return f\"{width}x{height}\"\n",
+    "\n",
+    "\n",
+    "def load_transfer_spec(control: str) -> tuple[Path, dict]:\n",
+    "    if control not in TRANSFER_CONTROLS:\n",
+    "        raise ValueError(f\"control must be one of {TRANSFER_CONTROLS}, got {control!r}\")\n",
+    "    spec_path = SPECS_DIR / f\"{control}.json\"\n",
+    "    if not spec_path.is_file():\n",
+    "        raise FileNotFoundError(spec_path)\n",
+    "    return spec_path, json.loads(spec_path.read_text())\n",
+    "\n",
+    "\n",
+    "def build_transfer_request(control: str) -> tuple[dict[str, str], Path, Path]:\n",
+    "    spec_path, spec = load_transfer_spec(control)\n",
+    "    hint = dict(spec[control])\n",
+    "    local_control_path = resolve_spec_path(spec_path, hint[\"control_path\"])\n",
+    "    hint[\"control_path\"] = repo_relative_path(local_control_path)\n",
+    "\n",
+    "    extra_params = {\n",
+    "        \"use_resolution_template\": False,\n",
+    "        \"use_duration_template\": False,\n",
+    "        \"guardrails\": GUARDRAILS,\n",
+    "        control: hint,\n",
+    "        \"control_guidance\": spec[\"control_guidance\"],\n",
+    "        \"num_video_frames_per_chunk\": spec[\"num_video_frames_per_chunk\"],\n",
+    "        \"num_conditional_frames\": spec.get(\"num_conditional_frames\", 1),\n",
+    "        \"num_first_chunk_conditional_frames\": spec.get(\"num_first_chunk_conditional_frames\", 0),\n",
+    "        \"share_vision_temporal_positions\": spec.get(\"share_vision_temporal_positions\", True),\n",
+    "        \"max_frames\": spec[\"num_frames\"],\n",
+    "    }\n",
+    "    form = {\n",
+    "        \"prompt\": compact_json_file(resolve_spec_path(spec_path, spec[\"prompt_path\"])),\n",
+    "        \"negative_prompt\": compact_json_file(resolve_spec_path(spec_path, spec[\"negative_prompt_file\"])),\n",
+    "        \"size\": spec_size(spec),\n",
+    "        \"num_frames\": str(spec[\"num_frames\"]),\n",
+    "        \"fps\": str(spec[\"fps\"]),\n",
+    "        \"num_inference_steps\": \"50\",\n",
+    "        \"guidance_scale\": str(spec[\"guidance\"]),\n",
+    "        \"flow_shift\": \"10.0\",\n",
+    "        \"seed\": \"2026\",\n",
+    "        \"extra_params\": json.dumps(extra_params, separators=(\",\", \":\")),\n",
+    "    }\n",
+    "    output_path = OUTPUT_ROOT / control / f\"{spec['name']}.mp4\"\n",
+    "    return form, local_control_path, output_path\n",
+    "\n",
+    "\n",
+    "def run_transfer(control: str) -> Path:\n",
+    "    form, local_control_path, output_path = build_transfer_request(control)\n",
+    "    output_path.parent.mkdir(parents=True, exist_ok=True)\n",
+    "    error_path = output_path.with_suffix(\".error.txt\")\n",
+    "    tmp_path = output_path.with_suffix(\".tmp\")\n",
+    "    print(\"control:\", control)\n",
+    "    print(\"local control:\", local_control_path)\n",
+    "    print(\"size:\", form[\"size\"], \"frames:\", form[\"num_frames\"], \"fps:\", form[\"fps\"])\n",
+    "    print(\"output:\", output_path)\n",
+    "\n",
+    "    headers = {\"Accept\": \"video/mp4\"}\n",
+    "    api_key = os.environ.get(\"COSMOS3_VLLM_API_KEY\")\n",
+    "    if api_key:\n",
+    "        headers[\"Authorization\"] = f\"Bearer {api_key}\"\n",
+    "\n",
+    "    t0 = time.time()\n",
+    "    response = requests.post(VIDEOS_SYNC_URL, data=form, headers=headers, timeout=3600)\n",
+    "    if not response.ok:\n",
+    "        error_path.write_text(response.text)\n",
+    "        print(\"request failed:\", response.status_code)\n",
+    "        print(response.text[:2000])\n",
+    "        response.raise_for_status()\n",
+    "    tmp_path.write_bytes(response.content)\n",
+    "    tmp_path.replace(output_path)\n",
+    "    print(f\"wrote {output_path} in {time.time() - t0:.1f}s\")\n",
+    "    return output_path\n",
+    "\n",
+    "\n",
+    "def _ffmpeg_exe() -> str:\n",
+    "    try:\n",
+    "        import imageio_ffmpeg\n",
+    "\n",
+    "        return imageio_ffmpeg.get_ffmpeg_exe()\n",
+    "    except ImportError:\n",
+    "        pass\n",
+    "    exe = shutil.which(\"ffmpeg\")\n",
+    "    if exe:\n",
+    "        return exe\n",
+    "    raise RuntimeError(\"Install imageio-ffmpeg or put ffmpeg on PATH to create compact previews.\")\n",
+    "\n",
+    "\n",
+    "def make_preview(src: Path, *, crf: int = 28) -> Path:\n",
+    "    preview = src.with_name(f\"{src.stem}_preview.mp4\")\n",
+    "    if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n",
+    "        subprocess.run(\n",
+    "            [\n",
+    "                _ffmpeg_exe(),\n",
+    "                \"-y\",\n",
+    "                \"-loglevel\",\n",
+    "                \"error\",\n",
+    "                \"-i\",\n",
+    "                str(src),\n",
+    "                \"-c:v\",\n",
+    "                \"libx264\",\n",
+    "                \"-crf\",\n",
+    "                str(crf),\n",
+    "                \"-preset\",\n",
+    "                \"veryfast\",\n",
+    "                \"-an\",\n",
+    "                \"-pix_fmt\",\n",
+    "                \"yuv420p\",\n",
+    "                str(preview),\n",
+    "            ],\n",
+    "            check=True,\n",
+    "        )\n",
+    "    return preview\n",
+    "\n",
+    "\n",
+    "def display_video(path: Path, *, width: int = 720) -> None:\n",
+    "    data = base64.b64encode(path.read_bytes()).decode(\"ascii\")\n",
+    "    label = html.escape(str(path))\n",
+    "    markup = f'''\n",
+    "<video controls playsinline preload=\"metadata\" width=\"{width}\" style=\"max-width: 100%; background: #000;\">\n",
+    "  <source src=\"data:video/mp4;base64,{data}\" type=\"video/mp4\">\n",
+    "</video>\n",
+    "<div style=\"font-family: monospace; font-size: 12px; margin-top: 4px;\">{label}</div>\n",
+    "'''\n",
+    "    display(HTML(markup))\n",
+    "\n",
+    "\n",
+    "def view_transfer(control: str, output_path: Path | None = None) -> None:\n",
+    "    form, local_control_path, expected_output = build_transfer_request(control)\n",
+    "    output_path = Path(output_path or expected_output)\n",
+    "    if not output_path.is_file():\n",
+    "        raise FileNotFoundError(f\"missing output: {output_path} (run {control} transfer first)\")\n",
+    "    for label, src in [(\"control\", local_control_path), (\"generated\", output_path)]:\n",
+    "        preview = make_preview(src)\n",
+    "        print(f\"{control} {label}: {src.name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n",
+    "        display_video(preview)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Preview Available Inputs\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for control in TRANSFER_CONTROLS:\n",
+    "    form, local_control_path, _ = build_transfer_request(control)\n",
+    "    prompt = json.loads(form[\"prompt\"])\n",
+    "    caption = prompt.get(\"temporal_caption\") or prompt.get(\"comprehensive_t2i_caption\") or prompt.get(\"extra\", {}).get(\"prompt\", \"\")\n",
+    "    print(f\"{control}: {local_control_path.relative_to(COSMOS_ROOT)}\")\n",
+    "    print(f\"  size={form['size']} frames={form['num_frames']} fps={form['fps']}\")\n",
+    "    print(f\"  prompt={caption[:180]}{'...' if len(caption) > 180 else ''}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Edge (Canny) Transfer\n",
+    "\n",
+    "Run the `edge` transfer request through vLLM-Omni, then display the input control video and generated output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "edge_output = run_transfer(\"edge\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_transfer(\"edge\", edge_output)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Blur Transfer\n",
+    "\n",
+    "Run the `blur` transfer request through vLLM-Omni, then display the input control video and generated output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "blur_output = run_transfer(\"blur\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_transfer(\"blur\", blur_output)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Depth Transfer\n",
+    "\n",
+    "Run the `depth` transfer request through vLLM-Omni, then display the input control video and generated output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "depth_output = run_transfer(\"depth\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_transfer(\"depth\", depth_output)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Segmentation Transfer\n",
+    "\n",
+    "Run the `seg` transfer request through vLLM-Omni, then display the input control video and generated output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seg_output = run_transfer(\"seg\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_transfer(\"seg\", seg_output)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. World Scenario Map Transfer\n",
+    "\n",
+    "Run the `wsm` transfer request through vLLM-Omni, then display the input control video and generated output.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "wsm_output = run_transfer(\"wsm\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "view_transfer(\"wsm\", wsm_output)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From e1198ce3338f1e2a725751da93169fce3918cc8f Mon Sep 17 00:00:00 2001
From: Maciej Bala <mbala@nvidia.com>
Date: Tue, 16 Jun 2026 15:56:20 +0200
Subject: [PATCH 2/2] bugfix

Signed-off-by: Maciej Bala <mbala@nvidia.com>
---
 README.md                                                    | 4 ++--
 cookbooks/cosmos3/generator/transfer/README.md               | 5 +++++
 .../transfer/run_video_transfer_with_vllm_omni.ipynb         | 1 +
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index a5f808aa..dc956422 100644
--- a/README.md
+++ b/README.md
@@ -413,7 +413,7 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   --form-string "guidance_scale=3.0" \
   --form-string "flow_shift=10.0" \
   --form-string "seed=2026" \
-  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \
+  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"resolution":"720","control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \
   -o cosmos3_transfer_depth.mp4
 ```
 
@@ -434,7 +434,7 @@ Common request fields (the image endpoint follows the [Image Generation API](htt
 | `max_sequence_length` | Maximum number of prompt tokens kept for conditioning (Cosmos 3 default `512`); longer prompts are truncated with a warning, shorter ones padded |
 | `input_reference` | Uploaded image or video for image-to-video, video-to-video, and action requests |
 | `video_reference` | JSON-safe video reference for video-to-video requests, such as `{"video_url":"https://..."}`; do not combine with `input_reference` or `image_reference` |
-| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
+| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`) and transfer bucket `resolution`, prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
 | `extra_args` | JSON object for Cosmos 3-specific image-endpoint options such as `use_resolution_template` |
 
 Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding `guardrails: false` to `extra_params`:
diff --git a/cookbooks/cosmos3/generator/transfer/README.md b/cookbooks/cosmos3/generator/transfer/README.md
index 13991561..8d372941 100644
--- a/cookbooks/cosmos3/generator/transfer/README.md
+++ b/cookbooks/cosmos3/generator/transfer/README.md
@@ -126,6 +126,10 @@ cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4
 If your server does not run from the repo root, start it from the repo root or
 adjust `control_path` to a path the server process can read.
 
+Transfer requests should also pass the spec `resolution` inside `extra_params`.
+The video API `size` field controls the output size string, but Cosmos3 transfer
+bucket selection reads `extra_params.resolution`.
+
 Send a depth-transfer request:
 
 ```python
@@ -157,6 +161,7 @@ response = requests.post(
                 "use_duration_template": False,
                 "guardrails": True,
                 "depth": {"control_path": control_path.as_posix()},
+                "resolution": "720",
                 "control_guidance": 1.5,
                 "num_video_frames_per_chunk": 121,
                 "max_frames": 121,
diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb
index 0c7f1f46..ce009292 100644
--- a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb
+++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb
@@ -204,6 +204,7 @@
     "        \"use_duration_template\": False,\n",
     "        \"guardrails\": GUARDRAILS,\n",
     "        control: hint,\n",
+    "        \"resolution\": spec[\"resolution\"],\n",
     "        \"control_guidance\": spec[\"control_guidance\"],\n",
     "        \"num_video_frames_per_chunk\": spec[\"num_video_frames_per_chunk\"],\n",
     "        \"num_conditional_frames\": spec.get(\"num_conditional_frames\", 1),\n",