NVIDIA · MaciejBalaNV · Jun 16, 2026 · Jun 16, 2026
diff --git a/README.md b/README.md
@@ -291,16 +291,17 @@ See the [Cosmos 3 Diffusers documentation](https://huggingface.co/docs/diffusers
 
 Use vLLM-Omni for Generator production inference behind an OpenAI-compatible API. This integration loads the full Cosmos 3 checkpoint, including the Qwen3-VL-based reasoner path and the diffusion generation path. For understanding-only tasks that return text, use [Reasoner with vLLM](#reasoner-with-vllm) instead, which loads only the reasoner.
 
-> **Compatibility status:** Cosmos 3 Generator support has landed in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main`: text-to-image, text-to-video, and image-to-video ([#3454](https://github.com/vllm-project/vllm-omni/pull/3454)) and video-with-sound ([#4073](https://github.com/vllm-project/vllm-omni/pull/4073)) are merged; action (policy / forward-dynamics) is in review ([#4102](https://github.com/vllm-project/vllm-omni/pull/4102)) and video-to-video is planned. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md).
+> **Compatibility status:** Cosmos 3 Generator support is available in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main` for text-to-image, text-to-video, image-to-video, video-to-video, transfer-control video-to-video, video-with-sound, and action generation. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md).
 
-Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read.
+Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read. The command below runs from `/workspace`, so repo-local paths such as `cookbooks/...` resolve inside the container.
 
 ```shell
 docker run --runtime nvidia --gpus all \
   -v ~/.cache/huggingface:/root/.cache/huggingface \
   -v "$(pwd):/workspace" \
   -p 8000:8000 \
   --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Nano \
   --omni \
@@ -326,7 +327,7 @@ Additional parallelism options:
 
 When combining parallelism options, ensure the server has enough GPUs for the product of the enabled degrees (`tensor_parallel_size` × `cfg_parallel_size` × `ulysses_degree`).
 
-To install vLLM-Omni from `main` instead of using the Docker image (text-to-image, text-to-video, image-to-video, and video-with-sound are merged there; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage), create a venv and install, choosing the CUDA build that matches your driver:
+To install vLLM-Omni from `main` instead of using the Docker image, create a venv and install, choosing the CUDA build that matches your driver. This path uses the same request formats as the Docker image; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage:
 
 ```shell
 uv venv --python 3.13 --seed --managed-python
@@ -349,7 +350,8 @@ Vision endpoints:
 | Text to video | `POST /v1/videos/sync` | Blocks and returns the MP4 bytes directly |
 | Image to video | `POST /v1/videos/sync` | Upload the conditioning image with `input_reference` |
 | Video to video | `POST /v1/videos/sync` | Upload a source video and choose which frames stay as clean conditioning |
-| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to produce a soundtrack alongside the video |
+| Transfer video to video | `POST /v1/videos/sync` | Pass a transfer hint such as `edge`, `blur`, `depth`, `seg`, or `wsm` in `extra_params` |
+| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to supported text-to-video or image-to-video requests |
 
 Action modes use Cosmos 3 as a world model: they condition on an embodiment (`domain_name`) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, so send those through the asynchronous `POST /v1/videos` job and read the action data from the completed result; forward dynamics returns only video and can use synchronous `POST /v1/videos/sync`.
 
@@ -378,6 +380,43 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \
   -o cosmos3_t2v_output.mp4
 ```
 
+Example video-to-video request:
+
+```shell
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  --form-string "prompt=Continue the same driving scene with smooth natural motion." \
+  --form-string "negative_prompt=blurry, distorted, low quality, jittery, deformed" \
+  --form-string "size=832x480" \
+  --form-string "num_frames=61" \
+  --form-string "fps=10" \
+  --form-string "num_inference_steps=35" \
+  --form-string "guidance_scale=6.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=2222" \
+  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"condition_frame_indexes_vision":[0,1],"condition_video_keep":"first"}' \
+  -F "input_reference=@cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4;type=video/mp4" \
+  -o cosmos3_v2v_output.mp4
+```
+
+Example transfer-control request:
+
+```shell
+curl -sS -X POST http://localhost:8000/v1/videos/sync \
+  -H "Accept: video/mp4" \
+  --form-string "prompt=Generate a realistic scene following the provided depth control video." \
+  --form-string "negative_prompt=blurry, distorted, low quality" \
+  --form-string "size=1280x720" \
+  --form-string "num_frames=121" \
+  --form-string "fps=30" \
+  --form-string "num_inference_steps=50" \
+  --form-string "guidance_scale=3.0" \
+  --form-string "flow_shift=10.0" \
+  --form-string "seed=2026" \
+  --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"resolution":"720","control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \
+  -o cosmos3_transfer_depth.mp4
+```
+
 Use `--form-string` for text fields (`prompt`, `negative_prompt`, `extra_params`) rather than `-F`: with `-F`, `curl` treats `;` as a content-type separator and silently truncates any value that contains one.
 
 Common request fields (the image endpoint follows the [Image Generation API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/image_generation_api/), and the video endpoints follow the [Videos API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/videos_api/#request-parameters)):
@@ -394,7 +433,8 @@ Common request fields (the image endpoint follows the [Image Generation API](htt
 | `seed` | Reproducibility seed |
 | `max_sequence_length` | Maximum number of prompt tokens kept for conditioning (Cosmos 3 default `512`); longer prompts are truncated with a warning, shorter ones padded |
 | `input_reference` | Uploaded image or video for image-to-video, video-to-video, and action requests |
-| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
+| `video_reference` | JSON-safe video reference for video-to-video requests, such as `{"video_url":"https://..."}`; do not combine with `input_reference` or `image_reference` |
+| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`) and transfer bucket `resolution`, prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle |
 | `extra_args` | JSON object for Cosmos 3-specific image-endpoint options such as `use_resolution_template` |
 
 Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding `guardrails: false` to `extra_params`:
@@ -633,12 +673,13 @@ We are building examples that show Cosmos 3 capabilities end to end, including w
 | --- | --- | --- | --- | --- |
 | Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) |
 | Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) |
-| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
+| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, text-to-video, image-to-video, and video-to-video, with supported sound modes, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) |
 | Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) |
 | Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) |
 | Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) |
 | Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) |
 | Transfer with Cosmos Framework | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) |
+| Transfer with vLLM-Omni | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) |
 | Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) |
 | Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) |
 | Reasoner with NIM | Reasoner | The same image and video reasoning examples as the vLLM notebook, run against the prebuilt, OpenAI-compatible [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container; local media is sent as base64 data URIs. | [Notebook](cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_nim.ipynb) |

diff --git a/cookbooks/cosmos3/README.md b/cookbooks/cosmos3/README.md
@@ -10,7 +10,7 @@ backend you want to run and follow that one section.
 | [Diffusers](#diffusers) | Direct generation with `Cosmos3OmniPipeline` | Generator (Audiovisual) |
 | [Transformers](#transformers-coming-soon) | Hugging Face Transformers inference | Reasoner |
 | [vLLM](#vllm) | OpenAI-compatible reasoning server (image/video understanding) | Reasoner |
-| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action) | Generator (Audiovisual, Action) |
+| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action/transfer) | Generator (Audiovisual, Action, **Transfer**) |
 | [NIM](#nim) | Prebuilt OpenAI-compatible reasoning server (image/video understanding); no venv | Reasoner |
 
 ## Prerequisites
@@ -250,7 +250,7 @@ graphs compile.
 
 ## vLLM-Omni
 
-OpenAI-compatible **generation** server (image/video/audio/action) for the
+OpenAI-compatible **generation** server (image/video/audio/action/transfer) for the
 Generator cookbooks.
 
 Cosmos3 checkpoints can exceed the default server init timeout — always pass
@@ -259,7 +259,7 @@ Cosmos3 checkpoints can exceed the default server init timeout — always pass
 ### Option 1: Docker (recommended)
 
 The prebuilt image `vllm/vllm-omni:cosmos3` supports every Generator modality
-(including action). Pull once:
+(including video-to-video, transfer controls, and action). Pull once:
 
 ```bash
 docker pull vllm/vllm-omni:cosmos3
@@ -275,7 +275,8 @@ export COSMOS3_HOST_PORT="${COSMOS3_HOST_PORT:-8000}"
 
 The container listens on port 8000; `-p "${COSMOS3_HOST_PORT}:8000"` publishes it
 on the host. Generator notebooks often use `COSMOS3_HOST_PORT=8001` so port 8000
-stays free for a Reasoner server.
+stays free for a Reasoner server. The Docker commands run from `/workspace`, so
+repo-local paths such as `cookbooks/...` resolve inside the container.
 
 **Cosmos3-Nano** (single GPU):
 
@@ -285,6 +286,7 @@ docker run --runtime nvidia --gpus '"device=0"' \
   -v "${HF_HOME}:/root/.cache/huggingface" \
   -v "${COSMOS3_WORKDIR}:/workspace" \
   -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Nano \
     --omni \
@@ -301,6 +303,7 @@ docker run --runtime nvidia --gpus all \
   -v "${HF_HOME}:/root/.cache/huggingface" \
   -v "${COSMOS3_WORKDIR}:/workspace" \
   -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \
+  -w /workspace \
   vllm/vllm-omni:cosmos3 \
   vllm serve nvidia/Cosmos3-Super \
     --omni \
@@ -318,11 +321,10 @@ filesystem should be readable.
 
 vLLM-Omni prints `Application startup complete.` when the API is ready.
 
-### Option 2: Native venv (limited modalities)
+### Option 2: Native venv
 
-To install from the upstreaming PR branch instead of Docker (text-to-image,
-text-to-video, and image-to-video only — not action or sound yet), create a venv
-and pick the CUDA build that matches your driver (see
+To install from `main` instead of Docker, create a venv and pick the CUDA build
+that matches your driver (see
 [CUDA driver and the `cuXXX` backend](#cuda-driver-and-the-cuxxx-backend)):
 
 ```bash
@@ -331,11 +333,11 @@ source .venv/bin/activate
 
 # CUDA 13 driver:
 uv pip install --torch-backend=cu130 \
-  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
+  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main"
 
 # CUDA 12.x driver:
 # uv pip install --torch-backend=cu128 \
-#   "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head"
+#   "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main"
 ```
 
 Run the same `vllm serve` arguments as in the Docker commands above, directly on