From 040d5d61702714b22019ba8bae80907cd5fc3fa9 Mon Sep 17 00:00:00 2001 From: Maciej Bala Date: Tue, 16 Jun 2026 14:51:27 +0200 Subject: [PATCH 1/2] Initial version of transfer for vllm-omni Signed-off-by: Maciej Bala --- README.md | 53 +- cookbooks/cosmos3/README.md | 22 +- .../cosmos3/generator/audiovisual/README.md | 49 +- .../audiovisual/run_with_vllm_omni.ipynb | 195 ++++++- .../cosmos3/generator/transfer/README.md | 95 +++- ...video_transfer_with_cosmos_framework.ipynb | 2 +- .../run_video_transfer_with_vllm_omni.ipynb | 492 ++++++++++++++++++ 7 files changed, 862 insertions(+), 46 deletions(-) create mode 100644 cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb diff --git a/README.md b/README.md index d54eaa07..a5f808aa 100644 --- a/README.md +++ b/README.md @@ -291,9 +291,9 @@ See the [Cosmos 3 Diffusers documentation](https://huggingface.co/docs/diffusers Use vLLM-Omni for Generator production inference behind an OpenAI-compatible API. This integration loads the full Cosmos 3 checkpoint, including the Qwen3-VL-based reasoner path and the diffusion generation path. For understanding-only tasks that return text, use [Reasoner with vLLM](#reasoner-with-vllm) instead, which loads only the reasoner. -> **Compatibility status:** Cosmos 3 Generator support has landed in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main`: text-to-image, text-to-video, and image-to-video ([#3454](https://github.com/vllm-project/vllm-omni/pull/3454)) and video-with-sound ([#4073](https://github.com/vllm-project/vllm-omni/pull/4073)) are merged; action (policy / forward-dynamics) is in review ([#4102](https://github.com/vllm-project/vllm-omni/pull/4102)) and video-to-video is planned. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md). +> **Compatibility status:** Cosmos 3 Generator support is available in [vllm-project/vllm-omni](https://github.com/vllm-project/vllm-omni) `main` for text-to-image, text-to-video, image-to-video, video-to-video, transfer-control video-to-video, video-with-sound, and action generation. The `vllm/vllm-omni:cosmos3` Docker image remains the easiest all-in-one build. For current setup and per-modality usage, see the maintained recipes: [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md). -Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read. +Start the server from the Docker image (all modalities). Mount any directory that contains local media or action files you want the server to read. The command below runs from `/workspace`, so repo-local paths such as `cookbooks/...` resolve inside the container. ```shell docker run --runtime nvidia --gpus all \ @@ -301,6 +301,7 @@ docker run --runtime nvidia --gpus all \ -v "$(pwd):/workspace" \ -p 8000:8000 \ --ipc=host \ + -w /workspace \ vllm/vllm-omni:cosmos3 \ vllm serve nvidia/Cosmos3-Nano \ --omni \ @@ -326,7 +327,7 @@ Additional parallelism options: When combining parallelism options, ensure the server has enough GPUs for the product of the enabled degrees (`tensor_parallel_size` × `cfg_parallel_size` × `ulysses_degree`). -To install vLLM-Omni from `main` instead of using the Docker image (text-to-image, text-to-video, image-to-video, and video-with-sound are merged there; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage), create a venv and install, choosing the CUDA build that matches your driver: +To install vLLM-Omni from `main` instead of using the Docker image, create a venv and install, choosing the CUDA build that matches your driver. This path uses the same request formats as the Docker image; see the [Cosmos3-Nano](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Nano.md) and [Cosmos3-Super](https://github.com/vllm-project/vllm-omni/blob/main/recipes/cosmos3/Cosmos3-Super.md) recipes for per-modality usage: ```shell uv venv --python 3.13 --seed --managed-python @@ -349,7 +350,8 @@ Vision endpoints: | Text to video | `POST /v1/videos/sync` | Blocks and returns the MP4 bytes directly | | Image to video | `POST /v1/videos/sync` | Upload the conditioning image with `input_reference` | | Video to video | `POST /v1/videos/sync` | Upload a source video and choose which frames stay as clean conditioning | -| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to produce a soundtrack alongside the video | +| Transfer video to video | `POST /v1/videos/sync` | Pass a transfer hint such as `edge`, `blur`, `depth`, `seg`, or `wsm` in `extra_params` | +| Video with sound | `POST /v1/videos/sync` | Add `generate_sound=true` to supported text-to-video or image-to-video requests | Action modes use Cosmos 3 as a world model: they condition on an embodiment (`domain_name`) and exchange video and action sequences. Policy and inverse dynamics return a predicted action chunk, so send those through the asynchronous `POST /v1/videos` job and read the action data from the completed result; forward dynamics returns only video and can use synchronous `POST /v1/videos/sync`. @@ -378,6 +380,43 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \ -o cosmos3_t2v_output.mp4 ``` +Example video-to-video request: + +```shell +curl -sS -X POST http://localhost:8000/v1/videos/sync \ + -H "Accept: video/mp4" \ + --form-string "prompt=Continue the same driving scene with smooth natural motion." \ + --form-string "negative_prompt=blurry, distorted, low quality, jittery, deformed" \ + --form-string "size=832x480" \ + --form-string "num_frames=61" \ + --form-string "fps=10" \ + --form-string "num_inference_steps=35" \ + --form-string "guidance_scale=6.0" \ + --form-string "flow_shift=10.0" \ + --form-string "seed=2222" \ + --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"condition_frame_indexes_vision":[0,1],"condition_video_keep":"first"}' \ + -F "input_reference=@cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4;type=video/mp4" \ + -o cosmos3_v2v_output.mp4 +``` + +Example transfer-control request: + +```shell +curl -sS -X POST http://localhost:8000/v1/videos/sync \ + -H "Accept: video/mp4" \ + --form-string "prompt=Generate a realistic scene following the provided depth control video." \ + --form-string "negative_prompt=blurry, distorted, low quality" \ + --form-string "size=1280x720" \ + --form-string "num_frames=121" \ + --form-string "fps=30" \ + --form-string "num_inference_steps=50" \ + --form-string "guidance_scale=3.0" \ + --form-string "flow_shift=10.0" \ + --form-string "seed=2026" \ + --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \ + -o cosmos3_transfer_depth.mp4 +``` + Use `--form-string` for text fields (`prompt`, `negative_prompt`, `extra_params`) rather than `-F`: with `-F`, `curl` treats `;` as a content-type separator and silently truncates any value that contains one. Common request fields (the image endpoint follows the [Image Generation API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/image_generation_api/), and the video endpoints follow the [Videos API](https://docs.vllm.ai/projects/vllm-omni/en/latest/serving/videos_api/#request-parameters)): @@ -394,7 +433,8 @@ Common request fields (the image endpoint follows the [Image Generation API](htt | `seed` | Reproducibility seed | | `max_sequence_length` | Maximum number of prompt tokens kept for conditioning (Cosmos 3 default `512`); longer prompts are truncated with a warning, shorter ones padded | | `input_reference` | Uploaded image or video for image-to-video, video-to-video, and action requests | -| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle | +| `video_reference` | JSON-safe video reference for video-to-video requests, such as `{"video_url":"https://..."}`; do not combine with `input_reference` or `image_reference` | +| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle | | `extra_args` | JSON object for Cosmos 3-specific image-endpoint options such as `use_resolution_template` | Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding `guardrails: false` to `extra_params`: @@ -633,12 +673,13 @@ We are building examples that show Cosmos 3 capabilities end to end, including w | --- | --- | --- | --- | --- | | Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | | Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | -| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | +| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, text-to-video, image-to-video, and video-to-video, with supported sound modes, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | | Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | | Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | | Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | | Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | | Transfer with Cosmos Framework | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | +| Transfer with vLLM-Omni | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb) | | Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | | Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | | Reasoner with NIM | Reasoner | The same image and video reasoning examples as the vLLM notebook, run against the prebuilt, OpenAI-compatible [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container; local media is sent as base64 data URIs. | [Notebook](cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | diff --git a/cookbooks/cosmos3/README.md b/cookbooks/cosmos3/README.md index ecf78a06..71a1c469 100644 --- a/cookbooks/cosmos3/README.md +++ b/cookbooks/cosmos3/README.md @@ -10,7 +10,7 @@ backend you want to run and follow that one section. | [Diffusers](#diffusers) | Direct generation with `Cosmos3OmniPipeline` | Generator (Audiovisual) | | [Transformers](#transformers-coming-soon) | Hugging Face Transformers inference | Reasoner | | [vLLM](#vllm) | OpenAI-compatible reasoning server (image/video understanding) | Reasoner | -| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action) | Generator (Audiovisual, Action) | +| [vLLM-Omni](#vllm-omni) | OpenAI-compatible generation server (image/video/audio/action/transfer) | Generator (Audiovisual, Action, **Transfer**) | | [NIM](#nim) | Prebuilt OpenAI-compatible reasoning server (image/video understanding); no venv | Reasoner | ## Prerequisites @@ -250,7 +250,7 @@ graphs compile. ## vLLM-Omni -OpenAI-compatible **generation** server (image/video/audio/action) for the +OpenAI-compatible **generation** server (image/video/audio/action/transfer) for the Generator cookbooks. Cosmos3 checkpoints can exceed the default server init timeout — always pass @@ -259,7 +259,7 @@ Cosmos3 checkpoints can exceed the default server init timeout — always pass ### Option 1: Docker (recommended) The prebuilt image `vllm/vllm-omni:cosmos3` supports every Generator modality -(including action). Pull once: +(including video-to-video, transfer controls, and action). Pull once: ```bash docker pull vllm/vllm-omni:cosmos3 @@ -275,7 +275,8 @@ export COSMOS3_HOST_PORT="${COSMOS3_HOST_PORT:-8000}" The container listens on port 8000; `-p "${COSMOS3_HOST_PORT}:8000"` publishes it on the host. Generator notebooks often use `COSMOS3_HOST_PORT=8001` so port 8000 -stays free for a Reasoner server. +stays free for a Reasoner server. The Docker commands run from `/workspace`, so +repo-local paths such as `cookbooks/...` resolve inside the container. **Cosmos3-Nano** (single GPU): @@ -285,6 +286,7 @@ docker run --runtime nvidia --gpus '"device=0"' \ -v "${HF_HOME}:/root/.cache/huggingface" \ -v "${COSMOS3_WORKDIR}:/workspace" \ -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \ + -w /workspace \ vllm/vllm-omni:cosmos3 \ vllm serve nvidia/Cosmos3-Nano \ --omni \ @@ -301,6 +303,7 @@ docker run --runtime nvidia --gpus all \ -v "${HF_HOME}:/root/.cache/huggingface" \ -v "${COSMOS3_WORKDIR}:/workspace" \ -p "${COSMOS3_HOST_PORT}:8000" --ipc=host \ + -w /workspace \ vllm/vllm-omni:cosmos3 \ vllm serve nvidia/Cosmos3-Super \ --omni \ @@ -318,11 +321,10 @@ filesystem should be readable. vLLM-Omni prints `Application startup complete.` when the API is ready. -### Option 2: Native venv (limited modalities) +### Option 2: Native venv -To install from the upstreaming PR branch instead of Docker (text-to-image, -text-to-video, and image-to-video only — not action or sound yet), create a venv -and pick the CUDA build that matches your driver (see +To install from `main` instead of Docker, create a venv and pick the CUDA build +that matches your driver (see [CUDA driver and the `cuXXX` backend](#cuda-driver-and-the-cuxxx-backend)): ```bash @@ -331,11 +333,11 @@ source .venv/bin/activate # CUDA 13 driver: uv pip install --torch-backend=cu130 \ - "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head" + "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main" # CUDA 12.x driver: # uv pip install --torch-backend=cu128 \ -# "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@refs/pull/3454/head" +# "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@main" ``` Run the same `vllm serve` arguments as in the Docker commands above, directly on diff --git a/cookbooks/cosmos3/generator/audiovisual/README.md b/cookbooks/cosmos3/generator/audiovisual/README.md index d80adad4..d95bf835 100644 --- a/cookbooks/cosmos3/generator/audiovisual/README.md +++ b/cookbooks/cosmos3/generator/audiovisual/README.md @@ -1,8 +1,8 @@ # Cosmos3 Generator Audiovisual Examples -Generate images and video (with optional audio) from text or image prompts with -`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Sample -prompts live under [`assets/`](./assets). +Generate images and video (with optional audio) from text, image, or video +prompts with `Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. +Sample prompts live under [`assets/`](./assets). Environment setup for every backend is centralized in the shared [Cosmos3 cookbooks environment setup](../../README.md) guide; each backend below @@ -176,11 +176,48 @@ Path("/tmp/cosmos3_t2v.mp4").write_bytes(response.content) For image-to-video, post to the same endpoint with an image under `files={"input_reference": ...}`. For audio, add `"generate_sound": "true"`. +For video-to-video, upload a source video under `input_reference` and choose the +clean conditioning frames through `extra_params`: + +```python +from pathlib import Path + +source_video = Path("../action/assets/videos/av_0.mp4").resolve() +with source_video.open("rb") as video_file: + response = requests.post( + "http://localhost:8000/v1/videos/sync", + data={ + "prompt": "Continue the same driving scene with smooth natural motion.", + "negative_prompt": "blurry, distorted, low quality, jittery, deformed", + "size": "832x480", + "num_frames": "61", + "fps": "10", + "num_inference_steps": "35", + "guidance_scale": "6.0", + "flow_shift": "10.0", + "seed": "2222", + "extra_params": json.dumps( + { + "use_resolution_template": False, + "use_duration_template": False, + "guardrails": True, + "condition_frame_indexes_vision": [0, 1], + "condition_video_keep": "first", + } + ), + }, + files={"input_reference": (source_video.name, video_file, "video/mp4")}, + headers={"Accept": "video/mp4"}, + ) +response.raise_for_status() +Path("/tmp/cosmos3_v2v.mp4").write_bytes(response.content) +``` ### Notebook walkthrough [`run_with_vllm_omni.ipynb`](./run_with_vllm_omni.ipynb) is the full tutorial for the vLLM-Omni backend: it walks through text-to-image, text-to-video, and -image-to-video requests with audio on or off. Server launch options (Nano and -Super, tensor parallelism, layerwise offload, and CFG-parallel variants) live in -the [shared environment setup guide](../../README.md#vllm-omni). +image-to-video requests with audio on or off, plus standard video-to-video +requests. Server launch options (Nano and Super, tensor parallelism, layerwise +offload, and CFG-parallel variants) live in the +[shared environment setup guide](../../README.md#vllm-omni). diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb index a74f91a2..fae30615 100644 --- a/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb +++ b/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb @@ -16,7 +16,7 @@ "\n", "This notebook calls already-running vLLM Omni Cosmos3 servers with direct `curl` requests from Python.\n", "\n", - "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint.\n" + "The examples are split into Cosmos3-Nano and Cosmos3-Super sections. Each section is self-contained, so you can run just one. Each section targets the matching model endpoint and includes standard video-to-video examples where an input video is uploaded as the reference.\n" ], "id": "d88fe9a8" }, @@ -54,6 +54,7 @@ " -v \"$(pwd):/workspace\" \\\n", " -p 8000:8000 \\\n", " --ipc=host \\\n", + " -w /workspace \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Nano \\\n", " --omni \\\n", @@ -75,6 +76,7 @@ " -v \"$(pwd):/workspace\" \\\n", " -p 8000:8000 \\\n", " --ipc=host \\\n", + " -w /workspace \\\n", " vllm/vllm-omni:cosmos3 \\\n", " vllm serve nvidia/Cosmos3-Super \\\n", " --omni \\\n", @@ -100,8 +102,7 @@ " --init-timeout 1800\n", "```\n", "\n", - "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n", - "" + "For Cosmos3, set CFG strength with the request-level `guidance_scale` field. Do not use `true_cfg_scale` for CFG Parallel with these Cosmos3 examples.\n" ], "id": "26776c50" }, @@ -298,6 +299,22 @@ " \"image\": \"assets/images/image2video/coastal_road_audio.jpg\",\n", " \"enable_sound\": True,\n", " },\n", + " \"v2v_nano_noaudio\": {\n", + " \"model\": \"Cosmos3-Nano\",\n", + " \"mode\": \"video2video\",\n", + " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", + " \"negative_prompt_mode\": \"image2video\",\n", + " \"video\": \"../action/assets/videos/av_0.mp4\",\n", + " \"enable_sound\": False,\n", + " \"sampling\": {\n", + " \"size\": \"832x480\",\n", + " \"fps\": 10,\n", + " \"num_frames\": 61,\n", + " \"seed\": 2222,\n", + " \"condition_frame_indexes_vision\": [0, 1],\n", + " \"condition_video_keep\": \"first\",\n", + " },\n", + " },\n", " \"t2v_super_noaudio\": {\n", " \"model\": \"Cosmos3-Super\",\n", " \"mode\": \"text2video\",\n", @@ -311,6 +328,22 @@ " \"image\": \"assets/images/image2video/car_driving.jpg\",\n", " \"enable_sound\": False,\n", " },\n", + " \"v2v_super_noaudio\": {\n", + " \"model\": \"Cosmos3-Super\",\n", + " \"mode\": \"video2video\",\n", + " \"prompt\": \"assets/prompts/image2video/car_driving.json\",\n", + " \"negative_prompt_mode\": \"image2video\",\n", + " \"video\": \"../action/assets/videos/av_0.mp4\",\n", + " \"enable_sound\": False,\n", + " \"sampling\": {\n", + " \"size\": \"832x480\",\n", + " \"fps\": 10,\n", + " \"num_frames\": 61,\n", + " \"seed\": 3333,\n", + " \"condition_frame_indexes_vision\": [0, 1],\n", + " \"condition_video_keep\": \"first\",\n", + " },\n", + " },\n", "}\n", "\n", "\n", @@ -326,6 +359,9 @@ "\n", "\n", "def payload_dimensions(payload: dict) -> tuple[int, int]:\n", + " if payload.get(\"size\"):\n", + " width, height = (int(part) for part in payload[\"size\"].split(\"x\", 1))\n", + " return height, width\n", " if payload.get(\"resolution\") == \"720\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", " return 720, 1280\n", " if payload.get(\"resolution\") == \"256\" and payload.get(\"aspect_ratio\") == \"16,9\":\n", @@ -350,7 +386,8 @@ " prompt_path = asset_path(spec[\"prompt\"])\n", " negative_prompt = \"\"\n", " if spec[\"mode\"] != \"text2image\":\n", - " negative_prompt_path = asset_path(f\"assets/negative_prompts/{spec['mode']}/neg_prompt.json\")\n", + " negative_prompt_mode = spec.get(\"negative_prompt_mode\", spec[\"mode\"])\n", + " negative_prompt_path = asset_path(f\"assets/negative_prompts/{negative_prompt_mode}/neg_prompt.json\")\n", " negative_prompt = compact_json_file(negative_prompt_path)\n", " payload_path = payload_dir / f\"{use_case}.json\"\n", " payload = {\n", @@ -361,9 +398,13 @@ " \"enable_sound\": spec[\"enable_sound\"],\n", " **FIXED_SAMPLING,\n", " }\n", + " payload.update(spec.get(\"sampling\", {}))\n", " if spec[\"mode\"] == \"image2video\":\n", " image_path = asset_path(spec[\"image\"])\n", " payload[\"vision_path\"] = os.path.relpath(image_path, payload_path.parent)\n", + " elif spec[\"mode\"] == \"video2video\":\n", + " video_path = asset_path(spec[\"video\"])\n", + " payload[\"vision_path\"] = os.path.relpath(video_path, payload_path.parent)\n", "\n", " payload_path.write_text(json.dumps(payload, indent=2) + \"\\n\")\n", "\n", @@ -375,10 +416,29 @@ " print(f\"output: {output_dir}\")\n", " print(f\"prompt: {prompt_path.relative_to(COSMOS_ROOT)}\")\n", " if \"vision_path\" in payload:\n", - " image_display_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", - " print(f\"image: {image_display_path.relative_to(COSMOS_ROOT)}\")\n", - " display(Image(filename=str(image_display_path), width=420))\n", - " print(json.dumps({k: payload[k] for k in [\"model_mode\", \"name\", \"enable_sound\", \"num_steps\", \"guidance\", \"shift\", \"fps\", \"num_frames\", \"resolution\", \"aspect_ratio\", \"seed\"]}, indent=2))\n", + " reference_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", + " if payload[\"model_mode\"] == \"image2video\":\n", + " print(f\"image: {reference_path.relative_to(COSMOS_ROOT)}\")\n", + " display(Image(filename=str(reference_path), width=420))\n", + " else:\n", + " print(f\"video: {reference_path.relative_to(COSMOS_ROOT)}\")\n", + " keys = [\n", + " \"model_mode\",\n", + " \"name\",\n", + " \"enable_sound\",\n", + " \"num_steps\",\n", + " \"guidance\",\n", + " \"shift\",\n", + " \"fps\",\n", + " \"num_frames\",\n", + " \"resolution\",\n", + " \"aspect_ratio\",\n", + " \"size\",\n", + " \"condition_frame_indexes_vision\",\n", + " \"condition_video_keep\",\n", + " \"seed\",\n", + " ]\n", + " print(json.dumps({k: payload[k] for k in keys if k in payload}, indent=2))\n", " return payload_path, output_dir, spec[\"model\"]\n", "\n", "\n", @@ -414,6 +474,9 @@ " \"use_duration_template\": False,\n", " \"guardrails\": True,\n", " }\n", + " for key in (\"condition_frame_indexes_vision\", \"condition_video_keep\"):\n", + " if key in payload:\n", + " extra_params[key] = payload[key]\n", " form = {\n", " \"prompt\": payload[\"prompt\"],\n", " \"negative_prompt\": payload[\"negative_prompt\"],\n", @@ -475,9 +538,10 @@ " for key, value in build_vllm_form(payload).items():\n", " cmd += [\"--form-string\", f\"{key}={value}\"]\n", "\n", - " if payload[\"model_mode\"] == \"image2video\":\n", - " image_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", - " cmd += [\"-F\", f\"input_reference=@{image_path}\"]\n", + " if payload[\"model_mode\"] in {\"image2video\", \"video2video\"}:\n", + " reference_path = resolve_payload_path(payload_path, payload[\"vision_path\"])\n", + " media_type = \"video/mp4\" if payload[\"model_mode\"] == \"video2video\" else \"image/jpeg\"\n", + " cmd += [\"-F\", f\"input_reference=@{reference_path};type={media_type}\"]\n", "\n", " cmd += [\"-o\", str(tmp_path)]\n", " result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n", @@ -536,8 +600,9 @@ " print(\"endpoint:\", endpoint)\n", " print(\"payload:\", payload_path)\n", " print(\"output:\", output_path)\n", - " if payload[\"model_mode\"] == \"image2video\":\n", - " print(\"input image:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n", + " if payload.get(\"vision_path\"):\n", + " label = \"input video\" if payload[\"model_mode\"] == \"video2video\" else \"input image\"\n", + " print(f\"{label}:\", resolve_payload_path(payload_path, payload[\"vision_path\"]))\n", " t0 = time.time()\n", " if payload[\"model_mode\"] == \"text2image\":\n", " post_image(payload=payload, output_path=output_path, model=model)\n", @@ -889,6 +954,58 @@ "outputs": [], "id": "58a28e11" }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Nano: Video to Video Without Audio\n", + "\n", + "Nano video-to-video generation using the checked-in autonomous-driving clip as the reference video. The request conditions on the first two latent video indexes and keeps the first source frames as clean conditioning.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "v2v_nano_noaudio_payload, v2v_nano_noaudio_output, v2v_nano_noaudio_model = create_payload(\"v2v_nano_noaudio\", backend=\"vllm\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_vllm_payload(v2v_nano_noaudio_payload, v2v_nano_noaudio_output, model=\"Cosmos3-Nano\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_run(v2v_nano_noaudio_output)\n" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1062,6 +1179,58 @@ ], "execution_count": null, "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Super: Video to Video Without Audio\n", + "\n", + "Super video-to-video generation using the same reference-video request shape as Nano, against the `Cosmos3-Super` endpoint.\n", + "\n", + "### Create Payload\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "v2v_super_noaudio_payload, v2v_super_noaudio_output, v2v_super_noaudio_model = create_payload(\"v2v_super_noaudio\", backend=\"vllm\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "run_vllm_payload(v2v_super_noaudio_payload, v2v_super_noaudio_output, model=\"Cosmos3-Super\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### View Results\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_run(v2v_super_noaudio_output)\n" + ] } ], "metadata": { diff --git a/cookbooks/cosmos3/generator/transfer/README.md b/cookbooks/cosmos3/generator/transfer/README.md index 0c477056..13991561 100644 --- a/cookbooks/cosmos3/generator/transfer/README.md +++ b/cookbooks/cosmos3/generator/transfer/README.md @@ -1,7 +1,8 @@ # Cosmos3 Generator Transfer Examples -Cosmos3-Nano video **transfer** examples on the native PyTorch (Cosmos Framework) path. -Sample assets under [`assets/`](./assets) cover spatial control signals paired with +Cosmos3-Nano video **transfer** examples for the native PyTorch (Cosmos +Framework) path and the OpenAI-compatible vLLM-Omni server path. Sample assets +under [`assets/`](./assets) cover spatial control signals paired with `prompt.json` files: - **Edge (Canny)** — edge map control plus caption. @@ -10,18 +11,20 @@ Sample assets under [`assets/`](./assets) cover spatial control signals paired w - **Segmentation** — segmentation map control plus caption. - **World scenario (WSM)** — world-scenario map control plus caption. -vLLM-Omni does not expose transfer controls today. - Environment setup is centralized in the shared [Cosmos3 cookbooks environment setup](../../README.md) guide. ## Transfer Definition -Video transfer generates a target clip from a `prompt.json` caption and a precomputed -control video on the hint block (`control_path`). Inference uses `model_mode` `video2video`; -there is no `vision_path` or source RGB video at run time. Output frame count and geometry -come from the control video; see the spec field reference for how `fps` and -`aspect_ratio` are resolved. All examples share +Video transfer generates a target clip from a `prompt.json` caption and a +precomputed control video on a hint block (`control_path`). The Framework path +uses `model_mode` `video2video` in a local JSON spec. The vLLM-Omni path uses +`POST /v1/videos/sync` and passes the same hint key (`edge`, `blur`, `depth`, +`seg`, or `wsm`) inside `extra_params`. + +There is no source RGB video at run time for the checked-in transfer examples. +Output frame count and geometry come from the control video; see the spec field +reference for how `fps` and `aspect_ratio` are resolved. All examples share `assets/negative_prompt.json` for the negative caption. | Control | Asset folder | Inference input | Generation duration | @@ -32,7 +35,8 @@ come from the control video; see the spec field reference for how `fps` and | Segmentation | `assets/seg/` | `control_seg.mp4` + `prompt.json` | 121 frames @ 30 FPS | | World scenario (WSM) | `assets/wsm/` | `control_wsm.mp4` + `prompt.json` | 101 frames @ 10 FPS | -Transfer inference is selected automatically when any hint key is present in the spec. +Transfer inference is selected automatically when any hint key is present in the +Framework spec or in vLLM-Omni `extra_params`. ## Run with Cosmos Framework @@ -97,6 +101,74 @@ checked-in assets under [`assets/`](./assets) via paths relative to [`specs/`](. Outputs are written under the directory passed to `-o`, with one subdirectory per sample name, for example `output/transfer_edge/vision.mp4`. Batch size must be 1 for transfer. +## Run with vLLM-Omni + +### Quickstart + +Set up the environment and start the server: +[vLLM-Omni setup](../../README.md#vllm-omni) (Docker recommended). Run the +Docker command from the `cosmos` repo root so the repo is mounted at +`/workspace` and the server runs from that directory inside the container: + +```bash +export COSMOS3_WORKDIR="$(pwd)" +export COSMOS3_HOST_PORT=8000 +``` + +The transfer examples send repo-local `control_path` strings to the server. For +Docker, those paths must be visible from the server working directory. With the +shared Docker setup, the checked-in depth control video is: + +```text +cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4 +``` + +If your server does not run from the repo root, start it from the repo root or +adjust `control_path` to a path the server process can read. + +Send a depth-transfer request: + +```python +import json +from pathlib import Path + +import requests + +transfer_root = Path("cookbooks/cosmos3/generator/transfer") +prompt = json.dumps(json.load(open(transfer_root / "assets/depth/prompt.json"))) +negative = json.dumps(json.load(open(transfer_root / "assets/negative_prompt.json"))) +control_path = transfer_root / "assets/depth/control_depth.mp4" + +response = requests.post( + "http://localhost:8000/v1/videos/sync", + data={ + "prompt": prompt, + "negative_prompt": negative, + "size": "1280x720", + "num_frames": "121", + "fps": "30", + "num_inference_steps": "50", + "guidance_scale": "3.0", + "flow_shift": "10.0", + "seed": "2026", + "extra_params": json.dumps( + { + "use_resolution_template": False, + "use_duration_template": False, + "guardrails": True, + "depth": {"control_path": control_path.as_posix()}, + "control_guidance": 1.5, + "num_video_frames_per_chunk": 121, + "max_frames": 121, + } + ), + }, + headers={"Accept": "video/mp4"}, +) +response.raise_for_status() +Path("/tmp/cosmos3_transfer_depth.mp4").write_bytes(response.content) +``` + ### Spec field reference A representative spec (`specs/edge.json`): @@ -140,6 +212,9 @@ Key fields: - [`run_video_transfer_with_cosmos_framework.ipynb`](./run_video_transfer_with_cosmos_framework.ipynb) — full tutorial on a **GPU host**: environment setup, `nvidia-smi` check, then five inference blocks (edge, blur, depth, seg, wsm) with previews. See [Cosmos3 environment setup](../../README.md). +- [`run_video_transfer_with_vllm_omni.ipynb`](./run_video_transfer_with_vllm_omni.ipynb) — + full tutorial against an already-running vLLM-Omni server: endpoint checks, repo-local + control paths, five transfer requests, and compact previews. - [`specs/`](./specs) — checked-in Framework input JSON per control (paths relative to `specs/`). ### Troubleshooting diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb index faf6ad3a..3e5cfb97 100644 --- a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb +++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb @@ -30,7 +30,7 @@ "- **seg** \u2014 segmentation map (`control_seg.mp4`)\n", "- **wsm** \u2014 world-scenario map (`control_wsm.mp4`)\n", "\n", - "vLLM-Omni does not expose transfer controls today; use this Cosmos Framework path only.\n", + "For the OpenAI-compatible vLLM-Omni transfer path, see [`run_video_transfer_with_vllm_omni.ipynb`](./run_video_transfer_with_vllm_omni.ipynb). This notebook focuses on the Cosmos Framework path.\n", "\n", "Sections **8\u201312** each run one control (inference + preview). Run only the blocks you need.\n", "\n", diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb new file mode 100644 index 00000000..0c7f1f46 --- /dev/null +++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb @@ -0,0 +1,492 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cosmos3 Nano Transfer with vLLM-Omni\n", + "\n", + "This notebook calls an already-running vLLM-Omni Cosmos3 server through the OpenAI-compatible video API. It reuses the checked-in transfer control assets and specs from this cookbook, then sends edge, blur, depth, segmentation, and world-scenario-map transfer requests to `POST /v1/videos/sync`.\n", + "\n", + "The notebook does not modify the vLLM-Omni source tree.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Prerequisites\n", + "\n", + "Start a vLLM-Omni Cosmos3 server before running the request cells. The examples assume the `cosmos` repo is mounted at `/workspace` and the server runs from that directory inside the container, so repo-local `cookbooks/...` paths resolve correctly.\n", + "\n", + "```bash\n", + "cd /path/to/cosmos\n", + "export HF_HOME=\"${HF_HOME:-$HOME/.cache/huggingface}\"\n", + "export COSMOS3_WORKDIR=\"$PWD\"\n", + "export COSMOS3_HOST_PORT=8000\n", + "\n", + "docker run --runtime nvidia --gpus all \\\n", + " -e CUDA_DEVICE_ORDER=PCI_BUS_ID \\\n", + " -v \"${HF_HOME}:/root/.cache/huggingface\" \\\n", + " -v \"${COSMOS3_WORKDIR}:/workspace\" \\\n", + " -p \"${COSMOS3_HOST_PORT}:8000\" --ipc=host \\\n", + " -w /workspace \\\n", + " vllm/vllm-omni:cosmos3 \\\n", + " vllm serve nvidia/Cosmos3-Nano \\\n", + " --omni \\\n", + " --model-class-name Cosmos3OmniDiffusersPipeline \\\n", + " --allowed-local-media-path / \\\n", + " --port 8000 \\\n", + " --init-timeout 1800\n", + "```\n", + "\n", + "Generator guardrails are on by default and require access to the gated `nvidia/Cosmos-1.0-Guardrail` repository. To disable guardrails for these sample requests, set `COSMOS3_VLLM_GUARDRAILS=false`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Configure Paths and Endpoints\n", + "\n", + "Run this cell from anywhere inside the `cosmos` checkout. It resolves local assets, output paths, the vLLM-Omni endpoint, and repo-local `control_path` values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "import base64\n", + "import html\n", + "import json\n", + "import os\n", + "import shutil\n", + "import subprocess\n", + "import time\n", + "from IPython.display import HTML, display\n", + "\n", + "\n", + "def find_repo_root(start: Path) -> Path:\n", + " for path in [start, *start.parents]:\n", + " if (path / \"README.md\").exists() and (path / \"cookbooks\").exists():\n", + " return path\n", + " return start\n", + "\n", + "\n", + "COSMOS_ROOT = find_repo_root(Path.cwd().resolve())\n", + "TRANSFER_ROOT = COSMOS_ROOT / \"cookbooks\" / \"cosmos3\" / \"generator\" / \"transfer\"\n", + "SPECS_DIR = TRANSFER_ROOT / \"specs\"\n", + "ASSETS_DIR = TRANSFER_ROOT / \"assets\"\n", + "OUTPUT_ROOT = Path(\n", + " os.environ.get(\"COSMOS3_TRANSFER_VLLM_OUTPUT_ROOT\", TRANSFER_ROOT / \"outputs\" / \"vllm_omni\")\n", + ").resolve()\n", + "VLLM_BASE_URL = os.environ.get(\"COSMOS3_VLLM_BASE_URL\", \"http://localhost:8000\").rstrip(\"/\")\n", + "GUARDRAILS = os.environ.get(\"COSMOS3_VLLM_GUARDRAILS\", \"true\").strip().lower() not in {\"0\", \"false\", \"no\", \"off\"}\n", + "\n", + "OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)\n", + "\n", + "print(\"COSMOS_ROOT:\", COSMOS_ROOT)\n", + "print(\"TRANSFER_ROOT:\", TRANSFER_ROOT)\n", + "print(\"OUTPUT_ROOT:\", OUTPUT_ROOT)\n", + "print(\"VLLM_BASE_URL:\", VLLM_BASE_URL)\n", + "print(\"GUARDRAILS:\", GUARDRAILS)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Verify Endpoint Configuration\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from urllib.parse import urlparse\n", + "\n", + "try:\n", + " import requests\n", + "except ImportError as exc:\n", + " raise RuntimeError(\"Install requests in this notebook kernel: pip install requests\") from exc\n", + "\n", + "\n", + "def api_root_url(base_url: str) -> str:\n", + " normalized = base_url.rstrip(\"/\")\n", + " if not normalized.endswith(\"/v1\"):\n", + " normalized = f\"{normalized}/v1\"\n", + " return normalized\n", + "\n", + "\n", + "API_ROOT = api_root_url(VLLM_BASE_URL)\n", + "VIDEOS_SYNC_URL = f\"{API_ROOT}/videos/sync\"\n", + "MODELS_URL = f\"{API_ROOT}/models\"\n", + "parsed = urlparse(API_ROOT)\n", + "print(\"api root:\", API_ROOT)\n", + "print(\"videos sync:\", VIDEOS_SYNC_URL)\n", + "print(\"models:\", MODELS_URL)\n", + "print(\"scheme:\", parsed.scheme)\n", + "print(\"host:\", parsed.netloc)\n", + "\n", + "response = requests.get(MODELS_URL, timeout=30)\n", + "response.raise_for_status()\n", + "print(response.json())\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Define Transfer Request Helpers\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "TRANSFER_CONTROLS = (\"edge\", \"blur\", \"depth\", \"seg\", \"wsm\")\n", + "\n", + "\n", + "def compact_json_file(path: Path) -> str:\n", + " return json.dumps(json.loads(path.read_text()), ensure_ascii=True, separators=(\",\", \":\"))\n", + "\n", + "\n", + "def resolve_spec_path(spec_path: Path, value: str) -> Path:\n", + " path = Path(value)\n", + " if path.is_absolute():\n", + " return path\n", + " return (spec_path.parent / path).resolve()\n", + "\n", + "\n", + "def repo_relative_path(local_path: Path) -> str:\n", + " return local_path.resolve().relative_to(COSMOS_ROOT).as_posix()\n", + "\n", + "\n", + "def spec_size(spec: dict) -> str:\n", + " height = int(spec[\"resolution\"])\n", + " width_ratio, height_ratio = (int(part) for part in spec[\"aspect_ratio\"].split(\",\", 1))\n", + " width = round(height * width_ratio / height_ratio)\n", + " return f\"{width}x{height}\"\n", + "\n", + "\n", + "def load_transfer_spec(control: str) -> tuple[Path, dict]:\n", + " if control not in TRANSFER_CONTROLS:\n", + " raise ValueError(f\"control must be one of {TRANSFER_CONTROLS}, got {control!r}\")\n", + " spec_path = SPECS_DIR / f\"{control}.json\"\n", + " if not spec_path.is_file():\n", + " raise FileNotFoundError(spec_path)\n", + " return spec_path, json.loads(spec_path.read_text())\n", + "\n", + "\n", + "def build_transfer_request(control: str) -> tuple[dict[str, str], Path, Path]:\n", + " spec_path, spec = load_transfer_spec(control)\n", + " hint = dict(spec[control])\n", + " local_control_path = resolve_spec_path(spec_path, hint[\"control_path\"])\n", + " hint[\"control_path\"] = repo_relative_path(local_control_path)\n", + "\n", + " extra_params = {\n", + " \"use_resolution_template\": False,\n", + " \"use_duration_template\": False,\n", + " \"guardrails\": GUARDRAILS,\n", + " control: hint,\n", + " \"control_guidance\": spec[\"control_guidance\"],\n", + " \"num_video_frames_per_chunk\": spec[\"num_video_frames_per_chunk\"],\n", + " \"num_conditional_frames\": spec.get(\"num_conditional_frames\", 1),\n", + " \"num_first_chunk_conditional_frames\": spec.get(\"num_first_chunk_conditional_frames\", 0),\n", + " \"share_vision_temporal_positions\": spec.get(\"share_vision_temporal_positions\", True),\n", + " \"max_frames\": spec[\"num_frames\"],\n", + " }\n", + " form = {\n", + " \"prompt\": compact_json_file(resolve_spec_path(spec_path, spec[\"prompt_path\"])),\n", + " \"negative_prompt\": compact_json_file(resolve_spec_path(spec_path, spec[\"negative_prompt_file\"])),\n", + " \"size\": spec_size(spec),\n", + " \"num_frames\": str(spec[\"num_frames\"]),\n", + " \"fps\": str(spec[\"fps\"]),\n", + " \"num_inference_steps\": \"50\",\n", + " \"guidance_scale\": str(spec[\"guidance\"]),\n", + " \"flow_shift\": \"10.0\",\n", + " \"seed\": \"2026\",\n", + " \"extra_params\": json.dumps(extra_params, separators=(\",\", \":\")),\n", + " }\n", + " output_path = OUTPUT_ROOT / control / f\"{spec['name']}.mp4\"\n", + " return form, local_control_path, output_path\n", + "\n", + "\n", + "def run_transfer(control: str) -> Path:\n", + " form, local_control_path, output_path = build_transfer_request(control)\n", + " output_path.parent.mkdir(parents=True, exist_ok=True)\n", + " error_path = output_path.with_suffix(\".error.txt\")\n", + " tmp_path = output_path.with_suffix(\".tmp\")\n", + " print(\"control:\", control)\n", + " print(\"local control:\", local_control_path)\n", + " print(\"size:\", form[\"size\"], \"frames:\", form[\"num_frames\"], \"fps:\", form[\"fps\"])\n", + " print(\"output:\", output_path)\n", + "\n", + " headers = {\"Accept\": \"video/mp4\"}\n", + " api_key = os.environ.get(\"COSMOS3_VLLM_API_KEY\")\n", + " if api_key:\n", + " headers[\"Authorization\"] = f\"Bearer {api_key}\"\n", + "\n", + " t0 = time.time()\n", + " response = requests.post(VIDEOS_SYNC_URL, data=form, headers=headers, timeout=3600)\n", + " if not response.ok:\n", + " error_path.write_text(response.text)\n", + " print(\"request failed:\", response.status_code)\n", + " print(response.text[:2000])\n", + " response.raise_for_status()\n", + " tmp_path.write_bytes(response.content)\n", + " tmp_path.replace(output_path)\n", + " print(f\"wrote {output_path} in {time.time() - t0:.1f}s\")\n", + " return output_path\n", + "\n", + "\n", + "def _ffmpeg_exe() -> str:\n", + " try:\n", + " import imageio_ffmpeg\n", + "\n", + " return imageio_ffmpeg.get_ffmpeg_exe()\n", + " except ImportError:\n", + " pass\n", + " exe = shutil.which(\"ffmpeg\")\n", + " if exe:\n", + " return exe\n", + " raise RuntimeError(\"Install imageio-ffmpeg or put ffmpeg on PATH to create compact previews.\")\n", + "\n", + "\n", + "def make_preview(src: Path, *, crf: int = 28) -> Path:\n", + " preview = src.with_name(f\"{src.stem}_preview.mp4\")\n", + " if not preview.exists() or preview.stat().st_mtime < src.stat().st_mtime:\n", + " subprocess.run(\n", + " [\n", + " _ffmpeg_exe(),\n", + " \"-y\",\n", + " \"-loglevel\",\n", + " \"error\",\n", + " \"-i\",\n", + " str(src),\n", + " \"-c:v\",\n", + " \"libx264\",\n", + " \"-crf\",\n", + " str(crf),\n", + " \"-preset\",\n", + " \"veryfast\",\n", + " \"-an\",\n", + " \"-pix_fmt\",\n", + " \"yuv420p\",\n", + " str(preview),\n", + " ],\n", + " check=True,\n", + " )\n", + " return preview\n", + "\n", + "\n", + "def display_video(path: Path, *, width: int = 720) -> None:\n", + " data = base64.b64encode(path.read_bytes()).decode(\"ascii\")\n", + " label = html.escape(str(path))\n", + " markup = f'''\n", + "\n", + "
{label}
\n", + "'''\n", + " display(HTML(markup))\n", + "\n", + "\n", + "def view_transfer(control: str, output_path: Path | None = None) -> None:\n", + " form, local_control_path, expected_output = build_transfer_request(control)\n", + " output_path = Path(output_path or expected_output)\n", + " if not output_path.is_file():\n", + " raise FileNotFoundError(f\"missing output: {output_path} (run {control} transfer first)\")\n", + " for label, src in [(\"control\", local_control_path), (\"generated\", output_path)]:\n", + " preview = make_preview(src)\n", + " print(f\"{control} {label}: {src.name} ({src.stat().st_size // 1024} KB -> {preview.stat().st_size // 1024} KB preview)\")\n", + " display_video(preview)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Preview Available Inputs\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for control in TRANSFER_CONTROLS:\n", + " form, local_control_path, _ = build_transfer_request(control)\n", + " prompt = json.loads(form[\"prompt\"])\n", + " caption = prompt.get(\"temporal_caption\") or prompt.get(\"comprehensive_t2i_caption\") or prompt.get(\"extra\", {}).get(\"prompt\", \"\")\n", + " print(f\"{control}: {local_control_path.relative_to(COSMOS_ROOT)}\")\n", + " print(f\" size={form['size']} frames={form['num_frames']} fps={form['fps']}\")\n", + " print(f\" prompt={caption[:180]}{'...' if len(caption) > 180 else ''}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Edge (Canny) Transfer\n", + "\n", + "Run the `edge` transfer request through vLLM-Omni, then display the input control video and generated output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "edge_output = run_transfer(\"edge\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_transfer(\"edge\", edge_output)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Blur Transfer\n", + "\n", + "Run the `blur` transfer request through vLLM-Omni, then display the input control video and generated output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "blur_output = run_transfer(\"blur\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_transfer(\"blur\", blur_output)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Depth Transfer\n", + "\n", + "Run the `depth` transfer request through vLLM-Omni, then display the input control video and generated output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "depth_output = run_transfer(\"depth\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_transfer(\"depth\", depth_output)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Segmentation Transfer\n", + "\n", + "Run the `seg` transfer request through vLLM-Omni, then display the input control video and generated output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seg_output = run_transfer(\"seg\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_transfer(\"seg\", seg_output)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. World Scenario Map Transfer\n", + "\n", + "Run the `wsm` transfer request through vLLM-Omni, then display the input control video and generated output.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wsm_output = run_transfer(\"wsm\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "view_transfer(\"wsm\", wsm_output)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From e1198ce3338f1e2a725751da93169fce3918cc8f Mon Sep 17 00:00:00 2001 From: Maciej Bala Date: Tue, 16 Jun 2026 15:56:20 +0200 Subject: [PATCH 2/2] bugfix Signed-off-by: Maciej Bala --- README.md | 4 ++-- cookbooks/cosmos3/generator/transfer/README.md | 5 +++++ .../transfer/run_video_transfer_with_vllm_omni.ipynb | 1 + 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a5f808aa..dc956422 100644 --- a/README.md +++ b/README.md @@ -413,7 +413,7 @@ curl -sS -X POST http://localhost:8000/v1/videos/sync \ --form-string "guidance_scale=3.0" \ --form-string "flow_shift=10.0" \ --form-string "seed=2026" \ - --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \ + --form-string 'extra_params={"use_resolution_template":false,"use_duration_template":false,"guardrails":true,"depth":{"control_path":"cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4"},"resolution":"720","control_guidance":1.5,"num_video_frames_per_chunk":121,"max_frames":121}' \ -o cosmos3_transfer_depth.mp4 ``` @@ -434,7 +434,7 @@ Common request fields (the image endpoint follows the [Image Generation API](htt | `max_sequence_length` | Maximum number of prompt tokens kept for conditioning (Cosmos 3 default `512`); longer prompts are truncated with a warning, shorter ones padded | | `input_reference` | Uploaded image or video for image-to-video, video-to-video, and action requests | | `video_reference` | JSON-safe video reference for video-to-video requests, such as `{"video_url":"https://..."}`; do not combine with `input_reference` or `image_reference` | -| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`), prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle | +| `extra_params` | JSON-encoded Cosmos 3-specific options: action settings (`action_mode`, `domain_name`, `raw_action_dim`, `action_chunk_size`, `action_path`), video-to-video conditioning (`condition_frame_indexes_vision`, `condition_video_keep`), transfer hints (`edge`, `blur`, `depth`, `seg`, `wsm`) and transfer bucket `resolution`, prompt-template toggles (`use_resolution_template`, `use_duration_template`), and the per-request `guardrails` toggle | | `extra_args` | JSON object for Cosmos 3-specific image-endpoint options such as `use_resolution_template` | Disabling guardrails: Cosmos 3 ships safety guardrails that screen prompts and blur faces in generated output. Disable them per request by adding `guardrails: false` to `extra_params`: diff --git a/cookbooks/cosmos3/generator/transfer/README.md b/cookbooks/cosmos3/generator/transfer/README.md index 13991561..8d372941 100644 --- a/cookbooks/cosmos3/generator/transfer/README.md +++ b/cookbooks/cosmos3/generator/transfer/README.md @@ -126,6 +126,10 @@ cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4 If your server does not run from the repo root, start it from the repo root or adjust `control_path` to a path the server process can read. +Transfer requests should also pass the spec `resolution` inside `extra_params`. +The video API `size` field controls the output size string, but Cosmos3 transfer +bucket selection reads `extra_params.resolution`. + Send a depth-transfer request: ```python @@ -157,6 +161,7 @@ response = requests.post( "use_duration_template": False, "guardrails": True, "depth": {"control_path": control_path.as_posix()}, + "resolution": "720", "control_guidance": 1.5, "num_video_frames_per_chunk": 121, "max_frames": 121, diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb index 0c7f1f46..ce009292 100644 --- a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb +++ b/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_vllm_omni.ipynb @@ -204,6 +204,7 @@ " \"use_duration_template\": False,\n", " \"guardrails\": GUARDRAILS,\n", " control: hint,\n", + " \"resolution\": spec[\"resolution\"],\n", " \"control_guidance\": spec[\"control_guidance\"],\n", " \"num_video_frames_per_chunk\": spec[\"num_video_frames_per_chunk\"],\n", " \"num_conditional_frames\": spec.get(\"num_conditional_frames\", 1),\n",