diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fd6ad5e3..d6dd005e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,81 +1,282 @@ # Contributing to NVIDIA Cosmos -Thank you for your interest in contributing to NVIDIA Cosmos. This document provides guidelines and instructions for contributing. +Thank you for your interest in contributing to NVIDIA Cosmos. This guide covers how to propose changes, add new cookbooks, and maintain the quality bar we hold for community-facing content. ## Code of Conduct This project adheres to the [NVIDIA Open Source Code of Conduct](https://github.com/NVIDIA/cosmos/blob/main/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior by filing an issue or contacting [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com). +--- + ## How to Contribute ### Reporting Issues -If you encounter a bug or have a feature request, please open an issue on the [GitHub Issues](https://github.com/NVIDIA/cosmos/issues) page. When filing an issue, include: +Open an issue on [GitHub Issues](https://github.com/NVIDIA/cosmos/issues) with: -- A clear and descriptive title -- Steps to reproduce the problem (if applicable) -- Expected behavior vs. actual behavior -- Your environment details (OS, CUDA version, GPU model, Python version) +- A clear, descriptive title +- Steps to reproduce (if applicable) +- Expected vs. actual behavior +- Environment details: OS, CUDA version, GPU model, Python version, `uv` version - Relevant logs or error messages -### Submitting Changes +### Contribution Workflow -1. **Fork the repository** and create a new branch from `main`: +1. **Fork** the repository and create a branch from `main`: - ```shell - git checkout -b your-branch-name + ```bash + git checkout -b cookbook/descriptive-name # or docs/, fix/, benchmark/ ``` -2. **Make your changes.** Ensure your changes follow the project conventions and do not introduce regressions. +2. **Make your changes** following the guidelines below. -3. **Test your changes.** Verify that existing cookbooks and examples still work correctly with your modifications. +3. **Test your changes.** Run your notebook end-to-end on the target GPU. Verify existing cookbooks are unaffected. -4. **Commit your changes** with a clear, descriptive commit message: +4. **Commit** with a clear message: - ```shell - git commit -m "Brief description of the change" + ```bash + git commit -m "Add worker-safety Reasoner cookbook with vLLM backend" ``` -5. **Push to your fork** and open a Pull Request against the `main` branch of the upstream repository. +5. **Push and open a Pull Request** against `main`. ### Pull Request Guidelines - Provide a clear description of what your PR does and why -- Reference any related issues (e.g., `Fixes #123`) -- Keep PRs focused: one logical change per PR -- Ensure your branch is up to date with `main` before submitting -- Be responsive to review feedback +- Reference related issues (e.g., `Fixes #123`) +- One logical change per PR +- Ensure your branch is up to date with `main` +- Respond to review feedback promptly + +--- + +## Cookbook Structure + +The `cookbooks/` directory is organized by **model generation → tower → capability**. Each cookbook is a self-contained directory with a README, one or more runnable notebooks, and supporting assets. + +``` +cookbooks/ +└── cosmos3/ + ├── README.md # Shared setup (all backends) + ├── cosmos3-model-architecture.png + │ + ├── reasoner/ # Reasoner Tower + │ ├── README.md # Reasoner overview + backend table + │ ├── basic_examples/ # Shipped starter cookbooks + │ │ ├── reasoner_prompt_guide.md + │ │ ├── run_with_vllm.ipynb + │ │ ├── run_with_nim.ipynb + │ │ ├── run_with_cosmos_framework.ipynb + │ │ └── assets/ + │ └── / # ← Community contributions go here + │ ├── README.md + │ ├── run__with_.ipynb + │ └── assets/ + │ + ├── generator/ + │ ├── audiovisual/ # Generator: T2I, T2V, I2V, audio + │ │ ├── README.md + │ │ ├── basic_examples/ # Shipped starter cookbooks + │ │ │ ├── run_with_diffusers.ipynb + │ │ │ ├── run_with_vllm_omni.ipynb + │ │ │ ├── run_with_cosmos_framework.ipynb + │ │ │ └── assets/ + │ │ └── / # ← Community contributions go here + │ │ + │ ├── action/ # Generator: policy, FDM, IDM + │ │ ├── README.md + │ │ ├── basic_examples/ # Shipped starter cookbooks + │ │ │ ├── run_fd_with_cosmos_framework.ipynb + │ │ │ ├── run_fd_with_vllm.ipynb + │ │ │ ├── run_id_with_cosmos_framework.ipynb + │ │ │ ├── run_id_with_vllm.ipynb + │ │ │ ├── run_policy_with_cosmos_framework.md + │ │ │ └── assets/ + │ │ └── / # ← Community contributions go here + │ │ + │ └── transfer/ # Generator: video-to-video transfer + │ ├── README.md + │ ├── basic_examples/ # Shipped starter cookbooks + │ │ ├── run_video_transfer_with_cosmos_framework.ipynb + │ │ ├── preview_helpers.py + │ │ ├── specs/ + │ │ └── assets/ + │ └── / # ← Community contributions go here + │ + └── end2end/ # Multi-tower or application workflows + ├── README.md + └── / # ← Community contributions go here + ├── README.md + ├── run__with_.md + └── assets/ +``` + +### Where Does My Cookbook Go? + +| Your cookbook does... | Place it under | +|----------------------|---------------| +| Image/video understanding, VLM, reasoning, grounding | `cookbooks/cosmos3/reasoner/` | +| Text-to-image, text-to-video, image-to-video, audio | `cookbooks/cosmos3/generator/audiovisual/` | +| Robotics policy, forward/inverse dynamics | `cookbooks/cosmos3/generator/action/` | +| Video-to-video style transfer, edge-guided generation | `cookbooks/cosmos3/generator/transfer/` | +| Multi-tower application workflows or external robot/simulation stacks | `cookbooks/cosmos3/end2end/` | + +If your cookbook spans multiple towers (e.g., Reasoner analysis → Generator synthesis), create a new directory under `cookbooks/cosmos3/` with a clear name (e.g., `cookbooks/cosmos3/end2end/`). + +--- + +## Cookbook Quality Requirements + +Every cookbook merged into this repo must meet these requirements. Reviewers will check each item. + +### 1. Open-Access Data Only + +- All datasets must be **publicly downloadable** without NVIDIA-internal credentials +- Acceptable sources: HuggingFace Hub (public or gated with free access), public URLs, synthetic data generated in the notebook +- If working with partners, request a **small public subset** for the cookbook example +- Include the dataset license in your README + +**Not acceptable:** Internal S3 buckets, VPN-only URLs, private NFS mounts, datasets requiring paid partner agreements + +### 2. Results / Expected Output + +Every cookbook must include a **Results** section showing what a successful run looks like: + +- **Inference cookbooks:** Sample generated images/videos, text outputs, or action trajectories saved to `assets/` +- **Post-training cookbooks:** Training loss curves, before/after comparison, evaluation metrics +- **Timing benchmarks:** Wall-clock time on the target GPU (e.g., "Cosmos3-Nano T2V: 45s on 1× A100") + +This lets developers validate their own runs against a known-good baseline. + +### 3. Canonical Setup (No Hidden Dependencies) + +- **Do not duplicate setup instructions.** Link to the shared [`cookbooks/cosmos3/README.md`](cookbooks/cosmos3/README.md) for backend installation (Cosmos Framework, Diffusers, vLLM, NIM) +- Your README should only document **cookbook-specific** dependencies beyond the shared setup +- All dependencies must be installable via `uv pip install` or `apt-get` — no manual builds +- Pin specific versions of critical packages when they affect reproducibility + +### 4. One-Click Runnable + +- Each notebook should run **top-to-bottom without manual intervention** +- Use environment variables for configurable paths (`HF_TOKEN`, `COSMOS3_MEDIA_ROOT`, etc.) +- Default to the smallest model size (Cosmos3-Nano) so the widest set of GPUs can run it +- If a cookbook requires a running server (vLLM, NIM), provide the exact launch command in the README and automate the health check in the notebook + +### 5. Naming Convention + +Follow the existing pattern: + +``` +run__with_.ipynb +``` + +Examples: +- `run_with_vllm.ipynb` — generic Reasoner inference via vLLM +- `run_fd_with_cosmos_framework.ipynb` — forward dynamics via Cosmos Framework +- `run_video_transfer_with_cosmos_framework.ipynb` — video transfer via Cosmos Framework + +For markdown-only guides (no notebook): `run__with_.md` + +--- + +## Cookbook README Template + +Each cookbook directory needs a `README.md`. Use this structure: + +```markdown +# [Cookbook Title] + +One-paragraph description of what this cookbook demonstrates and why it matters. + +## What You'll Build + +- Bullet list of concrete outputs (e.g., "Generate a 480p video from a text prompt") + +## Prerequisites + +- Link to [shared setup](../README.md#backend-name) for backend installation +- Any additional cookbook-specific requirements + +## Backends + +| Backend | Notebook | GPU Requirement | +|---------|----------|----------------| +| vLLM | [`run_with_vllm.ipynb`](run_with_vllm.ipynb) | 1× A100 (80 GB) | +| NIM | [`run_with_nim.ipynb`](run_with_nim.ipynb) | 1× A100 (80 GB) | + +## Quick Start + +Minimal steps to go from clone to first result: + + 1. Set up the backend (link) + 2. Run the notebook + 3. Check your outputs in `assets/` + +## Results / Expected Output + +Sample outputs, metrics, and timing benchmarks from a successful run. + +## Dataset + +| Name | Source | License | Size | +|------|--------|---------|------| +| Dataset Name | [HuggingFace link](...) | Apache 2.0 | ~2 GB | +``` + +--- + +## Contribution Areas + +We welcome contributions in these areas: + +| Area | Examples | +|------|---------| +| **New cookbooks** | Domain-specific applications (robotics, AV, healthcare, manufacturing) | +| **New backends** | Additional serving/inference backends for existing cookbooks | +| **Documentation** | README improvements, prompt guides, architecture explanations | +| **Bug fixes** | Notebook fixes, broken links, version compatibility issues | +| **Benchmarks** | Inference timing across GPU configurations (A100, H100, L40S, RTX 4090) | +| **Post-training recipes** | SFT, LoRA, domain adaptation examples with open datasets | + +### What We Won't Merge + +- Cookbooks that depend on internal/proprietary datasets +- Notebooks that require manual mid-run intervention +- Changes that break existing cookbook functionality +- Generated binary files (model weights, large media) — use HuggingFace/external links instead + +--- ## Development Setup ### Prerequisites - Python 3.10 or later -- CUDA 12.8 or 13.x (see [Troubleshooting](README.md#troubleshooting) for version matching) +- CUDA 12.8 or 13.x (see [Troubleshooting](README.md#troubleshooting)) - An NVIDIA GPU with sufficient VRAM for your target workflow -- `uv` >= 0.11.3 (install from [astral.sh/uv](https://astral.sh/uv)) +- `uv` >= 0.11.3 ([astral.sh/uv](https://astral.sh/uv)) +- `git-lfs` installed (`apt-get install git-lfs`) ### Getting Started -1. Clone the repository: +```bash +git clone https://github.com/NVIDIA/cosmos.git +cd cosmos +``` - ```shell - git clone https://github.com/NVIDIA/cosmos.git - cd cosmos - ``` - -2. Set up your environment following the instructions in the [README](README.md). +Follow [cookbooks/cosmos3/README.md](cookbooks/cosmos3/README.md) to set up the backend(s) your cookbook uses. -3. Explore the [cookbooks](cookbooks/) for end-to-end examples of Generator and Reasoner workflows. +### Testing Your Cookbook -## Contribution Areas +Before submitting: -We welcome contributions in the following areas: +1. **Clean run:** Restart your kernel and run all cells top-to-bottom +2. **Minimal GPU:** Test on the smallest supported GPU configuration +3. **No secrets:** Verify no API keys, tokens, or internal paths are committed +4. **Output cells:** Clear large output cells but keep the Results section outputs +5. **File sizes:** Ensure no single file exceeds 10 MB (use git-lfs for larger assets or link externally) -- **Cookbooks and examples:** New notebooks demonstrating Cosmos 3 capabilities -- **Documentation:** Improvements to README, cookbook READMEs, or inline documentation -- **Bug fixes:** Fixes for issues in existing code or documentation -- **Benchmarks:** Additional inference benchmark results across different hardware configurations +--- ## License @@ -83,4 +284,4 @@ By contributing to this project, you agree that your contributions will be licen ## Questions? -If you have questions about contributing, feel free to open an issue or reach out at [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com). +If you have questions about contributing, open an issue or reach out at [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com). diff --git a/README.md b/README.md index d54eaa07..6c29d092 100644 --- a/README.md +++ b/README.md @@ -522,7 +522,7 @@ docker run -it --rm --name=$CONTAINER_NAME \ The OpenAI-compatible API is then available at `http://127.0.0.1:8000/v1`. Query it with `curl`: ```shell -IMAGE_DATA_URI="data:image/jpeg;base64,$(base64 -w 0 cookbooks/cosmos3/reasoner/assets/robot_153.jpg)" +IMAGE_DATA_URI="data:image/jpeg;base64,$(base64 -w 0 cookbooks/cosmos3/reasoner/basic_examples/assets/robot_153.jpg)" curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \ -H 'Accept: application/json' \ @@ -627,21 +627,22 @@ The Cosmos Framework requires `uv >= 0.11.3` (enforced via its `pyproject.toml`) ### Examples -We are building examples that show Cosmos 3 capabilities end to end, including world generation, world understanding, captioning, temporal localization, grounding, and physical reasoning. Each example is a self-contained script or notebook you can run from this repository. +We are building examples that show Cosmos 3 capabilities end to end, including world generation, world understanding, captioning, temporal localization, grounding, and physical reasoning. Each example is a self-contained script, notebook, or guide you can run from this repository. | Example | Surface | Workflows demonstrated | Open | nbviewer | | --- | --- | --- | --- | --- | -| Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb) | -| Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb) | -| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb) | -| Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb) | -| Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb) | -| Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb) | -| Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb) | -| Transfer with Cosmos Framework | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb) | -| Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb) | -| Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb) | -| Reasoner with NIM | Reasoner | The same image and video reasoning examples as the vLLM notebook, run against the prebuilt, OpenAI-compatible [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container; local media is sent as base64 data URIs. | [Notebook](cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/run_with_nim.ipynb) | +| Generator (audiovisual) with Diffusers | Generator | Text-to-image, plus text-to-video and image-to-video each with or without synchronized sound, via `Cosmos3OmniPipeline`. | [Notebook](cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_diffusers.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_diffusers.ipynb) | +| Generator (audiovisual) with Cosmos Framework | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_cosmos_framework.ipynb) | +| Generator (audiovisual) with vLLM-Omni | Generator | Text-to-image, plus text-to-video and image-to-video each with sound on or off, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_vllm_omni.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_vllm_omni.ipynb) | +| Forward dynamics with Cosmos Framework | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_cosmos_framework.ipynb) | +| Forward dynamics with vLLM-Omni | Generator | Forward dynamics: action-conditioned future-observation prediction for AV, DROID, and UMI, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_vllm.ipynb) | +| Inverse dynamics with Cosmos Framework | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/action/basic_examples/run_id_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/basic_examples/run_id_with_cosmos_framework.ipynb) | +| Inverse dynamics with vLLM-Omni | Generator | Inverse dynamics: ego-motion trajectory prediction from input AV video, against an OpenAI-compatible vLLM-Omni server. | [Notebook](cookbooks/cosmos3/generator/action/basic_examples/run_id_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/action/basic_examples/run_id_with_vllm.ipynb) | +| Transfer with Cosmos Framework | Generator | Video transfer: edge, blur, depth, segmentation, and world-scenario controls with captions, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/generator/transfer/basic_examples/run_video_transfer_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/generator/transfer/basic_examples/run_video_transfer_with_cosmos_framework.ipynb) | +| Explainable mixed-SKU palletizer | End-to-end | Cosmos3-Nano and Cosmos3-Super Diffusers validation for palletizing-scene artifacts, plus a Doosan Robotics Isaac Sim/cuRobo full-stack smoke path. | [Guide](cookbooks/cosmos3/end2end/explainable-palletizer/README.md) | N/A | +| Reasoner with Cosmos Framework | Reasoner | Text and image reasoning: detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts, through the `cosmos_framework.scripts.inference` entrypoint. | [Notebook](cookbooks/cosmos3/reasoner/basic_examples/run_with_cosmos_framework.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/basic_examples/run_with_cosmos_framework.ipynb) | +| Reasoner with vLLM | Reasoner | Image and video reasoning: captioning, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, action CoT, driving scenes, physical-plausibility, and situation understanding, against an OpenAI-compatible vLLM server (Cosmos3-Super on 4 GPUs by default; switch to Nano per the cookbook README). | [Notebook](cookbooks/cosmos3/reasoner/basic_examples/run_with_vllm.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/basic_examples/run_with_vllm.ipynb) | +| Reasoner with NIM | Reasoner | The same image and video reasoning examples as the vLLM notebook, run against the prebuilt, OpenAI-compatible [Cosmos 3 Reasoner NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/cosmos3-reasoner) container; local media is sent as base64 data URIs. | [Notebook](cookbooks/cosmos3/reasoner/basic_examples/run_with_nim.ipynb) | [![Render with nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/nvidia/cosmos/blob/main/cookbooks/cosmos3/reasoner/basic_examples/run_with_nim.ipynb) | ### Inference Benchmarks diff --git a/cookbooks/cosmos3/README.md b/cookbooks/cosmos3/README.md index ecf78a06..48f6474c 100644 --- a/cookbooks/cosmos3/README.md +++ b/cookbooks/cosmos3/README.md @@ -203,7 +203,7 @@ export VLLM_USE_DEEP_GEMM=0 All Reasoner cookbooks talk to an OpenAI-compatible chat-completions API. After [installing vLLM](#vllm), run the commands below from `cookbooks/cosmos3/reasoner` (same working directory as -[`run_with_vllm.ipynb`](reasoner/run_with_vllm.ipynb)). That sets +[`run_with_vllm.ipynb`](reasoner/basic_examples/run_with_vllm.ipynb)). That sets `$(dirname "$(pwd)")` to `/cookbooks/cosmos3`, which matches the notebook's `COSMOS3_MEDIA_ROOT`. @@ -221,7 +221,7 @@ vllm serve nvidia/Cosmos3-Nano \ --port 8000 ``` -**Cosmos3-Super** (four GPUs; default in [`run_with_vllm.ipynb`](reasoner/run_with_vllm.ipynb), port 8001): +**Cosmos3-Super** (four GPUs; default in [`run_with_vllm.ipynb`](reasoner/basic_examples/run_with_vllm.ipynb), port 8001): ```bash export COSMOS3_MEDIA_ROOT="$(dirname "$(pwd)")" diff --git a/cookbooks/cosmos3/end2end/README.md b/cookbooks/cosmos3/end2end/README.md new file mode 100644 index 00000000..210a318a --- /dev/null +++ b/cookbooks/cosmos3/end2end/README.md @@ -0,0 +1,12 @@ +# Cosmos3 End-to-End Cookbooks + +End-to-end cookbooks combine multiple Cosmos3 capabilities or connect Cosmos3 to +external robotics, simulation, or application stacks. + +Environment setup for reusable Cosmos3 backends is centralized in the shared +[Cosmos3 cookbooks environment setup](../README.md) guide. Each cookbook below +documents any application-specific services, ports, and validation steps. + +| Cookbook | Backends | Entry point | +| --- | --- | --- | +| Explainable mixed-SKU palletizer | Cosmos3 Diffusers, Doosan Robotics simulation stack | [`explainable-palletizer/`](./explainable-palletizer/) | diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/README.md b/cookbooks/cosmos3/end2end/explainable-palletizer/README.md new file mode 100644 index 00000000..90e15f63 --- /dev/null +++ b/cookbooks/cosmos3/end2end/explainable-palletizer/README.md @@ -0,0 +1,168 @@ +# See How It Thinks: Mixed Palletizing with Explainable Visual Reasoning + +This cookbook shows how to validate an explainable mixed-SKU palletizing workflow +with Cosmos3-Nano or Cosmos3-Super. The live Cosmos3 path uses a Diffusers-backed +generation endpoint to create auditable palletizing-scene outputs, while the +Doosan Robotics reference stack provides the full Isaac Sim, cuRobo, FastAPI, +and React control-loop smoke test. + +| Model | Workload | Use case | +| --- | --- | --- | +| [Cosmos3-Nano](https://huggingface.co/nvidia/Cosmos3-Nano), [Cosmos3-Super](https://huggingface.co/nvidia/Cosmos3-Super), [Isaac Sim](https://developer.nvidia.com/isaac-sim), [cuRobo](https://curobo.org/), [Diffusers](https://github.com/huggingface/diffusers) | End-to-end | Explainable mixed-SKU palletizing: visual scene generation, handling-policy validation, and simulated robot execution | + +**Source project:** [doosan-robotics/explainable-palletizer](https://github.com/doosan-robotics/explainable-palletizer) + +## What You Will Build + +- Run a Cosmos3-Nano or Cosmos3-Super Diffusers smoke request for a palletizing + cell and save the generated media artifact. +- Validate the Doosan reference stack in no-token mode with real Isaac Sim and a + tiny inference model. +- Use the same prompt family to document why the system places, delays, or routes + boxes for human inspection. +- Review the source project's damaged-carton, heavy-box, and mixed-SKU + walkthroughs in [workflow_e2e.md](workflow_e2e.md). + +## Prerequisites + +- Follow the shared [Diffusers setup](../../README.md#diffusers) if you are + running Cosmos3 locally. +- For the Doosan reference stack, install Docker with Compose V2 and the NVIDIA + Container Toolkit on a Linux GPU host. +- Install [`uv`](https://docs.astral.sh/uv/) for the Doosan full-stack smoke if + the source checkout needs to generate `uv.lock`. +- For full Cosmos3 model runs, authenticate to Hugging Face and accept the + relevant Cosmos3 model licenses. +- Use a host with enough VRAM for the selected model size. + +| Profile | Model | Suggested hardware | Notes | +| --- | --- | --- | --- | +| Nano | `nvidia/Cosmos3-Nano` | Single NVIDIA GB10, RTX 4090-class, or larger GPU | Default for smoke tests and widest accessibility | +| Super | `nvidia/Cosmos3-Super` | Multi-GPU Hopper/Blackwell-class system | Higher quality; launch the serving backend with the Super checkpoint before running the same client request | + +## Backends + +| Backend | Entry point | GPU requirement | +| --- | --- | --- | +| Diffusers-compatible HTTP endpoint | [`run_palletizer_with_diffusers.md`](run_palletizer_with_diffusers.md) | Nano: single GPU; Super: multi-GPU | +| Doosan reference stack | `make docker-test` in the source repo | NVIDIA GPU with Isaac Sim support | + +## Quick Start + +Use the Diffusers entry point first. It is the fastest way to confirm that the +Cosmos3-Nano or Cosmos3-Super service is reachable and can produce a palletizing +scene artifact: + +```bash +cd cookbooks/cosmos3/end2end/explainable-palletizer +export COSMOS3_DIFFUSERS_BASE_URL=http://127.0.0.1:8000 +curl -fsS "${COSMOS3_DIFFUSERS_BASE_URL}/health" +curl -fsS "${COSMOS3_DIFFUSERS_BASE_URL}/v1/models" +``` + +Then run the Python smoke-test block in +[`run_palletizer_with_diffusers.md`](run_palletizer_with_diffusers.md) to write a +local media artifact and verify the response can be decoded. Use +[workflow_e2e.md](workflow_e2e.md) to compare the generated artifact against the +operator-review criteria from the reference palletizer scenarios. + +Then run the full-stack reference smoke on a GPU host: + +```bash +git clone https://github.com/doosan-robotics/explainable-palletizer.git +cd explainable-palletizer +cp docker/.env.example docker/.env +test -f uv.lock || uv lock +make docker-test +``` + +`make docker-test` starts real Isaac Sim plus a tiny inference model, so it does +not need a Hugging Face token. The current upstream Dockerfile expects +`uv.lock`; generate it once with `uv lock` if the source checkout does not +already include it. Change `SIM_PORT`, `INFERENCE_PORT`, `APP_PORT`, and +`FRONTEND_PORT` in `docker/.env` if another local service already uses the +defaults. See +[`run_palletizer_with_diffusers.md`](run_palletizer_with_diffusers.md#full-stack-troubleshooting) +for NVIDIA Docker runtime setup, single-GPU sequential startup, and the cuRobo +`warp-lang` compatibility note. + +## Architecture + +

+ Explainable palletizer workflow +

+ +The Doosan reference stack launches four services: + +| Service | Default port | Role | +| --- | --- | --- | +| `sim-server` | 8100 | Runs Isaac Sim headlessly, creates conveyor-box images, and executes cuRobo-planned pick/place trajectories | +| `inference-server` | 8200 | Serves the model endpoint used by the application server | +| `app-server` | 8000 | Builds prompts, parses structured actions, maintains pallet state, and streams events | +| `frontend` | 3000 | Shows camera frames, reasoning, parsed actions, and execution status | + +For the Cosmos3 Diffusers path, the client only needs an HTTP endpoint exposing: + +- `GET /health` +- `GET /v1/models` +- `POST /v1/infer` + +The request is model-size agnostic. To move from Nano to Super, start the serving +backend with `nvidia/Cosmos3-Super` and confirm `/v1/models` reports the Super +checkpoint before reusing the same prompt and payload shape. + +## Results / Expected Output + +A successful Diffusers smoke test writes one generated image or video artifact +and prints metadata similar to: + +```text +model: nvidia/Cosmos3-Nano +backend: diffusers +decoded media bytes: non-zero +seed: fixed integer +``` + +A successful full-stack Doosan smoke test prints healthy endpoints for: + +- `sim-server` +- `inference-server` +- `app-server` +- `frontend` + +and exposes the UI at the configured frontend port. + +The companion walkthrough includes expected actions and screenshots for: + +- damaged cartons routed to `CALL_A_HUMAN`, +- heavy boxes placed on low pallet layers with firm grip, +- mixed-SKU stacks that keep rigid goods below fragile items. + +## Dataset + +| Name | Source | License | Size | +| --- | --- | --- | --- | +| Synthetic palletizing scenes and box assets | [doosan-robotics/explainable-palletizer](https://github.com/doosan-robotics/explainable-palletizer) | See upstream repository | Small source assets plus Docker/model caches | +| Cosmos3 models | [NVIDIA Cosmos3 collection](https://huggingface.co/collections/nvidia/cosmos3) | NVIDIA Open Model License | Varies by model | + +## Safety and Limitations + +- This is a simulated proof of concept, not a production robot safety system. +- Cosmos3 generation can help validate scene prompts and expected outputs, but + real palletizing deployments still need independent safety controls, guarded + robot execution, and site-specific validation. +- The upstream Doosan project currently keeps its full closed-loop stack in the + public reference repository; this cookbook does not vendor that source code. +- If the full-stack smoke test fails on a driver, CUDA, or container-runtime + mismatch, fix the host runtime before treating the robot-loop path as passed. +- If the Diffusers endpoint was just restarted, the first Nano/Super request can + spend several minutes loading weights before returning a generated artifact. + +## Resources + +- [doosan-robotics/explainable-palletizer](https://github.com/doosan-robotics/explainable-palletizer) +- [Cosmos3-Nano](https://huggingface.co/nvidia/Cosmos3-Nano) +- [Cosmos3-Super](https://huggingface.co/nvidia/Cosmos3-Super) +- [Cosmos3 Diffusers setup](../../README.md#diffusers) +- [Isaac Sim](https://developer.nvidia.com/isaac-sim) +- [cuRobo](https://curobo.org/) diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/assets/main_workflow.svg b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/main_workflow.svg new file mode 100644 index 00000000..14afea03 --- /dev/null +++ b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/main_workflow.svg @@ -0,0 +1,121 @@ + + + Explainable palletizer: Cosmos3 validation loop + The app server or smoke client connects to Sim Server for robot-loop evidence, to a Cosmos3 Diffusers endpoint for generated palletizing artifacts, and to Frontend for operator review. Frontend also connects directly to Sim Server for the camera stream. + + + + + + + + + + + + + + + + + + + + + zenith-net + + + RUNTIME ARCHITECTURE + Explainable Palletizer Validation Loop + + + + + + + + + ↔ SWAPPABLE + + :8100 + SIMULATION / REAL + Sim Server + or Real Robot + + Isaac Sim + cuRobo MotionGen + Doosan P3020 · conveyor + Box image capture + + + + + + :8000 + ORCHESTRATION + App Server + + Control loop + Prompt builder + Action parser + Pallet state + constraints + WebSocket event stream + + + + + + :8200 + GENERATION + Cosmos3 Endpoint + + Diffusers backend + Nano default · Super quality + Prompt + seed + model ID + /health · /v1/models · /v1/infer + Generated media artifact + + + + + + box images + state + + + + pick/place cmd + + + + + + scene prompt + + + + artifact + metadata + + + + + :3000 + FRONTEND + Frontend + + Camera feed · Reasoning trace + Parsed action · Execution status + + + + WebSocket events + + + + camera stream + + diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario1.webp b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario1.webp new file mode 100644 index 00000000..1d409ea1 Binary files /dev/null and b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario1.webp differ diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario2.webp b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario2.webp new file mode 100644 index 00000000..55bec61c Binary files /dev/null and b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario2.webp differ diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario3.webp b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario3.webp new file mode 100644 index 00000000..7587d4fd Binary files /dev/null and b/cookbooks/cosmos3/end2end/explainable-palletizer/assets/scenario3.webp differ diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/run_palletizer_with_diffusers.md b/cookbooks/cosmos3/end2end/explainable-palletizer/run_palletizer_with_diffusers.md new file mode 100644 index 00000000..24d924b7 --- /dev/null +++ b/cookbooks/cosmos3/end2end/explainable-palletizer/run_palletizer_with_diffusers.md @@ -0,0 +1,185 @@ +# Run Explainable Palletizer with Cosmos3 Diffusers + +This markdown entry point validates the Cosmos3-Nano or Cosmos3-Super generation +path used by the explainable palletizer cookbook. It expects a small +Diffusers-compatible proxy with: + +- `GET /health` +- `GET /v1/models` +- `POST /v1/infer` + +The same request shape works for Nano and Super. The model is selected by the +server; verify it with `/v1/models` before running generation. + +## 1. Choose an endpoint + +Use the deployed Nano endpoint: + +```bash +export COSMOS3_DIFFUSERS_BASE_URL="${COSMOS3_DIFFUSERS_BASE_URL:-http://127.0.0.1:8000}" +``` + +For a local Super deployment, point the variable at the Super service instead: + +```bash +export COSMOS3_DIFFUSERS_BASE_URL="http://127.0.0.1:8000" +``` + +## 2. Smoke test health and model identity + +```bash +curl -fsS "${COSMOS3_DIFFUSERS_BASE_URL}/health" +curl -fsS "${COSMOS3_DIFFUSERS_BASE_URL}/v1/models" +``` + +Expected Nano model response: + +```json +{"object":"list","data":[{"id":"nvidia/Cosmos3-Nano","object":"model"}]} +``` + +For Super, the returned model ID should identify `nvidia/Cosmos3-Super`. + +## 3. Generate a palletizing-scene artifact + +```bash smoke-test +python3 - <<'PY' +import base64 +import json +import os +import time +import urllib.error +import urllib.request +from pathlib import Path + +base_url = os.environ.get("COSMOS3_DIFFUSERS_BASE_URL", "http://127.0.0.1:8000").rstrip("/") +out_dir = Path(os.environ.get("COSMOS3_OUTPUT_DIR", "/tmp/cosmos3-palletizer-smoke")) +out_dir.mkdir(parents=True, exist_ok=True) + +payload = { + "prompt": ( + "A clean warehouse palletizing cell with a robot arm sorting mixed " + "cardboard boxes, visible handling labels, a conveyor, and an " + "operator review panel, technical demo style." + ), + "negative_prompt": "blurry, unsafe robot motion, broken gripper, low quality", + "resolution": os.environ.get("COSMOS3_RESOLUTION", "256"), + "num_output_frames": int(os.environ.get("COSMOS3_NUM_FRAMES", "1")), + "fps": float(os.environ.get("COSMOS3_FPS", "1")), + "steps": int(os.environ.get("COSMOS3_STEPS", "1")), + "guidance_scale": float(os.environ.get("COSMOS3_GUIDANCE", "1.1")), + "seed": int(os.environ.get("COSMOS3_SEED", "20260616")), +} + +start = time.time() +req = urllib.request.Request( + f"{base_url}/v1/infer", + data=json.dumps(payload).encode("utf-8"), + headers={"Content-Type": "application/json"}, + method="POST", +) + +try: + with urllib.request.urlopen(req, timeout=1800) as response: + data = json.loads(response.read()) +except urllib.error.HTTPError as exc: + body = exc.read().decode("utf-8", errors="replace") + raise SystemExit(f"generation failed with HTTP {exc.code}: {body[:2000]}") + +if data.get("error"): + raise SystemExit(f"generation failed: {data['error']}") + +media = base64.b64decode(data["b64_video"], validate=True) +suffix = ".jpg" if media.startswith(b"\xff\xd8\xff") else ".mp4" +artifact = out_dir / f"palletizer-cosmos3-{payload['seed']}{suffix}" +artifact.write_bytes(media) + +print(json.dumps({ + "backend": data.get("backend"), + "seed": payload["seed"], + "bytes": len(media), + "artifact": str(artifact), + "elapsed_sec": round(time.time() - start, 2), +}, indent=2)) + +if len(media) < 1024: + raise SystemExit("decoded media artifact is unexpectedly small") +PY +``` + +If the Diffusers service was just started, the first generation request may +spend several minutes loading model weights before it returns. Keep the request +timeout high enough for that warm load; subsequent smoke requests should be much +faster on the same process. + +## 4. Run the Doosan full-stack smoke + +This validates the reference robot loop without gated model access. Use alternate +ports if another service already occupies the defaults: + +```bash +git clone https://github.com/doosan-robotics/explainable-palletizer.git +cd explainable-palletizer +cp docker/.env.example docker/.env +test -f uv.lock || uv lock + +python3 - <<'PY' +from pathlib import Path +p = Path("docker/.env") +text = p.read_text() +for old, new in { + "SIM_PORT=8100": "SIM_PORT=8310", + "INFERENCE_PORT=8200": "INFERENCE_PORT=8320", + "APP_PORT=8000": "APP_PORT=8330", + "FRONTEND_PORT=3000": "FRONTEND_PORT=3340", +}.items(): + text = text.replace(old, new) +p.write_text(text) +PY + +make docker-test +``` + +The current public Doosan Dockerfile expects `uv.lock`; the `uv lock` guard +generates it for fresh clones where upstream has not checked in the lockfile. +When the launcher reports all services healthy, open the configured frontend +port and verify that the simulated camera, reasoning panel, action panel, and +execution status are visible. + +### Full-stack troubleshooting + +If Docker reports `unknown or invalid runtime name: nvidia`, configure the NVIDIA +runtime before rerunning the stack: + +```bash +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +``` + +On a single-GPU host, Isaac Sim warmup can race with vLLM memory profiling when +all services start concurrently. Start the services sequentially if `make +docker-test` exits during inference-server profiling: + +```bash +cd docker +set -a +. ./.env +set +a + +docker compose -f docker-compose.yml -f docker-compose.test.yml up -d sim-server +until curl -fsS "http://127.0.0.1:${SIM_PORT}/sim/health"; do sleep 5; done +sleep 20 + +docker compose -f docker-compose.yml -f docker-compose.test.yml up -d inference-server +until curl -fsS "http://127.0.0.1:${INFERENCE_PORT}/health"; do sleep 5; done + +docker compose -f docker-compose.yml -f docker-compose.test.yml up -d app-server frontend +until curl -fsS "http://127.0.0.1:${APP_PORT}/api/health"; do sleep 5; done +curl -fsS "http://127.0.0.1:${FRONTEND_PORT}/" >/dev/null +``` + +If `sim-server` exits with `AttributeError: module 'warp.types' has no attribute +'array'`, the image resolved a newer `warp-lang` than the current cuRobo/Isaac +Sim path expects. Pin `warp-lang==1.12.0` in the sim image, rebuild, and rerun +the smoke; do not count the robot-loop path as passed until all four services +are healthy. diff --git a/cookbooks/cosmos3/end2end/explainable-palletizer/workflow_e2e.md b/cookbooks/cosmos3/end2end/explainable-palletizer/workflow_e2e.md new file mode 100644 index 00000000..a50174cf --- /dev/null +++ b/cookbooks/cosmos3/end2end/explainable-palletizer/workflow_e2e.md @@ -0,0 +1,254 @@ +# Decision Walkthrough and Validation Criteria + +This companion guide adapts the explainable-palletizer walkthrough from the +source project for the Cosmos3 cookbook layout. Use it after the quick Diffusers +smoke test in [run_palletizer_with_diffusers.md](run_palletizer_with_diffusers.md) +to understand what a useful palletizing-scene artifact or full-stack robot-loop +run should make visible. + +The original Doosan Robotics proof of concept was built for a Cosmos Cookoff +reasoning demo. This cookbook uses Cosmos3-Nano as the default generation +profile and Cosmos3-Super as the higher-quality generation profile. The +Cosmos3 Diffusers path validates palletizing scene prompts, model identity, +fixed seeds, and generated media artifacts. The Doosan reference stack remains +the source for the Isaac Sim, cuRobo, app-server, and operator UI control-loop +behavior. + +## Model Profiles + +| Profile | Model | Purpose | +| --- | --- | --- | +| Nano | `nvidia/Cosmos3-Nano` | Default smoke-test profile for low-cost palletizing-scene artifact generation | +| Super | `nvidia/Cosmos3-Super` | Higher-quality profile for final prompt review and richer generated artifacts | + +Both profiles use the same HTTP client shape in this cookbook: + +- `GET /health` +- `GET /v1/models` +- `POST /v1/infer` + +The model is selected by the running server. Always check `/v1/models` before +the generation request and record the returned model ID with the output +artifact. + +## Reference Control Loop + +The Doosan stack is a four-service reference system: + +| Service | Default port | Role | +| --- | --- | --- | +| `sim-server` | 8100 | Runs Isaac Sim headlessly, creates conveyor-box images, and executes cuRobo-planned pick/place trajectories | +| `inference-server` | 8200 | Serves the reference model endpoint used by the app server or the no-token tiny model in `make docker-test` | +| `app-server` | 8000 | Builds prompts, parses structured actions, maintains pallet state, and streams events | +| `frontend` | 3000 | Shows camera frames, rationales, parsed actions, and execution status | + +The control loop is intentionally auditable: + +1. `sim-server` keeps a conveyor buffer populated with visible boxes. +2. `app-server` requests box images, dimensions, pallet state, and valid + placement cells. +3. `app-server` sends the prompt to the inference endpoint. +4. The response is parsed into a bounded action contract. +5. `app-server` validates the action against pallet constraints. +6. `sim-server` plans and executes the simulated robot motion. + +The Cosmos3 Diffusers smoke test does not replace the robot action parser. It +creates visual evidence for palletizing prompts, scene constraints, and +operator-review criteria before or alongside the full-stack robot smoke path. + +## Action Contract + +The reference app accepts three action types: + +| Action | Required fields | Meaning | +| --- | --- | --- | +| `PICK_AND_PLACE` | `box`, `target_pallet`, `position`, `speed_pct`, `grip_strength`, `reason` | Pick one visible box and place it at a valid pallet position | +| `CALL_A_HUMAN` | `boxes`, `reason` | Remove damaged, contaminated, unsealed, or otherwise unsafe boxes for inspection | +| `WAIT` | `reason` | Wait only when too few boxes are visible and no safe placement or human call is appropriate | + +`PICK_AND_PLACE` is constrained by the prompt and by parser-side validation: + +| Field | Type | Allowed values | +| --- | --- | --- | +| `box` | string | One of the visible box IDs, such as `box_0001` | +| `target_pallet` | integer | `1` or `2` | +| `position` | `[x, y, z]` | One of the precomputed valid positions for the selected box and pallet | +| `speed_pct` | integer | `40`, `80`, or `100` | +| `grip_strength` | string | `standard`, `gentle`, or `firm` | +| `reason` | string | Brief operator-visible rationale | + +The important safety property is that placement positions are not invented by +the model. The app computes valid positions from pallet occupancy and stability +rules, then the response must select one of those legal positions. + +## Scenario 1: Damaged Carton + +**Expected action:** `CALL_A_HUMAN` + +Three boxes arrive together. Two are visibly unsafe: one has an open top flap +with detached tape, and another is crushed and deformed. The intact box should +not cause the app to ignore the unsafe buffer state. + +| Field | Value | +| --- | --- | +| Visible boxes | `box_0000`, `box_0001`, `box_0002` | +| Visible condition | `box_0000`: open flap, detached tape; `box_0001`: intact; `box_0002`: crushed, deformed | +| Pallet state | Partial fill | +| Valid placement cells | Available, but unsafe boxes should block the pick | + +Operator-visible rationale: + +```text +box_0000 has open flaps and detached tape. box_0002 is crushed and deformed. +Escalate both boxes for inspection before continuing placement from a clean +buffer. +``` + +Parsed action: + +```json +{ + "action": "CALL_A_HUMAN", + "boxes": ["box_0000", "box_0002"], + "reason": "box_0000 has open flaps and detached tape, box_0002 is crushed and deformed" +} +``` + +Simulated outcome: no pick attempt. `app-server` emits a `CALL_A_HUMAN` event, +the UI flags the damaged boxes, and the conveyor advances after operator +inspection. + +

+ Damaged carton scenario triggers CALL_A_HUMAN in the UI +

+ +## Scenario 2: Heavy Appliance Box + +**Expected action:** `PICK_AND_PLACE` at a low `z` position with a firm grip. + +Three intact heavy or sturdy boxes arrive together: a metal tool set, a +36-can case of canned beans, and a 25 kg set of rubber-coated weight plates. +Both pallets are empty, so the first heavy box can seed the base layer. + +| Field | Value | +| --- | --- | +| Visible boxes | `box_0001` (tool set), `box_0003` (canned beans), `box_0004` (weight plates) | +| Dimensions | `box_0001`: 2 x 2 x 2; `box_0003`: 2 x 2 x 1; `box_0004`: 2 x 2 x 2 | +| Pallet state | Pallet 1: 0% filled, pallet 2: 0% filled | +| Valid placement cells | Base-layer `[0, 0, 0]` on either pallet | + +Operator-visible rationale: + +```text +All boxes pass damage inspection. Heavy, sturdy boxes belong on the base layer +for stack stability. With both pallets empty, choose pallet 1 and place the +first visible heavy box at [0, 0, 0] using firm grip. +``` + +Parsed action: + +```json +{ + "action": "PICK_AND_PLACE", + "box": "box_0001", + "target_pallet": 1, + "position": [0, 0, 0], + "speed_pct": 80, + "grip_strength": "firm", + "reason": "Pure Harvest steel tool set is heavy and sturdy; placed at low z on Pallet 1 to form a stable base." +} +``` + +Simulated outcome: cuRobo plans the pick, and the Doosan P3020 places +`box_0001` at `[0, 0, 0]` on Pallet 1. + +

+ Heavy appliance scenario places a sturdy box on the base layer +

+ +## Scenario 3: Mixed-SKU Stacking + +**Expected action:** place heavy or rigid items below fragile items. + +Three intact boxes arrive: a 10-pack of SPAM cans, a 4-pack of glass kimchi +fermentation jars, and a multi-pack of honey butter chips. Pallet 1 is already +partially built. The next action should preserve stack quality by placing +heavier rigid goods at the lowest available stable slot and reserving higher +layers for fragile goods. + +| Field | Value | +| --- | --- | +| Visible boxes | `box_0008` (SPAM cans), `box_0010` (glass jars), `box_0011` (chip multipack) | +| Dimensions | `box_0008`: 2 x 2 x 1; `box_0010`: 2 x 1 x 1; `box_0011`: 2 x 1 x 1 | +| Pallet state | Pallet 1: 44% filled, pallet 2: 19% filled | +| Valid placement cells | Mid-layer slots on pallet 1, plus top-layer slots for delicate items | + +Operator-visible rationale: + +```text +All boxes pass damage inspection. Canned goods are heavy and rigid, so place +them at the lowest currently valid slot. Defer glass jars and chips to higher +slots with gentler handling. +``` + +Parsed action: + +```json +{ + "action": "PICK_AND_PLACE", + "box": "box_0008", + "target_pallet": 1, + "position": [0, 0, 2], + "speed_pct": 40, + "grip_strength": "firm", + "reason": "SPAM cans heavy and rigid; Pallet 1 closer to completion; delicate boxes deferred to high z." +} +``` + +Follow-up delicate-item action: + +```json +{ + "action": "PICK_AND_PLACE", + "box": "box_0029", + "target_pallet": 1, + "position": [0, 0, 3], + "speed_pct": 40, + "grip_strength": "gentle", + "reason": "box_0029 needs gentle handling; placing it at the top ensures it is not crushed by heavier boxes below." +} +``` + +Simulated outcome: the Doosan P3020 places `box_0008` in a mid-layer stable +slot, and a later iteration places fragile bottles at the top layer. + +

+ Mixed-SKU scenario places rigid items below delicate items +

+ +## What to Check in Cosmos3 Output + +For Nano and Super Diffusers runs, inspect the generated artifact against the +same operator-review criteria: + +- Mixed boxes, pallet grid, and robot cell are visible. +- Damage or fragile/handling cues are legible enough to support review. +- The scene leaves safe clearance around the robot arm. +- The prompt, seed, model ID, backend, and decoded artifact size are recorded. +- The artifact is treated as validation evidence, not as a production safety + decision. + +Super should be used when the Nano artifact is too small, noisy, or ambiguous +for the review objective. The full-stack robot-loop outcome still depends on +the reference app's parser, pallet constraints, and simulated execution checks. + +## Limitations + +- This is a simulated proof of concept, not a production robot safety system. +- Cosmos3-generated artifacts help validate scene prompts and review criteria; + they do not certify a real robot workcell. +- Real deployments need independent safety controls, guarded robot execution, + site-specific validation, and human-approved exception handling. +- The public Doosan stack can change independently of this cookbook, so treat + `make docker-test` as a live integration smoke rather than a guaranteed unit + test. diff --git a/cookbooks/cosmos3/generator/action/README.md b/cookbooks/cosmos3/generator/action/README.md index 6158764f..e289f7e5 100644 --- a/cookbooks/cosmos3/generator/action/README.md +++ b/cookbooks/cosmos3/generator/action/README.md @@ -1,8 +1,24 @@ -# Cosmos3 Generator Action Examples +# Cosmos3 Generator Action Cookbooks -Cosmos3-Nano action-generation examples across two inference backends — native -PyTorch (Cosmos Framework) and vLLM-Omni. Both backends use the sample assets -under [`assets/`](./assets) and cover two tasks: +Cosmos3-Nano action-generation cookbooks across two inference backends — native +PyTorch (Cosmos Framework) and vLLM-Omni. + +## Basic Examples + +The [`basic_examples/`](./basic_examples/) directory contains the shipped starter +cookbooks and sample assets. Community-contributed cookbooks are added as sibling +directories alongside `basic_examples/` — see the +[Contributing Guide](../../../../CONTRIBUTING.md) for the recipe structure. + +| Cookbook | Backend | Notebook | +|---------|---------|----------| +| Forward dynamics (AV, DROID, UMI) | Cosmos Framework | [`basic_examples/run_fd_with_cosmos_framework.ipynb`](./basic_examples/run_fd_with_cosmos_framework.ipynb) | +| Inverse dynamics (AV) | Cosmos Framework | [`basic_examples/run_id_with_cosmos_framework.ipynb`](./basic_examples/run_id_with_cosmos_framework.ipynb) | +| Policy (DROID) | Cosmos Framework | [`basic_examples/run_policy_with_cosmos_framework.md`](./basic_examples/run_policy_with_cosmos_framework.md) | +| Forward dynamics (AV, DROID, UMI) | vLLM-Omni | [`basic_examples/run_fd_with_vllm.ipynb`](./basic_examples/run_fd_with_vllm.ipynb) | +| Inverse dynamics (AV) | vLLM-Omni | [`basic_examples/run_id_with_vllm.ipynb`](./basic_examples/run_id_with_vllm.ipynb) | + +Both backends use the sample assets under [`basic_examples/assets/`](./basic_examples/assets/) and cover two tasks: - **Forward dynamics (`fd`)** — predict future observations from a start image plus an action trajectory (AV, DROID, and UMI robotics examples) using the Cosmos3-Nano. @@ -68,7 +84,7 @@ torchrun --nproc-per-node=1 \ The input spec pairs a start image with an action trajectory. The notebooks assemble ready-to-run specs for AV, DROID, and UMI examples from the checked-in -assets under [`assets/`](./assets). Outputs are written under the framework +assets under [`basic_examples/assets/`](./basic_examples/assets/). Outputs are written under the framework checkout. ### Cosmos Framework Walkthrough @@ -76,11 +92,11 @@ checkout. The Cosmos Framework build their input spec, run inference, and visualize the generated videos: -- [`run_fd_with_cosmos_framework.ipynb`](./run_fd_with_cosmos_framework.ipynb) — +- [`run_fd_with_cosmos_framework.ipynb`](./basic_examples/run_fd_with_cosmos_framework.ipynb) — forward dynamics for AV, DROID, and UMI robotics examples using Cosmos3-Nano. -- [`run_id_with_cosmos_framework.ipynb`](./run_id_with_cosmos_framework.ipynb) — +- [`run_id_with_cosmos_framework.ipynb`](./basic_examples/run_id_with_cosmos_framework.ipynb) — inverse dynamics, predicting ego-motion trajectories from input AV videos using Cosmos3-Nano. -- [`run_policy_with_cosmos_framework.md`](./run_policy_with_cosmos_framework.md) - policy, predicting future observations and action trajectories for DROID robot using Cosmos3-Nano-Policy-DROID. +- [`run_policy_with_cosmos_framework.md`](./basic_examples/run_policy_with_cosmos_framework.md) - policy, predicting future observations and action trajectories for DROID robot using Cosmos3-Nano-Policy-DROID. ## Run with vLLM-Omni @@ -100,8 +116,8 @@ curl http://localhost:8001/v1/models Forward-dynamics requests are multipart `POST`s to `/v1/videos` — a start image under `files={"input_reference": ...}` plus an `extra_params` payload carrying the action trajectory. The vLLM notebooks use these diffusion defaults for action -generation (see [`run_fd_with_vllm.ipynb`](./run_fd_with_vllm.ipynb) and -[`run_id_with_vllm.ipynb`](./run_id_with_vllm.ipynb)): +generation (see [`run_fd_with_vllm.ipynb`](./basic_examples/run_fd_with_vllm.ipynb) and +[`run_id_with_vllm.ipynb`](./basic_examples/run_id_with_vllm.ipynb)): | Field | Value | | --- | --- | @@ -117,9 +133,9 @@ including autoregressive chunked generation for the robotics examples. The vLLM-Omni notebooks send requests through the OpenAI-compatible video API and write outputs under `outputs/cosmos3_action_vllm/`: -- [`run_fd_with_vllm.ipynb`](./run_fd_with_vllm.ipynb) — forward dynamics for AV, +- [`run_fd_with_vllm.ipynb`](./basic_examples/run_fd_with_vllm.ipynb) — forward dynamics for AV, DROID, and UMI robotics examples. -- [`run_id_with_vllm.ipynb`](./run_id_with_vllm.ipynb) — inverse dynamics, +- [`run_id_with_vllm.ipynb`](./basic_examples/run_id_with_vllm.ipynb) — inverse dynamics, predicting ego-motion trajectories from input AV videos. diff --git a/cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json b/cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_forward.json similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/actions/av_traj_forward.json rename to cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_forward.json diff --git a/cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json b/cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_left.json similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/actions/av_traj_left.json rename to cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_left.json diff --git a/cookbooks/cosmos3/generator/action/assets/actions/av_traj_right.json b/cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_right.json similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/actions/av_traj_right.json rename to cookbooks/cosmos3/generator/action/basic_examples/assets/actions/av_traj_right.json diff --git a/cookbooks/cosmos3/generator/action/assets/actions/umi.json b/cookbooks/cosmos3/generator/action/basic_examples/assets/actions/umi.json similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/actions/umi.json rename to cookbooks/cosmos3/generator/action/basic_examples/assets/actions/umi.json diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/data/chunk-000/file-000.parquet b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/data/chunk-000/file-000.parquet similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/data/chunk-000/file-000.parquet rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/data/chunk-000/file-000.parquet diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/episodes/chunk-000/file-000.parquet b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/episodes/chunk-000/file-000.parquet similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/episodes/chunk-000/file-000.parquet rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/episodes/chunk-000/file-000.parquet diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/info.json similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/info.json rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/info.json diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/tasks.parquet b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/tasks.parquet similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/meta/tasks.parquet rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/meta/tasks.parquet diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.exterior_image_1_left/chunk-000/file-000.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.exterior_image_1_left/chunk-000/file-000.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.exterior_image_1_left/chunk-000/file-000.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.exterior_image_1_left/chunk-000/file-000.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.exterior_image_2_left/chunk-000/file-000.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.exterior_image_2_left/chunk-000/file-000.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.exterior_image_2_left/chunk-000/file-000.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.exterior_image_2_left/chunk-000/file-000.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.wrist_image_left/chunk-000/file-000.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.wrist_image_left/chunk-000/file-000.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/droid_lerobot_example/videos/observation.image.wrist_image_left/chunk-000/file-000.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/droid_lerobot_example/videos/observation.image.wrist_image_left/chunk-000/file-000.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/images/av_0.jpg b/cookbooks/cosmos3/generator/action/basic_examples/assets/images/av_0.jpg similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/images/av_0.jpg rename to cookbooks/cosmos3/generator/action/basic_examples/assets/images/av_0.jpg diff --git a/cookbooks/cosmos3/generator/action/assets/images/av_1.jpg b/cookbooks/cosmos3/generator/action/basic_examples/assets/images/av_1.jpg similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/images/av_1.jpg rename to cookbooks/cosmos3/generator/action/basic_examples/assets/images/av_1.jpg diff --git a/cookbooks/cosmos3/generator/action/assets/images/umi.png b/cookbooks/cosmos3/generator/action/basic_examples/assets/images/umi.png similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/images/umi.png rename to cookbooks/cosmos3/generator/action/basic_examples/assets/images/umi.png diff --git a/cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/videos/av_0.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/videos/av_0.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/videos/av_0.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/videos/av_1.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/videos/av_1.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/videos/av_1.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/videos/av_1.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/videos/robolab_example_rollout.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/videos/robolab_example_rollout.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/videos/robolab_example_rollout.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/videos/robolab_example_rollout.mp4 diff --git a/cookbooks/cosmos3/generator/action/assets/videos/umi.mp4 b/cookbooks/cosmos3/generator/action/basic_examples/assets/videos/umi.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/action/assets/videos/umi.mp4 rename to cookbooks/cosmos3/generator/action/basic_examples/assets/videos/umi.mp4 diff --git a/cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_cosmos_framework.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/action/run_fd_with_cosmos_framework.ipynb rename to cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_cosmos_framework.ipynb diff --git a/cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb b/cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_vllm.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/action/run_fd_with_vllm.ipynb rename to cookbooks/cosmos3/generator/action/basic_examples/run_fd_with_vllm.ipynb diff --git a/cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/action/basic_examples/run_id_with_cosmos_framework.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/action/run_id_with_cosmos_framework.ipynb rename to cookbooks/cosmos3/generator/action/basic_examples/run_id_with_cosmos_framework.ipynb diff --git a/cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb b/cookbooks/cosmos3/generator/action/basic_examples/run_id_with_vllm.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/action/run_id_with_vllm.ipynb rename to cookbooks/cosmos3/generator/action/basic_examples/run_id_with_vllm.ipynb diff --git a/cookbooks/cosmos3/generator/action/run_policy_with_cosmos_framework.md b/cookbooks/cosmos3/generator/action/basic_examples/run_policy_with_cosmos_framework.md similarity index 100% rename from cookbooks/cosmos3/generator/action/run_policy_with_cosmos_framework.md rename to cookbooks/cosmos3/generator/action/basic_examples/run_policy_with_cosmos_framework.md diff --git a/cookbooks/cosmos3/generator/audiovisual/README.md b/cookbooks/cosmos3/generator/audiovisual/README.md index d80adad4..fbe388f4 100644 --- a/cookbooks/cosmos3/generator/audiovisual/README.md +++ b/cookbooks/cosmos3/generator/audiovisual/README.md @@ -1,14 +1,26 @@ -# Cosmos3 Generator Audiovisual Examples +# Cosmos3 Generator Audiovisual Cookbooks Generate images and video (with optional audio) from text or image prompts with -`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Sample -prompts live under [`assets/`](./assets). +`Cosmos3-Nano` and `Cosmos3-Super`, across three inference backends. Environment setup for every backend is centralized in the shared [Cosmos3 cookbooks environment setup](../../README.md) guide; each backend below links to the section you need. The quickstarts are minimal text-to-video examples to get one generation running per backend — run them from this folder. +## Basic Examples + +The [`basic_examples/`](./basic_examples/) directory contains the shipped starter +cookbooks and sample prompts. Community-contributed cookbooks are added as sibling +directories alongside `basic_examples/` — see the +[Contributing Guide](../../../../CONTRIBUTING.md) for the recipe structure. + +| Cookbook | Backend | Notebook | +|---------|---------|----------| +| T2I / T2V / I2V + audio | Cosmos Framework | [`basic_examples/run_with_cosmos_framework.ipynb`](./basic_examples/run_with_cosmos_framework.ipynb) | +| T2I / T2V / I2V + audio | Diffusers | [`basic_examples/run_with_diffusers.ipynb`](./basic_examples/run_with_diffusers.ipynb) | +| T2I / T2V / I2V + audio | vLLM-Omni | [`basic_examples/run_with_vllm_omni.ipynb`](./basic_examples/run_with_vllm_omni.ipynb) | + Generator requires the Guardrail. Request access to the gated [nvidia/Cosmos-1.0-Guardrail](https://huggingface.co/nvidia/Cosmos-1.0-Guardrail) HF repository before running these examples. To disable the guardrail, set @@ -31,12 +43,12 @@ import json from pathlib import Path prompt = json.dumps( - json.load(open("assets/prompts/text2video/robot_kitchen.json")), + json.load(open("basic_examples/assets/prompts/text2video/robot_kitchen.json")), ensure_ascii=True, separators=(",", ":"), ) negative = json.dumps( - json.load(open("assets/negative_prompts/text2video/neg_prompt.json")), + json.load(open("basic_examples/assets/negative_prompts/text2video/neg_prompt.json")), ensure_ascii=True, separators=(",", ":"), ) @@ -72,7 +84,7 @@ more GPUs via `--nproc-per-node`. ### Notebook walkthrough -[`run_with_cosmos_framework.ipynb`](./run_with_cosmos_framework.ipynb) is the full +[`run_with_cosmos_framework.ipynb`](./basic_examples/run_with_cosmos_framework.ipynb) is the full tutorial for the native PyTorch backend: it covers every use case — text-to-image, text-to-video, image-to-video, with audio on or off — and includes the detailed, environment-aware setup and visualization for each generation. @@ -91,8 +103,8 @@ from diffusers import Cosmos3OmniPipeline from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler from diffusers.utils import export_to_video -prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json")) -negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json")) +prompt = json.load(open("basic_examples/assets/prompts/text2video/robot_kitchen.json")) +negative = json.load(open("basic_examples/assets/negative_prompts/text2video/neg_prompt.json")) pipe = Cosmos3OmniPipeline.from_pretrained( "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda" @@ -122,7 +134,7 @@ To run **Cosmos3-Super** instead, load the larger checkpoint: ### Notebook walkthrough -[`run_with_diffusers.ipynb`](./run_with_diffusers.ipynb) is the full tutorial for +[`run_with_diffusers.ipynb`](./basic_examples/run_with_diffusers.ipynb) is the full tutorial for the Diffusers backend: it provisions a dedicated venv, then walks through text-to-image, text-to-video, and image-to-video generation (with and without audio) using `Cosmos3OmniPipeline`, including how to preview the generated media. @@ -145,8 +157,8 @@ from pathlib import Path import requests -prompt = json.load(open("assets/prompts/text2video/robot_kitchen.json")) -negative = json.load(open("assets/negative_prompts/text2video/neg_prompt.json")) +prompt = json.load(open("basic_examples/assets/prompts/text2video/robot_kitchen.json")) +negative = json.load(open("basic_examples/assets/negative_prompts/text2video/neg_prompt.json")) response = requests.post( "http://localhost:8000/v1/videos/sync", @@ -179,7 +191,7 @@ For image-to-video, post to the same endpoint with an image under ### Notebook walkthrough -[`run_with_vllm_omni.ipynb`](./run_with_vllm_omni.ipynb) is the full tutorial for +[`run_with_vllm_omni.ipynb`](./basic_examples/run_with_vllm_omni.ipynb) is the full tutorial for the vLLM-Omni backend: it walks through text-to-image, text-to-video, and image-to-video requests with audio on or off. Server launch options (Nano and Super, tensor parallelism, layerwise offload, and CFG-parallel variants) live in diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/car_driving.jpg b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/car_driving.jpg similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/car_driving.jpg rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/car_driving.jpg diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/coastal_road_audio.jpg b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/coastal_road_audio.jpg similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/coastal_road_audio.jpg rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/coastal_road_audio.jpg diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/humanoid_robot.jpg b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/humanoid_robot.jpg similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/images/image2video/humanoid_robot.jpg rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/images/image2video/humanoid_robot.jpg diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/image2video/neg_prompt.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/negative_prompts/image2video/neg_prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/image2video/neg_prompt.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/negative_prompts/image2video/neg_prompt.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/negative_prompts/text2video/neg_prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/negative_prompts/text2video/neg_prompt.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/negative_prompts/text2video/neg_prompt.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/car_driving.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/car_driving.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/car_driving.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/car_driving.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/coastal_road_audio.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/coastal_road_audio.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/coastal_road_audio.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/coastal_road_audio.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/humanoid_robot.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/humanoid_robot.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/image2video/humanoid_robot.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/image2video/humanoid_robot.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2image/robot_draping.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2image/robot_draping.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2image/robot_draping.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/car_colliding.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/car_colliding.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/car_colliding.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/car_colliding.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/robot_kitchen.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_kitchen.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/robot_kitchen.json diff --git a/cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_pouring_water_audio.json b/cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/robot_pouring_water_audio.json similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/assets/prompts/text2video/robot_pouring_water_audio.json rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/assets/prompts/text2video/robot_pouring_water_audio.json diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_cosmos_framework.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/run_with_cosmos_framework.ipynb rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_cosmos_framework.ipynb diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb b/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_diffusers.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/run_with_diffusers.ipynb rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_diffusers.ipynb diff --git a/cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb b/cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_vllm_omni.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/audiovisual/run_with_vllm_omni.ipynb rename to cookbooks/cosmos3/generator/audiovisual/basic_examples/run_with_vllm_omni.ipynb diff --git a/cookbooks/cosmos3/generator/transfer/README.md b/cookbooks/cosmos3/generator/transfer/README.md index 0c477056..8c849185 100644 --- a/cookbooks/cosmos3/generator/transfer/README.md +++ b/cookbooks/cosmos3/generator/transfer/README.md @@ -1,7 +1,19 @@ -# Cosmos3 Generator Transfer Examples +# Cosmos3 Generator Transfer Cookbooks -Cosmos3-Nano video **transfer** examples on the native PyTorch (Cosmos Framework) path. -Sample assets under [`assets/`](./assets) cover spatial control signals paired with +Cosmos3-Nano video **transfer** cookbooks on the native PyTorch (Cosmos Framework) path. + +## Basic Examples + +The [`basic_examples/`](./basic_examples/) directory contains the shipped starter +cookbook and sample assets. Community-contributed cookbooks are added as sibling +directories alongside `basic_examples/` — see the +[Contributing Guide](../../../../CONTRIBUTING.md) for the recipe structure. + +| Cookbook | Backend | Notebook | +|---------|---------|----------| +| Video transfer (edge, blur, depth, seg, wsm) | Cosmos Framework | [`basic_examples/run_video_transfer_with_cosmos_framework.ipynb`](./basic_examples/run_video_transfer_with_cosmos_framework.ipynb) | + +Sample assets under [`basic_examples/assets/`](./basic_examples/assets/) cover spatial control signals paired with `prompt.json` files: - **Edge (Canny)** — edge map control plus caption. @@ -26,11 +38,11 @@ come from the control video; see the spec field reference for how `fps` and | Control | Asset folder | Inference input | Generation duration | | --- | --- | --- | --- | -| Edge (Canny) | `assets/edge/` | `control_edge.mp4` + `prompt.json` | 121 frames @ 30 FPS | -| Blur | `assets/blur/` | `control_blur.mp4` + `prompt.json` | 121 frames @ 30 FPS | -| Depth | `assets/depth/` | `control_depth.mp4` + `prompt.json` | 121 frames @ 30 FPS | -| Segmentation | `assets/seg/` | `control_seg.mp4` + `prompt.json` | 121 frames @ 30 FPS | -| World scenario (WSM) | `assets/wsm/` | `control_wsm.mp4` + `prompt.json` | 101 frames @ 10 FPS | +| Edge (Canny) | `basic_examples/assets/edge/` | `control_edge.mp4` + `prompt.json` | 121 frames @ 30 FPS | +| Blur | `basic_examples/assets/blur/` | `control_blur.mp4` + `prompt.json` | 121 frames @ 30 FPS | +| Depth | `basic_examples/assets/depth/` | `control_depth.mp4` + `prompt.json` | 121 frames @ 30 FPS | +| Segmentation | `basic_examples/assets/seg/` | `control_seg.mp4` + `prompt.json` | 121 frames @ 30 FPS | +| World scenario (WSM) | `basic_examples/assets/wsm/` | `control_wsm.mp4` + `prompt.json` | 101 frames @ 10 FPS | Transfer inference is selected automatically when any hint key is present in the spec. @@ -40,7 +52,7 @@ Transfer inference is selected automatically when any hint key is present in the Set up the environment: [Cosmos Framework setup](../../README.md#cosmos-framework). Activate the framework venv, then run inference (checked-in `specs/*.json` use paths -relative to `specs/`). Transfer on Nano looks like: +relative to `basic_examples/specs/`). Transfer on Nano looks like: ```bash cd cookbooks/cosmos3/generator/transfer @@ -49,7 +61,7 @@ cd cookbooks/cosmos3/generator/transfer torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ - -i specs/edge.json \ + -i basic_examples/specs/edge.json \ -o ./output/ \ --checkpoint-path Cosmos3-Nano \ --seed 2026 @@ -58,7 +70,7 @@ torchrun --nproc-per-node=1 \ torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ - -i specs/blur.json \ + -i basic_examples/specs/blur.json \ -o ./output/ \ --checkpoint-path Cosmos3-Nano \ --seed 2026 @@ -67,7 +79,7 @@ torchrun --nproc-per-node=1 \ torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ - -i specs/depth.json \ + -i basic_examples/specs/depth.json \ -o ./output/ \ --checkpoint-path Cosmos3-Nano \ --seed 2026 @@ -76,7 +88,7 @@ torchrun --nproc-per-node=1 \ torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ - -i specs/seg.json \ + -i basic_examples/specs/seg.json \ -o ./output/ \ --checkpoint-path Cosmos3-Nano \ --seed 2026 @@ -85,14 +97,14 @@ torchrun --nproc-per-node=1 \ torchrun --nproc-per-node=1 \ -m cosmos_framework.scripts.inference \ --parallelism-preset=latency \ - -i specs/wsm.json \ + -i basic_examples/specs/wsm.json \ -o ./output/ \ --checkpoint-path Cosmos3-Nano \ --seed 2026 ``` The input spec sets `prompt_path` and a hint block with `control_path` pointing at the -checked-in assets under [`assets/`](./assets) via paths relative to [`specs/`](./specs). +checked-in assets under [`basic_examples/assets/`](./basic_examples/assets/) via paths relative to [`basic_examples/specs/`](./basic_examples/specs/). Outputs are written under the directory passed to `-o`, with one subdirectory per sample name, for example `output/transfer_edge/vision.mp4`. Batch size must be 1 for transfer. @@ -137,10 +149,10 @@ Key fields: ### Cookbook entrypoints -- [`run_video_transfer_with_cosmos_framework.ipynb`](./run_video_transfer_with_cosmos_framework.ipynb) — +- [`run_video_transfer_with_cosmos_framework.ipynb`](./basic_examples/run_video_transfer_with_cosmos_framework.ipynb) — full tutorial on a **GPU host**: environment setup, `nvidia-smi` check, then five inference blocks (edge, blur, depth, seg, wsm) with previews. See [Cosmos3 environment setup](../../README.md). -- [`specs/`](./specs) — checked-in Framework input JSON per control (paths relative to `specs/`). +- [`basic_examples/specs/`](./basic_examples/specs/) — checked-in Framework input JSON per control (paths relative to `basic_examples/specs/`). ### Troubleshooting diff --git a/cookbooks/cosmos3/generator/transfer/assets/blur/control_blur.mp4 b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/blur/control_blur.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/blur/control_blur.mp4 rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/blur/control_blur.mp4 diff --git a/cookbooks/cosmos3/generator/transfer/assets/blur/prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/blur/prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/blur/prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/blur/prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4 b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/depth/control_depth.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/depth/control_depth.mp4 rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/depth/control_depth.mp4 diff --git a/cookbooks/cosmos3/generator/transfer/assets/depth/prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/depth/prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/depth/prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/depth/prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/assets/edge/control_edge.mp4 b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/edge/control_edge.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/edge/control_edge.mp4 rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/edge/control_edge.mp4 diff --git a/cookbooks/cosmos3/generator/transfer/assets/edge/prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/edge/prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/edge/prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/edge/prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/assets/negative_prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/negative_prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/negative_prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/negative_prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/assets/seg/control_seg.mp4 b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/seg/control_seg.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/seg/control_seg.mp4 rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/seg/control_seg.mp4 diff --git a/cookbooks/cosmos3/generator/transfer/assets/seg/prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/seg/prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/seg/prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/seg/prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/assets/wsm/control_wsm.mp4 b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/wsm/control_wsm.mp4 similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/wsm/control_wsm.mp4 rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/wsm/control_wsm.mp4 diff --git a/cookbooks/cosmos3/generator/transfer/assets/wsm/prompt.json b/cookbooks/cosmos3/generator/transfer/basic_examples/assets/wsm/prompt.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/assets/wsm/prompt.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/assets/wsm/prompt.json diff --git a/cookbooks/cosmos3/generator/transfer/preview_helpers.py b/cookbooks/cosmos3/generator/transfer/basic_examples/preview_helpers.py similarity index 100% rename from cookbooks/cosmos3/generator/transfer/preview_helpers.py rename to cookbooks/cosmos3/generator/transfer/basic_examples/preview_helpers.py diff --git a/cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb b/cookbooks/cosmos3/generator/transfer/basic_examples/run_video_transfer_with_cosmos_framework.ipynb similarity index 100% rename from cookbooks/cosmos3/generator/transfer/run_video_transfer_with_cosmos_framework.ipynb rename to cookbooks/cosmos3/generator/transfer/basic_examples/run_video_transfer_with_cosmos_framework.ipynb diff --git a/cookbooks/cosmos3/generator/transfer/specs/blur.json b/cookbooks/cosmos3/generator/transfer/basic_examples/specs/blur.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/specs/blur.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/specs/blur.json diff --git a/cookbooks/cosmos3/generator/transfer/specs/depth.json b/cookbooks/cosmos3/generator/transfer/basic_examples/specs/depth.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/specs/depth.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/specs/depth.json diff --git a/cookbooks/cosmos3/generator/transfer/specs/edge.json b/cookbooks/cosmos3/generator/transfer/basic_examples/specs/edge.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/specs/edge.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/specs/edge.json diff --git a/cookbooks/cosmos3/generator/transfer/specs/seg.json b/cookbooks/cosmos3/generator/transfer/basic_examples/specs/seg.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/specs/seg.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/specs/seg.json diff --git a/cookbooks/cosmos3/generator/transfer/specs/wsm.json b/cookbooks/cosmos3/generator/transfer/basic_examples/specs/wsm.json similarity index 100% rename from cookbooks/cosmos3/generator/transfer/specs/wsm.json rename to cookbooks/cosmos3/generator/transfer/basic_examples/specs/wsm.json diff --git a/cookbooks/cosmos3/reasoner/README.md b/cookbooks/cosmos3/reasoner/README.md index b3d3542d..3718e45a 100644 --- a/cookbooks/cosmos3/reasoner/README.md +++ b/cookbooks/cosmos3/reasoner/README.md @@ -1,15 +1,28 @@ -# Cosmos3 Reasoner Examples +# Cosmos3 Reasoner Cookbooks Run the Cosmos3 Reasoner (vision-language reasoning over images and video) across -multiple inference backends. Sample inputs live under [`assets/`](./assets). +multiple inference backends. Environment setup for every backend is centralized in the shared [Cosmos3 cookbooks environment setup](../README.md) guide; each backend below links to the section you need. +## Basic Examples + +The [`basic_examples/`](./basic_examples/) directory contains the shipped starter +cookbooks and sample inputs. Community-contributed cookbooks are added as sibling +directories alongside `basic_examples/` — see the +[Contributing Guide](../../../CONTRIBUTING.md) for the recipe structure. + +| Cookbook | Backend | Notebook | +|---------|---------|----------| +| Reasoner inference | Cosmos Framework | [`basic_examples/run_with_cosmos_framework.ipynb`](./basic_examples/run_with_cosmos_framework.ipynb) | +| Reasoner inference | vLLM | [`basic_examples/run_with_vllm.ipynb`](./basic_examples/run_with_vllm.ipynb) | +| Reasoner inference | NIM | [`basic_examples/run_with_nim.ipynb`](./basic_examples/run_with_nim.ipynb) | + ## Reasoner Prompt Guide -See the [Reasoner Prompt Guide](./reasoner_prompt_guide.md). +See the [Reasoner Prompt Guide](./basic_examples/reasoner_prompt_guide.md). ## Run with Cosmos Framework @@ -29,7 +42,7 @@ cat > outputs/cookbooks/cosmos3/reasoner/inputs/robot_image.json <<'JSON' "model_mode": "reasoner", "name": "robot_image", "prompt": "Describe what is happening in this image in one sentence.", - "vision_path": "../../cookbooks/cosmos3/reasoner/assets/robot_153.jpg", + "vision_path": "../../cookbooks/cosmos3/reasoner/basic_examples/assets/robot_153.jpg", "enable_sound": false } JSON @@ -54,7 +67,7 @@ The generated text is written to ### Notebook walkthrough -[`run_with_cosmos_framework.ipynb`](./run_with_cosmos_framework.ipynb) is the full +[`run_with_cosmos_framework.ipynb`](./basic_examples/run_with_cosmos_framework.ipynb) is the full tutorial. It writes text and image smoke tests, then walks through image capability sections — detailed captioning, robot task planning, 2D grounding, describe-anything, and action-trajectory prompts — rendering the prompt, media @@ -73,7 +86,7 @@ Set up the environment and start the server: [Start the server](../README.md#start-the-server) (launch commands). The quickstart below uses **Cosmos3-Nano** on port 8000. The -[`run_with_vllm.ipynb`](./run_with_vllm.ipynb) notebook defaults to +[`run_with_vllm.ipynb`](./basic_examples/run_with_vllm.ipynb) notebook defaults to **Cosmos3-Super** on port **8001** — use that launch command from the env setup guide and point the client at `http://localhost:8001/v1`. @@ -83,7 +96,7 @@ Once the server is ready, query it with the OpenAI client: from pathlib import Path import openai -image_path = Path("assets/robot_153.jpg").resolve() +image_path = Path("basic_examples/assets/robot_153.jpg").resolve() image_url = image_path.as_uri() client = openai.OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") @@ -108,7 +121,7 @@ print(response.choices[0].message.content) ### Notebook walkthrough -[`run_with_vllm.ipynb`](./run_with_vllm.ipynb) uses the **Cosmos3-Super** launch +[`run_with_vllm.ipynb`](./basic_examples/run_with_vllm.ipynb) uses the **Cosmos3-Super** launch from the [environment setup guide](../README.md#start-the-server) and walks through many more image and video examples: detailed captioning, VQA, temporal localization, embodied reasoning, common-sense reasoning, 2D @@ -138,7 +151,7 @@ import mimetypes from pathlib import Path import openai -image_path = Path("assets/robot_153.jpg").resolve() +image_path = Path("basic_examples/assets/robot_153.jpg").resolve() mime = mimetypes.guess_type(image_path.name)[0] or "application/octet-stream" image_url = f"data:{mime};base64,{base64.b64encode(image_path.read_bytes()).decode('ascii')}" @@ -170,7 +183,7 @@ for the full request reference. ### Notebook walkthrough -[`run_with_nim.ipynb`](./run_with_nim.ipynb) is the NIM counterpart to the vLLM +[`run_with_nim.ipynb`](./basic_examples/run_with_nim.ipynb) is the NIM counterpart to the vLLM notebook: it launches the NIM container, waits for readiness, and then runs the same image and video examples — detailed captioning, VQA, temporal localization, embodied reasoning, common-sense reasoning, 2D grounding, describe-anything, diff --git a/cookbooks/cosmos3/reasoner/assets/action_cot_driving_scene.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/action_cot_driving_scene.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/action_cot_driving_scene.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/action_cot_driving_scene.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/action_cot_trajectory.png b/cookbooks/cosmos3/reasoner/basic_examples/assets/action_cot_trajectory.png similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/action_cot_trajectory.png rename to cookbooks/cosmos3/reasoner/basic_examples/assets/action_cot_trajectory.png diff --git a/cookbooks/cosmos3/reasoner/assets/assisted_task_next_action.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/assisted_task_next_action.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/assisted_task_next_action.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/assisted_task_next_action.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/common_sense_reasoning.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/common_sense_reasoning.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/common_sense_reasoning.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/common_sense_reasoning.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/describe_anything.png b/cookbooks/cosmos3/reasoner/basic_examples/assets/describe_anything.png similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/describe_anything.png rename to cookbooks/cosmos3/reasoner/basic_examples/assets/describe_anything.png diff --git a/cookbooks/cosmos3/reasoner/assets/drive_scene_next_action.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/drive_scene_next_action.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/drive_scene_next_action.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/drive_scene_next_action.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/grounding_2d.png b/cookbooks/cosmos3/reasoner/basic_examples/assets/grounding_2d.png similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/grounding_2d.png rename to cookbooks/cosmos3/reasoner/basic_examples/assets/grounding_2d.png diff --git a/cookbooks/cosmos3/reasoner/assets/physical_plausibility.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/physical_plausibility.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/physical_plausibility.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/physical_plausibility.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/robot_153.jpg b/cookbooks/cosmos3/reasoner/basic_examples/assets/robot_153.jpg similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/robot_153.jpg rename to cookbooks/cosmos3/reasoner/basic_examples/assets/robot_153.jpg diff --git a/cookbooks/cosmos3/reasoner/assets/robot_planning.png b/cookbooks/cosmos3/reasoner/basic_examples/assets/robot_planning.png similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/robot_planning.png rename to cookbooks/cosmos3/reasoner/basic_examples/assets/robot_planning.png diff --git a/cookbooks/cosmos3/reasoner/assets/robotics_next_action.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/robotics_next_action.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/robotics_next_action.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/robotics_next_action.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/situation_understanding.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/situation_understanding.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/situation_understanding.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/situation_understanding.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/temporal_localization_1.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/temporal_localization_1.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/temporal_localization_1.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/temporal_localization_1.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/temporal_localization_2.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/temporal_localization_2.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/temporal_localization_2.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/temporal_localization_2.mp4 diff --git a/cookbooks/cosmos3/reasoner/assets/video_caption.mp4 b/cookbooks/cosmos3/reasoner/basic_examples/assets/video_caption.mp4 similarity index 100% rename from cookbooks/cosmos3/reasoner/assets/video_caption.mp4 rename to cookbooks/cosmos3/reasoner/basic_examples/assets/video_caption.mp4 diff --git a/cookbooks/cosmos3/reasoner/reasoner_prompt_guide.md b/cookbooks/cosmos3/reasoner/basic_examples/reasoner_prompt_guide.md similarity index 100% rename from cookbooks/cosmos3/reasoner/reasoner_prompt_guide.md rename to cookbooks/cosmos3/reasoner/basic_examples/reasoner_prompt_guide.md diff --git a/cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb b/cookbooks/cosmos3/reasoner/basic_examples/run_with_cosmos_framework.ipynb similarity index 100% rename from cookbooks/cosmos3/reasoner/run_with_cosmos_framework.ipynb rename to cookbooks/cosmos3/reasoner/basic_examples/run_with_cosmos_framework.ipynb diff --git a/cookbooks/cosmos3/reasoner/run_with_nim.ipynb b/cookbooks/cosmos3/reasoner/basic_examples/run_with_nim.ipynb similarity index 100% rename from cookbooks/cosmos3/reasoner/run_with_nim.ipynb rename to cookbooks/cosmos3/reasoner/basic_examples/run_with_nim.ipynb diff --git a/cookbooks/cosmos3/reasoner/run_with_vllm.ipynb b/cookbooks/cosmos3/reasoner/basic_examples/run_with_vllm.ipynb similarity index 100% rename from cookbooks/cosmos3/reasoner/run_with_vllm.ipynb rename to cookbooks/cosmos3/reasoner/basic_examples/run_with_vllm.ipynb