Skip to content
Open

Dwon #55

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions CUDA_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# CUDA 12.8 and PyTorch Setup Guide

This guide covers setting up LingBot-World with CUDA 12.8 and PyTorch.

## Prerequisites

- NVIDIA GPU with compute capability 7.0+ (e.g., RTX 2000 series or newer)
- Docker with NVIDIA Container Toolkit installed
- At least 16GB of system RAM
- Sufficient disk space for models and data

## Option 1: Docker Setup (Recommended)

### 1. Install Docker and NVIDIA Container Toolkit

**Ubuntu/Debian:**
```bash
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
```

**Windows:**
- Install Docker Desktop for Windows
- Ensure WSL2 backend is enabled
- Install NVIDIA drivers for WSL2

### 2. Build and Run with Docker Compose

```bash
# Build the image
docker-compose build

# Run the container
docker-compose up -d

# Enter the container
docker-compose exec lingbot-world bash

# Or run directly
docker-compose run --rm lingbot-world python your_script.py
```

### 3. Verify CUDA Installation

Inside the container:
```python
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
if torch.cuda.is_available():
print(f"GPU Name: {torch.cuda.get_device_name(0)}")
```

## Option 2: Local Installation

### 1. Install CUDA 12.8

**Ubuntu/Debian:**
```bash
# Download and install CUDA 12.8
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-8

# Add to PATH
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```

**Windows:**
1. Download CUDA 12.8 from [NVIDIA Developer](https://developer.nvidia.com/cuda-downloads)
2. Run the installer and follow the prompts
3. Verify installation: `nvcc --version`

### 2. Create Python Environment

```bash
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip setuptools wheel
```

### 3. Install PyTorch with CUDA 12.4 Support

**Note:** PyTorch doesn't have official CUDA 12.8 builds yet. Using CUDA 12.4 builds which are compatible:

```bash
# Install PyTorch with CUDA 12.4 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
```

### 4. Install LingBot-World Dependencies

```bash
# Install remaining dependencies
pip install -r requirements.txt

# Install in editable mode
pip install -e .
```

### 5. Verify Installation

```bash
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
```

## Troubleshooting

### CUDA Version Mismatch

If you encounter CUDA version mismatch errors:
- PyTorch CUDA 12.4 builds are compatible with CUDA 12.x runtime
- Ensure your NVIDIA driver supports CUDA 12.8 (driver version ≥ 550.x)

### Out of Memory Errors

If you encounter OOM errors:
- Reduce batch size in your training scripts
- Enable gradient checkpointing
- Use mixed precision training (fp16 or bf16)

### Flash Attention Installation Issues

If `flash_attn` fails to install:
```bash
# Install with CUDA support
pip install flash-attn --no-build-isolation
```

Or build from source:
```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install
```

## Performance Optimization

### Enable TF32 for Ampere+ GPUs

```python
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```

### Use Compilation (PyTorch 2.x)

```python
model = torch.compile(model)
```

## Additional Resources

- [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads)
- [PyTorch Installation Guide](https://pytorch.org/get-started/locally/)
- [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
- [Flash Attention](https://github.com/Dao-AILab/flash-attention)
51 changes: 51 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# LingBot-World Dockerfile with CUDA 12.8 support
FROM nvidia/cuda:12.8.0-cudnn9-devel-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}

# Install system dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
python3-dev \
git \
wget \
curl \
build-essential \
libgl1-mesa-glx \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
libgomp1 \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*

# Upgrade pip
RUN python3 -m pip install --upgrade pip setuptools wheel

# Set working directory
WORKDIR /app

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip install --no-cache-dir -r requirements.txt

# Install Hugging Face CLI and authenticate using the command line
RUN huggingface-cli login --token ${HUGGINGFACE_TOKEN} && \
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='google/embeddinggemma-300m', filename='config.json', repo_type='model', cache_dir='/app/checkpoints')"

# Copy project files
COPY . .

# Install the package in editable mode
RUN pip install -e .

# Set the default command
CMD ["/bin/bash"]
126 changes: 126 additions & 0 deletions README copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
<div align="center">
<img src="assets/teaser.png">

<h1>LingBot-World: Advancing Open-source World Models</h1>

Robbyant Team

</div>


<div align="center">

[![Page](https://img.shields.io/badge/%F0%9F%8C%90%20Project%20Page-Demo-00bfff)](https://technology.robbyant.com/lingbot-world)
[![Tech Report](https://img.shields.io/badge/%F0%9F%93%84%20Tech%20Report-Document-teal)](LingBot_World_paper.pdf)
[![Paper](https://img.shields.io/static/v1?label=Paper&message=PDF&color=red&logo=arxiv)](https://github.com/robbyant/lingbot-world)
[![Model](https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Model&message=HuggingFace&color=yellow)](https://huggingface.co/robbyant/lingbot-world-base-cam)
[![Model](https://img.shields.io/static/v1?label=%F0%9F%A4%96%20Model&message=ModelScope&color=purple)](https://www.modelscope.cn/models/Robbyant/lingbot-world-base-cam)
[![License](https://img.shields.io/badge/License-Apache--2.0-green)](LICENSE.txt)

</div>

-----

We are excited to introduce **LingBot-World**, an open-sourced world simulator stemming from video generation. Positioned
as a top-tier world model, LingBot-World offers the following features.
- **High-Fidelity & Diverse Environments**: It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond.
- **Long-Term Memory & Consistency**: It enables a minute-level horizon while preserving contextual consistency over time, which is also known as long-term memory.
- **Real-Time Interactivity & Open Access**: It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

## 🎬 Video Demo
<div align="center">
<video src="https://github.com/user-attachments/assets/ea4a7a8d-5d9e-4ccf-96e7-02f93797116e" width="100%" poster=""> </video>
</div>

## 🔥 News
- Jan 29, 2026: 🎉 We release the technical report, code, and models for LingBot-World.

<!-- ## 🔖 Introduction of LingBot-World
We present **LingBot-World**, an **open-sourced** world simulator stemming from video generation. Positioned
as a top-tier world model, LingBot-World offers the following features.
- It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond.
- It enables a minute-level horizon while preserving contextual consistency over time, which is also known as **long-term memory**.
- It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning. -->

## ⚙️ Quick Start
This codebase is built upon [Wan2.2](https://github.com/Wan-Video/Wan2.2). Please refer to their documentation for installation instructions.
### Installation
Clone the repo:
```sh
git clone https://github.com/robbyant/lingbot-world.git
cd lingbot-world
```
Install dependencies:
```sh
# Ensure torch >= 2.4.0
pip install -r requirements.txt
```
Install [`flash_attn`](https://github.com/Dao-AILab/flash-attention):
```sh
pip install flash-attn --no-build-isolation
```
### Model Download

| Model | Control Signals | Resolution | Download Links |
| :--- | :--- | :--- | :--- |
| **LingBot-World-Base (Cam)** | Camera Poses | 480P & 720P | 🤗 [HuggingFace](https://huggingface.co/robbyant/lingbot-world-base-cam) 🤖 [ModelScope](https://www.modelscope.cn/models/Robbyant/lingbot-world-base-cam) |
| **LingBot-World-Base (Act)** | Actions | - | *To be released* |
| **LingBot-World-Fast** | - | - | *To be released* |


Download models using modelscope-cli:
```sh
pip install modelscope
modelscope download robbyant/lingbot-world-base-cam --local_dir ./lingbot-world-base-cam
```
### Inference
Our model supports video generation at both 480P and 720P resolutions. You can find data samples for inference in the `examples/` directory, which includes the corresponding input images, prompts, and control signals. To enable long video generation, we utilize multi-GPU inference powered by FSDP and DeepSpeed Ulysses.
- 480P:

This means the frame_num must be in the form of 4n + 1, where n is an integer (e.g., 1, 2, 3, etc.). For example, valid values include 5, 9, 13, 161, 321, etc.

python generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --frame_num 31 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls." --save_file C:\workspace\world\lingbot-world\out

python generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --frame_num 21 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."

``` sh
$env:USE_LIBUV=0
torchrun --nproc_per_node=1 generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
```
python -m torch.distributed.run --nproc_per_node=1 generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls.

- 720P:
``` sh
torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 720*1280 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --action_path examples/00 --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
```
Alternatively, you can run inference without control actions:
``` sh
torchrun --nproc_per_node=8 generate.py --task i2v-A14B --size 480*832 --ckpt_dir lingbot-world-base-cam --image examples/00/image.jpg --dit_fsdp --t5_fsdp --ulysses_size 8 --frame_num 161 --prompt "The video presents a soaring journey through a fantasy jungle. The wind whips past the rider's blue hands gripping the reins, causing the leather straps to vibrate. The ancient gothic castle approaches steadily, its stone details becoming clearer against the backdrop of floating islands and distant waterfalls."
```
Tips:
If you have sufficient CUDA memory, you may increase the `frame_num` parameter to a value such as 961 to generate a one-minute video at 16 FPS.

## 📚 Related Projects
- [HoloCine](https://holo-cine.github.io/)
- [Ditto](https://editto.net/)
- [WorldCanvas](https://worldcanvas.github.io/)
- [RewardForcing](https://reward-forcing.github.io/)
- [CoDeF](https://qiuyu96.github.io/CoDeF/)

## 📜 License
This project is licensed under the Apache 2.0 License. Please refer to the [LICENSE file](LICENSE.txt) for the full text, including details on rights and restrictions.

## ✨ Acknowledgement
We would like to express our gratitude to the Wan Team for open-sourcing their code and models. Their contributions have been instrumental to the development of this project.

## 📖 Citation
If you find this work useful for your research, please cite our paper:

```
@article{lingbot-world,
title={Advancing Open-source World Models},
author={Robbyant Team},
journal={arXiv preprint arXiv:xx.xx},
year={2026}
}
```
Loading