GitHub - H-EmbodVis/HyDRA: Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models

Kaijin Chen¹, Dingkang Liang¹, Xin Zhou¹, Yikang Ding², Xiaoqiang Liu², Pengfei Wan², Xiang Bai¹

¹ Huazhong University of Science and Technology ² Kling Team, Kuaishou Technology

🔍 Overview

While recent video world models excel at simulating static environments, they share a critical blind spot: the physical world is dynamic. When moving subjects exit the camera's field of view and later re-emerge, current models often lose track of them—rendering returning subjects as frozen statues, distorted phantoms, or letting them vanish entirely.

To bridge this gap, we introduce Hybrid Memory, a novel paradigm that requires models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects. A true world model must not only remember a subject's appearance but also mentally predict its unseen trajectory, ensuring visual and motion continuity even during out-of-view intervals.

Abstract

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. We introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses contexts into memory tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

🎥 Generation Results

More results can be found on our project homepage.

📅 TODO

Release the paper
Release HM-World dataset
Release HyDRA checkpoints and inference code
Release HyDRA training code

🛠️ Installation

Step 1: Clone this repository

git clone https://github.com/H-EmbodVis/HyDRA.git
cd HyDRA

If your cloned folder name is not HyDRA, please cd into the actual folder.

Step 2: Create & activate an Anaconda environment

conda create -n hydra python=3.10 -y
conda activate hydra

Step 3: Install required packages

pip install -r requirements.txt

Step 4: Download the pretrained Wan2.1 (1.3B) T2V model

Model link: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
Recommended location: ./ckpts

Example directory structure (recommended):

HyDRA/
└── ckpts/
    ├── Wan2.1_VAE.pth
    ├── diffusion_pytorch_model.safetensors
    ├── models_t5_umt5-xxl-enc-bf16.pth
    └── ... (other files)

Step 5: Download the trained HyDRA weights

Checkpoint link: https://huggingface.co/H-EmbodVis/HyDRA
Recommended location: ./ckpts (e.g., ./ckpts/hydra.pth)

🚀 Inference

Run inference on the example data:

python infer_hydra.py

🎓 Training

We provide an example training script that can be used to train on a custom dataset.

Data preparation

For each training sample, please preprocess and save it as a single .pth file:

Encode the custom video into video latents with the VAE.
Encode the caption into a text embedding with the text encoder.
Convert the camera poses into a relative coordinate system.

Then, load these per-sample .pth files in the training script and train the DiT module.

Train command

python train_hydra.py \
  --dit_path ./ckpts/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
  --use_gradient_checkpointing \
  --hydra

👍 Acknowledgement

Thanks for the following related works and open source reposities:

📖 Citation

If you find our work useful, please consider citing:

@article{chen2026out,
  title   = {Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models},
  author  = {Chen, Kaijin and Liang, Dingkang and Zhou, Xin and Ding, Yikang and Liu, Xiaoqiang and Wan, Pengfei and Bai, Xiang},
  journal = {arXiv preprint arXiv:2603.25716},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
diffsynth		diffsynth
examples		examples
README.md		README.md
infer_hydra.py		infer_hydra.py
requirements.txt		requirements.txt
train_hydra.py		train_hydra.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models

Kaijin Chen¹, Dingkang Liang¹, Xin Zhou¹, Yikang Ding², Xiaoqiang Liu², Pengfei Wan², Xiang Bai¹

🔍 Overview

🎥 Generation Results

📅 TODO

🛠️ Installation

Step 1: Clone this repository

Step 2: Create & activate an Anaconda environment

Step 3: Install required packages

Step 4: Download the pretrained Wan2.1 (1.3B) T2V model

Step 5: Download the trained HyDRA weights

🚀 Inference

🎓 Training

Data preparation

Train command

👍 Acknowledgement

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Out of Sight but Not Out of Mind:Hybrid Memory for Dynamic Video World Models

Kaijin Chen1, Dingkang Liang1, Xin Zhou1, Yikang Ding2, Xiaoqiang Liu2, Pengfei Wan2, Xiang Bai1

🔍 Overview

🎥 Generation Results

📅 TODO

🛠️ Installation

Step 1: Clone this repository

Step 2: Create & activate an Anaconda environment

Step 3: Install required packages

Step 4: Download the pretrained Wan2.1 (1.3B) T2V model

Step 5: Download the trained HyDRA weights

🚀 Inference

🎓 Training

Data preparation

Train command

👍 Acknowledgement

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Out of Sight but Not Out of Mind:
Hybrid Memory for Dynamic Video World Models

Kaijin Chen¹, Dingkang Liang¹, Xin Zhou¹, Yikang Ding², Xiaoqiang Liu², Pengfei Wan², Xiang Bai¹

Packages