While recent video world models excel at simulating static environments, they share a critical blind spot: the physical world is dynamic. When moving subjects exit the camera's field of view and later re-emerge, current models often lose track of them—rendering returning subjects as frozen statues, distorted phantoms, or letting them vanish entirely.
To bridge this gap, we introduce Hybrid Memory, a novel paradigm that requires models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects. A true world model must not only remember a subject's appearance but also mentally predict its unseen trajectory, ensuring visual and motion continuity even during out-of-view intervals.
Abstract
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. We introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses contexts into memory tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
More results can be found on our project homepage.
- Release the paper
- Release HM-World dataset
- Release HyDRA checkpoints and inference code
- Release HyDRA training code
git clone https://github.com/H-EmbodVis/HyDRA.git
cd HyDRAIf your cloned folder name is not
HyDRA, pleasecdinto the actual folder.
conda create -n hydra python=3.10 -y
conda activate hydrapip install -r requirements.txt- Model link: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
- Recommended location:
./ckpts
Example directory structure (recommended):
HyDRA/
└── ckpts/
├── Wan2.1_VAE.pth
├── diffusion_pytorch_model.safetensors
├── models_t5_umt5-xxl-enc-bf16.pth
└── ... (other files)
- Checkpoint link: https://huggingface.co/H-EmbodVis/HyDRA
- Recommended location:
./ckpts(e.g.,./ckpts/hydra.pth)
Run inference on the example data:
python infer_hydra.pyWe provide an example training script that can be used to train on a custom dataset.
For each training sample, please preprocess and save it as a single .pth file:
- Encode the custom video into video latents with the VAE.
- Encode the caption into a text embedding with the text encoder.
- Convert the camera poses into a relative coordinate system.
Then, load these per-sample .pth files in the training script and train the DiT module.
python train_hydra.py \
--dit_path ./ckpts/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors \
--use_gradient_checkpointing \
--hydra Thanks for the following related works and open source reposities:
If you find our work useful, please consider citing:
@article{chen2026out,
title = {Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models},
author = {Chen, Kaijin and Liang, Dingkang and Zhou, Xin and Ding, Yikang and Liu, Xiaoqiang and Wan, Pengfei and Bai, Xiang},
journal = {arXiv preprint arXiv:2603.25716},
year = {2026}
}






