Two variants with similar performance:
1.Training-only masking (= VGGT Structure).
2.Inference-time masking (VGGT Structure with Mask).
Media Lab, MIT; Harvard Medical School
Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang†, Mengyu Wang†
(†: Jointly Supervised)
@inproceedings{zhoupage,
title={PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation},
author={Zhou, Kaichen and Wang, Yuhan and Chen, Grace and Beaudouin, Gaspard and Zhan, Fangneng and Liang, Paul Pu and Wang, Mengyu},
booktitle={The Fourteenth International Conference on Learning Representations}
}PAGE-4D (ICLR 2026) extends the Visual Geometry Grounded Transformer (VGGT, CVPR 2025) to dynamic scenes. It is a feed-forward neural network that directly infers key 4D scene attributes, including camera poses, depth maps, and dense point maps, while explicitly modeling dynamic elements such as moving humans and deformable objects—all without requiring post-processing or optimization.
First, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
git clone https://github.com/kaichen-z/PAGE4D.git
pip install -r requirements.txtNow, try the model with just a few lines of code:
import torch
from page.models.vggt import VGGT
from page.utils.load_fn import load_and_preprocess_images
from page.utils.pose_enc import pose_encoding_to_extri_intri
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
model = VGGT()
checkpoint = torch.load(Directory, map_location=device)
model.load_state_dict(checkpoint['model'], strict=False)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_and_preprocess_images(image_names).to(device)
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=dtype):
predictions = model(images)Training uses launch_gra.py with gradient checkpointing for memory efficiency.
cd training_bash
bash final_train.shThe script runs training with automatic retries on failure and logs to logs/training_final.log.
cd training
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29508 launch_gra.py --config training_finalFor multi-GPU training, set CUDA_VISIBLE_DEVICES and --nproc_per_node accordingly.
Edit training/config/training_final.yaml to customize:
- Datasets: Training and validation datasets under
data.train.dataset.dataset_configsanddata.val.dataset.dataset_configs. Updatedataset_locationpaths for your environment. - Resume: Set
checkpoint.resume_checkpoint_pathto resume from a checkpoint. - Experiment:
exp_namecontrols checkpoint and log directory names. - Debug limits:
limit_train_batchesandlimit_val_batchescap batches per epoch; set tonullfor full training.
Update TRAINING_CMD and LOG_DIR in final_train.sh if your project path differs from the default.
We provide a detailed visualization strategy used in Figure 2 of our paper. The script eval/visualization.py extracts and visualizes the model's internal feature maps to illustrate how PAGE-4D disentangles frame-local and global cross-view information.
cd eval
python visualization.pyConfigure at the bottom of visualization.py:
directory: Output folder for saved visualizations.image_names: List of image paths (multi-view inputs).initial_num,gap: Frame indices for video sequences (e.g.,rgb_{initial_num:05d}.jpg,rgb_{initial_num+gap:05d}.jpg).
For each input image and each transformer layer, the script saves:
{name}_frame_feature_{layer}.png: Frame-local feature heatmap.{name}_global_feature_{layer}.png: Global cross-view feature heatmap.
Dataset-specific preparation scripts live in training/data/datasets/prepare/. They sample frames and produce standardized directory layouts for the dataloaders.
TUM RGB-D (tum_pre.py):
- Input: Raw TUM format with
rgb.txt,groundtruth.txt,depth.txt. - Process: Associates RGB, depth, and pose by timestamp; samples 90 frames at stride 3.
- Output per sequence:
rgb_90/,depth_90/,groundtruth_90.txt.
# Run from project root; update dataset_location in script if needed
python -m data.datasets.prepare.tum_preBonn RGB-D (bonn_pre.py):
- Input:
rgbd_bonn_dataset/*/rgb/*.png,depth/*.png,groundtruth.txt. - Process: Samples frames 30–140 (110 frames) for sequences
balloon2,crowd2,crowd3,person_tracking2,synchronous. - Output per sequence:
rgb_110/,depth_110/,groundtruth_110.txt.
python -m data.datasets.prepare.bonn_preUpdate the dirs path at the top of each script to your dataset location.
dataset_validation.py checks that the dataloader works with your config and optionally saves visualizations (point clouds, depth maps, tracks).
cd training
python -m data.dataset_validation --config debugEnable your dataset in training/data/datasets/config/debug.yaml (or the config you pass) by uncommenting the corresponding dataset entry. The script loads the dataset via Hydra, iterates the loader, and can save:
.plypoint clouds (world and camera coordinates),- Side-by-side track visualizations,
- Depth maps,
- Track-overlay videos.
Set save_address in the script to the desired output directory.
We provide evaluation pipelines for monocular depth, video depth, and relative pose (camera trajectory) on dynamic scenarios. Each pipeline can be run via its run_page.sh script. Edit model_weights, datasets, and paths in the script (and in eval/eval/*/metadata.py) for your environment before running.
Evaluates single-image depth estimation. Uncomment the launch_page.py block in run_page.sh to run inference first (saves depth .npy); otherwise the script runs eval_metrics.py on existing predictions (Abs Rel, Sq Rel, RMSE, δ thresholds).
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/monodepth/run_page.shDatasets: sintel, bonn, dyncheck (edit the datasets array in the script). See metadata.py for more options.
Evaluates depth on video sequences with sliding-window inference. Uses multi-GPU via accelerate.
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/video_depth/run_page.shDatasets: sintel, bonn, dyncheck. Metrics: Abs Rel, Sq Rel, RMSE, Log RMSE, δ < 1.25, etc. To compute metrics after inference, uncomment and run the eval_depth.py block in the script.
Evaluates camera trajectory (pose) estimation using evo. Outputs ATE and RPE (translation, rotation).
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/relpose/run_page.shDatasets: sintel, tum. Outputs: pred_traj.txt, pred_focal.txt, pred_intrinsics.txt, trajectory plots, *_eval_metric.txt with ATE/RPE.
You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
from page.utils.pose_enc import pose_encoding_to_extri_intri
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=dtype):
images = images[None] # add batch dimension
aggregated_tokens_list, ps_idx = model.aggregator(images)
# Predict Cameras
pose_enc = model.camera_head(aggregated_tokens_list)[-1]
# Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
# Predict Depth Maps
depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
# Predict Point Maps
point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)Spatial mask during training. Fine-tuning uses a learnable spatial mask in the aggregator (SpatialMaskHead_IMP in model/page/layers/block.py). Its strength is scheduled with mask_alpha(step, mask_hold_start, mask_hold_end): the mask is fully on for early optimizer steps, then its influence is reduced smoothly (cosine decay) until it is off. Set mask_hold_start / mask_hold_end in your training config (e.g. training_final.yaml under model).
At inference:
#Non-mask version:
mask_hold_start = 0
mask_hold_end = 0#Mask-enabled version:
mask_hold_start > 0
mask_hold_end > 0Two variants with similar performance:
1.Training-only masking (= VGGT Structure).
2.Inference-time masking (VGGT Structure with Mask).
Download Weights (non mask version - use the mask only during the early stage of training) (Suggested). Pretrained weights are released as checkpoint_nomask.pt on Hugging Face (dataset page). Download the file and point the Quick Start Directory (or eval model_weights) to its path:
huggingface-cli download zhouk777/PAGE4D checkpoint_nomask.pt --repo-type dataset --local-dir .Download Weights (mask version - always keep mask during training). Pretrained weights are released as checkpoint_mask.pt on Hugging Face (dataset page). Download the file and point the Quick Start Directory (or eval model_weights) to its path:
huggingface-cli download zhouk777/PAGE4D checkpoint_mask.pt --repo-type dataset --local-dir .Our interactive code follows a similar design to VGGT (CVPR 2025). Please refer to their original repository for more details.
Thanks to these great repositories: VGGT, CUT3R and many other inspiring works in the community.