From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
ICML 2026
EgoTSR is a curriculum-based framework for ego-centric task-oriented spatiotemporal reasoning. It aims to help vision-language models move from explicit spatial understanding to internalized task-state judgment and finally to long-horizon planning, reducing chronological bias and spatiotemporal hallucinations in embodied tasks.
- A three-stage curriculum learning framework for egocentric task reasoning: CoT supervision, weakly supervised task-state tagging, and long-horizon planning.
- EgoTSR-Data, a 46M-sample dataset organized into CoT, Tag, and LongTag stages.
- A Reasoning-Enhanced Task Decomposition mechanism that maps high-level task descriptions into causal atomic subtask sequences.
- A dual-level evaluation framework covering both short atomic spatial perception and long-horizon logical planning.
- Strong performance on long-horizon logical reasoning and fine-grained perceptual reasoning across human demonstrations, simulations, and real-robot settings.
EgoTSR/
├── 1get_clips.py # Extract frames/clips from raw observations
├── 2group.py # Group clips by task name
├── 3refine.py # Refine task-specific clip endings
├── datamaker.py # Optional CoT data generation helper
├── qwenvl/full/ # Qwen2.5-VL training and inference scripts
└── README.md
The original data processing pipeline follows three steps:
python 1get_clips.py
python 2group.py
python 3refine.pyThe expected processed short-task structure is:
EgoTSR-Short/
├── clips/
│ └── <scene_id>/<episode_id>/<task_name>/<frame_id>.png
├── observation/
│ └── <scene_id>/<episode_id>/
├── task/
│ └── *.jsonl
└── qwenvl/
└── full/
Processed data links:
- Short-CoT and Short-Tag: Hugging Face
- Long-Tag: Hugging Face
Please update the hard-coded paths in the preprocessing scripts according to your local data layout before running them.
The Qwen2.5-VL training entrypoints are under qwenvl/full/.
CoT-stage training:
bash qwenvl/full/pdsh_cot_only.shTag-stage training:
bash qwenvl/full/pdsh_tag_only.shLong-horizon Tag training:
bash qwenvl/full/pdsh_tag_long.shGeneral-stage training:
bash qwenvl/full/pdsh_tag_general.shThese scripts contain cluster-specific paths, node settings, model paths, and data paths. Please revise them for your own environment before use.
The evaluation input CSV should contain fields such as video_path1, video_path2, and task_name. The output CSV stores the predicted target field, such as img1 or img2.
CoT-model inference:
python qwenvl/full/inference4eval_8gpu_cot.pyTag-model inference:
python qwenvl/full/inference4eval_8gpu_tag.pyLong-horizon model inference:
python qwenvl/full/inference4eval_8gpu_long.pyPlease update checkpoint paths, base directories, input CSV paths, and output CSV paths in the scripts before running inference.
If you find this project useful for your research, please cite:
@article{yang2026egotsr,
title={From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning},
author={Yang, Xiaoda and Liu, Yuxiang and Gao, Shenzhou and Wang, Can and Xue, Jingyang and Yang, Lixin and Mu, Yao and Jin, Tao and Zhang, Zhimeng and Yan, Shuicheng and Zhao, Zhou},
journal={arXiv preprint arXiv:2604.10517},
year={2026}
}This project builds on Qwen2.5-VL and related open-source vision-language tooling. We thank the community for its contributions to multimodal reasoning, embodied AI, and long-horizon planning.