From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

ICML 2026

EgoTSR is a curriculum-based framework for ego-centric task-oriented spatiotemporal reasoning. It aims to help vision-language models move from explicit spatial understanding to internalized task-state judgment and finally to long-horizon planning, reducing chronological bias and spatiotemporal hallucinations in embodied tasks.

Highlights

A three-stage curriculum learning framework for egocentric task reasoning: CoT supervision, weakly supervised task-state tagging, and long-horizon planning.
EgoTSR-Data, a 46M-sample dataset organized into CoT, Tag, and LongTag stages.
A Reasoning-Enhanced Task Decomposition mechanism that maps high-level task descriptions into causal atomic subtask sequences.
A dual-level evaluation framework covering both short atomic spatial perception and long-horizon logical planning.
Strong performance on long-horizon logical reasoning and fine-grained perceptual reasoning across human demonstrations, simulations, and real-robot settings.

Repository Structure

EgoTSR/
├── 1get_clips.py              # Extract frames/clips from raw observations
├── 2group.py                  # Group clips by task name
├── 3refine.py                 # Refine task-specific clip endings
├── datamaker.py               # Optional CoT data generation helper
├── qwenvl/full/               # Qwen2.5-VL training and inference scripts
└── README.md

Data Preparation

The original data processing pipeline follows three steps:

python 1get_clips.py
python 2group.py
python 3refine.py

The expected processed short-task structure is:

EgoTSR-Short/
├── clips/
│   └── <scene_id>/<episode_id>/<task_name>/<frame_id>.png
├── observation/
│   └── <scene_id>/<episode_id>/
├── task/
│   └── *.jsonl
└── qwenvl/
    └── full/

Processed data links:

Short-CoT and Short-Tag: Hugging Face
Long-Tag: Hugging Face

Please update the hard-coded paths in the preprocessing scripts according to your local data layout before running them.

Training

The Qwen2.5-VL training entrypoints are under qwenvl/full/.

CoT-stage training:

bash qwenvl/full/pdsh_cot_only.sh

Tag-stage training:

bash qwenvl/full/pdsh_tag_only.sh

Long-horizon Tag training:

bash qwenvl/full/pdsh_tag_long.sh

General-stage training:

bash qwenvl/full/pdsh_tag_general.sh

These scripts contain cluster-specific paths, node settings, model paths, and data paths. Please revise them for your own environment before use.

Inference

The evaluation input CSV should contain fields such as video_path1, video_path2, and task_name. The output CSV stores the predicted target field, such as img1 or img2.

CoT-model inference:

python qwenvl/full/inference4eval_8gpu_cot.py

Tag-model inference:

python qwenvl/full/inference4eval_8gpu_tag.py

Long-horizon model inference:

python qwenvl/full/inference4eval_8gpu_long.py

Please update checkpoint paths, base directories, input CSV paths, and output CSV paths in the scripts before running inference.

Citation

If you find this project useful for your research, please cite:

@article{yang2026egotsr,
  title={From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning},
  author={Yang, Xiaoda and Liu, Yuxiang and Gao, Shenzhou and Wang, Can and Xue, Jingyang and Yang, Lixin and Mu, Yao and Jin, Tao and Zhang, Zhimeng and Yan, Shuicheng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2604.10517},
  year={2026}
}

Acknowledgement

This project builds on Qwen2.5-VL and related open-source vision-language tooling. We thank the community for its contributions to multimodal reasoning, embodied AI, and long-horizon planning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

Highlights

Repository Structure

Data Preparation

Training

Inference

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
qwenvl/full		qwenvl/full
1get_clips.py		1get_clips.py
2group.py		2group.py
3refine.py		3refine.py
README.md		README.md
datamaker.py		datamaker.py

Folders and files

Latest commit

History

Repository files navigation

From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning

Highlights

Repository Structure

Data Preparation

Training

Inference

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages