[arXiv] [Project Page]
STaR System Overview.
Task-Conditioned Retrieval via Information Bottleneck and Context-Aware Cross-Modal Reasoning.
STaR is an agentic framework for scalable task-conditioned retrieval and contextual reasoning over long-horizon multimodal robot memory, enabling robots to answer open-ended spatial, temporal, and descriptive queries and to produce precise, actionable outputs for navigation.
Key contributions:
- Long-horizon multimodal memory (OmniMem) that integrates 3D primitives, temporally aligned video captions, and visual keyframes to support joint spatial, temporal, and semantic reasoning in open-world environments.
- Scalable Task-Conditioned Retrieval (STaR) based on the Information Bottleneck principle, which distills compact, non-redundant, and task-relevant memory subsets from long-term experience without requiring predefined task lists.
- Agentic RAG workflow that couples MLLM-based planning with structured memory retrieval and contextual reasoning, enabling accurate question answering and reliable downstream execution for navigation.
- Extensive evaluation and real-robot deployment, demonstrating state-of-the-art performance on NaVQA and a challenging warehouse benchmark (WH-VQA in Isaac Sim), as well as robust long-horizon reasoning on a real Husky mobile robot.
Code coming soon.

