I am evaluating Cosmos3 inverse dynamics (video → action) and I would like to build LeRobot-format episodes for downstream training/finetuning (e.g., observation.images.*, observation.state, action, timestamps, etc.).
Cosmos3 inference currently returns an action tensor (e.g. [T, raw_action_dim]) but it is unclear how I should obtain the corresponding state observations (observation.state) required by LeRobot/robot-learning datasets.
Context / Goal
-
I run Cosmos3 inverse dynamics on an input video and obtain:
action.data with shape [T, raw_action_dim]
-
I want to append new episodes to an existing dataset in LeRobot/GR00T format (parquet + meta + mp4), which expects at minimum:
observation.images.<cam> (video frames)
observation.state (per-timestep state vector)
action (per-timestep action vector)
- timestamps / frame_index / episode_index, etc.
My blocker: given only (video, predicted action), how should I obtain observation.state to create a coherent episode?
I am evaluating Cosmos3 inverse dynamics (video → action) and I would like to build LeRobot-format episodes for downstream training/finetuning (e.g.,
observation.images.*,observation.state,action, timestamps, etc.).Cosmos3 inference currently returns an
actiontensor (e.g.[T, raw_action_dim]) but it is unclear how I should obtain the corresponding state observations (observation.state) required by LeRobot/robot-learning datasets.Context / Goal
I run Cosmos3 inverse dynamics on an input video and obtain:
action.datawith shape[T, raw_action_dim]I want to append new episodes to an existing dataset in LeRobot/GR00T format (parquet + meta + mp4), which expects at minimum:
observation.images.<cam>(video frames)observation.state(per-timestep state vector)action(per-timestep action vector)My blocker: given only (video, predicted action), how should I obtain
observation.stateto create a coherent episode?