Summary
I would like to suggest adding a stable, user-facing Python API for policy evaluation in LW-BenchHub.
From a user perspective, LW-BenchHub already provides a valuable set of robotics tasks, robots, scenes, teleoperation tools, replay utilities, and RL/IL evaluation scripts. However, if I want to evaluate my own trained policy or VLA model, the current experience still requires me to understand and modify several internal scripts, YAML configs, observation mappings, action mappings, video writers, and evaluation loops.
For users who simply want to bring their own policy and run benchmark evaluation, it would be very helpful if LW-BenchHub exposed a higher-level public API similar in spirit to Gymnasium’s environment interface or PyTorch Lightning’s trainer abstraction.
The goal is not to hide all robotics complexity, but to provide a clear and stable contract between:
- user-provided policy;
- benchmark suite / task selection;
- robot and action interface;
- evaluator lifecycle;
- metrics and artifacts.
User story
As a user who has trained a robot policy or VLA model, I want to import LW-BenchHub as a Python library, plug in my policy, select a benchmark/task/robot, and run evaluation without modifying repository internals.
Ideally, the basic usage could look like:
from lw_benchhub.eval import evaluate
from lw_benchhub.policy import RobotPolicy, Action
class MyPolicy(RobotPolicy):
def reset(self, episode):
self.model.reset()
def act(self, obs, ctx):
action = self.model.predict(
image=obs.camera["front"],
joints=obs.proprio["joint_pos"],
instruction=obs.language,
)
return Action.joint_delta(action)
report = evaluate(
policy=MyPolicy(),
suite="lightwheel-robocasa",
robot="pandaomron",
num_episodes=100,
save_video=True,
output_dir="./eval_runs/my_policy",
)
print(report.success_rate)
print(report.to_markdown())
Current pain points
From the perspective of a policy-evaluation user, the current workflow is powerful but difficult to use as a library.
Some common friction points are:
-
Evaluation is script-first rather than API-first
Users need to understand scripts such as training, play/eval, replay, teleoperation, and task config loading. This makes it harder to integrate LW-BenchHub into an external research workflow, CI pipeline, or internal evaluation platform.
-
Policy integration requires understanding internal environment details
A user-provided policy often needs to know raw observation keys, action dimensions, camera names, proprioception fields, and robot-specific joint conventions. These are difficult to discover programmatically.
-
Observation and action schemas are not exposed as a stable public contract
For robotics evaluation, action representation is critical. A policy may output joint deltas, joint positions, end-effector deltas, gripper commands, or normalized actions. The expected action dimension, joint order, control frequency, and normalization should be inspectable before running evaluation.
-
Evaluation failures are hard to classify
During evaluation, it is important to distinguish between:
- policy failure;
- timeout;
- invalid action schema;
- environment setup failure;
- asset / placement failure;
- dependency or runtime failure;
- simulator crash.
Without structured failure classification, a benchmark result can be difficult to interpret.
-
Results should be machine-readable
For downstream usage, it would be useful if evaluation returns a structured report object rather than relying mainly on stdout logs. This would make it easier to build dashboards, compare policies, archive results, and debug failures.
Proposed API direction
I suggest introducing a small public API layer focused on policy evaluation.
Possible modules:
lw_benchhub.eval
- evaluate()
- Evaluator
- EvalConfig
- EvalReport
lw_benchhub.policy
- RobotPolicy
- Observation
- Action
lw_benchhub.spec
- BenchmarkSuite
- RobotSpec
- ObservationSpec
- ActionSpec
A minimal version could start with only:
evaluate()
Evaluator
EvalConfig
EvalReport
RobotPolicy
Observation
Action
The existing scripts could then become thin wrappers around this API.
Suggested interface
RobotPolicy
class RobotPolicy:
def reset(self, episode):
pass
def act(self, obs, ctx):
raise NotImplementedError
The policy should not need to manually call env.step(), handle video writing, compute success metrics, or manage episode termination. The evaluator should own the evaluation loop.
Observation
A normalized observation object could expose common fields:
obs.camera["front"]
obs.camera["wrist"]
obs.proprio["joint_pos"]
obs.proprio["ee_pose"]
obs.gripper
obs.language
obs.raw
The raw observation can still be preserved for advanced users, but the common path should be stable and documented.
Action
A typed action object could make action semantics explicit:
Action.joint_delta(x)
Action.joint_position(x)
Action.ee_delta_pose(x)
Action.normalized(x)
This would make errors easier to detect and explain.
For example:
Action dimension mismatch.
Policy returned:
mode: joint_delta
dim: 14
Environment expects:
robot: pandaomron
mode: joint_delta
dim: 8
Suggested fix:
inspect the robot action spec or provide an ActionAdapter.
EvalConfig
config = EvalConfig(
suite="lightwheel-robocasa",
robot="pandaomron",
tasks=["OpenMicrowave", "PutObjectInCabinet"],
num_episodes=100,
num_envs=8,
max_steps=500,
seed=42,
device="cuda:0",
headless=True,
save_video=True,
output_dir="./eval_runs/my_policy",
)
EvalReport
report.success_rate
report.task_metrics
report.failure_breakdown
report.failed_episodes
report.video_paths
report.trajectory_paths
report.to_json()
report.to_markdown()
The report should ideally include reproducibility metadata:
- LW-BenchHub version / git commit
- Isaac Sim version
- task suite
- robot
- seed
- number of episodes
- observation spec
- action spec
- checkpoint metadata if provided
- output artifacts
CLI compatibility
The CLI/script interface could continue to exist, but ideally it would call the same public API.
For example:
lw-benchhub eval \
--suite lightwheel-robocasa \
--robot pandaomron \
--policy ./my_policy.py:MyPolicy \
--episodes 100 \
--save-video
This would make the CLI, Python API, and future platform integrations share the same evaluation path.
Why this would help
This would make LW-BenchHub much easier to use for several common user groups:
- users evaluating their own VLA or imitation-learning policy;
- researchers comparing models across benchmark tasks;
- robotics engineers validating robot-specific action spaces;
- platform teams running automated benchmark jobs;
- users who want structured evaluation reports rather than manual log parsing.
It would also help avoid requiring every downstream user to become familiar with the internal script structure before they can run a meaningful evaluation.
Non-goals
This proposal does not suggest removing existing scripts or YAML configs.
It also does not suggest hiding all robotics-specific complexity. Robot-specific observation/action mapping, scene configuration, and task validation will still be necessary. The suggestion is to make these concepts explicit, inspectable, and stable at the public API boundary.
Acceptance criteria for a first version
A first useful version could be considered successful if:
- a user can define a minimal
RobotPolicy with act(obs, ctx) -> Action;
- a user can run evaluation from Python without modifying repository internals;
- observation and action specs can be inspected before evaluation;
- the evaluator returns a structured
EvalReport;
- environment/setup failures are separated from policy failures;
- existing scripts can be migrated to call the same API internally.
Closing note
LW-BenchHub already provides a strong foundation for robotics benchmark evaluation. A stable Python evaluation API would make it much easier for external users to bring their own policies, run reproducible evaluations, and build downstream workflows on top of the project.
Summary
I would like to suggest adding a stable, user-facing Python API for policy evaluation in LW-BenchHub.
From a user perspective, LW-BenchHub already provides a valuable set of robotics tasks, robots, scenes, teleoperation tools, replay utilities, and RL/IL evaluation scripts. However, if I want to evaluate my own trained policy or VLA model, the current experience still requires me to understand and modify several internal scripts, YAML configs, observation mappings, action mappings, video writers, and evaluation loops.
For users who simply want to bring their own policy and run benchmark evaluation, it would be very helpful if LW-BenchHub exposed a higher-level public API similar in spirit to Gymnasium’s environment interface or PyTorch Lightning’s trainer abstraction.
The goal is not to hide all robotics complexity, but to provide a clear and stable contract between:
User story
As a user who has trained a robot policy or VLA model, I want to import LW-BenchHub as a Python library, plug in my policy, select a benchmark/task/robot, and run evaluation without modifying repository internals.
Ideally, the basic usage could look like:
Current pain points
From the perspective of a policy-evaluation user, the current workflow is powerful but difficult to use as a library.
Some common friction points are:
Evaluation is script-first rather than API-first
Users need to understand scripts such as training, play/eval, replay, teleoperation, and task config loading. This makes it harder to integrate LW-BenchHub into an external research workflow, CI pipeline, or internal evaluation platform.
Policy integration requires understanding internal environment details
A user-provided policy often needs to know raw observation keys, action dimensions, camera names, proprioception fields, and robot-specific joint conventions. These are difficult to discover programmatically.
Observation and action schemas are not exposed as a stable public contract
For robotics evaluation, action representation is critical. A policy may output joint deltas, joint positions, end-effector deltas, gripper commands, or normalized actions. The expected action dimension, joint order, control frequency, and normalization should be inspectable before running evaluation.
Evaluation failures are hard to classify
During evaluation, it is important to distinguish between:
Without structured failure classification, a benchmark result can be difficult to interpret.
Results should be machine-readable
For downstream usage, it would be useful if evaluation returns a structured report object rather than relying mainly on stdout logs. This would make it easier to build dashboards, compare policies, archive results, and debug failures.
Proposed API direction
I suggest introducing a small public API layer focused on policy evaluation.
Possible modules:
A minimal version could start with only:
The existing scripts could then become thin wrappers around this API.
Suggested interface
RobotPolicyThe policy should not need to manually call
env.step(), handle video writing, compute success metrics, or manage episode termination. The evaluator should own the evaluation loop.ObservationA normalized observation object could expose common fields:
The raw observation can still be preserved for advanced users, but the common path should be stable and documented.
ActionA typed action object could make action semantics explicit:
This would make errors easier to detect and explain.
For example:
EvalConfigEvalReportThe report should ideally include reproducibility metadata:
CLI compatibility
The CLI/script interface could continue to exist, but ideally it would call the same public API.
For example:
lw-benchhub eval \ --suite lightwheel-robocasa \ --robot pandaomron \ --policy ./my_policy.py:MyPolicy \ --episodes 100 \ --save-videoThis would make the CLI, Python API, and future platform integrations share the same evaluation path.
Why this would help
This would make LW-BenchHub much easier to use for several common user groups:
It would also help avoid requiring every downstream user to become familiar with the internal script structure before they can run a meaningful evaluation.
Non-goals
This proposal does not suggest removing existing scripts or YAML configs.
It also does not suggest hiding all robotics-specific complexity. Robot-specific observation/action mapping, scene configuration, and task validation will still be necessary. The suggestion is to make these concepts explicit, inspectable, and stable at the public API boundary.
Acceptance criteria for a first version
A first useful version could be considered successful if:
RobotPolicywithact(obs, ctx) -> Action;EvalReport;Closing note
LW-BenchHub already provides a strong foundation for robotics benchmark evaluation. A stable Python evaluation API would make it much easier for external users to bring their own policies, run reproducible evaluations, and build downstream workflows on top of the project.