Skip to content

Provide a stable Python Eval API for bring-your-own-policy evaluation #45

@IMSUVEN

Description

@IMSUVEN

Summary

I would like to suggest adding a stable, user-facing Python API for policy evaluation in LW-BenchHub.

From a user perspective, LW-BenchHub already provides a valuable set of robotics tasks, robots, scenes, teleoperation tools, replay utilities, and RL/IL evaluation scripts. However, if I want to evaluate my own trained policy or VLA model, the current experience still requires me to understand and modify several internal scripts, YAML configs, observation mappings, action mappings, video writers, and evaluation loops.

For users who simply want to bring their own policy and run benchmark evaluation, it would be very helpful if LW-BenchHub exposed a higher-level public API similar in spirit to Gymnasium’s environment interface or PyTorch Lightning’s trainer abstraction.

The goal is not to hide all robotics complexity, but to provide a clear and stable contract between:

  • user-provided policy;
  • benchmark suite / task selection;
  • robot and action interface;
  • evaluator lifecycle;
  • metrics and artifacts.

User story

As a user who has trained a robot policy or VLA model, I want to import LW-BenchHub as a Python library, plug in my policy, select a benchmark/task/robot, and run evaluation without modifying repository internals.

Ideally, the basic usage could look like:

from lw_benchhub.eval import evaluate
from lw_benchhub.policy import RobotPolicy, Action


class MyPolicy(RobotPolicy):
    def reset(self, episode):
        self.model.reset()

    def act(self, obs, ctx):
        action = self.model.predict(
            image=obs.camera["front"],
            joints=obs.proprio["joint_pos"],
            instruction=obs.language,
        )
        return Action.joint_delta(action)


report = evaluate(
    policy=MyPolicy(),
    suite="lightwheel-robocasa",
    robot="pandaomron",
    num_episodes=100,
    save_video=True,
    output_dir="./eval_runs/my_policy",
)

print(report.success_rate)
print(report.to_markdown())

Current pain points

From the perspective of a policy-evaluation user, the current workflow is powerful but difficult to use as a library.

Some common friction points are:

  1. Evaluation is script-first rather than API-first

    Users need to understand scripts such as training, play/eval, replay, teleoperation, and task config loading. This makes it harder to integrate LW-BenchHub into an external research workflow, CI pipeline, or internal evaluation platform.

  2. Policy integration requires understanding internal environment details

    A user-provided policy often needs to know raw observation keys, action dimensions, camera names, proprioception fields, and robot-specific joint conventions. These are difficult to discover programmatically.

  3. Observation and action schemas are not exposed as a stable public contract

    For robotics evaluation, action representation is critical. A policy may output joint deltas, joint positions, end-effector deltas, gripper commands, or normalized actions. The expected action dimension, joint order, control frequency, and normalization should be inspectable before running evaluation.

  4. Evaluation failures are hard to classify

    During evaluation, it is important to distinguish between:

    • policy failure;
    • timeout;
    • invalid action schema;
    • environment setup failure;
    • asset / placement failure;
    • dependency or runtime failure;
    • simulator crash.

    Without structured failure classification, a benchmark result can be difficult to interpret.

  5. Results should be machine-readable

    For downstream usage, it would be useful if evaluation returns a structured report object rather than relying mainly on stdout logs. This would make it easier to build dashboards, compare policies, archive results, and debug failures.

Proposed API direction

I suggest introducing a small public API layer focused on policy evaluation.

Possible modules:

lw_benchhub.eval
  - evaluate()
  - Evaluator
  - EvalConfig
  - EvalReport

lw_benchhub.policy
  - RobotPolicy
  - Observation
  - Action

lw_benchhub.spec
  - BenchmarkSuite
  - RobotSpec
  - ObservationSpec
  - ActionSpec

A minimal version could start with only:

evaluate()
Evaluator
EvalConfig
EvalReport
RobotPolicy
Observation
Action

The existing scripts could then become thin wrappers around this API.

Suggested interface

RobotPolicy

class RobotPolicy:
    def reset(self, episode):
        pass

    def act(self, obs, ctx):
        raise NotImplementedError

The policy should not need to manually call env.step(), handle video writing, compute success metrics, or manage episode termination. The evaluator should own the evaluation loop.

Observation

A normalized observation object could expose common fields:

obs.camera["front"]
obs.camera["wrist"]
obs.proprio["joint_pos"]
obs.proprio["ee_pose"]
obs.gripper
obs.language
obs.raw

The raw observation can still be preserved for advanced users, but the common path should be stable and documented.

Action

A typed action object could make action semantics explicit:

Action.joint_delta(x)
Action.joint_position(x)
Action.ee_delta_pose(x)
Action.normalized(x)

This would make errors easier to detect and explain.

For example:

Action dimension mismatch.

Policy returned:
  mode: joint_delta
  dim: 14

Environment expects:
  robot: pandaomron
  mode: joint_delta
  dim: 8

Suggested fix:
  inspect the robot action spec or provide an ActionAdapter.

EvalConfig

config = EvalConfig(
    suite="lightwheel-robocasa",
    robot="pandaomron",
    tasks=["OpenMicrowave", "PutObjectInCabinet"],
    num_episodes=100,
    num_envs=8,
    max_steps=500,
    seed=42,
    device="cuda:0",
    headless=True,
    save_video=True,
    output_dir="./eval_runs/my_policy",
)

EvalReport

report.success_rate
report.task_metrics
report.failure_breakdown
report.failed_episodes
report.video_paths
report.trajectory_paths
report.to_json()
report.to_markdown()

The report should ideally include reproducibility metadata:

- LW-BenchHub version / git commit
- Isaac Sim version
- task suite
- robot
- seed
- number of episodes
- observation spec
- action spec
- checkpoint metadata if provided
- output artifacts

CLI compatibility

The CLI/script interface could continue to exist, but ideally it would call the same public API.

For example:

lw-benchhub eval \
  --suite lightwheel-robocasa \
  --robot pandaomron \
  --policy ./my_policy.py:MyPolicy \
  --episodes 100 \
  --save-video

This would make the CLI, Python API, and future platform integrations share the same evaluation path.

Why this would help

This would make LW-BenchHub much easier to use for several common user groups:

  • users evaluating their own VLA or imitation-learning policy;
  • researchers comparing models across benchmark tasks;
  • robotics engineers validating robot-specific action spaces;
  • platform teams running automated benchmark jobs;
  • users who want structured evaluation reports rather than manual log parsing.

It would also help avoid requiring every downstream user to become familiar with the internal script structure before they can run a meaningful evaluation.

Non-goals

This proposal does not suggest removing existing scripts or YAML configs.

It also does not suggest hiding all robotics-specific complexity. Robot-specific observation/action mapping, scene configuration, and task validation will still be necessary. The suggestion is to make these concepts explicit, inspectable, and stable at the public API boundary.

Acceptance criteria for a first version

A first useful version could be considered successful if:

  • a user can define a minimal RobotPolicy with act(obs, ctx) -> Action;
  • a user can run evaluation from Python without modifying repository internals;
  • observation and action specs can be inspected before evaluation;
  • the evaluator returns a structured EvalReport;
  • environment/setup failures are separated from policy failures;
  • existing scripts can be migrated to call the same API internally.

Closing note

LW-BenchHub already provides a strong foundation for robotics benchmark evaluation. A stable Python evaluation API would make it much easier for external users to bring their own policies, run reproducible evaluations, and build downstream workflows on top of the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions