Provide a stable Python Eval API for bring-your-own-policy evaluation

### Summary

I would like to suggest adding a stable, user-facing Python API for policy evaluation in LW-BenchHub.

From a user perspective, LW-BenchHub already provides a valuable set of robotics tasks, robots, scenes, teleoperation tools, replay utilities, and RL/IL evaluation scripts. However, if I want to evaluate my own trained policy or VLA model, the current experience still requires me to understand and modify several internal scripts, YAML configs, observation mappings, action mappings, video writers, and evaluation loops.

For users who simply want to bring their own policy and run benchmark evaluation, it would be very helpful if LW-BenchHub exposed a higher-level public API similar in spirit to Gymnasium’s environment interface or PyTorch Lightning’s trainer abstraction.

The goal is not to hide all robotics complexity, but to provide a clear and stable contract between:

- user-provided policy;
- benchmark suite / task selection;
- robot and action interface;
- evaluator lifecycle;
- metrics and artifacts.

### User story

As a user who has trained a robot policy or VLA model, I want to import LW-BenchHub as a Python library, plug in my policy, select a benchmark/task/robot, and run evaluation without modifying repository internals.

Ideally, the basic usage could look like:

```python
from lw_benchhub.eval import evaluate
from lw_benchhub.policy import RobotPolicy, Action


class MyPolicy(RobotPolicy):
    def reset(self, episode):
        self.model.reset()

    def act(self, obs, ctx):
        action = self.model.predict(
            image=obs.camera["front"],
            joints=obs.proprio["joint_pos"],
            instruction=obs.language,
        )
        return Action.joint_delta(action)


report = evaluate(
    policy=MyPolicy(),
    suite="lightwheel-robocasa",
    robot="pandaomron",
    num_episodes=100,
    save_video=True,
    output_dir="./eval_runs/my_policy",
)

print(report.success_rate)
print(report.to_markdown())
````

### Current pain points

From the perspective of a policy-evaluation user, the current workflow is powerful but difficult to use as a library.

Some common friction points are:

1. **Evaluation is script-first rather than API-first**

   Users need to understand scripts such as training, play/eval, replay, teleoperation, and task config loading. This makes it harder to integrate LW-BenchHub into an external research workflow, CI pipeline, or internal evaluation platform.

2. **Policy integration requires understanding internal environment details**

   A user-provided policy often needs to know raw observation keys, action dimensions, camera names, proprioception fields, and robot-specific joint conventions. These are difficult to discover programmatically.

3. **Observation and action schemas are not exposed as a stable public contract**

   For robotics evaluation, action representation is critical. A policy may output joint deltas, joint positions, end-effector deltas, gripper commands, or normalized actions. The expected action dimension, joint order, control frequency, and normalization should be inspectable before running evaluation.

4. **Evaluation failures are hard to classify**

   During evaluation, it is important to distinguish between:

   * policy failure;
   * timeout;
   * invalid action schema;
   * environment setup failure;
   * asset / placement failure;
   * dependency or runtime failure;
   * simulator crash.

   Without structured failure classification, a benchmark result can be difficult to interpret.

5. **Results should be machine-readable**

   For downstream usage, it would be useful if evaluation returns a structured report object rather than relying mainly on stdout logs. This would make it easier to build dashboards, compare policies, archive results, and debug failures.

### Proposed API direction

I suggest introducing a small public API layer focused on policy evaluation.

Possible modules:

```text
lw_benchhub.eval
  - evaluate()
  - Evaluator
  - EvalConfig
  - EvalReport

lw_benchhub.policy
  - RobotPolicy
  - Observation
  - Action

lw_benchhub.spec
  - BenchmarkSuite
  - RobotSpec
  - ObservationSpec
  - ActionSpec
```

A minimal version could start with only:

```text
evaluate()
Evaluator
EvalConfig
EvalReport
RobotPolicy
Observation
Action
```

The existing scripts could then become thin wrappers around this API.

### Suggested interface

#### `RobotPolicy`

```python
class RobotPolicy:
    def reset(self, episode):
        pass

    def act(self, obs, ctx):
        raise NotImplementedError
```

The policy should not need to manually call `env.step()`, handle video writing, compute success metrics, or manage episode termination. The evaluator should own the evaluation loop.

#### `Observation`

A normalized observation object could expose common fields:

```python
obs.camera["front"]
obs.camera["wrist"]
obs.proprio["joint_pos"]
obs.proprio["ee_pose"]
obs.gripper
obs.language
obs.raw
```

The raw observation can still be preserved for advanced users, but the common path should be stable and documented.

#### `Action`

A typed action object could make action semantics explicit:

```python
Action.joint_delta(x)
Action.joint_position(x)
Action.ee_delta_pose(x)
Action.normalized(x)
```

This would make errors easier to detect and explain.

For example:

```text
Action dimension mismatch.

Policy returned:
  mode: joint_delta
  dim: 14

Environment expects:
  robot: pandaomron
  mode: joint_delta
  dim: 8

Suggested fix:
  inspect the robot action spec or provide an ActionAdapter.
```

#### `EvalConfig`

```python
config = EvalConfig(
    suite="lightwheel-robocasa",
    robot="pandaomron",
    tasks=["OpenMicrowave", "PutObjectInCabinet"],
    num_episodes=100,
    num_envs=8,
    max_steps=500,
    seed=42,
    device="cuda:0",
    headless=True,
    save_video=True,
    output_dir="./eval_runs/my_policy",
)
```

#### `EvalReport`

```python
report.success_rate
report.task_metrics
report.failure_breakdown
report.failed_episodes
report.video_paths
report.trajectory_paths
report.to_json()
report.to_markdown()
```

The report should ideally include reproducibility metadata:

```text
- LW-BenchHub version / git commit
- Isaac Sim version
- task suite
- robot
- seed
- number of episodes
- observation spec
- action spec
- checkpoint metadata if provided
- output artifacts
```

### CLI compatibility

The CLI/script interface could continue to exist, but ideally it would call the same public API.

For example:

```bash
lw-benchhub eval \
  --suite lightwheel-robocasa \
  --robot pandaomron \
  --policy ./my_policy.py:MyPolicy \
  --episodes 100 \
  --save-video
```

This would make the CLI, Python API, and future platform integrations share the same evaluation path.

### Why this would help

This would make LW-BenchHub much easier to use for several common user groups:

* users evaluating their own VLA or imitation-learning policy;
* researchers comparing models across benchmark tasks;
* robotics engineers validating robot-specific action spaces;
* platform teams running automated benchmark jobs;
* users who want structured evaluation reports rather than manual log parsing.

It would also help avoid requiring every downstream user to become familiar with the internal script structure before they can run a meaningful evaluation.

### Non-goals

This proposal does not suggest removing existing scripts or YAML configs.

It also does not suggest hiding all robotics-specific complexity. Robot-specific observation/action mapping, scene configuration, and task validation will still be necessary. The suggestion is to make these concepts explicit, inspectable, and stable at the public API boundary.

### Acceptance criteria for a first version

A first useful version could be considered successful if:

* a user can define a minimal `RobotPolicy` with `act(obs, ctx) -> Action`;
* a user can run evaluation from Python without modifying repository internals;
* observation and action specs can be inspected before evaluation;
* the evaluator returns a structured `EvalReport`;
* environment/setup failures are separated from policy failures;
* existing scripts can be migrated to call the same API internally.

### Closing note

LW-BenchHub already provides a strong foundation for robotics benchmark evaluation. A stable Python evaluation API would make it much easier for external users to bring their own policies, run reproducible evaluations, and build downstream workflows on top of the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a stable Python Eval API for bring-your-own-policy evaluation #45

Summary

User story

Current pain points

Proposed API direction

Suggested interface

`RobotPolicy`

`Observation`

`Action`

`EvalConfig`

`EvalReport`

CLI compatibility

Why this would help

Non-goals

Acceptance criteria for a first version

Closing note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Provide a stable Python Eval API for bring-your-own-policy evaluation #45

Description

Summary

User story

Current pain points

Proposed API direction

Suggested interface

RobotPolicy

Observation

Action

EvalConfig

EvalReport

CLI compatibility

Why this would help

Non-goals

Acceptance criteria for a first version

Closing note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`RobotPolicy`

`Observation`

`Action`

`EvalConfig`

`EvalReport`