Community env submission: ComtradeBench — reliable LLM tool-use benchmark

Hi OpenEnv maintainers 👋

Sharing a community environment I built for the OpenEnv / AgentBeats hackathon, in case it fits the [`agentic-rl-hackathon-sf-2026`](https://huggingface.co/collections/openenv/agentic-rl-hackathon-sf-2026) collection.

## ComtradeBench

An OpenEnv-native benchmark for **reliable LLM tool-use under adversarial API conditions** — the environment simulates a paginated UN Comtrade–style API where the agent has to complete a multi-step retrieval job under realistic operational failures, not just answer-quality conditions.

### What it tests
- Pagination drift (page order randomized between calls)
- Within-page (8%) and cross-page (3%) duplicate records
- HTTP 429 rate-limit and HTTP 500 transient faults
- Totals-row contamination (`is_total=true` rows mixed into data)
- Mixed faults (retry + dedup simultaneously)
- **T9 — Adaptive adversary**: fault intensity escalates *mid-episode* as the agent makes progress (one of the first OpenEnv tasks to model within-episode escalation as far as I know)
- **T10 — Constrained budget**: halved request budget, no room for wasted fetches

### Scoring
Six weighted dimensions, not pass/fail:

| Dimension | Weight |
|---|---|
| Correctness | 30% |
| Completeness | 15% |
| Robustness | 15% |
| Efficiency | 15% |
| Data Quality | 15% |
| Observability | 10% |

### Results
- **Rule-based baseline**: 96.8 / 100 avg across all 10 tasks (confirms the env is well-calibrated and solvable)
- **Kimi (Moonshot V1-8K) LLM agent**: 94.4 on the shared T1–T8 subset
- **GRPO training pipeline** included — same env code runs in-process during training and as a Docker Space during eval, zero divergence. Reward-only, no human labels. Mean reward beat the rule-based baseline in 6 of 8 iterations.

All benchmark data is procedurally generated from a seeded PRNG — no external API dependency, fully reproducible from task ID + seed.

### Links
- 🧪 **Environment Space**: https://huggingface.co/spaces/yonghongzhang/comtrade-env
- 📝 **Blog Space** (full technical write-up): https://huggingface.co/spaces/yonghongzhang/comtrade-bench-blog
- 💻 **GitHub**: https://github.com/yonghongzhang-io/comtrade-openenv

### Request
Would love to be considered for inclusion in the `agentic-rl-hackathon-sf-2026` collection if it fits the criteria. Happy to iterate on anything that doesn't meet the spec — please let me know what needs adjusting.

Thanks for building OpenEnv 🙏


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community env submission: ComtradeBench — reliable LLM tool-use benchmark #527

ComtradeBench

What it tests

Scoring

Results

Links

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dimension	Weight
Correctness	30%
Completeness	15%
Robustness	15%
Efficiency	15%
Data Quality	15%
Observability	10%

Community env submission: ComtradeBench — reliable LLM tool-use benchmark #527

Description

ComtradeBench

What it tests

Scoring

Results

Links

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions