Skip to content

Community env submission: ComtradeBench β€” reliable LLM tool-use benchmarkΒ #527

@yonghongzhang-io

Description

@yonghongzhang-io

Hi OpenEnv maintainers πŸ‘‹

Sharing a community environment I built for the OpenEnv / AgentBeats hackathon, in case it fits the agentic-rl-hackathon-sf-2026 collection.

ComtradeBench

An OpenEnv-native benchmark for reliable LLM tool-use under adversarial API conditions β€” the environment simulates a paginated UN Comtrade–style API where the agent has to complete a multi-step retrieval job under realistic operational failures, not just answer-quality conditions.

What it tests

  • Pagination drift (page order randomized between calls)
  • Within-page (8%) and cross-page (3%) duplicate records
  • HTTP 429 rate-limit and HTTP 500 transient faults
  • Totals-row contamination (is_total=true rows mixed into data)
  • Mixed faults (retry + dedup simultaneously)
  • T9 β€” Adaptive adversary: fault intensity escalates mid-episode as the agent makes progress (one of the first OpenEnv tasks to model within-episode escalation as far as I know)
  • T10 β€” Constrained budget: halved request budget, no room for wasted fetches

Scoring

Six weighted dimensions, not pass/fail:

Dimension Weight
Correctness 30%
Completeness 15%
Robustness 15%
Efficiency 15%
Data Quality 15%
Observability 10%

Results

  • Rule-based baseline: 96.8 / 100 avg across all 10 tasks (confirms the env is well-calibrated and solvable)
  • Kimi (Moonshot V1-8K) LLM agent: 94.4 on the shared T1–T8 subset
  • GRPO training pipeline included β€” same env code runs in-process during training and as a Docker Space during eval, zero divergence. Reward-only, no human labels. Mean reward beat the rule-based baseline in 6 of 8 iterations.

All benchmark data is procedurally generated from a seeded PRNG β€” no external API dependency, fully reproducible from task ID + seed.

Links

Request

Would love to be considered for inclusion in the agentic-rl-hackathon-sf-2026 collection if it fits the criteria. Happy to iterate on anything that doesn't meet the spec β€” please let me know what needs adjusting.

Thanks for building OpenEnv πŸ™

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions