Hi OpenEnv maintainers π
Sharing a community environment I built for the OpenEnv / AgentBeats hackathon, in case it fits the agentic-rl-hackathon-sf-2026 collection.
ComtradeBench
An OpenEnv-native benchmark for reliable LLM tool-use under adversarial API conditions β the environment simulates a paginated UN Comtradeβstyle API where the agent has to complete a multi-step retrieval job under realistic operational failures, not just answer-quality conditions.
What it tests
- Pagination drift (page order randomized between calls)
- Within-page (8%) and cross-page (3%) duplicate records
- HTTP 429 rate-limit and HTTP 500 transient faults
- Totals-row contamination (
is_total=true rows mixed into data)
- Mixed faults (retry + dedup simultaneously)
- T9 β Adaptive adversary: fault intensity escalates mid-episode as the agent makes progress (one of the first OpenEnv tasks to model within-episode escalation as far as I know)
- T10 β Constrained budget: halved request budget, no room for wasted fetches
Scoring
Six weighted dimensions, not pass/fail:
| Dimension |
Weight |
| Correctness |
30% |
| Completeness |
15% |
| Robustness |
15% |
| Efficiency |
15% |
| Data Quality |
15% |
| Observability |
10% |
Results
- Rule-based baseline: 96.8 / 100 avg across all 10 tasks (confirms the env is well-calibrated and solvable)
- Kimi (Moonshot V1-8K) LLM agent: 94.4 on the shared T1βT8 subset
- GRPO training pipeline included β same env code runs in-process during training and as a Docker Space during eval, zero divergence. Reward-only, no human labels. Mean reward beat the rule-based baseline in 6 of 8 iterations.
All benchmark data is procedurally generated from a seeded PRNG β no external API dependency, fully reproducible from task ID + seed.
Links
Request
Would love to be considered for inclusion in the agentic-rl-hackathon-sf-2026 collection if it fits the criteria. Happy to iterate on anything that doesn't meet the spec β please let me know what needs adjusting.
Thanks for building OpenEnv π
Hi OpenEnv maintainers π
Sharing a community environment I built for the OpenEnv / AgentBeats hackathon, in case it fits the
agentic-rl-hackathon-sf-2026collection.ComtradeBench
An OpenEnv-native benchmark for reliable LLM tool-use under adversarial API conditions β the environment simulates a paginated UN Comtradeβstyle API where the agent has to complete a multi-step retrieval job under realistic operational failures, not just answer-quality conditions.
What it tests
is_total=truerows mixed into data)Scoring
Six weighted dimensions, not pass/fail:
Results
All benchmark data is procedurally generated from a seeded PRNG β no external API dependency, fully reproducible from task ID + seed.
Links
Request
Would love to be considered for inclusion in the
agentic-rl-hackathon-sf-2026collection if it fits the criteria. Happy to iterate on anything that doesn't meet the spec β please let me know what needs adjusting.Thanks for building OpenEnv π