This document describes how evaluation flows through the codebase and the design choices behind the current structure.
- Problem contributors (see
CONTRIBUTING.md) - Model submitters (see
SUBMIT.md) - General researchers using Frontier-CS to evaluate solutions
- Clear separation between single-problem and batch evaluation.
- Shared validation and config parsing across research backends.
- Predictable cleanup to avoid orphaned cloud resources.
- Explicit naming to avoid backend ambiguity.
- CLI:
frontier eval→SingleEvaluatorfrontier batch→BatchEvaluatorfrontier list/frontier show→ problem discovery
- CI:
- Validate Problems →
scripts/validate_problems.py→SingleEvaluator - Weekly Batch Evaluation →
scripts/run_eval.sh→BatchEvaluator
- Validate Problems →
Unified API for single-problem evaluation:
- Selects a runner based on track and backend.
- Registers SIGINT/atexit hooks to tear down SkyPilot clusters on exit.
Orchestrates batch evaluation with parallel workers and SkyPilot cluster pools:
- Work queues with resumable state and result aggregation.
- Resource-grouped cluster pools — pairs are grouped by
ResourceSignature(cloud, accelerators, instance type) so that CPU-only and GPU problems run on separate pools. - Hash-based resume — each result stores solution/problem hashes so stale results are automatically re-evaluated when source changes.
Runners execute evaluation. The class hierarchy:
Runner (ABC)
├── ResearchRunner # shared: problem validation, config loading, uv install
│ ├── ResearchDockerRunner # local container
│ └── ResearchSkyPilotRunner # cloud via SkyPilot
├── AlgorithmicLocalRunner # go-judge via HTTP
└── AlgorithmicSkyPilotRunner # go-judge on SkyPilot
The gen/ module generates solutions by calling LLMs. It is independent of
the evaluation pipeline — frontier eval and frontier batch do not
depend on it. It provides an LLM interface, an API key pool for concurrent
requests, and solution file formatting.
- Single vs Batch:
SingleEvaluatorstays focused on one-off runs (simple API + cleanup hooks), whileBatchEvaluatorowns scheduling, resumable state, and cluster pools. This keeps single-run paths lightweight and batch runs scalable. - Shared research helpers: input validation and config parsing live in
the
ResearchRunnerbase class so Docker and SkyPilot backends stay in sync. - Cleanup strategy: research evaluations tear down clusters by default
unless
keep_clusteris set.SingleEvaluatorcleans up via an active-cluster registry;BatchEvaluatormanages its own pool lifecycle. - Naming: runner class names encode track + backend
(e.g.,
ResearchDockerRunner) to remove ambiguity in logs and docs. - Score semantics: a score of 0 can mean the evaluator ran successfully; failures are reported via status/metadata rather than score alone.
- Reference solutions: problems ship with
reference.cpp/reference.pyso CI can verify end-to-end evaluation without model submissions. - Results separation: evaluation outputs go to a dedicated results repository to keep the main repo lean and auditable.
- Internal vs public: internal test cases and tooling live in a private repo; public artifacts are kept minimal but compatible.
- Weekly vs local: weekly CI uses
scripts/run_eval.shwith batch scheduling; local runs use the same script orfrontier evalfor quick iteration. - Resource-grouped cluster pools:
BatchEvaluatorgroups pairs byResourceSignature(cloud × accelerators × instance type) and creates a separate pool per group, avoiding the waste of running CPU-only problems on GPU clusters. - Hash-based resume: resuming a batch compares solution/problem hashes against stored results. Changed inputs are re-evaluated even when a prior result exists, preventing silently stale scores.
- Generation vs evaluation: solution generation (
gen/) is fully decoupled from evaluation. Generated files are plain source files with no special metadata; the evaluator has no dependency on the generation pipeline.
Both research runners share the same pre-evaluation steps (via
ResearchRunner):
- Validate solution file and
.FAILEDmarker. - Verify the problem path exists.
- Load
config.yamland runtime settings. - Build uv install command if
uv_projectis specified.
Execution diverges at the backend:
- Docker — launches a local container.
- SkyPilot — provisions a cloud VM and runs remotely.
-
Cleanup: research evaluations tear down clusters by default unless
keep_cluster=True.SingleEvaluatoruses an active-cluster registry to clean up on SIGINT/atexit;BatchEvaluatormanages its own cluster pool lifecycle. -
CI: problem validation runs single evals; the weekly batch job runs full evaluations on SkyPilot (typically GCP).