Agentic evaluation toolkit for Convai Character AI infrastructure.
convai-evals is a dataset-agnostic runner for testing Convai Character AI behavior through the same Web SDK path real applications use. You give it a dataset of inputs (text, voice, dynamic context) and a character, and it reports how the character behaved (responded vs. stayed silent, correctly or not) and how fast (latency). It works with any character and any dataset.
New here? Start with the Getting started guide below. It walks you through running your first evaluation step by step, including exactly where to enter your Character ID, API key, and datasets. No prior experience required.
- A Mac, Linux, or Windows computer.
- Node.js version 20 or newer. Download it from nodejs.org (pick the "LTS" version) and install it. To check it worked, open a terminal and run
node -v— you should see something likev20.xor higher. - A Convai Character ID and a Convai API key (from your Convai dashboard). You'll paste these in later — they are not stored in this tool.
- Your dataset files (
.csv). For example, if a dataset folder was shared with you, download the.csvfiles to your computer (your Downloads folder is fine).
Open a terminal, go to the folder where you want the tool, and run these commands one at a time:
# get the code
git clone https://github.com/Conv-AI/convai-evals.git
cd convai-evals
# install and build (takes a few minutes the first time)
npm install
npx playwright install chromium
npm run buildIf every command finishes without a red error, you're ready.
This is the easiest way and needs no command-line configuration. You'll keep two commands running, each in its own terminal window/tab.
Terminal 1 — start the engine:
npm run dev:serverTerminal 2 — start the web app:
npm run dev:webThen open the link it prints (it will be http://localhost:5180) in your browser. You'll see two cards:
a) Dataset card — load your data:
- Click "Upload CSV" and pick a dataset file from wherever you saved it (e.g. your Downloads folder). That's where any downloaded dataset goes — you select it here; there is no special folder to copy it into.
- Or click "Load synthetic sample" to try the tool with built-in fake data first.
b) Run config card — this is where your connection details go:
- Prod endpoint URL — paste your Convai prod endpoint URL here (provided by Convai).
- Character ID — paste your Convai character ID here.
- Convai API key — paste your Convai API key here (it shows as dots; it is only sent to the engine on your own machine).
- TTS for Voice In rows — leave Provider = Local (free, no setup). Set Voice ID to a voice your computer has:
- Mac:
Samantha - Linux:
en-us
- Mac:
- Leave the other settings at their defaults to start.
Finally click the big Run button. Progress shows on screen; when it finishes you'll see pass-rates and latency, and you can export the report as CSV or JSON.
Your endpoint URL, Character ID, and API key live only in this form (and the headless command below) — they are never written into the tool's code or saved to the repo.
If you have several datasets and want to run them in one go (or run several at the same time), use the headless runner instead of the web app. Keep npm run dev:server running in one terminal, then in another terminal:
CHARACTER_ID=your-character-id \
API_KEY=your-api-key \
ENDPOINT_URL=your-convai-prod-url \
TTS_VOICE_ID=Samantha \
REPORT_DIR=./reports \
node scripts/run-batch.mjs /path/to/your/datasets/- Where your endpoint URL, Character ID + API key go: the
ENDPOINT_URL=,CHARACTER_ID=, andAPI_KEY=values in the command above. Replace the placeholder text with your real values. - Where your datasets go: put the downloaded
.csvfiles in any folder, then pass that folder's path as the last argument (or list individual files). Example:node scripts/run-batch.mjs ~/Downloads/my-datasets/. - On Mac use
TTS_VOICE_ID=Samantha; on Linux useTTS_VOICE_ID=en-us.
Full options (concurrency, staggering, server-side latency, etc.) are in docs/headless-runner.md.
- Web app: shown on screen; use the Export CSV / Export JSON buttons to save them.
- Headless runner: in the
REPORT_DIRfolder (default./reports) — one*.report.jsonper dataset plus abatch-summary.jsonwith the pass-rates and a failure breakdown.
command not found: node/npm— Node.js isn't installed. Install it from nodejs.org and reopen the terminal.- Sessions fail with "botReady timeout" — usually the endpoint URL, Character ID, or API key is wrong, or they don't belong together. Double-check all three.
- Voice rows produce no audio / fail — the Voice ID must be valid for your operating system (Mac:
Samantha; Linux:en-us). Or switch the provider to Google with a key. - "port already in use" — another copy is already running. Close it, or start the engine on a different port with
PORT=4100 npm run dev:server. - No BigQuery needed — behavior pass-rates and client-side latency are produced without any database access. (Server-side per-stage latency is optional; see the headless-runner doc.)
Note: the
convai-evalscommand-line tool (below) only prepares and inspects datasets — it does not run live evaluations and does not take a Character ID or API key. Live runs happen in the web app or the headless runner described above.
The canonical input is versioned JSON:
{
"schema_version": "convai-evals/v0",
"scenario_id": "example",
"sessions": [
{
"session_id": "session-001",
"events": [
{
"event_id": "event-001",
"at_s": 0,
"input": { "kind": "text", "text": "What should I do next?" },
"expect": { "behavior": "respond", "llm_call": true, "verbal_response": true }
}
]
}
]
}Supported v0 input kinds are text, voice, and dynamic_context. Domain-specific fields belong in opaque metadata; keep organization-specific columns out of the schema. See scenario.schema.json and docs/scenario-format.md. CSV datasets follow docs/csv-adapter.md.
Behavior expectations support legacy values respond, abstain, and no_call, plus precise values respond_with_audio, respond_silent, and interrupted_by_priority_event. Legacy abstain passes when the server reaches silence through either a silent LLM result or a clean no-call path.
The CLI validates, converts, and inspects datasets and reports. It does not run live evals (use the web app or scripts/run-batch.mjs for that).
convai-evals validate examples/scenarios/*.json
convai-evals convert input.csv --from legacy-rtvi-csv --out scenario.json
convai-evals run scenario.json --out runtime-rows.json
convai-evals report report.json
convai-evals telemetry-ids report.json --out telemetry-ids.json
convai-evals explain scenario.json
convai-evals generate-template --kind voice-text-mix --out scenario.jsonDiagnostics are disabled by default:
DIAG_PROVIDER=noneOptionally, analytics API lookups are supported:
DIAG_PROVIDER=analytics-api
CONVAI_API_KEY=...
CONVAI_ANALYTICS_BASE_URL=https://analytics-api.convai.com/v1/analyticsReports preserve backend IDs when available so a follow-up agent can call Convai analytics APIs after a run.
Each row also carries a correlation block with a deterministic client_event_id, dispatch timestamps, and the attribution method used for response and transcript capture. Text and dynamic-context rows pass this metadata through the SDK data message so backend telemetry can join back to eval rows when supported.
Runs target the Convai prod endpoint. The URL is not bundled with the tool — you
provide it at runtime: paste it into the Prod endpoint URL field in the web app, or
set ENDPOINT_URL for the headless runner.
This repo stores no credentials and no datasets. You supply your endpoint URL, character ID, API key, and dataset files at runtime (in the web form or via environment variables); none of them are written to disk in the repo. Cloud diagnostics are off by default. The bundled examples are synthetic and exist only to exercise the pipeline.
Apache-2.0.