Convai Evals

Agentic evaluation toolkit for Convai Character AI infrastructure.

convai-evals is a dataset-agnostic runner for testing Convai Character AI behavior through the same Web SDK path real applications use. You give it a dataset of inputs (text, voice, dynamic context) and a character, and it reports how the character behaved (responded vs. stayed silent, correctly or not) and how fast (latency). It works with any character and any dataset.

New here? Start with the Getting started guide below. It walks you through running your first evaluation step by step, including exactly where to enter your Character ID, API key, and datasets. No prior experience required.

Getting started

1. What you'll need

A Mac, Linux, or Windows computer.
Node.js version 20 or newer. Download it from nodejs.org (pick the "LTS" version) and install it. To check it worked, open a terminal and run node -v — you should see something like v20.x or higher.
A Convai Character ID and a Convai API key (from your Convai dashboard). You'll paste these in later — they are not stored in this tool.
Your dataset files (.csv). For example, if a dataset folder was shared with you, download the .csv files to your computer (your Downloads folder is fine).

2. One-time setup

Open a terminal, go to the folder where you want the tool, and run these commands one at a time:

# get the code
git clone https://github.com/Conv-AI/convai-evals.git
cd convai-evals

# install and build (takes a few minutes the first time)
npm install
npx playwright install chromium
npm run build

If every command finishes without a red error, you're ready.

3. Run an evaluation — the web app (recommended)

This is the easiest way and needs no command-line configuration. You'll keep two commands running, each in its own terminal window/tab.

Terminal 1 — start the engine:

npm run dev:server

Terminal 2 — start the web app:

npm run dev:web

Then open the link it prints (it will be http://localhost:5180) in your browser. You'll see two cards:

a) Dataset card — load your data:

Click "Upload CSV" and pick a dataset file from wherever you saved it (e.g. your Downloads folder). That's where any downloaded dataset goes — you select it here; there is no special folder to copy it into.
Or click "Load synthetic sample" to try the tool with built-in fake data first.

b) Run config card — this is where your connection details go:

Prod endpoint URL — paste your Convai prod endpoint URL here (provided by Convai).
Character ID — paste your Convai character ID here.
Convai API key — paste your Convai API key here (it shows as dots; it is only sent to the engine on your own machine).
TTS for Voice In rows — leave Provider = Local (free, no setup). Set Voice ID to a voice your computer has:
- Mac: Samantha
- Linux: en-us
Leave the other settings at their defaults to start.

Finally click the big Run button. Progress shows on screen; when it finishes you'll see pass-rates and latency, and you can export the report as CSV or JSON.

Your endpoint URL, Character ID, and API key live only in this form (and the headless command below) — they are never written into the tool's code or saved to the repo.

4. Run many datasets at once (optional, for batches)

If you have several datasets and want to run them in one go (or run several at the same time), use the headless runner instead of the web app. Keep npm run dev:server running in one terminal, then in another terminal:

CHARACTER_ID=your-character-id \
API_KEY=your-api-key \
ENDPOINT_URL=your-convai-prod-url \
TTS_VOICE_ID=Samantha \
REPORT_DIR=./reports \
node scripts/run-batch.mjs /path/to/your/datasets/

Where your endpoint URL, Character ID + API key go: the ENDPOINT_URL=, CHARACTER_ID=, and API_KEY= values in the command above. Replace the placeholder text with your real values.
Where your datasets go: put the downloaded .csv files in any folder, then pass that folder's path as the last argument (or list individual files). Example: node scripts/run-batch.mjs ~/Downloads/my-datasets/.
On Mac use TTS_VOICE_ID=Samantha; on Linux use TTS_VOICE_ID=en-us.

Full options (concurrency, staggering, server-side latency, etc.) are in docs/headless-runner.md.

5. Where your results go

Web app: shown on screen; use the Export CSV / Export JSON buttons to save them.
Headless runner: in the REPORT_DIR folder (default ./reports) — one *.report.json per dataset plus a batch-summary.json with the pass-rates and a failure breakdown.

6. Troubleshooting

command not found: node / npm — Node.js isn't installed. Install it from nodejs.org and reopen the terminal.
Sessions fail with "botReady timeout" — usually the endpoint URL, Character ID, or API key is wrong, or they don't belong together. Double-check all three.
Voice rows produce no audio / fail — the Voice ID must be valid for your operating system (Mac: Samantha; Linux: en-us). Or switch the provider to Google with a key.
"port already in use" — another copy is already running. Close it, or start the engine on a different port with PORT=4100 npm run dev:server.
No BigQuery needed — behavior pass-rates and client-side latency are produced without any database access. (Server-side per-stage latency is optional; see the headless-runner doc.)

Note: the convai-evals command-line tool (below) only prepares and inspects datasets — it does not run live evaluations and does not take a Character ID or API key. Live runs happen in the web app or the headless runner described above.

Reference

Scenario format

The canonical input is versioned JSON:

{
  "schema_version": "convai-evals/v0",
  "scenario_id": "example",
  "sessions": [
    {
      "session_id": "session-001",
      "events": [
        {
          "event_id": "event-001",
          "at_s": 0,
          "input": { "kind": "text", "text": "What should I do next?" },
          "expect": { "behavior": "respond", "llm_call": true, "verbal_response": true }
        }
      ]
    }
  ]
}

Supported v0 input kinds are text, voice, and dynamic_context. Domain-specific fields belong in opaque metadata; keep organization-specific columns out of the schema. See scenario.schema.json and docs/scenario-format.md. CSV datasets follow docs/csv-adapter.md.

Behavior expectations support legacy values respond, abstain, and no_call, plus precise values respond_with_audio, respond_silent, and interrupted_by_priority_event. Legacy abstain passes when the server reaches silence through either a silent LLM result or a clean no-call path.

Command-line tool (`convai-evals`)

The CLI validates, converts, and inspects datasets and reports. It does not run live evals (use the web app or scripts/run-batch.mjs for that).

convai-evals validate examples/scenarios/*.json
convai-evals convert input.csv --from legacy-rtvi-csv --out scenario.json
convai-evals run scenario.json --out runtime-rows.json
convai-evals report report.json
convai-evals telemetry-ids report.json --out telemetry-ids.json
convai-evals explain scenario.json
convai-evals generate-template --kind voice-text-mix --out scenario.json

Diagnostics

Diagnostics are disabled by default:

DIAG_PROVIDER=none

Optionally, analytics API lookups are supported:

DIAG_PROVIDER=analytics-api
CONVAI_API_KEY=...
CONVAI_ANALYTICS_BASE_URL=https://analytics-api.convai.com/v1/analytics

Reports preserve backend IDs when available so a follow-up agent can call Convai analytics APIs after a run.

Each row also carries a correlation block with a deterministic client_event_id, dispatch timestamps, and the attribution method used for response and transcript capture. Text and dynamic-context rows pass this metadata through the SDK data message so backend telemetry can join back to eval rows when supported.

Endpoint

Runs target the Convai prod endpoint. The URL is not bundled with the tool — you provide it at runtime: paste it into the Prod endpoint URL field in the web app, or set ENDPOINT_URL for the headless runner.

Data & credentials

This repo stores no credentials and no datasets. You supply your endpoint URL, character ID, API key, and dataset files at runtime (in the web form or via environment variables); none of them are written to disk in the repo. Cloud diagnostics are off by default. The bundled examples are synthetic and exist only to exercise the pipeline.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cli		cli
docs		docs
examples/scenarios		examples/scenarios
recipes/prompts		recipes/prompts
schemas		schemas
scripts		scripts
server		server
shared		shared
web		web
worker		worker
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convai Evals

Getting started

1. What you'll need

2. One-time setup

3. Run an evaluation — the web app (recommended)

4. Run many datasets at once (optional, for batches)

5. Where your results go

6. Troubleshooting

Reference

Scenario format

Command-line tool (`convai-evals`)

Diagnostics

Endpoint

Data & credentials

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Convai Evals

Getting started

1. What you'll need

2. One-time setup

3. Run an evaluation — the web app (recommended)

4. Run many datasets at once (optional, for batches)

5. Where your results go

6. Troubleshooting

Reference

Scenario format

Command-line tool (convai-evals)

Diagnostics

Endpoint

Data & credentials

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Command-line tool (`convai-evals`)

Packages