Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .changeset/add-telemetry-toggle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
"@loveholidays/eval-kit": minor
---

Add optional OpenTelemetry tracing support for evaluations, batch processing, and async metrics.

- Emit spans for `Evaluator.evaluate`, `BatchEvaluator.evaluate`, row processing, retries, and BERT/perplexity metrics
- `@opentelemetry/api` is an optional peer dependency — zero overhead when not installed
- Telemetry is disabled by default; call `enableTelemetry(true)` to opt in
- `isTelemetryEnabled()` getter for reading the current state
- Exported `withSpan` helper for custom evaluator instrumentation
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ A TypeScript SDK for evaluating content quality using traditional metrics and AI
- **Traditional Metrics**: BLEU, TER, BERTScore, Coherence, Perplexity
- **AI-Powered Evaluation**: LLM-based evaluator with prompt templating (via Vercel AI SDK)
- **Batch Processing**: Concurrent execution, progress tracking, retry logic, CSV/JSON export
- **OpenTelemetry Tracing**: Optional distributed tracing with zero overhead when disabled

## Installation

Expand Down Expand Up @@ -87,6 +88,31 @@ await batchEvaluator.export({
});
```

### OpenTelemetry Tracing (Optional)

eval-kit has built-in OpenTelemetry support. Install the optional peer dependency to enable distributed tracing with your existing observability stack (Jaeger, Grafana Tempo, Datadog, etc.):

```bash
npm install @opentelemetry/api @opentelemetry/sdk-trace-node
```

Configure your OTel SDK as usual — eval-kit will automatically emit spans:

```typescript
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { SimpleSpanProcessor, ConsoleSpanExporter } from '@opentelemetry/sdk-trace-base';

// Set up OTel before using eval-kit
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
provider.register();

// Now all eval-kit operations produce traces
const result = await batchEvaluator.evaluate({ filePath: './data.csv' });
```

When `@opentelemetry/api` is not installed, all tracing is a no-op with zero overhead. See the [Telemetry Guide](./docs/TELEMETRY.md) for span details and custom evaluator instrumentation.

## Documentation

| Guide | Description |
Expand All @@ -95,6 +121,7 @@ await batchEvaluator.export({
| [Evaluator](./docs/EVALUATOR.md) | AI-powered evaluation and scoring |
| [Batch Evaluation](./docs/BATCH_EVALUATION_GUIDE.md) | Concurrent processing, progress tracking |
| [Export](./docs/EXPORT_GUIDE.md) | CSV and JSON export options |
| [Telemetry](./docs/TELEMETRY.md) | OpenTelemetry tracing and observability |

## Supported LLM Providers

Expand Down
217 changes: 217 additions & 0 deletions docs/TELEMETRY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# OpenTelemetry Telemetry Guide

eval-kit provides built-in [OpenTelemetry](https://opentelemetry.io/) tracing. When enabled, it emits spans for evaluations, batch processing, retries, and async metrics — giving you visibility into per-row latency, token usage, retry behavior, and where time is spent.

## Setup

### 1. Install dependencies

`@opentelemetry/api` is an optional peer dependency. Install it along with an SDK and exporter:

```bash
npm install @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/sdk-trace-base
```

### 2. Configure the OTel SDK

Set up a tracer provider **before** calling any eval-kit functions:

```typescript
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { SimpleSpanProcessor, ConsoleSpanExporter } from '@opentelemetry/sdk-trace-base';

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
provider.register();
```

For production, replace `ConsoleSpanExporter` with your backend exporter (Jaeger, OTLP, Zipkin, etc.).

### 3. Use eval-kit as normal

No code changes needed. eval-kit detects the registered provider and emits spans automatically.

## Disabling Telemetry

eval-kit's telemetry is **disabled by default**. To opt in:

```typescript
import { enableTelemetry } from '@loveholidays/eval-kit';

// Enable eval-kit tracing
enableTelemetry(true);

// Disable again if needed (OTel remains active for the rest of your app)
enableTelemetry(false);
```

When disabled, all tracing functions return no-ops with zero overhead — the same behaviour as when `@opentelemetry/api` is not installed.

## Zero overhead when not installed

When `@opentelemetry/api` is not installed, eval-kit uses no-op stubs internally. All tracing calls become empty function calls that are optimized away by the JS engine. There is no performance impact.

## Span Hierarchy

### Batch evaluation

```
eval-kit.batch.evaluate
├── eval-kit.batch.parse_input (file-based input only)
├── eval-kit.batch.process_row (per row)
│ ├── [retry event] (on retry attempts)
│ └── eval-kit.batch.run_evaluators
│ └── eval-kit.evaluator.evaluate (per evaluator)
└── eval-kit.batch.export (when export() is called)
```

### Single evaluation

```
eval-kit.evaluator.evaluate
```

### Async metrics (standalone)

```
eval-kit.metric.bert_score
eval-kit.metric.perplexity
```

## Span Attributes

### `eval-kit.evaluator.evaluate`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.evaluator.name` | string | Evaluator name |
| `eval_kit.model.id` | string | Model identifier |
| `eval_kit.score_config.type` | string | `"numeric"` or `"categorical"` |
| `eval_kit.input.candidate_text_length` | number | Length of input text |
| `eval_kit.result.score` | number/string | Evaluation score |
| `eval_kit.result.execution_time_ms` | number | Wall clock time |
| `eval_kit.result.token_usage.input` | number | Input tokens consumed |
| `eval_kit.result.token_usage.output` | number | Output tokens generated |
| `eval_kit.result.token_usage.total` | number | Total tokens |
| `eval_kit.result.error` | string | Error message (on failure) |

### `eval-kit.batch.evaluate`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.batch.id` | string | Unique batch ID |
| `eval_kit.batch.concurrency` | number | Max concurrent rows |
| `eval_kit.batch.execution_mode` | string | `"parallel"` or `"sequential"` |
| `eval_kit.batch.total_rows` | number | Total rows in input |
| `eval_kit.batch.successful_rows` | number | Rows completed successfully |
| `eval_kit.batch.failed_rows` | number | Rows that failed |

### `eval-kit.batch.process_row`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.row.id` | string | Row identifier |
| `eval_kit.row.index` | number | Row index in input |
| `eval_kit.row.duration_ms` | number | Total time including retries |
| `eval_kit.row.retry_count` | number | Number of retry attempts |
| `eval_kit.result.error` | string | Error message (on failure) |

**Retry events** are recorded on this span with name `retry`:

| Event Attribute | Type | Description |
|----------------|------|-------------|
| `eval_kit.retry.attempt` | number | Retry attempt number (1-based) |
| `eval_kit.retry.delay_ms` | number | Delay before retry |
| `eval_kit.retry.error` | string | Error that triggered the retry |

### `eval-kit.batch.run_evaluators`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.evaluator_count` | number | Number of evaluators |
| `eval_kit.execution_mode` | string | `"parallel"` or `"sequential"` |

### `eval-kit.batch.parse_input`

Only created for file-based input (not in-memory data).

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.parse.input_format` | string | `"file"` |
| `eval_kit.parse.row_count` | number | Number of parsed rows |

### `eval-kit.batch.export`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.export.format` | string | `"csv"` or `"json"` |
| `eval_kit.export.row_count` | number | Number of exported rows |

### `eval-kit.metric.bert_score` / `eval-kit.metric.perplexity`

| Attribute | Type | Description |
|-----------|------|-------------|
| `eval_kit.metric.name` | string | Metric name |
| `eval_kit.metric.model` | string | Model used |
| `eval_kit.result.score` | number | Computed score |

A `model_loaded` event is recorded when the model is loaded for the first time (cache miss).

## Custom Evaluator Instrumentation

If you implement `IEvaluator` and want your spans to appear in the trace hierarchy, use the exported `withSpan` helper:

```typescript
import { withSpan, type IEvaluator, type EvaluatorResult } from '@loveholidays/eval-kit';

const myEvaluator: IEvaluator = {
name: 'custom-eval',
async evaluate(input) {
return withSpan(
'my-app.custom-eval',
{ attributes: { 'my_app.evaluator.name': 'custom-eval' } },
async (span) => {
// Your evaluation logic here
const score = await computeScore(input.candidateText);

span.setAttribute('my_app.result.score', score);
return {
evaluatorName: 'custom-eval',
score,
feedback: 'Custom evaluation',
processingStats: { executionTime: 0 },
};
},
);
},
};
```

Your `my-app.custom-eval` span will automatically appear as a child of `eval-kit.batch.run_evaluators` when used in a batch evaluation, because `withSpan` uses the active async context.

## Example: Full Setup with Jaeger

```typescript
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
import { BatchEvaluator, Evaluator } from '@loveholidays/eval-kit';

// Configure OTel
const provider = new NodeTracerProvider({
resource: new Resource({ [ATTR_SERVICE_NAME]: 'my-eval-pipeline' }),
});
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }))
);
provider.register();

// Run evaluation — spans are exported to Jaeger automatically
const batch = new BatchEvaluator({ evaluators: [evaluator], concurrency: 10 });
const result = await batch.evaluate({ filePath: './data.csv' });

// Flush spans before exit
await provider.shutdown();
```
11 changes: 11 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,20 @@
"lodash": ">=4.17.23"
}
},
"peerDependencies": {
"@opentelemetry/api": "^1.0.0"
},
"peerDependenciesMeta": {
"@opentelemetry/api": {
"optional": true
}
},
"devDependencies": {
"@biomejs/biome": "^2.3.7",
"@changesets/cli": "^2.29.8",
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/sdk-trace-base": "^1.30.0",
"@opentelemetry/sdk-trace-node": "^1.30.0",
"@types/jest": "^30.0.0",
"@types/node": "^24.10.1",
"jest": "^30.2.0",
Expand Down
Loading