feat: Implement core evaluator files and Vocab implementation by adnanrhussain · Pull Request #12 · learning-commons-org/evaluators

adnanrhussain · 2026-02-06T00:50:28Z

Summary

This PR introduces the key components for the SDK and uses them for a VocabularyEvaluator

Documentation

See sdks/typescript/README.md

Testing

CI workflow should build successfully

SDK Feature PR Index

Copilot

Pull request overview

This PR implements the core infrastructure for a TypeScript SDK for Learning Commons educational text complexity evaluators, specifically implementing the VocabularyEvaluator as the first concrete evaluator. The PR is part of a stacked PR series (#10-#14) building out the SDK functionality.

Changes:

Introduces base evaluator infrastructure with telemetry, logging, validation, and error handling
Implements VocabularyEvaluator using a 2-stage LLM process (background knowledge generation + complexity evaluation)
Adds comprehensive test utilities supporting retry logic and parallel test execution for handling LLM non-determinism
Provides LLM provider abstractions supporting OpenAI, Google Gemini, and Anthropic through Vercel AI SDK

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
src/evaluators/base.ts	Abstract base class providing common validation, telemetry, and logging for all evaluators
src/evaluators/vocabulary.ts	VocabularyEvaluator implementation using Gemini + GPT-4o in a 2-stage evaluation process
src/providers/*.ts	LLM provider abstractions with Vercel AI SDK integration
src/telemetry/*.ts	Anonymous telemetry client with configurable privacy controls
src/errors.ts	Comprehensive error hierarchy for validation, API, authentication, rate limiting, and network errors
src/logger.ts	Structured logging interface with configurable verbosity levels
src/features/readability.ts	Flesch-Kincaid grade level calculation using compromise NLP library
src/prompts/vocabulary/*.ts	Prompt template loading and grade-specific prompt selection
src/schemas/*.ts	Zod schemas for vocabulary complexity and evaluation results
src/utils/prompts.ts	Utility for loading prompt files from text files
tests/utils/*.ts	Reusable test utilities with retry logic for handling LLM non-determinism
tests/unit/*/.ts	Unit tests for telemetry, readability, vocabulary evaluator, and validation
tests/integration/*.ts	Integration tests with real API calls for vocabulary evaluator
tests/README.md	Comprehensive test documentation covering patterns, configuration, and best practices
docs/telemetry.md	Telemetry documentation explaining data collection, privacy, and configuration
README.md	Main SDK documentation with installation, usage, error handling, and logging examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdks/typescript/src/telemetry/client.ts

sdks/typescript/src/prompts/vocabulary/system.ts

Copilot · 2026-02-19T04:53:20Z

sdks/typescript/tests/integration/vocabulary.integration.test.ts

+    evaluator = new VocabularyEvaluator({
+      googleApiKey: process.env.GOOGLE_API_KEY!,
+      openaiApiKey: process.env.OPENAI_API_KEY!,
+      retry: false, // We handle retries in the test logic


The retry configuration option is being used in tests (line 100), but this option is not defined in the VocabularyEvaluatorConfig or BaseEvaluatorConfig interfaces. The only retry-related option available is maxRetries. This will cause a type error when using TypeScript's strict mode. Either rename this to maxRetries: 0 or add a retry option to the configuration interface.

Suggested change

retry: false, // We handle retries in the test logic

maxRetries: 0, // We handle retries in the test logic

sdks/typescript/src/features/readability.ts

sdks/typescript/src/errors.ts

sdks/typescript/src/evaluators/base.ts

Copilot · 2026-02-19T04:53:23Z

sdks/typescript/src/prompts/vocabulary/system.ts

@@ -0,0 +1,28 @@
+import { loadPrompt } from '../../utils/prompts';


Same as the previous comment - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.

Suggested change

import { loadPrompt } from '../../utils/prompts';

import { loadPrompt } from '../../utils/prompts.js';

Copilot · 2026-02-19T04:53:24Z

sdks/typescript/src/prompts/vocabulary/user.ts

@@ -0,0 +1,39 @@
+import { loadPrompt } from '../../utils/prompts';


Same issue as previous files - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.

Suggested change

import { loadPrompt } from '../../utils/prompts';

import { loadPrompt } from '../../utils/prompts.js';

sdks/typescript/src/prompts/vocabulary/user.ts

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdks/typescript/src/schemas/outputs.ts

sdks/typescript/src/evaluators/vocabulary.ts

sdks/typescript/src/features/readability.ts

sdks/typescript/src/evaluators/vocabulary.ts

sdks/typescript/src/index.ts

alexgb

This is a big one. I scanned everything, but didn't go super deep. Not sure how helpful these comments are, so open to feedback.

alexgb · 2026-02-20T21:24:08Z

sdks/typescript/docs/telemetry.md

+
+We use telemetry data to improve evaluator quality, identify edge cases, and optimize performance. This helps us build better tools for our developer partners.
+
+Telemetry is **anonymous by default**. If you'd like to partner with us to improve your specific use case, you can optionally provide an API key (see Configuration section below). This allows us to connect with you and collaborate more deeply.


Do our terms of service cover this data? If so, should we include a link?

alexgb · 2026-02-20T21:24:10Z

sdks/typescript/docs/telemetry.md

+**By default, telemetry is enabled** and sends:
+- Performance metrics (latency, token usage)
+- Metadata (evaluator type, grade, SDK version)
+- **Input text** (the text you're evaluating)


I see you have an option to disable this field. I think it would be good to mention that here given some users might just turn off telemetry once they get this far.

+1. I'd disable if I thought it was all or nothing.
Should we have text / input collection off by default? Is that a Trust question?

alexgb · 2026-02-20T21:24:12Z

sdks/typescript/docs/telemetry.md

+- Metadata (evaluator type, grade, SDK version)
+- **Input text** (the text you're evaluating)
+
+We **never** collect your API keys (only a hashed identifier).


Which API key does this refer to? I'm assuming the API key LC provides the build partner? If so, the statement feels odd given build partners need to send us API keys when integrating with our platform.

I think we should steer clear of their LLM keys and just generate and save a UUID.
An additional consideration is that if we use their LLM keys for the hash, they cannot reset their anonymous identifier without changing their LLM key.

alexgb · 2026-02-20T21:37:15Z

sdks/typescript/src/evaluators/base.ts

+
+    // Initialize telemetry if enabled
+    if (this.config.telemetry.enabled) {
+      // Use all provider keys for client ID generation


This feels odd to me. I see the problem you're trying to solve, but it doesn't seem very reliable given these keys will rotate and be different across environments. Even though your hashing the keys, it still would make me nervous if I saw it.

Could we just ask for a org identifier and environment name configs in telemetry? Seems better experience for user and we potentially get more reliable way to associate telemetry events to a build partner. We could make these required configs. They would have ability to enter a random string if they don't want their true identity associated with telemetry events.

alexgb · 2026-02-20T21:42:08Z

sdks/typescript/src/evaluators/vocabulary.ts

+
+    // Validate required API keys
+    if (!config.googleApiKey) {
+      throw new ValidationError('Google API key is required. Pass googleApiKey in config.');


Above we throw ValidationError in non fatal scenarios, that a host application may want to catch, but these feels like a fatal error. Do we want to distinguish between run time validation errors and configuration errors by error type?

I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?

alexgb · 2026-02-23T21:08:02Z

sdks/typescript/src/evaluators/vocabulary.ts

+        score: complexityResponse.data.complexity_score,
+        reasoning: complexityResponse.data.reasoning,
+        metadata: {
+          promptVersion: '1.0',


For later, this seems like something that should be tracked as metadata alongside the prompt. Or, maybe we should import { version } from './package.json'.

alexgb · 2026-02-23T21:19:54Z

sdks/typescript/src/evaluators/vocabulary.ts

+ *
+ * @example
+ * ```typescript
+ * const result = await evaluateVocabulary(


Is there a world where we expose an API that looks like the following to avoid passing static configuration to these function calls?

const evaluator = createEvaluator({ googleApiKey: process.env.GOOGLE_API_KEY, openaiApiKey: process.env.OPENAI_API_KEY, }); const result = await evaluator.evaluateVocabulary(text, "3");

alexgb · 2026-02-23T21:24:13Z

sdks/typescript/src/prompts/vocabulary/background-knowledge.ts

+  return BACKGROUND_KNOWLEDGE_TEMPLATE
+    .replaceAll('{grade}', grade)
+    .replaceAll('{text}', text);
+}


Noting that you're doing something here that was similar to what we did in the KG evals project. We used a templating library that supported arbitrary blocks to create our own control flow and put frontmatter at the top of the template to support metadata tracking (like your prompt version above). Example: https://github.com/chanzuckerberg/edu-kg-evals/blob/main/test_cases/example-auto-score.md?plain=1

alexgb · 2026-02-23T21:31:23Z

sdks/typescript/src/providers/ai-sdk-provider.ts

+    const modelId = requestModel || this.config.model || this.getDefaultModel();
+    const apiKey = this.config.apiKey;
+
+    switch (this.config.type) {


You have a separate switch like data structure in DEFAULT_MODELS. Maybe these should be combined in one place?

Separately, it strikes me as odd that a user would specify a model by the provider name, but not care which actual model family they get back. "google" defaults to a high-tier pro model and "anthropic" defaults to a mid-tier model (sonnet). But I think I might be missing something about how this config is used.

alexgb · 2026-02-23T21:44:29Z

sdks/typescript/src/telemetry/types.ts

+  /**
+   * Number of retries for this stage
+   *
+   * IMPORTANT: Currently set to -1 (unknown) because Vercel AI SDK doesn't expose


How important is this datapoint? If too hard to capture, maybe we just remove it from the telemetry?

adnanrhussain · 2026-02-23T23:44:23Z

sdks/typescript/src/logger.ts

+  /** Debug messages - very verbose, for development */
+  DEBUG = 0,
+  /** Informational messages - normal operations */
+  INFO = 1,


@adnanrhussain - Ensure this is info by default. and can be silenced by config

czi-fsisenda

Looks good! A number of comments for consideration. None blocking though.
You mentioned that some of the topics we discussed would be addressed in other PRs.

czi-fsisenda · 2026-02-25T12:08:19Z

sdks/typescript/docs/telemetry.md

+
+**By default, telemetry is enabled** and sends:
+- Performance metrics (latency, token usage)
+- Metadata (evaluator type, grade, SDK version)


Should we send input size or can we infer that from token usage? So even if we don't get the actual text, we can get the text length.

Actually, I see text length is included.

czi-fsisenda · 2026-02-25T12:12:25Z

sdks/typescript/docs/telemetry.md

+  },
+  "input_text": "The mitochondria is the powerhouse of the cell...",
+  "metadata": {
+    "stage_details": [


nit: Phase may be a better name for this. When I hear stage, I think deployment stage.

czi-fsisenda · 2026-02-25T12:14:19Z

sdks/typescript/docs/telemetry.md

+|-------|-------------|
+| `timestamp` | ISO 8601 timestamp when evaluation started |
+| `sdk_version` | Version of the SDK (e.g., "0.1.0") |
+| `evaluator_type` | Which evaluator ran (e.g., "vocabulary", "sentence-structure") |


If we have multiple versions of the same evaluator in an SDK version would the eval version be captured in the eval name or should we have an eval version field too?

czi-fsisenda · 2026-02-25T12:20:34Z

sdks/typescript/docs/telemetry.md

+const evaluator = new VocabularyEvaluator({
+  googleApiKey: process.env.GOOGLE_API_KEY!,
+  openaiApiKey: process.env.OPENAI_API_KEY!,
+  apiKey: process.env.LEARNING_COMMONS_API_KEY!,  // Contact us for a key


How about learningCommonsApiKey? Clearer name.

czi-fsisenda · 2026-02-25T12:27:48Z

sdks/typescript/src/evaluators/base.ts

+  /**
+   * Maximum number of retries for failed API calls (default: 2)
+   * Set to 0 to disable retries.
+   *
+   * Note: With maxRetries=2, a failed call will be attempted up to 3 times total
+   * (1 initial attempt + 2 retries)
+   */
+  maxRetries?: number;


Current evaluations can take a while. We should consider letting users handle retry logic or default to 0 retries. Excluding retry logic simplifies our work.

czi-fsisenda · 2026-02-25T14:03:40Z

sdks/typescript/src/evaluators/vocabulary.ts

+
+    // Validate required API keys
+    if (!config.googleApiKey) {
+      throw new ValidationError('Google API key is required. Pass googleApiKey in config.');


I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?

czi-fsisenda · 2026-02-25T14:08:09Z

sdks/typescript/src/evaluators/vocabulary.ts

+    const complexityProviderName = (grade === '3' || grade === '4')
+      ? 'google:gemini-2.5-pro'
+      : 'openai:gpt-4.1-2025-04-14';


We should just think of vocabulary_3-4 as a different eval from vocabulary_k-2_5-12 to avoid this type of complexity. Then we'd have one switch at the input to select the evaluator, but implementations will be simpler.

czi-fsisenda · 2026-02-25T14:40:57Z

sdks/typescript/tests/integration/vocabulary.integration.test.ts

For integration testing an NPM library, a truer test would run against the NPM package rather than on the implementation. Not sure how complicated that would be to setup.
And also for integration tests, we normally shouldn't need very elaborate tests. The primary purpose should be checking that we can reach the integrating service and then unit tests should be exhaustive. With LLMs though, it feels like we want to check that the LLM is still returning the expected result. But then that isn't really an integration test. It's more a synthetic test that happens to run as part of CI/CD.
This is good. Just wondering if we should always have elaborate integration tests as we add evals or just a single evaluation will be good enough.

czi-fsisenda · 2026-02-25T14:43:17Z

sdks/typescript/tests/unit/features/readability.test.ts

These are placeholders?

czi-fsisenda · 2026-02-25T14:52:21Z

sdks/typescript/src/schemas/outputs.ts

+  evaluatorVersion?: string;
+  promptVersion: string;


Wondering if these should be separate. We should update the eval version any time the prompt changes.

adnanrhussain marked this pull request as ready for review February 19, 2026 04:46

adnanrhussain requested review from alexgb, Copilot, czi-fsisenda and georgemelvin February 19, 2026 04:46

Copilot started reviewing on behalf of adnanrhussain February 19, 2026 04:47 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

adnanrhussain force-pushed the ahussain/sdk_3_core_and_vocab branch from 6d659de to 4697e94 Compare February 20, 2026 03:23

adnanrhussain requested a review from Copilot February 20, 2026 03:24

Copilot started reviewing on behalf of adnanrhussain February 20, 2026 03:24 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

alexgb reviewed Feb 23, 2026

View reviewed changes

adnanrhussain commented Feb 23, 2026

View reviewed changes

czi-fsisenda approved these changes Feb 25, 2026

View reviewed changes

adnanrhussain force-pushed the ahussain/sdk_2_infra branch 2 times, most recently from 15a4c50 to ab29225 Compare March 2, 2026 19:13

Base automatically changed from ahussain/sdk_2_infra to ahussain/sdk_typescript March 2, 2026 23:20

adnanrhussain added 4 commits March 2, 2026 15:24

feat: Implement core evaluator files and Vocab implementation

b430284

rebase and update

3d01f04

feedback

fb820ab

fix complexity model for grades 5-12

e88a60e

adnanrhussain force-pushed the ahussain/sdk_3_core_and_vocab branch from 837bbf7 to e88a60e Compare March 2, 2026 23:26

adnanrhussain added 3 commits March 2, 2026 15:32

fix env loading

b60aade

disabled text collection by default

4951f09

re: feedback

d933db9

adnanrhussain added 3 commits March 2, 2026 21:51

logger for telemetry client

c48e673

remove retry_attempts

7750105

change to configurationerror

7f5b5da

adnanrhussain merged commit 35f473c into ahussain/sdk_typescript Mar 3, 2026
9 checks passed

adnanrhussain deleted the ahussain/sdk_3_core_and_vocab branch March 3, 2026 06:12

adnanrhussain mentioned this pull request Mar 11, 2026

feat: Release TypeScript SDK 0.1.0 #20

Open

	retry: false, // We handle retries in the test logic
	maxRetries: 0, // We handle retries in the test logic

		@@ -0,0 +1,28 @@
		import { loadPrompt } from '../../utils/prompts';

		@@ -0,0 +1,39 @@
		import { loadPrompt } from '../../utils/prompts';


		We use telemetry data to improve evaluator quality, identify edge cases, and optimize performance. This helps us build better tools for our developer partners.

		Telemetry is anonymous by default. If you'd like to partner with us to improve your specific use case, you can optionally provide an API key (see Configuration section below). This allows us to connect with you and collaborate more deeply.

Conversation

adnanrhussain commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Documentation

Testing

SDK Feature PR Index

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexgb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

czi-fsisenda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adnanrhussain commented Feb 6, 2026 •

edited

Loading