Skip to content

feat: Implement core evaluator files and Vocab implementation#12

Merged
adnanrhussain merged 10 commits intoahussain/sdk_typescriptfrom
ahussain/sdk_3_core_and_vocab
Mar 3, 2026
Merged

feat: Implement core evaluator files and Vocab implementation#12
adnanrhussain merged 10 commits intoahussain/sdk_typescriptfrom
ahussain/sdk_3_core_and_vocab

Conversation

@adnanrhussain
Copy link
Contributor

@adnanrhussain adnanrhussain commented Feb 6, 2026

Summary

This PR introduces the key components for the SDK and uses them for a VocabularyEvaluator

Documentation

See sdks/typescript/README.md

Testing

  • CI workflow should build successfully

SDK Feature PR Index

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the core infrastructure for a TypeScript SDK for Learning Commons educational text complexity evaluators, specifically implementing the VocabularyEvaluator as the first concrete evaluator. The PR is part of a stacked PR series (#10-#14) building out the SDK functionality.

Changes:

  • Introduces base evaluator infrastructure with telemetry, logging, validation, and error handling
  • Implements VocabularyEvaluator using a 2-stage LLM process (background knowledge generation + complexity evaluation)
  • Adds comprehensive test utilities supporting retry logic and parallel test execution for handling LLM non-determinism
  • Provides LLM provider abstractions supporting OpenAI, Google Gemini, and Anthropic through Vercel AI SDK

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
src/evaluators/base.ts Abstract base class providing common validation, telemetry, and logging for all evaluators
src/evaluators/vocabulary.ts VocabularyEvaluator implementation using Gemini + GPT-4o in a 2-stage evaluation process
src/providers/*.ts LLM provider abstractions with Vercel AI SDK integration
src/telemetry/*.ts Anonymous telemetry client with configurable privacy controls
src/errors.ts Comprehensive error hierarchy for validation, API, authentication, rate limiting, and network errors
src/logger.ts Structured logging interface with configurable verbosity levels
src/features/readability.ts Flesch-Kincaid grade level calculation using compromise NLP library
src/prompts/vocabulary/*.ts Prompt template loading and grade-specific prompt selection
src/schemas/*.ts Zod schemas for vocabulary complexity and evaluation results
src/utils/prompts.ts Utility for loading prompt files from text files
tests/utils/*.ts Reusable test utilities with retry logic for handling LLM non-determinism
tests/unit/**/*.ts Unit tests for telemetry, readability, vocabulary evaluator, and validation
tests/integration/*.ts Integration tests with real API calls for vocabulary evaluator
tests/README.md Comprehensive test documentation covering patterns, configuration, and best practices
docs/telemetry.md Telemetry documentation explaining data collection, privacy, and configuration
README.md Main SDK documentation with installation, usage, error handling, and logging examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

evaluator = new VocabularyEvaluator({
googleApiKey: process.env.GOOGLE_API_KEY!,
openaiApiKey: process.env.OPENAI_API_KEY!,
retry: false, // We handle retries in the test logic
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry configuration option is being used in tests (line 100), but this option is not defined in the VocabularyEvaluatorConfig or BaseEvaluatorConfig interfaces. The only retry-related option available is maxRetries. This will cause a type error when using TypeScript's strict mode. Either rename this to maxRetries: 0 or add a retry option to the configuration interface.

Suggested change
retry: false, // We handle retries in the test logic
maxRetries: 0, // We handle retries in the test logic

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,28 @@
import { loadPrompt } from '../../utils/prompts';
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.

Suggested change
import { loadPrompt } from '../../utils/prompts';
import { loadPrompt } from '../../utils/prompts.js';

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,39 @@
import { loadPrompt } from '../../utils/prompts';
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as previous files - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.

Suggested change
import { loadPrompt } from '../../utils/prompts';
import { loadPrompt } from '../../utils/prompts.js';

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@alexgb alexgb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big one. I scanned everything, but didn't go super deep. Not sure how helpful these comments are, so open to feedback.


We use telemetry data to improve evaluator quality, identify edge cases, and optimize performance. This helps us build better tools for our developer partners.

Telemetry is **anonymous by default**. If you'd like to partner with us to improve your specific use case, you can optionally provide an API key (see Configuration section below). This allows us to connect with you and collaborate more deeply.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do our terms of service cover this data? If so, should we include a link?

**By default, telemetry is enabled** and sends:
- Performance metrics (latency, token usage)
- Metadata (evaluator type, grade, SDK version)
- **Input text** (the text you're evaluating)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have an option to disable this field. I think it would be good to mention that here given some users might just turn off telemetry once they get this far.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'd disable if I thought it was all or nothing.
Should we have text / input collection off by default? Is that a Trust question?

- Metadata (evaluator type, grade, SDK version)
- **Input text** (the text you're evaluating)

We **never** collect your API keys (only a hashed identifier).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which API key does this refer to? I'm assuming the API key LC provides the build partner? If so, the statement feels odd given build partners need to send us API keys when integrating with our platform.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should steer clear of their LLM keys and just generate and save a UUID.
An additional consideration is that if we use their LLM keys for the hash, they cannot reset their anonymous identifier without changing their LLM key.


// Initialize telemetry if enabled
if (this.config.telemetry.enabled) {
// Use all provider keys for client ID generation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels odd to me. I see the problem you're trying to solve, but it doesn't seem very reliable given these keys will rotate and be different across environments. Even though your hashing the keys, it still would make me nervous if I saw it.

Could we just ask for a org identifier and environment name configs in telemetry? Seems better experience for user and we potentially get more reliable way to associate telemetry events to a build partner. We could make these required configs. They would have ability to enter a random string if they don't want their true identity associated with telemetry events.


// Validate required API keys
if (!config.googleApiKey) {
throw new ValidationError('Google API key is required. Pass googleApiKey in config.');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we throw ValidationError in non fatal scenarios, that a host application may want to catch, but these feels like a fatal error. Do we want to distinguish between run time validation errors and configuration errors by error type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?

score: complexityResponse.data.complexity_score,
reasoning: complexityResponse.data.reasoning,
metadata: {
promptVersion: '1.0',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later, this seems like something that should be tracked as metadata alongside the prompt. Or, maybe we should import { version } from './package.json'.

*
* @example
* ```typescript
* const result = await evaluateVocabulary(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a world where we expose an API that looks like the following to avoid passing static configuration to these function calls?

const evaluator = createEvaluator({
  googleApiKey: process.env.GOOGLE_API_KEY,
  openaiApiKey: process.env.OPENAI_API_KEY,
});

const result = await evaluator.evaluateVocabulary(text, "3");

return BACKGROUND_KNOWLEDGE_TEMPLATE
.replaceAll('{grade}', grade)
.replaceAll('{text}', text);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that you're doing something here that was similar to what we did in the KG evals project. We used a templating library that supported arbitrary blocks to create our own control flow and put frontmatter at the top of the template to support metadata tracking (like your prompt version above). Example: https://github.com/chanzuckerberg/edu-kg-evals/blob/main/test_cases/example-auto-score.md?plain=1

const modelId = requestModel || this.config.model || this.getDefaultModel();
const apiKey = this.config.apiKey;

switch (this.config.type) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a separate switch like data structure in DEFAULT_MODELS. Maybe these should be combined in one place?

Separately, it strikes me as odd that a user would specify a model by the provider name, but not care which actual model family they get back. "google" defaults to a high-tier pro model and "anthropic" defaults to a mid-tier model (sonnet). But I think I might be missing something about how this config is used.

/**
* Number of retries for this stage
*
* IMPORTANT: Currently set to -1 (unknown) because Vercel AI SDK doesn't expose
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this datapoint? If too hard to capture, maybe we just remove it from the telemetry?

/** Debug messages - very verbose, for development */
DEBUG = 0,
/** Informational messages - normal operations */
INFO = 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adnanrhussain - Ensure this is info by default. and can be silenced by config

Copy link

@czi-fsisenda czi-fsisenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A number of comments for consideration. None blocking though.
You mentioned that some of the topics we discussed would be addressed in other PRs.


**By default, telemetry is enabled** and sends:
- Performance metrics (latency, token usage)
- Metadata (evaluator type, grade, SDK version)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we send input size or can we infer that from token usage? So even if we don't get the actual text, we can get the text length.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see text length is included.

},
"input_text": "The mitochondria is the powerhouse of the cell...",
"metadata": {
"stage_details": [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Phase may be a better name for this. When I hear stage, I think deployment stage.

|-------|-------------|
| `timestamp` | ISO 8601 timestamp when evaluation started |
| `sdk_version` | Version of the SDK (e.g., "0.1.0") |
| `evaluator_type` | Which evaluator ran (e.g., "vocabulary", "sentence-structure") |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple versions of the same evaluator in an SDK version would the eval version be captured in the eval name or should we have an eval version field too?

const evaluator = new VocabularyEvaluator({
googleApiKey: process.env.GOOGLE_API_KEY!,
openaiApiKey: process.env.OPENAI_API_KEY!,
apiKey: process.env.LEARNING_COMMONS_API_KEY!, // Contact us for a key

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about learningCommonsApiKey? Clearer name.

Comment on lines +45 to +52
/**
* Maximum number of retries for failed API calls (default: 2)
* Set to 0 to disable retries.
*
* Note: With maxRetries=2, a failed call will be attempted up to 3 times total
* (1 initial attempt + 2 retries)
*/
maxRetries?: number;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current evaluations can take a while. We should consider letting users handle retry logic or default to 0 retries. Excluding retry logic simplifies our work.


// Validate required API keys
if (!config.googleApiKey) {
throw new ValidationError('Google API key is required. Pass googleApiKey in config.');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?

Comment on lines +134 to +136
const complexityProviderName = (grade === '3' || grade === '4')
? 'google:gemini-2.5-pro'
: 'openai:gpt-4.1-2025-04-14';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just think of vocabulary_3-4 as a different eval from vocabulary_k-2_5-12 to avoid this type of complexity. Then we'd have one switch at the input to select the evaluator, but implementations will be simpler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integration testing an NPM library, a truer test would run against the NPM package rather than on the implementation. Not sure how complicated that would be to setup.
And also for integration tests, we normally shouldn't need very elaborate tests. The primary purpose should be checking that we can reach the integrating service and then unit tests should be exhaustive. With LLMs though, it feels like we want to check that the LLM is still returning the expected result. But then that isn't really an integration test. It's more a synthetic test that happens to run as part of CI/CD.
This is good. Just wondering if we should always have elaborate integration tests as we add evals or just a single evaluation will be good enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are placeholders?

Comment on lines +30 to +31
evaluatorVersion?: string;
promptVersion: string;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if these should be separate. We should update the eval version any time the prompt changes.

@adnanrhussain adnanrhussain force-pushed the ahussain/sdk_2_infra branch 2 times, most recently from 15a4c50 to ab29225 Compare March 2, 2026 19:13
Base automatically changed from ahussain/sdk_2_infra to ahussain/sdk_typescript March 2, 2026 23:20
@adnanrhussain adnanrhussain force-pushed the ahussain/sdk_3_core_and_vocab branch from 837bbf7 to e88a60e Compare March 2, 2026 23:26
@adnanrhussain adnanrhussain merged commit 35f473c into ahussain/sdk_typescript Mar 3, 2026
9 checks passed
@adnanrhussain adnanrhussain deleted the ahussain/sdk_3_core_and_vocab branch March 3, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants