feat: Implement core evaluator files and Vocab implementation#12
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements the core infrastructure for a TypeScript SDK for Learning Commons educational text complexity evaluators, specifically implementing the VocabularyEvaluator as the first concrete evaluator. The PR is part of a stacked PR series (#10-#14) building out the SDK functionality.
Changes:
- Introduces base evaluator infrastructure with telemetry, logging, validation, and error handling
- Implements VocabularyEvaluator using a 2-stage LLM process (background knowledge generation + complexity evaluation)
- Adds comprehensive test utilities supporting retry logic and parallel test execution for handling LLM non-determinism
- Provides LLM provider abstractions supporting OpenAI, Google Gemini, and Anthropic through Vercel AI SDK
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| src/evaluators/base.ts | Abstract base class providing common validation, telemetry, and logging for all evaluators |
| src/evaluators/vocabulary.ts | VocabularyEvaluator implementation using Gemini + GPT-4o in a 2-stage evaluation process |
| src/providers/*.ts | LLM provider abstractions with Vercel AI SDK integration |
| src/telemetry/*.ts | Anonymous telemetry client with configurable privacy controls |
| src/errors.ts | Comprehensive error hierarchy for validation, API, authentication, rate limiting, and network errors |
| src/logger.ts | Structured logging interface with configurable verbosity levels |
| src/features/readability.ts | Flesch-Kincaid grade level calculation using compromise NLP library |
| src/prompts/vocabulary/*.ts | Prompt template loading and grade-specific prompt selection |
| src/schemas/*.ts | Zod schemas for vocabulary complexity and evaluation results |
| src/utils/prompts.ts | Utility for loading prompt files from text files |
| tests/utils/*.ts | Reusable test utilities with retry logic for handling LLM non-determinism |
| tests/unit/**/*.ts | Unit tests for telemetry, readability, vocabulary evaluator, and validation |
| tests/integration/*.ts | Integration tests with real API calls for vocabulary evaluator |
| tests/README.md | Comprehensive test documentation covering patterns, configuration, and best practices |
| docs/telemetry.md | Telemetry documentation explaining data collection, privacy, and configuration |
| README.md | Main SDK documentation with installation, usage, error handling, and logging examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| evaluator = new VocabularyEvaluator({ | ||
| googleApiKey: process.env.GOOGLE_API_KEY!, | ||
| openaiApiKey: process.env.OPENAI_API_KEY!, | ||
| retry: false, // We handle retries in the test logic |
There was a problem hiding this comment.
The retry configuration option is being used in tests (line 100), but this option is not defined in the VocabularyEvaluatorConfig or BaseEvaluatorConfig interfaces. The only retry-related option available is maxRetries. This will cause a type error when using TypeScript's strict mode. Either rename this to maxRetries: 0 or add a retry option to the configuration interface.
| retry: false, // We handle retries in the test logic | |
| maxRetries: 0, // We handle retries in the test logic |
| @@ -0,0 +1,28 @@ | |||
| import { loadPrompt } from '../../utils/prompts'; | |||
There was a problem hiding this comment.
Same as the previous comment - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.
| import { loadPrompt } from '../../utils/prompts'; | |
| import { loadPrompt } from '../../utils/prompts.js'; |
| @@ -0,0 +1,39 @@ | |||
| import { loadPrompt } from '../../utils/prompts'; | |||
There was a problem hiding this comment.
Same issue as previous files - missing .js extension in the import statement. Should be import { loadPrompt } from '../../utils/prompts.js'.
| import { loadPrompt } from '../../utils/prompts'; | |
| import { loadPrompt } from '../../utils/prompts.js'; |
6d659de to
4697e94
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
alexgb
left a comment
There was a problem hiding this comment.
This is a big one. I scanned everything, but didn't go super deep. Not sure how helpful these comments are, so open to feedback.
|
|
||
| We use telemetry data to improve evaluator quality, identify edge cases, and optimize performance. This helps us build better tools for our developer partners. | ||
|
|
||
| Telemetry is **anonymous by default**. If you'd like to partner with us to improve your specific use case, you can optionally provide an API key (see Configuration section below). This allows us to connect with you and collaborate more deeply. |
There was a problem hiding this comment.
Do our terms of service cover this data? If so, should we include a link?
sdks/typescript/docs/telemetry.md
Outdated
| **By default, telemetry is enabled** and sends: | ||
| - Performance metrics (latency, token usage) | ||
| - Metadata (evaluator type, grade, SDK version) | ||
| - **Input text** (the text you're evaluating) |
There was a problem hiding this comment.
I see you have an option to disable this field. I think it would be good to mention that here given some users might just turn off telemetry once they get this far.
There was a problem hiding this comment.
+1. I'd disable if I thought it was all or nothing.
Should we have text / input collection off by default? Is that a Trust question?
sdks/typescript/docs/telemetry.md
Outdated
| - Metadata (evaluator type, grade, SDK version) | ||
| - **Input text** (the text you're evaluating) | ||
|
|
||
| We **never** collect your API keys (only a hashed identifier). |
There was a problem hiding this comment.
Which API key does this refer to? I'm assuming the API key LC provides the build partner? If so, the statement feels odd given build partners need to send us API keys when integrating with our platform.
There was a problem hiding this comment.
I think we should steer clear of their LLM keys and just generate and save a UUID.
An additional consideration is that if we use their LLM keys for the hash, they cannot reset their anonymous identifier without changing their LLM key.
|
|
||
| // Initialize telemetry if enabled | ||
| if (this.config.telemetry.enabled) { | ||
| // Use all provider keys for client ID generation |
There was a problem hiding this comment.
This feels odd to me. I see the problem you're trying to solve, but it doesn't seem very reliable given these keys will rotate and be different across environments. Even though your hashing the keys, it still would make me nervous if I saw it.
Could we just ask for a org identifier and environment name configs in telemetry? Seems better experience for user and we potentially get more reliable way to associate telemetry events to a build partner. We could make these required configs. They would have ability to enter a random string if they don't want their true identity associated with telemetry events.
|
|
||
| // Validate required API keys | ||
| if (!config.googleApiKey) { | ||
| throw new ValidationError('Google API key is required. Pass googleApiKey in config.'); |
There was a problem hiding this comment.
Above we throw ValidationError in non fatal scenarios, that a host application may want to catch, but these feels like a fatal error. Do we want to distinguish between run time validation errors and configuration errors by error type?
There was a problem hiding this comment.
I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?
| score: complexityResponse.data.complexity_score, | ||
| reasoning: complexityResponse.data.reasoning, | ||
| metadata: { | ||
| promptVersion: '1.0', |
There was a problem hiding this comment.
For later, this seems like something that should be tracked as metadata alongside the prompt. Or, maybe we should import { version } from './package.json'.
| * | ||
| * @example | ||
| * ```typescript | ||
| * const result = await evaluateVocabulary( |
There was a problem hiding this comment.
Is there a world where we expose an API that looks like the following to avoid passing static configuration to these function calls?
const evaluator = createEvaluator({
googleApiKey: process.env.GOOGLE_API_KEY,
openaiApiKey: process.env.OPENAI_API_KEY,
});
const result = await evaluator.evaluateVocabulary(text, "3");
| return BACKGROUND_KNOWLEDGE_TEMPLATE | ||
| .replaceAll('{grade}', grade) | ||
| .replaceAll('{text}', text); | ||
| } |
There was a problem hiding this comment.
Noting that you're doing something here that was similar to what we did in the KG evals project. We used a templating library that supported arbitrary blocks to create our own control flow and put frontmatter at the top of the template to support metadata tracking (like your prompt version above). Example: https://github.com/chanzuckerberg/edu-kg-evals/blob/main/test_cases/example-auto-score.md?plain=1
| const modelId = requestModel || this.config.model || this.getDefaultModel(); | ||
| const apiKey = this.config.apiKey; | ||
|
|
||
| switch (this.config.type) { |
There was a problem hiding this comment.
You have a separate switch like data structure in DEFAULT_MODELS. Maybe these should be combined in one place?
Separately, it strikes me as odd that a user would specify a model by the provider name, but not care which actual model family they get back. "google" defaults to a high-tier pro model and "anthropic" defaults to a mid-tier model (sonnet). But I think I might be missing something about how this config is used.
| /** | ||
| * Number of retries for this stage | ||
| * | ||
| * IMPORTANT: Currently set to -1 (unknown) because Vercel AI SDK doesn't expose |
There was a problem hiding this comment.
How important is this datapoint? If too hard to capture, maybe we just remove it from the telemetry?
| /** Debug messages - very verbose, for development */ | ||
| DEBUG = 0, | ||
| /** Informational messages - normal operations */ | ||
| INFO = 1, |
There was a problem hiding this comment.
@adnanrhussain - Ensure this is info by default. and can be silenced by config
czi-fsisenda
left a comment
There was a problem hiding this comment.
Looks good! A number of comments for consideration. None blocking though.
You mentioned that some of the topics we discussed would be addressed in other PRs.
|
|
||
| **By default, telemetry is enabled** and sends: | ||
| - Performance metrics (latency, token usage) | ||
| - Metadata (evaluator type, grade, SDK version) |
There was a problem hiding this comment.
Should we send input size or can we infer that from token usage? So even if we don't get the actual text, we can get the text length.
There was a problem hiding this comment.
Actually, I see text length is included.
| }, | ||
| "input_text": "The mitochondria is the powerhouse of the cell...", | ||
| "metadata": { | ||
| "stage_details": [ |
There was a problem hiding this comment.
nit: Phase may be a better name for this. When I hear stage, I think deployment stage.
| |-------|-------------| | ||
| | `timestamp` | ISO 8601 timestamp when evaluation started | | ||
| | `sdk_version` | Version of the SDK (e.g., "0.1.0") | | ||
| | `evaluator_type` | Which evaluator ran (e.g., "vocabulary", "sentence-structure") | |
There was a problem hiding this comment.
If we have multiple versions of the same evaluator in an SDK version would the eval version be captured in the eval name or should we have an eval version field too?
sdks/typescript/docs/telemetry.md
Outdated
| const evaluator = new VocabularyEvaluator({ | ||
| googleApiKey: process.env.GOOGLE_API_KEY!, | ||
| openaiApiKey: process.env.OPENAI_API_KEY!, | ||
| apiKey: process.env.LEARNING_COMMONS_API_KEY!, // Contact us for a key |
There was a problem hiding this comment.
How about learningCommonsApiKey? Clearer name.
| /** | ||
| * Maximum number of retries for failed API calls (default: 2) | ||
| * Set to 0 to disable retries. | ||
| * | ||
| * Note: With maxRetries=2, a failed call will be attempted up to 3 times total | ||
| * (1 initial attempt + 2 retries) | ||
| */ | ||
| maxRetries?: number; |
There was a problem hiding this comment.
Current evaluations can take a while. We should consider letting users handle retry logic or default to 0 retries. Excluding retry logic simplifies our work.
|
|
||
| // Validate required API keys | ||
| if (!config.googleApiKey) { | ||
| throw new ValidationError('Google API key is required. Pass googleApiKey in config.'); |
There was a problem hiding this comment.
I'd think of this as a ConfigurationError. ValidationError gives me the impression that the issue is related to inputs.
Will we give users or ourselves a way to distinguish between retryable vs non-retryable errors?
| const complexityProviderName = (grade === '3' || grade === '4') | ||
| ? 'google:gemini-2.5-pro' | ||
| : 'openai:gpt-4.1-2025-04-14'; |
There was a problem hiding this comment.
We should just think of vocabulary_3-4 as a different eval from vocabulary_k-2_5-12 to avoid this type of complexity. Then we'd have one switch at the input to select the evaluator, but implementations will be simpler.
There was a problem hiding this comment.
For integration testing an NPM library, a truer test would run against the NPM package rather than on the implementation. Not sure how complicated that would be to setup.
And also for integration tests, we normally shouldn't need very elaborate tests. The primary purpose should be checking that we can reach the integrating service and then unit tests should be exhaustive. With LLMs though, it feels like we want to check that the LLM is still returning the expected result. But then that isn't really an integration test. It's more a synthetic test that happens to run as part of CI/CD.
This is good. Just wondering if we should always have elaborate integration tests as we add evals or just a single evaluation will be good enough.
| evaluatorVersion?: string; | ||
| promptVersion: string; |
There was a problem hiding this comment.
Wondering if these should be separate. We should update the eval version any time the prompt changes.
15a4c50 to
ab29225
Compare
837bbf7 to
e88a60e
Compare
Summary
This PR introduces the key components for the SDK and uses them for a VocabularyEvaluator
Documentation
See
sdks/typescript/README.mdTesting
SDK Feature PR Index