Skip to content

feat: Add TextComplexityEvaluator and metadata-driven evaluator system#15

Merged
adnanrhussain merged 3 commits intoahussain/sdk_typescriptfrom
ahussain/sdk_6_composite_evals
Mar 5, 2026
Merged

feat: Add TextComplexityEvaluator and metadata-driven evaluator system#15
adnanrhussain merged 3 commits intoahussain/sdk_typescriptfrom
ahussain/sdk_6_composite_evals

Conversation

@adnanrhussain
Copy link
Contributor

Summary

Implements composite text-complexity evaluator combining vocabulary and sentence structure analysis and adds metadata system to all evaluators for self-documentation and API key validation.

Documentation

See sdks/typescript/README.md

Testing

  • CI workflow should build successfully

SDK Feature PR Index

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a composite TextComplexityEvaluator that combines vocabulary and sentence structure analysis, and introduces a metadata-driven system for evaluator self-documentation and API key validation. The changes represent a significant architectural improvement by centralizing common evaluator patterns in the base class and eliminating duplicated configuration interfaces.

Changes:

  • Added TextComplexityEvaluator that runs vocabulary and sentence structure evaluations in parallel with concurrency control (using p-limit)
  • Introduced EvaluatorMetadata interface requiring all evaluators to define static metadata (id, name, description, supportedGrades, API key requirements)
  • Centralized API key validation in BaseEvaluator constructor using metadata, removing duplicate validation from individual evaluators
  • Consolidated all evaluator config interfaces into BaseEvaluatorConfig, removing VocabularyEvaluatorConfig, SentenceStructureEvaluatorConfig, and GradeLevelAppropriatenessEvaluatorConfig
  • Updated supported grades for Sentence Structure evaluator from K-12 to 3-12 (consistent with Vocabulary and Text Complexity)
  • Updated all tests to reflect new error messages and grade ranges

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdks/typescript/src/evaluators/base.ts Added EvaluatorMetadata interface and centralized API key validation based on metadata; converted getEvaluatorType from abstract to concrete method using metadata
sdks/typescript/src/evaluators/text-complexity.ts New composite evaluator running vocabulary and sentence structure evaluations in parallel with p-limit concurrency control; implements graceful degradation on partial failures
sdks/typescript/src/evaluators/vocabulary.ts Removed VocabularyEvaluatorConfig interface, added static metadata, removed manual API key validation now handled by base class
sdks/typescript/src/evaluators/sentence-structure.ts Removed SentenceStructureEvaluatorConfig interface, added static metadata, updated supported grades to 3-12, removed manual API key validation
sdks/typescript/src/evaluators/grade-level-appropriateness.ts Removed GradeLevelAppropriatenessEvaluatorConfig interface, added static metadata with empty supportedGrades, removed manual API key validation
sdks/typescript/src/evaluators/index.ts Updated exports to remove individual config interfaces and export EvaluatorMetadata, TextComplexityEvaluator, and related types
sdks/typescript/src/index.ts Added exports for TextComplexityEvaluator, evaluateTextComplexity, TextComplexityScore, TextComplexityInternal, and EvaluatorMetadata
sdks/typescript/tests/unit/evaluators/text-complexity.test.ts Comprehensive test suite for TextComplexityEvaluator covering metadata, constructor validation, evaluation flow, parallel execution, partial failures, and complexity determination
sdks/typescript/tests/unit/evaluators/vocabulary.test.ts Updated error messages to include evaluator name in validation errors
sdks/typescript/tests/unit/evaluators/sentence-structure.test.ts Updated error messages and changed test grade from 'K' to '3' to match new supported grade range
sdks/typescript/tests/unit/evaluators/grade-level-appropriateness.test.ts Updated error message to include evaluator name
sdks/typescript/tests/integration/sentence-structure.integration.test.ts Commented out grade 2 test case (no longer supported) and updated documentation to reflect 3-6 grade range
sdks/typescript/package.json Added p-limit 5.0.0 as production dependency for concurrency control
sdks/typescript/package-lock.json Updated p-limit to 5.0.0 with yocto-queue 1.2.2; preserved p-limit 3.1.0 as nested dependency for p-locate
sdks/typescript/README.md Added TextComplexityEvaluator documentation, updated all evaluator constructors to show BaseEvaluatorConfig with optional API keys, updated Sentence Structure supported grades to 3-12, added note explaining API key requirements per evaluator
Files not reviewed (1)
  • sdks/typescript/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Base automatically changed from ahussain/sdk_5_gla to ahussain/sdk_typescript March 3, 2026 07:20
@adnanrhussain adnanrhussain force-pushed the ahussain/sdk_6_composite_evals branch from 173a014 to 13325f1 Compare March 3, 2026 17:37
}

/**
* Evaluator metadata interface
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer - Making configuration and validation shared code across evals

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 93.58974% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
sdks/typescript/src/evaluators/text-complexity.ts 92.85% 4 Missing ⚠️
sdks/typescript/src/evaluators/base.ts 92.30% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@adnanrhussain adnanrhussain requested a review from Copilot March 3, 2026 17:46
@adnanrhussain adnanrhussain marked this pull request as ready for review March 3, 2026 17:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Files not reviewed (1)
  • sdks/typescript/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +247 to +266
private buildCombinedReasoning(
vocabResult: EvaluationResult<string> | { error: Error },
sentenceResult: EvaluationResult<string> | { error: Error }
): string {
const parts: string[] = [];

if ('error' in vocabResult) {
parts.push(`**Vocabulary Complexity:** Evaluation failed - ${vocabResult.error.message}`);
} else {
parts.push(`**Vocabulary Complexity (${vocabResult.score}):**\n${vocabResult.reasoning}`);
}

if ('error' in sentenceResult) {
parts.push(`**Sentence Structure Complexity:** Evaluation failed - ${sentenceResult.error.message}`);
} else {
parts.push(`**Sentence Structure Complexity (${sentenceResult.score}):**\n${sentenceResult.reasoning}`);
}

return parts.join('\n\n');
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildCombinedReasoning() formats the output using Markdown (e.g., **Vocabulary Complexity**). Other evaluators appear to return plain-text reasoning, so this changes the expected format for API consumers. Either switch to plain text for consistency or document that TextComplexityEvaluator.reasoning is Markdown.

Copilot uses AI. Check for mistakes.
Comment on lines +323 to +332
// Check that limit is defined
expect((evaluator as any).limit).toBeDefined();

const text = 'The cat sat on the mat.';
const grade = '5';

await evaluator.evaluate(text, grade);

// The limit should have been used (both calls go through it)
expect((evaluator as any).limit).toBeDefined();
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "concurrency control" test only checks that limit is defined, which doesn’t verify that calls actually go through the limiter or that the concurrency cap is respected. Consider asserting that limit is invoked (e.g., spy/wrap it) or by adding delayed sub-evaluator calls and verifying at most N run concurrently.

Suggested change
// Check that limit is defined
expect((evaluator as any).limit).toBeDefined();
const text = 'The cat sat on the mat.';
const grade = '5';
await evaluator.evaluate(text, grade);
// The limit should have been used (both calls go through it)
expect((evaluator as any).limit).toBeDefined();
// Grab the original limit function and create a spy that wraps it
const originalLimit = (evaluator as any).limit;
expect(originalLimit).toBeDefined();
const limitSpy = vi.fn(originalLimit);
(evaluator as any).limit = (...args: any[]) => limitSpy(...args);
const text = 'The cat sat on the mat.';
const grade = '5';
await evaluator.evaluate(text, grade);
// The limit should have been used (both evaluator calls go through it)
expect(limitSpy).toHaveBeenCalled();
expect(limitSpy.mock.calls.length).toBeGreaterThanOrEqual(2);

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +38
// {
// id: 'SS2',
// grade: '2',
// text: "The Roman Empire was a powerful empire that lasted for hundreds of years. It started as a small village in Italy and grew into a huge empire that controlled much of Europe, Asia, and Africa. The Roman Empire had many strong leaders like Julius Caesar and Augustus. These leaders helped the empire grow and become very powerful.\n \n\n The Roman Empire had a period of peace and prosperity called the Pax Romana. This time was good for the empire, but it didn't last forever. The empire started to have problems. The army became weaker, and the economy had problems. The empire was also attacked by groups of people called barbarians.\n \n\n The Roman Empire was divided into two parts: the Western Roman Empire and the Eastern Roman Empire. The Western Roman Empire eventually fell apart in 476 AD. The Eastern Roman Empire, also known as the Byzantine Empire, lasted for many more years. The Roman Empire left behind many things that we still use today, like the Roman alphabet and the calendar.",
// expected: 'moderately complex',
// acceptable: ['slightly complex', 'very complex'],
// },
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large commented-out integration test case for grade 2 is left in the test vector list. If grade 2 is no longer supported, it’s cleaner to remove it (or keep it as a skipped test with an explicit reason) to avoid accumulating dead code in the test suite.

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +193
private validateApiKeys(config: BaseEvaluatorConfig): void {
if (this.metadata.requiresGoogleKey && !config.googleApiKey) {
throw new ConfigurationError(
`Google API key is required for ${this.metadata.name} evaluator. Pass googleApiKey in config.`
);
}

if (this.metadata.requiresOpenAIKey && !config.openaiApiKey) {
throw new ConfigurationError(
`OpenAI API key is required for ${this.metadata.name} evaluator. Pass openaiApiKey in config.`
);
}
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseEvaluator assumes every subclass defines static metadata. If a new evaluator forgets to set it, this.metadata.requiresGoogleKey will throw a TypeError instead of a clear configuration error. Consider adding an explicit guard (e.g., verify this.metadata is defined and has required fields) and throw a helpful ConfigurationError when metadata is missing/malformed.

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +14
export interface TextComplexityInternal {
vocabulary: EvaluationResult<string> | { error: Error };
sentenceStructure: EvaluationResult<string> | { error: Error };
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextComplexityInternal stores raw Error instances. These don’t serialize well (e.g., JSON.stringify(new Error('x')) becomes {}), which makes _internal hard to log or persist. Consider storing a plain object (name/message/stack) or a string instead of an Error instance.

Suggested change
export interface TextComplexityInternal {
vocabulary: EvaluationResult<string> | { error: Error };
sentenceStructure: EvaluationResult<string> | { error: Error };
export interface SerializedError {
name: string;
message: string;
stack?: string;
}
export interface TextComplexityInternal {
vocabulary: EvaluationResult<string> | { error: SerializedError | string };
sentenceStructure: EvaluationResult<string> | { error: SerializedError | string };

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +207
it('should run both evaluators in parallel', async () => {
const text = 'The cat sat on the mat.';
const grade = '5';

const startTime = Date.now();
const result = await evaluator.evaluate(text, grade);
const duration = Date.now() - startTime;

// With mocked providers that take ~100ms each, parallel execution should be faster than sequential
// Sequential would be ~200ms, parallel should be ~100ms
// Allow some overhead but should be significantly less than 200ms
expect(duration).toBeLessThan(200);

expect('error' in result._internal!.vocabulary).toBe(false);
expect('error' in result._internal!.sentenceStructure).toBe(false);
});
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "run both evaluators in parallel" test isn’t actually asserting parallelism: both child evaluate() calls are mocked to resolve immediately (no real ~100ms delay), so the duration assertion will pass even if execution becomes sequential. Consider adding an explicit artificial delay to both mocks (or use fake timers) and assert the wall time is closer to one delay than the sum.

Copilot uses AI. Check for mistakes.
type TelemetryOptions,
type EvaluatorMetadata,
} from './evaluators/index.js';

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public API appears to drop the evaluator-specific config types (e.g. VocabularyEvaluatorConfig, SentenceStructureEvaluatorConfig, etc.) from the root export surface. If this package is already being consumed, that’s a breaking change; consider re-exporting them as deprecated type aliases to BaseEvaluatorConfig to preserve compatibility.

Suggested change
/**
* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
* to preserve backward compatibility for existing consumers.
*/
export type VocabularyEvaluatorConfig = BaseEvaluatorConfig;
/**
* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
* to preserve backward compatibility for existing consumers.
*/
export type SentenceStructureEvaluatorConfig = BaseEvaluatorConfig;
/**
* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
* to preserve backward compatibility for existing consumers.
*/
export type GradeLevelAppropriatenessEvaluatorConfig = BaseEvaluatorConfig;
/**
* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
* to preserve backward compatibility for existing consumers.
*/
export type TextComplexityEvaluatorConfig = BaseEvaluatorConfig;

Copilot uses AI. Check for mistakes.
Comment on lines 78 to 137
@@ -116,11 +107,6 @@ export class SentenceStructureEvaluator extends BaseEvaluator {
});
}

// Implement abstract methods from BaseEvaluator
protected getEvaluatorType(): string {
return 'sentence-structure';
}

/**
* Evaluate sentence structure complexity for a given text and grade level
*
@@ -147,7 +133,7 @@ export class SentenceStructureEvaluator extends BaseEvaluator {
try {
// Validate inputs — inside try so validation errors are telemetered.
this.validateText(text);
this.validateGrade(grade, VALID_GRADES);
this.validateGrade(grade, new Set(SentenceStructureEvaluator.metadata.supportedGrades));
this.logger.debug('Stage 1: Analyzing sentence structure', {
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change effectively drops SentenceStructureEvaluator support for grades K–2 by validating against supportedGrades (3–12 only). That’s a behavior/API change and the JSDoc above still says the grade param is "(K-12)". Please update the documentation/comments (and any public docs) to reflect the new supported range, or restore K–2 support if the restriction wasn’t intended.

Copilot uses AI. Check for mistakes.
Copy link

@czi-fsisenda czi-fsisenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Got some questions. Not sure whether we need to address these here or in a later PR. Let me know and I can approve for us to address later.

Comment on lines +85 to +100
* Each evaluator must provide this metadata as static properties
*/
export interface EvaluatorMetadata {
/** Unique identifier for the evaluator (e.g., 'vocabulary', 'sentence-structure') */
readonly id: string;
/** Human-readable name (e.g., 'Vocabulary', 'Sentence Structure') */
readonly name: string;
/** Brief description of what the evaluator does */
readonly description: string;
/** Supported grade levels (e.g., ['3', '4', '5', ...]) */
readonly supportedGrades: readonly string[];
/** Whether this evaluator requires a Google API key */
readonly requiresGoogleKey: boolean;
/** Whether this evaluator requires an OpenAI API key */
readonly requiresOpenAIKey: boolean;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +94 to +95
/** Supported grade levels (e.g., ['3', '4', '5', ...]) */
readonly supportedGrades: readonly string[];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the shape new evals are taking, it looks like grade isn't a core attribute.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the previous playground, we ended up setting up an input fields framework. Each eval would define what type of inputs it used, e.g. inputs: [{name: 'grade', type:'select', required: true, options: ['1st grade', '2nd grade']}]...
I thought we wouldn't need something this elaborate for a while, but all the new evaluators have different inputs. None use grade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll address this in a different PR

Comment on lines +13 to +14
vocabulary: EvaluationResult<string> | { error: Error };
sentenceStructure: EvaluationResult<string> | { error: Error };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the user do with a partially errored evaluation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q. It internally has a 3 attempt retry. I suppose if the evaluation is still failing, then they would just re-run the whole evaluator. But later we can make this smarter, and have only the failed ones retry again. Something for a later PR

Comment on lines +71 to +72
// Create concurrency limiter (max 3 concurrent operations)
this.limit = pLimit(3);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the reasoning for the limiter. The limiter is local to this evaluator, right?
So we get a queue for sub evaluations for this eval only? But I could run as many concurrent solo evaluations as I'd like? Shouldn't we have a global limiter if we're concerned about concurrent evaluations?
Since this combo eval is only doing 2 evals, it seems like either a limiter is a lot of complexity for this case or we are neglecting limiting concurrent solo evals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be around 6 evaluators in the Text Complexity evaluator. My plan is to provider this wrapper for the end-developer-client to use to be able to queue up multiple evaluators in the background, but that is an enhancement for later

Comment on lines +149 to +164
**Returns:**
```typescript
{
score: {
overall: string; // Overall complexity (highest of the two)
vocabulary: string; // Vocabulary complexity score
sentenceStructure: string; // Sentence structure complexity score
};
reasoning: string; // Combined reasoning from both evaluators
metadata: EvaluationMetadata;
_internal: {
vocabulary: EvaluationResult | { error: Error };
sentenceStructure: EvaluationResult | { error: Error };
};
}
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different, but usable, shape than we currently have.
We need to get alignment from data, eng, design on how we use the eval outputs.

@adnanrhussain adnanrhussain merged commit 9592349 into ahussain/sdk_typescript Mar 5, 2026
10 checks passed
@adnanrhussain adnanrhussain deleted the ahussain/sdk_6_composite_evals branch March 5, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants