feat: Add TextComplexityEvaluator and metadata-driven evaluator system by adnanrhussain · Pull Request #15 · learning-commons-org/evaluators

adnanrhussain · 2026-02-08T00:00:18Z

Summary

Implements composite text-complexity evaluator combining vocabulary and sentence structure analysis and adds metadata system to all evaluators for self-documentation and API key validation.

Documentation

See sdks/typescript/README.md

Testing

CI workflow should build successfully

SDK Feature PR Index

Copilot

Pull request overview

This PR implements a composite TextComplexityEvaluator that combines vocabulary and sentence structure analysis, and introduces a metadata-driven system for evaluator self-documentation and API key validation. The changes represent a significant architectural improvement by centralizing common evaluator patterns in the base class and eliminating duplicated configuration interfaces.

Changes:

Added TextComplexityEvaluator that runs vocabulary and sentence structure evaluations in parallel with concurrency control (using p-limit)
Introduced EvaluatorMetadata interface requiring all evaluators to define static metadata (id, name, description, supportedGrades, API key requirements)
Centralized API key validation in BaseEvaluator constructor using metadata, removing duplicate validation from individual evaluators
Consolidated all evaluator config interfaces into BaseEvaluatorConfig, removing VocabularyEvaluatorConfig, SentenceStructureEvaluatorConfig, and GradeLevelAppropriatenessEvaluatorConfig
Updated supported grades for Sentence Structure evaluator from K-12 to 3-12 (consistent with Vocabulary and Text Complexity)
Updated all tests to reflect new error messages and grade ranges

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
sdks/typescript/src/evaluators/base.ts	Added EvaluatorMetadata interface and centralized API key validation based on metadata; converted getEvaluatorType from abstract to concrete method using metadata
sdks/typescript/src/evaluators/text-complexity.ts	New composite evaluator running vocabulary and sentence structure evaluations in parallel with p-limit concurrency control; implements graceful degradation on partial failures
sdks/typescript/src/evaluators/vocabulary.ts	Removed VocabularyEvaluatorConfig interface, added static metadata, removed manual API key validation now handled by base class
sdks/typescript/src/evaluators/sentence-structure.ts	Removed SentenceStructureEvaluatorConfig interface, added static metadata, updated supported grades to 3-12, removed manual API key validation
sdks/typescript/src/evaluators/grade-level-appropriateness.ts	Removed GradeLevelAppropriatenessEvaluatorConfig interface, added static metadata with empty supportedGrades, removed manual API key validation
sdks/typescript/src/evaluators/index.ts	Updated exports to remove individual config interfaces and export EvaluatorMetadata, TextComplexityEvaluator, and related types
sdks/typescript/src/index.ts	Added exports for TextComplexityEvaluator, evaluateTextComplexity, TextComplexityScore, TextComplexityInternal, and EvaluatorMetadata
sdks/typescript/tests/unit/evaluators/text-complexity.test.ts	Comprehensive test suite for TextComplexityEvaluator covering metadata, constructor validation, evaluation flow, parallel execution, partial failures, and complexity determination
sdks/typescript/tests/unit/evaluators/vocabulary.test.ts	Updated error messages to include evaluator name in validation errors
sdks/typescript/tests/unit/evaluators/sentence-structure.test.ts	Updated error messages and changed test grade from 'K' to '3' to match new supported grade range
sdks/typescript/tests/unit/evaluators/grade-level-appropriateness.test.ts	Updated error message to include evaluator name
sdks/typescript/tests/integration/sentence-structure.integration.test.ts	Commented out grade 2 test case (no longer supported) and updated documentation to reflect 3-6 grade range
sdks/typescript/package.json	Added p-limit 5.0.0 as production dependency for concurrency control
sdks/typescript/package-lock.json	Updated p-limit to 5.0.0 with yocto-queue 1.2.2; preserved p-limit 3.1.0 as nested dependency for p-locate
sdks/typescript/README.md	Added TextComplexityEvaluator documentation, updated all evaluator constructors to show BaseEvaluatorConfig with optional API keys, updated Sentence Structure supported grades to 3-12, added note explaining API key requirements per evaluator

Files not reviewed (1)

sdks/typescript/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdks/typescript/src/evaluators/text-complexity.ts

adnanrhussain · 2026-03-03T17:39:30Z

sdks/typescript/src/evaluators/base.ts

 }

+/**
+ * Evaluator metadata interface


For the reviewer - Making configuration and validation shared code across evals

codecov · 2026-03-03T17:40:20Z

Codecov Report

❌ Patch coverage is 93.58974% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
sdks/typescript/src/evaluators/text-complexity.ts	92.85%	4 Missing ⚠️
sdks/typescript/src/evaluators/base.ts	92.30%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Files not reviewed (1)

sdks/typescript/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sdks/typescript/src/evaluators/text-complexity.ts

Copilot · 2026-03-03T17:51:23Z

sdks/typescript/src/evaluators/text-complexity.ts

+  private buildCombinedReasoning(
+    vocabResult: EvaluationResult<string> | { error: Error },
+    sentenceResult: EvaluationResult<string> | { error: Error }
+  ): string {
+    const parts: string[] = [];
+
+    if ('error' in vocabResult) {
+      parts.push(`**Vocabulary Complexity:** Evaluation failed - ${vocabResult.error.message}`);
+    } else {
+      parts.push(`**Vocabulary Complexity (${vocabResult.score}):**\n${vocabResult.reasoning}`);
+    }
+
+    if ('error' in sentenceResult) {
+      parts.push(`**Sentence Structure Complexity:** Evaluation failed - ${sentenceResult.error.message}`);
+    } else {
+      parts.push(`**Sentence Structure Complexity (${sentenceResult.score}):**\n${sentenceResult.reasoning}`);
+    }
+
+    return parts.join('\n\n');
+  }


buildCombinedReasoning() formats the output using Markdown (e.g., **Vocabulary Complexity**). Other evaluators appear to return plain-text reasoning, so this changes the expected format for API consumers. Either switch to plain text for consistency or document that TextComplexityEvaluator.reasoning is Markdown.

Copilot · 2026-03-03T17:51:23Z

sdks/typescript/tests/unit/evaluators/text-complexity.test.ts

+      // Check that limit is defined
+      expect((evaluator as any).limit).toBeDefined();
+
+      const text = 'The cat sat on the mat.';
+      const grade = '5';
+
+      await evaluator.evaluate(text, grade);
+
+      // The limit should have been used (both calls go through it)
+      expect((evaluator as any).limit).toBeDefined();


This "concurrency control" test only checks that limit is defined, which doesn’t verify that calls actually go through the limiter or that the concurrency cap is respected. Consider asserting that limit is invoked (e.g., spy/wrap it) or by adding delayed sub-evaluator calls and verifying at most N run concurrently.

Suggested change

// Check that limit is defined

expect((evaluator as any).limit).toBeDefined();

const text = 'The cat sat on the mat.';

const grade = '5';

await evaluator.evaluate(text, grade);

// The limit should have been used (both calls go through it)

expect((evaluator as any).limit).toBeDefined();

// Grab the original limit function and create a spy that wraps it

const originalLimit = (evaluator as any).limit;

expect(originalLimit).toBeDefined();

const limitSpy = vi.fn(originalLimit);

(evaluator as any).limit = (...args: any[]) => limitSpy(...args);

const text = 'The cat sat on the mat.';

const grade = '5';

await evaluator.evaluate(text, grade);

// The limit should have been used (both evaluator calls go through it)

expect(limitSpy).toHaveBeenCalled();

expect(limitSpy.mock.calls.length).toBeGreaterThanOrEqual(2);

Copilot · 2026-03-03T17:51:23Z

sdks/typescript/tests/integration/sentence-structure.integration.test.ts

+  // {
+  //   id: 'SS2',
+  //   grade: '2',
+  //   text: "The Roman Empire was a powerful empire that lasted for hundreds of years. It started as a small village in Italy and grew into a huge empire that controlled much of Europe, Asia, and Africa. The Roman Empire had many strong leaders like Julius Caesar and Augustus. These leaders helped the empire grow and become very powerful.\n \n\n The Roman Empire had a period of peace and prosperity called the Pax Romana. This time was good for the empire, but it didn't last forever. The empire started to have problems. The army became weaker, and the economy had problems. The empire was also attacked by groups of people called barbarians.\n \n\n The Roman Empire was divided into two parts: the Western Roman Empire and the Eastern Roman Empire. The Western Roman Empire eventually fell apart in 476 AD. The Eastern Roman Empire, also known as the Byzantine Empire, lasted for many more years. The Roman Empire left behind many things that we still use today, like the Roman alphabet and the calendar.",
+  //   expected: 'moderately complex',
+  //   acceptable: ['slightly complex', 'very complex'],
+  // },


Large commented-out integration test case for grade 2 is left in the test vector list. If grade 2 is no longer supported, it’s cleaner to remove it (or keep it as a skipped test with an explicit reason) to avoid accumulating dead code in the test suite.

Copilot · 2026-03-03T17:51:24Z

sdks/typescript/src/evaluators/base.ts

+  private validateApiKeys(config: BaseEvaluatorConfig): void {
+    if (this.metadata.requiresGoogleKey && !config.googleApiKey) {
+      throw new ConfigurationError(
+        `Google API key is required for ${this.metadata.name} evaluator. Pass googleApiKey in config.`
+      );
+    }
+
+    if (this.metadata.requiresOpenAIKey && !config.openaiApiKey) {
+      throw new ConfigurationError(
+        `OpenAI API key is required for ${this.metadata.name} evaluator. Pass openaiApiKey in config.`
+      );
+    }
+  }


BaseEvaluator assumes every subclass defines static metadata. If a new evaluator forgets to set it, this.metadata.requiresGoogleKey will throw a TypeError instead of a clear configuration error. Consider adding an explicit guard (e.g., verify this.metadata is defined and has required fields) and throw a helpful ConfigurationError when metadata is missing/malformed.

Copilot · 2026-03-03T17:51:24Z

sdks/typescript/src/evaluators/text-complexity.ts

+export interface TextComplexityInternal {
+  vocabulary: EvaluationResult<string> | { error: Error };
+  sentenceStructure: EvaluationResult<string> | { error: Error };


TextComplexityInternal stores raw Error instances. These don’t serialize well (e.g., JSON.stringify(new Error('x')) becomes {}), which makes _internal hard to log or persist. Consider storing a plain object (name/message/stack) or a string instead of an Error instance.

Suggested change

export interface TextComplexityInternal {

vocabulary: EvaluationResult<string> | { error: Error };

sentenceStructure: EvaluationResult<string> | { error: Error };

export interface SerializedError {

name: string;

message: string;

stack?: string;

}

export interface TextComplexityInternal {

vocabulary: EvaluationResult<string> | { error: SerializedError | string };

sentenceStructure: EvaluationResult<string> | { error: SerializedError | string };

Copilot · 2026-03-03T17:51:24Z

sdks/typescript/tests/unit/evaluators/text-complexity.test.ts

+    it('should run both evaluators in parallel', async () => {
+      const text = 'The cat sat on the mat.';
+      const grade = '5';
+
+      const startTime = Date.now();
+      const result = await evaluator.evaluate(text, grade);
+      const duration = Date.now() - startTime;
+
+      // With mocked providers that take ~100ms each, parallel execution should be faster than sequential
+      // Sequential would be ~200ms, parallel should be ~100ms
+      // Allow some overhead but should be significantly less than 200ms
+      expect(duration).toBeLessThan(200);
+
+      expect('error' in result._internal!.vocabulary).toBe(false);
+      expect('error' in result._internal!.sentenceStructure).toBe(false);
+    });


The "run both evaluators in parallel" test isn’t actually asserting parallelism: both child evaluate() calls are mocked to resolve immediately (no real ~100ms delay), so the duration assertion will pass even if execution becomes sequential. Consider adding an explicit artificial delay to both mocks (or use fake timers) and assert the wall time is closer to one delay than the sum.

Copilot · 2026-03-03T17:51:24Z

sdks/typescript/src/index.ts

  type TelemetryOptions,
+  type EvaluatorMetadata,
 } from './evaluators/index.js';



Public API appears to drop the evaluator-specific config types (e.g. VocabularyEvaluatorConfig, SentenceStructureEvaluatorConfig, etc.) from the root export surface. If this package is already being consumed, that’s a breaking change; consider re-exporting them as deprecated type aliases to BaseEvaluatorConfig to preserve compatibility.

Suggested change

/**

* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided

* to preserve backward compatibility for existing consumers.

*/

export type VocabularyEvaluatorConfig = BaseEvaluatorConfig;

/**

* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided

* to preserve backward compatibility for existing consumers.

*/

export type SentenceStructureEvaluatorConfig = BaseEvaluatorConfig;

/**

* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided

* to preserve backward compatibility for existing consumers.

*/

export type GradeLevelAppropriatenessEvaluatorConfig = BaseEvaluatorConfig;

/**

* @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided

* to preserve backward compatibility for existing consumers.

*/

export type TextComplexityEvaluatorConfig = BaseEvaluatorConfig;

Copilot · 2026-03-03T17:51:25Z

sdks/typescript/src/evaluators/sentence-structure.ts

@@ -116,11 +107,6 @@ export class SentenceStructureEvaluator extends BaseEvaluator {
    });
  }

-  // Implement abstract methods from BaseEvaluator
-  protected getEvaluatorType(): string {
-    return 'sentence-structure';
-  }
-
  /**
   * Evaluate sentence structure complexity for a given text and grade level
   *
@@ -147,7 +133,7 @@ export class SentenceStructureEvaluator extends BaseEvaluator {
    try {
      // Validate inputs — inside try so validation errors are telemetered.
      this.validateText(text);
-      this.validateGrade(grade, VALID_GRADES);
+      this.validateGrade(grade, new Set(SentenceStructureEvaluator.metadata.supportedGrades));
      this.logger.debug('Stage 1: Analyzing sentence structure', {


This change effectively drops SentenceStructureEvaluator support for grades K–2 by validating against supportedGrades (3–12 only). That’s a behavior/API change and the JSDoc above still says the grade param is "(K-12)". Please update the documentation/comments (and any public docs) to reflect the new supported range, or restore K–2 support if the restriction wasn’t intended.

czi-fsisenda

Looks good! Got some questions. Not sure whether we need to address these here or in a later PR. Let me know and I can approve for us to address later.

czi-fsisenda · 2026-03-03T21:46:48Z

sdks/typescript/src/evaluators/base.ts

+ * Each evaluator must provide this metadata as static properties
+ */
+export interface EvaluatorMetadata {
+  /** Unique identifier for the evaluator (e.g., 'vocabulary', 'sentence-structure') */
+  readonly id: string;
+  /** Human-readable name (e.g., 'Vocabulary', 'Sentence Structure') */
+  readonly name: string;
+  /** Brief description of what the evaluator does */
+  readonly description: string;
+  /** Supported grade levels (e.g., ['3', '4', '5', ...]) */
+  readonly supportedGrades: readonly string[];
+  /** Whether this evaluator requires a Google API key */
+  readonly requiresGoogleKey: boolean;
+  /** Whether this evaluator requires an OpenAI API key */
+  readonly requiresOpenAIKey: boolean;
+}


czi-fsisenda · 2026-03-03T22:20:32Z

sdks/typescript/src/evaluators/base.ts

+  /** Supported grade levels (e.g., ['3', '4', '5', ...]) */
+  readonly supportedGrades: readonly string[];


With the shape new evals are taking, it looks like grade isn't a core attribute.

From the previous playground, we ended up setting up an input fields framework. Each eval would define what type of inputs it used, e.g. inputs: [{name: 'grade', type:'select', required: true, options: ['1st grade', '2nd grade']}]...
I thought we wouldn't need something this elaborate for a while, but all the new evaluators have different inputs. None use grade.

I'll address this in a different PR

czi-fsisenda · 2026-03-05T13:32:20Z

sdks/typescript/src/evaluators/text-complexity.ts

+  vocabulary: EvaluationResult<string> | { error: Error };
+  sentenceStructure: EvaluationResult<string> | { error: Error };


What does the user do with a partially errored evaluation?

Good Q. It internally has a 3 attempt retry. I suppose if the evaluation is still failing, then they would just re-run the whole evaluator. But later we can make this smarter, and have only the failed ones retry again. Something for a later PR

czi-fsisenda · 2026-03-05T14:00:34Z

sdks/typescript/src/evaluators/text-complexity.ts

+    // Create concurrency limiter (max 3 concurrent operations)
+    this.limit = pLimit(3);


I'm not sure I understand the reasoning for the limiter. The limiter is local to this evaluator, right?
So we get a queue for sub evaluations for this eval only? But I could run as many concurrent solo evaluations as I'd like? Shouldn't we have a global limiter if we're concerned about concurrent evaluations?
Since this combo eval is only doing 2 evals, it seems like either a limiter is a lot of complexity for this case or we are neglecting limiting concurrent solo evals.

There will be around 6 evaluators in the Text Complexity evaluator. My plan is to provider this wrapper for the end-developer-client to use to be able to queue up multiple evaluators in the background, but that is an enhancement for later

czi-fsisenda · 2026-03-05T14:05:50Z

sdks/typescript/README.md

+**Returns:**
+```typescript
+{
+  score: {
+    overall: string;           // Overall complexity (highest of the two)
+    vocabulary: string;        // Vocabulary complexity score
+    sentenceStructure: string; // Sentence structure complexity score
+  };
+  reasoning: string;  // Combined reasoning from both evaluators
+  metadata: EvaluationMetadata;
+  _internal: {
+    vocabulary: EvaluationResult | { error: Error };
+    sentenceStructure: EvaluationResult | { error: Error };
+  };
+}
+```


Different, but usable, shape than we currently have.
We need to get alignment from data, eng, design on how we use the eval outputs.

adnanrhussain requested a review from Copilot February 19, 2026 22:45

Copilot started reviewing on behalf of adnanrhussain February 19, 2026 22:45 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

sdks/typescript/src/evaluators/text-complexity.ts Outdated Show resolved Hide resolved

sdks/typescript/src/evaluators/text-complexity.ts Outdated Show resolved Hide resolved

adnanrhussain force-pushed the ahussain/sdk_5_gla branch from 4b2fb44 to bd68f3d Compare March 3, 2026 06:55

Base automatically changed from ahussain/sdk_5_gla to ahussain/sdk_typescript March 3, 2026 07:20

adnanrhussain added 2 commits March 2, 2026 23:32

feat: Add TextComplexityEvaluator and metadata-driven evaluator system

a5fd726

refactors

13325f1

adnanrhussain force-pushed the ahussain/sdk_6_composite_evals branch from 173a014 to 13325f1 Compare March 3, 2026 17:37

adnanrhussain commented Mar 3, 2026

View reviewed changes

adnanrhussain requested a review from Copilot March 3, 2026 17:46

Copilot started reviewing on behalf of adnanrhussain March 3, 2026 17:46 View session

adnanrhussain marked this pull request as ready for review March 3, 2026 17:50

Copilot AI reviewed Mar 3, 2026

View reviewed changes

re: feedback

afbbd6d

adnanrhussain requested a review from czi-fsisenda March 3, 2026 18:41

czi-fsisenda reviewed Mar 5, 2026

View reviewed changes

czi-fsisenda approved these changes Mar 5, 2026

View reviewed changes

adnanrhussain merged commit 9592349 into ahussain/sdk_typescript Mar 5, 2026
10 checks passed

adnanrhussain deleted the ahussain/sdk_6_composite_evals branch March 5, 2026 23:07

adnanrhussain mentioned this pull request Mar 11, 2026

feat: Release TypeScript SDK 0.1.0 #20

Merged

+/**
+ * @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
+ * to preserve backward compatibility for existing consumers.
+ */
+export type VocabularyEvaluatorConfig = BaseEvaluatorConfig;
+/**
+ * @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
+ * to preserve backward compatibility for existing consumers.
+ */
+export type SentenceStructureEvaluatorConfig = BaseEvaluatorConfig;
+/**
+ * @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
+ * to preserve backward compatibility for existing consumers.
+ */
+export type GradeLevelAppropriatenessEvaluatorConfig = BaseEvaluatorConfig;
+/**
+ * @deprecated Use {@link BaseEvaluatorConfig} instead. This alias is provided
+ * to preserve backward compatibility for existing consumers.
+ */
+export type TextComplexityEvaluatorConfig = BaseEvaluatorConfig;

		/** Supported grade levels (e.g., ['3', '4', '5', ...]) */
		readonly supportedGrades: readonly string[];

		vocabulary: EvaluationResult<string> \| { error: Error };
		sentenceStructure: EvaluationResult<string> \| { error: Error };

		// Create concurrency limiter (max 3 concurrent operations)
		this.limit = pLimit(3);

Conversation

adnanrhussain commented Feb 8, 2026

Summary

Documentation

Testing

SDK Feature PR Index

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

czi-fsisenda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 3, 2026 •

edited

Loading