Add Advanced Evaluation Metrics and Model Investigation Tools by kavyavenk · Pull Request #28 · GAInTheHouse/Adaptive-Self-Learning-Agentic-AI-System

kavyavenk · 2026-03-06T04:44:10Z

Overview

Added comprehensive evaluation metrics and investigation tools to enhance our speech-to-text evaluation framework.

Key Features

1. Upgraded Unified Evaluator ()

Added three new metrics to the existing WER/CER evaluator:

Diarization Error Rate (DER): Measures speaker diarization accuracy
- Formula: DER = (Missed Speech + False Alarm + Speaker Confusion) / Total Duration
- Supports tolerance collar (default 0.25s) for boundary imprecision
- Requires speaker segment information (start, end, speaker ID)
Verb Error Rate: Measures accuracy of verb transcription specifically
- Uses NLTK POS tagging to extract verbs
- Critical for grammatical accuracy assessment
- Formula: (Missing Verbs + Extra Verbs) / Total Reference Verbs
Domain Error Rate: Measures accuracy within specific domains
- Supports: medical, legal, technical, business domains
- Calculates domain-specific WER
- Customizable domain keywords

2. AReal (RealtimeSTT) Investigation Script ()

Evaluates RealtimeSTT/Faster_Whisper model performance
Calculates WER, CER, Verb Error Rate, Domain Error Rate
Measures latency metrics
Processes audio files from data directory
Saves results to JSON for analysis

Usage:

python experiments/investigate_areal.py

3. Miles/Moonshine Investigation Script ()

Evaluates Moonshine model (optimized for live transcription)
Supports M.I.L.E.S voice assistant approach
Same comprehensive metrics as AReal investigation
Fallback to Whisper if models unavailable

Usage:

python experiments/investigate_miles.py

4. Oracle Teacher Script ()

Generates synthetic "gold" transcripts using LLM APIs for cases without ground truth:

Dual API Support:
- OpenAI API (GPT-4o, GPT-4) - default
- Ollama API (Llama 3, Llama 2) - local inference
Two-Stage Process:
1. Gets baseline transcript from Whisper
2. Refines transcript using LLM (fixes errors, adds punctuation, corrects grammar)
Batch Processing: Processes multiple files with rate limiting
Comprehensive Output: Saves baseline + gold transcripts with metadata

Usage:

# OpenAI
export OPENAI_API_KEY='your-key'
python experiments/oracle_teacher.py --api-type openai --model gpt-4o --limit 10

# Ollama (Llama 3)
python experiments/oracle_teacher.py --api-type llama --model llama3 --limit 10

New Dependencies

Added to requirements.txt:

nltk>=3.8.0 - For POS tagging (verb extraction)
pyannote.metrics>=4.0.0 - For advanced DER calculation (optional)
openai>=1.0.0 - For Oracle Teacher OpenAI API
faster-whisper>=0.10.0 - For AReal investigation

Documentation

Created docs/EVALUATION_METRICS_UPGRADE.md with:

Detailed metric explanations
Usage examples for all new features
API documentation
Future enhancement roadmap

Technical Details

Evaluation Metrics Implementation

from src.evaluation.metrics import STTEvaluator

evaluator = STTEvaluator()
results = evaluator.evaluate_batch(
    references=["ref1", "ref2"],
    hypotheses=["hyp1", "hyp2"],
    include_verb_rate=True,
    include_domain_rate=True,
    include_der=False  # Requires segments
)

Oracle Teacher Output Format

{
  "audio_file": "data/test.wav",
  "baseline_transcript": "the patient was diagnose with pneumonia",
  "gold_transcript": "The patient was diagnosed with pneumonia.",
  "refinement_time": 2.5,
  "model": "gpt-4o",
  "api_type": "openai"
}

Testing

All scripts include error handling and logging
Scripts automatically find audio files in data directory
Results saved to experiments/evaluation_outputs/
Compatible with existing evaluation framework

Files Changed

src/evaluation/metrics.py - Upgraded with new metrics
requirements.txt - Added new dependencies
experiments/investigate_areal.py - New investigation script
experiments/investigate_miles.py - New investigation script
experiments/oracle_teacher.py - New Oracle Teacher script
docs/EVALUATION_METRICS_UPGRADE.md - New documentation

Next Steps

Run investigation scripts on available data
Generate gold transcripts using Oracle Teacher
Integrate new metrics into existing evaluation pipeline
Add semantic similarity metrics (future enhancement)

…EFERENCE at root, no extra docs, no agent_evaluator) Co-authored-by: Cursor <cursoragent@cursor.com>

- ✅ Created Dockerfile.training (GPU) and Dockerfile.training.cpu (CPU) - ✅ Verified all dependencies: LoRA/PEFT, Wav2Vec2, PyTorch, audio processing - ✅ Added Docker Compose configuration for training - ✅ Created verification and build scripts - ✅ Comprehensive documentation (quickstart, troubleshooting, completion report) - ✅ CPU container built and verified successfully (3.05GB) - ✅ GPU container ready for GCP deployment All Week 1 tasks for Kavya completed. Training environment is reproducible and ready for GCP deployment. Co-authored-by: Cursor <cursoragent@cursor.com>

…Docker docs - Move run_comprehensive_evaluations.py to experiments/ - Move verify_evaluation_numbers.py to experiments/ - Move EVALUATION_VERIFICATION_SUMMARY.md to docs/ and expand explanation - Delete redundant WEEK1_DOCKER_BUILD_SUMMARY.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Delete redundant WEEK1_TRAINING_DOCKER_QUICKSTART.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Add Quick Start section to docs/WEEK1_TRAINING_DOCKER.md - Expand verification summary purpose and context explanation Co-authored-by: Cursor <cursoragent@cursor.com>

- Upgrade Unified Evaluator with DER, Verb Error Rate, and Domain Error Rate - Add AReal (RealtimeSTT) investigation script - Add Miles/Moonshine investigation script - Add Oracle Teacher script for synthetic gold transcript generation - Update requirements.txt with new dependencies - Add comprehensive documentation for new metrics Made-with: Cursor

Made-with: Cursor

- Make nltk import optional with fallback - Add NLTK_AVAILABLE flag - Use fallback verb extraction when NLTK not available - Prevents test failures when nltk not installed Made-with: Cursor

- Wrap nltk import in try-except block - Set NLTK_AVAILABLE flag before logger initialization - Add fallback verb extraction method - Prevents ModuleNotFoundError in CI when nltk not installed Made-with: Cursor

Copilot

Pull request overview

This PR expands the speech-to-text evaluation framework by adding new evaluation metrics (DER, Verb Error Rate, Domain Error Rate) and several experiment scripts to investigate models and generate/verify transcripts.

Changes:

Extend STTEvaluator with DER, verb-focused scoring, and domain-focused scoring.
Add new experiment scripts for model investigation (AReal/Miles), report number verification, comprehensive evaluations, and an LLM-based “oracle teacher”.
Add documentation and new Python dependencies to support the new tooling.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`src/evaluation/metrics.py`	Adds DER/verb/domain metrics and expands batch evaluation output.
`requirements.txt`	Adds dependencies for NLTK, DER tooling, OpenAI, and faster-whisper.
`experiments/verify_evaluation_numbers.py`	New script to compare “reported” vs. measured evaluation outputs.
`experiments/run_comprehensive_evaluations.py`	New script to run baseline/full-system evaluations + stats + ablation and emit a report.
`experiments/oracle_teacher.py`	New script to generate “gold” transcripts using OpenAI/Ollama refinement.
`experiments/investigate_miles.py`	New investigation script for Moonshine / Miles proxy with latency + metrics.
`experiments/investigate_areal.py`	New investigation script for RealtimeSTT/Faster-Whisper with latency + metrics.
`docs/EVALUATION_VERIFICATION_SUMMARY.md`	Documents what reported numbers were/weren’t verifiable and outcomes.
`docs/EVALUATION_METRICS_UPGRADE.md`	Documents the new metrics and the new scripts’ usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T20:59:46Z

+        ref_verbs = set(self.extract_verbs(reference))
+        hyp_verbs = set(self.extract_verbs(hypothesis))
+
+        if not ref_verbs:
+            return 0.0 if not hyp_verbs else 1.0
+
+        # Calculate errors
+        substitutions = len(ref_verbs - hyp_verbs)  # Missing verbs
+        insertions = len(hyp_verbs - ref_verbs)  # Extra verbs
+
+        total_errors = substitutions + insertions
+        ver = total_errors / len(ref_verbs) if ref_verbs else 0.0
+


calculate_verb_error_rate() converts verb lists to sets, which drops duplicates and changes the metric from counting missing/extra verb occurrences to only checking presence/absence. This doesn’t match the stated formula; consider comparing multisets/counts (e.g., via collections.Counter) so repeated verbs are handled correctly.

Copilot · 2026-03-09T20:59:46Z

+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+from src.baseline_model import BaselineSTTModel
+from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy
+from src.evaluation.metrics import STTEvaluator


Same sys.path issue as verify_evaluation_numbers.py: adding Path(__file__).parent won’t make src importable when running this script directly. Use the repo root (Path(__file__).resolve().parent.parent) or rely on package installation.

Copilot · 2026-03-09T20:59:47Z

+    logger.info("\n✓ Verified from actual evaluation files:")
+    logger.info(f"  - Baseline WER: {actual_results.get('baseline_wer', 'N/A')*100:.2f}%")
+    logger.info(f"  - Baseline CER: {actual_results.get('baseline_cer', 'N/A')*100:.2f}%")
+    logger.info(f"  - Mean Latency: {actual_results.get('mean_latency', 'N/A'):.2f}s")
+    logger.info(f"  - Throughput: {actual_results.get('throughput', 'N/A'):.2f} samples/s")


Several log lines multiply actual_results.get(..., 'N/A') * 100 and then format as floats. If the metric isn’t present (e.g., evaluation files missing), the default 'N/A' will cause type errors or nonsensical output. Consider defaulting to None and branching, or using numeric defaults and separately reporting missing values.

Copilot · 2026-03-09T20:59:47Z

+"""
+Comprehensive Evaluation Script
+Runs actual evaluations to get measured numbers for the report.
+"""


This script appears to overlap with experiments/run_evaluation.py, which states it “Replaces run_comprehensive_evaluations, ... verify_evaluation_numbers.” Keeping multiple entrypoints with overlapping responsibilities increases maintenance burden; consider removing this script, or clearly documenting why both are needed and how they differ.

Copilot · 2026-03-09T20:59:47Z

+        while ref_idx < len(ref_timeline) and hyp_idx < len(hyp_timeline):
+            ref_seg = ref_timeline[ref_idx]
+            hyp_seg = hyp_timeline[hyp_idx]
+
+            # Check overlap
+            overlap_start = max(ref_seg['start'], hyp_seg['start'])
+            overlap_end = min(ref_seg['end'], hyp_seg['end'])
+            overlap_duration = max(0, overlap_end - overlap_start)
+
+            if overlap_duration > tolerance:
+                # Check speaker match
+                if ref_seg.get('speaker') != hyp_seg.get('speaker'):
+                    speaker_confusion += overlap_duration
+            else:


calculate_der() can get stuck in an infinite loop: when overlap_duration > tolerance you never advance ref_idx or hyp_idx, so the while condition never changes for overlapping segments. You’ll need to advance at least one index (e.g., whichever segment ends first) after accounting for overlap/speaker confusion.

Copilot · 2026-03-09T20:59:47Z

+        domain_errors = defaultdict(lambda: {'total': 0, 'errors': 0})
+
+        for ref, hyp in zip(references, hypotheses):
+            domain = self.detect_domain(ref)
+            if domain:
+                domain_errors[domain]['total'] += 1
+                # Calculate WER for this domain
+                ref_wer = self.calculate_wer(ref, hyp)
+                if ref_wer > 0:
+                    domain_errors[domain]['errors'] += 1
+


calculate_domain_error_rate() is currently computing “% of samples with WER > 0” per domain, not a domain-specific WER (as described in the PR/docs). If the intent is domain-specific WER, accumulate WER values per domain and return an average (or return both average WER and error-free rate).

Copilot · 2026-03-09T20:59:48Z

+# Optional NLTK import for verb extraction
+try:
+    import nltk
+    # Download required NLTK data
+    try:
+        nltk.data.find('tokenizers/punkt')
+    except LookupError:
+        nltk.download('punkt', quiet=True)
+    try:
+        nltk.data.find('taggers/averaged_perceptron_tagger')
+    except LookupError:
+        nltk.download('averaged_perceptron_tagger', quiet=True)
+    NLTK_AVAILABLE = True


This module downloads NLTK datasets at import time (nltk.download(...)). That introduces network/IO side effects just from importing src.evaluation.metrics and can hang/fail in offline or restricted environments. Prefer lazy downloads (only when VER is requested), or require users to pre-install the corpora and raise a clear error with instructions.

Copilot · 2026-03-09T20:59:48Z

+    def get_baseline_transcript(self, audio_file: str) -> str:
+        """Get baseline transcript from Whisper."""
+        try:
+            transcript = self.baseline_model.transcribe(audio_file)
+            return transcript
+        except Exception as e:
+            logger.error(f"Error getting baseline transcript: {e}")
+            return ""


get_baseline_transcript() returns the full dict from BaselineSTTModel.transcribe(), but callers treat it as a string (slicing, f-string prompt interpolation). This will raise at runtime and/or send a dict into the LLM prompt. Return result.get('transcript', '') (and optionally keep the full metadata separately).

Copilot · 2026-03-09T20:59:48Z

+# Add src to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+from src.baseline_model import BaselineSTTModel
+from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy
+from src.evaluation.metrics import STTEvaluator


The sys.path modification adds the experiments/ directory (Path(__file__).parent), but imports are from src..., which typically requires adding the repo root (Path(__file__).resolve().parent.parent) or installing the package. As written, this script will fail to import src when run directly.

Copilot · 2026-03-09T20:59:48Z

+            baseline_wer = self.results['baseline']['wer']['mean']
+            full_wer = self.results['full_system']['wer']['mean']
+            if baseline_wer and full_wer:
+                improvement = ((baseline_wer - full_wer) / baseline_wer) * 100


In generate_report(), if baseline_wer and full_wer: uses truthiness, so valid values like 0.0 are treated as missing and the improvement section is skipped. Use explicit is not None checks (and guard against division by zero when baseline_wer == 0).

shivangi221b and others added 16 commits February 11, 2026 17:54

clean up repeated .md files

88aca83

delete unused files

6dac922

Restore eval scripts and submission dir

0efa213

Recover eval results

70d39d2

fix fix fix

02e869f

Merge main into shivangi_cleanup_2: keep branch deletions (no QUICK_R…

407730c

…EFERENCE at root, no extra docs, no agent_evaluator) Co-authored-by: Cursor <cursoragent@cursor.com>

Add project deliverables

a07bdb0

week 1 - completed dockerized training env

2f2e838

Merge remote-tracking branch 'origin/main' into kavya

02b9dc0

Merge branch 'main' into kavya

af71e31

Fix linting errors: Add missing Any and Union imports from typing

435e1ba

Made-with: Cursor

Make NLTK import optional to fix CI test failures

44fcf0a

- Make nltk import optional with fallback - Add NLTK_AVAILABLE flag - Use fallback verb extraction when NLTK not available - Prevents test failures when nltk not installed Made-with: Cursor

Fix NLTK import: Make it truly optional to prevent CI failures

4143274

- Wrap nltk import in try-except block - Set NLTK_AVAILABLE flag before logger initialization - Add fallback verb extraction method - Prevents ModuleNotFoundError in CI when nltk not installed Made-with: Cursor

kavyavenk requested review from GAInTheHouse and shivangi221b March 6, 2026 05:59

GAInTheHouse requested a review from Copilot March 9, 2026 20:54

Copilot started reviewing on behalf of GAInTheHouse March 9, 2026 20:55 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Advanced Evaluation Metrics and Model Investigation Tools#28

Add Advanced Evaluation Metrics and Model Investigation Tools#28
kavyavenk wants to merge 16 commits into
mainfrom
kavya

kavyavenk commented Mar 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kavyavenk commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

1. Upgraded Unified Evaluator ()

2. AReal (RealtimeSTT) Investigation Script ()

3. Miles/Moonshine Investigation Script ()

4. Oracle Teacher Script ()

New Dependencies

Documentation

Technical Details

Evaluation Metrics Implementation

Oracle Teacher Output Format

Testing

Files Changed

Next Steps

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kavyavenk commented Mar 6, 2026 •

edited

Loading