Skip to content

Add Advanced Evaluation Metrics and Model Investigation Tools#28

Open
kavyavenk wants to merge 16 commits into
mainfrom
kavya
Open

Add Advanced Evaluation Metrics and Model Investigation Tools#28
kavyavenk wants to merge 16 commits into
mainfrom
kavya

Conversation

@kavyavenk

@kavyavenk kavyavenk commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

Overview

Added comprehensive evaluation metrics and investigation tools to enhance our speech-to-text evaluation framework.

Key Features

1. Upgraded Unified Evaluator ()

Added three new metrics to the existing WER/CER evaluator:

  • Diarization Error Rate (DER): Measures speaker diarization accuracy

    • Formula: DER = (Missed Speech + False Alarm + Speaker Confusion) / Total Duration
    • Supports tolerance collar (default 0.25s) for boundary imprecision
    • Requires speaker segment information (start, end, speaker ID)
  • Verb Error Rate: Measures accuracy of verb transcription specifically

    • Uses NLTK POS tagging to extract verbs
    • Critical for grammatical accuracy assessment
    • Formula: (Missing Verbs + Extra Verbs) / Total Reference Verbs
  • Domain Error Rate: Measures accuracy within specific domains

    • Supports: medical, legal, technical, business domains
    • Calculates domain-specific WER
    • Customizable domain keywords

2. AReal (RealtimeSTT) Investigation Script ()

  • Evaluates RealtimeSTT/Faster_Whisper model performance
  • Calculates WER, CER, Verb Error Rate, Domain Error Rate
  • Measures latency metrics
  • Processes audio files from data directory
  • Saves results to JSON for analysis

Usage:

python experiments/investigate_areal.py

3. Miles/Moonshine Investigation Script ()

  • Evaluates Moonshine model (optimized for live transcription)
  • Supports M.I.L.E.S voice assistant approach
  • Same comprehensive metrics as AReal investigation
  • Fallback to Whisper if models unavailable

Usage:

python experiments/investigate_miles.py

4. Oracle Teacher Script ()

Generates synthetic "gold" transcripts using LLM APIs for cases without ground truth:

  • Dual API Support:
    • OpenAI API (GPT-4o, GPT-4) - default
    • Ollama API (Llama 3, Llama 2) - local inference
  • Two-Stage Process:
    1. Gets baseline transcript from Whisper
    2. Refines transcript using LLM (fixes errors, adds punctuation, corrects grammar)
  • Batch Processing: Processes multiple files with rate limiting
  • Comprehensive Output: Saves baseline + gold transcripts with metadata

Usage:

# OpenAI
export OPENAI_API_KEY='your-key'
python experiments/oracle_teacher.py --api-type openai --model gpt-4o --limit 10

# Ollama (Llama 3)
python experiments/oracle_teacher.py --api-type llama --model llama3 --limit 10

New Dependencies

Added to requirements.txt:

  • nltk>=3.8.0 - For POS tagging (verb extraction)
  • pyannote.metrics>=4.0.0 - For advanced DER calculation (optional)
  • openai>=1.0.0 - For Oracle Teacher OpenAI API
  • faster-whisper>=0.10.0 - For AReal investigation

Documentation

Created docs/EVALUATION_METRICS_UPGRADE.md with:

  • Detailed metric explanations
  • Usage examples for all new features
  • API documentation
  • Future enhancement roadmap

Technical Details

Evaluation Metrics Implementation

from src.evaluation.metrics import STTEvaluator

evaluator = STTEvaluator()
results = evaluator.evaluate_batch(
    references=["ref1", "ref2"],
    hypotheses=["hyp1", "hyp2"],
    include_verb_rate=True,
    include_domain_rate=True,
    include_der=False  # Requires segments
)

Oracle Teacher Output Format

{
  "audio_file": "data/test.wav",
  "baseline_transcript": "the patient was diagnose with pneumonia",
  "gold_transcript": "The patient was diagnosed with pneumonia.",
  "refinement_time": 2.5,
  "model": "gpt-4o",
  "api_type": "openai"
}

Testing

  • All scripts include error handling and logging
  • Scripts automatically find audio files in data directory
  • Results saved to experiments/evaluation_outputs/
  • Compatible with existing evaluation framework

Files Changed

  • src/evaluation/metrics.py - Upgraded with new metrics
  • requirements.txt - Added new dependencies
  • experiments/investigate_areal.py - New investigation script
  • experiments/investigate_miles.py - New investigation script
  • experiments/oracle_teacher.py - New Oracle Teacher script
  • docs/EVALUATION_METRICS_UPGRADE.md - New documentation

Next Steps

  • Run investigation scripts on available data
  • Generate gold transcripts using Oracle Teacher
  • Integrate new metrics into existing evaluation pipeline
  • Add semantic similarity metrics (future enhancement)

shivangi221b and others added 16 commits February 11, 2026 17:54
…EFERENCE at root, no extra docs, no agent_evaluator)

Co-authored-by: Cursor <cursoragent@cursor.com>
- ✅ Created Dockerfile.training (GPU) and Dockerfile.training.cpu (CPU)
- ✅ Verified all dependencies: LoRA/PEFT, Wav2Vec2, PyTorch, audio processing
- ✅ Added Docker Compose configuration for training
- ✅ Created verification and build scripts
- ✅ Comprehensive documentation (quickstart, troubleshooting, completion report)
- ✅ CPU container built and verified successfully (3.05GB)
- ✅ GPU container ready for GCP deployment

All Week 1 tasks for Kavya completed. Training environment is reproducible and ready for GCP deployment.

Co-authored-by: Cursor <cursoragent@cursor.com>
…Docker docs

- Move run_comprehensive_evaluations.py to experiments/
- Move verify_evaluation_numbers.py to experiments/
- Move EVALUATION_VERIFICATION_SUMMARY.md to docs/ and expand explanation
- Delete redundant WEEK1_DOCKER_BUILD_SUMMARY.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md)
- Delete redundant WEEK1_TRAINING_DOCKER_QUICKSTART.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md)
- Add Quick Start section to docs/WEEK1_TRAINING_DOCKER.md
- Expand verification summary purpose and context explanation

Co-authored-by: Cursor <cursoragent@cursor.com>
- Upgrade Unified Evaluator with DER, Verb Error Rate, and Domain Error Rate
- Add AReal (RealtimeSTT) investigation script
- Add Miles/Moonshine investigation script
- Add Oracle Teacher script for synthetic gold transcript generation
- Update requirements.txt with new dependencies
- Add comprehensive documentation for new metrics

Made-with: Cursor
- Make nltk import optional with fallback
- Add NLTK_AVAILABLE flag
- Use fallback verb extraction when NLTK not available
- Prevents test failures when nltk not installed

Made-with: Cursor
- Wrap nltk import in try-except block
- Set NLTK_AVAILABLE flag before logger initialization
- Add fallback verb extraction method
- Prevents ModuleNotFoundError in CI when nltk not installed

Made-with: Cursor

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the speech-to-text evaluation framework by adding new evaluation metrics (DER, Verb Error Rate, Domain Error Rate) and several experiment scripts to investigate models and generate/verify transcripts.

Changes:

  • Extend STTEvaluator with DER, verb-focused scoring, and domain-focused scoring.
  • Add new experiment scripts for model investigation (AReal/Miles), report number verification, comprehensive evaluations, and an LLM-based “oracle teacher”.
  • Add documentation and new Python dependencies to support the new tooling.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/evaluation/metrics.py Adds DER/verb/domain metrics and expands batch evaluation output.
requirements.txt Adds dependencies for NLTK, DER tooling, OpenAI, and faster-whisper.
experiments/verify_evaluation_numbers.py New script to compare “reported” vs. measured evaluation outputs.
experiments/run_comprehensive_evaluations.py New script to run baseline/full-system evaluations + stats + ablation and emit a report.
experiments/oracle_teacher.py New script to generate “gold” transcripts using OpenAI/Ollama refinement.
experiments/investigate_miles.py New investigation script for Moonshine / Miles proxy with latency + metrics.
experiments/investigate_areal.py New investigation script for RealtimeSTT/Faster-Whisper with latency + metrics.
docs/EVALUATION_VERIFICATION_SUMMARY.md Documents what reported numbers were/weren’t verifiable and outcomes.
docs/EVALUATION_METRICS_UPGRADE.md Documents the new metrics and the new scripts’ usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/evaluation/metrics.py
Comment on lines +262 to +274
ref_verbs = set(self.extract_verbs(reference))
hyp_verbs = set(self.extract_verbs(hypothesis))

if not ref_verbs:
return 0.0 if not hyp_verbs else 1.0

# Calculate errors
substitutions = len(ref_verbs - hyp_verbs) # Missing verbs
insertions = len(hyp_verbs - ref_verbs) # Extra verbs

total_errors = substitutions + insertions
ver = total_errors / len(ref_verbs) if ref_verbs else 0.0

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_verb_error_rate() converts verb lists to sets, which drops duplicates and changes the metric from counting missing/extra verb occurrences to only checking presence/absence. This doesn’t match the stated formula; consider comparing multisets/counts (e.g., via collections.Counter) so repeated verbs are handled correctly.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +21
# Add src to path
sys.path.insert(0, str(Path(__file__).parent))

from src.baseline_model import BaselineSTTModel
from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy
from src.evaluation.metrics import STTEvaluator

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same sys.path issue as verify_evaluation_numbers.py: adding Path(__file__).parent won’t make src importable when running this script directly. Use the repo root (Path(__file__).resolve().parent.parent) or rely on package installation.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +152
logger.info("\n✓ Verified from actual evaluation files:")
logger.info(f" - Baseline WER: {actual_results.get('baseline_wer', 'N/A')*100:.2f}%")
logger.info(f" - Baseline CER: {actual_results.get('baseline_cer', 'N/A')*100:.2f}%")
logger.info(f" - Mean Latency: {actual_results.get('mean_latency', 'N/A'):.2f}s")
logger.info(f" - Throughput: {actual_results.get('throughput', 'N/A'):.2f} samples/s")

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several log lines multiply actual_results.get(..., 'N/A') * 100 and then format as floats. If the metric isn’t present (e.g., evaluation files missing), the default 'N/A' will cause type errors or nonsensical output. Consider defaulting to None and branching, or using numeric defaults and separately reporting missing values.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +5
"""
Comprehensive Evaluation Script
Runs actual evaluations to get measured numbers for the report.
"""

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script appears to overlap with experiments/run_evaluation.py, which states it “Replaces run_comprehensive_evaluations, ... verify_evaluation_numbers.” Keeping multiple entrypoints with overlapping responsibilities increases maintenance burden; consider removing this script, or clearly documenting why both are needed and how they differ.

Copilot uses AI. Check for mistakes.
Comment thread src/evaluation/metrics.py
Comment on lines +193 to +206
while ref_idx < len(ref_timeline) and hyp_idx < len(hyp_timeline):
ref_seg = ref_timeline[ref_idx]
hyp_seg = hyp_timeline[hyp_idx]

# Check overlap
overlap_start = max(ref_seg['start'], hyp_seg['start'])
overlap_end = min(ref_seg['end'], hyp_seg['end'])
overlap_duration = max(0, overlap_end - overlap_start)

if overlap_duration > tolerance:
# Check speaker match
if ref_seg.get('speaker') != hyp_seg.get('speaker'):
speaker_confusion += overlap_duration
else:

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_der() can get stuck in an infinite loop: when overlap_duration > tolerance you never advance ref_idx or hyp_idx, so the while condition never changes for overlapping segments. You’ll need to advance at least one index (e.g., whichever segment ends first) after accounting for overlap/speaker confusion.

Copilot uses AI. Check for mistakes.
Comment thread src/evaluation/metrics.py
Comment on lines +306 to +316
domain_errors = defaultdict(lambda: {'total': 0, 'errors': 0})

for ref, hyp in zip(references, hypotheses):
domain = self.detect_domain(ref)
if domain:
domain_errors[domain]['total'] += 1
# Calculate WER for this domain
ref_wer = self.calculate_wer(ref, hyp)
if ref_wer > 0:
domain_errors[domain]['errors'] += 1

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculate_domain_error_rate() is currently computing “% of samples with WER > 0” per domain, not a domain-specific WER (as described in the PR/docs). If the intent is domain-specific WER, accumulate WER values per domain and return an average (or return both average WER and error-free rate).

Copilot uses AI. Check for mistakes.
Comment thread src/evaluation/metrics.py
Comment on lines +14 to +26
# Optional NLTK import for verb extraction
try:
import nltk
# Download required NLTK data
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt', quiet=True)
try:
nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
nltk.download('averaged_perceptron_tagger', quiet=True)
NLTK_AVAILABLE = True

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module downloads NLTK datasets at import time (nltk.download(...)). That introduces network/IO side effects just from importing src.evaluation.metrics and can hang/fail in offline or restricted environments. Prefer lazy downloads (only when VER is requested), or require users to pre-install the corpora and raise a clear error with instructions.

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +66
def get_baseline_transcript(self, audio_file: str) -> str:
"""Get baseline transcript from Whisper."""
try:
transcript = self.baseline_model.transcribe(audio_file)
return transcript
except Exception as e:
logger.error(f"Error getting baseline transcript: {e}")
return ""

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_baseline_transcript() returns the full dict from BaselineSTTModel.transcribe(), but callers treat it as a string (slicing, f-string prompt interpolation). This will raise at runtime and/or send a dict into the LLM prompt. Return result.get('transcript', '') (and optionally keep the full metadata separately).

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +18
# Add src to path
sys.path.insert(0, str(Path(__file__).parent))

from src.baseline_model import BaselineSTTModel
from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy
from src.evaluation.metrics import STTEvaluator

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sys.path modification adds the experiments/ directory (Path(__file__).parent), but imports are from src..., which typically requires adding the repo root (Path(__file__).resolve().parent.parent) or installing the package. As written, this script will fail to import src when run directly.

Copilot uses AI. Check for mistakes.
Comment on lines +343 to +346
baseline_wer = self.results['baseline']['wer']['mean']
full_wer = self.results['full_system']['wer']['mean']
if baseline_wer and full_wer:
improvement = ((baseline_wer - full_wer) / baseline_wer) * 100

Copilot AI Mar 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In generate_report(), if baseline_wer and full_wer: uses truthiness, so valid values like 0.0 are treated as missing and the improvement section is skipped. Use explicit is not None checks (and guard against division by zero when baseline_wer == 0).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants