Conversation
…EFERENCE at root, no extra docs, no agent_evaluator) Co-authored-by: Cursor <cursoragent@cursor.com>
- ✅ Created Dockerfile.training (GPU) and Dockerfile.training.cpu (CPU) - ✅ Verified all dependencies: LoRA/PEFT, Wav2Vec2, PyTorch, audio processing - ✅ Added Docker Compose configuration for training - ✅ Created verification and build scripts - ✅ Comprehensive documentation (quickstart, troubleshooting, completion report) - ✅ CPU container built and verified successfully (3.05GB) - ✅ GPU container ready for GCP deployment All Week 1 tasks for Kavya completed. Training environment is reproducible and ready for GCP deployment. Co-authored-by: Cursor <cursoragent@cursor.com>
…Docker docs - Move run_comprehensive_evaluations.py to experiments/ - Move verify_evaluation_numbers.py to experiments/ - Move EVALUATION_VERIFICATION_SUMMARY.md to docs/ and expand explanation - Delete redundant WEEK1_DOCKER_BUILD_SUMMARY.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Delete redundant WEEK1_TRAINING_DOCKER_QUICKSTART.md (consolidated into docs/WEEK1_TRAINING_DOCKER.md) - Add Quick Start section to docs/WEEK1_TRAINING_DOCKER.md - Expand verification summary purpose and context explanation Co-authored-by: Cursor <cursoragent@cursor.com>
- Upgrade Unified Evaluator with DER, Verb Error Rate, and Domain Error Rate - Add AReal (RealtimeSTT) investigation script - Add Miles/Moonshine investigation script - Add Oracle Teacher script for synthetic gold transcript generation - Update requirements.txt with new dependencies - Add comprehensive documentation for new metrics Made-with: Cursor
Made-with: Cursor
- Make nltk import optional with fallback - Add NLTK_AVAILABLE flag - Use fallback verb extraction when NLTK not available - Prevents test failures when nltk not installed Made-with: Cursor
- Wrap nltk import in try-except block - Set NLTK_AVAILABLE flag before logger initialization - Add fallback verb extraction method - Prevents ModuleNotFoundError in CI when nltk not installed Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR expands the speech-to-text evaluation framework by adding new evaluation metrics (DER, Verb Error Rate, Domain Error Rate) and several experiment scripts to investigate models and generate/verify transcripts.
Changes:
- Extend
STTEvaluatorwith DER, verb-focused scoring, and domain-focused scoring. - Add new experiment scripts for model investigation (AReal/Miles), report number verification, comprehensive evaluations, and an LLM-based “oracle teacher”.
- Add documentation and new Python dependencies to support the new tooling.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/evaluation/metrics.py |
Adds DER/verb/domain metrics and expands batch evaluation output. |
requirements.txt |
Adds dependencies for NLTK, DER tooling, OpenAI, and faster-whisper. |
experiments/verify_evaluation_numbers.py |
New script to compare “reported” vs. measured evaluation outputs. |
experiments/run_comprehensive_evaluations.py |
New script to run baseline/full-system evaluations + stats + ablation and emit a report. |
experiments/oracle_teacher.py |
New script to generate “gold” transcripts using OpenAI/Ollama refinement. |
experiments/investigate_miles.py |
New investigation script for Moonshine / Miles proxy with latency + metrics. |
experiments/investigate_areal.py |
New investigation script for RealtimeSTT/Faster-Whisper with latency + metrics. |
docs/EVALUATION_VERIFICATION_SUMMARY.md |
Documents what reported numbers were/weren’t verifiable and outcomes. |
docs/EVALUATION_METRICS_UPGRADE.md |
Documents the new metrics and the new scripts’ usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ref_verbs = set(self.extract_verbs(reference)) | ||
| hyp_verbs = set(self.extract_verbs(hypothesis)) | ||
|
|
||
| if not ref_verbs: | ||
| return 0.0 if not hyp_verbs else 1.0 | ||
|
|
||
| # Calculate errors | ||
| substitutions = len(ref_verbs - hyp_verbs) # Missing verbs | ||
| insertions = len(hyp_verbs - ref_verbs) # Extra verbs | ||
|
|
||
| total_errors = substitutions + insertions | ||
| ver = total_errors / len(ref_verbs) if ref_verbs else 0.0 | ||
|
|
There was a problem hiding this comment.
calculate_verb_error_rate() converts verb lists to sets, which drops duplicates and changes the metric from counting missing/extra verb occurrences to only checking presence/absence. This doesn’t match the stated formula; consider comparing multisets/counts (e.g., via collections.Counter) so repeated verbs are handled correctly.
| # Add src to path | ||
| sys.path.insert(0, str(Path(__file__).parent)) | ||
|
|
||
| from src.baseline_model import BaselineSTTModel | ||
| from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy | ||
| from src.evaluation.metrics import STTEvaluator |
There was a problem hiding this comment.
Same sys.path issue as verify_evaluation_numbers.py: adding Path(__file__).parent won’t make src importable when running this script directly. Use the repo root (Path(__file__).resolve().parent.parent) or rely on package installation.
| logger.info("\n✓ Verified from actual evaluation files:") | ||
| logger.info(f" - Baseline WER: {actual_results.get('baseline_wer', 'N/A')*100:.2f}%") | ||
| logger.info(f" - Baseline CER: {actual_results.get('baseline_cer', 'N/A')*100:.2f}%") | ||
| logger.info(f" - Mean Latency: {actual_results.get('mean_latency', 'N/A'):.2f}s") | ||
| logger.info(f" - Throughput: {actual_results.get('throughput', 'N/A'):.2f} samples/s") |
There was a problem hiding this comment.
Several log lines multiply actual_results.get(..., 'N/A') * 100 and then format as floats. If the metric isn’t present (e.g., evaluation files missing), the default 'N/A' will cause type errors or nonsensical output. Consider defaulting to None and branching, or using numeric defaults and separately reporting missing values.
| """ | ||
| Comprehensive Evaluation Script | ||
| Runs actual evaluations to get measured numbers for the report. | ||
| """ |
There was a problem hiding this comment.
This script appears to overlap with experiments/run_evaluation.py, which states it “Replaces run_comprehensive_evaluations, ... verify_evaluation_numbers.” Keeping multiple entrypoints with overlapping responsibilities increases maintenance burden; consider removing this script, or clearly documenting why both are needed and how they differ.
| while ref_idx < len(ref_timeline) and hyp_idx < len(hyp_timeline): | ||
| ref_seg = ref_timeline[ref_idx] | ||
| hyp_seg = hyp_timeline[hyp_idx] | ||
|
|
||
| # Check overlap | ||
| overlap_start = max(ref_seg['start'], hyp_seg['start']) | ||
| overlap_end = min(ref_seg['end'], hyp_seg['end']) | ||
| overlap_duration = max(0, overlap_end - overlap_start) | ||
|
|
||
| if overlap_duration > tolerance: | ||
| # Check speaker match | ||
| if ref_seg.get('speaker') != hyp_seg.get('speaker'): | ||
| speaker_confusion += overlap_duration | ||
| else: |
There was a problem hiding this comment.
calculate_der() can get stuck in an infinite loop: when overlap_duration > tolerance you never advance ref_idx or hyp_idx, so the while condition never changes for overlapping segments. You’ll need to advance at least one index (e.g., whichever segment ends first) after accounting for overlap/speaker confusion.
| domain_errors = defaultdict(lambda: {'total': 0, 'errors': 0}) | ||
|
|
||
| for ref, hyp in zip(references, hypotheses): | ||
| domain = self.detect_domain(ref) | ||
| if domain: | ||
| domain_errors[domain]['total'] += 1 | ||
| # Calculate WER for this domain | ||
| ref_wer = self.calculate_wer(ref, hyp) | ||
| if ref_wer > 0: | ||
| domain_errors[domain]['errors'] += 1 | ||
|
|
There was a problem hiding this comment.
calculate_domain_error_rate() is currently computing “% of samples with WER > 0” per domain, not a domain-specific WER (as described in the PR/docs). If the intent is domain-specific WER, accumulate WER values per domain and return an average (or return both average WER and error-free rate).
| # Optional NLTK import for verb extraction | ||
| try: | ||
| import nltk | ||
| # Download required NLTK data | ||
| try: | ||
| nltk.data.find('tokenizers/punkt') | ||
| except LookupError: | ||
| nltk.download('punkt', quiet=True) | ||
| try: | ||
| nltk.data.find('taggers/averaged_perceptron_tagger') | ||
| except LookupError: | ||
| nltk.download('averaged_perceptron_tagger', quiet=True) | ||
| NLTK_AVAILABLE = True |
There was a problem hiding this comment.
This module downloads NLTK datasets at import time (nltk.download(...)). That introduces network/IO side effects just from importing src.evaluation.metrics and can hang/fail in offline or restricted environments. Prefer lazy downloads (only when VER is requested), or require users to pre-install the corpora and raise a clear error with instructions.
| def get_baseline_transcript(self, audio_file: str) -> str: | ||
| """Get baseline transcript from Whisper.""" | ||
| try: | ||
| transcript = self.baseline_model.transcribe(audio_file) | ||
| return transcript | ||
| except Exception as e: | ||
| logger.error(f"Error getting baseline transcript: {e}") | ||
| return "" |
There was a problem hiding this comment.
get_baseline_transcript() returns the full dict from BaselineSTTModel.transcribe(), but callers treat it as a string (slicing, f-string prompt interpolation). This will raise at runtime and/or send a dict into the LLM prompt. Return result.get('transcript', '') (and optionally keep the full metadata separately).
| # Add src to path | ||
| sys.path.insert(0, str(Path(__file__).parent)) | ||
|
|
||
| from src.baseline_model import BaselineSTTModel | ||
| from src.integration import UnifiedSTTSystem, StatisticalAnalyzer, AblationStudy | ||
| from src.evaluation.metrics import STTEvaluator |
There was a problem hiding this comment.
The sys.path modification adds the experiments/ directory (Path(__file__).parent), but imports are from src..., which typically requires adding the repo root (Path(__file__).resolve().parent.parent) or installing the package. As written, this script will fail to import src when run directly.
| baseline_wer = self.results['baseline']['wer']['mean'] | ||
| full_wer = self.results['full_system']['wer']['mean'] | ||
| if baseline_wer and full_wer: | ||
| improvement = ((baseline_wer - full_wer) / baseline_wer) * 100 |
There was a problem hiding this comment.
In generate_report(), if baseline_wer and full_wer: uses truthiness, so valid values like 0.0 are treated as missing and the improvement section is skipped. Use explicit is not None checks (and guard against division by zero when baseline_wer == 0).
Overview
Added comprehensive evaluation metrics and investigation tools to enhance our speech-to-text evaluation framework.
Key Features
1. Upgraded Unified Evaluator ()
Added three new metrics to the existing WER/CER evaluator:
Diarization Error Rate (DER): Measures speaker diarization accuracy
DER = (Missed Speech + False Alarm + Speaker Confusion) / Total DurationVerb Error Rate: Measures accuracy of verb transcription specifically
(Missing Verbs + Extra Verbs) / Total Reference VerbsDomain Error Rate: Measures accuracy within specific domains
2. AReal (RealtimeSTT) Investigation Script ()
Usage:
3. Miles/Moonshine Investigation Script ()
Usage:
4. Oracle Teacher Script ()
Generates synthetic "gold" transcripts using LLM APIs for cases without ground truth:
Usage:
New Dependencies
Added to
requirements.txt:nltk>=3.8.0- For POS tagging (verb extraction)pyannote.metrics>=4.0.0- For advanced DER calculation (optional)openai>=1.0.0- For Oracle Teacher OpenAI APIfaster-whisper>=0.10.0- For AReal investigationDocumentation
Created
docs/EVALUATION_METRICS_UPGRADE.mdwith:Technical Details
Evaluation Metrics Implementation
Oracle Teacher Output Format
{ "audio_file": "data/test.wav", "baseline_transcript": "the patient was diagnose with pneumonia", "gold_transcript": "The patient was diagnosed with pneumonia.", "refinement_time": 2.5, "model": "gpt-4o", "api_type": "openai" }Testing
experiments/evaluation_outputs/Files Changed
src/evaluation/metrics.py- Upgraded with new metricsrequirements.txt- Added new dependenciesexperiments/investigate_areal.py- New investigation scriptexperiments/investigate_miles.py- New investigation scriptexperiments/oracle_teacher.py- New Oracle Teacher scriptdocs/EVALUATION_METRICS_UPGRADE.md- New documentationNext Steps