Skip to content

Granary v2 postprocessing#1

Draft
nune-tadevosyan wants to merge 8 commits into
shubhamNvidia:pr/advance-pipelinefrom
nune-tadevosyan:granary-v2-postprocessing
Draft

Granary v2 postprocessing#1
nune-tadevosyan wants to merge 8 commits into
shubhamNvidia:pr/advance-pipelinefrom
nune-tadevosyan:granary-v2-postprocessing

Conversation

@nune-tadevosyan

@nune-tadevosyan nune-tadevosyan commented Apr 15, 2026

Copy link
Copy Markdown

Description

Adds a production-ready postprocessing pipeline for Granary v2 ASR manifests under tutorials/audio/granary_v2_postprocessing/.

The pipeline reads JSONL manifests produced by ASR inference, applies text cleaning and quality filtering, and writes output manifests preserving the full input directory structure. All
entries are kept in the output — low-quality entries are flagged with skip_me=1 for downstream use rather than dropped.

Pipeline stages
ALMManifestReader - Reads JSONL — one AudioTask per line
InitializeFieldsStage - Copies pred_text → cleaned_text; sets skip_me = 0
RegexSubstitutionStage - Normalizes cleaned_text (quotes, dashes, brackets, whitespace)
WhisperHallucinationStage - Sets skip_me = 1 for repeated n-grams, long words, known hallucination phrases, or abnormal chars/duration rate
FastTextLIDStage - Sets skip_me = 1 for non-English or low-confidence language ID
FinalizeFieldsStage - Renames text → v1_text, promotes cleaned_text → text, drops pnc/itn/timestamp
ALMManifestWriterStage - Writes all entries (clean and flagged) to mirrored output paths

Key features:

  • Directory-based input — recursively finds all *.jsonl under --input_dir, no YAML config needed
  • Checkpointing — manifests with an existing non-empty output are skipped on rerun; atomic writes (.tmp + rename) prevent partial files from being mistaken for complete output
  • Slurm-ready — submit.sh groups manifests into configurable chunks and submits one job per chunk, all running in parallel
  bash tutorials/audio/granary_v2_postprocessing/submit.sh \
      /path/to/output_root \
      /path/to/input_dir

Important to set how many manifests should be processed during one job submission

nune-tadevosyan and others added 6 commits April 14, 2026 01:38
Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>
Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>
2. InitializeFieldsStage — copy pred_text → cleaned_text; skip_me = 0
3. RegexSubstitutionStage — apply regex normalization rules to cleaned_text
4. WhisperHallucinationStage — flag Whisper hallucination patterns (sets skip_me=1)
5. FastTextLIDStage — flag non-English or low-confidence transcriptions (sets skip_me=1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does low-confidence transcriptions predictions mean here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastText gives two predicitions language and probability of that language.

  1. If language if anything different than English we skip it.
  2. If language is English but with probability being smaller than 0.7 we again skip it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to listen to some samples where it predicted english but with lower probablity, curious of such data from our Granary set

Nune Tadevosyan added 2 commits April 16, 2026 00:32
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants