Granary v2 postprocessing by nune-tadevosyan · Pull Request #1 · shubhamNvidia/Curator

nune-tadevosyan · 2026-04-15T11:45:47Z

Description

Adds a production-ready postprocessing pipeline for Granary v2 ASR manifests under tutorials/audio/granary_v2_postprocessing/.

The pipeline reads JSONL manifests produced by ASR inference, applies text cleaning and quality filtering, and writes output manifests preserving the full input directory structure. All
entries are kept in the output — low-quality entries are flagged with skip_me=1 for downstream use rather than dropped.

Pipeline stages
ALMManifestReader - Reads JSONL — one AudioTask per line
InitializeFieldsStage - Copies pred_text → cleaned_text; sets skip_me = 0
RegexSubstitutionStage - Normalizes cleaned_text (quotes, dashes, brackets, whitespace)
WhisperHallucinationStage - Sets skip_me = 1 for repeated n-grams, long words, known hallucination phrases, or abnormal chars/duration rate
FastTextLIDStage - Sets skip_me = 1 for non-English or low-confidence language ID
FinalizeFieldsStage - Renames text → v1_text, promotes cleaned_text → text, drops pnc/itn/timestamp
ALMManifestWriterStage - Writes all entries (clean and flagged) to mirrored output paths

Key features:

Directory-based input — recursively finds all *.jsonl under --input_dir, no YAML config needed
Checkpointing — manifests with an existing non-empty output are skipped on rerun; atomic writes (.tmp + rename) prevent partial files from being mistaken for complete output
Slurm-ready — submit.sh groups manifests into configurable chunks and submits one job per chunk, all running in parallel

  bash tutorials/audio/granary_v2_postprocessing/submit.sh \
      /path/to/output_root \
      /path/to/input_dir

Important to set how many manifests should be processed during one job submission

Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>

nithinraok · 2026-04-15T13:40:49Z

+  2. InitializeFieldsStage     — copy pred_text → cleaned_text; skip_me = 0
+  3. RegexSubstitutionStage    — apply regex normalization rules to cleaned_text
+  4. WhisperHallucinationStage — flag Whisper hallucination patterns (sets skip_me=1)
+  5. FastTextLIDStage          — flag non-English or low-confidence transcriptions (sets skip_me=1)


what does low-confidence transcriptions predictions mean here?

FastText gives two predicitions language and probability of that language.

If language if anything different than English we skip it.

If language is English but with probability being smaller than 0.7 we again skip it

Were you able to listen to some samples where it predicted english but with lower probablity, curious of such data from our Granary set

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>

nune-tadevosyan and others added 6 commits April 14, 2026 01:38

Curator run gravary_v2

b168f82

Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>

update

b212afe

Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>

Pipeline

22bb986

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>

scripts

9281125

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>

submission update

71a0998

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>

update

6f300ae

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>

nithinraok reviewed Apr 15, 2026

View reviewed changes

Nune Tadevosyan added 2 commits April 16, 2026 00:32

benchmarks submit

b4e6181

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>

Update

449e3d4

Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granary v2 postprocessing#1

Granary v2 postprocessing#1
nune-tadevosyan wants to merge 8 commits into
shubhamNvidia:pr/advance-pipelinefrom
nune-tadevosyan:granary-v2-postprocessing

nune-tadevosyan commented Apr 15, 2026 •

edited

Loading

Uh oh!

nithinraok Apr 15, 2026

Uh oh!

nune-tadevosyan Apr 15, 2026

Uh oh!

nithinraok Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nune-tadevosyan commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nithinraok Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

nune-tadevosyan Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

nithinraok Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nune-tadevosyan commented Apr 15, 2026 •

edited

Loading