Granary v2 postprocessing#1
Draft
nune-tadevosyan wants to merge 8 commits into
Draft
Conversation
Signed-off-by: ntadevosyan <ntadevosyan@nvidia.com>
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos02.eos.clusters.nvidia.com>
nithinraok
reviewed
Apr 15, 2026
| 2. InitializeFieldsStage — copy pred_text → cleaned_text; skip_me = 0 | ||
| 3. RegexSubstitutionStage — apply regex normalization rules to cleaned_text | ||
| 4. WhisperHallucinationStage — flag Whisper hallucination patterns (sets skip_me=1) | ||
| 5. FastTextLIDStage — flag non-English or low-confidence transcriptions (sets skip_me=1) |
There was a problem hiding this comment.
what does low-confidence transcriptions predictions mean here?
Author
There was a problem hiding this comment.
FastText gives two predicitions language and probability of that language.
- If language if anything different than English we skip it.
- If language is English but with probability being smaller than 0.7 we again skip it
There was a problem hiding this comment.
Were you able to listen to some samples where it predicted english but with lower probablity, curious of such data from our Granary set
added 2 commits
April 16, 2026 00:32
Signed-off-by: Nune Tadevosyan <ntadevosyan@login-eos01.eos.clusters.nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a production-ready postprocessing pipeline for Granary v2 ASR manifests under tutorials/audio/granary_v2_postprocessing/.
The pipeline reads JSONL manifests produced by ASR inference, applies text cleaning and quality filtering, and writes output manifests preserving the full input directory structure. All
entries are kept in the output — low-quality entries are flagged with skip_me=1 for downstream use rather than dropped.
Pipeline stages
ALMManifestReader - Reads JSONL — one AudioTask per line
InitializeFieldsStage - Copies pred_text → cleaned_text; sets skip_me = 0
RegexSubstitutionStage - Normalizes cleaned_text (quotes, dashes, brackets, whitespace)
WhisperHallucinationStage - Sets skip_me = 1 for repeated n-grams, long words, known hallucination phrases, or abnormal chars/duration rate
FastTextLIDStage - Sets skip_me = 1 for non-English or low-confidence language ID
FinalizeFieldsStage - Renames text → v1_text, promotes cleaned_text → text, drops pnc/itn/timestamp
ALMManifestWriterStage - Writes all entries (clean and flagged) to mirrored output paths
Key features:
Important to set how many manifests should be processed during one job submission