From 58c3d63bf6e56524bf968c1926d0bb0eab63b18f Mon Sep 17 00:00:00 2001 From: RoboZephyr <276202023+RoboZephyr@users.noreply.github.com> Date: Fri, 29 May 2026 15:48:40 +0800 Subject: [PATCH] =?UTF-8?q?feat(seedance):=20production=20QA=20section=20?= =?UTF-8?q?=E2=80=94=20audio-vs-script=20drift=20detection=20(post-gen=20v?= =?UTF-8?q?erification)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Backfills a real dogfood insight from local module use into the OSS library. When `generate_audio: true` is enabled (the default on Seedance 2.0) and your content has dialogue, Seedance produces real spoken audio but the actual words spoken may diverge from the dialogue you wrote in the prompt. The model is creative, not literal — prompt dialogue in double quotes is a strong hint but for longer / narrative content the model can omit lines, rephrase, substitute homophones, or fill silences with ambient sound. This matters for subtitled-content publish workflows; doesn't matter for vibe / no-dialogue clips. Body adds (between "Trusted-result chaining" and "Polling vs callback") - 6-step production pattern: generate → ffmpeg extract → TOS upload → Seed-ASR transcribe → diff vs target → tag mismatches for editorial - 5-row status taxonomy mirrored from `library/volcengine-speech`: asr_ok_text_match / asr_ok_text_mismatch / video_incomplete / asr_unreliable / asr_no_speech (consistent label convention across the trove content pipeline) - 3 production rules: don't silently overwrite target subtitles with ASR; subtitles are a post-production layer; `status: "succeeded"` is not content QA — audio-vs-script is a separate gate - Cross-references to the volcengine-tos (audio hosting) and volcengine-speech (transcription) modules — closes the production loop with explicit links Privacy - Maintainer's local module had this insight in a "Production Note (2026-05-22)" section with private references (downstream project name, episode ID, file name, internal field/tool names). All stripped — the OSS version frames the same engineering insight in general terms with public placeholders - Pre-commit hook PRIVATE_RE scan: clean on staged diff Validation - bun bin/cli.ts validate library/seedance: 0 errors / 0 warnings Co-Authored-By: Claude Opus 4.7 (1M context) --- library/seedance/module.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/library/seedance/module.md b/library/seedance/module.md index cc7ed65..e812220 100644 --- a/library/seedance/module.md +++ b/library/seedance/module.md @@ -405,6 +405,42 @@ A face that Seedance 2.0 / Fast generated, or that Seedream 5.0 lite generated a --- +## Production QA: audio vs script drift (post-generation verification) + +When `generate_audio: true` (the default) is enabled and your content has narration / dialogue, Seedance **will produce real spoken audio** — but the actual words spoken may **diverge from the dialogue you wrote in the prompt**. The model is creative, not literal: prompt dialogue inside double quotes is a strong hint, but for longer or more abstract content the model can omit lines, rephrase, fill silences with ambient sound, or substitute homophones. + +**This matters when** you're producing subtitled content for publish (drama / lessons / podcasts / dubbed shorts). It does **not** matter for vibe / no-dialogue clips where the audio just needs to match the mood. + +**Production pattern — when subtitle accuracy is part of the deliverable**: + +1. Generate the video with audio-on (this section), poll to completion +2. Download the produced mp4, extract audio: + ```bash + ffmpeg -i out.mp4 -vn -acodec libmp3lame -b:a 128k audio.mp3 + ``` +3. Upload `audio.mp3` to a public-read TOS bucket (see [library/volcengine-tos](../volcengine-tos/module.md)) +4. Submit to Seed-ASR for transcription (see [library/volcengine-speech](../volcengine-speech/module.md)) +5. Diff ASR transcript against your target subtitle text (Levenshtein / token-overlap / per-utterance match — pick a tolerance that fits your content) +6. **Tag the result** and surface mismatches for editorial review before publish + +**Recommended status taxonomy** (mirrors the volcengine-speech module's labels — keep them consistent across your content pipeline): + +| label | meaning | next action | +|---|---|---| +| `asr_ok_text_match` | ASR succeeded; spoken words match target subtitle within tolerance | ship as-is | +| `asr_ok_text_mismatch` | ASR succeeded; **Seedance spoke different words** than the target script | editorial review: keep ASR text, keep target text, or re-generate | +| `video_incomplete` | source mp4 is partial / old / re-rendered; ASR returns far fewer utterances than the script expects | re-generate; the visible content doesn't represent the full scene | +| `asr_unreliable` | audio clearly exists in playback but ASR output is garbled or empty | check audio sample rate / encoding; consider re-extracting at 16 kHz mono | +| `asr_no_speech` | input audio is silent / music-only / non-speech | this script doesn't need ASR QA; mark `generate_audio: false` next time | + +**Critical production rules**: + +- **Don't silently overwrite target subtitles with ASR text.** ASR is diagnostic, not authoritative. Even when the model spoke "different but valid" words, your stored subtitle text is what should drive captions on publish — unless an editor explicitly accepts the ASR variant. +- **Subtitles are a post-production layer**: target text from your script drives rendering; ASR utterance timestamps drive timing; mismatches surface for review. +- **A "succeeded" Seedance task can still produce wrong-script audio** — don't treat `status: "succeeded"` as content QA. Audio-vs-script verification is a separate gate. + +--- + ## Polling vs callback For interactive UX (a user is waiting), polling every 30s is the documented pattern. For batch / pipeline / serverless use: