feat(audio): silence-aware chunk boundary splitting#51
Conversation
Cut audio chunks at natural pauses instead of fixed intervals to avoid truncating words at chunk seams. Each target boundary at N * chunk_duration is snapped to the midpoint of the closest silence detected by ffmpeg silencedetect within a ±15s window. Falls back to fixed-length cut when no silence is found, so the pipeline never blocks. Closes #49 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per /simplify review: - Replace per-target find_silence_near with one full-file find_all_silences pass. Boundary computation becomes pure Python over a list of silences, eliminating subprocess plumbing duplication and N sequential ffmpeg invocations. - Add -c copy and -ss before -i to chunk cut ffmpeg args. Fast seek + no re-encode is ~5-10x faster per cut on long files. - Drop find_silence_near and its unused parameters. - Tighten test assertions and remove narration comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements silence-aware audio chunk boundary selection so chunk seams land on natural pauses (with fixed-interval fallback), reducing mid-word truncation during downstream transcription.
Changes:
- Added ffmpeg
silencedetectparsing + silence midpoint anchoring logic for chunk boundaries. - Updated
split_audio()to cut using computed boundaries (with a minimum-chunk-length safeguard). - Added unit tests for silence parsing, boundary computation, and
split_audio()behavior with mocked ffmpeg.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/services/audio.py |
Adds silence detection/parsing and uses silence-anchored boundaries when splitting audio into chunks. |
tests/test_audio_split.py |
Introduces unit tests for silence parsing, boundary computation, and split_audio() integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| duration = await get_audio_duration(audio_path) | ||
| if duration <= chunk_duration_sec: | ||
| return [audio_path] | ||
|
|
||
| silences = await find_all_silences(audio_path) | ||
| boundaries = _compute_chunk_boundaries(duration, float(chunk_duration_sec), silences) | ||
|
|
||
| chunks = [] | ||
| start = 0.0 | ||
| idx = 0 | ||
| while start < duration: | ||
| for idx in range(len(boundaries) - 1): | ||
| chunk_start = boundaries[idx] | ||
| chunk_len = boundaries[idx + 1] - chunk_start |
There was a problem hiding this comment.
split_audio() can end up producing exactly one chunk file (e.g., when duration is only slightly above chunk_duration_sec and _compute_chunk_boundaries() returns [0.0, total_duration] to avoid a <1s tail). That breaks the implicit contract used by src/services/transcribe.py (if len(chunks) == 1: ... transcribe(audio_path)), resulting in an unnecessary ffmpeg run and a leftover _chunk000 file that is never used. Consider short-circuiting after computing boundaries (e.g., if len(boundaries) == 2, return [audio_path]) and add a unit test for duration in (chunk_duration_sec, chunk_duration_sec + _MIN_CHUNK_SEC).
| proc = await asyncio.create_subprocess_exec( | ||
| "ffmpeg", | ||
| "-i", | ||
| str(audio_path), | ||
| "-af", | ||
| f"silencedetect=noise={_SILENCE_NOISE_DB}dB:d={_SILENCE_MIN_DUR}", | ||
| "-f", | ||
| "null", | ||
| "-", | ||
| stdout=asyncio.subprocess.DEVNULL, | ||
| stderr=asyncio.subprocess.PIPE, | ||
| ) |
There was a problem hiding this comment.
find_all_silences() runs ffmpeg without -nostats / -hide_banner, so stderr will include frequent progress updates for the entire file; because communicate() buffers all stderr in memory, long inputs can create unnecessarily large buffers. Consider adding -nostats/-hide_banner (while keeping a loglevel that still emits silencedetect lines), or streaming stderr line-by-line and extracting only the silence_start/silence_end lines to avoid unbounded growth.
| async def find_all_silences(audio_path: Path) -> list[tuple[float, float]]: | ||
| """Run ffmpeg silencedetect once over the entire file. | ||
|
|
||
| Returns list of (start, end) silence ranges, or empty list if detection | ||
| fails — callers fall back to fixed-length cuts. | ||
| """ | ||
| proc = await asyncio.create_subprocess_exec( | ||
| "ffmpeg", | ||
| "-i", | ||
| str(audio_path), | ||
| "-af", | ||
| f"silencedetect=noise={_SILENCE_NOISE_DB}dB:d={_SILENCE_MIN_DUR}", | ||
| "-f", | ||
| "null", | ||
| "-", | ||
| stdout=asyncio.subprocess.DEVNULL, | ||
| stderr=asyncio.subprocess.PIPE, | ||
| ) | ||
| _, stderr = await proc.communicate() | ||
| if proc.returncode != 0: | ||
| logger.debug("silencedetect failed (returncode=%d) — falling back", proc.returncode) | ||
| return [] | ||
| return _parse_silence_ranges(stderr.decode(errors="replace")) | ||
|
|
||
|
|
||
| def _nearest_silence_midpoint( | ||
| silences: list[tuple[float, float]], | ||
| target: float, | ||
| window: float, | ||
| ) -> float | None: | ||
| """Return midpoint of the silence closest to `target`, or None if none within ±window.""" | ||
| candidates = [(s + e) / 2 for s, e in silences if abs((s + e) / 2 - target) <= window] | ||
| if not candidates: | ||
| return None | ||
| return min(candidates, key=lambda m: abs(m - target)) | ||
|
|
||
|
|
||
| def _compute_chunk_boundaries( | ||
| total_duration: float, | ||
| chunk_duration_sec: float, | ||
| silences: list[tuple[float, float]], | ||
| ) -> list[float]: | ||
| """Compute chunk boundary timestamps using silence anchoring. | ||
|
|
||
| Each target boundary at N * chunk_duration_sec is snapped to the nearest | ||
| silence midpoint within ±_SILENCE_SEARCH_WINDOW_SEC, falling back to the | ||
| fixed-length boundary when no silence is in range. Anchors that would | ||
| produce a chunk shorter than _MIN_CHUNK_SEC on either side are rejected. | ||
| """ | ||
| boundaries: list[float] = [0.0] | ||
| target = chunk_duration_sec | ||
| while target < total_duration: | ||
| anchor = _nearest_silence_midpoint(silences, target, _SILENCE_SEARCH_WINDOW_SEC) | ||
| if anchor is None or anchor <= boundaries[-1] + _MIN_CHUNK_SEC: | ||
| anchor = target | ||
| if anchor >= total_duration - _MIN_CHUNK_SEC: | ||
| break | ||
| boundaries.append(anchor) | ||
| target = anchor + chunk_duration_sec | ||
| boundaries.append(total_duration) | ||
| return boundaries | ||
|
|
There was a problem hiding this comment.
PR description mentions a find_silence_near() helper that searches a ±15s window per boundary, but the implementation introduces find_all_silences() (single full-file pass) and _compute_chunk_boundaries() always uses the global _SILENCE_SEARCH_WINDOW_SEC. Either update the PR description to match the chosen approach, or adjust the implementation to match the described per-boundary window helper/signature.
- Short-circuit split_audio when boundary computation collapses to [0.0, total_duration]. Previously this produced a single redundant chunk file via ffmpeg even though the caller's len(chunks)==1 optimization would ignore it. - Add -nostats -hide_banner to find_all_silences so the silencedetect ffmpeg run doesn't buffer megabytes of progress text in stderr for long inputs. - Add a regression test for the borderline (chunk_duration_sec < duration < chunk_duration_sec + _MIN_CHUNK_SEC) case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks @copilot — all 3 points addressed in 7113ccc:
|
Summary
ffmpeg silencedetectonce over the full file viafind_all_silences(), then compute boundaries in pure Python — no per-target subprocess callsN * chunk_duration_secis snapped to the nearest silence midpoint within ±15s, falling back to the fixed-length boundary when no silence is in range_MIN_CHUNK_SECon either side; short-circuit to the original path when boundary computation collapses to[0.0, total_duration]-ssbefore-iand-c copyfor chunk cuts (~5-10x faster, no re-encoding since chunks come from already-decoded WAV/MP3)split_audiointegration with mocked ffmpegWhy
Current
split_audio(src/services/audio.py) cuts audio at fixedwhisper_chunk_duration_secintervals with no silence detection. Words spanning a boundary can be truncated in both adjacent chunks. This becomes more visible once chunks stream to the editor (#50), so silence anchoring lands first.Test plan
pytest tests/test_audio_split.py— 11 passing (parser, boundary computation, integration, tail-edge case)pytest --ignore=tests/e2e— no regressionsruff check/ruff format --checkcleanCommits
b756124initial implementation0026fce/simplify refactor: single-pass silence detection + faster chunk cuts7113cccCopilot review: short-circuit collapsed boundaries, suppress ffmpeg progress noise, add tail-edge regression testCloses #49
🤖 Generated with Claude Code