Skip to content

feat(audio): silence-aware chunk boundary splitting#51

Merged
JFK merged 3 commits intomainfrom
feat/silence-aware-chunking
Apr 7, 2026
Merged

feat(audio): silence-aware chunk boundary splitting#51
JFK merged 3 commits intomainfrom
feat/silence-aware-chunking

Conversation

@JFK
Copy link
Copy Markdown
Owner

@JFK JFK commented Apr 7, 2026

Summary

  • Snap chunk boundaries to natural silences instead of cutting at fixed intervals, preventing word truncation at chunk seams
  • Run ffmpeg silencedetect once over the full file via find_all_silences(), then compute boundaries in pure Python — no per-target subprocess calls
  • Each target boundary at N * chunk_duration_sec is snapped to the nearest silence midpoint within ±15s, falling back to the fixed-length boundary when no silence is in range
  • Reject anchors that would create chunks shorter than _MIN_CHUNK_SEC on either side; short-circuit to the original path when boundary computation collapses to [0.0, total_duration]
  • Use -ss before -i and -c copy for chunk cuts (~5-10x faster, no re-encoding since chunks come from already-decoded WAV/MP3)
  • 11 unit tests covering the parser, boundary computation, and split_audio integration with mocked ffmpeg

Why

Current split_audio (src/services/audio.py) cuts audio at fixed whisper_chunk_duration_sec intervals with no silence detection. Words spanning a boundary can be truncated in both adjacent chunks. This becomes more visible once chunks stream to the editor (#50), so silence anchoring lands first.

Test plan

  • pytest tests/test_audio_split.py — 11 passing (parser, boundary computation, integration, tail-edge case)
  • pytest --ignore=tests/e2e — no regressions
  • ruff check / ruff format --check clean
  • Copilot review feedback addressed (3 comments)
  • Manual verification with a 30-min Japanese audio sample

Commits

  • b756124 initial implementation
  • 0026fce /simplify refactor: single-pass silence detection + faster chunk cuts
  • 7113ccc Copilot review: short-circuit collapsed boundaries, suppress ffmpeg progress noise, add tail-edge regression test

Closes #49

🤖 Generated with Claude Code

JFK and others added 2 commits April 7, 2026 21:07
Cut audio chunks at natural pauses instead of fixed intervals to avoid
truncating words at chunk seams. Each target boundary at N * chunk_duration
is snapped to the midpoint of the closest silence detected by ffmpeg
silencedetect within a ±15s window. Falls back to fixed-length cut when
no silence is found, so the pipeline never blocks.

Closes #49

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per /simplify review:
- Replace per-target find_silence_near with one full-file find_all_silences
  pass. Boundary computation becomes pure Python over a list of silences,
  eliminating subprocess plumbing duplication and N sequential ffmpeg
  invocations.
- Add -c copy and -ss before -i to chunk cut ffmpeg args. Fast seek + no
  re-encode is ~5-10x faster per cut on long files.
- Drop find_silence_near and its unused parameters.
- Tighten test assertions and remove narration comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements silence-aware audio chunk boundary selection so chunk seams land on natural pauses (with fixed-interval fallback), reducing mid-word truncation during downstream transcription.

Changes:

  • Added ffmpeg silencedetect parsing + silence midpoint anchoring logic for chunk boundaries.
  • Updated split_audio() to cut using computed boundaries (with a minimum-chunk-length safeguard).
  • Added unit tests for silence parsing, boundary computation, and split_audio() behavior with mocked ffmpeg.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/services/audio.py Adds silence detection/parsing and uses silence-anchored boundaries when splitting audio into chunks.
tests/test_audio_split.py Introduces unit tests for silence parsing, boundary computation, and split_audio() integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 183 to +193
duration = await get_audio_duration(audio_path)
if duration <= chunk_duration_sec:
return [audio_path]

silences = await find_all_silences(audio_path)
boundaries = _compute_chunk_boundaries(duration, float(chunk_duration_sec), silences)

chunks = []
start = 0.0
idx = 0
while start < duration:
for idx in range(len(boundaries) - 1):
chunk_start = boundaries[idx]
chunk_len = boundaries[idx + 1] - chunk_start
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_audio() can end up producing exactly one chunk file (e.g., when duration is only slightly above chunk_duration_sec and _compute_chunk_boundaries() returns [0.0, total_duration] to avoid a <1s tail). That breaks the implicit contract used by src/services/transcribe.py (if len(chunks) == 1: ... transcribe(audio_path)), resulting in an unnecessary ffmpeg run and a leftover _chunk000 file that is never used. Consider short-circuiting after computing boundaries (e.g., if len(boundaries) == 2, return [audio_path]) and add a unit test for duration in (chunk_duration_sec, chunk_duration_sec + _MIN_CHUNK_SEC).

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +132
proc = await asyncio.create_subprocess_exec(
"ffmpeg",
"-i",
str(audio_path),
"-af",
f"silencedetect=noise={_SILENCE_NOISE_DB}dB:d={_SILENCE_MIN_DUR}",
"-f",
"null",
"-",
stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.PIPE,
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_all_silences() runs ffmpeg without -nostats / -hide_banner, so stderr will include frequent progress updates for the entire file; because communicate() buffers all stderr in memory, long inputs can create unnecessarily large buffers. Consider adding -nostats/-hide_banner (while keeping a loglevel that still emits silencedetect lines), or streaming stderr line-by-line and extracting only the silence_start/silence_end lines to avoid unbounded growth.

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +176
async def find_all_silences(audio_path: Path) -> list[tuple[float, float]]:
"""Run ffmpeg silencedetect once over the entire file.

Returns list of (start, end) silence ranges, or empty list if detection
fails — callers fall back to fixed-length cuts.
"""
proc = await asyncio.create_subprocess_exec(
"ffmpeg",
"-i",
str(audio_path),
"-af",
f"silencedetect=noise={_SILENCE_NOISE_DB}dB:d={_SILENCE_MIN_DUR}",
"-f",
"null",
"-",
stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.PIPE,
)
_, stderr = await proc.communicate()
if proc.returncode != 0:
logger.debug("silencedetect failed (returncode=%d) — falling back", proc.returncode)
return []
return _parse_silence_ranges(stderr.decode(errors="replace"))


def _nearest_silence_midpoint(
silences: list[tuple[float, float]],
target: float,
window: float,
) -> float | None:
"""Return midpoint of the silence closest to `target`, or None if none within ±window."""
candidates = [(s + e) / 2 for s, e in silences if abs((s + e) / 2 - target) <= window]
if not candidates:
return None
return min(candidates, key=lambda m: abs(m - target))


def _compute_chunk_boundaries(
total_duration: float,
chunk_duration_sec: float,
silences: list[tuple[float, float]],
) -> list[float]:
"""Compute chunk boundary timestamps using silence anchoring.

Each target boundary at N * chunk_duration_sec is snapped to the nearest
silence midpoint within ±_SILENCE_SEARCH_WINDOW_SEC, falling back to the
fixed-length boundary when no silence is in range. Anchors that would
produce a chunk shorter than _MIN_CHUNK_SEC on either side are rejected.
"""
boundaries: list[float] = [0.0]
target = chunk_duration_sec
while target < total_duration:
anchor = _nearest_silence_midpoint(silences, target, _SILENCE_SEARCH_WINDOW_SEC)
if anchor is None or anchor <= boundaries[-1] + _MIN_CHUNK_SEC:
anchor = target
if anchor >= total_duration - _MIN_CHUNK_SEC:
break
boundaries.append(anchor)
target = anchor + chunk_duration_sec
boundaries.append(total_duration)
return boundaries

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions a find_silence_near() helper that searches a ±15s window per boundary, but the implementation introduces find_all_silences() (single full-file pass) and _compute_chunk_boundaries() always uses the global _SILENCE_SEARCH_WINDOW_SEC. Either update the PR description to match the chosen approach, or adjust the implementation to match the described per-boundary window helper/signature.

Copilot uses AI. Check for mistakes.
- Short-circuit split_audio when boundary computation collapses to
  [0.0, total_duration]. Previously this produced a single redundant
  chunk file via ffmpeg even though the caller's len(chunks)==1
  optimization would ignore it.
- Add -nostats -hide_banner to find_all_silences so the silencedetect
  ffmpeg run doesn't buffer megabytes of progress text in stderr for
  long inputs.
- Add a regression test for the borderline (chunk_duration_sec <
  duration < chunk_duration_sec + _MIN_CHUNK_SEC) case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JFK
Copy link
Copy Markdown
Owner Author

JFK commented Apr 7, 2026

Thanks @copilot — all 3 points addressed in 7113ccc:

  1. Single-chunk leakage (audio.py:193): Added a len(boundaries) == 2 short-circuit that returns the original path before any ffmpeg invocation. New regression test test_split_audio_short_circuits_when_only_tail_remains covers the borderline case (chunk_duration_sec < duration < chunk_duration_sec + _MIN_CHUNK_SEC).

  2. Unbounded stderr buffer (audio.py:132): Added -nostats -hide_banner to the find_all_silences ffmpeg invocation. silencedetect lines still come through; per-frame progress text no longer accumulates.

  3. PR description drift (audio.py:176): Updated the PR description above to match the actual implementation (find_all_silences single-pass, not the original find_silence_near per-target design — that was the first commit, refactored away in 0026fce per /simplify review).

Copilot finished work on behalf of JFK April 7, 2026 13:07
@JFK JFK merged commit 58d4ae0 into main Apr 7, 2026
3 checks passed
@JFK JFK deleted the feat/silence-aware-chunking branch April 7, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(audio): silence-aware chunk boundary splitting

2 participants