Audio File Deduplication

Identifies and moves duplicate audio files across a reference folder (your protected originals) and a source folder (the collection to clean).

How It Works — 3-Stage Tiered Pipeline

Deduplication runs in three stages to minimise disk I/O. Each stage eliminates non-matches using the least work possible, so expensive full hashing is only performed on genuine candidates.

Stage	Method	Disk cost
1	Size filter — files with unique sizes cannot be identical	Zero (metadata only)
2	Partial hash — hash only the first 4 KB of each size-matched file	Minimal I/O
3	Full SHA-256 hash — fully hash all Stage-2 survivors	Only for true candidates

Key design decision: both folders are pooled together in Stage 1. Processing each folder independently (as naive implementations do) would miss cross-folder duplicates whenever a reference file is the sole occupant of its size bucket — it would never be hashed and source copies would go undetected.

A source file is flagged as a duplicate when:

Its full hash matches any reference file → moved to output folder
Its full hash matches another source file with no reference match → first occurrence kept, rest moved

Requirements

Python 3.8+
tqdm

pip install -r requirements.txt

Usage

# Basic
python audio_deduplication.py \
  --reference /path/to/originals \
  --source    /path/to/scan \
  --output    /path/to/duplicates

# Dry run — log what would happen, move nothing
python audio_deduplication.py \
  --reference /path/to/originals \
  --source    /path/to/scan \
  --output    /path/to/duplicates \
  --dry-run

# All options
python audio_deduplication.py --help

CLI Options

Option	Default	Description
`--reference`	(required)	Protected reference folder (originals)
`--source`	(required)	Source folder to deduplicate
`--output`	(required)	Folder to move duplicates into
`--log-file`	`audio_deduplication.log`	Log file path
`--log-level`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`
`--threads`	`8`	Parallel worker count
`--hash-algorithm`	`sha256`	Hash algorithm (`sha256`, `md5`, …)
`--partial-hash-size`	`4096`	Bytes to read in Stage 2 partial hash
`--max-hash-bytes`	(full file)	Cap Stage 3 hash at N bytes; `0` = full file
`--dry-run`	`False`	Simulate; do not move any files
`--retries`	`3`	Move retry attempts on failure
`--retry-delay`	`1.0`	Initial retry delay in seconds (exponential backoff)

Audio Extensions

.mp3 .flac .wav .aac .ogg .aif .aiff

Edit AUDIO_EXTENSIONS at the top of the script to customise.

Running Tests

pip install pytest
pytest test_deduplication.py -v

Test coverage includes:

compute_hash — correct digest, partial-read boundary conditions, file smaller than cap, missing file
find_audio_files — extension matching, recursion, case-insensitivity, directory exclusion
find_duplicates — regression guard for the critical cross-folder bug, internal duplicates, multiple copies, no false positives, empty inputs, same-size-different-content
move_file — dry run, basic move, single and multiple collision resolution, nested directory creation, content preservation

Caution

This script moves files from the source folder. Always verify your intended actions with --dry-run before operating on real data, and keep backups.

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied. The author assumes no liability for any data loss resulting from use of this script. Always run with --dry-run first and maintain backups of your files before performing any deduplication operations. See LICENSE for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
audio_deduplication.py		audio_deduplication.py
requirements.txt		requirements.txt
test_deduplication.py		test_deduplication.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio File Deduplication

How It Works — 3-Stage Tiered Pipeline

Requirements

Usage

CLI Options

Audio Extensions

Running Tests

Caution

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio File Deduplication

How It Works — 3-Stage Tiered Pipeline

Requirements

Usage

CLI Options

Audio Extensions

Running Tests

Caution

Disclaimer

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages