Skip to content

aimsise/seedbraid

Seedbraid

CI

Seedbraid is a reference-based reconstruction tool for large, similar binary artifacts.

It combines deterministic content-defined chunking (CDC), a compact binary SBD1 seed format, reusable genome storage, and optional IPFS transport so you can ship reconstruction intent instead of repeatedly shipping full blobs.

Why Seedbraid

Seedbraid is designed for workflows where ordinary file distribution becomes wasteful:

  • large binary artifacts change often, but stay mostly similar
  • fixed-size chunking loses reuse under shifted offsets
  • you want compact transport plus bit-perfect restore guarantees
  • you want one CLI surface for encode, verify, decode, publish, and fetch

In short: Seedbraid helps you move less data, reuse more content, and still verify exact reconstruction.

When Seedbraid Is a Good Fit

Seedbraid works especially well for:

  • large binary versioning: datasets, ML models, media assets, VM images
  • distribution of many similar files across releases
  • shift-heavy changes such as insertions that break fixed chunk reuse
  • IPFS-based distribution and retrieval with integrity validation
  • environments where transfer size, dedup reuse, and reproducibility matter

Core Capabilities

  • Lossless encode/decode with SHA-256 verification
  • Deterministic chunking with fixed, cdc_buzhash, and cdc_rabin
  • Genome storage backed by SQLite for deduplicated chunk reuse
  • SBD1 binary seed container with manifest, recipe, optional RAW, and integrity data
  • IPFS publish/fetch transport
  • Optional remote pin integration
  • Strict verification mode for production-grade restore checks
  • Optional signing and encryption support

Installation

pip

pip install seedbraid

pipx

pipx install seedbraid
seedbraid --help

uvx

uvx seedbraid --help
uvx seedbraid doctor

Optional extras

# pip
pip install "seedbraid[zstd]"
pip install "seedbraid[crypto]"    # encryption / signing support

# pipx
pipx install "seedbraid[zstd]"
pipx install "seedbraid[crypto]"

# uvx
uvx --from "seedbraid[zstd]" seedbraid doctor
uvx --from "seedbraid[crypto]" seedbraid doctor

Quick Start

1. Encode a file into a seed

seedbraid encode input.bin --genome ./genome --out seed.sbd --portable

2. Verify the seed

seedbraid verify seed.sbd --genome ./genome --strict

3. Decode the file back

seedbraid decode seed.sbd --genome ./genome --out recovered.bin

4. Compare the result

cmp -s input.bin recovered.bin && echo "bit-perfect roundtrip: OK"

Note: If you installed via uvx, prefix commands with uvx (e.g. uvx seedbraid encode ...). For development builds, use uv run --no-editable seedbraid instead.

Typical Workflow

A common Seedbraid workflow looks like this:

  1. Prime or learn reusable chunks into a genome
  2. Encode a target artifact into a compact SBD1 seed
  3. Verify integrity before distribution
  4. Publish the seed if needed, including via IPFS
  5. Fetch and decode later using the genome
  6. Run strict verification when exact restore is required

Stability

Seedbraid v2.0.0 is production-ready.

Before deploying to your environment, validate behavior in your own runtime, storage, and network configuration.

Treat successful verify --strict and bit-perfect restore checks as release gates.

Production Validation Checklist

Before using Seedbraid in CI/CD or production pipelines, run a strict smoke workflow like this:

uv sync --no-editable --extra dev

workdir="$(mktemp -d)"
python3 - <<'PY' "$workdir/input.bin"
from pathlib import Path
import sys

out = Path(sys.argv[1])
payload = (b"seedbraid-beta-smoke" * 20000) + bytes(range(256)) * 200
out.write_bytes(payload)
print(f"wrote {out} bytes={len(payload)}")
PY

uv run --no-sync --no-editable seedbraid encode "$workdir/input.bin" \
  --genome "$workdir/genome" \
  --out "$workdir/seed.sbd" \
  --chunker cdc_buzhash \
  --avg 65536 --min 16384 --max 262144 \
  --learn --portable --compression zlib

uv run --no-sync --no-editable seedbraid verify "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --strict

uv run --no-sync --no-editable seedbraid decode "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --out "$workdir/decoded.bin"

cmp -s "$workdir/input.bin" "$workdir/decoded.bin" \
  && echo "bit-perfect roundtrip: OK"

CLI Reference

All examples below use bare seedbraid. If you installed via uvx, prefix with uvx. For development builds, use uv run --no-editable seedbraid.

Core Commands

Encode

seedbraid encode input.bin --genome ./genome --out seed.sbd

seedbraid encode input.bin --genome ./genome --out seed.sbd \
  --chunker cdc_buzhash --avg 65536 --min 16384 --max 262144 \
  --learn --no-portable --compression zlib

seedbraid encode input.bin --genome ./genome --out seed.private.sbd \
  --manifest-private

export SB_ENCRYPTION_KEY='your-secret-passphrase'
seedbraid encode input.bin --genome ./genome --out seed.encrypted.sbd \
  --encrypt --manifest-private

Decode

seedbraid decode seed.sbd --genome ./genome --out recovered.bin

seedbraid decode seed.encrypted.sbd --genome ./genome --out recovered.bin \
  --encryption-key "$SB_ENCRYPTION_KEY"

Verify

seedbraid verify seed.sbd --genome ./genome
seedbraid verify seed.sbd --genome ./genome --strict
seedbraid verify seed.sbd --genome ./genome --require-signature --signature-key "$SB_SIGNING_KEY"
seedbraid verify seed.encrypted.sbd --genome ./genome --strict \
  --encryption-key "$SB_ENCRYPTION_KEY"

verify supports two modes:

  • Quick mode: checks seed integrity and required chunk availability
  • Strict mode: reconstructs all content and enforces source size and SHA-256 match

Prime

seedbraid prime "./dataset/**/*" --genome ./genome --chunker cdc_buzhash

Doctor

seedbraid doctor --genome ./genome

doctor checks:

  • Python runtime compatibility (>=3.12)
  • kubo API reachability (SB_KUBO_API)
  • IPFS_PATH state
  • genome path writability
  • compression support (zlib, optional zstd)

Advanced Commands

Genome Snapshot / Restore

seedbraid genome snapshot --genome ./genome --out genome.sgs
seedbraid genome restore genome.sgs --genome ./genome-dr --replace

Publish Chunks to IPFS

seedbraid publish-chunks seed.sbd --genome ./genome
seedbraid publish-chunks seed.sbd --genome ./genome \
  --manifest-out chunks.json --workers 32
seedbraid publish-chunks seed.sbd --genome ./genome \
  --pin --remote-pin \
  --remote-endpoint https://pin.example/api/v1 \
  --remote-token "$SB_PINNING_TOKEN"

publish-chunks publishes all CDC chunks referenced by a seed to IPFS as raw blocks, generates a chunk manifest sidecar (.sbd.chunks.json), and optionally pins the chunk DAG locally or via a remote pinning provider.

Fetch and Decode from IPFS

seedbraid fetch-decode seed.sbd --out recovered.bin
seedbraid fetch-decode seed.sbd --out recovered.bin \
  --workers 64 --batch-size 200 --retries 5
seedbraid fetch-decode seed.sbd --out recovered.bin \
  --gateway https://ipfs.io/ipfs

fetch-decode reads a seed and its chunk manifest, fetches all chunks from IPFS in parallel batches, and reconstructs the original file. Requires the chunk manifest sidecar (.sbd.chunks.json) alongside the seed.

Decode with IPFS Genome

seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:///path/to/cache --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin \
  --gateway https://ipfs.io/ipfs

Using --genome ipfs:// activates hybrid storage: chunks are fetched from IPFS with local SQLite caching. ipfs:// uses a temporary cache; ipfs:///path/to/cache persists fetched chunks for future reuse.

Publish to IPFS

seedbraid publish seed.sbd --no-pin
seedbraid publish seed.sbd --pin
seedbraid publish seed.sbd --remote-pin \
  --remote-endpoint https://pin.example/api/v1 --remote-token "$SB_PINNING_TOKEN"

publish emits a warning when the seed is unencrypted. For sensitive data, prefer:

seedbraid encode --encrypt --manifest-private ...

When --remote-pin is enabled, Seedbraid also registers the CID with a configured Pinning Services API-compatible provider.

Fetch from IPFS

seedbraid fetch <cid> --out fetched.sbd
seedbraid fetch <cid> --out fetched.sbd --retries 5 --backoff-ms 300
seedbraid fetch <cid> --out fetched.sbd --gateway https://ipfs.io/ipfs

fetch retries with exponential backoff via the kubo HTTP API and can fall back to an HTTP gateway.

Pin Health

seedbraid pin-health <cid>

Remote Pin Add

export SB_PINNING_ENDPOINT='https://pin.example/api/v1'
export SB_PINNING_TOKEN='your-api-token'
seedbraid pin remote-add <cid>

Sign Seed

export SB_SIGNING_KEY='your-shared-secret'
seedbraid sign seed.sbd --out seed.signed.sbd --key-env SB_SIGNING_KEY --key-id team-a

Export / Import Genes

seedbraid export-genes seed.sbd --genome ./genome --out genes.pack
seedbraid import-genes genes.pack --genome ./another-genome

Generate an Encryption Key

Generate a high-entropy key for SB_ENCRYPTION_KEY:

seedbraid gen-encryption-key

Print shell export format:

seedbraid gen-encryption-key --shell

Set the current shell variable directly:

eval "$(seedbraid gen-encryption-key --shell)"

IPFS Setup

Start the kubo daemon:

ipfs daemon

By default, seedbraid connects to the kubo HTTP API at http://127.0.0.1:5001/api/v0. Override with the SB_KUBO_API environment variable:

export SB_KUBO_API=http://127.0.0.1:5001/api/v0

Run seedbraid doctor to verify connectivity.

Remote Pinning Setup

To use a remote pinning service, set the endpoint and token as environment variables.

Using a shell profile (~/.bashrc, ~/.zshrc):

export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'

Using direnv (.envrc in your project directory):

# .envrc
export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'

With these variables set, --remote-pin works without passing --remote-endpoint and --remote-token each time.

Verifying a Remote Pin

After publishing with --remote-pin, confirm the pin is active:

# 1. Check local pin and block availability
seedbraid pin-health <cid>

# 2. Verify the pinned content is fetchable from the network
seedbraid fetch <cid> --out /tmp/verify.sbd
seedbraid verify /tmp/verify.sbd --genome ./genome --strict

If pin-health reports the CID is pinned and fetch + verify --strict succeed, the remote pin is working correctly.

Common Failures

  • kubo daemon not reachable
    • Install Kubo, start the daemon with ipfs daemon, and verify with seedbraid doctor
  • Missing required chunk on decode or verify
    • Provide the correct --genome, or re-encode with --portable
  • zstd compression error
    • Install optional dependency zstandard, or use --compression zlib

Data Recovery Guide

Reconstructing a file requires two things: a seed (the recipe describing chunk order) and the chunks themselves (the actual data). If either is missing, recovery is impossible.

When Recovery Succeeds

Scenario Why It Works
Seed on hand + local genome available Recipe and ingredients are both local
Seed on hand + own IPFS node running with chunks pinned Recipe is local; ingredients are in your node's storage
Seed on hand + chunks held by a pinning service (Pinata, etc.) Recipe is local; ingredients are in a paid storage provider
Seed on hand + teammate's IPFS node holds the chunks Recipe is local; ingredients are on a peer's node
Seed created with --portable (chunks embedded in seed) Recipe and ingredients are bundled together in one file
Seed on hand + genome snapshot (.sgs backup) exists Recipe is local; ingredients are in a backup archive

When Recovery Fails

Scenario Why It Fails
Seed file lost Without the recipe, there is no way to know which chunks to fetch or how to reassemble them
Seed exists, but genome deleted and chunks never published to IPFS Recipe exists, but all ingredients have been discarded
Seed exists, but IPFS node stopped and no other node holds the chunks Recipe exists, but the only store that had the ingredients is offline
Seed exists, but IPFS pin removed and garbage collection ran Recipe exists, but automatic cleanup deleted the ingredients
Seed exists, but pinning service subscription expired Recipe exists, but the storage provider disposed of the ingredients
Seed exists, but even one chunk is missing from all sources Partial recovery is not supported; every chunk is required
Seed is encrypted and the encryption key is lost The recipe is unreadable without the key

Protecting Against Data Loss

Action Risk Mitigated
Back up seed files Prevents seed loss
Use --pin when publishing chunks Prevents IPFS garbage collection
Use a pinning service (--remote-pin) Survives local node shutdown
Encode with --portable Self-contained seed; no external chunk source needed (seed size increases)
Keep encryption keys in a secret manager Prevents key loss for encrypted seeds
Take genome snapshots (genome snapshot) Preserves local chunk data independently of IPFS

Safest option: --portable embeds all chunks in the seed, making it fully self-contained. The trade-off is that the seed grows to roughly the size of the original file, reducing the benefit of IPFS distribution.

Troubleshooting Matrix

Symptom Error Code Next Action
Encryption requested but key missing SB_E_ENCRYPTION_KEY_MISSING Pass --encryption-key or set SB_ENCRYPTION_KEY.
Signing requested but key missing SB_E_SIGNING_KEY_MISSING Export signing key env var and retry seedbraid sign.
Kubo daemon unreachable SB_E_IPFS_NOT_FOUND Install Kubo, run ipfs daemon, set SB_KUBO_API if non-default endpoint.
IPFS fetch/publish failure SB_E_IPFS_FETCH / SB_E_IPFS_PUBLISH Check daemon/network, retry, use gateway fallback if needed.
Remote pin configuration missing SB_E_REMOTE_PIN_CONFIG Set endpoint/token env vars or pass options.
Remote pin auth failed SB_E_REMOTE_PIN_AUTH Verify provider token permissions and retry.
Remote pin request invalid SB_E_REMOTE_PIN_REQUEST Check CID/provider options and retry.
Remote pin timeout/failure SB_E_REMOTE_PIN_TIMEOUT / SB_E_REMOTE_PIN Increase retries/timeout or check provider health.
Seed parse/integrity failure SB_E_SEED_FORMAT Re-fetch/rebuild seed and verify source integrity.
IPFS chunk publish failed SB_E_IPFS_CHUNK_PUT Check IPFS daemon, retry, verify chunk availability.
IPFS chunk fetch failed SB_E_IPFS_CHUNK_GET Check daemon/network, retry, use --gateway fallback.
Chunk manifest invalid SB_E_CHUNK_MANIFEST_FORMAT Regenerate manifest with publish-chunks.
IPFS MFS operation failed SB_E_IPFS_MFS Verify daemon is running with seedbraid doctor.

Development & Contributing

The sections below are for contributors and developers working on Seedbraid itself.

Development Setup

uv sync --no-editable --extra dev

Optional zstd support:

uv sync --no-editable --extra dev --extra zstd

Refresh the lockfile after dependency changes:

uv lock

Local Checks

UV_CACHE_DIR=.uv-cache uv run --no-editable ruff check .
PYTHONPATH=src uv run --no-editable python -m pytest
PYTHONPATH=src uv run --no-editable python -m pytest tests/test_compat_fixtures.py

IPFS tests auto-skip when the kubo daemon is not reachable.

Compatibility fixtures are stored in tests/fixtures/compat/v1/ and validated by tests/test_compat_fixtures.py.

To regenerate them intentionally:

uv run --no-editable python scripts/gen_compat_fixtures.py

CI

GitHub Actions workflows:

  • .github/workflows/ci.yml
    • ruff check .
    • python -m pytest
    • compatibility fixtures validation
    • benchmark gate
  • .github/workflows/publish-seed.yml
    • manual only, dry_run=true by default
    • generates a seed from source_path
    • runs seedbraid verify --strict
    • publishes to IPFS only when dry_run=false
    • installs Kubo when needed
    • verifies Kubo release signature status and checksum
    • supports pin, portable, manifest_private, and optional encrypt

Local parity commands:

uv sync --no-editable --extra dev
uv run --no-sync --no-editable ruff check .
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest tests/test_compat_fixtures.py
uv run --no-sync --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json

Benchmarking

1-byte insertion dedup benchmark

uv run --no-editable python scripts/bench_shifted_dedup.py
uv run --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json

Expected behavior:

  • cdc_buzhash should show better reuse than fixed when a single-byte insertion shifts offsets
  • bench_gate.py exits non-zero when configured thresholds are violated

Integrations

DVC Integration

  • Minimal DVC bridge lives in examples/dvc/
  • Pipeline stages are encode -> verify --strict -> fetch
  • The integration recipe and artifact layout are documented in examples/dvc/README.md

OCI Integration

  • ORAS bridge scripts and usage docs live in examples/oci/
  • Default OCI metadata convention:
    • artifact type: application/vnd.seedbraid.seed.v1
    • layer media type: application/vnd.seedbraid.seed.layer.v1+sbd
    • annotations: source SHA-256, chunker, manifest-private flag, seed title
  • Push/pull scripts:
    • examples/oci/scripts/push_seed.sh <seed.sbd> <registry/repository:tag>
    • examples/oci/scripts/pull_seed.sh <registry/repository:tag> <out.sbd>
  • After pull, run strict verification:
    • seedbraid verify <out.sbd> --genome <genome-path> --strict

ML Tooling Hooks

  • Scripts for MLflow metadata logging and Hugging Face upload live in examples/ml/
  • MLflow hook logs seed metadata fields
  • Hugging Face hook uploads seed.sbd and a metadata sidecar
  • Restore workflow is documented in examples/ml/README.md

Roadmap

Current adoption priorities include:

  • a faster onboarding path
  • stronger benchmark evidence versus alternatives
  • security and operator tooling such as signing, encryption, doctor, snapshot, and restore
  • stable format governance and backward-compatibility policy for long-lived seed archives

Project Documents

  • Format spec: docs/FORMAT.md
  • Design rationale: docs/DESIGN.md
  • Threat model: docs/THREAT_MODEL.md
  • Error codes: docs/ERROR_CODES.md
  • Performance gates: docs/PERFORMANCE.md
  • DVC example: examples/dvc/README.md
  • OCI example: examples/oci/README.md
  • ML tooling example: examples/ml/README.md

Support Seedbraid

Seedbraid is maintained as an open-source project.

If Seedbraid helps your workflow, please consider supporting the project through the repository Sponsor button. Support goes directly toward maintenance, documentation, and compatibility/performance validation.

Open Source Governance

  • License: MIT (LICENSE)
  • Security policy: SECURITY.md
  • Contributing guide: CONTRIBUTING.md
  • Code of Conduct: CODE_OF_CONDUCT.md

About

Open-source toolkit for lossless, reference-based reconstruction of large binary artifacts using deterministic CDC and compact SBD1 seeds.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors

Languages