Skip to content

franchoy/coldkeep

coldkeep

CI Go Version License Status Release

Status: Experimental research project.
Not production-ready. Do not use for real or sensitive data.
On-disk format and APIs may change before v1.0.

coldkeep is a content-addressed storage engine for cold data.

It splits files into content-defined chunks, deduplicates them using SHA-256, and stores them as encoded blocks in append-only container files with database-backed metadata.

coldkeep guarantees deterministic, byte-identical restore of stored data, validated by end-to-end hashing and resilient across garbage collection and system restart/recovery.

coldkeep is designed as a correctness-first storage engine, prioritizing determinism and recoverability over performance and feature completeness.


What it does (today)

  • Store files or folders by splitting them into content-defined chunks.
  • Deduplicate chunks using SHA-256 (content-addressed storage).
  • Pack chunks into container files up to a configurable maximum size.
  • Restore files by reconstructing them from stored chunks.
  • Guarantee byte-identical restore outputs (verified by SHA-256).
  • Use an append-only container model for deterministic and safe writes.
  • Remove logical files (decrementing chunk reference counts).
  • Run garbage collection to remove unreferenced chunks safely.
  • Recover from interrupted operations on startup.
  • Provide storage statistics and container health information.
  • Perform multi-level integrity verification (metadata, container structure, and full data integrity).
  • Store data in a format designed for deterministic recovery and long-term integrity.

✨ What’s new in v0.7

  • Block abstraction layer (logical vs physical separation)
  • Pluggable codecs (plain + aes-gcm)
  • AES-GCM encryption
  • CLI codec selection
  • init command for key setup
  • CI coverage for both modes

🔐 Encryption model

coldkeep supports pluggable block encoding via codecs:

Codec Description
plain No encoding (raw data)
aes-gcm AES-256-GCM authenticated encryption

Key properties:

  • Encryption is applied at the block level
  • Hashing is always performed on plaintext
  • Each block uses a random nonce
  • Encryption keys are externalized via environment variables

The system fails fast if encryption is requested and no key is provided.

Encryption is applied after chunking and before storage, ensuring deduplication operates on plaintext while data at rest remains protected.


🔑 Initialization (Encryption Setup)

Before storing encrypted data, generate a key:

coldkeep init

This will:

  • generate a secure 256-bit encryption key
  • print it to the console
  • create a .env file if it does not already exist

Example:

COLDKEEP_KEY=...
COLDKEEP_CODEC=aes-gcm

Load it into your shell:

export $(cat .env | xargs)

Docker:

The /app mount ensures the .env file created by init persists on the host.

docker compose run --rm -v "$PWD:/app" app init

⚠️ Data encrypted with a key cannot be recovered without it.
There is currently no key rotation or recovery mechanism.


⚙️ Codec selection

coldkeep store --codec plain file.txt
coldkeep store --codec aes-gcm file.txt

Codec notes

  • aes-gcm requires COLDKEEP_KEY
  • The CLI flag overrides environment configuration
  • The default codec is aes-gcm
  • Using plain stores data unencrypted and prints a warning

🚀 Quickstart

A small samples/ folder is included for testing.

Quick start (local, no Docker)

# 1. Generate encryption key and write .env
coldkeep init

# 2. Load the key into your shell
export $(cat .env | xargs)

# 3. Store a file
coldkeep store file.txt

Security note: If you lose the key, data cannot be recovered.
Never commit .env to version control.

Quick start (Docker)

The /app mount ensures the .env file created by init persists on the host.

# 1. Start services
docker compose up -d --build

# 2. Generate encryption key
docker compose run --rm -v "$PWD:/app" app init

# 3. Store a file (pass the key from the generated .env)
docker compose run --rm \
  --env-file .env \
  -v "$PWD/samples:/samples" \
  app store /samples/hello.txt

Security note: If you lose the key, data cannot be recovered.


Verification

coldkeep provides a multi-level integrity verification system to ensure consistency and detect corruption across metadata and stored data.

Levels

  • Standard

    • Validates metadata integrity
    • Checks reference counts, chunk ordering, and orphan records
  • Full

    • Includes all standard checks
    • Verifies container files exist and match recorded sizes
    • Validates container hashes and chunk-to-container consistency
  • Deep

    • Includes all full checks
    • Reads container data and recomputes chunk hashes
    • Detects physical data corruption at the byte level

Usage

Verify the entire system:

coldkeep verify system --level standard
coldkeep verify system --level full
coldkeep verify system --level deep

Verify a specific file:

coldkeep verify file <file_id> --level standard
coldkeep verify file <file_id> --level full
coldkeep verify file <file_id> --level deep

Notes

Deep verification performs full reads of container files and may be slow.
Recommended for periodic integrity audits rather than frequent execution.


Deterministic restore guarantees (v0.5)

coldkeep guarantees deterministic restore at the logical file level (introduced in v0.5).

Current integration coverage validates:

  • deterministic stored logical files under deduplication
  • byte-identical restore outputs verified by SHA-256
  • consistent restore behavior after GC and restart/recovery
  • deterministic results across representative and edge-case datasets

This guarantee applies to logical files and dataset-level workflows.

Coldkeep does not yet define a first-class contract for restoring full directory trees with exact original layout. Current tests restore logical files individually rather than reconstructing folder structure.


Design sketch

Core tables:

  • logical_file
    User-visible file entry (name, size, file_hash).

  • chunk
    Logical identity of a content-addressed chunk (chunk_hash, size, ref_count).

  • blocks
    Physical placement and codec metadata for each chunk (codec, block_offset, stored_size, container_id).

  • file_chunk
    Ordered mapping between logical files and chunks.

  • container
    Physical container file storing raw block payloads.

Containers are stored on disk under:

storage/containers/

Containers follow an append-only write model.

Storage pipeline:

logical_file -> file_chunk -> chunk -> blocks -> container

Lifecycle states:

  • logical_file: PROCESSING → COMPLETED → ABORTED
  • chunk: PROCESSING → COMPLETED → ABORTED

These states allow coldkeep to detect interrupted operations and recover safely on startup.


Project structure

coldkeep/
│
├─ cmd/
│   └─ coldkeep/          # CLI entrypoint
│
├─ internal/
│   ├─ blocks/            # block encoding (plain, aes-gcm)
│   ├─ chunk/             # chunking logic
│   ├─ container/         # container format + management
│   ├─ db/                # database helpers
│   ├─ listing/           # file listing operations
│   ├─ maintenance/       # gc, stats, verify_command
│   ├─ recovery/          # system recovery logic
│   ├─ storage/           # store / restore / remove pipeline
│   ├─ utils_env/         # env helpers
│   ├─ utils_print/       # print helpers
│   └─ verify/            # verification logic
│
├─ tests/                 # integration tests
├─ scripts/               # smoke / development scripts
├─ db/                    # database schema
│
├─ docker-compose.yml
├─ go.mod
└─ README.md

Development

🐳 Local development (with Docker)

Start services:

docker compose up -d --build

Store a sample file:

docker compose run --rm -v "$PWD/samples:/samples" app store /samples/hello.txt

Store the sample folder:

docker compose run --rm -v "$PWD/samples:/samples" app store-folder /samples

List stored files:

docker compose run --rm app list

Restore a file:

docker compose run --rm app restore 1 _out.bin

Show stats:

docker compose run --rm app stats

Run garbage collection:

docker compose run --rm app gc

💻 Local development (without Docker)

Start Postgres (example):

docker compose up -d postgres

Build the CLI first using the build command below.

Store the sample folder:

./coldkeep store-folder samples

List stored files:

./coldkeep list

Restore a file:

./coldkeep restore 1 restored.bin

Show stats:

./coldkeep stats

Run GC:

./coldkeep gc

Build

go build -o coldkeep ./cmd/coldkeep

Tests

Run all tests:

go test ./...

Integration tests live under:

tests/

and require a running PostgreSQL instance.


Smoke test

scripts/smoke.sh runs a full end-to-end workflow using the samples/ directory.

store -> stats -> list -> restore -> dedup check.

Local

docker compose up -d postgres
go build -o coldkeep ./cmd/coldkeep
bash scripts/smoke.sh

Docker

docker compose up -d postgres

docker compose run --rm \
  -e COLDKEEP_SAMPLES_DIR=/samples \
  -e COLDKEEP_STORAGE_DIR=/tmp/coldkeep-storage \
  -v "$PWD/samples:/samples" \
  --entrypoint bash \
  app scripts/smoke.sh

Configuration

Database configuration is read from environment variables
(see docker-compose.yml for defaults).

Storage is written to:

./storage/

During development you can safely delete this directory.

Additional environment variables used in development:

  • COLDKEEP_STORAGE_DIR
  • COLDKEEP_SAMPLES_DIR

Known limitations

Crash recovery

coldkeep includes a crash recovery model based on lifecycle states.

Operations use explicit states (PROCESSING, COMPLETED, ABORTED) to detect and handle interrupted operations safely.

On startup the system:

  • marks stale PROCESSING rows as ABORTED
  • prevents incomplete chunks from being reused
  • allows safe retries of interrupted operations

This model ensures that partially written data does not corrupt the logical state of the system.

In v0.6, the append-only container model further simplifies recovery by eliminating in-place mutations of container data.

However, the system is still experimental and full transactional guarantees across filesystem and database layers are not yet complete.

Use only with disposable test data.

Container compression

Whole-container compression has been removed in v0.6.

Future versions may introduce block-level compression.

Concurrency & integrity

Concurrency support has been significantly improved in v0.6, including locking and retry mechanisms.

However, it is still evolving and not yet optimized for extreme parallel workloads.


Security

See SECURITY.md.

This is an experimental research project with evolving on-disk formats.

Do not use for real or sensitive data.


Roadmap

Coldkeep follows a risk-reduction approach, where each release removes a class of failure until the system becomes fully trustworthy.

  • v0.2 — Crash Consistency Foundation
    Eliminate DB ↔ filesystem divergence and ensure safe recovery after crashes.

  • v0.3 — Safe Garbage Collection
    Guarantee that GC cannot remove referenced data.

  • v0.4 — Integrity & Verification Layer
    Enable full-system verification and corruption detection.

  • v0.5 — Deterministic Restore Guarantees
    Ensure byte-identical, reproducible restore across GC and restart.

  • v0.6 — Storage Model Evolution
    Introduce an append-only container model, remove legacy compression, and improve concurrency coordination as a foundation for future block abstraction and encryption.

  • v0.7 — Block Abstraction & Encryption Foundations
    Introduce block-level structure to enable partial reads, compression, and encryption in a controlled and extensible way.

  • v0.8 — Simulation & CLI Stabilization
    Add simulation capabilities for storage planning and define a stable command-line interface.

  • v0.9 — Internal Hardening
    Improve reliability, simplify internals, and finalize implementation details.

  • v1.0 — Storage Engine Stable
    Coldkeep becomes a trustworthy storage engine for real cold backups.


Contributing

Contributions and discussion are welcome.

See CONTRIBUTING.md.


License

Apache-2.0. See LICENSE

About

coldkeep is an experimental local-first content-addressed file storage engine with verifiable integrity written in Go. Files are split into content-addressed chunks, packed into container files on disk, and tracked through PostgreSQL metadata to attempt deduplication

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages