Skip to content

mizcausevic-dev/data-contract-registry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-contract-registry

CI Python License: MIT

Schema registry for data contracts. Semver versioning, compatibility checks (backward / forward / full), declared owners, freshness SLAs. The "you can't promote a new dataset version without an approved contract" pattern, lifted from API governance and aimed at data pipelines.

The headline endpoint is POST /contracts — register a new version, get back a deterministic compatibility report or a 422 with every breaking change called out by field name and kind.


Why

The thing that gets data teams paged at 2am isn't a missing test. It's a producer who quietly removed ltv because "we never use it anymore" while three downstream dashboards still join on it. Schema registries (Confluent, Buf, etc.) solved this for streaming and gRPC; data pipelines need the same hardness in a shape that fits the things data teams actually argue about:

  • owners — who do I page when this dataset goes stale
  • freshness SLA — when does "stale" become "broken"
  • primary key — changing it is a MAJOR, not a MINOR
  • enum drift — adding a value is fine; removing one is a backward-compatibility break
  • deprecation policy — flag a version with the URI of the migration plan; don't delete it

This package is the smallest thing that does all of those.


Install

pip install data-contract-registry
# with the FastAPI surface:
pip install "data-contract-registry[api]"

Python 3.11+. Runtime deps: pydantic + PyYAML.


Library quickstart

from data_contract_registry import (
    ContractRegistry,
    DataContract,
    DataField,
    Owner,
)

registry = ContractRegistry()

v1 = DataContract(
    dataset_id="users.daily_active",
    version="1.0.0",
    primary_key=["user_id", "active_date"],
    owners=[Owner(team="growth-platform", contact="#growth-platform")],
    fields=[
        DataField(name="user_id",     type="string"),
        DataField(name="active_date", type="timestamp"),
        DataField(name="plan",        type="string", enum=["free", "pro", "enterprise"]),
        DataField(name="ltv",         type="number", required=False),
    ],
    status="active",
)
registry.register(v1)

# Compatible promotion (added an optional field).
v1_1 = v1.model_copy(update={
    "version": "1.1.0",
    "fields": [*v1.fields, DataField(name="signup_source", type="string", required=False)],
})
report = registry.register(v1_1)
print(report.compatible)   # True

# Incompatible promotion — removing a field breaks backward compatibility.
v2 = v1.model_copy(update={"version": "2.0.0", "fields": [f for f in v1.fields if f.name != "ltv"]})
report = registry.register(v2)
print(report.compatible)               # False
print(report.errors[0].kind)           # "field_removed"
print(report.errors[0].message)        # "field 'ltv' was removed; old data will fail validation"

Compatibility modes

Mode Meaning
backward New schema can read data produced by the previous schema. Default. Consumers upgrade first.
forward Previous schema can read data produced by the new schema. Producers upgrade first.
full Both.
none Anything goes. First-time onboarding only.

The checks the engine knows how to flag (each carries a structured kind so you can build CI gates around specific failures):

Kind Severity Mode
field_removed error backward
field_type_changed error backward
field_required_added error backward (optional→required) or forward (new required field)
field_enum_shrunk error backward
primary_key_changed error always
version_not_increasing error always
owner_missing error always

FastAPI surface

pip install "data-contract-registry[api]"
uvicorn data_contract_registry.app:app --port 8090
Method Path What it does
GET / Service info.
GET /healthz Liveness probe.
GET /datasets List registered dataset IDs.
POST /contracts Register / promote a contract. 422 with a structured issue list when incompatible.
POST /contracts/check Dry-run compatibility check — does not register.
GET /contracts/{ds}/latest Latest active contract for a dataset.
GET /contracts/{ds}/versions Full version history.
GET /contracts/{ds}/versions/{v} One specific version.
POST /contracts/{ds}/versions/{v}/deprecate Mark deprecated with a migration URI.
POST /contracts/{ds}/versions/{v}/archive Archive a version (history preserved).
POST /contracts/owners/from-decision-card Cross-ecosystem hook — pull owners out of a Decision Card.

Bundles are held in-memory by default. For restart-safe storage, swap _BundleStore's implementation; the protocol is small.


The cross-ecosystem hook

The third hook in the portfolio (after procurement-decision-apipolicy-as-code-engine and the Suite → Decision Intelligence bridge). When a buyer approves a vendor whose data product the team will consume, the Decision Card's buyer.name + decision_maker are the right answer to "who owns the contract on our side":

curl -X POST http://localhost:8090/contracts/owners/from-decision-card \
  -H 'Content-Type: application/json' \
  -d @decision-card.json
# -> [
#   {"team": "Springfield USD",                            "contact": "#data-platform"},
#   {"team": "Director of Data (Alex Chen)",               "contact": null}
# ]

Drop that list straight into DataContract.owners and the registration carries paging info the team didn't have to re-type.


YAML authoring

# contracts/users-daily-active.yaml
dataset_id: users.daily_active
version: "1.0.0"
owners:
  - team: growth-platform
    contact: "#growth-platform"
freshness_sla:
  max_lag_seconds: 86400
fields:
  - {name: user_id,      type: string}
  - {name: active_date,  type: timestamp}
  - {name: plan,         type: string, enum: [free, pro, enterprise]}

Hand-author in YAML, validate in CI, register from Python:

import yaml
from pathlib import Path
from data_contract_registry import ContractRegistry, DataContract

raw = yaml.safe_load(Path("contracts/users-daily-active.yaml").read_text())
ContractRegistry().register(DataContract.model_validate(raw))

Tests

pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
mypy src
pytest -v

CI matrix runs Python 3.11 / 3.12 / 3.13.


Related in this ecosystem


License

MIT. See LICENSE.

About

Schema registry for data contracts. Semver versioning, compatibility checks (backward/forward/full), ownership, freshness SLAs. Bridges to procurement-decision-api.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages