Observation
Surfaced while implementing the :DocumentMeta serialization for #505 (PR #507). In the dev graph there are 153 distinct (garage_key, s.document) combinations across only 120 files — i.e. some files (one garage_key) carry Source chunks tagged to multiple ontologies (s.document).
Why it matters
document_id == content_hash == sha256(document bytes) is content-derived and ontology-independent, but garage_key = sources/<ontology>/<hash[:32]>.<ext> embeds the ontology. So the same content ingested into two ontologies produces:
- the same
document_id (one :DocumentMeta after MERGE-on-id — last writer wins on ontology/garage_key), but
- two different
garage_keys.
This means one DocumentMeta can't faithfully represent a file that legitimately lives in multiple ontologies, and two known limitations fall out of it (both documented in PR #507, neither affecting the dominant full-DB restore):
- Scoped (per-ontology) export filters on
d.ontology, so a multi-ontology document is only carried by the export of that one ontology (api/lib/serialization/exporter.py:export_documents).
- Restore-time durability dedups per
garage_key but MERGEs to one node per document_id — distinct garage_keys collapse, last-writer-wins (api/app/lib/age_client/rehydration.py:rehydrate_document_layer).
Questions to resolve
- Is a single file legitimately belonging to multiple ontologies an intended capability, or is the cross-ontology tagging a bug in ingestion/annealing?
- If intended: should
:DocumentMeta be keyed on (content_hash, ontology) (matching the re-ingest dedup key) rather than document_id alone, so each (file, ontology) gets its own node? That would also fix the catalog's per-ontology document tier and both limitations above.
- If a bug: where does the multi-tagging originate (annealing re-scoping? re-ingest into a new ontology)?
Context
Observation
Surfaced while implementing the
:DocumentMetaserialization for #505 (PR #507). In the dev graph there are 153 distinct(garage_key, s.document)combinations across only 120 files — i.e. some files (onegarage_key) carry Source chunks tagged to multiple ontologies (s.document).Why it matters
document_id == content_hash == sha256(document bytes)is content-derived and ontology-independent, butgarage_key = sources/<ontology>/<hash[:32]>.<ext>embeds the ontology. So the same content ingested into two ontologies produces:document_id(one:DocumentMetaafter MERGE-on-id — last writer wins onontology/garage_key), butgarage_keys.This means one DocumentMeta can't faithfully represent a file that legitimately lives in multiple ontologies, and two known limitations fall out of it (both documented in PR #507, neither affecting the dominant full-DB restore):
d.ontology, so a multi-ontology document is only carried by the export of that one ontology (api/lib/serialization/exporter.py:export_documents).garage_keybut MERGEs to one node perdocument_id— distinct garage_keys collapse, last-writer-wins (api/app/lib/age_client/rehydration.py:rehydrate_document_layer).Questions to resolve
:DocumentMetabe keyed on(content_hash, ontology)(matching the re-ingest dedup key) rather thandocument_idalone, so each (file, ontology) gets its own node? That would also fix the catalog's per-ontology document tier and both limitations above.Context
:SCOPED_BY(already tolerates multi-ontology), so display isn't broken today.