Skip to content

[ontology] Add built-in catalog ontology views#36159

Open
mtabebe wants to merge 11 commits intoMaterializeInc:mainfrom
mtabebe:ma/ontology/sql-built-in-v2
Open

[ontology] Add built-in catalog ontology views#36159
mtabebe wants to merge 11 commits intoMaterializeInc:mainfrom
mtabebe:ma/ontology/sql-built-in-v2

Conversation

@mtabebe
Copy link
Copy Markdown
Contributor

@mtabebe mtabebe commented Apr 20, 2026

Add four built-in views in mz_internal that describe the structure and relationships of the Materialize system catalog. These views are designed to help LLMs, diagnostic tools, and developers discover the right tables, join paths, and ID types when writing catalog queries.

Views:

  • mz_ontology_entity_types: what catalog objects exist and where
  • mz_ontology_semantic_types: typed ID domains (CatalogItemId, GlobalId, etc.)
  • mz_ontology_properties: column-level metadata with semantic types
  • mz_ontology_link_types: named relationships between entity types

The views are generated at startup from annotations on existing builtin definitions. Each builtin can carry an Ontology struct declaring its entity name, description, and FK relationships, plus per-column semantic type annotations via RelationDesc::with_semantic_type().

@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch 4 times, most recently from 1ecb51e to 2f07bba Compare April 20, 2026 18:20
@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Apr 20, 2026

I know there are some test failures here, but I think it is worth getting some early feedback on the structure of this change.

@ggevay there are 4 commits here that might be easier to look at individually: (1) defines the interface for the type, (2) adds all the annotations, (3) adds a doc, (4) adds tests

@mtabebe mtabebe marked this pull request as ready for review April 20, 2026 18:22
@mtabebe mtabebe requested review from a team as code owners April 20, 2026 18:22
@mtabebe mtabebe requested a review from ggevay April 20, 2026 18:22
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch 2 times, most recently from ebd3c9a to 8a82efd Compare April 22, 2026 15:33
@mtabebe mtabebe requested a review from a team as a code owner April 22, 2026 15:33
@mtabebe mtabebe requested a review from aljoscha April 22, 2026 17:08
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch from 8a82efd to d7191b0 Compare April 22, 2026 17:17
Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote some comments, will continue tomorrow.

Comment thread src/catalog/src/builtin/ontology.rs
Comment thread src/catalog/src/builtin/ontology.rs Outdated
Comment thread src/repr/src/relation.rs Outdated
Comment thread src/catalog/src/builtin/ontology.rs Outdated
Comment thread src/catalog/src/builtin/ontology.rs Outdated
Comment thread src/catalog/src/builtin.rs Outdated
@@ -3510,7 +3975,27 @@ FROM
WHERE data->>'kind' = 'Role'",
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: None,
ontology: Some(Ontology {
entity_name: "role_member",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe role_membership?

Comment thread src/catalog/src/builtin.rs Outdated
ontology: Some(Ontology {
entity_name: "operator",
description: "A built-in SQL operator",
links: &[],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add return_type_id as a link, like on mz_functions?

Comment thread src/catalog/src/builtin.rs Outdated
description: "An array type with its element type",
links: &[
OntologyLink {
name: "is_subtype_of",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not is_subtype_of. I'm not sure what would be a good term here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same issue for list and map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to detail_of

Comment thread src/catalog/src/builtin.rs Outdated
name: "granted_by",
target: "role",
properties_json: r#"{"kind": "foreign_key", "source_column": "grantor", "target_column": "id", "cardinality": "many_to_one"}"#,
},
Copy link
Copy Markdown
Contributor

@ggevay ggevay Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member_of_role and has_member here might have the same issue as object dependencies, mentioned above: It sounds like as if a role membership would have a has_member, when it's actually a role that has a member.

(granted_by might be ok, though.)

Comment thread src/catalog/src/builtin.rs Outdated
entity_name: "role_parameter",
description: "A session parameter default set for a role",
links: &[OntologyLink {
name: "parameter_of",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like default_parameter_setting_of?

@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch from d7191b0 to ff78461 Compare April 28, 2026 18:39
@antiguru antiguru self-requested a review April 29, 2026 15:22
mtabebe and others added 5 commits April 29, 2026 15:40
Add the structural foundation for built-in ontology views that describe
the Materialize system catalog. This includes:

- `Ontology` and `OntologyLink` structs on builtin definitions
- `semantic_types` field on `RelationDesc` with `with_semantic_type()` builder
- View generation code in `ontology.rs` that produces 4 views:
  `mz_ontology_entity_types`, `mz_ontology_semantic_types`,
  `mz_ontology_properties`, `mz_ontology_link_types`
- OID constants for the new views
- `ontology: None` on all existing builtins (no annotations yet)

The views are generated at startup by enumerating builtins that have
`ontology: Some(...)` annotations. This commit only adds the
infrastructure; annotations are added in the next commit.
Add ontology annotations to builtin catalog objects and semantic
type annotations. This populates the ontology views introduced
in the previous commit with:

- Entity types: databases, schemas, roles, clusters, replicas, tables,
  sources, views, MVs, indexes, sinks, connections, secrets, types,
  functions, and ~100 internal/introspection objects
- Semantic types: CatalogItemId, GlobalId, ClusterId, ReplicaId, etc.
- Link types: relationships (owned_by, in_schema, runs_on_cluster,
  depends_on, details_of, etc.)
- Column-level semantic type annotations via with_semantic_type()
Add documentation for the ontology module covering the four built-in views,
their schema, link type properties and LLM usage guide
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sistent

- Replace SemanticType &'static str with a typed enum; update all call
sites in builtin.rs and healthcheck.rs
- Rename OntologyLink names to noun phrases throughout: dependent_object,
referenced_object, transitively_dependent_object/referenced_object,
detail_of, group_role, member_role, default_parameter_setting_of
- Add missing FK links on source-table detail tables and mz_operators
(returns_type), plus a test for ontology_consistency
- Improve SEMANTIC_TYPE_DEFS descriptions with examples and clearer
wording (MzTimestamp, ObjectType, ConnectionType, SourceType)
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch 3 times, most recently from 5beda2b to 3a5fc04 Compare April 29, 2026 20:20
…ties

Nineteen annotated builtin relations had empty links columns that
point to other annotated entities
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch from 3a5fc04 to ab6fccc Compare April 29, 2026 21:24
Copy link
Copy Markdown
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some comments below. I think this is valuable to have, so mostly comments around the implementation.

When/how do we include more parts in the ontology? I think specifically the compute introspection could benefit from this.

use mz_repr::{RelationDesc, SemanticType, SqlScalarType};
use mz_sql::catalog::NameReference;

use super::{Builtin, BuiltinView, Ontology, PUBLIC_SELECT};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use crate imports, not super.

Comment on lines +67 to +73
struct Info<'a> {
table_name: &'static str,
schema_name: &'static str,
entity_name: String,
desc: &'a RelationDesc,
ontology: &'a Ontology,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document struct and fields.

Comment on lines +63 to +65
fn leak(v: BuiltinView) -> &'static BuiltinView {
Box::leak(Box::new(v))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll show up in tooling to detect leaks. Can we use a non-leaking approach here to avoid alarms in the future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I saw other uses of this in builtin, so thought this was the pattern, I'll fix

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are. Personal opinion is to not add more :)

Comment thread src/catalog/src/builtin/ontology.rs Outdated
let keys = desc.typ().keys.first()?;
let cols: Vec<_> = keys
.iter()
.map(|&i| format!("\"{}\"", desc.get_name(i)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about columns that contain double quotes?

Comment thread src/repr/src/relation.rs Outdated
Comment on lines +967 to +971
/// Optional semantic type annotations for columns.
/// Keyed by column index. Only populated for builtin catalog objects.
/// Excluded from Eq/Hash/serialization — it's ontology metadata, not schema.
#[serde(skip)]
semantic_types: BTreeMap<usize, SemanticType>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add quirks around ignoring parts of the type for Eq, Hash, etc. From experience, this is a maintenance burden for our future selves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack I was probably too clever for my own good

Comment thread src/repr/src/relation.rs Outdated
/// Annotates the most recently added column with a semantic type.
///
/// Possible values are enumerated in [`SemanticType`].
pub fn with_semantic_type(mut self, semantic_type: SemanticType) -> RelationDescBuilder {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function is great as it is context-sensitive. Could we add this to a separate with_column_semantic_type?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. A specific danger is that someone editing unrelated stuff near a call site might not realize that this has to be directly after a column that it refers to, and e.g. accidentally add a new column between a with_semantic_type and its with_column.

Comment on lines +2138 to +2146
ontology: Some(Ontology {
entity_name: "kafka_sink",
description: "Kafka-specific sink configuration (topic)",
links: &[OntologyLink {
name: "details_of",
target: "sink",
properties_json: r#"{"kind": "foreign_key", "source_column": "id", "target_column": "id", "cardinality": "one_to_one"}"#,
}],
}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure that the ontology doesn't get out-of-sync with the rest of the system?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a challenge. I did add an SLT test and based on this feedback will add more unit tests.

The main invariants that I see are:

  • Column semantic types: enforced at compile time via the with_column_semantic_type API the columns need to exist (but this doesn't enforce anything about the types)
  • Link target entity names: test_ontology_consistency asserts every OntologyLink::target references a known annotated entity
  • source_column values in properties_json: currently only checked implicitly (the FK coverage test reads them). I'm adding add an explicit check that each source_column value names an actual column in the entity's RelationDesc, so renames are caught.

I would totally be open to more ideas on how to make this more robust.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place for this file. doc/developer/generated shouldn't be touched by users. I think this is either developer documentation (not in generated), or a design doc.

@mtabebe
Copy link
Copy Markdown
Contributor Author

mtabebe commented Apr 30, 2026

Leaving some comments below. I think this is valuable to have, so mostly comments around the implementation.

When/how do we include more parts in the ontology? I think specifically the compute introspection could benefit from this.

Yeah, I intentionally left compute introspection out of this for now, but my feeling is that we can add the same type of annotations to BuiltinLog and update the generate_views function to handle them. This work has sort of spiraled bigger then I was anticipating, so I think extending it to compute introspection should be the next thing (and properly planned)

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote some more comments. Will continue after lunch.

Comment thread src/catalog/src/builtin.rs Outdated
description: "Kafka source table-level details",
links: &[OntologyLink {
name: "describes_source_table",
target: "object",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be target: "table",?

Comment thread src/catalog/src/builtin/ontology.rs Outdated
/// JSON object, e.g. `{"primary_key": ["id", "schema_id"]}`. Returns `None`
/// if the relation has no keys defined.
fn pk_json(desc: &RelationDesc) -> Option<String> {
let keys = desc.typ().keys.first()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several builtins declare more than one key. If we take only the first key here, then which one this surfaces would depend on the declared key order. Is this intended?

Maybe we could make it do something like

{"primary_key": [...], "alternate_keys": [[...], ...]}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a nit: If it's just one key, could you name the let keys just let key? (I know it's a Vec, so the s at the end feels natural, but it's still just one key at this point, so key is more accurate. We had tons of these key variable naming issues also in the optimizer, and at some point we made a conscious effort to fix all of them.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the alternate keys, I'll implement that

Comment thread src/catalog/src/builtin.rs Outdated
@@ -2323,18 +2469,37 @@ pub static MZ_COMPUTE_DEPENDENCIES: LazyLock<BuiltinSource> = LazyLock::new(|| B
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "compute_dependency",
description: "Dependency edges within compute dataflows",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended that edges is plural here? I think many (maybe most) other descriptions have analogous things in the singular.

Comment thread src/catalog/src/builtin.rs Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: .id missing at the end.

Comment thread src/catalog/src/builtin.rs Outdated
OntologyLink {
name: "dependent_compute_object",
target: "object",
properties_json: r#"{"source_id_type": "GlobalId", "requires_mapping": "mz_internal.mz_object_global_ids", "kind": "foreign_key", "source_column": "object_id", "target_column": "id", "cardinality": "many_to_one"}"#,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be maybe better to lift this into a tyyped struct, something like

LinkProperties {
  kind: LinkKind,
  source_column: &'static str,
  target_column: &'static str,
  cardinality: Cardinality,
  source_id_type: Option<SemanticType>,
  requires_mapping: Option<&'static str>,
  nullable: bool
}

I think it would make them more readable (there could be a helper function for common cases), and would also increase the chances of people getting them right when writing new ones. Although, it might also be over-engineering, I'm not sure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting some more comments, but also hitting approve, because these are just minor things. Looks great overall!

@@ -2690,30 +2987,48 @@ pub static MZ_SSH_TUNNEL_CONNECTIONS: LazyLock<BuiltinTable> = LazyLock::new(||
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "ssh_tunnel",
description: "SSH tunnel connection with public keys",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comment from Claude)

entity_name: "ssh_tunnel",
description: "SSH tunnel connection with public keys",

Compare its peers:

  • mz_kafka_connectionsentity_name: "kafka_connection"
  • mz_aws_privatelink_connectionsentity_name: "aws_privatelink" (also drops the _connection!)
  • mz_aws_connectionsentity_name: "aws_connection"

So three different conventions for the four connection-detail tables: <x>_connection, <x> (dropping the suffix), and inconsistent ones in between. Pick one — almost certainly <x>_connection to match the <x>_source and <x>_source_table families. Specifically:

  • aws_privatelink should be aws_privatelink_connection
  • ssh_tunnel should be ssh_tunnel_connection

Comment thread src/catalog/src/builtin.rs Outdated
/// Target entity name (e.g., "role", "schema").
pub target: &'static str,
/// JSON for the `properties` JSONB column (kind, source_column, target_column, etc.).
pub properties_json: &'static str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I raised this somewhere else too, but it would be nice to have more structure around properties_json, or at least more documentation. E.g., requires_mapping doesn't seem like a trivial thing, but its meaning doesn't seem to be documented anywhere.

Comment thread src/catalog/src/builtin.rs Outdated
let key = "\"source_column\": \"";
let start = json.find(key)? + key.len();
let end = json[start..].find('"')? + start;
Some(&json[start..end])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just run it through a json parser instead of string matching? As it is, I worry that someone in the future will

  • write the json with slightly different whitespace, and then the test gives a false positive
  • write invalid json, but the test doesn't catch it if it's in some other part of the json.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliminated as part of the strong typing

Some(format!("{{\"primary_key\": [{}]}}", cols.join(", ")))
}

// ── View builders ────────────────────────────────────────────
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comment from Claude)

The 4 view builders are 95% the same code

entity_types_view, properties_view, semantic_types_view, link_types_view all do:

  1. infos.iter().map(|i| format!("(...)", esc(...), ...)).collect()
  2. format!("SELECT * FROM (VALUES {}) AS t(col1, col2, ...)", vals.join(","))
  3. wrap in view(name, oid, &cols, &keys, sql)

This is begging for a single helper:

fn values_view(name, oid, cols, keys, rows: impl Iterator<Item = Vec<SqlLiteral>>) -> BuiltinView

…with a SqlLiteral enum for typed values (Str, Bool, Json, Null). That removes every esc(...) call site and centralizes the "this is going inside '...'" decision in one place — which would also have prevented the pk_json "-escaping bug in one stroke.

#[derive(Clone, Hash, Debug, PartialEq, Eq)]
pub struct OntologyLink {
/// Relationship name (e.g., "owned_by", "in_schema").
pub name: &'static str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OntologyLink somehow seems to be a non-trivial concept, judging by how many times I had to correct my AI agent on this while reviewing the PR, and also how the PR's original version had some of these wrong (e.g., the dependency ones). I'm wondering if we could add more doc commenting here to make it clearer / more explicit what these mean.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprisingly tricky. After some back-and-forth with my AI agent, we arrived at two possibilities:

A. Allow active verbs. This one encompasses all the existing examples in the PR's current state, except for session_on_cluster.

/// A foreign-key relationship from this catalog object to another ontology
/// entity.
///
/// **Contract.** For each row, the value in `properties_json.source_column`
/// references a row of `target`'s primary table via
/// `properties_json.target_column`. `name` is a label for this relationship
/// and must be unique within an `Ontology`.
///
/// **Direction.** A link always points *from* this row's `source_column`
/// *to* the `target` entity's `target_column`. `name` is just a label for
/// that one outgoing edge — it never reverses direction, regardless of how
/// it reads in English. When in doubt, the columns define the direction;
/// the name is descriptive only.
///
/// **Naming.** Several name shapes are in use, each with its own natural
/// reading:
///
/// - **Noun role** (preferred when natural):
///   `dependent_object` on `object_dependency`, `element_type` on
///   `array_type`, `default_parameter_setting_of` on `role_parameter`.
///   Read as: *"the `<target>` that is the `<name>` of this row"* —
///   e.g. "the object that is the *dependent_object* of this dependency edge."
///
/// - **Passive verb / prepositional**:
///   `owned_by` on `database`, `in_schema` on `object`, `details_of` on
///   `kafka_source`, `granted_by` on `role_membership`.
///   Read as: *"the `<target>` this row is `<name>`"* —
///   e.g. "the role this database is *owned_by*", "the schema this object
///   is *in*".
///
/// - **Active verb** (use sparingly, see caveat below):
///   `depends_on`, `has_element_type`, `references_source`, `runs_on_cluster`.
///   Read as: *"this row `<name>` the `<target>`"* — with the **row** as
///   the verb's subject — e.g. "this array_type *has_element_type* a type",
///   "this index *runs_on_cluster* a cluster."
///
/// **Caveat about active verbs.** Active verbs admit more than one English
/// reading (the row, the source-column's referent, or the target can each
/// be read as the subject), and historically every direction bug in this
/// module's review history has been on an active-verb name. The contract
/// above pins direction regardless, but if a natural noun phrase or passive
/// verb exists, prefer it. In particular, **do not** use an active verb on
/// an *edge entity* (a row that itself represents a relationship — e.g.
/// `mz_object_dependencies`, `mz_role_members`); the row is not an actor,
/// so a verb-with-row-as-subject is a category error. Use noun-role
/// endpoint names there (`dependent_object` / `referenced_object`,
/// `member_role` / `group_role`).

B. Disallow active verbs. This would require slight changes in many of the current OntologyLinks.

/// A foreign-key relationship from this catalog object to another ontology
/// entity.
///
/// **Contract.** For each row, the value in `properties_json.source_column`
/// references a row of `target`'s primary table via
/// `properties_json.target_column`. `name` is a label for this relationship
/// and must be unique within an `Ontology`.
///
/// **Direction.** A link always points *from* this row's `source_column`
/// *to* `target`'s `target_column`. The columns define direction; `name`
/// is descriptive only and never reverses it.
///
/// **Naming convention.** `name` denotes the role the `<target>` plays
/// relative to this row. Pick `name` so the link reads as a noun phrase
/// under:
///
/// > *"the `<target>` that is the `<name>` of this row."*
///
/// Three name shapes fit this frame and are the only ones permitted:
///
/// - **Noun role** (preferred): `dependent_object` on `object_dependency`
///   → "the object that is the *dependent_object* of this dependency edge."
/// - **Passive verb**: `owned_by` on `database`
///   → "the role this database is *owned by*."
/// - **Prepositional**: `in_schema` on `object`
///   → "the schema this object is *in*."
///
/// **Active verbs are disallowed** (`depends_on`, `has_member`, `references`,
/// `uses_X`, `has_X`, `returns_X`, `describes_X`, `runs_on_X`, …). An active
/// verb has a subject, and the subject can be read as the row, the
/// source-column's referent, or the target — an interpretive axis on top of
/// direction that has produced every direction bug in this module's review
/// history. Rewrite as a noun role: `has_element_type` → `element_type`,
/// `returns_type` → `return_type`, `references_source` → `referenced_source`,
/// `uses_connection` → `connection` (or `used_connection`),
/// `runs_on_cluster` → `host_cluster`, `depends_on` →
/// `dependent_object` / `referenced_object`, `describes_source_table` →
/// `details_of`.
///
/// **Edge entities.** For catalog objects whose rows represent a relationship
/// between two other things (e.g. `mz_object_dependencies`,
/// `mz_role_members`, `mz_compute_dependencies`), the rule is strictest: the
/// row is not an actor, so a verb-with-row-as-subject (the row "depends on"
/// something, "has" a member) is a category error regardless of direction.
/// Use noun-role endpoint names — `dependent_object` / `referenced_object`,
/// `member_role` / `group_role` — naming each link for the role its
/// endpoint plays in the edge.

I'm not sure which one is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the docs, but I don't really want to be restrictive on the verbs that we use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that some of this will be cleaned up by having strongly typed link properties, too

Comment thread src/repr/src/relation.rs Outdated
/// Keyed by column index. Only populated for builtin catalog objects.
/// Excluded from Eq/Hash/serialization — it's ontology metadata, not schema.
#[serde(skip)]
semantic_types: BTreeMap<usize, SemanticType>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also use ColumnIndex, like the existing metadata?

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

Comment thread src/catalog/src/builtin.rs Outdated
description: "Recent query activity with execution stats",
links: &[
OntologyLink {
name: "session_on_cluster",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran_on_cluster?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could we add an OntologyLink to "session"?

Comment thread src/catalog/src/builtin.rs Outdated
@@ -4285,6 +5068,15 @@ pub static MZ_SESSION_HISTORY: LazyLock<BuiltinSource> = LazyLock::new(|| Builti
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "session_history",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"session_history" feels a bit awkward here. How about changing it to simply "session", and changing the current "session" to "active_session"? And then anywhere where we have a link to either of these, we should actually have links to both of these (one nullable).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I think I misunderstood what "nullable": true means. For a moment I though it means that if you do an outer join, then some rows will come back null. But actually, it probably means that the column is nullable on our side, right?

Copy link
Copy Markdown
Contributor

@ggevay ggevay Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, "foreign key" traditionally means that it's ok to do an inner join, you won't lose stuff. But sometimes it can be also interesting to point out a link where you might lose stuff with an inner join, so you need an outer join. Do we have/want a way to express those links as well?

Edit: Or am I misunderstanding this, and "nullable": true can mean that a non-null in our column might not find a match?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. this also ties back to the problem mentioned elsewhere that properties_json is under-documented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's just that the column is nullable, is there an automated test that checks this? Or even better, why not derive it automatically? We could do that if properties_json would be a structured thing with smart constructors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, on sink_status_history, it seems even non-null values on our side might not find a match on the other side!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually, it probably means that the column is nullable on our side, right?

Yes that is what it means. I agree it would be better to derive it automatically, feels like something I could do in the future.

I also agree with what you are saying adding more annotations around what the joins can see. I am kind of hesitant to add more though... so maybe we can defer?

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't stop, lol

Comment thread src/catalog/src/builtin.rs Outdated
@@ -4673,6 +5492,11 @@ pub static MZ_STATEMENT_LIFECYCLE_HISTORY: LazyLock<BuiltinSource> = LazyLock::n
MONITOR_REDACTED_SELECT,
MONITOR_SELECT,
],
ontology: Some(Ontology {
entity_name: "statement_lifecycle",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statement_lifecycle_event

Comment thread src/catalog/src/builtin.rs Outdated
ontology: Some(Ontology {
entity_name: "statement_lifecycle",
description: "Statement lifecycle events (parse, bind, execute)",
links: &[],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to mz_recent_activity_log.execution_id.

Comment thread src/catalog/src/builtin.rs Outdated
@@ -4891,6 +5727,22 @@ pub static MZ_SINK_STATUS_HISTORY: LazyLock<BuiltinSource> = LazyLock::new(|| Bu
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "sink_status_history",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"history" sounds like multiple events, but entity_name is supposed to describe one row of this relation, right? (Maybe this could be explicitly added to its doc comment.) So, maybe rename to sink_status_event?

And the corresponding OntologyLink status_history_of_sink has the same issue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same for all the history relations:

  • sink_status_history
  • source_status_history
  • replica_status_history
  • wallclock_lag_history

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated... and added a note to the doc to clarify it names a single row

Comment thread src/catalog/src/builtin.rs Outdated
@@ -4285,6 +5068,15 @@ pub static MZ_SESSION_HISTORY: LazyLock<BuiltinSource> = LazyLock::new(|| Builti
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "session_history",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, on sink_status_history, it seems even non-null values on our side might not find a match on the other side!

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-comment: I found the following issues by asking Claude to correlate the Ontology::description field values with what we have in our existing docs e.g. in mz_internal.md, mz_catalog.md. We should look into unifying these, e.g. sourcing the descriptions in our docs from Ontology::description. This could be also a follow-up PR. (But the below issues need to be fixed here.)

Comment thread src/catalog/src/builtin.rs Outdated
links: &[OntologyLink {
name: "status_of_replica",
target: "replica",
properties_json: r#"{"source_id_type": "CatalogItemId", "kind": "foreign_key", "source_column": "replica_id", "target_column": "id", "cardinality": "one_to_one"}"#,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not one-to-one, because a replica can have multiple processes, and we have a row here for each process. The other fields are also wrong.

Comment thread src/catalog/src/builtin.rs Outdated
@@ -6195,6 +7403,15 @@ pub static MZ_OBJECT_LIFETIMES: LazyLock<BuiltinView> = LazyLock::new(|| Builtin
FROM mz_catalog.mz_audit_events a
WHERE a.event_type = 'create' OR a.event_type = 'drop'",
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "object_lifetime",
description: "Computed lifetime span (created_at to dropped_at) for objects",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no created_at and dropped_at columns. Maybe object_lifetime is switched/confused with object_history?

Comment thread src/catalog/src/builtin.rs Outdated
@@ -5280,6 +6198,15 @@ pub static MZ_FRONTIERS: LazyLock<BuiltinSource> = LazyLock::new(|| BuiltinSourc
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "frontier",
description: "Current read/write frontiers per object (source)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mention "source"? The old docs say

the frontiers of each source, sink, table, materialized view, index, and subscription

Comment thread src/catalog/src/builtin.rs Outdated
.with_column("redacted_sql", SqlScalarType::String.nullable(false))
.with_key(vec![0, 1, 2])
.finish(),
column_comments: BTreeMap::new(),
sql: "SELECT DISTINCT sql_hash, sql, redacted_sql FROM mz_internal.mz_sql_text WHERE prepared_day + INTERVAL '4 days' >= mz_now()",
access: vec![MONITOR_SELECT],
ontology: Some(Ontology {
entity_name: "recent_sql_text",
description: "Recent SQL text (indexed, last 3 days)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude:

"last 3 days" understates the actual retention

Verified in builtin.rs 4990–5013:

  • Ontology: "Recent SQL text (indexed, last 3 days)."
  • SQL: WHERE prepared_day + INTERVAL '4 days' >= mz_now()
  • Inline comment immediately above: "This should always be 1 day more than the interval in MZ_RECENT_THINNED_ACTIVITY_LOG, because prepared_day is rounded down to the nearest day. Thus something that actually happened three days ago could have a prepared_day anywhere from 3 to 4 days back."

So the actual retention is 3–4 days, with 4 days being the filter constant. Suggested: "(indexed, last ~3–4 days)", or just "recent". Mild but real — the description gives a tighter bound than the implementation guarantees.

Comment thread src/catalog/src/builtin.rs Outdated
@@ -2323,18 +2469,37 @@ pub static MZ_COMPUTE_DEPENDENCIES: LazyLock<BuiltinSource> = LazyLock::new(|| B
]),
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "compute_dependency",
description: "Dependency edges within compute dataflows",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude:

"within compute dataflows" mischaracterizes the rows:

one row = compute_object → input source edge, and mz_internal.md says the relation "describes the dependency structure between each compute object (index, materialized view, or subscription) and the sources of its data." "Within compute dataflows" reads as if the rows describe operator-to-operator edges inside one dataflow; they don't. Suggested: "A dependency edge from a compute object (index, materialized view, or subscription) to one of the sources of its data." (Combines naturally with the plural-vs-singular fix.)

Comment thread src/catalog/src/builtin.rs Outdated
@@ -3733,6 +4395,22 @@ WHERE
mz_internal.parse_catalog_create_sql(data->'value'->'definition'->'V1'->>'create_sql')->>'type' = 'secret'",
is_retained_metrics_object: false,
access: vec![PUBLIC_SELECT],
ontology: Some(Ontology {
entity_name: "secret",
description: "An encrypted secret value used by connections",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secret entity description here says "An encrypted secret value used by connections", but secrets aren't only used by connections — webhook sources also reference secrets directly via CHECK ... WITH (SECRET ...) (see WebhookValidationSecret in src/sql/src/plan.rs). Suggest broadening to something like:

"A user-defined secret containing sensitive configuration (e.g., credentials)."

or, if you want to enumerate consumers:

"A user-defined secret containing sensitive configuration (e.g., credentials), referenced by connections and webhook sources."

(This is also wrong in the old docs.)

mtabebe added 3 commits April 30, 2026 11:34
… enum

Replace properties_json (raw JSON blob) with a typed LinkProperties
enum that has 5 variants (ForeignKey, Union, MapsTo, DependsOn, Measures),
each with documented fields and serde::Serialize
so the JSONB output is identical to the old hand-written strings.
Introduce Lit enum (Str/Json/Null), values_sql(), and values_view()
helpers in ontology.rs so all SQL literal escaping is centralized in
Lit::render(). The three static view builders (entity_types, semantic_types,
link_types) and the two inline VALUES lists in properties_view now use
Lit instead of direct esc() calls at each site.
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch 5 times, most recently from 4567b72 to 2700db4 Compare April 30, 2026 21:31
@mtabebe mtabebe force-pushed the ma/ontology/sql-built-in-v2 branch from 2700db4 to 057397b Compare April 30, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants