datajoint · dimitri-yatsenko · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025
diff --git a/docs/src/design/semantic-matching-spec.md b/docs/src/design/semantic-matching-spec.md
@@ -164,6 +164,227 @@ A.join(B, semantic_check=False)  # Explicit bypass
 
 The error message directs users to the explicit `.join()` method.
 
+## Primary Key Rules in Relational Operators
+
+In DataJoint, the result of each query operator produces a valid **entity set** with a well-defined **entity type** and **primary key**. This section specifies how the primary key is determined for each relational operator.
+
+### General Principle
+
+The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands.
+
+### Notation
+
+In the examples below, `*` marks primary key attributes:
+- `A(x*, y*, z)` means A has primary key `{x, y}` and secondary attribute `z`
+- `A → B` means "A determines B" (defined below)
+
+### Rules by Operator
+
+| Operator | Primary Key Rule |
+|----------|------------------|
+| `A & B` (restriction) | PK(A) — preserved from left operand |
+| `A - B` (anti-restriction) | PK(A) — preserved from left operand |
+| `A.proj(...)` (projection) | PK(A) — preserved from left operand |
+| `A.aggr(B, ...)` (aggregation) | PK(A) — preserved from left operand |
+| `A * B` (join) | Depends on functional dependencies (see below) |
+
+### Join Primary Key Rule
+
+The join operator requires special handling because it combines two entity sets. The primary key of `A * B` depends on the **functional dependency relationship** between the operands.
+
+#### Definitions
+
+**A determines B** (written `A → B`): Every attribute in PK(B) is either already in PK(A) or is a secondary attribute in A.
+
+```
+A → B  iff  ∀b ∈ PK(B): b ∈ PK(A) OR b ∈ secondary(A)
+```
+
+Intuitively, `A → B` means that knowing A's primary key is sufficient to determine B's primary key through functional dependencies.
+
+**B determines A** (written `B → A`): Every attribute in PK(A) is either already in PK(B) or is a secondary attribute in B.
+
+```
+B → A  iff  ∀a ∈ PK(A): a ∈ PK(B) OR a ∈ secondary(B)
+```
+
+#### Join Primary Key Algorithm
+
+For `A * B`:
+
+| Condition | PK(A * B) | Attribute Order |
+|-----------|-----------|-----------------|
+| A → B | PK(A) | A's attributes first |
+| B → A (and not A → B) | PK(B) | B's attributes first |
+| Neither | PK(A) ∪ PK(B) | PK(A) first, then PK(B) − PK(A) |
+
+When both `A → B` and `B → A` hold, the left operand takes precedence (use PK(A)).
+
+#### Examples
+
+**Example 1: B → A**
+```
+A: x*, y*
+B: x*, z*, y    (y is secondary in B, so z → y)
+```
+- A → B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). **No.**
+- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). **Yes.**
+- Result: **PK(A * B) = {x, z}** with B's attributes first.
+
+**Example 2: Both directions (bijection-like)**
+```
+A: x*, y*, z    (z is secondary in A)
+B: y*, z*, x    (x is secondary in B)
+```
+- A → B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). **Yes.**
+- B → A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). **Yes.**
+- Both hold, prefer left operand: **PK(A * B) = {x, y}** with A's attributes first.
+
+**Example 3: Neither direction**
+```
+A: x*, y*
+B: z*, x    (x is secondary in B)
+```
+- A → B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. **No.**
+- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). **No.**
+- Result: **PK(A * B) = {x, y, z}** (union) with A's attributes first.
+
+**Example 4: A → B (subordinate relationship)**
+```
+Session: session_id*
+Trial: session_id*, trial_num*    (references Session)
+```
+- A → B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. **No.**
+- B → A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. **Yes.**
+- Result: **PK(Session * Trial) = {session_id, trial_num}** with Trial's attributes first.
+
+### Design Tradeoff: Predictability vs. Minimality
+
+The join primary key rule prioritizes **predictability** over **minimality**. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes).
+
+**Example of non-minimal result:**
+```
+A: x*, y*
+B: z*, x    (x is secondary in B, so z → x)
+```
+
+The mathematically minimal primary key for `A * B` would be `{y, z}` because:
+- `z → x` (from B's structure)
+- `{y, z} → {x, y, z}` (z gives us x, and we have y)
+
+However, `{y, z}` is problematic:
+- It is **not the primary key of either operand** (A has `{x, y}`, B has `{z}`)
+- It is **not the union** of the primary keys
+- It represents a **novel entity type** that doesn't correspond to A, B, or their natural pairing
+
+This creates confusion: what kind of entity does `{y, z}` identify?
+
+**The simplified rule produces `{x, y, z}`** (the union), which:
+- Is immediately recognizable as "one A entity paired with one B entity"
+- Contains A's full primary key and B's full primary key
+- May have redundancy (`x` is determined by `z`) but is semantically clear
+
+**Rationale:** Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors.
+
+### Attribute Ordering
+
+The primary key attributes always appear **first** in the result's attribute list, followed by secondary attributes. When `B → A` (and not `A → B`), the join is conceptually reordered as `B * A` to maintain this invariant:
+
+- If PK = PK(A): A's attributes appear first
+- If PK = PK(B): B's attributes appear first
+- If PK = PK(A) ∪ PK(B): PK(A) attributes first, then PK(B) − PK(A), then secondaries
+
+### Non-Commutativity
+
+With these rules, join is **not commutative** in terms of:
+1. **Primary key selection**: `A * B` may have a different PK than `B * A` when one direction determines but not the other
+2. **Attribute ordering**: The left operand's attributes appear first (unless B → A)
+
+The **result set** (the actual rows returned) remains the same regardless of order, but the **schema** (primary key and attribute order) may differ.
+
+### Left Join Constraint
+
+For left joins (`A.join(B, left=True)`), the functional dependency **A → B is required**.
+
+**Why this constraint exists:**
+
+In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:
+
+| Scenario | PK by inner join rule | Left join problem |
+|----------|----------------------|-------------------|
+| A → B | PK(A) | ✅ Safe — A's attrs always present |
+| B → A | PK(B) | ❌ B's PK attrs could be NULL |
+| Neither | PK(A) ∪ PK(B) | ❌ B's PK attrs could be NULL |
+
+**Example of invalid left join:**
+```
+A: x*, y*           PK(A) = {x, y}
+B: x*, z*, y        PK(B) = {x, z}, y is secondary
+
+Inner join: PK = {x, z} (B → A rule)
+Left join attempt: FAILS because z could be NULL for unmatched A rows
+```
+
+**Valid left join example:**
+```
+Session: session_id*, date
+Trial: session_id*, trial_num*, stimulus    (references Session)
+
+Session.join(Trial, left=True)  # OK: Session → Trial
+# PK = {session_id}, all sessions retained even without trials
+```
+
+**Error message:**
+```
+DataJointError: Left join requires the left operand to determine the right operand (A → B).
+The following attributes from the right operand's primary key are not determined by
+the left operand: ['z']. Use an inner join or restructure the query.
+```
+
+### Bypassing the Left Join Constraint
+
+For special cases where the user takes responsibility for handling the potentially invalid primary key, the constraint can be bypassed using `allow_invalid_primary_key=True`:
+
+```python
+# Normally blocked - B does not determine A
+A.join(B, left=True)  # Error: A → B not satisfied
+
+# Bypass the constraint - user takes responsibility
+A.join(B, left=True, allow_invalid_primary_key=True)  # Allowed, PK = PK(A) ∪ PK(B)
+```
+
+When bypassed, the resulting primary key is the union of both operands' primary keys (PK(A) ∪ PK(B)). The user must ensure that subsequent operations (such as `GROUP BY` or projection) establish a valid primary key.
+
+This mechanism is used internally by aggregation (`aggr`) with `keep_all_rows=True`, which resets the primary key via the `GROUP BY` clause.
+
+### Aggregation Exception
+
+`A.aggr(B, keep_all_rows=True)` uses a left join internally but has the **opposite requirement**: **B → A** (the group expression B must have all of A's primary key attributes).
+
+This apparent contradiction is resolved by the `GROUP BY` clause:
+
+1. Aggregation requires B → A so that B can be grouped by A's primary key
+2. The intermediate left join `A LEFT JOIN B` would have an invalid PK under the normal left join rules
+3. Aggregation internally allows the invalid PK, producing PK(A) ∪ PK(B)
+4. The `GROUP BY PK(A)` clause then **resets** the primary key to PK(A)
+5. The final result has PK(A), which consists entirely of non-NULL values from A
+
+Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed.
+
+**Example:**
+```
+Session: session_id*, date
+Trial: session_id*, trial_num*, response_time    (references Session)
+
+# Aggregation with keep_all_rows=True
+Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)')
+
+# Internally: Session LEFT JOIN Trial (with invalid PK allowed)
+# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num}
+# But GROUP BY session_id resets PK to {session_id}
+# Result: All sessions, with avg_rt=NULL for sessions without trials
+```
+
 ## Universal Set `dj.U`
 
 `dj.U()` or `dj.U('attr1', 'attr2', ...)` represents the universal set of all possible values and lineages.
@@ -537,6 +758,14 @@ Use .proj() to rename one of the attributes or .join(semantic_check=False) in a
    - `A.aggr(B)` raises error when PK attributes have different lineage
    - `dj.U('a', 'b').aggr(B)` works when B has `a` and `b` attributes
 
+6. **Join primary key determination**:
+   - `A * B` where `A → B`: result has PK(A)
+   - `A * B` where `B → A` (not `A → B`): result has PK(B), B's attributes first
+   - `A * B` where both `A → B` and `B → A`: result has PK(A) (left preference)
+   - `A * B` where neither direction: result has PK(A) ∪ PK(B)
+   - Verify attribute ordering matches primary key source
+   - Verify non-commutativity: `A * B` vs `B * A` may differ in PK and order
+
 ### Integration Tests
 
 1. **Schema migration**: Existing schema gets `~lineage` table populated correctly

diff --git a/src/datajoint/expression.py b/src/datajoint/expression.py
@@ -282,7 +282,7 @@ def __matmul__(self, other):
             "The @ operator has been removed in DataJoint 2.0. " "Use .join(other, semantic_check=False) for permissive joins."
         )
 
-    def join(self, other, semantic_check=True, left=False):
+    def join(self, other, semantic_check=True, left=False, allow_invalid_primary_key=False):
         """
         Create the joined QueryExpression.
 
@@ -293,10 +293,14 @@ def join(self, other, semantic_check=True, left=False):
         :param semantic_check: If True (default), raise error on non-homologous namesakes.
             If False, bypass semantic check (use for legacy compatibility).
         :param left: If True, perform a left join retaining all rows from self.
+        :param allow_invalid_primary_key: If True, bypass the left join A → B constraint.
+            The resulting PK will be PK(A) ∪ PK(B), which may contain NULLs for unmatched rows.
+            Use when you will reset the PK afterward (e.g., via GROUP BY in aggregation).
 
         Examples:
             a * b  is short for a.join(b)
             a.join(b, semantic_check=False)  for permissive joins
+            a.join(b, left=True, allow_invalid_primary_key=True)  for left join with invalid PK
         """
         # U joins are deprecated - raise error directing to use & instead
         if isinstance(other, U):
@@ -336,10 +340,12 @@ def join(self, other, semantic_check=True, left=False):
         result._connection = self.connection
         result._support = self.support + other.support
         result._left = self._left + [left] + other._left
-        result._heading = self.heading.join(other.heading)
+        result._heading = self.heading.join(other.heading, left=left, allow_invalid_primary_key=allow_invalid_primary_key)
         result._restriction = AndList(self.restriction)
         result._restriction.append(other.restriction)
-        result._original_heading = self.original_heading.join(other.original_heading)
+        result._original_heading = self.original_heading.join(
+            other.original_heading, left=left, allow_invalid_primary_key=allow_invalid_primary_key
+        )
         assert len(result.support) == len(result._left) + 1
         return result
 
@@ -683,7 +689,8 @@ def create(cls, arg, group, keep_all_rows=False):
 
         if keep_all_rows and len(group.support) > 1 or group.heading.new_attributes:
             group = group.make_subquery()  # subquery if left joining a join
-        join = arg.join(group, left=keep_all_rows)  # reuse the join logic
+        # Allow invalid PK for left join (aggregation resets PK via GROUP BY afterward)
+        join = arg.join(group, left=keep_all_rows, allow_invalid_primary_key=True)
         result = cls()
         result._connection = join.connection
         result._heading = join.heading.set_primary_key(arg.primary_key)  # use left operand's primary key