Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 41 additions & 7 deletions .ai/base.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ Understand the intended design first:
- `docs/component.md` — component lifecycle, status model, reconciliation phases
- `docs/primitives.md` — primitive categories, field application, mutation system, editors, selectors
- `docs/primitives/*.md` — primitive implementations
- `docs/custom-resource.md` — implementing custom resource wrappers using `pkg/generic`
- `docs/guidelines.md` — best practices for structuring operators (desired state, one component per condition, etc.)
- `docs/compatibility.md` — supported version combinations and compatibility policy

### Source to read

Expand All @@ -31,12 +34,22 @@ Verify the real API before using or documenting it. Key packages:
- `pkg/component/` — builder, reconciliation, condition types, participation modes
- `pkg/component/concepts/` — lifecycle interfaces and their exact status type constants
- `pkg/primitives/` — kubernetes primitive resource wrappers with builders and mutators
- `pkg/generic/` — generic building blocks for custom resource wrappers (reconciliation, mutation sequencing,
suspension, data extraction)
- `pkg/mutation/editors/` — available methods per editor type
- `pkg/mutation/selectors/` — available container selectors
- `pkg/feature/feature.go` — `NewVersionGate`, `Mutation[T]`
- `pkg/recording/` — resource event recording
- `pkg/testing/` — testing utilities (`golden/` for snapshot tests, `integration/` for integration helpers)

When changing a public API, also check `examples/` for real usage patterns and to identify what else needs updating.

### Keeping these instructions current

If a change introduces a new `docs/` file, a new `pkg/` package, or removes/renames an existing one, update the
reference lists above and the documentation table below in the same response. These instructions are only useful if they
point to things that actually exist.

---

## Architecture
Expand Down Expand Up @@ -77,13 +90,15 @@ semantics. GoDoc is part of the public API surface.

Update documentation in the **same response** as the code change — never leave them out of sync.

| Code area changed | Documentation to update |
| ------------------------------------------------- | ----------------------- |
| Component builder, reconciliation, status model | `docs/component.md` |
| Primitives, field application, editors, selectors | `docs/primitives.md` |
| Primitive implementations | `docs/primitives/*.md` |
| Any `pkg/` export visible in the quick start | `README.md` |
| Examples | `examples/*/README.md` |
| Code area changed | Documentation to update |
| ------------------------------------------------- | ------------------------- |
| Component builder, reconciliation, status model | `docs/component.md` |
| Primitives, field application, editors, selectors | `docs/primitives.md` |
| Primitive implementations | `docs/primitives/*.md` |
| Generic building blocks, custom resource wrappers | `docs/custom-resource.md` |
| Operator structuring patterns, best practices | `docs/guidelines.md` |
| Any `pkg/` export visible in the quick start | `README.md` |
| Examples | `examples/*/README.md` |

When updating documentation in markdown files, make sure to run `make fmt-md` for consistent formatting.

Expand Down Expand Up @@ -147,6 +162,25 @@ first determine whether the code or the test is wrong:
If the intention remains genuinely ambiguous after analysis, **stop and ask** before touching either the code or the
test. Do not guess.

### E2E Tests

E2E tests live in `e2e/` and run against a real kind cluster. They validate the same intent as unit tests, just at a
higher integration level. Treat them with the same rigour: if a code change breaks an E2E test, determine whether the
code or the test is wrong before adjusting either.

**When to write E2E tests.** Create E2E tests for larger changes or architectural shifts that deserve validation against
a real cluster. Small, isolated fixes covered by unit tests do not need E2E coverage.

**Running E2E tests after code changes.** After `make all` passes, check whether E2E tests are affected by the change.
If they are, run the minimal possible E2E suite for the change rather than the full suite:

```bash
make e2e-primitives PRIMITIVE=daemonset
```

Do not run `make e2e` locally unless a full suite run is genuinely required; it takes close to 15 minutes. If a full run
is needed, call it out so the decision is explicit.

---

## Documentation Standards
Expand Down
48 changes: 41 additions & 7 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ Understand the intended design first:
- `docs/component.md` — component lifecycle, status model, reconciliation phases
- `docs/primitives.md` — primitive categories, field application, mutation system, editors, selectors
- `docs/primitives/*.md` — primitive implementations
- `docs/custom-resource.md` — implementing custom resource wrappers using `pkg/generic`
- `docs/guidelines.md` — best practices for structuring operators (desired state, one component per condition, etc.)
- `docs/compatibility.md` — supported version combinations and compatibility policy

### Source to read

Expand All @@ -31,12 +34,22 @@ Verify the real API before using or documenting it. Key packages:
- `pkg/component/` — builder, reconciliation, condition types, participation modes
- `pkg/component/concepts/` — lifecycle interfaces and their exact status type constants
- `pkg/primitives/` — kubernetes primitive resource wrappers with builders and mutators
- `pkg/generic/` — generic building blocks for custom resource wrappers (reconciliation, mutation sequencing,
suspension, data extraction)
- `pkg/mutation/editors/` — available methods per editor type
- `pkg/mutation/selectors/` — available container selectors
- `pkg/feature/feature.go` — `NewVersionGate`, `Mutation[T]`
- `pkg/recording/` — resource event recording
- `pkg/testing/` — testing utilities (`golden/` for snapshot tests, `integration/` for integration helpers)

When changing a public API, also check `examples/` for real usage patterns and to identify what else needs updating.

### Keeping these instructions current

If a change introduces a new `docs/` file, a new `pkg/` package, or removes/renames an existing one, update the
reference lists above and the documentation table below in the same response. These instructions are only useful if they
point to things that actually exist.

---

## Architecture
Expand Down Expand Up @@ -77,13 +90,15 @@ semantics. GoDoc is part of the public API surface.

Update documentation in the **same response** as the code change — never leave them out of sync.

| Code area changed | Documentation to update |
| ------------------------------------------------- | ----------------------- |
| Component builder, reconciliation, status model | `docs/component.md` |
| Primitives, field application, editors, selectors | `docs/primitives.md` |
| Primitive implementations | `docs/primitives/*.md` |
| Any `pkg/` export visible in the quick start | `README.md` |
| Examples | `examples/*/README.md` |
| Code area changed | Documentation to update |
| ------------------------------------------------- | ------------------------- |
| Component builder, reconciliation, status model | `docs/component.md` |
| Primitives, field application, editors, selectors | `docs/primitives.md` |
| Primitive implementations | `docs/primitives/*.md` |
| Generic building blocks, custom resource wrappers | `docs/custom-resource.md` |
| Operator structuring patterns, best practices | `docs/guidelines.md` |
| Any `pkg/` export visible in the quick start | `README.md` |
| Examples | `examples/*/README.md` |

When updating documentation in markdown files, make sure to run `make fmt-md` for consistent formatting.

Expand Down Expand Up @@ -147,6 +162,25 @@ first determine whether the code or the test is wrong:
If the intention remains genuinely ambiguous after analysis, **stop and ask** before touching either the code or the
test. Do not guess.

### E2E Tests

E2E tests live in `e2e/` and run against a real kind cluster. They validate the same intent as unit tests, just at a
higher integration level. Treat them with the same rigour: if a code change breaks an E2E test, determine whether the
code or the test is wrong before adjusting either.

**When to write E2E tests.** Create E2E tests for larger changes or architectural shifts that deserve validation against
a real cluster. Small, isolated fixes covered by unit tests do not need E2E coverage.

**Running E2E tests after code changes.** After `make all` passes, check whether E2E tests are affected by the change.
If they are, run the minimal possible E2E suite for the change rather than the full suite:

```bash
make e2e-primitives PRIMITIVE=daemonset
```

Do not run `make e2e` locally unless a full suite run is genuinely required; it takes close to 15 minutes. If a full run
is needed, call it out so the decision is explicit.

---

## Documentation Standards
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,11 @@ All new code should include tests. The project uses [testify](https://github.com
go test ./...
```

## Further Reading

- [The Missing Layers in Your Kubernetes Operator](https://medium.com/@sourcehawk/the-missing-layers-in-your-kubernetes-operator-306ee8633350) -
a walkthrough of common structural problems in Kubernetes operators and how the framework addresses them.

## License

Apache License 2.0. See [LICENSE](LICENSE) for details.
1 change: 1 addition & 0 deletions docs/component.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Each resource is registered with a `ResourceOptions` struct that controls how th
| `ResourceOptions{ReadOnly: true}` | **Read-only**: fetched but never modified; health still contributes |
| `ResourceOptions{Delete: true}` | **Delete-only**: removed from the cluster if present; does not contribute to health |
| `ResourceOptions{ParticipationMode: ParticipationModeAuxiliary}` | The resource's health does not contribute to the component condition. The component can become Ready regardless of this resource's state. **Exception:** a blocked [guard](#guards) always contributes to the condition regardless of participation mode, because it halts the entire reconciliation pipeline |
| `ResourceOptions{SuppressGraceInconsistencyWarning: true}` | Suppresses the warning log emitted when the resource's grace handler returns Healthy while its convergence handler returns non-healthy. Use this when the inconsistency is intentional (e.g., a custom grace handler that deliberately reports Healthy for a resource that has not fully converged) |

### Building Resource Options with Feature Gating

Expand Down
52 changes: 51 additions & 1 deletion docs/custom-resource.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ generics with type-specific logic.
- [Mutator Design Guidelines](#mutator-design-guidelines)
- [3. Implement Status Handlers](#3-implement-status-handlers)
- [Workload Handlers](#workload-handlers)
- [Convergence and Grace Status Consistency](#convergence-and-grace-status-consistency)
- [Status Constants Reference](#status-constants-reference)
- [4. Implement the Builder](#4-implement-the-builder)
- [Builder Pattern Guidelines](#builder-pattern-guidelines)
Expand Down Expand Up @@ -269,8 +270,24 @@ func DefaultConvergingStatusHandler(
}, nil
}

// DefaultGraceStatusHandler reports degraded or down state during convergence.
// DefaultGraceStatusHandler reports health during convergence.
func DefaultGraceStatusHandler(gs *examplev1.GameServer) (concepts.GraceStatusWithReason, error) {
desired := int32(1)
if gs.Spec.Replicas != nil {
desired = *gs.Spec.Replicas
}

// Use == rather than >= so that grace and convergence agree on replica state.
// Both handlers evaluate the same object in the same reconcile loop, so grace
// must not return Healthy for a state that convergence considers non-healthy
// (e.g. ReadyReplicas > desired during scale-down).
if gs.Status.ReadyReplicas == desired {
return concepts.GraceStatusWithReason{
Status: concepts.GraceStatusHealthy,
Reason: "All replicas are ready",
}, nil
}

if gs.Status.ReadyReplicas > 0 {
return concepts.GraceStatusWithReason{
Status: concepts.GraceStatusDegraded,
Expand Down Expand Up @@ -313,6 +330,39 @@ func DefaultDeleteOnSuspendHandler(_ *examplev1.GameServer) bool {
}
```

#### Convergence and Grace Status Consistency

The convergence handler and the grace handler evaluate the same object in the same reconcile loop with no refetch
between them. The grace handler must not return Healthy for any object state where the convergence handler returns
non-healthy. If this happens, one of the two handlers is misconfigured, and the component will log a warning.

When convergence returns Healthy, the component is satisfied and grace is never called. For all other states, grace must
not contradict convergence by returning Healthy. The following table shows a consistent pair of handlers for a
Deployment with 3 desired replicas:

| Desired | Ready | Convergence | Grace |
| ------- | ----- | ----------- | ------------ |
| 3 | 0 | Creating | Down |
| 3 | 1 | Scaling | Degraded |
| 3 | 3 | Healthy | (not called) |
| 3 | 5 | Scaling | Degraded |

A misconfigured grace handler that reports Healthy when the resource has not converged breaks this invariant:

| Desired | Ready | Convergence | Grace |
| ------- | ----- | ----------- | ------------ |
| 3 | 0 | Creating | Down |
| 3 | 1 | Scaling | Degraded |
| 3 | 3 | Healthy | (not called) |
| 3 | 5 | Scaling | **Healthy** |

In the last row, convergence considers the resource non-healthy (still scaling down), but grace tells the component
everything is fine.

If this inconsistency is intentional (e.g., a custom grace handler that deliberately reports Healthy for a resource that
has not fully converged), set `SuppressGraceInconsistencyWarning: true` on the resource's `ResourceOptions` to suppress
the warning log.

#### Status Constants Reference

| Category | Status Type | Constant Name | String Value |
Expand Down
21 changes: 21 additions & 0 deletions docs/guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ are effective and pitfalls that are easy to walk into.
- [Keep Controllers Thin](#keep-controllers-thin)
- [Resource Registration Order Is Execution Order](#resource-registration-order-is-execution-order)
- [Use Data Extraction and Guards for Resource Dependencies](#use-data-extraction-and-guards-for-resource-dependencies)
- [Prefer stable values for guard conditions](#prefer-stable-values-for-guard-conditions)
- [Use Prerequisites for Cross-Component Dependencies](#use-prerequisites-for-cross-component-dependencies)
- [Use Component Feature Gates for Optional Components](#use-component-feature-gates-for-optional-components)
- [Mutations Describe Intent, Not Observation](#mutations-describe-intent-not-observation)
Expand Down Expand Up @@ -349,6 +350,21 @@ The guard prevents the dependent resource from being applied until its precondit
as a `Blocked` condition reason so users can see why a resource has not been created yet. The shared variable
(`roleARN`) is scoped to the reconciliation call, which prevents state leakage between reconciles.

### Prefer stable values for guard conditions

A guard re-evaluates on every reconcile. If the extracted value it depends on is unstable (it can disappear, change, or
transiently become empty), the guard will re-block after the dependent resource has already been created. In most cases
this is not intentional. The resource is already running, but the guard now reports `Blocked` and skips reconciliation
for everything after it.

Good candidates for guard conditions are values that appear once and remain stable: a status field written by a
controller (an ARN, a provisioned IP, a generated credential reference). Poor candidates are values that fluctuate
during normal operation, such as replica counts, transient annotations, or fields that get cleared during rolling
updates.

If you genuinely need to react to a value disappearing after initial creation, that is a valid use case, but it should
be a deliberate design choice rather than an accidental side effect of choosing an unstable extraction target.

## Use Prerequisites for Cross-Component Dependencies

When one component cannot start until another is ready, use a prerequisite on the dependent component rather than
Expand Down Expand Up @@ -541,3 +557,8 @@ consuming that output, not for the internal implementation.

The audience cares about the feature, not the Kubernetes resource type backing it. A condition named
`DeploymentReconciled` tells a user nothing about what capability is affected.

## Further Reading

For a deeper look at the structural problems these guidelines address, see
[The Missing Layers in Your Kubernetes Operator](https://medium.com/@sourcehawk/the-missing-layers-in-your-kubernetes-operator-306ee8633350).
7 changes: 6 additions & 1 deletion e2e/component/prerequisite_gate_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -333,11 +333,16 @@ var _ = Describe("Component Feature Gate and Prerequisite", func() {
})

framework.NewClusterTestApp(ctx, k8sClient, name)

By("waiting for PrerequisiteNotMet condition before suspending")
Eventually(framework.GetClusterCondition(ctx, k8sClient, name, "E2EReady"), framework.DefaultTimeout, framework.DefaultPolling).
Should(framework.HaveConditionStatus(metav1.ConditionFalse, "PrerequisiteNotMet"))

framework.UpdateClusterTestApp(ctx, k8sClient, name, func(a *framework.ClusterTestApp) {
a.Spec.Suspended = true
})

By("waiting for PrerequisiteNotMet condition (not Suspended)")
By("verifying condition remains PrerequisiteNotMet (not Suspended)")
Eventually(framework.GetClusterCondition(ctx, k8sClient, name, "E2EReady"), framework.DefaultTimeout, framework.DefaultPolling).
Should(framework.HaveConditionStatus(metav1.ConditionFalse, "PrerequisiteNotMet"))

Expand Down
Loading
Loading