Skip to content

fix(conformance): bring test suite to green baseline#32

Open
indyjonesnl wants to merge 7 commits into
calfonso:mainfrom
indyjonesnl:upstream/tests-green-baseline
Open

fix(conformance): bring test suite to green baseline#32
indyjonesnl wants to merge 7 commits into
calfonso:mainfrom
indyjonesnl:upstream/tests-green-baseline

Conversation

@indyjonesnl
Copy link
Copy Markdown

Goal

Make cargo test --workspace pass on main so subsequent PRs can prove regressions clearly. Without this, 9–11 baseline failures hide real signal.

Changes

Controller fixes (2 commits):

  • fix(conformance): misc sweep — GC drops the 2-scan grace period for orphan deletion (delete_orphan already re-verifies owners per K8s attemptToDeleteItem); ResourceQuota watches pods/services/configmaps/secrets/PVCs; DaemonSet RollingUpdate resolves maxUnavailable as IntOrString with round-up percent scaling; StatefulSet status.replicas and readyReplicas exclude terminating pods.
  • fix(controller-manager): track terminating pods on EndpointSlices — terminating pods stay in the slice with terminating=true, ready=false so kube-proxy can drain them; honors publishNotReadyAddresses.

Test corrections (5 commits) — assertions that drifted from upstream behavior:

  • test(daemonset): align pod-name assertion with K8s generateName — DS pods are <ds>-<random>, not <ds>-<node>-<random>.
  • test(gc): ignore obsolete namespace-cascade test — GC no longer cascades namespaces (NamespaceController owns it).
  • test(node): bypass 60s startup grace via seed_first_seen_for_test — tests that exercise Ready-flip behavior need to skip the K8s-standard 60s startup grace.
  • test(controllerrevision): align hash-suffix assertion with K8s SafeEncodeString — controller emits SafeEncodeString alphabet, not hex.
  • test(statefulset): simulate kubelet cleanup between reconciles — graceful-delete tests need a kubelet simulator helper since reconcile only sets deletionTimestamp.

Verification

Local on this branch:

cargo test --workspace --locked
# Pass: 3780  Fail: 0

Why these two themes are bundled

The 5 test corrections and the 2 controller fixes are coupled — separately, each side leaves part of the suite red:

  • Controller fixes alone leave ~6 tests red (the test assertions are stale).
  • Test corrections alone leave 9 tests red (the controllers behave per the new tests' expectations only after the fixes).

Bundling them lets reviewers see a single PR that goes from red to green.

indyjonesnl and others added 7 commits May 14, 2026 17:04
… preempt, endpoints

Knock down four residual v1.35 conformance gaps in the unit-14 slice:

- GarbageCollector: orphans are now reaped in a single scan. The owner is
  already re-verified inside `delete_orphan` (mirroring K8s
  attemptToDeleteItem → getObject), so the prior 2-scan grace was pure
  added latency that conformance probes for orphan-pod cleanup observed
  as a failure.
- ResourceQuota: the controller now watches pods, services, configmaps,
  secrets and PVCs in addition to ResourceQuota itself. Lifecycle events
  on tracked resources immediately re-enqueue every quota in the affected
  namespace, so status.used reflects pod create/delete without waiting
  for the 30s resync.
- DaemonSet RollingUpdate: maxUnavailable is now resolved as a true
  IntOrString. Percentages are scaled against the desired pod count and
  rounded UP per `intstr.GetScaledValueFromIntOrPercent(roundUp=true)`.
  Previously "25%" was parsed as 25 absolute, allowing the whole fleet
  to be deleted in one reconcile on small clusters.
- StatefulSet eviction/scale-down: status.replicas and status.readyReplicas
  now exclude terminating pods. Scale-down and PDB-driven eviction tests
  expect the counts to decrement the moment deletionTimestamp is set
  (graceful termination), not only when the pod is fully removed.

Items inspected and confirmed already correct (no change shipped):
- list chunking + compaction 410 response (already returns fresh
  continueToken in Status.metadata).
- HostPort conflict detection (scheduler `check_host_port_conflicts`
  correctly handles wildcard hostIPs and protocol overlap).
- Preemption running path (scheduler `check_preemption` already considers
  Running pods as victims and uses K8s "remove all, then reprieve").
- Endpoints latency (controller is watch-driven on pods+services; reconcile
  does not write the service it watches, so the workqueue cooldown does
  not affect endpoint-update latency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DaemonSet pods are named '<ds>-<random>' via generateName (not
'<ds>-<node>-<random>'). A unique suffix per pod is required by the
'should retry creating failed daemon pods' conformance test, which
asserts the failed pod's name returns NotFound after replacement.
Test expectation predates the controller change; updating it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n EndpointSlices

K8s endpointslice controller keeps terminating pods (deletionTimestamp set)
in the slice with terminating=true, ready=false so kube-proxy can drain
them gracefully. Dropping them on the first reconcile broke the "serve"
conformance tests that race-check endpoint serving during rolling updates.

Also honor publishNotReadyAddresses on the Service spec: when set, all
endpoints get ready=true and serving=true regardless of pod Ready state.
This matches K8s endpointslice utils.podEndpointConditions semantics and
is required by headless services fronting peer-discovery protocols.

Additional fixes:
- Mirror EndpointSlices under the bare Endpoints name (matching K8s
  convention) instead of always appending "-mirrored", falling back to
  the suffix only when a selector-based slice already owns the bare name.
- cleanup_orphans() now actually deletes orphaned slices when the
  owning Service / Endpoints disappears (was a no-op stub).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GC was deliberately changed to drop namespace cascade
(NamespaceController owns it now; force-deleting through GC raced with
finalizers). The test still exercises the removed code path; mark
ignored with a pointer to the namespace_controller_test where the
behavior is now verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NodeController applies a K8s-standard 60s startup grace before flipping
Ready conditions on a freshly-seen node. Two integration tests
(test_node_without_ready_condition, test_node_not_ready_with_old_heartbeat)
expected the condition to flip on the first reconcile — within the
grace window — and were therefore always red.

Add a #[doc(hidden)] seed_first_seen_for_test() that backdates a
node's first_seen entry so reconcile treats the node as past the
grace. Call it from both tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…codeString

The controller uses K8s SafeEncodeString (consonant+digit alphabet,
variable length per FNV digit), not 10-char hex. Test assertion
predated this; update to reflect actual format and check the
SafeEncodeString alphabet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
StatefulSet reconcile only sets deletionTimestamp; the actual pod
removal from storage is the kubelet's job. Two integration tests
(scales_down_reverse_order, rolling_update_changes_image) assert on
storage pod count and were therefore always red. Add a private
simulate_kubelet_cleanup helper that removes any pod with
deletionTimestamp, and call it between reconcile cycles. Mirrors the
helper already present in the src/controllers/statefulset.rs unit
tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant