diff --git a/test/antithesis/scratchbook/bug-ledger.md b/test/antithesis/scratchbook/bug-ledger.md new file mode 100644 index 00000000000..75cd0b8ce39 --- /dev/null +++ b/test/antithesis/scratchbook/bug-ledger.md @@ -0,0 +1,74 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space — headline guarantees that frame these defects as bugs. + - path: https://github.com/DataDog/saluki/pull/1768 + why: PR review #4393897611 / Codex P2 flagged the now-stale aggregate panic repro reconciled here. +--- + +# Bug Ledger + +Accounting of every defect discovered during `antithesis-research`, with the vehicle used to +demonstrate it. Goal: burn each discovered bug into a local repro case or an Antithesis triage shot. +**No fixes were applied.** Following TDD, each local repro asserts the *desired* invariant, so it +**currently FAILS** against the buggy code — the failing test *is* the demonstration. It turns green +once the defect is fixed (a ready-made regression guard). Each test's comment describes why/how the +bug happens. + +## Burned into local repro cases (failing unit tests) + +| # | Property | Test | Defect | +|---|----------|------|--------| +| 1 | `ddsketch-no-nan-poison` | `lib/ddsketch/src/agent/sketch.rs::tests::bug_nan_sample_poisons_sum_and_avg` | One NaN sample permanently poisons sketch `sum`/`avg` (sticky), `count`/`min`/`max` stay valid → silent corruption; no finiteness guard at the sketch boundary. | +| 2 | `replay-corruption-not-silent-eof` | `lib/saluki-components/.../sources/dogstatsd/replay/reader.rs::tests::bug_corrupt_length_prefix_silently_drops_following_records` | A corrupt/oversized length prefix is read as clean EOF → all following well-formed records silently dropped (false replay fidelity / data loss). | +| 3 | `aggregate-clock-skew-stable` (forward-jump facet) | `lib/saluki-components/.../transforms/aggregate/mod.rs::tests::bug_forward_clock_jump_floods_zero_value_points` | No `current_time >= last_flush` guard; a forward wall-clock jump builds `zero_value_buckets` over the whole interval — O(jump) work/alloc — and floods one idle counter with points proportional to the jump. | +| 4 | `rss-bounded-under-cardinality` / `interner-full-bounded` (root cause) | `lib/saluki-context/src/resolver.rs::tests::bug_default_heap_fallback_makes_context_resolution_unbounded` | `allow_heap_allocations` defaults true → a full fixed-size interner silently spills to the heap and `resolve` never refuses → unbounded memory under a high-cardinality flood; the only bounded mode (heap disallowed) silently drops. | +| 5 | `config-stall-no-deadlock` | `lib/saluki-config/src/lib.rs::tests::bug_config_ready_hangs_forever_without_snapshot` | `GenericConfiguration::ready()` awaits the first dynamic snapshot with no internal timeout; with the sender held open and silent, it never resolves → ADP startup hangs forever. | + +Run all five (expect five FAILURES — the failing tests are the demonstrations): +`cargo nextest run --no-fail-fast -E 'test(/bug_nan_sample_poisons_sum_and_avg|bug_corrupt_length_prefix_silently_drops_following_records|bug_forward_clock_jump_floods_zero_value_points|bug_default_heap_fallback_makes_context_resolution_unbounded|bug_config_ready_hangs_forever_without_snapshot/)'` + +## Resolved upstream on main (repro now stale) + +- **`aggregate-no-panic-any-window` — sub-second window `% 0` panic (was bug #1).** Fixed on main: + the config key is renamed `aggregate_window_duration_seconds` and typed `NonZeroU64`, and + `align_to_bucket_start` divides by `bucket_width_secs.get()` end-to-end + (`transforms/aggregate/mod.rs:95-98,822-823`), so a zero/sub-second window now fails config + parsing instead of reaching the divisor (PR #1772). The repro + `tests::bug_sub_second_aggregate_window_panics_on_insert` lives in a **sibling stack commit** + (`chore(agent-data-plane): failing repros for six discovered bugs`), not this docs commit. **Action + required there (out of scope for this commit):** delete that test or convert it to a passing + regression guard, since the desired invariant now holds. The catalog entry is reframed as a + low-cost `Unreachable` regression tripwire. + +## Burned into an Antithesis triage shot (submitted run) + +- **`rss-bounded-under-cardinality` (behavioral)** and **`forwarder-eventual-delivery` (baseline liveness)** — run id (redacted; tracked internally) (test-name `saluki-adp-bug-hunt`, 30 min, submitted 2026-05-29). The `parallel_driver_send_dogstatsd` high-cardinality regime drives memory growth; `finally_verify_delivery` checks delivery. Triage with the `antithesis-triage` skill once it completes. + +## Antithesis-shot-only — blocked on harness infrastructure (not locally reproducible) + +These discovered defects cannot be reproduced as local unit tests and require infrastructure not yet +built. Each is a follow-up Antithesis shot, not a current repro. + +- **`memory-limiter-survives-rss-read-failure`** — `check_memory_usage` does `querier.resident_set_size().expect(...)`; an RSS-read failure panics the checker thread, silently disabling memory protection. **Not locally reproducible:** the `Querier` is constructed internally with no injection seam. **Shot blocker:** a custom `/proc` fault (enabled for the tenant) + a memory-limiter-enabled ADP config variant + a SUT-side `assert_unreachable!` at the `.expect` site. +- **`config-runtime-update-not-revalidated`** — the incompatibility gate runs only at startup; a runtime config-stream update can introduce a high-severity-incompatible key with no re-gate. **Shot blocker:** config-stream add-on **and** human confirmation of intent (intentional trust of the authoritative Agent vs. oversight) — flagged `(needs human input)` in the catalog. +- **`shutdown-drains-no-loss` / `graceful-shutdown-within-30s` (forceful-stop data loss)** — behavioral; needs the running harness shut down under a slow/blocked intake to exceed the 30s grace window. **Shot blocker:** an intake failure-mode toggle + a shutdown driver. + +## Recorded as covered / not a distinct defect + +- **`aggregate-clock-skew-stable` (backward-jump gap)** — dual of bug #4; same root cause (no monotonicity guard on `current_time` vs `last_flush`). Covered by #4's triage. +- **aggregate context-limit counts contexts, not bytes** — a single context with many timestamped values has unbounded value memory; a facet of `rss-bounded-under-cardinality`, partially covered by #5 and the catalog's open question. +- **`ddsketch-relative-error-bound` merge non-associativity** — f64 ordering sensitivity; a library-level numeric property, not a clear ADP runtime defect (the catalog demoted it to a harness/proptest invariant). +- **`source-dispatch-no-misroute`** — misroute is structurally improbable with the current `extract`-then-`send_all` ordering; the live facet (silent uncounted loss on a downstream dispatch error) would need a failing-dispatcher harness. Candidate future local repro; not a confirmed defect today. + +## Status + +Cleanly-reproducible discovered bugs that remain open: **5 local repros + 1 submitted run.** One +original repro (the aggregate sub-second `% 0` panic) was **fixed upstream on main** (PR #1772) and is +recorded above as resolved-stale — its repro in the sibling stack commit needs removal/conversion. The +remaining items are explicitly blocked on harness infrastructure (custom `/proc` fault, config-stream +add-on, intake failure toggle) or human input, and are recorded above as follow-up Antithesis shots. +No further bug is reproducible without building that infrastructure. diff --git a/test/antithesis/scratchbook/deployment-topology.md b/test/antithesis/scratchbook/deployment-topology.md new file mode 100644 index 00000000000..fcef2890c0d --- /dev/null +++ b/test/antithesis/scratchbook/deployment-topology.md @@ -0,0 +1,217 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence — correctness-test topology (diff-testing Agent vs ADP via fakeintake) informs reuse. + - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP + why: Documents the Core Agent → AgentSecure config-stream → ADP relay topology (informs the config-stream add-on). +--- + +# Deployment Topology: Agent Data Plane (ADP) + +## Guiding principle + +The existing `bin/correctness` harness (`millstone` + `datadog-intake` + `panoramic` + `airlock`) is +a near-perfect base — but it is built for **deterministic diff-testing with a healthy intake**. +Antithesis's value is the opposite: **fault the links the harness keeps healthy**. The single most +important design move is to put the **ADP → intake HTTP forwarding link across a container boundary** +so Antithesis can partition, delay, drop, and black-hole it. Everything else stays minimal. + +Antithesis injects faults **per container**: anything in one container shares its fate, and links +*within* a container cannot be faulted. So component placement is dictated by which links we need +faultable. + +> **Post-evaluation routing (catalog now 35 properties).** The primary topology + listener variant +> cover ~24 of 35. The remaining ~11 route to the two add-ons: **Add-on 1 (config stream)** covers +> the config/reload cluster — `config-stall-no-deadlock`, `config-incompatible-refuses-start`, +> `config-runtime-update-not-revalidated`, **`filter-config-reload-correct`**, and the reload facets +> of `tag-filterlist-applied-consistently`; **Add-on 2 (diff-test)** covers the differential +> correctness properties — `aggregate-matches-agent`, **`mapper-output-matches-agent`**, +> **`prefix-filter-ordering-matches-agent`**, and the differential facet of +> `tag-filterlist-applied-consistently`. The events/service-checks trio +> (`events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, `events-sc-pipeline-reachable`) runs in +> the **primary** topology (the workload must emit events + service-checks, not only metrics). +> `mapper-interner-bounded` runs in the primary topology (high-cardinality mappable names + small +> mapper interner). + +## Primary topology (covers ~24 of 35 properties) + +Standalone ADP, fed deterministic load, forwarding to a mock intake — all three on separate +containers so every link is faultable. + +```text ++------------------------+ DogStatsD +------------------------+ HTTP (Datadog +------------------------+ +| workload-client | (UDP/TCP, faultable) | adp | intake API, | mock-intake | +| - millstone load gen | ------------------------> | agent-data-plane | faultable, retryable) | datadog-intake | +| - Antithesis SDK | | (standalone mode) | ----------------------> | (mock fakeintake) | +| - test template | <------------------------ | UDP/TCP/UDS listeners | <---------------------- | records payloads, | ++------------------------+ backpressure / health +------------------------+ acks / 5xx / hang | queryable for asserts | + +------------------------+ +``` + +| Container | Role | Image | Runs | Connections | Replicas | +|---|---|---|---|---|---| +| `adp` | Service (SUT) | reuse `docker/Dockerfile.agent-data-plane` (standalone build) | `agent-data-plane run` in **standalone mode** (`DD_DATA_PLANE_STANDALONE_MODE=true`, `DD_DATA_PLANE_DOGSTATSD_ENABLED=true`), no Core Agent dependency | receives DogStatsD from `workload-client`; forwards to `mock-intake` over HTTP | 1 | +| `mock-intake` | Dependency | reuse `docker/Dockerfile.correctness-tools` (the `datadog-intake` binary) | mock Datadog intake; record + count forwarded payloads; expose a query API the workload reads for assertions | receives ADP forwarder traffic; queried by `workload-client` | 1 | +| `workload-client` | Client (test driver) | new thin Dockerfile layering the `millstone` binary + test template + Antithesis Rust SDK | emits `setup_complete`, then test commands drive `millstone` load and run assertions against `mock-intake` | sends DogStatsD to `adp`; queries `mock-intake` | 1 | + +Notes: +- **Use UDP or TCP, not UDS, between `workload-client` and `adp`.** UDS requires a shared volume + (same fate / no faulting), and it couples origin-detection credentials. UDP/TCP keeps the intake + *and* the DSD-intake links independently faultable and lets `malformed-dsd-no-crash` exercise the + network listeners. (UDS-specific listener behavior can be a secondary case with a shared-volume + sidecar — see "Listener-coverage variant".) +- **Point ADP's forwarder at `mock-intake`** via `DD_URL` / forwarder endpoint config; set a real + (fake) API key. This is the link that unlocks the entire egress data-loss cluster. +- `millstone` already supports deterministic seeds and fixed payload counts (`millstone.yaml`), + so the workload is reproducible; Antithesis adds the fault dimension on top. + +### What the primary topology covers + +- **Memory & resource bounds (Cat A):** high-cardinality / many-timestamp `millstone` corpus + + `memory_mode`/`memory_limit` set on `adp`; node-throttling on `adp` to stress the limiter timing; + observe RSS vs grant. `rss-bounded-under-cardinality`, `aggregate-context-limit-enforced`, + `interner-full-bounded`, `memory-limiter-survives-rss-read-failure` (needs `/proc` fault — see + faults), `retry-queue-bounded-under-outage`. +- **Data integrity & no silent loss (Cat B):** partition / delay / black-hole the `adp↔mock-intake` + link, then heal it. `no-silent-interconnect-drop`, `forwarder-eventual-delivery`, + `retry-queue-bounded-under-outage`, `shutdown-drains-no-loss`, `source-dispatch-no-misroute`. +- **Aggregation crash/clock (Cat C subset):** config exploration + (`aggregate_window_duration_seconds`, now `NonZeroU64` — the `% 0` panic is closed upstream, so + `aggregate-no-panic-any-window` is just a regression tripwire) and clock jitter. + `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable`. Sketch internals + (`ddsketch-*`, `ddsketch-no-nan-poison`) ride the same workload with SUT-side assertions; the NaN + bypass needs a `checks_ipc` Histogram producer (see "checks_ipc note"). +- **Lifecycle (Cat D subset):** SIGINT / node-termination on `adp`. `graceful-shutdown-within-30s`, + `data-component-failure-triggers-process-shutdown`, `topology-ready-before-intake`. +- **Untrusted input (Cat E) + concurrency (Cat F):** adversarial DogStatsD packets from the + workload (`malformed-dsd-no-crash`); interner races and non-finite handling ride normal load with + SUT-side assertions (`interner-reclamation-no-corruption`, `non-finite-values-handled-consistently`). +- **Events & service-checks (Cat B/E additions):** the workload must emit well-formed *and* + malformed events + service-checks so `events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, and + the anti-vacuity anchor `events-sc-pipeline-reachable` are exercised — a metrics-only `millstone` + corpus leaves these vacuous. +- **Transformer correctness (Cat G, primary-runnable subset):** `mapper-interner-bounded` rides a + high-cardinality flood of distinct *mappable* names against a small `dogstatsd_mapper_string_interner_size`. + The differential Cat G properties (`mapper-output-matches-agent`, `prefix-filter-ordering-matches-agent`) + need Add-on 2; the reload ones need Add-on 1. + +## Add-on 1 — Core Agent config stream (the `config-*` cluster) + +Standalone mode bypasses the remote-agent config stream, so the config-stream properties need a +fourth container: a **Core Agent (or a minimal gRPC config-stream stub)** that ADP registers against +and receives snapshot/partial config from. + +```text ++------------------------+ gRPC config stream (faultable) +------------------------+ +| core-agent-stub | <--------------------------------> | adp (remote-agent mode)| +| - IPC/gRPC config srvr | register / snapshot / partial | DD_DATA_PLANE_STANDALONE| +| - Status/Flare/Telem | | _MODE=false | ++------------------------+ +------------------------+ +``` + +| Container | Role | Image | Runs | Why | +|---|---|---|---|---| +| `core-agent-stub` | Dependency | reuse `docker/Dockerfile.datadog-agent`, **or** a new minimal stub speaking the remote-agent IPC/gRPC config protocol | serves registration + config snapshot/partial over gRPC | exercises the no-timeout startup wait and runtime config-apply path | + +Covers: `config-stall-no-deadlock` (delay/drop the config stream → quiescent indefinite hang at +`ready().await`, *the* falsification target), `config-incompatible-refuses-start` (send a +high-severity-incompatible non-default key at startup → expect exit 1), +`config-runtime-update-not-revalidated` (send the incompatible key *after* startup → observe silent +apply), and the **Category G runtime-reload cluster** — **`filter-config-reload-correct`** (push +filter config over the stream while metrics flow; explore `broadcast::Lagged` staleness, partial +apply, and key-deletion-clears-all-filtering) plus the reload facet of +`tag-filterlist-applied-consistently` (stale cache after a Lagged-dropped reload). This is the the design partner +design-partner's documented "Tag Filter RC Relay Stress Test" focus, so the stub must be able to +send adversarial/partial/laggy filter updates. A **stub is preferred over the full Datadog Agent** for state-space minimality and because we +need to send adversarial/malformed config the real Agent would never emit; flag as a build task. + +## Add-on 2 — Diff-test for `aggregate-matches-agent` (heaviest; optional, separate run) + +The differential property needs the Datadog Agent baseline and ADP comparison fed identical load, +each forwarding to its own intake, then compared. This is the existing `panoramic` correctness setup; +under Antithesis the comparison runs as a `finally_`/`eventually_` check during a quiet period. + +```text + +------------------+ +------------------+ + millstone -->| datadog-agent | ---> | intake-baseline | + (same seed, | (baseline) | +------------------+ + fan-out) -->| adp (comparison) | ---> | intake-comparison| --> finally_: stele diff within ratio + +------------------+ +------------------+ +``` + +This doubles the container count and the state space, so run it as its **own test template/topology**, +not bundled with the fault-focused primary run. Keep faults light here (fault-induced flush timing +differences create false diffs unless the comparison runs in an `ANTITHESIS_STOP_FAULTS` quiet window +long enough to cover `FLUSH_WAIT`≈32s on both sides). Reuse `stele`/`panoramic` analysis logic. + +Beyond `aggregate-matches-agent`, this add-on is also where the **Category G differential** +properties run, with identical config/profiles on both baseline and comparison: +**`mapper-output-matches-agent`** (identical `dogstatsd_mapper_profiles`, corpus of mappable names), +**`prefix-filter-ordering-matches-agent`** (corpus where keep/drop depends on stage order), and the +differential facet of **`tag-filterlist-applied-consistently`** (post-filter name/tags within ratio). +The corpus must actually exercise mapped/filtered metrics or these (and the `aggregate-matches-agent` +roll-up's implicit coverage of them) pass vacuously. + +## Listener-coverage variant (secondary) + +To cover UDS-datagram / UDS-stream listeners and the DogStatsD **replay** properties +(`replay-no-panic-on-malformed-capture`, `replay-corruption-not-silent-eof`): add a `replay-client` +that shares a volume with `adp` for the UDS socket and runs the `agent-data-plane dogstatsd replay` +CLI against **adversarially generated capture files** (the workload synthesizes corrupt/truncated/ +zstd-bomb captures using the SDK's RNG). Replay runs as a separate CLI process, so its panic/OOM is +isolated from the data plane — install the SUT-side panic/reachability assertions in that CLI process. +No cross-container faults are needed for replay; it is pure untrusted-input exploration. + +## Fault requirements (confirm enabled for the tenant) + +Antithesis disables some faults by default. **The user confirmed (2026-05-28) that node +termination, clock jitter, and custom `/proc` faults are all enabled for this tenant**, so the +properties that depend on them are realizable rather than vacuous. The custom `/proc` fault still +needs a script. + +| Fault | Needed by | Status | +|---|---|---| +| **Node termination** (kill/restart) | `disk-persisted-retry-survives-restart`, `data-component-failure-triggers-process-shutdown`, crash-recovery facets of `forwarder-eventual-delivery` and `aggregate-matches-agent` | **Confirmed enabled** | +| **Clock jitter** | `aggregate-clock-skew-stable` (and the clock facet of `aggregate-matches-agent`) | **Confirmed enabled** | +| Network partition / bad-node / congestion | entire egress data-loss cluster (Cat B), `no-silent-interconnect-drop` backpressure | Usually on | +| Node throttling / CPU modulation | `rss-bounded-under-cardinality` (limiter timing), `memory-limiter-survives-rss-read-failure`, interner-race timing | Usually on | +| **Custom fault** (`/proc` RSS-read failure) | `memory-limiter-survives-rss-read-failure` — needs a custom fault/script to make RSS unreadable mid-run | **Confirmed enabled** (script still TBD) | + +- **Liveness properties** (`forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`, + `shutdown-drains-no-loss`, `config-stall-no-deadlock`) need a quiet window to verify recovery: use + `eventually_`/`finally_` commands or `ANTITHESIS_STOP_FAULTS` after healing the intake/partition. +- The intake-down scenario is also approximable **without** network faults by toggling + `mock-intake` into reject/5xx/hang modes via a custom fault or an admin endpoint — useful where + network-fault availability is limited. + +## SDK selection + +- **Workload client:** Rust — Antithesis Rust SDK (assertions + RNG for adversarial input + generation: malformed packets, corrupt captures, config keys). +- **SUT (`adp`):** Rust — add the Antithesis Rust SDK for the **net-new SUT-side assertions** that + catalog Categories C/E/F require (NaN-at-sketch-boundary, bin-count, interner-corruption, + source-misroute, limiter-RSS-failure, replay-panic, and the aggregate divide-by-zero regression + tripwire — now a closed vector, `NonZeroU64`). These internal + states are invisible to a workload-only checker (see `property-catalog.md` "Catalog-wide notes"). + Build a dedicated ADP image with the SDK + Antithesis coverage instrumentation enabled. + +## Assumptions & open questions + +- **Standalone mode is acceptable for the primary topology.** `AGENTS.md` calls standalone "not for + production," but it is the same mode the correctness suite uses and it removes the Core Agent from + ~22 properties' state space. Config-stream properties use Add-on 1 with remote-agent mode. Confirm + no standalone-only code path masks a production behavior we care about. +- **`datadog-intake` needs a controllable failure mode** (reject / 5xx / slow / hang, ideally + toggleable at runtime) to drive the egress cluster without relying solely on network faults. + Confirm whether the existing binary supports this or needs a small extension. +- **A minimal Core Agent config-stub** must be built (or the full `datadog-agent` image adapted) to + send adversarial config the real Agent wouldn't — needed for Add-on 1. +- Whether the workload can drive DogStatsD over **UDP/TCP at the volume `millstone` targets** without + loss confounding the assertions (UDP is lossy by nature; for no-loss assertions prefer TCP/UDS, and + scope UDP cases to no-crash rather than no-loss). +- The `checks_ipc` Histogram NaN bypass (`ddsketch-no-nan-poison`) needs a **checks-IPC producer** in + the topology (a check emitting a NaN histogram), which the DogStatsD-only primary topology lacks — + add a minimal checks-IPC feeder or a unit-level SUT assertion for that one property. diff --git a/test/antithesis/scratchbook/evaluation/antithesis-fit.md b/test/antithesis/scratchbook/evaluation/antithesis-fit.md new file mode 100644 index 00000000000..ee216f5fa54 --- /dev/null +++ b/test/antithesis/scratchbook/evaluation/antithesis-fit.md @@ -0,0 +1,361 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space. +--- + +# Antithesis Fit Evaluation — ADP Property Catalog + +Lens: for each of the 27 properties, does *verifying* it require exploring a state space that +deterministic tests cannot reach (timing / concurrency / partial-failure / combinatorial), and does +the chosen assertion TYPE match that mode? Bias: surface problems, not admire the catalog. + +Scope key: `catalog-wide` = affects the whole catalog or a family; `property-specific` = one slug. + +--- + +## Findings + +### F1. Two "guaranteed-crash" properties are deterministic config/clock checks, not search problems +- **Property/slugs:** `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable` (and the + catalog-wide note that bills them as "cheap, high-value first targets"). +- **Concern:** Both crashes are *deterministic given an input you already know the shape of*. They are + not states Antithesis must *discover* through interleaving; they are reproducible with a one-line + unit test. Antithesis's distinctive value (exploring an intractable timing/concurrency space) is + near-zero here. The real value Antithesis adds is narrower than the catalog implies: (a) *config-space + exploration* reaching the `{"secs":0,"nanos":N}` Duration shape the registry advertises, and (b) for + the clock property, a *clock-fault* the deterministic harness cannot inject. Those are genuine but + modest; the crash itself is trivially provable without a fuzzer. +- **`aggregate-no-panic-any-window` specifically:** _Update 2026-05-30 — this defect is now **fixed + upstream** (window typed `NonZeroU64`, PR #1772); the analysis below is the historical record of + the original bug. The property survives only as a regression tripwire._ Confirmed deterministic at + the time — `mod.rs:818` + `timestamp % bucket_width_secs` panics on `bucket_width_secs == 0`, and `mod.rs:630` + `step_by(bucket_width_secs as usize)` panics on 0; `bucket_width_secs = bucket_width.as_secs()` + (`mod.rs:553`) truncates any sub-1s window to 0. There is no validation (grep confirms only + definition/default/use sites). This is a single bad config value → guaranteed panic on the first + metric. A `#[test]` constructing `AggregationState::new(Duration::from_millis(500))` and inserting + one metric proves it in milliseconds with zero search budget. **It does not need Antithesis to find + it; it needs a config validator and a unit test.** Verified further: the aggregate transform reads + `window_duration` only at construction (`mod.rs:92-93`, `:213`) — no `ConfigChangeEvent`/subscribe + wiring — so this is a *startup-config-only* vector. The catalog's open question "can the gRPC stream + push this at runtime?" resolves to **no** for this transform (it is not hot-reloaded), which removes + the one angle that would have made it a live runtime-exploration target. +- **Evidence:** `mod.rs:818,630,553,92-93,213`; registry `aggregate.rs:9`; grep showing no + `ConfigChangeEvent` subscription in the aggregate transform. +- **Suggested action:** Keep both as **config-exploration / fault-injection targets** but explicitly + *demote the framing from "headline crash-loop finding" to "cheap config/clock tripwire."* The + primary fix and primary regression guard is a load-time config validator (`window >= 1s`) plus a + unit test — that belongs in the SUT, not in search budget. In Antithesis, an + `assert_unreachable` at the `% 0` site is fine to keep (it costs nothing once instrumented and + catches a future runtime-push regression), but do not bill these as where Antithesis "earns its + keep." The clock property retains more Antithesis value than the window property (clock-fault + injection + the *forward-jump flood bound*, which IS a search-worthy quantitative invariant — + see F2). + +### F2. `aggregate-clock-skew-stable` is partly Antithesis-worthy, partly a deterministic check — the assertion bundles both +- **Property/slug:** `aggregate-clock-skew-stable` (property-specific). +- **Concern:** The property folds two very different things under one slug. The *forward-jump + zero-value flood* (`Always(zero_value_buckets.len() <= ceil(flush_interval/window)+slack)`) is a + genuine quantitative invariant that benefits from clock-fault injection during a *live flush race* — + good Antithesis fit. But the *backward-jump silent gap* and the *pre-epoch `unwrap_or_default()==0`* + cases are deterministic: step the clock, observe an empty range — reproducible without search. The + `Always(current_time >= self.last_flush)` monotonicity assertion is essentially a unit-testable + invariant about a single function. Bundling them means the high-value flood-bound assertion shares a + slug with deterministic checks, muddying budget attribution. +- **Evidence:** `mod.rs:627-635` (zero-value loop), `time.rs:21-26` (`unwrap_or_default`), property + file lines 37-55. +- **Suggested action:** Keep the property, but in the workload prioritize the **forward-jump flood + bound** as the search-worthy assertion (it interacts with flush timing and memory). The + monotonicity/backward-gap facets are worth an `assert_always`/`Sometimes` but recognize they would + be caught by a deterministic clock-step test too; they ride along cheaply once clock-fault is wired. + +### F3. `ddsketch-relative-error-bound` is pure library/proptest territory — it is not an Antithesis property +- **Property/slug:** `ddsketch-relative-error-bound` (property-specific). +- **Concern:** The property file's own Investigation Log resolves that **ADP never calls + `DDSketch::quantile` on the live customer path** — histogram percentiles use raw-sample + `HistogramSummary::quantile` (exact, not a DDSketch bound) and distribution sketches ship raw bins to + Datadog, quantiled server-side. The only runtime `quantile` caller is the prometheus *internal- + telemetry* destination. So `Always(|q_est - v| <= eps_rel*|v|)` cannot be anchored to any production + runtime call. The accuracy half is a *mathematical invariant of a pure function* over arbitrary + inputs — the textbook definition of a proptest, not a state-space-exploration target. There is no + timing, concurrency, partial-failure, or live-state dimension to it. +- The merge-associativity half ("merge order under interleaving") *sounds* Antithesis-flavored, but + the merges happen inside the single-task aggregate transform; the f64 reordering it worries about is + deterministic for a given merge order and is again a proptest over `merge(A,merge(B,C))` vs + `merge(merge(A,B),C)`. Antithesis would have to construct the same input sets a proptest constructs, + with worse shrinking. +- **Evidence:** property file Investigation Log lines 89-128 (no live quantile call); `lib.rs:56` + (agent sketch re-export); `prometheus/mod.rs:346` (the lone non-production caller). +- **Suggested action:** **Remove from the Antithesis catalog; reassign to proptest/unit territory** (a + Hegel/proptest test over the agent `DDSketch` accuracy + merge associativity). If anything from this + property survives into Antithesis, it is already covered by `ddsketch-bin-count-bounded` (bin + serialization fidelity) and `ddsketch-no-nan-poison` (sum/avg sanity). Demoting to Medium (as the + catalog did) understates the problem — its assertion type does not match the testing mode at all. + Keeping it as a live `Always` is misleading because there is no live call to assert against. + +### F4. `ddsketch-bin-count-bounded` substantially duplicates existing proptests; Antithesis value is a thin regression tripwire +- **Property/slug:** `ddsketch-bin-count-bounded` (property-specific). +- **Concern:** Verified that strong proptest coverage already exists: + `prop_bin_count_never_exceeds_limit` at `sketch.rs:925`, plus `prop_output_bins_are_sorted_and_distinct`, + `prop_output_keys_are_highest_from_input`, and unit tests including + `trim_left_respects_bin_limit_with_large_weights`. The invariant is *structurally enforced* — + `trim_left` runs after every mutating method. Antithesis cannot explore an input space the proptest + doesn't already cover for the *math*; it generates `i16` keys × `u32::MAX` weights already. The + genuinely non-duplicative value is narrow and real: a **live regression tripwire** that fires if a + *future mutator* (a new insert helper, a new merge path, the future "merge agent-shipped sketches" + use case) forgets to call `trim_left` — something a proptest on the *existing* methods cannot catch + because it tests the methods that exist today. +- **Evidence:** `sketch.rs:919-1023` (proptests), `sketch.rs:255,319,358,579` (trim_left at every + mutation site), property file lines 38-40, 60-63. +- **Suggested action:** **Keep, but reframe and down-weight.** It is NOT a state-space-exploration win; + it is a cheap *always-on guard against forgetting `trim_left`*. The assertion (`Always(bins.len() <= + 4096)` at the end of every mutator) is correct and basically free once the SDK is in. But the catalog + lists it **High** priority — that is too high for a property that mostly re-states an existing + proptest. Recommend **Medium**: the marginal value over proptests is the live-mutator-regression + tripwire only, and the `Reachable("trim_left collapsed bins")` anchor is essential or the `Always` is + vacuous on real production sketches (which rarely exceed 4096 bins under normal cardinality). + +### F5. `config-incompatible-refuses-start` is a deterministic gate check that the existing integration suite already exercises +- **Property/slug:** `config-incompatible-refuses-start` (property-specific). +- **Concern:** The gate is a single, *deterministic, ordered control-flow check*: `check_and_warn_config` + at `run.rs:157` returns `Err` → `exit(1)` *before* `create_topology`/`build`/`spawn`. There is no + timing, concurrency, or partial-failure dimension — given the config, the outcome is fixed. The + property file itself notes the existing **integration suite already has "config-check exit codes" + cases** (sut-analysis §6). The proposed `assert_unreachable("pipeline spawned with high-severity key")` + after `spawn()` is *statically unreachable* (the `?` already returned) — it is a belt-and-suspenders + guard against a future reordering regression, not a search target. Antithesis adds essentially nothing + over a parameterized integration test that feeds N high-severity keys and asserts exit 1. +- **Evidence:** `run.rs:157,331-381`, `main.rs:136-146`; integration "config-check exit codes" cases + (sut-analysis §6); property file lines 20-49. +- **Suggested action:** Keep as a **cheap config-exploration target** (the SDK markers cost nothing once + added, and config-space exploration can mutate which key/value is injected), but **demote from High**. + The deterministic gate is already covered by integration tests; Antithesis's only marginal add is + fuzzing *which* key triggers it. Most of the catalog's listed value here is duplicative. + +### F6. `config-stall-no-deadlock`: the high-value falsification target was retracted; remaining content is a quiescent-hang check with a weak in-process assertion +- **Property/slug:** `config-stall-no-deadlock` (property-specific). +- **Concern:** The property's own Investigation Log **retracts the busy-loop hazard** ("downgrade it + from highest-value falsification target to a non-issue") because tonic terminates the stream after one + error → 5s backoff. What remains is: drop the snapshot → ADP hangs *quiescently forever* at + `ready().await` (no timeout, `lib.rs:694-704`). That is a real and interesting liveness finding, and + detecting "indefinite quiescent hang vs progress" is reasonable Antithesis fit (timing of the config + stream is the explored dimension). BUT the assertion is weak: there is no clean in-process assertion + for "stalled forever" — the file admits the busy-loop guard is "best caught workload-side (monitor + CPU/log rate)" and that `Always(no panic)` is "implicit." So the property reduces to two reachability + markers (`Sometimes(config received)`, `Reachable(wait entered)`) plus an out-of-band CPU monitor. + The `Sometimes(config received)` is trivially satisfiable in the happy path and proves little; the + load-bearing "hang is quiescent" check is not an SDK assertion at all. +- **Evidence:** property file lines 103-118 and Investigation Log 120-187; `lib.rs:694-704` (no timeout). +- **Suggested action:** Keep — the no-timeout hang is a genuine operational finding worth demonstrating + — but **be honest that the verifiable artifact is a workload-side CPU/log-rate liveness check, not an + SDK invariant.** Consider whether the real recommendation is a SUT change (add a diagnostic timeout) + rather than a test. As written, Antithesis "proves" the hang is quiescent, which is a weak property + (it cannot prove "never makes progress" — only observe it didn't within the run). + +### F7. `data-component-failure-triggers-process-shutdown` — the `Always` is a temporal property the SDK cannot express in-process; the `Reachable` is the only clean assertion +- **Property/slug:** `data-component-failure-triggers-process-shutdown` (property-specific). +- **Concern:** The defensible invariant ("component death is *always* followed by process exit") is a + **temporal** property across the process lifetime. The property file itself concedes "To express the + Always invariant in-process is awkward (it is enforced by control flow)" and falls back to "a + workload-side temporal assertion" via Antithesis query-logs. So the in-process artifact is just + `assert_reachable` on the shutdown arm (`run.rs:280-283`) — which proves the path *exists*, not that + it *always* fires. Inducing the death is genuinely Antithesis-flavored (panic injection, sub-second + window, clean early finish), so the property has real fit; the concern is that the catalog's `Always` + framing oversells what the SDK can check. The actual guarantee is structural (one `JoinSet`, one + `wait_for_unexpected_finish` arm) and would be better unit-tested at the topology level for the + control-flow part. +- **Evidence:** property file lines 96-108; `running.rs:40-51`, `run.rs:280-283`. +- **Suggested action:** Keep (fault-induced component death is good Antithesis fit), but split the + claim: the `Reachable(death→shutdown path)` is the legitimate SDK assertion; the `Always(death⇒exit)` + is a **query-logs temporal check**, not an `assert_always`. Make that explicit so the synthesizer + doesn't double-count it as an in-process invariant. + +### F8. Source-side panic/divide-by-zero properties depend on whether a data-component panic actually crashes the *process* — verify the fail-stop chain or the no-crash assertions are unfalsifiable +- **Property/slugs:** `aggregate-no-panic-any-window`, `malformed-dsd-no-crash`, + `data-component-failure-triggers-process-shutdown`, `source-dispatch-no-misroute` (catalog-wide + interaction). +- **Concern:** Several "no crash" / "crash is caught" properties hinge on a panic in a *data component + task* propagating to a process exit Antithesis can observe. The fail-stop model says a panicking + component → `JoinError` → `wait_for_unexpected_finish` → whole-process shutdown → s6 restart. But + Antithesis observes a *container that never exits* (s6 restarts ADP in-place, per sut-analysis §6 and + deployment-topology). If the container masks the exit, an `Always(process up)` workload assertion + (`malformed-dsd-no-crash`) is **trivially satisfied even when ADP is crash-looping** — the container + stays up. This is a catalog-wide soundness risk for every "no crash" property whose assertion is + workload-side "process up." The catalog half-acknowledges this (it routes crash detection through + SUT-side `assert_unreachable` at panic sites) but the deployment topology's "process up" framing for + `malformed-dsd-no-crash` (`Always(process up)`) is exactly the unfalsifiable shape under s6. +- **Evidence:** sut-analysis §6 ("container s6 supervisor restarts ADP on exit, so the container never + actually exits"); `malformed-dsd-no-crash` invariant `Always(process up)`. +- **Suggested action:** Catalog-wide: make every "no crash" assertion SUT-side (`assert_unreachable` at + the panic site) **and/or** assert against a restart-count / uptime telemetry, NOT "container process + up." Flag for the workload author that s6 auto-restart can make `Always(process up)` vacuously pass. + This is the single most important cross-cutting fit hazard. + +### F9. `replay-corruption-not-silent-eof` is a data-fidelity property with no fault/timing dimension — input-mutation only, marginal over a fuzz/unit test +- **Property/slug:** `replay-corruption-not-silent-eof` (property-specific). +- **Concern:** The "bad thing" is *silent truncation reported as success* — a deterministic function of + the input bytes (`reader.rs:84-104`). Given a corrupt length prefix, `read_next` returns `Ok(None)` + every time; there is no timing/concurrency/partial-failure. The property file even notes the current + tests *assert* this behavior intentionally and that distinguishing truncation from clean EOF "may + need a format change." So Antithesis here is just an input fuzzer over a pure parser, and the + assertion (`AlwaysOrUnreachable(faithful completion)`) presupposes a SUT change that doesn't exist. + This is unit/proptest territory (corrupt-prefix → expected outcome), not state-space exploration. +- **Evidence:** property file lines 21-43, 89-92; `reader.rs:84-104`. +- **Suggested action:** Keep at **Medium or lower**, but recognize it as a **fuzz/proptest target on the + reader**, riding the same adversarial-capture corpus as `replay-no-panic-on-malformed-capture`. Its + Antithesis-specific value is near-zero beyond bundling with the panic property; the real deliverable + is a maintainer decision (is silent truncation acceptable?) plus a format/telemetry change. + +### F10. Several "Sometimes" anchors risk vacuity or astronomically-unlikely satisfaction; audit reachability budget +- **Property/slugs:** catalog-wide, sharpest on `interner-reclamation-no-corruption`, + `ddsketch-bin-count-bounded`, `forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`. +- **Concern:** Many properties pair a safety `Always`/`Unreachable` with a `Sometimes` that must be hit + for the safety claim to be non-vacuous. Two distinct hazards: + (a) **Hard-to-reach `Sometimes` → vacuous `Always`.** `ddsketch-bin-count-bounded`'s + `Always(bins<=4096)` is vacuous unless `Reachable("trim_left collapsed bins")` actually fires — but + real production sketches under normal cardinality rarely exceed 4096 distinct keys, so the collapse + path may never trigger without a deliberately pathological corpus. If the workload doesn't force it, + the property passes while proving nothing. + (b) **`Sometimes` requiring a precise race.** `interner-reclamation-no-corruption` needs + `Sometimes(drop re-check found resurrected entry)` — the `is_active()` re-check at `map.rs:459` + returning true. That is a narrow decrement→lock window. It is *exactly* what Antithesis is good at + (so fit is good), but the workload must run a near-full interner with heap-fallback **off** and high + churn or the contended path is never pressured (heap-fallback default true defuses it — property file + config-deps line 67-69). If the workload uses defaults, the `Sometimes` never fires and the safety + `Always` is vacuous. +- **Evidence:** `ddsketch-bin-count-bounded` lines 75-76; `interner-reclamation-no-corruption` lines + 72-80, config-deps 67-71. +- **Suggested action:** Catalog-wide: for every safety property gated by a `Sometimes`, the workload + MUST include the configuration/corpus that forces the anchor (small interner, heap-fallback off, + high-cardinality corpus that exceeds 4096 keys). Recommend the synthesizer add a "vacuity guard" + column: each `Always`/`Unreachable` is only meaningful if its paired `Sometimes` is *demonstrated* + reached in the run report. Properties whose `Sometimes` is unreached should be reported as + inconclusive, not passing. + +### F11. `memory-limiter-survives-rss-read-failure` is good fit but its priority is underestimated and gated behind a custom fault +- **Property/slug:** `memory-limiter-survives-rss-read-failure` (property-specific) — *underestimated*. +- **Concern:** This is a textbook Antithesis property — a *partial-failure* (mid-run `/proc` read + failure) producing a *silent* fail-open (frozen backoff at 0) that no deterministic test reaches; the + damaging case is a *race* (reads fail *before* RSS crosses threshold). The bare `std::thread` death + doesn't trigger process shutdown, so it is invisible. Yet the catalog rates it **Medium** ("High if + RSS reads can realistically fail post-startup"). Given that the whole memory family is "fails by + design under defaults," the *one* runtime protection silently vanishing is arguably the highest-stakes + partial-failure in Category A. The Medium rating undersells it. The countervailing fact: it requires a + **custom `/proc` fault** (deployment-topology flags it as "Custom; must script") and the limiter must + be *explicitly enabled* (default Disabled), so reachability is conditional. +- **Evidence:** `limiter.rs:99-122` (`.expect()` in the steady-state loop), property file lines 36-50; + deployment-topology fault table (custom `/proc` fault). +- **Suggested action:** Raise to **High conditional on the custom `/proc` fault being scriptable on the + tenant** and on the limiter being enabled in that workload variant. If the custom fault cannot be + built, the property is *unreachable* and should be reported as such rather than silently passing + (ties to F8/F10 vacuity concerns). This is a case where Antithesis value is *underestimated* by the + catalog's Medium tag. + +### F12. `source-dispatch-no-misroute` — the misroute is "structurally improbable," making the `Unreachable` likely vacuous; the real live hazard (silent uncounted loss) is relegated to a sub-clause +- **Property/slug:** `source-dispatch-no-misroute` (property-specific). +- **Concern:** The property file's own analysis (lines 33-45) concludes that with the current + `extract`-then-`send_all` ordering, **misroute is structurally impossible** — `extract` removes + matching events by predicate and recomputes `seen_event_types` before any send. So the headline + `Unreachable(misroute)` is *expected to be vacuously unreachable today*; it only guards a future + refactor. The genuinely live, Antithesis-reachable hazard is the **sub-clause**: a `send_all` failure + drops the extracted events and there is likely **no counter** for it ("possibly fully silent — a + finding," lines 102-104). That silent-loss-under-downstream-error path IS a partial-failure worth + exploring, but it is buried as clause (B) under a headline that will read as a vacuous pass. +- **Evidence:** property file lines 33-45, 84-87, 102-104; `mod.rs:1667-1716`. +- **Suggested action:** Re-center the property on the **silent-uncounted-loss-on-dispatch-failure** + facet (which overlaps `no-silent-interconnect-drop`) and treat the misroute `Unreachable` as a + cheap future-regression guard explicitly expected to be unreached. As written, the High-value reading + (misroute) is the vacuous one and the live reading (silent loss) is the footnote. + +--- + +## Passes (properties whose Antithesis fit and assertion type are sound) + +- **`rss-bounded-under-cardinality`** — Burst-vs-250ms-sample race + cooperative-only backoff is a + timing space no deterministic test reaches; `Always(rss <= grant*tol)` with SUT-side + `Sometimes(backoff_applied && rss_still_climbing)` is the right shape. Strong fit. (Caveat: needs the + limiter enabled and a real RSS reading not masked by the container — relates to F8.) +- **`retry-queue-bounded-under-outage`** — Sustained-outage saturation + shared circuit-breaker + + per-endpoint queues + disk eviction is genuine partial-failure/timing territory; `Always(bytes<=cap)` + + `Sometimes(items_dropped>0)` correctly pairs a true invariant with a non-vacuity anchor. Strong fit. +- **`no-silent-interconnect-drop`** — Backpressure-vs-drop under a slow downstream is exactly an + interleaving/partial-failure exploration; `Always(discarded delta==0)` on a wired edge + `Sometimes + (backpressure engaged)` is well-matched. Strong fit. +- **`forwarder-eventual-delivery`** — Liveness after a 5xx/timeout/reset storm then recovery; correctly + typed as `Sometimes(all-delivered-after-recovery)` (liveness → progress, not instantaneous Always). + Needs a quiet/heal window — fit is good. Strong fit. +- **`disk-persisted-retry-survives-restart`** — SIGKILL mid-outage + restart + reconcile + poison-file + injection is the canonical Antithesis crash-durability scenario; the at-most-once slack caveat and the + `Reachable(persistence-active)` non-vacuity guard are correctly identified. Strong fit. +- **`shutdown-drains-no-loss`** / **`graceful-shutdown-within-30s`** — Shutdown under load with a slow + intake pushing past the 30s boundary is a timing race; correctly typed as `Sometimes(clean-drain)` + + `AlwaysOrUnreachable(timeout⇒forceful)`. Good fit. (Minor: the two slugs overlap; the catalog already + carves who-owns-what cleanly.) +- **`malformed-dsd-no-crash`** — Adversarial packet input across 4 listener types exploring codec/ + framing error paths is good fuzz+fault fit; the SUT-side `Unreachable` at codec panic sites is the + right artifact. Pass *with the F8 caveat* (don't anchor on container "process up"). +- **`replay-no-panic-on-malformed-capture`** — Untrusted-file fuzzing of an unfuzzed, zero-coverage + path with confirmed OOM/zstd-bomb vectors; isolated in the replay CLI process so a panic IS + observable (exit). Good fit; the SUT-side `Unreachable` at panic sites + `Reachable(parse executed)` + is correct. +- **`interner-reclamation-no-corruption`** — Real-scheduler exploration beyond the bounded loom model is + precisely Antithesis's edge over loom; the overlap/sentinel `Unreachable` is the right artifact. Pass, + *conditional on the F10 workload config* (small interner, heap-fallback off) or the `Sometimes` is + vacuous. +- **`aggregate-context-limit-enforced`** — Cardinality-flood × flush-timing × zero-value keep-alive + interleaving to hit/clear the breach flag is timing-sensitive; `Always(len<=limit)` + `Sometimes + (breached)` + `Sometimes(events_dropped)` is well-formed. Pass. +- **`interner-full-bounded`** — Fill-the-buffer timing + concurrent intern-vs-drop on the reclamation + path; the heap-on/heap-off branch split with matched `Sometimes` anchors is correct. Pass. +- **`aggregate-matches-agent`** — Differential property under faults (delayed flush, restart mid-window, + backpressure) the deterministic `panoramic` harness cannot inject; correctly anchored on the existing + diff harness as a `finally_`/quiet-window check, not an in-process assertion. Good fit (heavy; its own + topology). Pass. +- **`ddsketch-no-nan-poison`** — Confirmed LIVE bypass via checks_ipc Histogram → encoder `insert_n`; + the SUT-side `Always(sum/avg finite)` at the sketch boundary is correct. *Note:* the poisoning itself + is deterministic given a NaN input (so the "find it" value is modest), but routing a NaN through a + realistic checks_ipc producer and proving it reaches the encoder boundary across the live topology is + more than a unit test. Pass, leaning toward "needs the checks_ipc feeder or it's a unit assertion" + (deployment-topology already flags this). +- **`non-finite-values-handled-consistently`** — Adversarial all-NaN/Inf packets across metric types; + the honest framing (ghost-metric expected Unreachable on DSD path, NaN-poison live via non-DSD) is + correct and the assertion types match. Pass. +- **`topology-ready-before-intake`** — Stalling a downstream's readiness / failing a supervisor child is + a timing/partial-init exploration; the honest narrowing to "readiness-milestone ordering" (not + "no byte read pre-ready") keeps the `Always` falsifiable. Pass. +- **`config-runtime-update-not-revalidated`** — Inject a high-severity key over the runtime stream after + a clean start; this is a control-plane→data-plane path the diff-test never touches, and the + Reachable/Unreachable framing matches the open design question. Reasonable fit (Medium is right). + +--- + +## Uncertainties + +- **U1 (F8 severity):** I have not confirmed *how* the Antithesis harness observes ADP process exit + under the container s6 supervisor — whether snouty/Antithesis sees the inner process restart or only + the container. If Antithesis can see inner-process restarts (restart count), the "no crash" workload + assertions are salvageable as-is; if it only sees the container, every `Always(process up)` is + vacuous. This determines whether F8 is a catalog-wide blocker or a workload-wording nit. sut-analysis + §6 strongly implies the container masks exits ("never actually exits"), so I lean toward blocker, but + did not verify the harness's process-observation granularity. +- **U2 (F3 scope):** I treated `ddsketch-relative-error-bound` as fully non-Antithesis based on the + property's own resolved finding that quantile is never called live. If a future ADP change starts + querying quantiles in-process (e.g. a new local-rollup feature), the property would re-acquire live + fit. Flagging that the "remove from catalog" recommendation is contingent on the current no-live-call + fact remaining true. +- **U3 (F1 fix direction):** Whether `aggregate-no-panic-any-window` should be fixed by clamp vs reject + vs sub-second support changes whether the surviving Antithesis assertion is `Always(window>=1)` or + `Unreachable(% 0)`. Unresolved in the catalog ("needs human input"); my "demote to config-validator + + unit test" recommendation holds regardless of fix direction, but the exact SDK assertion depends on it. +- **U4 (custom-fault availability):** F11's priority bump for `memory-limiter-survives-rss-read-failure` + is conditional on a scriptable `/proc` RSS-read fault. I could not confirm the tenant supports custom + faults of this kind; deployment-topology marks it "Custom; must script." If unavailable, the property + is unreachable, not Medium. +- **U5 (Sometimes-reachability of bin collapse):** F10(a) assumes real production sketches rarely exceed + 4096 distinct keys under normal cardinality. I did not measure a realistic millstone corpus's per- + sketch key count; if the high-cardinality corpus routinely blows past 4096, the `Reachable(collapse)` + anchor fires naturally and the vacuity concern for `ddsketch-bin-count-bounded` is reduced. diff --git a/test/antithesis/scratchbook/evaluation/coverage-balance.md b/test/antithesis/scratchbook/evaluation/coverage-balance.md new file mode 100644 index 00000000000..6dbc4e54281 --- /dev/null +++ b/test/antithesis/scratchbook/evaluation/coverage-balance.md @@ -0,0 +1,329 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space. +--- + +# Coverage Balance Evaluation — ADP Property Catalog (27 properties) + +Lens: evaluate the catalog as a **portfolio**. Walk `sut-analysis.md` section by section; for +each risk area check whether a property covers it, whether low-risk areas are over-invested, and +whether the assertion-type mix (safety / liveness / reachability) is appropriate. Cite slugs and +sut-analysis sections. Evidence is grounded in the SUT tree at the pinned commit. + +## Method / what I checked against source + +- Read all four scratchbook artifacts and all 27 property slug files (present in + `test/antithesis/scratchbook/properties/`). +- Re-derived the live topology from `bin/agent-data-plane/src/cli/run.rs` (`create_topology`, + `add_*_pipeline_to_blueprint`, lines 414-758) and `bin/agent-data-plane/src/config.rs:100-156` + (`*_pipeline_required`). +- Spot-confirmed: events/service_checks are always-wired when DogStatsD is on + (`config.rs:135-145`); checks_ipc Histogram has no finiteness guard + (`sources/checks_ipc/mod.rs:195`); host_tags lives in `bin/agent-data-plane/src/components/host_tags/`; + OTLP/traces/logs/APM pipelines are all wired in `run.rs`. + +## Type distribution (portfolio shape) + +Counting primary type tags across the 27 properties (compound tags counted by their lead type): + +- **Safety-dominant:** ~19 properties are Safety or Safety+X. +- **Liveness-dominant:** 6 (`forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`, + `shutdown-drains-no-loss`, `topology-ready-before-intake`, `config-stall-no-deadlock`, + `graceful-shutdown-within-30s`). +- **Reachability** appears only as a *secondary* clause (paired with Safety/Liveness); there is no + standalone reachability property. + +Assessment: the safety/liveness split is reasonable and matches the headline guarantee (no-crash = +safety, no-loss = both). The portfolio is appropriately weighted toward the two headline families. +The **reachability** dimension is structurally thin — it exists only as `Sometimes`/`Reachable` +anti-vacuity riders. That is acceptable for most properties, but see Finding 7: several +*event/service-check/enrichment* paths have **no reachability anchor at all**, meaning a workload +could pass the whole catalog without ever exercising them. + +--- + +## FINDINGS + +### Finding 1 — Events & service-checks intake→enrich→encode→forward path is essentially uncovered, despite being an always-on production path + +- **Property/slugs:** catalog-wide (gap). Tangentially touched by `source-dispatch-no-misroute` + (Medium) and `shutdown-drains-no-loss`, but neither asserts events/SC delivery correctness. +- **Concern:** The catalog is metrics-heavy. When DogStatsD is enabled (the production config), + `events_pipeline_required()` and `service_checks_pipeline_required()` are **both true** + (`config.rs:135-145`), so `dsd_in.events → events_enrich → dd_events_encode → dd_out` and + `dsd_in.service_checks → service_checks_enrich → dd_service_checks_encode → dd_out` are live, + always-wired edges (`run.rs:681-684`). These are separate codecs + (`lib/saluki-io/src/deser/codec/dogstatsd/event.rs` 394 LOC, `service_check.rs` 312 LOC, each with + only 7-8 tests) and separate encoders. No property asserts: + (a) events/SC parse robustness (malformed event/SC packets — `malformed-dsd-no-crash` is scoped to + the codec generally but its angle and open questions are metric-framing-centric), + (b) events/SC eventual delivery or no-silent-loss (the Cat B liveness/safety properties name only + metrics/forwarder transactions), + (c) the events/SC encoders' own panic/backpressure behavior. +- **Scope:** Two always-on production sub-pipelines. +- **Evidence brief:** `run.rs:681-684`, `config.rs:135-145`; codec test counts above; sut-analysis §2 + (DSD pipeline diagram explicitly shows the events/service_checks branches) — the analysis names + them but no property family adopts them. +- **Suggested action:** A targeted discovery pass on the event/service_check codecs + encoders: + malformed event/SC framing (`_e{...}` and `_sc|...` shapes), no-silent-loss on the events/SC edges + under backpressure, and an eventual-delivery facet. At minimum add an event/SC `Sometimes(parsed)` / + `Sometimes(delivered)` reachability anchor so a metrics-only workload doesn't pass vacuously. + +### Finding 2 — Entire trace/APM and logs pipelines have zero properties (component blind spot vs. topology) + +- **Property/slugs:** catalog-wide (gap). +- **Concern:** `run.rs` wires a full **traces/APM pipeline** (`traces_enrich` with `ottl_filter`, + `ottl_transform`, `apm_onboarding`, `trace_obfuscation`, `trace_sampler`; `dd_apm_stats`, + `dd_stats_encode`, `dd_traces_encode`; `run.rs:551-591`) and a **logs pipeline** + (`dd_logs_encode`; `run.rs:506-521`). The catalog has **no property** touching traces, APM stats, + OTTL transforms, trace obfuscation/sampling, or logs encoding. The deployment-topology doc also + never mentions them. This is the largest component blind spot relative to the actual topology. +- **Scope:** Multiple transforms + encoders + a forwarder fan-in to `dd_out`. +- **Evidence brief:** `run.rs:441-457, 506-521, 551-591`; sut-analysis §2 mentions only the DSD/OTLP + shape and does not enumerate traces/APM/logs — so this is a gap that exists in *both* the analysis + and the catalog. +- **Suggested action:** Decide explicitly whether traces/APM/logs are **in scope** for ADP's first + customers (they may be gated off by default — `traces_pipeline_required` only fires for OTLP-native; + logs only for checks/OTLP-native). If out of scope, document the exclusion in the catalog's + catalog-wide notes so it's a *deliberate* boundary, not a silent omission. If in scope, at least + one no-crash/no-silent-loss property is warranted for the APM-stats and obfuscation transforms + (string-heavy, SQL-parsing obfuscation in `trace_obfuscation/sql.rs` is a classic untrusted-input + hazard). + +### Finding 3 — OTLP pipeline (native + proxy/relay) is named but uncovered; only referenced as a "closed path" in NaN reasoning + +- **Property/slugs:** `ddsketch-no-nan-poison`, `non-finite-values-handled-consistently` (mention OTLP + only to argue it is *closed*); no property *asserts* OTLP behavior. +- **Concern:** `add_otlp_pipeline_to_blueprint` (`run.rs:700-758`) has two distinct modes — **native** + (`otlp_in` source → metrics_enrich/dd_logs_encode/traces_enrich) and **proxy/relay** + (`otlp_relay_in` relay + `local_agent_otlp_out` forwarder to the Core Agent, with a separate + `otlp_traces_decode` decoder path). OTLP is an *untrusted-input gRPC/protobuf parse surface* + analogous to the replay reader, and it deliberately bypasses aggregation (`run.rs:751-753`). The + catalog treats OTLP only as a finiteness-closed branch; there is no malformed-OTLP-no-crash, no + OTLP-relay-forwarder-delivery, and no property on the relay→forwarder edge. +- **Scope:** A second untrusted-input source + a second forwarder (to Core Agent, not Datadog intake). +- **Evidence brief:** `run.rs:700-758`; catalog mentions "OTLP is closed" in + `ddsketch-no-nan-poison` resolved-question and `non-finite-values-handled-consistently` §Resolved. +- **Suggested action:** Targeted pass: is OTLP enabled for the design partner? If yes, an + OTLP-equivalent of `malformed-dsd-no-crash` (malformed protobuf / oversized OTLP request) and a + relay-forwarder delivery property are warranted. The `local_agent_otlp_out` forwarder is a + *second* egress path the entire Cat B data-loss family ignores. + +### Finding 4 — DSD transform chain (mapper / prefix-filter / tag-filterlist / post-agg-filter) has no correctness property despite a documented ordering-bug history + +- **Property/slugs:** catalog-wide (gap). `aggregate-matches-agent` (Safety, High) would catch a + gross divergence end-to-end but is anchored on the `panoramic` diff harness on happy-path load and + is explicitly a *separate, optional* run (deployment-topology Add-on 2); it is not a targeted + transform-ordering check. +- **Concern:** sut-analysis §8 calls out **"moved DSD prefix/filter in front of enrich (pipeline + ordering bug)"** as a notable correctness fix in churn history. The live order is + `dsd_enrich(mapper) → dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter` + (`run.rs:674-679`). Tag filtering happens both pre- and post-aggregation. None of the 27 properties + asserts transform-ordering invariants or mapper/filter correctness (e.g. a metric that should be + prefix-dropped is never aggregated; a mapped name is mapped before filtering). This is a + bug-history item (the lens explicitly flags it) that did not map to a property. +- **Scope:** Four chained transforms on the hottest metrics path, with regression history. +- **Evidence brief:** `run.rs:638-679`; sut-analysis §8 "moved DSD prefix/filter in front of enrich". +- **Suggested action:** Either (a) explicitly fold transform-ordering correctness into + `aggregate-matches-agent`'s scope (the diff harness *would* catch a reordering regression if the + workload includes prefix-filtered / mapped / tag-filtered metrics — confirm the `millstone` corpus + does), or (b) add a focused ordering property. Currently it relies on an optional, happy-path, + separate-run diff test — disproportionately weak for a known regression hotspot. + +### Finding 5 — Origin detection / workload-tagger enrichment correctness is uncovered + +- **Property/slugs:** catalog-wide (gap). `source-dispatch-no-misroute` touches the source but not + enrichment/tagging. +- **Concern:** sut-analysis §2 and §9 highlight origin detection via **UDS peer credentials** + (credential errors counted but *do not drop the packet*) and the workload-tagger/workloadmeta + enrichment. `host_enrichment` and `host_tags` transforms run on every metric/event/SC + (`run.rs:482-489, 648-655`); `origin.rs` resolves origin tags. No property asserts enrichment + *correctness* (right tags attached, no cross-contamination of origin between concurrent packets on + a shared socket, behavior when peer-cred lookup fails). Given multi-tenant origin detection is a + correctness-critical and concurrency-sensitive area (per-packet credential lookup under a shared + UDS listener), the zero-property coverage is a disproportionate gap. +- **Scope:** Origin resolver + two enrichment transforms, all on the hot path. +- **Evidence brief:** `sources/dogstatsd/origin.rs`, `transforms/host_enrichment/mod.rs`, + `bin/.../components/host_tags/`; sut-analysis §2 "Origin detection uses UDS peer credentials; + credential errors are counted but do not drop the packet", §9. +- **Suggested action:** Discovery pass on origin/tagger enrichment: (a) does a peer-cred failure + ever attach *another* connection's origin tags (cross-tenant tag leak — a silent data-corruption + hazard)? (b) is host_tags' gRPC dependency (it is built `from_configuration().await` and only in + non-standalone mode, `run.rs:486-489`) able to hang/deadlock enrichment if the tagger stream + stalls — analogous to `config-stall-no-deadlock`? Note: the **primary topology uses UDP/TCP, not + UDS** (deployment-topology), so origin detection is *structurally unexercised* by the primary run — + a second blind spot the topology choice creates. + +### Finding 6 — dsd_stats statistics tap and dsd_debug_log destination have no property + +- **Property/slugs:** catalog-wide (gap). `no-silent-interconnect-drop`'s open question asks whether + `dsd_stats_out`/`dsd_debug_log_out` can have zero senders — i.e. the catalog *noticed* these + destinations but did not adopt them. +- **Concern:** `dsd_stats_out` is wired off `dsd_in.metrics` unconditionally (`run.rs:686`); + `dsd_debug_log_out` conditionally (`run.rs:688-693`). These are extra fan-out consumers on the + busiest output (`dsd_in.metrics` fans to dsd_enrich, dsd_stats_out, and optionally + dsd_debug_log_out). Per sut-analysis §4, fan-out delivers *sequentially* and a slow consumer stalls + the others — so a slow/blocked `dsd_stats_out` or `dsd_debug_log_out` could backpressure the entire + metrics intake. No property tests this fan-out-stall hazard on these taps. +- **Scope:** Two destinations on the primary metrics output's fan-out. +- **Evidence brief:** `run.rs:672, 686, 688-693`; sut-analysis §4 "an output with N senders … one + slow consumer stalls delivery to the others"; `no-silent-interconnect-drop` open question. +- **Suggested action:** Either fold the dsd_stats/debug-log fan-out into + `no-silent-interconnect-drop`'s scope (assert a blocked tap backpressures rather than drops, and + resolve its own open question about zero-sender cases), or add a fan-out-stall reachability anchor. + Low-cost since it rides the primary topology. + +### Finding 7 — SO_REUSEPORT UDP autoscaling has no property + +- **Property/slugs:** catalog-wide (gap). +- **Concern:** sut-analysis §2 calls out `SO_REUSEPORT` UDP autoscaling on Linux + (`sources/dogstatsd/mod.rs:667-686`, also in `lib/saluki-io/src/net/listener.rs` and + `net/unix/linux.rs`). Multiple worker sockets bound to the same port is a concurrency/sharding + surface: packet distribution across workers, per-worker buffer-clear-and-continue interacting with + shared codec state, and worker count scaling under load. No property addresses multi-listener + behavior; `malformed-dsd-no-crash` implicitly assumes a single listener. +- **Scope:** UDP listener sharding (Linux production default for high-throughput DSD). +- **Evidence brief:** `sources/dogstatsd/mod.rs` REUSEPORT refs; sut-analysis §2. +- **Suggested action:** Confirm whether REUSEPORT autoscaling is on by default and at what worker + count; if multi-worker, a no-crash / no-loss property under concurrent multi-socket load is + warranted (also strengthens `malformed-dsd-no-crash` and `interner-reclamation-no-corruption`, + which assume the real scheduler but a single read loop). + +### Finding 8 — Internal supervisor / control-plane restartability is not a property (only the fail-stop data side is) + +- **Property/slugs:** `data-component-failure-triggers-process-shutdown` covers the *data* side; + no property covers the *supervised internal* side. +- **Concern:** sut-analysis §2 stresses the **crucial split**: the internal supervisor (control + plane, internal telemetry, env/workload) *is* restartable (OneForOne/OneForAll bounded by + intensity/period), but the data topology is fail-stop. The catalog has a strong property for the + fail-stop side but **nothing** asserting the supervised side actually restarts correctly within + its intensity/period bound, or that exceeding intensity escalates (does a crash-looping internal + child eventually take down the process, or spin forever?). `graceful-shutdown-within-30s`'s open + question even notes "the internal supervisor shutdown has **no timeout**" — an unguarded path with + no property. +- **Scope:** The entire restartable half of the supervision model. +- **Evidence brief:** `bin/agent-data-plane/src/internal/mod.rs`, `internal/control_plane.rs`, + `runtime/supervisor.rs`; sut-analysis §2 "Erlang/OTP-style Supervisor … OneForOne/OneForAll … + bounded by intensity/period"; `graceful-shutdown-within-30s` open question (no internal-supervisor + timeout). +- **Suggested action:** Add a property: induce an internal-supervisor child failure (telemetry / + workload / control-plane) and assert (a) it restarts within the intensity/period bound + (`Sometimes(child restarted)`), (b) exceeding intensity escalates to a bounded outcome (not an + infinite restart spin), and (c) the no-timeout internal shutdown does not let total shutdown exceed + the operational expectation. This is the complement to the fail-stop property and is currently the + most under-covered architectural half. + +### Finding 9 — API-key rotation mid-run surviving in the retry queue is not a property + +- **Property/slugs:** `forwarder-eventual-delivery`, `retry-queue-bounded-under-outage`, + `disk-persisted-retry-survives-restart` all touch the retry queue but none exercise API-key + rotation. +- **Concern:** sut-analysis §2 and §8 both flag that retry-queue IDs were *stabilized to survive + API-key rotation* (now load-bearing). This is a churn-history correctness fix with no property. + A key rotation mid-outage could (regression) re-key queued transactions such that a persisted entry + no longer matches its endpoint, dropping or duplicating data on recovery — exactly the durability + surface `disk-persisted-retry-survives-restart` cares about, but rotation is never injected. +- **Scope:** Retry-queue identity stability across credential change. +- **Evidence brief:** sut-analysis §2 "Retry-queue IDs are built to survive API-key rotation", §8 + "stabilize additional-endpoint retry-queue IDs (now load-bearing for API-key rotation)"; + `common/datadog/io.rs`. +- **Suggested action:** Add an API-key-rotation fault dimension to the existing forwarder/retry + properties (rotate the API key via config-stream update during an intake outage, then heal, and + assert no-loss/no-dup recovery). Cheap to fold into `disk-persisted-retry-survives-restart` or + `forwarder-eventual-delivery` as an additional fault rather than a new property. + +### Finding 10 — Two bug-history items mapped; two did not + +- **Property/slugs:** `aggregate-matches-agent`, catalog-wide. +- **Concern:** Lens asks which sut-analysis §8 fixes map to properties. Of the four named + correctness fixes — *histogram compensated summation*, *unitless histogram counts*, *timestamped + count sampling*, *prefix/filter ordering* — only the latter two are even *implicitly* reachable via + `aggregate-matches-agent`'s diff harness, and prefix/filter ordering is weakly covered (Finding 4). + **Histogram compensated summation** and **unitless histogram counts** have no dedicated property; + they would only surface in the diff test if the `millstone` corpus happens to include the + triggering histogram shapes and the 1e-8 ratio catches the drift. Compensated-summation regressions + are precisely the kind of small-magnitude error a 1e-8 ratio might mask under reordered merges + (related to `ddsketch-relative-error-bound`, demoted to Medium/library-only). +- **Scope:** Two histogram-accuracy regression classes. +- **Evidence brief:** sut-analysis §8 "compensated summation for histograms; unitless histogram + counts". +- **Suggested action:** Confirm the diff-test corpus exercises histogram metrics with values chosen + to expose summation error (large + small magnitude mixed), and that the ratio is tight enough. + Otherwise these regressions are unguarded. Could be a sub-clause of `aggregate-matches-agent` or a + histogram-specific accuracy property. + +### Finding 11 — Possible over-investment: DDSketch library internals carry 3 properties, 2 demoted to library-only + +- **Property/slugs:** `ddsketch-bin-count-bounded` (High), `ddsketch-relative-error-bound` + (Medium, demoted), `ddsketch-no-nan-poison` (High), and `non-finite-values-handled-consistently` + (Medium) which overlaps the NaN facet. +- **Concern:** Four properties cluster on DDSketch/non-finite correctness. The catalog itself demotes + `ddsketch-relative-error-bound` to "a library property, not a live ADP runtime invariant" + (its own §Resolved: ADP does not call `quantile` on the live path) and notes + `non-finite-values-handled-consistently` overlaps `ddsketch-no-nan-poison`. So ~2 of the 4 are + partially redundant / not live-path. This is mild over-investment relative to, say, the + zero-coverage events/SC/traces/OTLP areas (Findings 1-3). +- **Scope:** Sketch correctness cluster. +- **Evidence brief:** `ddsketch-relative-error-bound` §Resolved (quantile not on live path); + `non-finite-values-handled-consistently` Priority Medium with overlap note; + `ddsketch-no-nan-poison` is the one genuinely live, High-value member. +- **Suggested action:** Keep `ddsketch-no-nan-poison` (confirmed-live, High) and + `ddsketch-bin-count-bounded` (live regression tripwire). Consider merging + `ddsketch-relative-error-bound` and the NaN facet of `non-finite-values-handled-consistently` into + harness-side library tests rather than Antithesis runtime properties, freeing portfolio attention + for the uncovered pipelines. Not a correctness error in the catalog — a *weighting* observation. + +--- + +## PASSES (areas where coverage balance is appropriate) + +- **Memory & resource bounds (Cat A):** Well-proportioned to its risk. Five properties cover the + RSS bound, the context cap, interner spill, RSS-read-failure, and the retry-queue byte cap — each + mapping to a distinct sut-analysis §7 attack surface (§7.1-5, 7). The "fails-by-design under + defaults" framing is correct and the highest-risk area gets the most attention. +- **Egress data-loss (Cat B forwarder cluster):** `forwarder-eventual-delivery`, + `retry-queue-bounded-under-outage`, `disk-persisted-retry-survives-restart` together cover the + circuit-breaker, byte cap, and crash-durability surfaces (sut-analysis §2 egress, §6 gaps 1-3). + Strong, non-redundant, correctly liveness-typed. (Gap: API-key rotation, Finding 9.) +- **Guaranteed-crash config/clock hazards:** `aggregate-no-panic-any-window` and + `aggregate-clock-skew-stable` captured the two zero-fault-injection deterministic crashes + (sut-analysis §7.8, §7.9). _Update 2026-05-30 — §7.8 (sub-second window divide-by-zero) is now + **fixed upstream** (window typed `NonZeroU64`, PR #1772) and demoted to a regression tripwire; the + clock-skew forward-jump crash remains live, high value, cheap._ +- **Untrusted DSD + replay (Cat E):** `malformed-dsd-no-crash`, `replay-no-panic-on-malformed-capture`, + `replay-corruption-not-silent-eof` cover the codec and the newest/largest replay feature + (sut-analysis §6 gap 6, §8). Replay is correctly weighted as the top regression-prone area. +- **Lifecycle/config (Cat D):** `config-stall-no-deadlock`, `config-incompatible-refuses-start`, + `config-runtime-update-not-revalidated`, plus the two shutdown properties and the fail-stop + property cover the §7.13 no-timeout wait, the startup gate, and the §2 fail-stop model coherently. +- **Type mix:** Safety-heavy is correct for a "no crash / no corruption" SUT; liveness is present + exactly where progress (delivery, drain, startup) is the contract. + +--- + +## UNCERTAINTIES (need human/team input or a targeted pass to resolve) + +- **Are traces/APM/logs/OTLP in scope for the first-customer delivery?** This single answer + decides whether Findings 2 and 3 are real gaps or deliberate exclusions. ADP targets Agent 7.80.0 + with `data_plane.enabled` routing DogStatsD; the catalog and topology both implicitly assume + DogStatsD-only. If that assumption is correct, document the exclusion; if not, these are the + largest gaps in the portfolio. (Needs team input.) +- **Does the `millstone` correctness corpus exercise events, service_checks, mapped/prefix-filtered + metrics, and adversarial histogram values?** Determines whether `aggregate-matches-agent` + implicitly covers Findings 1 (events/SC delivery), 4 (transform ordering), and 10 (histogram + summation), or whether those are truly unguarded. (Needs a corpus read — `bin/correctness/`, + `millstone.yaml`.) +- **Is SO_REUSEPORT UDP autoscaling on by default, and at what worker count?** Determines whether + Finding 7 is a live multi-listener concurrency surface or a single-worker no-op. (Needs a config + default read.) +- **Is origin detection reachable at all in the planned topology?** The primary topology uses UDP/TCP + (no UDS), so peer-cred origin detection (Finding 5) is structurally unexercised. Confirm whether + the listener-coverage UDS variant is actually planned to run, else origin enrichment correctness is + untested by construction. (Needs topology decision.) +- **Can a high-severity-incompatible key actually arrive over the config stream?** Open in + `config-runtime-update-not-revalidated`; also gates whether the API-key-rotation-via-config-stream + fault (Finding 9) is reachable. (Needs Core Agent protocol knowledge / team input.) diff --git a/test/antithesis/scratchbook/evaluation/implementability.md b/test/antithesis/scratchbook/evaluation/implementability.md new file mode 100644 index 00000000000..7f97651ccd5 --- /dev/null +++ b/test/antithesis/scratchbook/evaluation/implementability.md @@ -0,0 +1,516 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space. +--- + +# Implementability Evaluation — ADP Antithesis Property Catalog + +Lens: **can each property actually be CHECKED as planned?** Three sub-questions per property: +(1) observability — workload-visible vs internal-state, and is the SUT-side instrumentation point +real and reachable; (2) topology — does the planned deployment support the failure scenario; +(3) preconditions — can the workload reliably construct the needed state within an Antithesis +timeline. Bias: surface what blocks a green check. + +Verified against code this session (load-bearing for the findings below): +- `run.rs:91,94,486` — standalone mode is a config flag; `check_and_warn_config` (gate) runs on the + *resolved* config regardless of source (`run.rs:157`). +- `run.rs:446` — `checks_ipc` pipeline is gated on `dp_config.checks().enabled()`, **independent of + standalone mode**. `checks_ipc/mod.rs:5-13,39-48,57-77` — the source is a **gRPC server** + (`ChecksServer`, TCP :5105) consuming `SendCheckPayloadRequest`; it needs an **external gRPC client + speaking `datadog_protos::checks`** to emit anything. +- `forwarders/datadog/mod.rs:60-66` + `endpoints.rs:157-180` — `with_endpoint_override(dd_url, api_key)` + exists; the forwarder CAN be pointed at a mock intake (primary egress topology is implementable). +- `limiter.rs:54` — the limiter checker is a bare `std::thread::Builder::new()` thread (confirms the + silent-death framing of `memory-limiter-survives-rss-read-failure`). +- `transforms/aggregate/mod.rs:92-93,194` — `window_duration` is read once via serde `as_typed()` at + build; **no `ConfigChangeEvent` subscriber** in the aggregate transform (grep: none). So + `aggregate_window_duration` is **startup-only**, not runtime-reloadable. + +Categories below: **F#** = a finding (something that blocks/weakens the planned check). The summary +at the end groups Findings / Passes / Uncertainties. + +--- + +## Cross-cutting prerequisite (affects ~all SUT-side properties) + +**F0 — The Antithesis Rust SDK is a hard prerequisite for ~17 of 27 properties, and it is not yet a +dependency.** `existing-assertions.md` confirms zero SDK usage and no `antithesis-sdk` in any +`Cargo.toml`/`Cargo.lock`. Every property whose check is an in-process `assert_*` (all crash/panic, +NaN-at-sketch, bin-count, interner-corruption, source-misroute, limiter-RSS-failure, context-limit, +retry-byte-cap, clock-skew, both replay properties, the lifecycle ordering/shutdown reachability +markers) is **blocked until a dedicated ADP image is built with the SDK linked in** and the +assertions are physically added at the cited code sites. The catalog acknowledges this in prose but +does not treat it as a gating work item with a per-site edit list. **Scope:** topology + build. +**Suggested action:** make "fork ADP, add `antithesis-sdk` dep, land the named assertions, build a +second instrumented image" an explicit milestone before any SUT-side property can pass; the +workload-only properties (below) can run first against a stock image. _Update 2026-05-30 — the SDK +dep and instrumented build are now in place (ADP `antithesis` cargo feature + bootstrap probe, see +`existing-assertions.md`); the remaining milestone is landing the per-site property assertions._ + +The properties that are **workload-only checkable** (no SUT fork strictly required, read telemetry / +mock-intake / process exit): `aggregate-context-limit-enforced` (counter-anchored), parts of +`no-silent-interconnect-drop` (counter), `forwarder-eventual-delivery` (intake reconciliation), +`disk-persisted-retry-survives-restart` (intake reconciliation), `retry-queue-bounded-under-outage` +(only the *reachability* Sometimes; the byte-cap Always is internal), `config-incompatible-refuses-start` +(exit code + log), `config-stall-no-deadlock` (CPU/log-rate + progress log), `aggregate-matches-agent` +(harness diff). Everything else needs the fork. + +--- + +## Category A — Memory & Resource Bounds + +### rss-bounded-under-cardinality — PASS (with a topology caveat) +- **Observability:** OK. RSS read workload-side (or SUT-side from the same `Querier`). Expected-to-fail + is the finding, not a blocker. +- **F1 — the RSS bound is unobservable unless a limit is actually configured, and the cgroup auto-path + is a silent trap.** The whole property presupposes `effective_limit_bytes()` exists. Under defaults + the limiter is a noop (`accounting.rs:37-40,174-178`) and there is no grant to compare against — so + the assertion has no threshold. The workload MUST set `memory_mode`+`memory_limit`. Worse + (`rss-bounded-under-cardinality.md:118-139`): a **non-empty `DOCKER_DD_AGENT` env var** silently + switches on cgroup auto-detect and changes the baseline. **Action:** pin `memory_mode`/`memory_limit` + explicitly in the `adp` container env and assert (or log-scrape) that a non-noop limiter is active, + else the run is vacuous; audit the base image for `DOCKER_DD_AGENT`. +- **Precondition / timing:** the "many distinct timestamped values" inflation path needs sustained + high-cardinality load for long enough that interner heap-spill + `SmallVec` growth diverge from the + grant. Feasible with `millstone`, but the **container cgroup may OOM-kill ADP before the assertion + reads over-grant RSS** (the property's own open question). On a kill, the observable is a process + crash, not an RSS reading — still a guarantee violation, but it lands on a *different* property + (`data-component-failure...`/no-crash) and can confuse triage. **Action:** give the container + headroom above ADP's configured grant so the assertion fires before the kernel kill. + +### aggregate-context-limit-enforced — PASS +- The `Always(len <= context_limit)` is a single-task, lock-free, local invariant — the cleanest + SUT-side `Always` in the catalog. `Sometimes(breached)` is anchorable to the existing + `events_dropped`/breach counter (workload-readable). **Precondition:** lower `aggregate_context_limit` + (e.g. 1000) so the boundary is reachable in-run — straightforward with a cardinality flood. No + topology or timing blocker. + +### interner-full-bounded — PASS (mode B), PASS-with-precondition (mode A) +- **Observability:** distinguishing interned/inlined/heap-fallback/dropped needs SUT-side state, as the + file says; `intern_fallback_total` exists as a `Sometimes` anchor (mode B, default) and is + workload-readable. Mode B is easy. +- **F2 — mode A (bounded) preconditions are fiddly and fragility-prone.** To fill a fixed interner you + must (a) set a *small* `dogstatsd_string_interner_size_bytes`, (b) set + `dogstatsd_allow_context_heap_allocs=false` (opt-in, test-only in shipping code per + `interner-full-bounded.md:117-124`), and (c) use **names/tags > 31 bytes** so `MetaString` inlining + doesn't bypass the interner entirely (`:89-91`). Miss any one and the property is vacuous (never + fills, or spills to heap). Plus fragmentation under churn can make `try_intern` return `None` below + nominal capacity (open question), which muddies the "bounded == drops at budget" reading. + **Action:** bake (a)/(b)/(c) into the mode-A workload corpus and add a `Sometimes(try_intern==None)` + guard so a non-filling run is flagged rather than passing green. + +### memory-limiter-survives-rss-read-failure — CONDITIONAL (needs a custom fault + SUT fork) +- **Topology:** requires a **custom `/proc` RSS-read fault** (deployment-topology.md:144 flags it as + "Custom; must script"). Antithesis cannot inject this out of the box; someone must write a fault + hook that makes `process_memory::Querier::resident_set_size()` return `None` mid-run on the target. +- **F3 — the failure may be unreachable on the Antithesis Linux target without that custom fault, and + the property's whole value hinges on it.** The file's own pivotal open question + (`:93-98`): can `resident_set_size()` actually fail post-startup on this platform, or only via + injected corruption? If `/proc/self/statm` essentially never fails on the target, the realistic + panic is reachable *only* through the scripted fault — making this a fault-injection curiosity, not + a production risk. Also requires the limiter to be ON (`memory_mode!=disabled`+limit), i.e. the same + config prerequisite as F1. +- **Observability:** the `.expect()` thread death is **silent** (bare `std::thread`, no shutdown, no + metric) — confirmed `limiter.rs:54`. So this is **not workload-observable** at all without the SUT + fork: you must replace the `.expect()` site with an `assert_unreachable` (or panic-hook). **Action:** + confirm tenant supports the custom `/proc` fault AND that the SUT fork lands the assertion; if the + fault is unavailable, demote/park this property — it cannot be checked. + +### retry-queue-bounded-under-outage — PASS (split observability) +- **Topology:** needs `adp↔mock-intake` across a container boundary so the outage is faultable — + primary topology provides it. Mock intake needs a controllable reject/5xx/hang mode (topology open + question; `datadog-intake` may need a small extension). **Action:** confirm/extend the mock intake's + failure-mode toggle (also needed by `forwarder-eventual-delivery`, `shutdown-drains-no-loss`). +- **Observability:** the byte-cap `Always` is internal to `RetryQueue::push` → SUT fork. The + `Sometimes(items_dropped>0)` is telemetry → workload-readable. Disk-cap branch requires disk + persistence ON and the silent-fallback (`io.rs:405`) to NOT have fired — the file already flags this + (must `assert_unreachable`/log-scrape the fallback or the disk-cap test is vacuous). All tractable. +- **Precondition:** saturate the queue within a timeline — feasible with sustained load + a held-down + intake; size load vs. the 15 MiB default cap. Multi-endpoint fan-out multiplies the bound (decide + per-endpoint vs aggregate assertion — affects what threshold you check). + +--- + +## Category B — Data Integrity & No Silent Loss + +### no-silent-interconnect-drop — PASS (scope carefully) +- `events_discarded_total` is emitted and workload-readable; `Always(delta==0)` on wired edges is + checkable without a fork (counter scrape), and the discard branch only fires for zero-sender + outputs. **Caveat (file's open question):** confirm no production DSD output is ever + conditionally-unwired (`dsd_debug_log_out`/`dsd_stats_out`); if one is, scope the `Always` to the + always-wired edges or it false-positives. **Precondition:** must actually reach the full-channel + state — 128-buffer edges (`built.rs:653`) mean you need a genuinely slow downstream + sustained load; + the `Sometimes(backpressure engaged)` anchor needs a stall signal (rising send latency). Tractable + via throttling the intake. +- **Transport note:** this is an internal-edge property; UDP lossiness upstream does NOT confound it + (the assertion is on the encoder→forwarder internal edges, not the wire). Fine on UDP or TCP. + +### forwarder-eventual-delivery — PASS (TCP/UDS for the input side) +- **Observability:** primary check is workload-side reconciliation of accepted-retryable vs + mock-intake-received → no fork needed for the core; the `Reachable(Error::Open re-enqueue)` anchor + needs the fork. +- **F4 — UDP input confounds the "all accepted ... delivered" reconciliation; this property MUST use + TCP or UDS for the DSD ingress.** The liveness claim equates *accepted* input to *delivered* output. + If `millstone→adp` is UDP (the topology's default suggestion), packets can be dropped on the wire + *before* acceptance, so "accepted" is unmeasurable from the sender side and the reconciliation set is + ill-defined. deployment-topology.md:175 hints at this ("for no-loss assertions prefer TCP/UDS"), but + the property files don't state it as a hard requirement. **Action:** pin DSD ingress to **TCP** for + this property (UDS needs a shared volume → not faultable, and the egress link is what we fault here, + so TCP ingress is fine). Same applies to `disk-persisted-retry-survives-restart` and + `shutdown-drains-no-loss`. +- **Precondition:** outage must be **shorter than retry-queue overflow** (else drop-oldest legitimately + sheds data and the reconciliation must exclude it). Needs a quiet/heal window for the eventual check + (`eventually_`/`ANTITHESIS_STOP_FAULTS`). Both standard. + +### disk-persisted-retry-survives-restart — CONDITIONAL (node-termination fault) — PASS otherwise +- **Topology:** needs the **node-termination/kill fault** (topology table: "Commonly disabled — must + enable") + s6 restart. If the tenant has kill disabled, this property can't run. +- **F5 — the silent in-memory fallback makes this vacuous unless explicitly guarded, and the + persistence-active signal is log-only (no metric).** `io.rs:405-408` falls back to in-memory with + only an `error!` log. The property file already prescribes treating the fallback as + `assert_unreachable` (fork) or log-scraping. Without that, a misconfigured disk path silently turns + this into an in-memory test that "passes" while proving nothing. **Action:** enforce the + persistence-active check as setup-gating. +- **Observability/precondition:** reconciliation at the mock intake with transaction-identity dedup + (workload-side, OK); tolerate the ~1-txn at-most-once window (delete-before-return). Corrupt-entry + poison drop is checkable either by injecting a hand-crafted file or naturally via SIGKILL-mid-write + (non-atomic write, confirmed). Use TCP/UDS ingress (see F4). + +### source-dispatch-no-misroute — PASS (fork required; misroute structurally improbable) +- **Observability:** the routing decision is internal — NOT telemetry-visible — so the + `Unreachable(misroute)` must be an in-process `assert_unreachable` checking the metrics dispatch + buffer is metrics-only (`mod.rs:~1707`). Fork required. +- **F6 — the failure may be structurally unreachable, risking a vacuous/never-firing assertion.** The + file's own analysis (`:33-52`) shows `extract`-then-`send_all` removes matched events from the buffer + *before* the send can fail, so a send error causes *loss*, not misroute — the assertion likely never + fires. That's fine as a **regression tripwire**, but it means this property cannot demonstrate value + in a run (no `Sometimes` can prove the bad state is reachable, because it isn't). **Action:** keep it + as a guard, but set expectations that it is a latent-regression assertion, not a falsifiable-now + property; pair it with the *loss-counting* sub-claim (B) which IS observable. Also: the two + `.expect("...output should always exist")` are real crash sites if a deployment omits an output — + worth a separate guard but only reachable by mis-wiring (not normal load). + +### shutdown-drains-no-loss — PASS (conditional, intricate preconditions) +- **Topology:** SIGINT/termination on `adp`; slow/blocked intake to push past 30s (needs the mock-intake + hang mode, F4-adjacent). OK in primary topology. +- **F7 — the "accepted-before-signal that reached a flushed window" set is hard to construct precisely, + making the no-loss reconciliation fragile.** Two designed-loss boundaries (open aggregation window + dropped unless `flush_open_windows`; 30s forceful stop) mean the reconciliation set must *exclude* + open-window and post-timeout data. Determining exactly which input metrics "reached a flushed window" + at the instant of the signal requires knowing the aggregate flush cadence vs. the signal time — a + timing-coupled boundary the workload can only approximate. **Action:** set `flush_open_windows=true` + to remove one boundary, drive only *closed-window* (pre-signal, aged > window) data into the no-loss + set, and assert the clean case as `Sometimes` (not `Always`). Use TCP/UDS ingress (F4). Realistic + drain time near 30s under max load is itself a finding (size load conservatively). + +--- + +## Category C — Aggregation & Sketch Correctness + +### aggregate-matches-agent — CONDITIONAL (separate heavy topology) — implementable but expensive +- **Topology:** Add-on 2 (datadog-agent baseline + adp + two intakes + panoramic/stele). Doubles + containers and state space; must run as its own template. +- **F8 — fault injection and the differential check are in fundamental tension; the diff is only valid + in a fault-free quiet window, which limits what this property actually tests.** Injected scheduler + pauses/clock steps create *timing-artifact* diffs (delayed flush → different bucket) that are false + positives, not correctness bugs. The topology doc concedes the comparison must run inside an + `ANTITHESIS_STOP_FAULTS` window long enough to cover `FLUSH_WAIT≈32s` on both sides. So the property + largely re-runs the existing deterministic diff test under Antithesis with faults *paused* — the + net-new coverage (equivalence *under* faults) is the hardest part and is exactly where false diffs + bite. **Open implementability questions unresolved:** can `panoramic` survive an ADP restart mid-run + (it may assume a single long-lived process)? Is the Agent baseline's bucket width pinned identical to + ADP's window (else the stele `interval_a==interval_b` check, metrics.rs:171, false-positives)? + **Action:** treat as a low-fault, quiet-window equivalence run; verify `panoramic` restart-tolerance + before committing; this is the highest-effort, lowest-certainty property to operationalize. + +### aggregate-no-panic-any-window — PASS (cheapest high-value target) +- Deterministic crash from config alone; no fault injection needed, just config-space exploration of + the `{"secs":0,"nanos":N}` shape. The `Unreachable` at `align_to_bucket_start`/`step_by` needs the + fork to be a *clean* signal, but even **without** the fork the panic → process exit → s6 crash-loop + is workload-observable (no listener / repeated restart). **Resolved here:** the runtime-config-push + open question is **NO** — `window_duration` is read once at build (`mod.rs:92-93,194`), no + `ConfigChangeEvent` subscriber, so this is a **startup-only** crash vector. Update the property to + drop the "gRPC live-push" angle. + +### aggregate-clock-skew-stable — CONDITIONAL (clock-jitter fault) — PASS otherwise +- **Topology:** needs **clock-jitter fault** ("Commonly disabled — must enable"). If unavailable, the + property can't run. +- **Observability:** `zero_value_buckets.len()` and the `last_flush`/`current_time` pair are internal → + fork required for the crisp `Always(bounded)`/`Always(monotonic)`. Downstream flood/gap is only + *indirectly* visible workload-side (a spike in zero-value points at the mock intake) — a weaker proxy. +- **Precondition:** step the container clock forward (flood) / backward (gap) *during* a flush — the + `Sometimes(clock jumped during flush)` anchor confirms coincidence. Forward-jump flood is easy to + observe (memory + point count). All tractable given the fault. **Action:** confirm clock fault + enabled; land the SUT-side bucket-count assertion. + +### ddsketch-bin-count-bounded — PASS (fork; needs to drive >4096 bins) +- `bin_count()` is internal → fork. **Precondition:** must push a sketch past 4096 bins so `trim_left` + fires and the `Reachable("trim_left collapsed")` is non-vacuous — needs thousands of distinct + histogram/distribution sample values per flush. Feasible via `millstone` distribution corpus. + Otherwise the `Always` is vacuously true. No topology blocker (rides normal DSD load). + +### ddsketch-relative-error-bound — PASS as a HARNESS/library test only (NOT a live runtime check) +- **F9 — this is not checkable as a live ADP runtime invariant; it can only be a SUT-side unit/harness + assertion.** The file resolves (`:104-128`) that ADP **never calls `DDSketch::quantile` on the + customer path** — it ships raw bins; quantiles are computed server-side. So there is no production + call site to anchor `Always(quantile within eps)`. The only realizable form is an in-tree test-harness + assertion over the agent sketch in isolation (essentially the existing proptests with SDK + annotations). **Action:** reframe explicitly as a library-invariant harness check (the catalog + already demotes it to Medium and says this); do not plan a topology/workload path for it. Merge-order + facet (f64 `avg`/`sum` non-associativity) is likewise only meaningfully testable harness-side. + +### ddsketch-no-nan-poison — CONDITIONAL — the planned producer needs a NOT-YET-BUILT gRPC feeder +- **F10 — the only live NaN path requires a checks_ipc gRPC producer that the primary (DSD-only) + topology lacks and that nobody has built; without it the property is unreachable.** Confirmed this + session: `checks_ipc` is a **gRPC server** (`ChecksServer` on TCP :5105, `checks_ipc/mod.rs:5-13, + 39-77`) consuming `SendCheckPayloadRequest`; the NaN bypass (`checks_ipc/mod.rs:195` → encoder + `insert_n`) only fires if some client sends a Histogram with a NaN value. The DSD path is + finiteness-gated (FloatIter), OTLP is gated, the aggregate path is DSD-only — so **no DSD workload can + reach the poisoning site.** deployment-topology.md:177-179 hand-waves "add a minimal checks-IPC + feeder or a unit-level SUT assertion." That feeder is a real build task: a gRPC client speaking + `datadog_protos::checks` that emits a NaN histogram. The source is enabled by `dp_config.checks(). + enabled()` and works in standalone mode (no Core Agent needed for checks_ipc itself), but the + *producer* is missing. **Action:** EITHER build the checks_ipc NaN feeder (client + enable the + pipeline) for an end-to-end check, OR fall back to a SUT-side unit assertion at the sketch boundary + (`adjust_basic_stats`) exercised by an in-process test — the catalog should pick one explicitly, not + leave it as "or." The sketch-boundary `Always(v.is_finite())` assertion itself is sound and is the + right fix location. + +--- + +## Category D — Lifecycle & Configuration + +### topology-ready-before-intake — PASS (reframed to milestone ordering) +- The file already honestly narrows this to **readiness-milestone ordering** (sup-ready before + build/spawn; eventually all_ready), NOT "no bytes read before ready" — because the source binds and + reads gated only by backpressure. The defensible assertion is log/flag ordering (`sup_ready_ms` + before `topology_ready_ms`), which is workload-observable from logs even without the fork. + **Uncertainty (open question):** does `dsd_in` bind listeners during `initialize()` before + `mark_ready`? If so a `port_listening` probe pre-ready would show binding-before-ready — strengthens + or weakens the claim. Needs a read of the DSD listener-bind vs mark_ready ordering to finalize + assertion strength. Not a blocker; just bounds the claim. + +### config-stall-no-deadlock — PASS — but the busy-loop falsification target is dead +- **Topology:** needs Add-on 1 (core-agent-stub or minimal gRPC config-stream stub) — a **build task** + (topology open question). Standalone mode bypasses the config stream entirely, so this property is + N/A without the stub. +- **F11 — the headline "busy-loop" falsification target is unreachable through the tonic stack; the + real (and only) check is a quiescent-hang detector.** The file's investigation (`:120-187`) resolves + that a steadily-erroring stream terminates after one error and backs off 5s — no spin. So the + realizable check is: drop the snapshot → ADP blocks **quiescently** (low CPU) at `ready().await` + forever, no panic. That is checkable workload-side (CPU/log-rate monitor + "Waiting for initial + configuration" present, "Initial configuration received" absent). **Action:** drop the busy-loop + scenario from the workload; keep the quiescent-hang + flap-reconnect(5s) scenarios. The stub must be + able to register ADP then withhold the snapshot — confirm the stub supports that. + +### config-incompatible-refuses-start — PASS (workload-observable) +- Exit code 1 + absence of `topology_ready_ms` + no data at intake — fully workload-observable; the + `Unreachable`-after-spawn and `Reachable`-at-refusal markers are nice-to-have fork additions but not + required for the core check. **Precondition:** the workload must supply a **current `Severity::High` + non-default key** from `config_registry/datadog/unsupported.rs`; that list drifts, so pin it to the + commit or source it dynamically. Needs Add-on 1 stub to deliver config (or bootstrap YAML/env, which + also works — `check_and_warn_config` runs on the resolved config regardless of source, confirmed + `run.rs:157`). So this is actually runnable **without** the stub via env/YAML config — easier than + the topology doc implies. + +### config-runtime-update-not-revalidated — CONDITIONAL — reachability depends on an unanswered product +question +- **Topology:** Add-on 1 stub, in remote-agent mode (must push a runtime `Partial`/`Snapshot` carrying + a high-severity key). +- **F12 — the property's reachability hinges on a `(needs human input)` product question the catalog + hasn't resolved: can a `Severity::High` key actually traverse the config stream, or does the Core + Agent pre-filter it?** If the real Agent never emits such a key, only an adversarial stub can, which + makes this a "the gate doesn't exist at runtime" documentation finding rather than a falsifiable + property. The check itself (Reachable on the unguarded apply path, or Unreachable on running-with-key) + is implementable via the stub, but its *value* is gated on the product answer. **Action:** get the + team's answer before investing; if startup-only gating is intentional, demote to a documented gap + + a single `Reachable` marker. + +### graceful-shutdown-within-30s — PASS (conditional on fault + scope to topology) +- SIGINT clean case (bounded load) + wedged-intake forceful case. Reachability markers on both branches + of `shutdown_with_timeout` (fork) or log-observable (`"All components stopped."` / + `"Forcefully stopping topology"`). **Caveat (file open question):** the **internal-supervisor + shutdown has no timeout** (`run.rs:294`), so the *process* can exceed 30s even when the *topology* + met it — a workload asserting "process exits within ~35s" can false-fail for an out-of-scope reason. + **Action:** scope the assertion to topology-shutdown completion (the log/marker inside + `shutdown_with_timeout`), not process exit. Forceful path needs the mock-intake hang mode (F4-adjacent). + +### data-component-failure-triggers-process-shutdown — PASS (temporal/log check) +- Best realized as an Antithesis **query-logs temporal assertion**: whenever "Topology component + unexpectedly finished" appears, a process exit follows — workload/triage-side, no fork strictly + needed (a `Reachable` marker on the select arm helps). **Precondition:** induce a component finish — + cheapest via the sub-second-window panic (`aggregate-no-panic-any-window`) which is a guaranteed + deterministic finish. So this property piggybacks on C's crash target. Needs node-termination/kill + NOT required (the component finishes on its own). Solid. + +--- + +## Category E — Untrusted Input Parsing + +### malformed-dsd-no-crash — PASS (UDP fine here; this is the no-crash property) +- **Transport:** explicitly the property where **UDP is appropriate** — it tests the connectionless + clear-and-continue path; UDP/UDS-datagram listener survival is *part of the property*. The file + correctly scopes "socket never dies" to connectionless and excludes TCP `break`. No loss assertion + here, so UDP lossiness doesn't confound. +- **Observability:** `Always(process up)` is workload-side liveness; `framing_errors`/`*_decode_failed` + are existing counters for the `Sometimes` anchors. The `Unreachable` at the `unreachable!` / + `from_utf8_unchecked` codec sites needs the fork (a parser-regression panic is otherwise just a crash). + **Precondition:** SDK-RNG-generated adversarial packets across all 4 listeners — straightforward. + Covering UDS-datagram/stream needs the shared-volume sidecar (listener-coverage variant), an extra + topology piece but documented. + +### replay-no-panic-on-malformed-capture — PASS (instrument the REPLAY CLI process, not ADP) +- **F13 — the panic assertion must live in the separate replay CLI process, which means a SECOND + instrumented binary, and the realistic panic surface is in zstd/prost deps, not ADP code.** Confirmed + (`reader.rs` + `dogstatsd.rs:394`): replay parses the file **in the `agent-data-plane dogstatsd + replay` CLI process** and forwards payloads over UDS to the running ADP. So (a) the SUT-side panic + hook / `assert_unreachable` belongs in the replay CLI, requiring SDK linkage in that code path too; + (b) the reader's own two `expect`s are bounds-guarded (not panic sites) — the real risk is a panic + *inside* `zstd::stream::decode_all` / `prost::decode` on adversarial input, which is harder to assert + on (it'd be a dep panic → SIGABRT). **Also two confirmed resource-exhaustion vectors** (unbounded + `fs::read`, uncapped `zstd::decode_all`) are OOM, not panic — they're a *different* observable + (process killed) and overlap the memory family. **Action:** instrument the replay CLI; treat OOM + vectors as a separate resource property or a size-cap fix; use the listener-coverage variant + (shared-volume `replay-client`) — no cross-container faults needed (pure input exploration). **The + capture files must be SDK-RNG-generated adversarial bytes** — a workload generator build task. + +### replay-corruption-not-silent-eof — CONDITIONAL — likely needs a format change OR stays heuristic +- **F14 — distinguishing truncation from clean EOF may be impossible without a format change, so a + strict `Always` is not implementable today; only a heuristic/`Sometimes` is.** The file resolves + (`:89-92`): there is **no record-count or total-length field**, and the code intentionally returns + `Ok(None)` for truncation (tests assert it). To assert "completion was faithful" you'd need to know + the true record count, which the format doesn't carry. So the realizable check is a SUT-side + assertion at `reader.rs:95` distinguishing `size==0 && at_trailer_boundary` (clean) from + overrun/mid-stream (corrupt) — which requires the reader to *track* the trailer boundary it doesn't + today. Without that instrumentation change, the property degrades to `Sometimes(an overrunning prefix + was seen)` — proves the corrupt branch is reachable but NOT that completion is faithful. Plus a + `(needs human input)` question on whether maintainers even consider silent truncation a bug. + **Action:** scope to the `Sometimes(corruption-reached-the-(b)-branch)` form + the CLI-process + instrumentation, and flag the strict-fidelity `Always` as fix-dependent (format change needed). + +--- + +## Category F — Concurrency & Boundary Conditions + +### interner-reclamation-no-corruption — CONDITIONAL — relies on Antithesis scheduling to hit the race +- **F15 — the corruption branch is loom-proven-safe under the modeled interleavings; whether Antithesis + can construct an interleaving loom doesn't cover is unknown, so the `Sometimes(contended path hit)` + anchors may never fire, risking a vacuous green.** The whole value proposition is "explore beyond + loom's bounded model under the real scheduler," but the workload cannot *force* the + decrement→lock→re-check race; it can only create pressure (small interner, high-cardinality churn, + short-lived contexts) and hope Antithesis schedules the contended interleaving. The `Sometimes(drop + re-check found resurrected entry)` / `Sometimes(reclaimed-slot reused)` anchors are the guard against + vacuity — but if they never fire, the property neither passes meaningfully nor fails. **Observability:** + the corruption check (overlap or sentinel-run in a resolved `&str`) is internal → fork; note the + **two different sentinels** (`0x21` in map.rs, `0xAA` in fixed_size.rs) — a hard-coded check would + miss one impl; use direct overlap detection. **Precondition:** small interner + heap-fallback OFF so + reclamation is actually pressured (else strings spill to heap and reclamation is never exercised). + **Action:** land the overlap-based (not sentinel-hardcoded) SUT assertion + the `Sometimes` contention + anchors; accept that reachability of the race is Antithesis-scheduler-dependent and report the anchor + status in triage so a never-contended run isn't mistaken for a pass. + +### non-finite-values-handled-consistently — PASS (DSD facet) / shares F10 (NaN-poison facet) +- The DSD ghost-metric facet is checkable on the primary topology: all-non-finite packets → + `num_points==0` gate (`mod.rs:1478`) → `Ok(None)`; the `AlwaysOrUnreachable(no zero-point metric + reaches aggregation)` and `Sometimes(non-finite dropped)` are anchorable (the latter needs a + `non_finite_dropped` counter that doesn't exist — currently only a `debug!` log, so add a counter or + the fork). The **NaN-at-sketch facet inherits F10** (needs the checks_ipc gRPC feeder to be live; on + the DSD path it's correctly Unreachable). **Action:** add the `non_finite_dropped` reachability anchor + (counter or SUT marker); keep the sketch-boundary `Always(is_finite)` as the producer-independent + assertion but understand its live exercise depends on F10's feeder. + +--- + +## Summary + +### Findings (blockers / weakeners — Property | Concern | Scope | Evidence | Action) + +- **F0 | SDK not present; ~17 properties need an instrumented ADP fork before they can pass | build/topology | + existing-assertions.md (zero SDK); deployment-topology.md:153-161 | Make "fork + add SDK + land named + assertions + second image" an explicit gating milestone; run workload-only properties first.** +- **F1 | rss-bounded-under-cardinality — no grant to assert against under defaults; `DOCKER_DD_AGENT` silently + flips the baseline | config | rss-bounded-under-cardinality.md:118-139; accounting.rs:37-40,107-121 | Pin + memory_mode+memory_limit, assert non-noop limiter, audit image for DOCKER_DD_AGENT; give container RSS + headroom so assertion fires before OOM-kill.** +- **F2 | interner-full-bounded mode A — three coupled preconditions (small interner, heap-off, >31B strings) or + vacuous | config/precondition | interner-full-bounded.md:89-91,117-124 | Bake all three into the corpus + + Sometimes(try_intern==None) guard.** +- **F3 | memory-limiter-survives-rss-read-failure — needs a custom /proc fault that may be the ONLY way to reach + the failure; silent thread death is unobservable without the fork | topology+fault+SUT | deployment-topology.md:144; + limiter.rs:54,100-102; property open Q :93-98 | Confirm tenant supports the custom fault AND the fork; else + park — uncheckable.** +- **F4 | forwarder-eventual-delivery / disk-persisted-retry / shutdown-drains — UDP ingress confounds no-loss + reconciliation; MUST use TCP (UDS needs shared volume) | topology | deployment-topology.md:175; forwarder- + eventual-delivery.md:69-74 | Pin DSD ingress to TCP for these three properties.** +- **F5 | disk-persisted-retry — silent in-memory fallback (log-only, no metric) makes it vacuous unless guarded | + observability | disk-persisted-retry-survives-restart.md:121-130; io.rs:405-408 | Gate the run on a + persistence-active assert_unreachable/log-scrape.** +- **F6 | source-dispatch-no-misroute — misroute is structurally unreachable (extract-then-send); assertion can't + fire in a run | falsifiability | source-dispatch-no-misroute.md:33-52 | Keep as regression tripwire; pair with + the observable loss-counting sub-claim.** +- **F7 | shutdown-drains-no-loss — the "accepted-before-signal, flushed-window" set is timing-coupled and hard to + construct precisely | precondition | shutdown-drains-no-loss.md:55-60 | Set flush_open_windows=true, restrict to + closed-window data, assert as Sometimes.** +- **F8 | aggregate-matches-agent — faults create false diffs; net-new coverage only valid in a fault-paused + window; panoramic may not survive restart | topology/method | aggregate-matches-agent.md:89-96; deployment- + topology.md:117-120 | Run as a low-fault quiet-window equivalence; verify panoramic restart-tolerance first.** +- **F9 | ddsketch-relative-error-bound — no live runtime call site (ADP ships raw bins, no quantile); only a + library/harness test | observability | ddsketch-relative-error-bound.md:104-128 | Reframe as in-tree harness + assertion; no topology/workload path.** +- **F10 | ddsketch-no-nan-poison (and NaN facet of non-finite) — only-live NaN path needs a checks_ipc gRPC NaN + feeder not in the primary topology and not yet built | topology/build | checks_ipc/mod.rs:5-13,39-77,195; + run.rs:446; deployment-topology.md:177-179 | Build the gRPC checks feeder OR fall back to a SUT-unit sketch- + boundary assertion — pick one explicitly.** +- **F11 | config-stall-no-deadlock — busy-loop falsification target is unreachable through tonic; needs a config- + stream stub | method/topology | config-stall-no-deadlock.md:120-187 | Drop busy-loop scenario; keep quiescent- + hang; confirm stub can register-then-withhold-snapshot.** +- **F12 | config-runtime-update-not-revalidated — reachability gated on an unresolved product question (can a + High-severity key traverse the stream?) | product input | config-runtime-update-not-revalidated.md:42-47 | Get + team answer before investing; else demote to documented gap + Reachable marker.** +- **F13 | replay-no-panic — assertion must live in the SEPARATE replay CLI process (second instrumented binary); + real panic surface is in zstd/prost deps; +2 OOM vectors | topology/build | replay-no-panic-on-malformed- + capture.md:83-130; reader.rs:40-44; dogstatsd.rs:394 | Instrument the replay CLI; SDK-RNG capture generator; + treat OOM vectors separately.** +- **F14 | replay-corruption-not-silent-eof — strict fidelity Always needs a format change (no record count exists); + only a heuristic Sometimes is implementable today | format limitation | replay-corruption-not-silent-eof.md:89-92 | + Scope to Sometimes(corrupt branch reached) + CLI instrumentation; flag strict Always as fix-dependent.** +- **F15 | interner-reclamation-no-corruption — race is loom-safe; hitting an un-modeled interleaving depends on the + Antithesis scheduler; Sometimes anchors may never fire (vacuous) | scheduler-dependence | interner-reclamation- + no-corruption.md:55-128 | Use overlap-based (not sentinel-hardcoded) assertion; heap-off + small interner; report + anchor status so a never-contended run isn't read as a pass.** + +### Passes (implementable as planned, modulo the SDK fork in F0 and any noted scoping) + +- aggregate-context-limit-enforced (cleanest SUT Always; counter-anchored Sometimes). +- aggregate-no-panic-any-window (config-only crash; resolved: startup-only, drop the runtime-push angle). +- ddsketch-bin-count-bounded (fork + drive >4096 bins). +- no-silent-interconnect-drop (counter-readable; scope to always-wired edges). +- retry-queue-bounded-under-outage (split observability; mock-intake failure-mode toggle needed). +- config-incompatible-refuses-start (workload-observable via exit code; runnable even without the stub via env/YAML). +- topology-ready-before-intake (reframed to milestone ordering; log-observable). +- graceful-shutdown-within-30s (scope to topology shutdown, not process exit). +- data-component-failure-triggers-process-shutdown (log/temporal check; piggybacks on the C crash target). +- malformed-dsd-no-crash (UDP appropriate here; counters + liveness; fork for codec Unreachable). +- aggregate-clock-skew-stable (needs clock fault enabled; otherwise solid). +- non-finite-values-handled-consistently (DSD ghost-metric facet; needs a non_finite_dropped anchor). + +### Uncertainties (need an answer to finalize the check) + +- Does `dsd_in` bind listeners during `initialize()` before `mark_ready`? Decides whether + topology-ready-before-intake can strengthen to "no socket bound pre-ready" (file open Q). +- Is any production DSD output (`dsd_debug_log_out`/`dsd_stats_out`) ever conditionally unwired? Decides + the scope of no-silent-interconnect-drop's `Always(delta==0)` (file open Q). +- Can `process_memory::Querier::resident_set_size()` fail post-startup on the Antithesis Linux target + without the custom fault? Pivotal for F3's priority (file open Q). +- Can a `Severity::High` config key actually traverse the Core Agent → ADP config stream? Pivotal for + F12's value (file open Q, needs human input). +- Does the mock `datadog-intake` binary support a runtime-toggleable reject/5xx/slow/hang mode, or does it + need extension? Needed by retry-queue, forwarder-eventual-delivery, shutdown forceful-path (topology open Q). +- Which faults are enabled on the target tenant (node-termination, clock-jitter both "commonly disabled", + custom /proc fault)? Several CONDITIONAL properties can't run if these are off (topology fault table). +- Can `panoramic` survive an ADP process restart mid-run? Decides whether aggregate-matches-agent's + restart-equivalence facet is testable at all (file open Q). diff --git a/test/antithesis/scratchbook/evaluation/synthesis.md b/test/antithesis/scratchbook/evaluation/synthesis.md new file mode 100644 index 00000000000..2a47198f0f8 --- /dev/null +++ b/test/antithesis/scratchbook/evaluation/synthesis.md @@ -0,0 +1,141 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space — design partner focus on the tag-filter RC relay shaped the bias findings. + - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP + why: Confirms runtime filter config-reload is the design-partner's documented test focus. + - path: https://github.com/DataDog/saluki/pull/1768 + why: PR review #4393897611 (Copilot) — the G2 filter-deletion wording and three priority alignments reconciled here. +--- + +# Property Evaluation Synthesis + +Four evaluation lenses (Antithesis-fit, coverage-balance, implementability, wildcard) stress-tested +the 27-property catalog as a portfolio. Findings categorized below as **Refinement** (applied +directly), **Gap** (filled via targeted discovery), or **Bias** (escalated to the user). Lens +evidence: `evaluation/{antithesis-fit,coverage-balance,implementability,wildcard}.md`. + +Outcome: 8 properties added (catalog 27 → 35), 9 refinements applied, 1 scope bias escalated. + +## Gaps (filled) + +### G1 — Events & service-checks paths uncovered (always-on) → 3 properties +Coverage-balance F1, wildcard. When DogStatsD is enabled, `dsd_in.events`/`dsd_in.service_checks` +are always-wired production paths (`run.rs:681-684`) with their own ~394/~312-LOC codecs, yet the +27-property catalog was metrics-only. **Filled** with `events-sc-no-silent-loss`, +`malformed-event-sc-no-crash`, `events-sc-pipeline-reachable` (the last is an anti-vacuity anchor so +a metrics-dominated workload can't pass the first two trivially). + +### G2 — ADP-as-transformer correctness + runtime filter config-reload → 5 properties +Coverage-balance F4/F5/F10, wildcard W1/W2/W6, the design-partner focus. The catalog covered ADP as +a *transport* but not the mapping/filtering/enrichment *correctness* layer, and treated runtime +config only as a crash gate — never as a data-correctness event, despite the watcher hazards +(`broadcast::Lagged` silent drop → stale filtering, `watcher.rs:36-74`; partial-deserialize +half-apply; key-deletion leaves filtering silently **stale** because the additive `diff_recursive` +emits no change event, `diff.rs:12-48`, while only an explicit empty/null value **clears** it, +`tag_filterlist/mod.rs:274-276`) and the design partner's documented "Tag Filter RC Relay Stress +Test." (Deletion-is-stale vs. explicit-empty-clears is the distinction detailed in +`filter-config-reload-correct.md` Hazard 3.) **Filled** with `mapper-output-matches-agent`, `mapper-interner-bounded` +(a *second* bounded interner with its own silent drop), `filter-config-reload-correct` (the watcher +hazards on live data), `tag-filterlist-applied-consistently`, `prefix-filter-ordering-matches-agent` +(bug-history-sensitive stage ordering). These need the config-stream add-on topology (not standalone) +and the diff-test add-on; noted in each evidence file. + +### Gaps NOT filled (folded or escalated) +- **API-key rotation mid-run** (coverage F9): folded as a fault dimension into + `disk-persisted-retry-survives-restart` / `forwarder-eventual-delivery` rather than a new property. +- **Internal-supervisor restartability** (coverage F8): noted as a minor gap; low priority relative + to the data path. Left for a future pass. +- **Traces/APM, logs, OTLP pipelines** (coverage F2/F3, wildcard): escalated as the scope **Bias** + below rather than filled — they are the "broader topology, lower priority" the user deferred. + +## Biases (escalated to user) + +### B1 — Catalog (and SUT analysis) is framed around metrics-DogStatsD transport +Wildcard W4/W6, coverage F2/F3, multiple uncertainties. Even after gap-filling, the catalog is +DogStatsD-metrics-centric. The **traces/APM, logs, and OTLP pipelines** (`run.rs:506-591,700-758`) — +including a SQL-parsing trace-obfuscation untrusted-input surface and a *second* OTLP forwarder — have +**zero properties** and are absent from the SUT analysis. Whether they are in scope for first-customer +(Agent 7.80.0) delivery is a product judgment the evaluation can't make. The primary topology +also uses **standalone mode**, which structurally hides the entire runtime-config surface (the +watcher never fires) — so the G2 config-reload properties pass vacuously unless the config-stream +topology is promoted to primary. **Escalated** — see the questions posed to the user. This does not +block the catalog; it scopes which add-ons and pipelines get instrumented first. + +## Refinements (applied) + +- **R1 (catalog-wide, important):** the container's s6 supervisor auto-restarts ADP on exit, so + "process up" workload assertions are vacuously green even during a crash-loop. **Every no-crash + property must assert SUT-side `Unreachable` at panic sites (or on restart-count), never container + liveness.** Applied as a catalog-wide note and reflected in `malformed-dsd-no-crash`, + `malformed-event-sc-no-crash`, `data-component-failure-triggers-process-shutdown`. +- **R2 (catalog-wide; updated 2026-05-30):** the Antithesis Rust SDK is now wired into ADP behind the + `antithesis` cargo feature (`bin/agent-data-plane/Cargo.toml`) with an `antithesis_init()` + + bootstrap `assert_reachable!` probe, and the harness binaries carry workload-side anchors — see + `existing-assertions.md`. The **"fork ADP + add SDK + build an instrumented image"** prerequisite is + therefore largely satisfied (the wiring is proven end-to-end); what remains is implementing the ~17 + in-process SUT-side **property** assertions on top of that scaffold. The ~10 workload-only + properties can still run first. +- **R3 (catalog-wide):** no-loss properties (`no-silent-interconnect-drop`, `forwarder-eventual-delivery`, + `disk-persisted-retry-survives-restart`, `shutdown-drains-no-loss`, `events-sc-no-silent-loss`) + **must use TCP (or UDS), not UDP**, or UDP's inherent loss confounds "accepted == delivered." Noted + in the topology and those properties. +- **R4 (catalog-wide vacuity):** safety properties gated by hard-to-reach `Sometimes` anchors + (bin-collapse, interner-resurrection race, events/SC reachability) must have the workload force the + anchor config/corpus; the run synthesizer must report **unreached `Sometimes` as inconclusive, not + passing**. Added to catalog-wide notes. +- **R5 `ddsketch-relative-error-bound`:** demoted — it is a **library/proptest invariant, not a live + ADP runtime assertion** (ADP ships raw bins, never calls `DDSketch::quantile` on the customer path). + Re-scoped to harness-side; priority Medium→Low. (Applied during open-question sync; reaffirmed.) +- **R6 `ddsketch-bin-count-bounded`:** demoted High→Medium — substantially duplicates existing + proptests; genuine Antithesis value is only a live regression tripwire for a future mutator that + forgets `trim_left`. The `Reachable(collapse)` anchor is essential or the `Always` is vacuous. +- **R7 `config-incompatible-refuses-start`:** demoted High→Medium — a deterministic ordered gate + already covered by the integration suite's config-check-exit-code cases; kept as cheap config + exploration with the `Reachable(refused)` framing (the `Unreachable` is statically unreachable). +- **R8 `source-dispatch-no-misroute`:** re-centered on the **live silent-loss facet** (dispatch + failure loses events with no/under-counting) rather than the structurally-vacuous misroute + `Unreachable`; priority kept Medium, framing corrected. +- **R9 `memory-limiter-survives-rss-read-failure`:** priority noted as **High *conditional* on a + scriptable `/proc` custom fault + limiter enabled**; otherwise it is unreachable. Framing clarified. +- **R10 (de-dup labelling):** marked shared-scenario pairs in `property-relationships.md` + (`shutdown-drains-no-loss` ⇄ `graceful-shutdown-within-30s`; `non-finite-values-handled-consistently` + ⇄ `ddsketch-no-nan-poison`) so the portfolio count isn't read as 35 independent test efforts. + +## Passes (lenses confirmed sound) + +- Category A memory bounds — well-proportioned to the highest-risk surface; each property maps to a + distinct mechanism. +- Category B forwarder cluster (eventual-delivery / byte-cap / crash-durability) — strong, non-redundant. +- The zero-fault clock crash finding `aggregate-clock-skew-stable` (forward-jump facet) — correctly + prioritized, cheap, high-value (F1 note: clock vector, not a runtime-discoverable state). Its sibling + `aggregate-no-panic-any-window` had its `% 0` panic vector **closed upstream** (window is now + `NonZeroU64`, PR #1772, fc4bb297); it is demoted from a live crash bug to a cheap `Unreachable` + regression tripwire — see the catalog status note and the bug ledger. +- `ddsketch-no-nan-poison` checks_ipc bypass — genuine live latent bug, correctly High. +- Type mix (~safety-heavy with 6+ liveness) appropriate for a no-crash/no-corruption SUT; reachability + used correctly as anti-vacuity riders. +- `host_enrichment` static (correctly no property); mapper uses the backtracking-free `regex` crate + (regex-DoS is a non-issue — recorded closed); OTTL panics are traces-only and test-gated. + +## Open uncertainties carried forward (need team input) + +- Are traces/APM, logs, OTLP in scope for first-customer delivery? (B1) +- Does the `millstone` corpus exercise events/SC, mapped/filtered metrics, and adversarial histogram + values — i.e. does `aggregate-matches-agent` implicitly cover some G1/G2 ground? +- Can a `Severity::High` config key, or a filter-config update, actually traverse the RC stream at + runtime (vs. Core-Agent pre-filtering)? Gates the reachability of the config-reload properties. +- ~~Which faults are enabled for the tenant~~ **Resolved (user, 2026-05-28):** node termination, + clock jitter, and custom `/proc` faults are all enabled — the crash-recovery, clock-skew, and + limiter-RSS-failure properties are realizable. +- Does `datadog-intake` support a runtime failure-mode toggle (reject/5xx/slow/hang)? + +## Scope decision (user, 2026-05-28) + +The traces/APM, logs, and OTLP pipelines are **deferred** (documented exclusion in the catalog), not +filled. DogStatsD metrics + events/service-checks + the runtime config/transform surface is the +first-customer scope. Bias B1 is thereby resolved-as-accepted: the catalog is intentionally scoped, +not accidentally narrow. diff --git a/test/antithesis/scratchbook/evaluation/wildcard.md b/test/antithesis/scratchbook/evaluation/wildcard.md new file mode 100644 index 00000000000..33d28cf872f --- /dev/null +++ b/test/antithesis/scratchbook/evaluation/wildcard.md @@ -0,0 +1,271 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space. + - path: https://datadoghq.atlassian.net/wiki/spaces/AMCC/pages/6640602441/Tag+Filter+RC+Relay+Stress+Test+agent+ADP + why: Team-authored stress-test spec for the runtime tag-filter RC relay — the exact surface the catalog under-covers. +--- + +# WILDCARD evaluation — ADP property catalog + +Bias: find what Lens 1 (Antithesis-Fit), Lens 2 (Coverage-Balance), Lens 3 (Implementability) +miss. The three lenses accept the SUT analysis's framing, in which ADP is essentially a +*transport* — its job is to not crash and not lose bytes. That framing is the catalog's blind +spot. ADP is also a *transformer*: it maps, filters (twice), tag-filters, enriches, and +aggregates customer data, and most of that transformation is driven by **runtime config that +mutates while data flows**. The catalog has 27 properties; ~24 are about crash/memory/loss and +3 (`aggregate-matches-agent`, `aggregate-clock-skew-stable`, the two ddsketch math props) are +about correctness of values. **Zero** properties cover the map/filter/tag-filter/enrichment +layer's correctness, and zero cover runtime config-reload as a *data-correctness* event rather +than a crash/incompatibility event. That is the headline wildcard finding. + +--- + +## FINDING W1 — The runtime config-reload-while-data-flows surface is almost entirely uncovered (catalog-wide / framing miss) + +**Concern.** Five production components subscribe to live config updates over the Core Agent RC +stream and **rebuild correctness-affecting state in place while metrics are flowing through them**: + +- `dogstatsd_prefix_filter` — watches 4 keys (`metric_filterlist`, `metric_filterlist_match_prefix`, + `statsd_metric_blocklist`, `statsd_metric_blocklist_match_prefix`) and rebuilds the allow/block + filter live (`bin/agent-data-plane/src/components/dogstatsd_prefix_filter/mod.rs:285-330`). +- `dogstatsd_post_aggregate_filter` — same 4 keys + (`.../dogstatsd_post_aggregate_filter/mod.rs:268-308`). +- `tag_filterlist` — watches `metric_tag_filterlist`, and on change does + `self.filters = compile_filters(...); self.context_cache = build_context_cache();` + (`.../tag_filterlist/mod.rs:222, 274-278`). +- `dsd_debug_log` (stats enable) and `internal/logging` (`log_level`) — lower stakes. + +The catalog's only two runtime-config properties — `config-incompatible-refuses-start` and +`config-runtime-update-not-revalidated` — treat config purely as a **crash / incompatibility-gate +safety** concern (does an incompatible *key* get applied?). Neither asks the question that the +team's own Confluence page ("Tag Filter RC Relay Stress Test — agent + ADP") is built around: +**after a filter/tag config update lands, is the data that gets through actually filtered +correctly, and is that true under load + interleaving + fault?** This is the single biggest +framing gap. It is squarely in Antithesis's sweet spot (timing of the config-stream event vs the +data-flow event is exactly what a deterministic diff-test cannot explore) and it is a +*data-correctness* failure, not a crash — invisible to every existing process-level assertion. + +**Concrete hazards inside this surface that no property names:** + +1. **`broadcast::Lagged` drops config updates silently → stale filtering, no recovery until the + next update.** The watcher reads config over a `tokio::broadcast` channel; on `Lagged` it + logs a warn and `continue`s without re-reading the missed value + (`lib/saluki-config/src/dynamic/watcher.rs:60-67`). A transform whose `select!` loop is busy + draining a full event channel under load (backpressure) can lag the broadcast and **miss a + filter update entirely**, then run with stale filters indefinitely. This is a + backpressure × config-reload interaction (a directive-#2 combination) that produces wrong + filtering output, not a crash. No lens models it. +2. **Partial-deserialize skip → half-applied multi-key config.** A malformed/wrong-typed value + for one watched key is skipped with a warn while the other keys apply + (`watcher.rs:43-56`). The prefix/post-agg filters watch 4 interdependent keys; a new + `metric_filterlist` can apply while its companion `metric_filterlist_match_prefix` is rejected, + leaving the filter in an inconsistent (new-list / old-match-mode) state — silently wrong + filtering semantics. +3. **Key-deletion silently clears all filtering.** `tag_filterlist` does + `new_entries.as_deref().unwrap_or(&[])` (`mod.rs:275`): an RC update that *removes* the key + yields `None` → empty filter set → all tag filtering silently turned off. Correctness loss with + no signal. +4. **Cache coherency across swap.** `tag_filterlist` rebuilds `context_cache` on swap (good), but + the swap and the in-flight event batch are processed in the same `select!` loop — a property + should pin down that no metric in the post-swap batch is filtered against a stale cache entry. + +**Scope.** Production DSD hot path (prefix filter, tag filterlist, post-agg filter all sit +between `dsd_in.metrics` and `dd_metrics_encode`, `run.rs:674-679`). Requires Add-on 1 (config +stream) — these are *not* exercised at all in standalone-mode primary topology (see W4). + +**Evidence brief.** `dogstatsd_prefix_filter/mod.rs:285-330`, +`dogstatsd_post_aggregate_filter/mod.rs:268-308`, `tag_filterlist/mod.rs:222,274-278`, +`saluki-config/src/dynamic/watcher.rs:43-67`. Confluence "Tag Filter RC Relay Stress Test" +(referenced in `deployment-topology.md:8`) — the team has already scoped a stress test here that +the catalog does not mirror. + +**Suggested action.** Add a property family `config-filter-reload-correct` (suggest 1-2 +properties): (a) `Always`(after a filter-config update is observed, every subsequently-emitted +metric reflects the *latest* applied filter — no metric retains a tag the new denylist removes / +no blocked metric passes); (b) `Sometimes`(config update applied under concurrent load) + +`Reachable`(broadcast Lagged path) to prove the stale-config branch is reachable. Differential +formulation (Agent + ADP both receive the same RC update mid-stream, diff output after a quiet +window) reuses the `aggregate-matches-agent` harness and is the most feasible check. At minimum, +elevate the watcher's `Lagged`/partial-deserialize/key-deletion behaviors to first-class hazards +in `config-runtime-update-not-revalidated`, which today scopes them out. + +--- + +## FINDING W2 — The DogStatsD mapper is a self-contained correctness + resource surface with no property (slug: none; nearest `aggregate-matches-agent`) + +**Concern.** `dogstatsd_mapper` (`lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs`) +sits first on the DSD metrics path (`dsd_enrich` chains it, `run.rs:639-641,674`). It: +- compiles user-supplied regexes and does capture-group expansion into new metric **names** and + new **tag values** (`try_map`, `mod.rs:297-314`) — a rich correctness surface (wrong capture + expansion = silently renamed metric / wrong tag, customer-visible data corruption); +- has its **own result cache** keyed by metric name (`mod.rs:269-285`) and its **own string + interner** separate from the aggregate interner (`context_string_interner_size`, default 64KiB); +- silently drops the metric when its interner is full: `resolve_with_origin_tags(...)?` on both the + cache-hit path (`mod.rs:277-282`) and slow path (`mod.rs:318`) returns `None`, and `try_map`'s + `None` means the original (pre-map) context is used — but a `None` from the cache-hit branch + returns `None` from `try_map` with no telemetry. This is a *second*, mapper-local interner + exhaustion path that the catalog's `interner-full-bounded` (scoped to the aggregate/context + interner) does not cover. + +This is exactly the "correctness of the data that gets through" facet directive #1 flagged. Regex +capture expansion and a name-keyed cache are classic silent-wrong-data bug shapes; ADP claims +Agent-equivalence for the mapper too, and nothing tests it under fault or even names it. + +**Scope.** Production DSD hot path, first transform. Memory facet overlaps Category A but with a +distinct interner instance. + +**Evidence brief.** `dogstatsd_mapper/mod.rs:31-32` (wildcard match regex), `186-211` +(`build_regex`, regex compiled from `*`→`([^.]*)`), `258-345` (`try_map`, cache + capture +expansion + `?`-drop), separate interner at `103-165`. + +**Suggested action.** Add `mapper-output-matches-agent-and-bounded` (or fold into a broadened +transform-correctness differential): assert mapped names/tags match the Agent's mapper for the +same input + profiles; `Always`(mapper interner full ⇒ counted drop, not silent / not heap-spill +depending on config); `Sometimes(mapper_interner_full)`. Also worth a cheap regex-DoS angle: +user-supplied `regex`-type mappings are compiled with the `regex` crate (no catastrophic +backtracking, good) — confirm and note as a *pass* so it isn't re-flagged. + +--- + +## FINDING W3 — Replay-then-aggregate timestamp divergence is a correctness hazard mentioned in the SUT analysis but has no property + +**Concern.** SUT analysis §7.9 explicitly states: "a replayed capture buckets differently than +when captured (the aggregator ignores per-record timestamps for non-timestamped metrics)." The +new replay feature (`e88d04b10a`, the most regression-prone area per §8) re-injects captured +DSD records through the live socket; non-timestamped metrics are then bucketed by **current wall +clock at replay time**, not capture time. The catalog has three replay properties +(`replay-no-panic-on-malformed-capture`, `replay-corruption-not-silent-eof`) — all about +*crash / parse fidelity*, none about *aggregation fidelity of replayed data*. Given replay is the +newest, largest, riskiest feature shipping for the first customers, "does replayed data aggregate to the same +result as the original" is a correctness question the catalog skips. + +**Scope.** Replay CLI → DSD socket → aggregate. Listener-coverage variant topology. + +**Evidence brief.** SUT analysis §7.9; replay re-injection via DSD UDS +(`sources/dogstatsd/replay/`, replay CLI `dogstatsd.rs:394`); aggregate wall-clock bucketing +(`aggregate-clock-skew-stable` evidence). + +**Suggested action.** Either (a) add a note/property that replay fidelity for non-timestamped +metrics is *by design* lossy w.r.t. bucketing (document, assert nothing) — needs human input on +intent; or (b) if capture preserves timestamps, assert replayed aggregation matches original +within ratio. Flag for the team; do not over-engineer until intent is confirmed. + +--- + +## FINDING W4 — Standalone-mode primary topology structurally hides the entire control-plane → data-plane config surface (catalog-wide / topology) + +**Concern.** `deployment-topology.md` runs ~22/27 properties in **standalone mode**, which +"bypasses the remote-agent config stream" (topology doc, Add-on 1). The doc's own open question — +"Confirm no standalone-only code path masks a production behavior we care about" — is answered by +W1/W2: standalone mode masks the *entire runtime-config-driven filtering/mapping correctness +surface*, because in standalone there is no RC stream to deliver filter/tag updates, so the +`watch_for_updates` branches in all five components **never fire**. Any property that doesn't +force Add-on 1 will *vacuously pass* on this surface. The catalog buries the config stream in an +"Add-on" for 3 properties; in production, RC-driven filtering is a primary, always-on behavior +(the design partner's whole tag-filter relay use case). The topology under-weights it. + +**Scope.** Whole config-stream cluster + W1 + W2. + +**Evidence brief.** `deployment-topology.md:43,78-101,166-168`; `watcher.rs:29-32` +(`if self.rx.is_none() { pending_forever }` — in standalone the watcher future never resolves). + +**Suggested action.** Promote the config-stream add-on to a co-equal primary topology (or make +the stub mandatory), and route W1's new properties through it. Note explicitly in the catalog +that filter/tag/mapper-reload properties are vacuous in standalone mode. + +--- + +## FINDING W5 — Duplicate / over-overlapping properties (catalog hygiene) + +Confirmed overlaps the other lenses should reconcile: + +1. **`shutdown-drains-no-loss` vs `graceful-shutdown-within-30s`** — the catalog already + acknowledges the split ("one owns *what data survives*, the other owns *clean completion in + time*"), and the split is defensible, but both assert against the same 30s-timeout boundary and + the same forceful-stop path with the same fault setup. Risk: they will be implemented as one + instrumented run with two assertions, so counting them as two "properties" inflates the + portfolio. Keep both assertions, but treat as one test scenario. +2. **`data-component-failure-triggers-process-shutdown`** overlaps the forceful-stop clause of + `graceful-shutdown-within-30s` and the crash trigger of `aggregate-no-panic-any-window` / + `aggregate-clock-skew-stable` (those panics are *how* you induce the component failure). These + are three properties sharing one mechanism (component dies → process exits → s6 restarts). + Fine to keep, but the "induce a panic" half is the same event. +3. **`non-finite-values-handled-consistently` vs `ddsketch-no-nan-poison`** — `non-finite` is + largely a superset: its invariant *is* `Always(value.is_finite() at DDSketch insert boundary)`, + which is `ddsketch-no-nan-poison`'s core, plus the ghost-metric clause. The catalog demotes + `non-finite` to Medium and keeps `ddsketch-no-nan-poison` High with the live `checks_ipc` + bypass as justification. Reasonable, but the assertion site is *identical* + (`adjust_basic_stats`/`insert*`); these should be one SUT-side assertion with two reachability + anchors, not two separately-instrumented properties. +4. **`ddsketch-relative-error-bound` vs `ddsketch-bin-count-bounded`** — `relative-error` is + already demoted to a library/harness invariant (ADP doesn't call `quantile` live) and + `bin-count-bounded` owns the live facet. Borderline whether `relative-error` belongs in an + *Antithesis* catalog at all (it's a pure proptest target with existing proptests) — Lens 1 + territory, flagging for cross-check. + +**Suggested action.** Mark 1-3 as "shared-scenario" pairs so the portfolio count isn't read as +independent coverage; let Lens 1 rule on whether `ddsketch-relative-error-bound` is Antithesis- +appropriate at all. + +--- + +## FINDING W6 — Mis-prioritization given the 7.80.0 ship context + +**Concern.** The ship context is: first customers, *design partner*, whose documented +interest (Confluence) is the **tag-filter RC relay**. The catalog's two highest-effort, highest- +visibility High items are `aggregate-matches-agent` (heaviest topology, own run) and the +memory-bounds family (much of which is "expected to FAIL by design under default config" — i.e. +known limitations, not regressions). Meanwhile the design partner's actual feature — runtime tag +filtering correctness (W1) — has no property. Relative to ship context, W1 should arguably be a +High before some of the "fails-by-design" memory items, which document known gaps the team +already understands rather than surfacing surprises. + +**Suggested action.** Re-rank: W1 (filter-reload correctness) to High; keep the two guaranteed- +crash config/clock findings High (cheap, real, ship-blocking). Consider that several Category-A +"fails by design" properties are really *documentation of a known limitation* and could be Medium +unless the team intends to flip defaults before 7.80.0. + +--- + +## PASSES (things the catalog/lenses got right; do not re-flag) + +- The two guaranteed-crash findings (`aggregate-no-panic-any-window` sub-second window divide-by- + zero; `aggregate-clock-skew-stable` forward-jump flood) were correctly High, cheap, and real — + verified the code shapes matched the catalog claims. _Update 2026-05-30 — the sub-second + divide-by-zero is now **fixed upstream** (window typed `NonZeroU64`, PR #1772) and demoted to a + regression tripwire; the forward-jump flood remains live._ +- `ddsketch-no-nan-poison`'s live `checks_ipc` bypass is a genuine, correctly-prioritized latent + bug. +- The default-config-is-hostile framing for Category A is accurate and well-evidenced. +- `host_enrichment` is static (hostname queried once at build, `host_enrichment/mod.rs:67-75`) — + **not** a runtime-mutable correctness surface; correctly *absent* from the catalog. Don't add it. +- OTTL filter/transform processors are wired into the **traces** path only + (`run.rs:561-567`), not the DSD metrics hot path, and their panic sites are `#[cfg(test)]` — + correctly out of scope for the DSD-focused topology. (Note: if a traces topology is ever added, + OTTL untrusted-config parsing becomes a parse-safety surface like the replay reader.) +- The mapper compiles user regex via the `regex` crate (no catastrophic backtracking by + construction) — the regex-DoS angle is a non-issue; record as closed. + +## UNCERTAINTIES (report-what's-odd; could not fully resolve) + +- **Is RC-stream filter update even reachable from the Core Agent for these keys?** Same open + question the catalog raises for `config-runtime-update-not-revalidated` ("can a High-severity key + be delivered, or does the Agent pre-filter?"). If the Agent *does* push `metric_tag_filterlist` / + `metric_filterlist` updates at runtime (the relay use case implies yes), W1 is High-value + and live; if these are startup-only in practice, W1 collapses to a code-review note. Needs human/ + team input — pivotal for W1's priority. +- **Does `tag_filterlist` only filtering Counter + sketch metrics (`mod.rs:235-237`) match the + Agent?** Gauges/rates appear to pass through tag-filtering untouched. Could be intentional + (sketches/counters are the cardinality drivers) or a correctness gap vs the Agent. Odd enough to + flag; no property would catch it today. +- **broadcast channel depth for config events** — could not determine the `broadcast::channel` + capacity that feeds `FieldUpdateWatcher`; how easily `Lagged` triggers under load (W1 hazard 1) + depends on it. If depth is large, the stale-config window is narrow; if 1-16, it's very + reachable under backpressure. Worth a one-line code check before sizing the W1 workload. +- **Whether the three "shared-scenario" property pairs (W5) are double-counted in the + coverage-balance portfolio math** — Lens 2's distribution counts may overstate independent + coverage by ~3-4 properties. diff --git a/test/antithesis/scratchbook/existing-assertions.md b/test/antithesis/scratchbook/existing-assertions.md new file mode 100644 index 00000000000..8a80c280b45 --- /dev/null +++ b/test/antithesis/scratchbook/existing-assertions.md @@ -0,0 +1,77 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: Datadog ADP Confluence space (design notes, weekly summaries, gap analyses) consulted for grounding. + - path: https://datadoghq.atlassian.net/browse/DADP + why: ADP Jira project for incidents and tracked gaps. + - path: https://github.com/DataDog/saluki/pull/1768 + why: PR review #4393897611 — re-running research caught that the harness now adds SDK assertions this file previously denied. +--- + +# Existing Antithesis SDK Assertions + +## Summary + +**A small bootstrap-and-workload assertion set exists**, added by the Antithesis harness commit +(`chore(agent-data-plane): Antithesis test harness and workload`, the parent of this scratchbook +commit). It comprises **6 SDK call sites** across three binaries: one lifecycle init and one +bootstrap reachability probe in ADP (both gated behind the `antithesis` cargo feature, no-op in +production), plus two workload-side `assert_reachable!`/`assert_sometimes!` pairs in the harness test +commands. These are **integration probes and anti-vacuity anchors**, not the property-catalog +invariants — none of the 35 cataloged property assertions is implemented yet. + +> [!NOTE] +> A prior version of this file stated no SDK assertions existed. That was true before the harness +> commit landed; it is now stale. Re-research on 2026-05-30 corrected it. + +## Assertions present + +| File:line | Type | Message | Gating | Purpose | +|-----------|------|---------|--------|---------| +| `bin/agent-data-plane/src/main.rs:51` | `antithesis_init()` | (lifecycle init) | `#[cfg(feature = "antithesis")]` | Registers the assertion catalog before any are evaluated; no-op outside Antithesis, absent in prod builds. | +| `bin/agent-data-plane/src/main.rs:100` | `assert_reachable!` | "agent-data-plane completed bootstrap" | `#[cfg(feature = "antithesis")]` | Bootstrap-integration probe — proves the SDK is linked, cataloging works, the instrumentation path is wired. | +| `test/antithesis/harness/src/bin/finally_verify_delivery.rs:54` | `assert_reachable!` | "intake metrics dump query succeeded" | harness binary | Confirms the delivery-verification query path ran. | +| `test/antithesis/harness/src/bin/finally_verify_delivery.rs:59` | `assert_sometimes!` | "metrics delivered end-to-end to the intake" (`delivered > 0`) | harness binary | Workload-side liveness anchor — partially seeds `forwarder-eventual-delivery`. | +| `test/antithesis/harness/src/bin/parallel_driver_send_dogstatsd.rs:77` | `assert_reachable!` | "workload sent a dogstatsd batch" | harness binary | Confirms the DSD driver actually emitted load. | +| `test/antithesis/harness/src/bin/parallel_driver_send_dogstatsd.rs:87` | `assert_sometimes!` | "workload drove a high-cardinality dogstatsd flood" (`regime == High`) | harness binary | Anti-vacuity anchor that timelines reach the high-cardinality regime — seeds `rss-bounded-under-cardinality`. | + +Dependency wiring: ADP gains the SDK only under the `antithesis` feature +(`bin/agent-data-plane/Cargo.toml:14` → `dep:antithesis_sdk`, `antithesis_sdk/full`, +`dep:antithesis-instrumentation`); the harness crate depends on `antithesis_sdk` unconditionally +(`test/antithesis/harness/Cargo.toml`). `antithesis-instrumentation` is an external build-time +instrumentation crate, not a source of in-tree assertions. + +## How this was determined + +Searched the repository with ripgrep over `*.rs` and `*.toml`: + +- `rg -li "antithesis" -g '*.rs' -g '*.toml'` — matches in ADP `main.rs`, the two harness binaries, + and the `Cargo.toml` files above. +- `rg "assert_always|assert_sometimes|assert_reachable|assert_unreachable|antithesis_sdk" -g '*.rs'` + — the 6 call sites tabled above; **no `assert_always!` and no `assert_unreachable!` anywhere yet.** + +## Implication for property work + +The catalog's invariants are still **net-new instrumentation**. The two `assert_sometimes!` anchors +above are workload-side only and serve anti-vacuity, not the safety/liveness invariants themselves: + +- `forwarder-eventual-delivery` has a workload-side `Sometimes(delivered > 0)` but no SUT-side + no-loss `Always`/accounting assertion — that remains to be added. +- `rss-bounded-under-cardinality` has its high-cardinality `Sometimes` anchor but no SUT-side RSS or + interner-bound `Always` — also net-new. +- The ~17 properties requiring in-process SUT-side assertions (per evaluation R2) still need ADP to + be forked and instrumented behind the `antithesis` feature. The feature scaffold now exists, which + lowers that bar — the `antithesis_init()` + bootstrap probe prove the wiring works end-to-end. + +Other existing (non-Antithesis) signals remain available to anchor assertions or workload-side +checkers: + +- **Internal telemetry counters** via the `metrics` crate (`events_discarded_total`, + `events_sent_total`, aggregate `context_limit` breach counters, forwarder queue-drop counters). +- **Rust unit tests** with std `assert!`/`assert_eq!`, dense across `saluki-components`, + `saluki-core`, `saluki-io`, and `ddsketch` — not Antithesis assertions; run only under `cargo test`. +- A `loom` cfg in `lib/stringtheory/src/interning/` — the authors already treat the interner's + reclamation path as concurrency-sensitive. diff --git a/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md b/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md new file mode 100644 index 00000000000..24e65de132e --- /dev/null +++ b/test/antithesis/scratchbook/properties/aggregate-clock-skew-stable.md @@ -0,0 +1,107 @@ +--- +slug: aggregate-clock-skew-stable +title: Aggregation stays sane across wall-clock skew +type: Safety +priority: High +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +status: assertion MISSING; CONFIRMED two-clock hazard, no monotonicity guard +--- + +# aggregate-clock-skew-stable — Aggregation stays sane across wall-clock skew + +## Property (one sentence) +A wall-clock jump (backward or forward) during aggregation never produces a flood of +zero-value counter points nor a silent gap in counter continuity, and bucketing stays +bounded and well-formed. + +## Origin +- SUT analysis §7 #9 (Wildcard): "Two-clock hazard: bucketing uses wall clock + (`get_unix_timestamp`), flush cadence uses monotonic `tokio::interval`. A backward + wall-clock jump makes the zero-value range empty (silent counter gap); a forward jump + floods zero-value points and allocates a large `SmallVec`. No monotonicity guard." + +## Files / functions / lines (CONFIRMED) +- `lib/saluki-common/src/time.rs` + - `get_unix_timestamp` (21–26): `SystemTime::now().duration_since(UNIX_EPOCH).unwrap_or_default().as_secs()` + — **wall clock**, non-monotonic; on a backward step it returns a smaller value (and on + pre-epoch it `unwrap_or_default()`s to 0). +- `lib/saluki-components/src/transforms/aggregate/mod.rs` + - Flush cadence: `interval_at(Instant::now() + flush_interval, flush_interval)` (290–293) — + **monotonic** tokio timer. So *when* a flush fires is monotonic, but *what timestamp* it + stamps is wall-clock. + - `insert` reads `current_time = get_unix_timestamp()` (347) per input batch; buckets via + `align_to_bucket_start(current_time, bucket_width_secs)` (579). + - `flush(get_unix_timestamp(), ...)` (319) — flush timestamp is wall clock. + - Zero-value bucket generation (627–635): + ``` + let start = align_to_bucket_start(self.last_flush, bucket_width_secs); + for bucket_start in (start..current_time).step_by(bucket_width_secs as usize) { ... } + ``` + `self.last_flush` is the wall-clock time of the previous flush (set at 718). + - **Backward jump:** `current_time < start` → range `start..current_time` is **empty** → + no zero-value buckets emitted for the gap → silent break in counter continuity (and the + `should_expire_if_empty` math `am.last_seen + counter_expire_secs < current_time` (651) + can flip, prematurely expiring or never expiring counters). + - **Forward jump:** `current_time >> start` → the loop emits one zero-value bucket per + `bucket_width_secs` across the entire jump, each pushed into + `SmallVec<[(u64, MetricValues); 4]>` (626) → large heap allocation + a flood of + zero-value points merged into every counter (661–671) and flushed downstream. + - `split_timestamp = align_to_bucket_start(current_time, w).saturating_sub(1)` (620) — a + backward jump moves the split earlier, so values already flushed in earlier (now "future") + buckets are retained, possibly re-evaluated against a smaller `current_time`. + - No comparison/guard between `current_time` and `last_flush` for monotonicity anywhere in + `flush` or `insert`. + +## Failure scenario (Antithesis angle — clock fault injection) +1. **Forward jump (e.g. NTP step +1h, width 10s):** next flush generates ~360 zero-value + buckets, allocating a large `SmallVec` and emitting hundreds of zero-value rate points per + live counter downstream — a metric flood and memory spike (tension with bounded-memory and + "match the Agent" — the Agent does not behave this way). +2. **Backward jump:** the zero-value range goes empty; counters that should have emitted + continuity zeros emit nothing for the skipped interval → downstream sees a gap. On a large + backward jump, `am.last_seen + counter_expire_secs < current_time` can become false for + counters that should expire (they linger, consuming context budget) or true for ones that + shouldn't. +3. **Pre-epoch / clock reset to 0:** `unwrap_or_default()` yields `current_time = 0`, making + `align_to_bucket_start(0, w) = 0` and most ranges empty — effectively freezes bucketing. +4. **Replay divergence (noted in §7 #9):** because non-timestamped metrics are bucketed by the + aggregator's *current* wall clock (not per-record capture time), a replayed capture buckets + differently than at capture — relevant to the replay feature. + +## Observations +- Antithesis can drive this directly via **clock fault injection** (step the container clock + backward/forward) while a steady counter stream flows. +- Natural bounded invariant: the number of zero-value buckets generated in a single flush must + be bounded by a sane multiple of `flush_interval / window_duration` (a couple), NOT by an + arbitrary wall-clock delta. `zero_value_buckets.len()` is the in-process anchor. +- Counter-continuity invariant: across a flush, a live counter's emitted timestamps should be + contiguous in bucket-width steps with no missing closed bucket and no duplicate. +- SUT-side instrumentation wins: `zero_value_buckets.len()` and the `last_flush`/`current_time` + pair live inside `flush`; workload-side can only observe the downstream flood/gap indirectly. + +## Config dependencies +- `aggregate_window_duration` (bucket width) and `aggregate_flush_interval` (cadence) set the + expected per-flush zero-value bucket count. +- `counter_expiry_seconds` interacts with the skewed `last_seen` expiry comparison (651, 662). + +## Suggested assertion +- `assert_always(zero_value_buckets.len() <= max_expected, "zero-value bucket count bounded")` + inside `flush`, where `max_expected` ≈ `ceil(flush_interval / window_duration) + small_slack`. + Flags the forward-jump flood. +- `assert_always(current_time >= self.last_flush, "aggregate flush time is monotonic")` OR, if a + monotonicity guard is added, `AlwaysOrUnreachable` that the backward branch is handled (clamp + `current_time` to `>= last_flush`). +- `Sometimes(clock_jumped_during_flush)` to confirm the skew fault actually coincided with a flush. + +## Open questions +- Intended fix: switch bucketing to a monotonic source, or guard + `current_time = max(current_time, last_flush)` and cap the zero-value loop iterations? This + decides whether the assertion is `Unreachable(flood)` vs `Always(bounded)`. +- Is there an upstream protection against `get_unix_timestamp()` returning 0 (pre-epoch)? + None found — worth confirming the container clock can't be stepped below epoch in the harness. +- What is the Agent's behavior under the same clock step? Needed to know whether "bounded and + no flood" also means "still matches Agent" (ties into `aggregate-matches-agent`). +- Does the coarse-time path (`get_coarse_unix_timestamp`, 41–59) feed any aggregate code? (It + does not appear to; aggregate uses the accurate `get_unix_timestamp`.) Confirm no second path. diff --git a/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md b/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md new file mode 100644 index 00000000000..d6bfbad1a91 --- /dev/null +++ b/test/antithesis/scratchbook/properties/aggregate-context-limit-enforced.md @@ -0,0 +1,121 @@ +# aggregate-context-limit-enforced + +**Family:** Resource Boundaries — bounded state / queues +**Status:** Verified against code at commit 042f41db3b. Property is expected to **HOLD** (this is +a real, enforced invariant) — it is the load-bearing memory-determinism lever for the aggregator. + +## What led to the property + +`sut-analysis.md` §3 and §7 identify the aggregation state map as "the central memory-determinism +lever." Unlike the interner (which spills to heap by default) and the advisory memory limiter +(off by default), the aggregate context limit is a **hard, always-on cap enforced at insert +time**, independent of any `memory_mode`. It is the one runtime memory bound that is not +advisory. Worth asserting precisely because it is the strongest claim ADP actually makes about +bounded aggregation state. + +## The invariant and where it lives + +Aggregation state is a single `HashMap` (`hashbrown`) owned solely by +the transform task — no locks, all mutation `&mut self` (`transforms/aggregate/mod.rs`, +`AggregationState`). The cap: + +- Default `aggregate_context_limit = 1_000_000` (`mod.rs:47-49` `default_context_limit`, field at + `mod.rs:114-115`, stored in `AggregationState.context_limit` at `mod.rs:531`). +- Enforced in `AggregationState::insert` (`mod.rs:566-571`): + ```rust + if !self.contexts.contains_key(metric.context()) && self.contexts.len() >= self.context_limit { + self.context_limit_breached = true; + return false; // new context over the cap is DROPPED + } + ``` + Critically the guard is gated on `!contains_key`: an **existing** context always proceeds to + merge (lines 573+), so the cap only ever rejects *new* contexts, never breaks aggregation of + already-tracked ones. +- The caller (`mod.rs:375-384`) treats `insert == false` as a drop: increments + `events_dropped` telemetry and logs **one** warning per breach episode (gated on + `was_breached` so it doesn't spam). +- Recovery: `context_limit_breached` is cleared once `contexts.len() < context_limit` again + (`mod.rs:714-715`), e.g. after a flush evicts contexts. + +So the precise invariant: **live context count never exceeds `aggregate_context_limit`; over-cap +*new* contexts are dropped-and-counted; *existing* contexts always merge.** + +## Failure scenario (Antithesis) + +Flood DSD with far more than `aggregate_context_limit` distinct contexts (set the limit low, +e.g. 1000, to make the boundary reachable within a run). Assert the map size never exceeds the +cap and that drops are counted. Antithesis adds value over the deterministic correctness harness +by interleaving the flood with **flush timing** and **counter zero-value keep-alive**: zero-value +counters kept alive after flush still count against the limit (`sut-analysis.md` §3), so the +boundary can be hit by keep-alives, not just fresh contexts — a timing-sensitive interaction the +fixed-clock harness won't explore. Also tests the recovery edge: does `len()` correctly dip below +the cap after a flush and re-admit new contexts (clearing `context_limit_breached`)? + +## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist) + +- `Always(state.contexts.len() <= context_limit)` anchored in `insert`/after-insert in + `transforms/aggregate/mod.rs`. Safety: must hold on every check. Honest — there is no code path + that grows the map past the cap, so this is a true `Always`, not an aspirational one. +- `AlwaysOrUnreachable(contains_key(ctx) || len < limit ⇒ insert succeeds)` — i.e. an existing + context is *never* dropped by the cap. Captures the "existing always merges" half. Use + AlwaysOrUnreachable because the merge-of-existing path may not be exercised in a given run. +- `Sometimes(context_limit_breached == true)` — proves the workload actually reaches the boundary + (otherwise the `Always` above is vacuously true). Liveness/reachability of the interesting state. +- `Sometimes(events_dropped incremented due to context limit)` — proves the drop is counted, not + silent-and-uncounted. + +This is a strong candidate for a true SUT-side `Always` because the bound is a local, lock-free, +single-owner invariant — exactly the kind Antithesis `Always` is designed for. + +## Configuration dependencies + +- `aggregate_context_limit` (default 1,000,000). For a finite-duration run, must be lowered so + the boundary is reachable. +- `counter_expiry_seconds` (default 300): kept-alive zero-value counters occupy context slots + until expiry, affecting how easily the cap is reached and when `len()` dips below it. +- `aggregate_window_duration` / primary flush interval (default 15s): flush cadence drives when + contexts are evicted and the breach flag clears. + +## Open questions + +- The cap counts *contexts*, not *bytes*. A single context with many distinct timestamped values + is one map entry but unbounded value memory (cross-ref `rss-bounded-under-cardinality`). So this + property bounds entry count, NOT aggregator memory. Prose must not overclaim "bounded memory." + +## Investigation Log + +#### Zero-value keep-alive counters: storage location and flush-time `contexts.len()` behavior +- **Examined**: `lib/saluki-components/src/transforms/aggregate/mod.rs`: + `AggregatedMetric` struct (522-525), `AggregationState` (529-558), `insert` (566-610), + `flush` (612-719), and the dedicated test `context_limit_with_zero_value_counters` + (1104-1157). Also the module doc on zero-value counters (71-75) and `is_empty` (562-564). +- **Found (a) — storage**: There is **NO separate structure** for zero-value/keep-alive + counters. An idle counter remains as a normal entry in the single + `contexts: HashMap` map (529). On flush, closed-bucket values are + split off and emitted (682-695) leaving `am.values` empty; the entry is only removed if + `am.values.is_empty() && should_expire_if_empty` (697). For counters, + `should_expire_if_empty` is **false** until `last_seen + counter_expire_secs < current_time` + (649-654), so a kept-alive counter is an empty-valued entry that **stays in `contexts`** and + on each subsequent flush gets a fresh `0.0` bucket merged in (661-672) and re-emitted. +- **Found (a) — cap check counts them**: `insert` rejects a new context when + `!contexts.contains_key(..) && contexts.len() >= context_limit` (568). Since idle counters + are live entries in `contexts`, they **count toward `context_limit`** at the cap check. The + test at 1104-1157 asserts exactly this: with `context_limit = 2`, two counters that have gone + to zero-value mode still block insertion of a third (`assert!(!state.insert(... metric3 ...))`, + 1138), and the third only succeeds (1152) after the two expire and are dropped (1148). +- **Found (b) — when `len()` drops**: `contexts.len()` drops **during flush**, in the removal + pass at 703-707 (`for context in self.contexts_remove_buf.drain(..) { self.contexts.remove(...) }`), + only for entries that were marked at 697-700 (empty values AND eligible to expire). The breach + flag is cleared right after if `contexts.len() < context_limit` (714-716). So flush DOES remove + expired/all-closed non-counter contexts and expired counters; it does NOT remove kept-alive + (not-yet-expired, empty) counters. The recovery edge (re-admitting new contexts) is reachable + only after kept-alive counters actually expire (`counter_expire_secs`, default 300s) or a + non-counter context flushes empty. +- **Not found**: No code path tracks zero-value counters outside `contexts`; no separate counter + toward the limit. +- **Conclusion**: RESOLVED. The true live bound is exactly **`context_limit`** (map entries), + NOT `context_limit + zero_value_count` — kept-alive zero-value counters are ordinary map + entries already included in the `len()` cap check. The `Always(contexts.len() <= context_limit)` + assertion target is correct as-is. Caveat (already noted): this bounds entry *count*, not bytes; + with `counter_expire_seconds` default 300s a flood of sparse counters keeps `len()` pinned near + the cap for ~5 min, which delays but does not breach the bound. diff --git a/test/antithesis/scratchbook/properties/aggregate-matches-agent.md b/test/antithesis/scratchbook/properties/aggregate-matches-agent.md new file mode 100644 index 00000000000..69ad7766c75 --- /dev/null +++ b/test/antithesis/scratchbook/properties/aggregate-matches-agent.md @@ -0,0 +1,96 @@ +--- +slug: aggregate-matches-agent +title: Aggregated output matches the Datadog Agent +type: Safety +priority: High +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +status: assertion MISSING (no Antithesis SDK in tree) +--- + +# aggregate-matches-agent — Aggregated output matches the Datadog Agent + +## Property (one sentence) +For the same input metric stream, ADP's aggregated output (counter→rate conversion, +half-open `[start, start+width)` buckets, histogram/distribution statistics) equals the +Datadog Agent's output — and that equivalence is preserved under fault injection +(delayed/skipped flush, restart, backpressure, clock perturbation). + +## Origin +- SUT analysis §5 safety #3 ("Aggregation output matches the Datadog Agent … explicitly + 'to match the Datadog Agent'"). +- Existing correctness suite is a **diff test** (`bin/correctness/`) that already checks + happy-path equivalence deterministically. The Antithesis angle is preserving equivalence + under faults the diff harness cannot inject (§6 gaps 1, 5). + +## Files / functions / lines +- `lib/saluki-components/src/transforms/aggregate/mod.rs` + - `counter_values_to_rate` (810–815): `MetricValues::Counter(points) => MetricValues::rate(points, interval)`. + - Passthrough conversion (451–459): counters→rate using **bucket width** as the rate + interval, with the in-code comment "we have to match the behavior of the Datadog Agent. ¯\_(ツ)_/¯". + - `transform_and_push_metric` (728–808): histogram → per-statistic metrics (count as rate + with bucket width, others as gauge); copy-to-distribution builds a `DDSketch` via + `insert_n` per sample (740–750); rate statistics use `MetricValues::rate(.., bucket_width)`. + - `is_bucket_closed` (821–843) + doc: half-open `[start, start+width)`, closed iff + `(bucket_start + width - 1) < current_time`. + - `align_to_bucket_start` (817–819): `timestamp - (timestamp % bucket_width_secs)`. + - `flush` split at `split_timestamp = align_to_bucket_start(current_time, w).saturating_sub(1)` (620). +- Diff harness: + - `bin/correctness/stele/src/metrics.rs` `PartialEq for MetricValue` (153–186): + Count/Rate/Gauge compared with `approx_eq_ratio(RATIO_ERROR=1e-8)`; **Rate also requires + `interval_a == interval_b`** (171); Sketch compared on min/max/avg/sum (ratio) + exact + `count()` + exact `bin_count()` (175–182). + - `bin/correctness/panoramic` (drives identical workload into Agent + ADP), `millstone` + (load gen), `datadog-intake` (mock intake). Fixed `FLUSH_WAIT = 32s` (per SUT analysis §6). + +## Failure scenario (Antithesis angle) +Diff equivalence is established only under a healthy, deterministic run. Faults that can +break it without the existing suite ever noticing: +1. **Delayed / skipped flush:** if the monotonic `primary_flush` interval is delayed (CPU + starvation, scheduler pause), a bucket that should have closed flushes one interval late. + Combined with the wall-clock bucketing (see `aggregate-clock-skew-stable`), the emitted + timestamps/rate intervals can diverge from the Agent. +2. **Restart mid-window:** `flush_open_windows=false` default drops open buckets on shutdown + (SUT analysis §3). The Agent baseline and ADP may shed different partial windows on a kill, + producing a one-window data delta. +3. **Backpressure:** a slow downstream blocks the aggregate's dispatcher (`dispatcher.flush().await`, + 330); if input continues to arrive during the stall, late-arriving updates may land in a + different wall-clock bucket than the Agent assigns them. +4. **Counter→rate interval:** the `interval` carried on a rate is the **bucket width**, not the + flush interval. If window vs flush interval are misconfigured relative to the Agent, the + `interval_a == interval_b` check (stele 171) fails even when values match. + +## Observations +- This is fundamentally a **differential** property: it requires running both ADP and a + Datadog Agent baseline against an identical stream and diffing normalized output. It is not + expressible as a single in-process SDK assertion the way the others in this catalog are. +- Best realized in Antithesis by extending the existing `panoramic` diff harness to run + *inside* the Antithesis environment and assert equivalence (`assert_always` the diff is + empty / within ratio) **while Antithesis injects faults** (network, process kill+restart, + clock). The diff result is the natural assertion anchor. +- OTLP metrics deliberately **skip aggregation** (SUT analysis §2, `run.rs:751-753`) to avoid + counter→rate; equivalence claims apply to the DSD path. + +## Config dependencies +- `aggregate_window_duration` (default 10s) — drives bucket width AND the rate interval. +- `aggregate_flush_interval` (default 15s) — drives flush cadence (monotonic). +- `aggregate_flush_open_windows` (default false) — governs restart-window deltas. +- `counter_expiry_seconds` (default 300) — zero-value counter continuity. +- `histogram_aggregates` / `histogram_copy_to_distribution[_prefix]` — histogram output shape. +- Agent baseline must be configured with matching window/flush/expiry for a fair diff. + +## Suggested assertion +- Workload-side (harness): `assert_always(diff_within_ratio, "ADP aggregation matches Agent")` + evaluated on the normalized stele diff after each flush window, **with faults active**. +- `Sometimes(fault_was_injected_during_window)` to confirm the equivalence check actually ran + under a perturbed condition (not just clean windows). + +## Open questions +- Does the existing `panoramic` harness tolerate a process restart of ADP mid-run, or does it + assume a single long-lived process? (Determines how restart-equivalence is asserted.) +- What FLUSH_WAIT is needed once faults delay flushes? The fixed 32s may be too short under + injected scheduler pauses, causing false diffs that are timing artifacts not correctness bugs. +- Is the Agent baseline's bucket width guaranteed identical to ADP's `aggregate_window_duration` + in the harness config? If not, the `interval` equality check is a false-positive source. +- Are zero-value counters (continuity) emitted identically by both sides across a skipped flush? diff --git a/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md b/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md new file mode 100644 index 00000000000..ab6f81d8c04 --- /dev/null +++ b/test/antithesis/scratchbook/properties/aggregate-no-panic-any-window.md @@ -0,0 +1,98 @@ +--- +slug: aggregate-no-panic-any-window +title: No aggregate_window_duration value causes a panic +type: Safety / Reachability +priority: Low +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +status: FIXED UPSTREAM (window now NonZeroU64, PR #1772); retained as regression tripwire +--- + +# aggregate-no-panic-any-window — No `aggregate_window_duration` value causes a panic + +> **Update (2026-05-30): FIXED UPSTREAM on main.** The original `% 0` / `step_by(0)` panic vector +> documented below is now structurally impossible. The config key was renamed +> `aggregate_window_duration_seconds` and is typed `NonZeroU64` (`transforms/aggregate/mod.rs:95-98`); +> `bucket_width_secs` is `NonZeroU64` end-to-end and `align_to_bucket_start` divides by +> `bucket_width_secs.get()` (`:822-823`), which can never be zero. A zero/sub-second value now fails +> config deserialization instead of reaching the divisor (PR #1772). The forensic detail below is +> retained as the historical evidence trail for the original defect; the property survives only as a +> low-cost `Unreachable` regression tripwire. The repro test +> `tests::bug_sub_second_aggregate_window_panics_on_insert` (in a sibling stack commit) is now stale +> and should be dropped or converted to a passing guard — see the bug ledger. + +## Property (one sentence) +No configured `aggregate_window_duration_seconds` value causes the aggregate transform to panic; +the bucket-width divisor is `NonZeroU64` and can never be zero, so zero/sub-second values are +rejected at config load rather than reaching the modulo path. + +## Origin +- SUT analysis §7 #8 (Wildcard): "Sub-second `aggregate_window_duration` → guaranteed panic: + `bucket_width_secs = window.as_secs()` with no validation; a value < 1s yields `% 0` + divide-by-zero and `step_by(0)` panics." + +## Files / functions / lines (CONFIRMED) +- `lib/saluki-components/src/transforms/aggregate/mod.rs` + - `AggregationState::new` (542–560): `bucket_width_secs: bucket_width.as_secs()` (553). + For any `window_duration < 1s`, `as_secs()` truncates to **0**. No validation. + - `align_to_bucket_start` (817–819): `timestamp - (timestamp % bucket_width_secs)` → + **`% 0` panics** (`attempt to calculate the remainder with a divisor of zero`). + Called from `insert` (579) on every metric and from `flush` (620, 628). + - `flush` zero-value loop (630): `(start..current_time).step_by(bucket_width_secs as usize)` + → **`step_by(0)` panics** (`step_by called with step == 0`). Reached on the 2nd+ flush + (`self.last_flush != 0`, 627). + - First reachable panic is in `insert` via `align_to_bucket_start` (579) — i.e. on the very + first metric, before any flush, if `bucket_width_secs == 0`. +- Config plumbing (CONFIRMED no validation): + - `AggregateConfiguration.window_duration` (92–93) deserialized via serde with + `default = default_window_duration` (10s); no `#[validate]` / range check. + - `from_configuration` (187–189): `config.as_typed()` — pure deserialize, no bounds check. + - `config_registry/datadog/aggregate.rs` (7–13, 50–58): `aggregate_window_duration` declared + as `ValueType::String`, `SupportLevel::Full`, `default: None`, `test_json {"secs":42}`. + No minimum/positive-value constraint anywhere in the registry. + - Repo-wide grep for `aggregate_window_duration` / `window_duration` shows zero validation + sites (only definition, default, two uses, and a telemetry nanos field at 1534). + +## Failure scenario +Operator sets `aggregate_window_duration: 500ms` (or any value `< 1s`, e.g. `{"secs":0,"nanos":...}`). +- `bucket_width_secs = 0`. +- First DSD metric reaching `dsd_agg` calls `insert` → `align_to_bucket_start(ts, 0)` → `% 0` + panic → aggregate task panics → data topology component finishes unexpectedly → + `wait_for_unexpected_finish` → **whole-process shutdown** (SUT analysis §2 supervision; data + components are fail-stop, not restarted). s6 restarts ADP, which re-reads the same bad config + and panics again → crash loop. This directly violates the "won't crash" guarantee. +- Note: a `1500ms` window does NOT panic (`as_secs() == 1`) but silently truncates the window + to 1s — a separate correctness surprise worth a `Sometimes` observation. + +## Observations +- The panic is deterministic given the config; Antithesis value is exercising the config space + (including the `{"secs":0,"nanos":N}` Duration shape the registry advertises) and catching the + crash, OR validating the planned fix. +- `PassthroughBatcher` also receives `window_duration` as `bucket_width` (220–224) but only uses + it as a `Duration` for the rate interval (`counter_values_to_rate`), not as a divisor — so the + passthrough path does not panic on sub-second windows; only the aggregation state path does. + +## Config dependencies +- `aggregate_window_duration` (the sole trigger). Truncation via `as_secs()` means any value in + `(0, 1s)` → 0 → panic; `[1s, ...)` → floor to whole seconds (lossy but safe). + +## Suggested assertion +- If the fix is **validation at config load**: `assert_always(window_secs >= 1, ...)` at the + point `AggregationState` is constructed (or `AlwaysOrUnreachable` that the build path rejects + sub-second windows), plus `Unreachable` on the `% 0` / `step_by(0)` code path. +- If no fix: `assert_unreachable("aggregate align_to_bucket_start reached with bucket_width_secs == 0")` + placed in `align_to_bucket_start` / before the `step_by` loop — Antithesis flags it the moment a + sub-second window is fed. +- `Reachable("aggregate constructed with sub-second window")` to confirm the workload actually + explores that config region. + +## Open questions +- Should the fix clamp (`max(1)`), reject at load (fatal config error, consistent with §5 #5 + "config incompatibility is fatal at startup"), or support genuine sub-second bucketing + (would require changing `bucket_width_secs` from `u64` seconds to a finer unit)? This changes + whether the assertion is `Always(validated)` vs `Unreachable(panic path)`. +- Does the gRPC dynamic-config stream allow pushing `aggregate_window_duration` at runtime? If + so, a mid-run config update to a sub-second window is a live crash vector, not just a startup one. +- SUT-side instrumentation needed: the divisor lives deep in `align_to_bucket_start`; an + `Unreachable` assertion there is the cleanest signal (workload-side cannot observe the divisor). diff --git a/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md b/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md new file mode 100644 index 00000000000..a1d85421d28 --- /dev/null +++ b/test/antithesis/scratchbook/properties/config-incompatible-refuses-start.md @@ -0,0 +1,152 @@ +--- +slug: config-incompatible-refuses-start +title: High-severity incompatible config refuses to start the pipeline +family: Lifecycle Transitions & Configuration +type: Safety (Reachability / Unreachable) +priority: Medium +status: assertion-missing +sut_commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +--- + +# config-incompatible-refuses-start + +## Origin + +SUT analysis §5 Safety #5 ("Config incompatibility is fatal at startup (high-severity incompatible +non-default key → refuse to run)") and §6 (integration suite has "config-check exit codes" cases). + +## Files / functions / lines + +- `bin/agent-data-plane/src/cli/run.rs:157`: + ```rust + check_and_warn_config(&config).error_context("Incompatible configuration detected.")?; + ``` + Crucial ordering: this is at run.rs:157, **before** `create_topology` (run.rs:168), + `blueprint.build()` (run.rs:238), and `built_topology.spawn(...)` (run.rs:239). The `?` propagates + the error out of `handle_run_command` before any data component is ever built or spawned. +- `bin/agent-data-plane/src/cli/run.rs:331-381` — `check_and_warn_config`: + - Iterates `config.flattened_keys()` (run.rs:336-339). + - `config_classifier.classify(&key, &val)` (run.rs:341) → `None` skips invalid/N-A keys. + - `classification.is_default` (run.rs:346) → keys with default values are **skipped** (the Agent + populates defaults, so only **non-default** values count). + - Match on `support_level` (run.rs:351-370): + - `Full` / `Incompatible(Low)` / `Partial` / `Incompatible(Medium)` → log + **Proceed**. + - `Incompatible(Severity::High)` (run.rs:362-366) → `error!` log + + `high_severity_incompatibilities += 1`. + - `Ignored` / `Unrecognized` → silently ignored. + - run.rs:373-378: if `high_severity_incompatibilities > 0`, returns + `Err(generic_error!("{n} incompatible configuration detected. ADP cannot start. …"))`. + **All keys are checked before returning** (so the count and all error logs are complete). +- `bin/agent-data-plane/src/main.rs:136-146` — the `Err` from `handle_run_command` maps to + `Some(1)`; `main.rs:101-104` calls `std::process::exit(1)`. So a high-severity incompatibility → + **process exit code 1, pipeline never spawned.** +- `lib/saluki-components/src/config_registry/classifier.rs:42-51` — `classify`: looks up the schema + entry, returns `Classification { support_level, is_default }`. +- `lib/saluki-components/src/config_registry/classifier.rs:53-…` — `is_default_value`: compares the + value against the schema's documented default (incl. null/empty-string handling, alias handling). +- `lib/saluki-components/src/config_registry/mod.rs:144-175` — `Severity { Low, Medium, High }` and + `SupportLevel::{Full, Partial, Incompatible(Severity), Ignored, Unrecognized}`. Incompatible keys + live in `config_registry/datadog/unsupported.rs` with their severity. + +## Key observation + +The refuse-to-start gate is correctly placed **before** topology build/spawn, so the safety +property "pipeline never runs with a high-severity-incompatible non-default key" is structurally +enforced by control flow, not by a runtime check inside the running pipeline. The two strongest +Antithesis assertions: + +- **Unreachable:** "data pipeline spawned while a high-severity incompatible non-default config key + is present." Place an `assert_unreachable!` reading a flag (set during `check_and_warn_config` + when a high-severity incompatibility was seen) at/after `spawn()` (run.rs:239). Because the `?` at + run.rs:157 returns first, this point is never reached with such a key set — Antithesis confirms it. +- **Reachable:** "ADP refused to start due to incompatible config" — mark the + `high_severity_incompatibilities > 0` return path (run.rs:373) as reachable so the workload knows + the refusal path is actually exercised under the incompatible-config workload. + +## Failure scenario (Antithesis angle) + +- Workload injects a config containing a known high-severity-incompatible key with a **non-default** + value (sourced from `config_registry/datadog/unsupported.rs` `Severity::High` entries). Expect: + process exits with code 1, no `topology_ready_ms` log, no listener bound, no data forwarded. +- Negative control: same key but at its **default** value → `is_default` true → skipped → ADP + starts normally. Assert the refusal path is NOT taken (the gate must not be over-eager). +- Multiple high-severity keys → still exits 1, error reports the count (run.rs:374-377). Verify all + are logged before exit (debuggability invariant). +- Medium/Low/Partial incompatible keys → ADP proceeds (warn/debug only). Assert pipeline DOES start + — confirms severity gating is graded, not all-or-nothing. + +## Config dependencies + +- The exact set of `Severity::High` keys lives in `config_registry/datadog/unsupported.rs` + (generated/maintained list). The workload needs at least one current High-severity key+non-default + value to exercise the refusal path; this list can drift, so the workload should source the key + dynamically or be pinned to the commit. +- Config arrives either from bootstrap YAML/env (run.rs:149 branch) or from the Core Agent dynamic + config (run.rs:107-145 branch). `check_and_warn_config` runs on the **final** resolved `config` + (run.rs:157) regardless of source, so an incompatible key delivered *over the config stream* is + also gated. (Note: on the dynamic path the gate runs once at startup after `ready()`; a key that + becomes incompatible via a later partial update is NOT re-checked — see Open Questions.) + +## Assertion (MISSING — net-new instrumentation) + +No Antithesis SDK assertions exist. Proposed SUT-side: +- In `check_and_warn_config`, when `high_severity_incompatibilities > 0`, before returning Err: + `assert_reachable!("ADP refused to start: high-severity incompatible config")`. +- Set a process-local flag `saw_high_severity_incompat = true` in that branch; add + `assert_unreachable!("pipeline spawned with high-severity incompatible config", + saw_high_severity_incompat)` immediately after `built_topology.spawn(...)` at run.rs:239 (it + should be statically unreachable because the `?` already returned, but the assertion makes the + guarantee explicit and catches any future reordering regression). +- Alternatively / additionally, workload-side: `process_exits_with(1)` + assert no + `topology_ready_ms` log + intake never receives data (mirrors existing integration "config-check + exit codes" cases but under fault injection). + +## SUT-side instrumentation needs + +- Antithesis SDK dependency (none today). +- Reachable marker on the refusal branch; Unreachable marker after spawn keyed on a + high-severity-seen flag. +- Workload must supply a current `Severity::High` key with a non-default value. + +### Investigation Log + +#### Are runtime partial config updates re-validated by the incompatibility gate? (2026-05-28) + +**Examined:** +- `bin/agent-data-plane/src/cli/run.rs:157` (`check_and_warn_config` call site), `:331-381` + (`check_and_warn_config` body), `:14` (the only import of `ConfigClassifier`/`Severity`/ + `SupportLevel`). +- `lib/saluki-config/src/lib.rs:541-651` (`run_dynamic_config_updater` — the task that applies all + runtime `Snapshot`/`Partial` updates over the gRPC config stream). +- Grep across `lib/saluki-config/` and `bin/agent-data-plane/src/` for `ConfigClassifier`, + `check_and_warn_config`, `classify(`. + +**Found — gate is startup-only, NOT re-run at runtime:** +- `check_and_warn_config` constructs a fresh `ConfigClassifier::new()` (run.rs:333) and is invoked + exactly once, at run.rs:157, before `create_topology`/`build`/`spawn`. The `?` returns the + process before the pipeline is built (matches the existing "Key observation" section). +- The runtime updater `run_dynamic_config_updater` rebuilds the figment on every update + (lib.rs:564-578 for the initial snapshot, lib.rs:621-649 for subsequent updates) and dispatches + `ConfigChangeEvent`s via `dynamic::diff_config` (lib.rs:633-639), but it contains **no reference + to `ConfigClassifier` or `check_and_warn_config`** and performs no support-level/severity check. + A `Partial` update is applied via `upsert(&mut dynamic_state, &key, value)` (lib.rs:612) and the + new figment is swapped in (lib.rs:645) — unconditionally. +- The grep confirms `ConfigClassifier` and `check_and_warn_config` appear ONLY in run.rs (the import + at :14 and the call/definition at :157/:331). The saluki-config crate that owns the dynamic updater + has zero awareness of the classifier. + +**Not found:** No runtime re-validation hook, no severity check on `ConfigChangeEvent`, no path that +re-enters `check_and_warn_config` after startup, and no mechanism that would refuse/halt on a +runtime-delivered high-severity key. The classifier crate (`saluki-components::config_registry`) is +not a dependency of the dynamic-update path. + +**Conclusion (RESOLVED, scope confirmed):** The incompatibility gate runs **only once at startup**. +A config key that flips to a high-severity-incompatible value via a later `Partial` (or `Snapshot`) +update over the gRPC stream is applied to the live figment and broadcast as a change event — ADP +**keeps running**; it does NOT refuse-to-start or shut down. The `config-incompatible-refuses-start` +property is therefore correctly scoped to **startup configuration only** (the `?` at run.rs:157 +guards topology spawn against the *startup-resolved* config, including the first snapshot + env +overlays, per the existing notes). Runtime re-validation is a genuine GAP and warrants a separate +property/finding ("runtime config update can introduce a high-severity-incompatible key with no +re-gate") — not folded into this safety property. Recommend filing that gap; this file's Open +Questions are now resolved. diff --git a/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md b/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md new file mode 100644 index 00000000000..e2e7f72ab12 --- /dev/null +++ b/test/antithesis/scratchbook/properties/config-runtime-update-not-revalidated.md @@ -0,0 +1,60 @@ +# config-runtime-update-not-revalidated + +## Origin + +Surfaced during the open-question investigation of `config-incompatible-refuses-start`. ADP's +incompatibility gate (`check_and_warn_config` + `ConfigClassifier`) protects startup, but the +**dynamic config stream** from the Core Agent can deliver partial/snapshot updates at runtime that +are never re-classified. A config key that becomes high-severity-incompatible *after* startup is +applied silently. + +## Code paths + +- `bin/agent-data-plane/src/cli/run.rs:157` — `check_and_warn_config(&config)` runs exactly once, + before `create_topology` / `build` / `spawn`. Its `Err` aborts startup (`exit(1)`). +- `ConfigClassifier` / `check_and_warn_config` are referenced **only** in `run.rs` — there is no + re-validation hook on the dynamic-config update path. +- The dynamic config updater (`saluki-config` `lib.rs` ~541-651) applies runtime `Partial`/`Snapshot` + updates with no classifier check. +- The config stream itself: `bin/agent-data-plane/src/internal/remote_agent.rs` (config event loop) + pushes `ConfigUpdate::Snapshot/Partial` into the dynamic configuration. + +## Failure scenario + +The Core Agent pushes a config update (e.g. enabling a feature ADP classifies as +`Incompatible(Severity::High)`) over the AgentSecure config stream while ADP is running. ADP applies +it and **keeps running** in a configuration it would have refused to start with — risking wrong +aggregates or silent data corruption, exactly the outcome the startup gate exists to prevent. + +## Property + +- **Type:** Safety (Reachability / scope gap) +- **Invariant:** Either `Unreachable("pipeline running with high-severity incompatible non-default + key after a runtime config update")`, or — if the intended design is "startup-only gating" — a + `Reachable` marker proving the unguarded runtime-apply path is taken, documenting the gap. +- **Antithesis angle:** Start ADP with a clean config (passes the gate), then inject a config-stream + update carrying a high-severity-incompatible non-default key; observe whether ADP detects/refuses + or silently applies it. This exercises the control-plane → data-plane config path the diff-test + never touches. +- **Priority:** Medium (depends on whether high-severity keys are reachable via the stream in + practice — a product question). + +## Open Questions + +- Is this an intentional design choice (startup-only gating, runtime updates trusted because they + come from the authoritative Core Agent) or an oversight? `(needs human input)` +- Can a `Severity::High` key actually be delivered over the config stream, or does the Core Agent + pre-filter what it sends to remote agents? Determines real-world reachability. + +### Investigation Log + +#### Are runtime partial config updates re-validated by the incompatibility gate? + +- Examined: `bin/agent-data-plane/src/cli/run.rs:157,331-381` (gate + caller), grep for + `check_and_warn_config` / `ConfigClassifier` across the tree, `saluki-config` dynamic updater + (`lib.rs` ~541-651), `internal/remote_agent.rs` config event loop. +- Found: the gate runs exactly once at startup; the classifier is referenced only in `run.rs`; the + runtime update path applies `Partial`/`Snapshot` updates with no classifier call. +- Conclusion: confirmed — a runtime config update can introduce a high-severity-incompatible key + with no re-gate, and ADP keeps running. Filed as this standalone property. The remaining questions + (intentional vs. oversight; stream reachability of high-severity keys) need the team's input. diff --git a/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md b/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md new file mode 100644 index 00000000000..5d9963ec6a9 --- /dev/null +++ b/test/antithesis/scratchbook/properties/config-stall-no-deadlock.md @@ -0,0 +1,187 @@ +--- +slug: config-stall-no-deadlock +title: Config-stream stall does not deadlock or busy-loop startup +family: Lifecycle Transitions & Configuration +type: Liveness +priority: High +status: assertion-missing +sut_commit: 042f41db3bd97118c38981765fd49696fce9d318 +--- + +# config-stall-no-deadlock + +## Origin + +SUT analysis §2 (control plane: "Startup **blocks** on `dynamic_config.ready().await` for the +first config (run.rs:119-121, *no timeout shown*). Stream end → reconnect after fixed 5s") and §7 +#13 ("Core Agent reachability assumed at startup: ADP blocks indefinitely on +`dynamic_config.ready()` with no visible timeout — if the Agent never sends config, ADP never +starts the pipeline"). + +## Files / functions / lines + +- `bin/agent-data-plane/src/cli/run.rs:119-121`: + ```rust + info!("Waiting for initial configuration from Datadog Agent..."); + dynamic_config.ready().await; + info!("Initial configuration received."); + ``` + This is on the `use_new_config_stream_endpoint` path (run.rs:107). It is reached only after + `RemoteAgentBootstrap::from_configuration` (run.rs:96-104) — which itself **blocks on the initial + registration** (`remote_agent.rs:97-105`, `init_reg_rx.await`). +- `lib/saluki-config/src/lib.rs:687-704` — `GenericConfiguration::ready()`: awaits a `oneshot` + receiver (`ready_signal`). **No timeout.** If the oneshot never fires it awaits forever. If the + sender is dropped, `ready_rx.await` returns `Err` and `ready()` logs an error and **returns** + (so a dropped channel unblocks startup; an idle-but-open channel does not). +- `lib/saluki-config/src/lib.rs:541-584` — `run_dynamic_config_updater`: the oneshot + `ready_sender.send(())` fires (lib.rs:581) **only after the first `ConfigUpdate::Snapshot`** is + received and the figment is rebuilt. If `receiver.recv()` returns `None` before any snapshot + (channel closed), the task returns *without* sending ready (lib.rs:546-552) → `ready()` sees the + sender dropped → returns with an error log (does NOT hang). If the channel stays open but no + snapshot ever arrives, the task awaits at lib.rs:546 forever and `ready()` hangs forever. +- `bin/agent-data-plane/src/internal/remote_agent.rs:251-304` — `run_config_stream_event_loop`: + - `:260` waits for a session ID (`session_id.wait_for_update().await`). + - `:262-263` opens `stream_config_events` and drains it. + - On stream error (`:295-298`): logs `error!` and **continues the inner while-loop** — this can + spin if the stream yields a steady error item without ending (see Open Questions). + - On stream end (outer loop falls through to `:301-302`): `debug!("Config stream ended, retrying + in 5 seconds…"); tokio::time::sleep(Duration::from_secs(5)).await;` — fixed 5s reconnect + backoff, then loops back to `:255`. +- `remote_agent.rs:139-148` — `create_config_stream` builds the `mpsc::channel(100)` whose receiver + feeds `run_dynamic_config_updater`. The config-stream event loop is the **sender** side; it only + drops the sender by `return`ing (e.g. `:289-292` when the dynamic config channel is closed). + +## Investigation: IS there a timeout? + +**No.** Confirmed by reading `GenericConfiguration::ready()` (lib.rs:694-704): a bare +`ready_rx.await` with no `tokio::time::timeout` wrapper, and `run.rs:120` calls it bare. There is +also no timeout on `init_reg_rx.await` in `RemoteAgentBootstrap::from_configuration` +(remote_agent.rs:97). The registration *retries* in the background (registration loop +`remote_agent.rs:185-249`, `DEFAULT_REFRESH_INTERVAL=30s`, `REFRESH_FAILED_RETRY_INTERVAL=5s`), and +the first registration result is forwarded to `init_reg_rx` (success or error) — so the **bootstrap +registration** does resolve (Ok or Err) on the first attempt. But the subsequent **config-stream +`ready()`** has no timeout: if the Core Agent registers ADP but never streams a snapshot, ADP hangs +at run.rs:120 indefinitely, logging only the single "Waiting for initial configuration…" line. + +## Honest framing of the property + +This is a **liveness** property with two acceptable outcomes (not a crash, not a busy-loop): +1. **Progress:** ADP eventually receives the first snapshot and logs "Initial configuration + received." → proceeds to build topology. +2. **Bounded waiting:** ADP remains observably blocked at "Waiting for initial configuration from + Datadog Agent…" — a *quiescent* await (parked on a oneshot / parked in `sleep`), **not** burning + CPU and **not** panicking. + +The property to assert is therefore: **the config stall never produces a crash, panic, or +busy-loop; ADP is either making progress or quiescently waiting.** It is NOT correct to assert +"ADP always eventually starts" — with no timeout and a permanently-silent Agent, it legitimately +never starts. Document that the *absence of a timeout* is the design as-is (matches s6-supervised +container model where the operator/Agent presence is assumed). + +## Failure scenarios (Antithesis angle) + +- **Drop the config snapshot:** Core Agent registers ADP but the config stream never sends a + `Snapshot` (or sends only `Partial`). Expect: quiescent block at run.rs:120; CPU near zero; no + panic. Falsify on busy-loop (high CPU while "waiting"). +- **Flap the stream:** stream repeatedly opens then ends (EOF). Expect: reconnect every 5s + (remote_agent.rs:302), no tighter spin. Sometimes(reconnect-after-5s path taken). +- **Steady stream error (no EOF):** stream yields `Err` items continuously + (remote_agent.rs:295-298 `continue`s the inner loop without backoff). **Potential busy-loop / + log-flood.** This is the highest-value falsification target — assert CPU/iteration rate bounded. +- **Close the dynamic-config channel mid-startup:** sender drop → `ready()` returns with error log, + startup proceeds (or downstream fails). Verify no hang and no panic. +- Network partition between ADP and Core Agent during/after registration. + +## Config dependencies + +- `use_new_config_stream_endpoint` (run.rs:93) — gates whether `ready()` is awaited at all. If + false (legacy `remote_agent_enabled` only), the `(bootstrap_config, bootstrap_dp_config)` branch + (run.rs:149) is taken and there is **no `ready()` wait** → property N/A. +- `standalone_mode` (run.rs:91): standalone skips remote-agent bootstrap entirely → no config stall. +- `secure_api_listen_address` (remote_agent.rs:75) — needed for registration. + +## Assertion (MISSING — net-new instrumentation) + +No Antithesis SDK assertions exist. Proposed: +- Wrap the conceptual "waiting for config" region with a `Sometimes("config wait was entered")` + reachability marker just before run.rs:120, and a `Reachable("initial configuration received")` + just after run.rs:121 — so the workload can distinguish "stalled" vs "progressed". +- The busy-loop hazard (remote_agent.rs:295-298) is best caught **workload-side**: monitor CPU / + log-line rate while the config stream errors; assert bounded. No clean in-process assertion. +- `Always("no panic in config path")` is implicit (panic = crash = Antithesis catches it); not a + bespoke assertion. + +## SUT-side instrumentation needs + +- Antithesis SDK dependency (none today). +- Reachability markers around run.rs:119-121 to separate "entered wait" from "config received". +- Workload-side CPU/log-rate monitor for the busy-loop hazard. + +### Investigation Log + +#### Steady stream error: busy-loop or backoff? + `init_reg_rx` boundedness (2026-05-28) + +**Examined:** +- `bin/agent-data-plane/src/internal/remote_agent.rs:251-304` (`run_config_stream_event_loop`), + `:185-249` (`run_remote_agent_registration_loop`), `:162-183` (`RemoteAgentState::new`). +- `lib/datadog-agent-commons/src/ipc/client/mod.rs:202-224` (`stream_config_events`) and + `lib/datadog-agent-commons/src/ipc/client/streaming.rs:53-93` (`StreamingResponse::poll_next`, + the stream type ADP iterates) plus its regression tests `:105-133`. +- tonic 0.14.6 `Streaming::poll_next` at + `~/.cargo/registry/.../tonic-0.14.6/src/codec/decode.rs:399-421`. +- `lib/datadog-agent-commons/src/ipc/session.rs:67-103` (`SessionIdHandle`). + +**Found — busy-loop question RESOLVED, NOT a bug:** +- The stream ADP drains is `StreamingResponse`, which wraps either an `Initial` + RPC-establishment future or a tonic `Streaming`, plus a `Terminated` state (streaming.rs:11-23). +- An **initial** RPC error (connection refused, RPC rejected, session invalid) → `Outcome::Terminate` + → the stream fuses to `Terminated` and yields `Some(Err(status))` exactly **once**, then `None` + forever (streaming.rs:70-72, 86-89). This is the dominant error mode for a steadily-failing stream + (the RPC never establishes), and it terminates immediately → outer loop hits the **5s sleep** + (remote_agent.rs:301-302). Confirmed by the test + `streaming_response_terminates_after_initial_error` (streaming.rs:105-122). +- A **mid-stream** error from an already-established `Streaming`: `StreamingResponse` yields the + `Some(Err(_))` (does NOT itself terminate, streaming.rs:74-77), but the underlying tonic + `Streaming::poll_next` (decode.rs:399-421) yields the error **once** then transitions to + `State::Error(None)` so the very next poll returns `Poll::Ready(None)` (decode.rs:403-408, + `status.take()` empties the Option). The explicit comment at decode.rs:403-405 confirms: "yield + that error once and then on subsequent calls return Poll::Ready(None)". +- Net: in BOTH error modes the inner `while let Some(result) = stream.next().await` loop + (remote_agent.rs:263) sees at most one `Err`, ADP logs one `error!` (remote_agent.rs:295-298) and + `continue`s, then the next `.next()` yields `None` → inner loop exits → the **5s sleep** + (remote_agent.rs:302) runs before reconnect. There is NO unbounded spin and NO reconnect tighter + than 5s. The `continue` at :297 can iterate at most once per stream instance. +- A residual log/CPU concern only remains if the Core Agent could establish the stream and then emit + a *steady cadence* of error frames over a long-lived HTTP/2 body without ever closing it — but + tonic ends the body on the first decode/transport error, so this is not reachable with the standard + client. The hazard described in the Failure Scenarios ("steady stream error, no EOF") is therefore + NOT realizable through this stack; downgrade it from "highest-value falsification target" to a + non-issue. Flap-the-stream (repeated EOF every 5s) remains the realistic shape. + +**Found — `init_reg_rx.await` (remote_agent.rs:97) is bounded:** +- `RemoteAgentState::new` always initializes `session_id: SessionIdHandle::empty()` + (remote_agent.rs:176) and `initial_registration_tx: Some(init_reg_tx)` (:178). The handle is freshly + created per bootstrap (`RemoteAgentState::new` returns it, :85), so it cannot already hold a + session ID. +- The registration loop's first `loop_timer.tick()` (remote_agent.rs:192) fires immediately (tokio + interval first tick is immediate). With an empty `session_id`, `state.session_id.get()` returns + `None` (session.rs:84-90) → the loop takes the `None` register branch (remote_agent.rs:206-246), NOT + the refresh branch. Both Ok and Err arms send on `initial_registration_tx.take()` + (remote_agent.rs:233-235 and :241-243). So the first result (success or failure) is always + delivered → `init_reg_rx.await` resolves on the first attempt. +- The "session_id already Some on first tick → refresh branch, never sends" path is **impossible**: + the only writer of a non-None session ID is the register branch itself (`:230`), which has not yet + run on the first tick. Confirmed no path leaves `initial_registration_tx` unsent. + +**Not found:** No metric or counter for stream-reconnect cadence; the only signal is the +`debug!`/`error!` logs at remote_agent.rs:296 and :301. No per-iteration throttle beyond the +terminate-then-5s-sleep structure (none needed given the termination semantics above). + +**Conclusion:** The busy-loop concern is resolved — a steadily-erroring config stream cannot +spin: the stream terminates after one error and the loop backs off 5s. The `init_reg_rx.await` +bootstrap wait is bounded (always resolves Ok/Err on the first registration attempt). The remaining +true liveness gap is unchanged and is the *snapshot stall* (`ready()` at run.rs:120 has no timeout): +a Core Agent that registers ADP but never streams a `Snapshot` leaves ADP quiescently blocked +forever. Property framing should drop the "steady stream error → busy-loop" falsification target as +unreachable through tonic, and keep the "drop the snapshot → quiescent (low-CPU) indefinite block, +no panic" assertion as the load-bearing one. diff --git a/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md b/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md new file mode 100644 index 00000000000..6be7188d7b6 --- /dev/null +++ b/test/antithesis/scratchbook/properties/data-component-failure-triggers-process-shutdown.md @@ -0,0 +1,134 @@ +--- +slug: data-component-failure-triggers-process-shutdown +title: Any data component finishing triggers whole-process shutdown (fail-stop) +family: Lifecycle Transitions & Configuration +type: Safety (Always) + Reachability +priority: High +status: assertion-missing +sut_commit: 042f41db3bd97118c38981765fd49696fce9d318 +--- + +# data-component-failure-triggers-process-shutdown + +## Origin + +SUT analysis §2 Supervision ("**the primary data topology is NOT** [supervised] — `RunningTopology` +spawns each data component into a `JoinSet` with **no restart**. Any data component finishing → +`wait_for_unexpected_finish` → **whole-process shutdown**") and §7 (fail-stop recovery model: s6 +restarts ADP on exit). The invariant: ADP must never run a **silently half-running** pipeline. + +## Files / functions / lines + +- `lib/saluki-core/src/topology/built.rs:158-410` — `BuiltTopology::spawn`: each component + (sources, transforms, encoders, forwarders, destinations, relays, decoders) is spawned via + `spawn_component` (built.rs:666-687) into a single `JoinSet>`. There is + **no per-component restart wrapper** — unlike `runtime/supervisor.rs`, the topology does not + re-spawn a finished component. +- `lib/saluki-core/src/topology/running.rs:40-51` — `wait_for_unexpected_finish`: + ```rust + let task_result = self.component_tasks.join_next_with_id().await + .expect("no components to wait for"); + handle_task_result(&mut self.component_task_map, task_result, /*unexpected=*/true); + ``` + Returns as soon as **any one** component task finishes (Ok, Err, or panic). It does NOT loop / + restart — the single completion is surfaced to the caller. +- `bin/agent-data-plane/src/cli/run.rs:280-283` — in the main `select!`: + ```rust + _ = running_topology.wait_for_unexpected_finish() => { + error!("Topology component unexpectedly finished. Shutting down..."); + topology_failed = true; + }, + ``` + Any component finishing wins this select arm → falls through to + `running_topology.shutdown_with_timeout(30s)` (run.rs:290) → shuts down the **whole** topology → + process proceeds to exit. With `topology_failed = true`, the final result is `Ok` (clean shutdown) + but logged as "Topology shutdown complete despite error(s)." (run.rs:302-303). Process then exits; + the container's s6 supervisor restarts ADP (full-process restart = recovery model). +- `lib/saluki-core/src/topology/running.rs:130-162` — `handle_task_result` with `unexpected=true`: + a clean `Ok(())` finish is logged as `warn!("Component unexpectedly finished.")` (running.rs:140); + an `Err` as `error!("Component stopped with error.")`; a `JoinError` (panic/cancel) as + `error!("Component task failed unexpectedly.")`. +- `lib/saluki-core/src/runtime/supervisor.rs` — the supervisor with `OneForOne`/`OneForAll` restart + (supervisor.rs:477-481) applies to the **internal supervisor only** (control plane / internal + telemetry / env), assembled at run.rs:185-202. It does **not** wrap data components. This is the + "crucial split" from SUT §2. + +## Key observation / honest framing + +The fail-stop guarantee is real and structural: there is exactly one `JoinSet` for data components +and exactly one `wait_for_unexpected_finish` arm that converts any single completion into +whole-topology shutdown. The defensible invariant: + +- **Always:** whenever a data component task terminates before an operator-initiated shutdown + (SIGINT), the process initiates topology-wide shutdown — it never continues running the remaining + components as a partial pipeline. Equivalently: there is no state where component count has + decreased due to an unexpected finish *and* the topology keeps processing. +- **Reachable:** the `wait_for_unexpected_finish` → shutdown path is actually hit when a data + component is induced to finish (proves fail-stop fires, not just that it exists). + +Caveat to state honestly: between the moment a component finishes and the moment shutdown completes, +the pipeline is transiently "half-running" (other components still alive, draining). That window is +*bounded by the 30s shutdown* (see `graceful-shutdown-within-30s`) and is by design. The invariant +is about **not silently staying** half-running, not about instantaneous teardown. + +## Failure scenarios (Antithesis angle) + +- **Induce a component panic** (e.g. trigger one of the hot-path `.expect`/`unreachable!` sites in + SUT §7 #14; note the sub-second `aggregate_window_duration` panic of §7 #8 is now closed upstream) + → component task ends with + `JoinError` → `wait_for_unexpected_finish` fires → process shuts down. Assert the shutdown path is + reached and the process exits (s6 then restarts). Falsify if the pipeline keeps running with a + dead component. +- **Component returns Ok unexpectedly** (clean finish mid-run, e.g. a source whose loop exits on a + closed socket) → same fail-stop path (running.rs:140 warn). Confirms even a "successful" early + finish triggers shutdown. +- **Forwarder task exits on permanent error** → fail-stop. +- Distinguish from SIGINT: under SIGINT the `ctrl_c` arm (run.rs:284) wins, `topology_failed` stays + false. The property is specifically about the **non-SIGINT** completion arm. + +## Config dependencies + +- Number/identity of data components depends on enabled pipelines (run.rs:414-457). The invariant + holds regardless, but the workload should know which component it is killing. +- Internal-supervisor components are explicitly **excluded** — a control-plane component failing at + runtime is handled by the supervisor (run.rs:263-271 logs and continues), NOT by this fail-stop. + Do not assert fail-stop for internal-supervisor component failures. + +## Assertion (MISSING — net-new instrumentation) + +No Antithesis SDK assertions exist. Proposed SUT-side: +- In the `wait_for_unexpected_finish` select arm (run.rs:280-283), before/at the `error!` log: + `assert_reachable!("data component unexpectedly finished → process shutting down")`. +- To express the Always invariant in-process is awkward (it is enforced by control flow). Best + approach: a workload-side temporal assertion — *whenever* the + `"Topology component unexpectedly finished. Shutting down..."` log appears (or any component-death + telemetry), the process must subsequently exit (and not continue serving). Antithesis + query-logs/temporal checks (event A always precedes process-exit B) fit this well. +- Optionally instrument `handle_task_result` (running.rs) to emit a distinct telemetry/log on + unexpected component finish so the workload can detect the half-running window and assert it is + always followed by shutdown. + +## Open questions + +- **Is `wait_for_unexpected_finish` always being polled?** It is one arm of the run.rs:255 `select!`. + Once any other arm completes (SIGINT, internal supervisor finish) the `select!` exits and + `wait_for_unexpected_finish` is no longer polled — but at that point shutdown is already happening, + so the invariant still holds. Confirm there is no window after topology spawn (run.rs:239) but + before the `select!` (run.rs:255) where a component could finish unobserved. The intervening code + (run.rs:241-253) spawns a detached readiness task and sets two bools — quick and non-awaiting on + the topology — so the gap is negligible, but worth noting. WHY IT MATTERS: a component dying in + that gap would still be caught by `join_next_with_id` once the select polls (JoinSet buffers + completions), so likely safe; confirm. +- **`expect("no components to wait for")` (running.rs:45):** if the topology has zero components, + `join_next_with_id()` returns `None` and this panics. Could an empty topology be built? run.rs:401 + errors out if `!data_pipelines_enabled()`, and `create_topology` adds at least a forwarder + + components for any enabled pipeline, so a spawned topology should be non-empty. WHY IT MATTERS: a + panic here would itself trigger process exit (still fail-stop-ish) but via an ugly path. WHAT + CHANGES: low priority; document as a defensive-panic site. + +## SUT-side instrumentation needs + +- Antithesis SDK dependency (none today). +- Reachable marker on the run.rs:280 unexpected-finish arm. +- A distinct log/telemetry event on unexpected component finish to anchor a workload-side temporal + "death-implies-exit" assertion. diff --git a/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md b/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md new file mode 100644 index 00000000000..3344deb2b4f --- /dev/null +++ b/test/antithesis/scratchbook/properties/ddsketch-bin-count-bounded.md @@ -0,0 +1,108 @@ +--- +slug: ddsketch-bin-count-bounded +title: DDSketch bin count never exceeds bin_limit +type: Safety +priority: Medium +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +status: assertion MISSING (as Antithesis); strong unit + proptest coverage exists +--- + +# ddsketch-bin-count-bounded — DDSketch bin count never exceeds bin_limit + +## Property (one sentence) +After any sequence of inserts, multi-weight inserts, interpolations, and merges, an agent +`DDSketch`'s bin count never exceeds `bin_limit` (4096). + +## Origin +- SUT analysis §5 safety #4: "bin count must never exceed 4096". +- Doc/comment in `agent/sketch.rs` collapse logic and `trim_left` ("leaving exactly bin_limit bins"). + +## Files / functions / lines (CONFIRMED) +- `lib/ddsketch/build.rs` (2–4): `AGENT_DEFAULT_BIN_LIMIT = 4096`, `AGENT_DEFAULT_EPS = 1/128`, + `AGENT_DEFAULT_MIN_VALUE = 1e-9`; emitted as `DDSKETCH_CONF_BIN_LIMIT`. +- `lib/ddsketch/src/agent/config.rs` (8, 14–15): `Config.bin_limit` set from `DDSKETCH_CONF_BIN_LIMIT`. +- `lib/ddsketch/src/agent/sketch.rs` + - `trim_left` (689–714): enforces the cap. `if bin_limit == 0 || bins.len() <= bin_limit { return; }` + else drains the lowest `bins.len() - bin_limit` bins, folding their mass into the first kept + bin → "leaving exactly bin_limit bins" (712–713). + - Every mutation path calls `trim_left(.., SKETCH_CONFIG.bin_limit)`: + `insert` (358), `insert_keys` (319), `insert_key_counts` (255), `merge` (579). + - `generate_bins` (716–735): a single `(k, n)` with `n >= u32::MAX` could emit multiple bins + for one key (overflow split, 731–733); but `trim_left` runs immediately after every caller, + so the post-operation bin count is still capped. (Historic regression fixed: with old u16 + bins a large weight exploded bin count — test `trim_left_respects_bin_limit_with_large_weights` + 824–846.) + - `bin_count()` (90–92): `self.bins.len()` — the value to assert on. +- Existing tests already assert this: unit tests (760–891) and **proptests** + `prop_bin_count_never_exceeds_limit` (924–936), `prop_output_bins_are_sorted_and_distinct`, + `prop_output_keys_are_highest_from_input`, etc. (919–1023). +- ADP entry into the agent sketch: `aggregate/mod.rs:7` `use ddsketch::DDSketch` (re-export of + `agent::DDSketch`, `lib.rs:56`); built in `transform_and_push_metric` (743–745) via + `DDSketch::default()` + `insert_n` per histogram sample, and distributions flow as `SketchPoints`. +- Separate impl: the **canonical** `DDSketch` (`canonical/sketch.rs`) uses + `CollapsingLowestDenseStore` with `max_num_bins` (default 2048) and an `assert!(max_num_bins >= 1)` + (collapsing_lowest.rs:37), collapsing on growth (67–87). Not on the aggregate hot path, but the + same invariant applies if/where it is used. + +## Failure scenario (Antithesis angle) +The unit/proptests run only under `cargo test` on isolated inputs. Antithesis adds value by +checking the invariant **live, on real production sketches**, after arbitrary interleavings: +- Histogram→distribution conversion inserting thousands of distinct sample values per flush + (743–745) feeding `insert_n` with large weights, then `merge`d across windows. +- Merge of many incoming agent sketches (future "take sketches shipped by the agent, merge them" + use case, sketch.rs:33–35) where each `merge` (542–582) extends then `trim_left`s. +- A code path that mutates bins **without** calling `trim_left` (e.g. a new insert helper, or + `insert_raw_bin` 491–495 which intentionally does NOT trim) escaping the cap — exactly the + kind of regression a live `Always` assertion would catch that the targeted tests would not. + +## Observations +- This invariant is structurally enforced today; the Antithesis assertion is a **regression + tripwire** placed at the sketch boundary, valuable because the cap is re-established by a + separate `trim_left` call at each mutation site (easy to miss when adding a new mutator). +- SUT-side instrumentation strongly preferred: `bin_count()` is internal and per-sketch; a + workload-side checker cannot see individual sketches mid-pipeline. + +## Config dependencies +- `bin_limit` is compile-time (build.rs), not runtime-configurable for the agent sketch. + +## Suggested assertion +- `assert_always(self.bins.len() <= SKETCH_CONFIG.bin_limit as usize, + "DDSketch bin count within bin_limit")` placed at the **end of every mutating method** + (`insert`, `insert_n`/`insert_keys`/`insert_key_counts`, `merge`, `insert_interpolate_buckets`) + — i.e. one shared check after `trim_left`. +- `Reachable("DDSketch trim_left collapsed bins")` to confirm the workload actually drives a + sketch past the limit (otherwise the `Always` is vacuously true and proves nothing). + +## Open questions +- `insert_raw_bin` is `#[cfg(test)]`/`pub(crate)` test-only (490) and bypasses `trim_left` — confirm + it can never be reached in a release build (otherwise it is a hole in the invariant). +- `generate_bins` overflow split for `n >= u32::MAX`: under truly extreme single-key weights, does + the transient (pre-`trim_left`) bin vector allocation matter for memory? Probably not, but worth + a `Sometimes` if huge weights are in scope. + +## Investigation Log + +#### Which DDSketch (agent 4096 vs canonical 2048) is on ADP's live aggregation path +- **Examined**: `lib/ddsketch/src/lib.rs:46-56` (module layout + crate-root re-export); + `lib/ddsketch/build.rs:2-4,45` (`AGENT_DEFAULT_BIN_LIMIT = 4096`, `AGENT_DEFAULT_EPS = 1/128`); + `lib/ddsketch/src/agent/config.rs` (`Config`, generated `DDSKETCH_CONF_BIN_LIMIT`), + `agent/sketch.rs:255,319,358,829` (`trim_left(..., SKETCH_CONFIG.bin_limit)`); + `transforms/aggregate/mod.rs:7` (`use ddsketch::DDSketch`), `:740-750` (distribution build via + `DDSketch::default()` + `insert_n`); `bin/agent-data-plane/src/cli/run.rs:491-498` + (metrics pipeline => `dd_metrics_encode`), `encoders/datadog/metrics/mod.rs:5,467,841,1006`; + grep of `ddsketch::canonical` / `canonical::DDSketch` usage across `lib` and `bin`. +- **Found**: The crate root re-exports the **agent** implementation: + `pub use agent::{Bin, Bucket, DDSketch};` (`lib.rs:56`) — so `ddsketch::DDSketch` *is* the agent + sketch (bin_limit **4096**, eps 1/128). The aggregate transform imports `ddsketch::DDSketch` + (`mod.rs:7`) and uses it in the histogram->distribution conversion path (`mod.rs:743-750`). The + DD metrics encoder also uses `ddsketch::DDSketch` (`metrics/mod.rs:5`) for Histogram/Distribution + values. The **canonical** sketch (`ddsketch::canonical`, `max_num_bins`-based, + `CollapsingLowestDenseStore`, relative_accuracy 0.01) has **no non-test usage** in `lib` or + `bin` — it is library-only / not wired into any ADP topology component. +- **Not found**: Any live ADP component constructing `ddsketch::canonical::DDSketch`. +- **Conclusion**: RESOLVED. Only the **agent sketch (bin_limit 4096)** is on the live ADP path + (aggregate distribution build + DD metrics sketch encoding). The canonical sketch (2048) is not + reachable in production. The bin-count assertion should target the agent sketch's + `SKETCH_CONFIG.bin_limit == 4096` exclusively; canonical can be dropped from scope. diff --git a/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md b/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md new file mode 100644 index 00000000000..5c78274472e --- /dev/null +++ b/test/antithesis/scratchbook/properties/ddsketch-no-nan-poison.md @@ -0,0 +1,171 @@ +--- +slug: ddsketch-no-nan-poison +title: A NaN sample never silently poisons a sketch's sum/avg +type: Safety / Reachability +priority: High +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +status: assertion MISSING; CONFIRMED sketch boundary does NOT guard finiteness +--- + +# ddsketch-no-nan-poison — A NaN sample never silently poisons a sketch's sum/avg + +## Property (one sentence) +A single NaN (or other non-finite) sample must never silently poison a sketch's `sum`/`avg`; +for any finite input stream the sketch's `sum`/`avg` stay finite, and the sketch boundary +rejects or sanitizes non-finite values rather than absorbing them. + +## Origin +- SUT analysis §7 #10 (Wildcard): "NaN poisons a DDSketch (`agent/sketch.rs:188-206`): + `sum`/`avg` go NaN permanently; finiteness is guarded per-source (DSD codec), not at the + sketch boundary — fragile if a new producer is added." + +## Files / functions / lines (CONFIRMED) +- `lib/ddsketch/src/agent/sketch.rs` + - `adjust_basic_stats(v, n)` (188–206): **NO finiteness check.** + `self.sum += v * n as f64;` (198) → if `v` is NaN, `sum` becomes NaN permanently (NaN is + sticky under `+`). `self.avg += (v - self.avg)/count` (201) / the `n>1` branch (205) → + `avg` also goes NaN. `min`/`max` comparisons (189, 193) are all false for NaN, so a NaN + leaves min/max unchanged but silently corrupts sum/avg. + - Entry points that call `adjust_basic_stats` with caller-supplied `v` and **no NaN reject**: + `insert(v)` (327–330), `insert_n(v,n)` (374–384, calls `adjust_basic_stats` at 380 for n>1 + and `insert` for n==1), `insert_many` (362–371), `insert_interpolate_bucket` (426, 440), + `insert_raw_bin` (493). + - `Config::key(NaN)` (config.rs 70–87): every comparison with NaN is false, so it falls to + `log_gamma(NaN).round_ties_even() as i32` = 0, `key = norm_bias`, clamped to `[1, MAX_KEY]` + → NaN gets a **valid bin** (count incremented) while sum/avg are poisoned: the sketch looks + populated but its sum/avg are NaN. `count` still increments (197), so `is_empty()` is false + and `sum()`/`avg()` return `Some(NaN)`. + - `quantile` (535): `.or(Some(f64::NAN))` can itself yield NaN as a fallback — distinct from + poisoning, but means a NaN out of `quantile` is not by itself proof of poisoning. +- ADP call site (the boundary in question): `transform_and_push_metric` (744–745): + `sketch.insert_n(sample.value.into_inner(), sample.weight.0 as u64)` — calls `insert_n` + **directly**, with no finiteness guard at this boundary. +- Per-source guard that exists today (the *only* current protection): + - DSD codec drops non-finite float values (SUT analysis §8 "drop non-finite floats in codec"; + §7 #7 `non_finite_metric_values_are_silently_dropped`). So in the current DSD pipeline NaN + is filtered upstream — but this is **not** enforced at the sketch boundary, so any new + producer (OTLP path, replay, future sources) that reaches `insert_n` bypasses the guard. +- `stele` diff comparison (`metrics.rs` 175–182) compares sketch `sum`/`avg` with + `approx_eq_ratio` — a NaN sum makes any comparison false, so poisoning would surface as a + diff-test mismatch *if* the harness happened to feed a NaN; the deterministic happy-path + workload does not. + +## Failure scenario (Antithesis angle) +1. A producer reaches the sketch boundary with a NaN/±Inf sample value (a new source, a replay + record with a corrupt value, or a regression that removes the codec guard). `insert_n` + absorbs it; `sum`/`avg` go NaN for the lifetime of that sketch and propagate through + `merge` (551–552) into every sketch it touches and downstream into the emitted distribution + → permanently wrong customer data, silently. +2. `weight` non-finite is not possible (`u64`), but `value.into_inner()` is an `f64` with no + guarantee of finiteness at this call site. + +## Observations +- The cheapest robust assertion is **at the sketch boundary** (inside `adjust_basic_stats` or at + the top of `insert`/`insert_n`/`insert_many`), because that is the single chokepoint and is + exactly where the missing guard lives. SUT-side instrumentation strongly wins; workload-side + can only observe a NaN sum after it has already propagated downstream. +- Two framings: + - **Outcome invariant:** `assert_always(self.sum.is_finite() && self.avg.is_finite(), + "sketch sum/avg finite")` after each mutation — for a workload that only injects finite + values, this catches any internal NaN production; with NaN injection it documents the + poisoning. + - **Boundary invariant:** if a finiteness guard is added, `assert_unreachable("non-finite + value reached DDSketch::adjust_basic_stats")` to prove NaN never gets absorbed. + +## Config dependencies +- None directly. Reachability depends on which producers feed the aggregate sketch path + (DSD codec currently filters; OTLP/replay/future sources may not). + +## Suggested assertion +- Primary: `assert_always(v.is_finite(), "value reaching DDSketch is finite")` at the top of + `adjust_basic_stats` (covers all insert/merge-feeding paths) — OR, if the boundary is changed + to reject, `assert_unreachable` on the absorbed-NaN path. +- Secondary outcome check: `assert_always(self.sum.is_finite(), "DDSketch.sum finite")` after + mutations, as a backstop for internally produced non-finite (e.g. overflow → Inf). +- `Reachable("non-finite sample offered at sketch boundary")` only if the workload deliberately + injects NaN past the codec — otherwise keep it `Unreachable`-style to assert the guard holds. + +## Open questions +- Should the fix **reject/skip** the NaN at the sketch boundary (matching the codec's drop + policy) or **clamp**? Rejecting keeps count/sum consistent; the Agent's policy here should be + confirmed against the diff baseline (ties into `aggregate-matches-agent`). +- `quantile`'s `.or(Some(f64::NAN))` fallback (535): is a NaN quantile result distinguishable + from a poisoned-sketch NaN downstream? The assertion should target sum/avg, not quantile, to + avoid conflating the two. +- Is `+Inf`/`-Inf` (e.g. from an overflowing `sum`) in scope? `v.is_finite()` covers both NaN + and Inf; confirm the desired policy treats them identically. + +### Investigation Log + +#### Does any non-DSD producer reach the DDSketch insert boundary without the codec FloatIter finiteness filter? +- **Examined:** the finiteness filter (`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs:254, + 273-303` — `FloatIter` skips non-finite with `value.is_finite()` at :299 and a debug log at :301); + every agent-DDSketch insert caller in the tree (`grep` for `insert`/`insert_n`/`insert_many`/ + `insert_interpolate_buckets`/`add_n`); and the ADP topology wiring in + `bin/agent-data-plane/src/cli/run.rs:462-499, 593-686, 745-755`. + Specifically traced: (a) OTLP — `lib/saluki-components/src/sources/otlp/metrics/translator.rs`; + (b) self-telemetry — `lib/saluki-core/src/observability/metrics/mod.rs:299-310`, + `processor.rs`; (c) the aggregate histogram→distribution path — + `lib/saluki-components/src/transforms/aggregate/mod.rs:737-762`; (d) checks_ipc — + `lib/saluki-components/src/sources/checks_ipc/mod.rs:185-204`; (e) the datadog metrics encoder — + `lib/saluki-components/src/encoders/datadog/metrics/mod.rs:1043-1061`. +- **Found:** + - **Confirmed: the sketch boundary itself has no finiteness guard.** `agent/sketch.rs` `insert` + (:327), `insert_n` (:374), `insert_many` (:362), `insert_interpolate_bucket` (:387) all call + `adjust_basic_stats` (:188) which does `self.sum += v * n` unconditionally — NaN poisons + sum/avg permanently. `FloatIter` (codec) is the *only* finiteness filter in the metric path. + - **Aggregate transform `insert_n` path (the flagged mod.rs:745) is DSD-ONLY → CLOSED.** The + aggregate transform (`dsd_agg`) is wired **exclusively** into the DSD pipeline: + `dsd_in → dsd_enrich → dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter → + metrics_enrich` (run.rs:664-679). DSD metrics pass through `FloatIter` at decode time, so the + `Histogram` samples reaching `aggregate/mod.rs:745` (`sketch.insert_n(sample.value...)`) are + already finite. **checks_ipc and OTLP metrics join the topology at `metrics_enrich` + (run.rs:469/499 and run.rs:753), which is DOWNSTREAM of `dsd_agg`** — they never enter the + aggregate transform. So no non-DSD producer reaches `insert_n` *in the aggregate transform*. + - **OTLP number path → CLOSED.** `get_number_data_point_value` (translator.rs:1366) feeds + `is_skippable` (`value.is_nan() || value.is_infinite()`, :1374-1377) in both + `map_number_metrics` (:726) and `map_number_monotonic_metrics` (:754); non-finite values are + skipped with a warn/debug log. Gauges/counters never carry NaN downstream. + - **OTLP histogram/sketch path → effectively CLOSED (no NaN poisoning).** OTLP histograms become + sketches via two routes, neither of which feeds a raw NaN into `adjust_basic_stats`: + (i) exponential histograms build a `Dogsketch` proto and use `DDSketch::try_from` + (`build_agent_sketch_from_key_counts`, :314-351) — never calls `insert`/`adjust_basic_stats`; + (ii) explicit-bounds histograms call `qa.insert_interpolate_buckets(buckets)` (:889) where bucket + `upper_limit` comes from the payload's `explicit_bounds`, which is **not** finiteness-checked. + However, `insert_interpolate_bucket` (sketch.rs:387) only ever passes + `SKETCH_CONFIG.bin_lower_bound(key)` (a finite reconstructed value) into `adjust_basic_stats` + (:426, :440), never the raw bound. A NaN bound makes `distance`/`fkn` NaN → `fkn as u64 == 0` → + no per-key insert; the remainder branch uses a finite `bin_lower_bound`. So a NaN explicit bound + can distort *bucketing* but does not poison sum/avg with NaN. (`insert_interpolate_buckets` at + :465-481 handles ±Inf explicitly but not NaN — a latent robustness gap, not a poisoning path.) + - **LIVE non-DSD NaN→sketch path FOUND: checks_ipc Histogram → datadog metrics encoder.** + - `checks_ipc/mod.rs:195`: `MetricType::Histogram => Metric::histogram(context, (timestamp, + value))` where `value` is the raw f64 from an external Python check over IPC. **No `is_finite` + / `is_nan` check anywhere in this decode** (mod.rs:185-204). A check emitting NaN produces a + `Histogram` metric carrying NaN. + - That metric flows `checks_ipc_in.metrics → metrics_enrich → dd_metrics_encode` (run.rs:469, + 499) — i.e. it does NOT pass through the DSD codec FloatIter and does NOT enter the aggregate + transform. + - The encoder converts `MetricValues::Histogram` to a sketch by calling **`ddsketch.insert_n( + sample.value.into_inner(), sample.weight...)`** at `encoders/datadog/metrics/mod.rs:1054` + (inside `encode_sketch_metric`, the `Histogram` arm at :1049-1058). This is a direct + `insert_n` of the raw sample value with no finiteness guard → `adjust_basic_stats` → + `sum += NaN`. The emitted Datadog sketch payload then carries a NaN sum/avg silently. + - `distribution_sampled_fallible` (value/mod.rs:312, `insert_n`) is called ONLY from the DSD codec + (metric.rs:267, fed by `FloatIter`) → DSD-only, safe. +- **Not found:** No finiteness filter on the checks_ipc value path; none at the sketch boundary; + none in the encoder's Histogram→sketch conversion. No code path where the *aggregate transform* + receives non-DSD input. +- **Conclusion:** RESOLVED, and the hazard is **LIVE** (not closed on all paths). The specifically + flagged aggregate `insert_n` path (aggregate/mod.rs:745) is closed because that transform is + DSD-only. OTLP number and histogram paths do not poison sum/avg. **But there is a live non-DSD + NaN-poisoning path: a Python check emitting a Histogram metric with a NaN value via checks_ipc + (checks_ipc/mod.rs:195, no finiteness check) reaches `DDSketch::insert_n` in the Datadog metrics + encoder (encoders/datadog/metrics/mod.rs:1054), bypassing both the DSD FloatIter and the aggregate + transform.** Therefore ddsketch-no-nan-poison and the ghost-metric hazard remain LIVE. The robust + fix is a guard at the sketch boundary (`adjust_basic_stats`/`insert*`), since the per-producer + filter (FloatIter) demonstrably does not cover the checks_ipc→encoder path. The Antithesis angle: + drive a check (or checks_ipc IPC input) that emits a NaN histogram value and assert sketch sum/avg + stay finite at the encoder boundary. diff --git a/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md b/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md new file mode 100644 index 00000000000..aca5871ea1d --- /dev/null +++ b/test/antithesis/scratchbook/properties/ddsketch-relative-error-bound.md @@ -0,0 +1,128 @@ +--- +slug: ddsketch-relative-error-bound +title: DDSketch quantiles within relative-error bound; merges associative/commutative +type: Safety +priority: Low +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +status: assertion MISSING +--- + +# ddsketch-relative-error-bound — Quantile accuracy + merge associativity/commutativity + +## Property (one sentence) +For values within the non-collapsed representable range, quantile queries are within the +configured relative error (eps ≈ 0.78%, gamma = 1 + 2·eps), and merging sketches is +associative and commutative (order of merges does not change the result). + +## Origin +- SUT analysis §5 safety #4: "DDSketch relative-error guarantee: eps=1/128 (~0.78%) … merge + associative/commutative." + +## Files / functions / lines (CONFIRMED) +- `lib/ddsketch/build.rs` (3, 14–38): `eps = 1/128`; `eps *= 2; gamma_v = 1 + eps; + gamma_ln = ln_1p(eps)`; `norm_min`, `norm_bias` derived; `assert!(norm_min <= min_value)`. +- `lib/ddsketch/src/agent/config.rs` + - `key(v)` (70–87): `log_gamma(v).round_ties_even() + norm_bias`, **clamped to `[1, MAX_KEY]`** + (86). Values with `|v| < norm_min` map to key 0 (75–77); values above `gamma^MAX_KEY` + saturate at `MAX_KEY` (i.e. INF bucket). **Accuracy is NOT guaranteed at these extremes** — + this is the caveat the property must scope around. + - `bin_lower_bound(k)` (47–62): inverse mapping; `gamma_v.powf(k - norm_bias)`. +- `lib/ddsketch/src/agent/sketch.rs` + - `quantile(q)` (498–536): rank via `rank(count,q) = round_ties_even(q*(count-1))` (668–671); + interpolates `v_low*weight + v_high*(1-weight)` with `v_high = v_low * gamma_v` (522–523); + result `clamp(self.min, self.max)` (535); empty → `None`; `q<=0 → min`, `q>=1 → max`. + - `merge(other)` (542–582): merges basic stats (count/min/max/sum/avg) then bins, then + `trim_left`. Bin merge is order-independent on keys; `avg` uses an incremental formula + (552) that is **not** exactly order-independent in floating point. +- Canonical impl (`canonical/mapping/logarithmic.rs`): `relative_accuracy() = (gamma-1)/(gamma+1)` + (114–115); `index = ceil(log(value)/log(gamma))` (9, 97). `with_relative_accuracy` rejects + accuracy outside `(0,1)` (40–47). Separate from the agent sketch but same family. + +## Failure scenario (Antithesis angle) +The diff-test (`stele` 175–182) compares sketches on min/max/avg/sum (ratio 1e-8) + exact +count + exact bin_count, on a deterministic clean run. Antithesis adds: +1. **Merge order under interleaving:** windows flushed/merged in different orders (delayed + flush, backpressure reordering) could expose non-associativity. The bin merge is exact on + counts, but `avg` (incremental, 552) and `sum` accumulate floating-point error that depends + on merge order → a `quantile`/avg drift the diff test (single fixed order) never sees. +2. **Quantile error at the boundary of the representable range:** values near `norm_min` (1e-9) + or above the max key collapse to key 0 / INF; quantiles there can exceed the 0.78% bound + (the documented caveat, sketch.rs:48-51 and canonical "extremes"). The property must assert + the bound only for in-range values and `Sometimes`-observe out-of-range handling. +3. **Collapsed sketch:** once `trim_left`/`CollapsingLowest` collapses low bins, the relative + error guarantee for low quantiles is intentionally void (`is_collapsed`, + collapsing_lowest.rs:50-51). Assertion must exclude collapsed-low-quantile queries. + +## Observations +- Two checkable sub-properties: + - (a) **Accuracy:** for an inserted value `v` within range, `quantile(q)` for the rank of `v` + is within `gamma`-relative error of `v`: `|q_est - v| <= eps_rel * |v|` where + `eps_rel = (gamma_v - 1)/(gamma_v + 1) ≈ 1/128`. + - (b) **Merge invariance:** `merge(A, merge(B,C)) ≈ merge(merge(A,B), C)` and + `merge(A,B) ≈ merge(B,A)` within ratio, on bins exactly and on sum/avg within a small + floating tolerance. +- Best validated SUT-side with a known input set so the expected rank value is computable; + workload-side cannot reconstruct per-sample ground truth from emitted aggregates. + +## Config dependencies +- eps/gamma/bin_limit are compile-time (build.rs). `aggregate_window_duration` controls how + many samples land in one sketch before flush/merge (affects collapse likelihood). + +## Suggested assertion +- `assert_always(quantile_within_relative_error, "DDSketch quantile within eps for in-range value")` + evaluated in a SUT-side test harness over in-range inputs (exclude key-0 / INF / collapsed-low). +- `assert_always(merge_result_equal_within_ratio, "DDSketch merge is order-independent")` over a + set of sketches merged in two different orders. +- `Sometimes(value_out_of_representable_range)` to confirm the extreme-value carve-out is exercised + (and to document that accuracy is not claimed there). + +## Open questions +- What floating tolerance for `avg`/`sum` under reordered merges is acceptable vs the 1e-8 the + diff test uses? The incremental `avg` (552) and `sum +=` (551) are order-sensitive in f64; + need a principled epsilon to avoid false positives. +(All prior open questions resolved — see Investigation Log below.) + +## Investigation Log + +#### Which sketch on the live path, and does ADP call `DDSketch::quantile` at runtime? +- **Examined**: `lib/ddsketch/src/lib.rs:56` (crate-root = agent sketch); + `lib/ddsketch/src/agent/sketch.rs:498` (`pub fn quantile`); `lib/ddsketch/build.rs:2-4` + (eps = 1/128, build.rs doubles it: `eps *= 2.0` then `gamma_v = 1+eps`, lines 19-20); + `lib/ddsketch/src/canonical/mapping/fixed.rs:38` (`RELATIVE_ACCURACY = 0.01`); + `transforms/aggregate/mod.rs:735-799` (histogram statistics emit) and `config.rs:58-69` + (`value_from_histogram`); `lib/saluki-core/src/data_model/event/metric/value/histogram.rs:166-197` + (`HistogramSummary::quantile`); grep of `.quantile(` across `lib`+`bin` excluding ddsketch internals; + `bin/agent-data-plane/src/cli/run.rs:491-498` (live metrics pipeline -> `dd_metrics_encode`); + `encoders/datadog/metrics/mod.rs:1006-1150` (`encode_sketch_metric` / sketch serialization); + `destinations/prometheus/mod.rs:343-348` (the one `sketch.quantile(q)` runtime call site). +- **Found (a) — sketch type**: Live aggregation uses the **agent** sketch (`ddsketch::DDSketch`, + eps 1/128). The canonical sketch (`fixed.rs` RELATIVE_ACCURACY 0.01, 2048 bins) is not wired into + any ADP component (confirmed in companion `ddsketch-bin-count-bounded` log). So the accuracy + target is the agent sketch's fixed relative accuracy, not the canonical 0.01. +- **Found (b) — quantile NOT queried on the live path**: ADP does **not** call + `DDSketch::quantile` at runtime on the production metrics path. Two distinct percentile paths + exist, neither using `DDSketch::quantile` live: + (1) Histogram-mode percentiles in aggregate go through `HistogramStatistic::value_from_histogram` + (`config.rs:67`) -> `summary.quantile(q)`, which is `HistogramSummary::quantile` + (`histogram.rs:172-197`) operating on **raw sorted samples** — a separate structure, NOT a + DDSketch. (2) Distribution-mode builds a `DDSketch` (`mod.rs:743-750`) and the DD metrics encoder + serializes it via `encode_sketch_metric`, writing `sketch.bins()` keys/counts plus + count/min/max/avg/sum (`metrics/mod.rs:1135-1150`) — it ships the raw bins over the wire and never + queries a quantile. Quantile estimation happens **server-side at Datadog** after ingestion. + The only runtime `DDSketch::quantile` caller in the codebase is the **prometheus** destination + (`prometheus/mod.rs:346`), which is internal-telemetry/scrape only and not in the primary + metrics topology. +- **Not found**: Any live ADP metrics-path call to `DDSketch::quantile`; any production use of the + canonical sketch. +- **Conclusion**: RESOLVED (with a framing consequence). The accuracy assertion's "live" target is + the **agent sketch (eps 1/128)**, and ADP **does not query DDSketch quantiles at runtime on the + customer metrics path** — it ships raw bins to Datadog. Therefore an `Always(quantile within eps)` + assertion cannot be anchored to a production runtime call; it must be an **SUT-side test-harness** + assertion over the agent sketch in isolation (the existing unit/proptest level), OR retargeted to + the property that actually matters in production: *bins/summary are serialized faithfully and bin + count stays capped* (covered by `ddsketch-bin-count-bounded`). The histogram-percentile path that + IS computed in-process uses raw-sample `HistogramSummary::quantile`, which is exact (no DDSketch + relative-error bound applies). Property framing should be narrowed: the DDSketch relative-error + guarantee is a library invariant, not a live ADP runtime invariant. diff --git a/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md b/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md new file mode 100644 index 00000000000..0798d2a02ef --- /dev/null +++ b/test/antithesis/scratchbook/properties/disk-persisted-retry-survives-restart.md @@ -0,0 +1,175 @@ +--- +slug: disk-persisted-retry-survives-restart +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Liveness (with safety sub-clauses: no-duplication, poison-drop) +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Disk-persisted retry transactions survive process kill+restart and are eventually delivered exactly once + +## Origin +SUT analysis §2 (disk persistence), §3 (two disk-backed subsystems), §6 gap #3 +("disk-persisted retry queue recovery never tested across a real kill+restart"), §9 +open question (persisted.rs disk-full/partial-write/corrupt across crash). No Antithesis +assertion exists. + +## What the code does + +### Persistence enable + silent fallback +`lib/saluki-components/src/common/datadog/io.rs:391-409`: a `RetryQueue` is created; if +`config.retry().storage_max_size_bytes() > 0`, `with_disk_persistence(...)` is awaited. On +**init failure it logs and silently falls back to an in-memory-only `RetryQueue`** (~405-408): +"Failed to initialize disk persistence ... Transactions will not be persisted." This is a +durability *downgrade* with no hard failure — the operator believes persistence is on but it isn't. + +### Flush to disk on shutdown +`io.rs:488-503`: on endpoint-task shutdown, `pending_txns.flush().await` is called; `flush` +(`io.rs:781-800`) pushes high-priority into low-priority then `self.low_priority.flush()`, which +persists outstanding transactions to disk **only if disk persistence is enabled** (`io.rs:769-776` +doc: "If disk persistence isn't enabled, all pending transactions will be dropped"). A flush error +logs "Events may be permanently lost" (~500-501). + +### Exactly-once on consume (delete-before-return) +`lib/saluki-io/src/net/util/retry/queue/persisted.rs` `try_deserialize_entry` (~373-397): after +deserializing, it **deletes the file from disk before returning** (~391-394) "so that we don't risk +sending duplicates." So a successful pop removes the on-disk copy first — a crash *after* delete but +*before* send loses that one txn (at-most-once for the in-flight item), while a crash *before* delete +keeps it (at-least-once). The delete-before-return biases toward no-duplication. + +### Poison/corrupt entry handling (drop, don't loop forever) +- `pop` (~206-243): on `try_deserialize_entry` `Err(e)` (corrupt/unreadable), it logs + "Permanently dropping persisted entry that could not be consumed", decrements `total_on_disk_bytes`, + increments `entries_dropped`, and `continue`s — does NOT retry the poison entry forever, does NOT + abort recovery (~227-241). +- `remove_until_available_space` eviction path (~304-323): same poison handling during eviction. +- `try_deserialize_entry` deserialize failure (~373-389): attempts to `remove_file` the corrupt + entry so it doesn't accumulate, tolerates removal failure. +- `refresh_entry_state` (~245-273): unrecognized files are warned and skipped, not fatal. + +## Failure scenario (Antithesis) +1. Enable disk persistence (`forwarder_retry_queue_storage_max_size > 0`). +2. Drive a known set of transactions; induce an intake outage so they land in the retry queue. +3. SIGKILL the ADP process mid-flow (the s6 container supervisor restarts it). +4. Restore healthy intake. +5. Expectation: every persisted retryable transaction is eventually delivered, **exactly once** + (no loss, no duplication). Separately: inject a corrupted on-disk entry and assert recovery + continues and the corrupt entry is dropped (not retried forever, not crashing recovery). + +## Key observations +- "Exactly once" is approximate at the crash boundary: delete-before-return means at most one + in-flight txn can be lost on a crash in the delete→send gap, and at-least-once if crash precedes + delete. The clean claim is **no systemic loss and no duplication of the persisted backlog**; the + single in-flight item is a known narrow window. +- SIGKILL (not graceful) skips the shutdown flush (`io.rs:488-503`), so only transactions *already + written to disk* survive; high-priority in-memory txns not yet persisted are lost. The graceful + path (SIGTERM/30s) flushes them to disk. The property must distinguish kill vs graceful. +- Retry-queue IDs are stable across API-key rotation (`io.rs:514-533`) so a restart with a rotated + key still finds and retries the same persisted backlog — relevant if the workload rotates keys. + +## Config deps +- `forwarder_retry_queue_storage_max_size` (`storage_max_size_bytes`) > 0 to enable persistence. +- `storage_path`, `storage_max_disk_ratio` — disk-full eviction behavior + (`remove_until_available_space`, `on_disk_bytes_limit`). + +## Suggested assertion (MISSING — net-new) +- **Sometimes(persisted-backlog-fully-recovered)**: at least once, after a kill+restart with + persistence enabled and intake restored, the set of transactions delivered post-restart covers the + persisted backlog with no duplicates (reconcile workload input vs mock-intake received, dedup by + transaction identity). Liveness + no-dup. +- **AlwaysOrUnreachable(poison-dropped)**: whenever a corrupt on-disk entry is encountered, it is + dropped (entries_dropped increments) and recovery proceeds — never an infinite retry of the same + entry and never a recovery abort. Anchor at `persisted.rs:227-241` / `304-323`. +- **Reachable(disk-persistence-actually-active)**: confirm persistence init succeeded (the + silent-fallback at `io.rs:405-408` did NOT fire) — otherwise the whole property is vacuously testing + in-memory mode. Treat the fallback as an Unreachable in the persistence-enabled workload, OR detect + it and fail the run setup. + +## SUT-side instrumentation needs +- SDK `assert_unreachable` (or workload detection) at the silent-fallback branch `io.rs:405` when + persistence is configured — to catch the durability downgrade that would otherwise make the test + vacuous. +- SDK `assert_reachable` at the poison-drop `continue` (`persisted.rs:238`) gated on the + corrupt-entry test variant. +- Primary check is workload-side reconciliation against the mock intake with transaction-identity + dedup; needs a deterministic countable input set and a mock intake that records received IDs. + +## Open questions +- **Ordering after restart**: `refresh_entry_state` sorts by timestamp (~268) but filename timestamp + has second granularity + nonce; confirm restart preserves enough ordering that the + bias-to-freshest/oldest-drop semantics aren't inverted across a restart (affects which txns survive + overflow, not raw loss). +- **The narrow at-most-once window** (delete-before-return then crash before send): is the single + in-flight txn loss acceptable per the headline guarantee, or should the assertion tolerate it? Sets + whether the reconcile allows a 1-txn slack. + +### Investigation Log + +#### Durability-downgrade visibility + torn-write classification + recovery wedging (2026-05-28) + +**Examined:** +- `lib/saluki-components/src/common/datadog/io.rs:391-410` (RetryQueue create + `with_disk_persistence` + + silent fallback) and grep of io.rs for `persistence`/`fallback`/metric near the branch. +- `lib/saluki-io/src/net/util/retry/queue/persisted.rs`: `try_from_path` (:30-37), + `decode_timestamped_filename` (:410-427), `generate_timestamped_filename` (:400-408), `push` + (:164-199), `pop` (:206-243), `refresh_entry_state` (:245-273), `try_deserialize_entry` (:354-398), + `remove_until_available_space` poison handling (:304-330), and tests + `pop_skips_corrupt_entry`/`pop_returns_none_when_all_entries_corrupt` (:701-795). + +**Found — (a) durability downgrade is surfaced ONLY as an `error!` log, no metric:** +- On disk-persistence init failure, io.rs:405-408 runs `.unwrap_or_else(|e| { error!(endpoint_url, + error = %e, "Failed to initialize disk persistence for retry queue. Transactions will not be + persisted."); RetryQueue::new(queue_id, config.retry().queue_max_size_bytes()) })`. The only + observable signal is that one `error!` log line; there is **no metric/gauge/counter** emitted to + distinguish "persistence active" from "fell back to in-memory". Grep of io.rs for + persistence/fallback finds only the doc comments (:393, :773-775) and this log (:406). So a + workload cannot detect the downgrade via telemetry — it must scrape the log or, better, treat the + fallback branch as an `assert_unreachable` when persistence is configured (as the file already + recommends). Confirmed the downgrade is effectively silent at the metrics layer. +- **(a, cont.) the in-memory byte cap STILL holds after fallback:** the fallback constructs + `RetryQueue::new(queue_id, config.retry().queue_max_size_bytes())` (io.rs:407) — identical + in-memory cap to the non-persisted path (io.rs:391). So in degraded mode the queue is a plain + capped in-memory queue with drop-oldest; the byte-cap invariant is preserved (it just becomes the + drop-not-spill branch). No unbounded growth from the fallback. + +**Found — torn/partial write is classified as CORRUPT (drop+warn), and does NOT wedge recovery:** +- `push` writes via `tokio::fs::write(&entry_path, &serialized)` (persisted.rs:184) directly to the + final `retry--.json` path. NOTE: there is NO temp-file + atomic rename despite the + stale comment at :165 ("Serialize the entry to a temporary file"). So a SIGKILL mid-write leaves a + file with a **valid filename** but **truncated/partial JSON content**. +- On restart, `refresh_entry_state` (:245-273) scans the dir and calls `PersistedEntry::try_from_path` + (:253), which validates ONLY the filename via `decode_timestamped_filename` (:31, :410-427) — it + does NOT read or validate content. A torn write has a well-formed name, so it is accepted into + `entries` (not skipped as "unrecognized"). Unrecognized files (bad name) are warned and skipped + (:255-262) and the scan `continue`s past them. +- When `pop` reaches the torn entry, `try_deserialize_entry` reads the bytes and `serde_json::from_slice` + fails (:373-375) → best-effort `remove_file` of the corrupt file (:378-384) → returns `Err` + (:386-387). `pop` matches that `Err` (:227-240): logs `warn!` "Permanently dropping persisted entry + that could not be consumed", decrements `total_on_disk_bytes`, `entries_dropped += 1`, and + `continue`s the loop to the next entry (:230-239). So a torn write = **corrupt → dropped (with + warn), counted in `entries_dropped`** — NOT treated as unrecognized-skip. +- A single bad file does **NOT** wedge recovery: `pop`'s `loop` advances to the next entry on every + corrupt/poison hit (no infinite retry of the same entry — explicit comment at :228-229), and the + eviction path `remove_until_available_space` has the same poison handling (:313-322). The "all + entries corrupt" case returns `Ok(None)` cleanly (test at :774-795). Recovery proceeds past any + number of bad files. +- Edge case: the `Ok(None)` branch in `pop` (:221-225, file vanished mid-recovery / `NotFound` at + :358-364) triggers a `refresh_entry_state` + retry — also non-wedging, since the missing file is + simply dropped from the rescanned set. + +**Not found:** No metric for the persistence-fallback downgrade. No atomic write/rename or fsync in +`push` (torn writes are possible and are handled at read time, not prevented at write time). No code +path where a corrupt/torn file aborts recovery or is retried indefinitely. + +**Conclusion (RESOLVED):** (a) The durability downgrade on disk-init failure is surfaced ONLY via a +single `error!` log — no metric — so the workload must treat the fallback branch (io.rs:405) as an +`assert_unreachable` (or log-scrape) to avoid a vacuous in-memory-mode test; the in-memory byte cap +still holds after fallback. (b) A torn/partial write across a kill is classified as a **corrupt +entry**: it is dropped with a `warn!` and `entries_dropped` increments, losing just that one +transaction; it does **not** wedge recovery — both the `pop` scan and the eviction scan continue past +any number of corrupt/unrecognized files. This validates the proposed `AlwaysOrUnreachable(poison- +dropped)` assertion and confirms the "torn write loses one txn, not the backlog" framing. (Caveat for +the workload: because writes are non-atomic, the corrupt-entry test variant can be produced naturally +by SIGKILL mid-write, not only by injecting a hand-crafted corrupt file.) diff --git a/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md b/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md new file mode 100644 index 00000000000..473023643a5 --- /dev/null +++ b/test/antithesis/scratchbook/properties/events-sc-no-silent-loss.md @@ -0,0 +1,113 @@ +--- +slug: events-sc-no-silent-loss +title: Events and service-checks are delivered without silent loss under backpressure/outage +type: Liveness (with a Safety no-silent-drop clause) +priority: High +status: net-new (no SDK assertion exists) +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +--- + +# events-sc-no-silent-loss + +## Origin +Coverage gap: the catalog's data-loss family (`no-silent-interconnect-drop`, +`forwarder-eventual-delivery`, `shutdown-drains-no-loss`) is reasoned and instrumented entirely on +the **metrics** path (encoder = `dd_metrics_encode`, the aggregation pipeline). The events and +service-check sub-pipelines are always-on production paths with a **different shape** — no +aggregation buffer, a straight `dsd_in.{events,service_checks} → *_enrich → dd_{events,service_checks}_encode +→ dd_out` chain (`run.rs:681-684`) — and their own encoders with their own silent-drop branch. None +of the existing properties assert events/SC reach the forwarder; this fills that gap. (It EXTENDS +`no-silent-interconnect-drop` / `forwarder-eventual-delivery` rather than duplicating them: same +faults, different always-on edges and a different silent-loss site.) + +## Code paths (file:line) +- Wiring: `bin/agent-data-plane/src/cli/run.rs:681-684` — events and service_checks edges; both + terminate at `dd_out` (the shared Datadog forwarder). +- Source dispatch fan-out: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1667-1716` + (`dispatch_events`): `extract(is_eventd)` → `buffered_named("events").send_all(...)` then + `extract(is_service_check)` → `buffered_named("service_checks").send_all(...)`. `send_all` awaits + on a full bounded mpsc (backpressure), but a dispatch **error** is only `error!`-logged and + swallowed (`mod.rs:1688`, `mod.rs:1702`) — no drop counter on this path. +- Events encoder silent-drop branch: `lib/saluki-components/src/encoders/datadog/events/mod.rs:179-194` + — on a **recoverable** encode error the event is dropped and only `events_dropped_encoder()` + incremented (`telemetry.rs:50,77`); TODO admits the dropped count is hardcoded `1`, not the real + number (`events/mod.rs:186`). Flush build error is `error!`-logged and the request discarded + (`events/mod.rs:208`). +- Service-checks encoder twin: `lib/saluki-components/src/encoders/datadog/service_checks/mod.rs:177-211` + — identical recoverable-drop + flush-discard structure. +- Wrong-type silent swallow: encoder `process_event` calls `event.try_into_eventd()` / + `try_into_service_check()` and returns `ProcessResult::Continue` (consuming + dropping) when the + type does not match (`events/mod.rs:173-177`, `service_checks/mod.rs:171-175`; + `data_model/event/mod.rs:167-182`). A mis-routed or mistyped event is lost here with NO counter — + ties to `source-dispatch-no-misroute`. +- Zero-payload-size config trap: `events/mod.rs:64-67` documents that `serializer_max_payload_size: 0` + makes **every** non-empty compressed payload exceed the limit and be dropped during flush (a silent + total-loss config). Same clamp logic applies to service checks. +- Egress (shared with metrics): the `dd_out` forwarder retry/circuit-breaker/queue-drop behavior is + already characterized in `forwarder-eventual-delivery` / `retry-queue-bounded-under-outage`. + +## Failure scenario +Under a slow/throttled or transiently-down intake, the encoder→forwarder edge fills and backpressure +should propagate up the events/SC edges to the source read loop (queue-and-await, never drop). Two +silent-loss risks specific to these paths: (1) the encoder's recoverable-error branch drops +events/SC with an undercounted (`+1`) telemetry signal; (2) a wrong-type event reaching the encoder +is swallowed with no counter at all. After a transient intake outage clears, every accepted event/SC +that did not legitimately overflow the (shared) retry queue should still be delivered. A regression +that turns a backpressure-await into a drop, or mis-scopes the recoverable-error branch, silently +loses customer events/service-checks — the "won't lose customer data" half of the headline, on a +path no existing property watches. + +## Observations +- Events/SC have **no aggregation stage**, so unlike metrics there is no flush-window semantics — + every accepted event/SC should map ~1:1 to a delivered intake item (modulo batching of up to + `MAX_EVENTS_PER_PAYLOAD = 100`, `events/mod.rs:35`). This makes a count-in == count-out reconcile + cleaner than for metrics. +- The `events_received` / `service_checks_received` source counters share the metric name + `component_events_received_total` distinguished only by a `message_type` tag + (`sources/dogstatsd/metrics.rs:111-119`) — the workload checker must filter by tag, not name. +- `events_sent` (`telemetry.rs:41,83`) on the encoders is the delivery-side anchor. + +## Suggested assertions (MISSING / net-new) +- Safety: `Always(no silent drop on a wired events/SC edge under load)` — modeled like + `no-silent-interconnect-drop` but asserted on the events + service_checks edges; backpressure + (await), never discard, on a connected output. +- Liveness: `Sometimes(all-accepted-events-delivered-after-recovery)` and + `Sometimes(all-accepted-service-checks-delivered-after-recovery)` — post-recovery delivered count + (`events_sent`, filtered) ≥ accepted count (`events_received` by `message_type`), minus legitimate + retry-queue overflow. Liveness ⇒ progress, not an instantaneous invariant. +- Reachability anchors (REQUIRED to prevent vacuity, esp. for a metrics-heavy workload): + `Sometimes(events_received{message_type=events} > 0)` and + `Sometimes(service_checks_received > 0)`. +- Optional Safety guard: `Always(events_dropped_encoder delta == 0)` while intake is healthy and + config is non-pathological — catches the recoverable-error drop firing when it shouldn't. + +## Config dependencies +- DSD enabled; events/service_checks on by default (`mod.rs:205-221`). +- Keep `serializer_max_payload_size` / `serializer_max_uncompressed_payload_size` at non-pathological + values for the "no-loss" branch; a separate negative case can set `serializer_max_payload_size: 0` + to confirm the documented total-drop trap (`events/mod.rs:64-67`). +- Shared forwarder/retry-queue config (disk persistence, queue byte caps) governs the eventual- + delivery branch exactly as for `forwarder-eventual-delivery`. + +## SUT-side instrumentation needs +- Workload-side: drive an event/SC stream, throttle/down the mock intake, then restore; reconcile + accepted (`component_events_received_total{message_type in (events, service_checks)}`) vs delivered + (`component_events_sent_total` on the events/SC encoders) at the mock intake, allowing for retry- + queue overflow and ~`MAX_*_PER_PAYLOAD` batching slack. +- The dispatch-error path (`mod.rs:1688,1702`) and the wrong-type swallow have NO counter — a strict + no-silent-loss assertion needs net-new SUT-side instrumentation (a drop counter or an + `assert_unreachable`) there, else loss on those branches is invisible to a workload checker. + +## Open Questions +- Is the encoder "recoverable error" branch (`events/mod.rs:183`) ever hit on healthy intake with + well-formed events, or only on genuinely oversized single events? Determines whether the optional + `Always(events_dropped_encoder == 0)` guard is sound or flaky. +- Does the events/SC retry traffic share `dd_out`'s per-endpoint queue with metrics, so a metric + flood can evict queued events (cross-stream eviction)? Affects how the overflow allowance is scoped. +- Are events/SC requests `Clone` (so retryable failures take the `Error::Open` re-enqueue path), as + was confirmed for the metrics forwarder requests in `forwarder-eventual-delivery`? Needs checking + for the `/api/v1/events_batch` and service-check request builders. +- Does `dispatch_events` count anything when `send_all` errors, or is dispatch-time loss fully silent + (a finding, shared with `source-dispatch-no-misroute`)? diff --git a/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md b/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md new file mode 100644 index 00000000000..675576e1eda --- /dev/null +++ b/test/antithesis/scratchbook/properties/events-sc-pipeline-reachable.md @@ -0,0 +1,78 @@ +--- +slug: events-sc-pipeline-reachable +title: Events and service-check sub-pipelines are actually exercised (anti-vacuity anchor) +type: Reachability +priority: Medium +status: net-new (no SDK assertion exists) +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +--- + +# events-sc-pipeline-reachable + +## Origin +Coverage gap + anti-vacuity guard. The entire existing 27-property catalog is metrics-only. A +realistic ADP workload is dominated by metric samples; events and service-checks are comparatively +rare. Without an explicit reachability anchor, the two new event/SC safety/liveness properties +(`malformed-event-sc-no-crash`, `events-sc-no-silent-loss`) can pass **vacuously** — the assertions +never fail simply because no event ever traversed the parse → enrich → encode → deliver chain. This +property exists to make the event/SC paths' execution a first-class, observable test obligation, per +the catalog-wide note that `Sometimes(...)` anchors are mandatory to prove a path is reached +(`property-catalog.md` "Catalog-wide notes"). + +## Code paths (file:line) +- Parse: `lib/saluki-io/src/deser/codec/dogstatsd/event.rs:31` / + `.../service_check.rs:28` (codecs decode a frame into `ParsedPacket::Event` / `::ServiceCheck`). +- Source accept + counter: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1502-1517` increments + `events_received()` on a successfully handled event; `mod.rs:1519-1537` increments + `service_checks_received()`. Counters: `sources/dogstatsd/metrics.rs:34-39,114-119` (both emit + `component_events_received_total` with a distinguishing `message_type` tag). +- Dispatch onto the named outputs: `mod.rs:1679-1704` (`buffered_named("events")` / + `buffered_named("service_checks")`). +- Delivery: events encoder `encoders/datadog/events/mod.rs:197-213` and service-checks encoder + `encoders/datadog/service_checks/mod.rs:195-211` dispatch an `HttpPayload`; success increments + `events_sent` (`common/datadog/telemetry.rs:83`) → reaches `dd_out` → mock intake + `/api/v1/events_batch` (events) and the service-checks intake endpoint. + +## Failure scenario +Not a SUT bug per se — a **test-quality** failure: if this anchor never fires, the event/SC +properties provide no real assurance. It also surfaces a genuine SUT regression class: a wiring or +filter change (e.g. `EnablePayloadsConfiguration` defaulting events/SC off, a future filter dropping +all events, a broken named output) that silently removes the event/SC path entirely would make this +`Sometimes` go unsatisfied — a real, observable defect on an "always-on production path." + +## Observations +- Defaults make the path live: `EnablePayloadsConfiguration { events: true, service_checks: true }` + (`sources/dogstatsd/mod.rs:205-221`); the edges are unconditionally wired in `run.rs:681-684` + (not behind a feature flag like `dsd_debug_log_out`). +- Two milestones are worth separate anchors so a parse-but-don't-deliver regression is visible: + (a) **parsed/accepted** at the source, (b) **delivered** at the encoder/intake. + +## Suggested assertions (MISSING / net-new) +- `Sometimes(event_parsed_and_accepted)` — at least once `events_received` + (`component_events_received_total{message_type=events}`) advances. +- `Sometimes(service_check_parsed_and_accepted)` — `service_checks_received` advances. +- `Sometimes(event_delivered)` / `Sometimes(service_check_delivered)` — the events/SC encoder's + `events_sent` advances and a payload reaches the mock intake's events / service-check endpoint. +- (Strengthen to `Reachable` if the workload guarantees ≥1 well-formed event + SC per run.) + +## Config dependencies +- DSD enabled; events/service_checks left at their `true` defaults (`mod.rs:205-221`). +- Workload MUST emit at least one well-formed event (`_e{...}`) and one well-formed service check + (`_sc|...`) so the anchors can fire — this is a workload-construction requirement, not a SUT config. + +## SUT-side instrumentation needs +- Source-side anchors read the existing `component_events_received_total` counter (filter by + `message_type` tag) — no new instrumentation strictly required for the "parsed/accepted" milestone. +- Delivery-side anchors read `component_events_sent_total` on the events/SC encoders and/or observe + the mock intake receiving an events/service-check payload — the cleanest signal is a mock-intake + observation, which the deployment topology's controllable mock intake already supports. + +## Open Questions +- Should the delivery anchor key on the encoder `events_sent` counter or on the mock intake actually + receiving the `/api/v1/events_batch` (and service-check) POST? Intake observation is stronger + (proves end-to-end) but depends on the mock intake distinguishing those endpoints. +- Is one anchor per stream sufficient, or do we want per-(event vs service-check) AND + per-(parsed vs delivered) granularity (4 anchors) to localize a parse-but-not-deliver regression? + Leaning toward 4 for diagnostic value at negligible cost. diff --git a/test/antithesis/scratchbook/properties/filter-config-reload-correct.md b/test/antithesis/scratchbook/properties/filter-config-reload-correct.md new file mode 100644 index 00000000000..9a7d2568788 --- /dev/null +++ b/test/antithesis/scratchbook/properties/filter-config-reload-correct.md @@ -0,0 +1,147 @@ +# filter-config-reload-correct + +## Origin + +The design partner's documented focus (Confluence "Tag Filter RC Relay Stress Test: agent + ADP", +AMCC space): the Core Agent pushes metric-filter config (`metric_tag_filterlist`, `metric_filterlist`, +`statsd_metric_blocklist`, …) over the Remote Config → AgentSecure → ADP config stream **at runtime, +while data is flowing**. Five components rebuild correctness-affecting filtering state live from that +stream. The existing `config-runtime-update-not-revalidated` treats config purely as a crash/ +incompatibility gate; it never treats a config update as a **data-correctness** event. This property +fills that gap: a botched live reload produces *stale* or *fully-cleared* filtering applied to live +customer metrics — wrong tags retained/dropped, or all filtering silently disabled. + +## Code paths + +### The watcher (shared mechanism) + +- `lib/saluki-config/src/dynamic/watcher.rs:36-74` — `FieldUpdateWatcher::changed`: + - **Hazard 1 — silent lag drop:** `Err(broadcast::error::RecvError::Lagged(_))` (`:61-67`) only + `warn!`s and `continue`s. The broadcast channel has capacity **100** (`lib.rs:363`, + `broadcast::channel(100)`). Under a burst of config changes (or a slow consumer task), a receiver + that falls >100 behind **silently loses** the intervening `ConfigChangeEvent`s. If the *latest* + state was in the dropped span and no further change to that key arrives, the component keeps + **stale filters forever** with no recovery — there is no re-read of current config on lag. + - **Hazard 2 — partial-deserialize skip:** `:42-57` — `serde_json::from_value::(...).ok()`; if + the new value fails to deserialize, it `warn!`s and **skips the update** (loops). A + multi-key/multi-entry filter config where one entry is malformed can leave the component on the + **previous** config (half-applied across a multi-watcher component — see below). +- Each watcher is an independent `broadcast::Receiver` (`lib.rs:797-798,821-824`); a component with N + watchers has N receivers that can lag/skip **independently** → a partially-updated filter set. + +### Hazard 3 — key deletion never fires (the silent clear-all, subtler than expected) + +- `lib/saluki-config/src/dynamic/diff.rs:12-48` — `diff_recursive` iterates **only `new_dict` keys**. + A key present in the *old* config but **absent in the new snapshot produces NO `ConfigChangeEvent`.** + So *deleting* `metric_tag_filterlist` from the streamed config does **not** notify the watcher; the + component keeps the **old** filters (stale), it does not clear them. +- The clear-all DOES happen when the key is delivered as an explicit empty/null value: + - `tag_filterlist/mod.rs:274-276` — on a `changed()` event, + `compile_filters(new_entries.as_deref().unwrap_or(&[]))`. If `new_value` deserializes to `None` + (e.g. explicit `null` or a shape that fails per-element but yields `Some([])`), this rebuilds with + **`&[]` → ALL tag filtering removed**, and **rebuilds the context cache** (`build_context_cache()`). + - `dogstatsd_prefix_filter/mod.rs:311-334` and `dogstatsd_post_aggregate_filter/mod.rs:290-313` use + `if let Some(new) = maybe_new { … }` — they *ignore* a `None`, so an explicit-empty arriving as + `None` is a **no-op** there, but an explicit empty `[]` (deserializes to `Some(vec![])`) clears + the list. +- Net effect: **deletion = stale (Hazard 3a)**, **explicit-empty = cleared (Hazard 3b)**, and the two + filter families (`tag_filterlist` vs `prefix_filter`/`post_agg_filter`) react **differently** to a + `None`. This inconsistency across the five components is a correctness hazard in itself. + +### The five live-reloading components + +1. `bin/agent-data-plane/src/components/tag_filterlist/mod.rs:222,274-277` — 1 watcher + (`metric_tag_filterlist`); rebuilds `self.filters` + `self.context_cache` live. +2. `bin/agent-data-plane/src/components/dogstatsd_prefix_filter/mod.rs:285-289,311-334` — **4 watchers** + (`metric_filterlist`, `metric_filterlist_match_prefix`, `statsd_metric_blocklist`, + `statsd_metric_blocklist_match_prefix`); each rebuilds the effective blocklist matcher. +3. `bin/agent-data-plane/src/components/dogstatsd_post_aggregate_filter/mod.rs:268-273,290-313` — **4 + watchers**; rebuilds the histogram-suffix matcher. +4. & 5. Any other `watch_for_updates` consumers on correctness-affecting keys (grep + `watch_for_updates` — the prefix/post-agg filters share the same four keys, so a single key change + fans out to multiple components that must stay mutually consistent). + +## Failure scenario + +While ADP forwards live metrics, the Core Agent pushes a rapid sequence of filter-config updates +(the RC relay stress test). One of: + +- **Lag:** the burst exceeds the 100-slot broadcast buffer; a filter component's receiver lags, the + `Lagged` arm drops the events, and the component keeps applying **stale** filters (e.g. still + excluding a tag the operator just re-included, or still forwarding a metric just added to the + blocklist) with no self-correction. +- **Partial:** one malformed entry in a multi-entry `metric_tag_filterlist` update fails + deserialization → the whole update is skipped → stale filtering; or, in `prefix_filter`, one of the + four keys updates while another's event is dropped → a **half-applied** filter config (new + blocklist, old match-prefix flag) that filters inconsistently. +- **Clear-all:** an explicit empty `[]` (or a `None`-deserializing value) for `metric_tag_filterlist` + rebuilds with `&[]` → **all tag filtering silently disabled** on live data (tags the customer + intended to drop now flow to intake); deleting the key entirely instead leaves filtering **stale** + (the opposite surprise). + +All are silent (warn-only at most) and customer-visible (wrong tags / wrong metrics forwarded). + +## Property + +- **Type:** Safety (data-correctness under live config reload). +- **Invariant:** + - `Always(after a filter-config update is acknowledged-applied, the next metric for an affected + name is filtered per the NEW config)` — i.e. no stale filtering after a settled update. Assert + SUT-side at the filter apply site by comparing the metric's post-filter tags/keep-decision to the + currently-loaded `CompiledFilters`/matcher. + - `Unreachable("filter update Lagged-dropped with no subsequent reconciliation")` on the + `RecvError::Lagged` arm (`watcher.rs:61`) — or, if lag is accepted as best-effort, `Reachable` + there + a liveness check that the component eventually converges to the latest config (it does + not, today — there is no re-read). + - `Sometimes(filter config reloaded while metrics in flight)` to prove the reload-under-load state + is reached (non-vacuity). + - `AlwaysOrUnreachable(tag filtering not silently fully-cleared by a config event that the operator + did not intend as a clear)` — distinguishes deletion (should-stay-or-explicitly-clear) from + explicit-empty. +- **Antithesis angle:** the core interaction is **burst + scheduling**: push many filter updates + faster than the filter task drains its broadcast receiver (node throttling / CPU modulation on + `adp` widens the lag window), interleaved with sustained metric load so the stale/half-applied/ + cleared window overlaps live data. Also explore (a) deletion vs explicit-empty vs malformed-entry + shapes, and (b) updating one of the four prefix-filter keys while starving another receiver. +- **Priority:** High (this is the design partner's explicit stress scenario; correctness under live + RC reload is the headline of the AMCC Confluence page). + +## Config dependencies + +- Dynamic config must be **enabled** (remote-agent mode, Add-on 1 topology with a Core Agent / config + stub). In standalone mode `watch_for_updates` returns a watcher whose `rx` is `None` → + `changed()` pends forever (`watcher.rs:30-33`) and none of this fires. **This property cannot run in + standalone mode.** +- Keys to drive: `metric_tag_filterlist`, `metric_filterlist`, `metric_filterlist_match_prefix`, + `statsd_metric_blocklist`, `statsd_metric_blocklist_match_prefix` (constants in + `dogstatsd_filterlist.rs`). +- The Core Agent/stub must be able to send **malformed**, **explicit-empty**, **key-deleting** + (snapshot omitting the key), and **bursty** sequences the real Agent might not — favor a stub. + +## Open Questions + +- Is the broadcast `Lagged` drop considered acceptable (best-effort) by the team, or a bug? There is + **no re-read of current config** on lag, so a dropped *final* update is permanent staleness. + `(needs human input)` +- Deletion-doesn't-diff (Hazard 3a): is the intended Agent semantics that removing a filter key from + RC should *clear* the filter? If so, ADP's additive `diff_recursive` (`diff.rs:12-48`) silently + diverges (keeps stale filters). Confirm against Agent RC semantics. +- The `tag_filterlist` vs `prefix_filter`/`post_agg_filter` asymmetry on a `None` update + (clear-vs-ignore): intended or accidental? It means the same RC action has different effects on + different filters. +- Cross-component consistency: a single `metric_filterlist` change fans out to both `prefix_filter` + and `post_agg_filter` via separate receivers — can they diverge transiently (one applied, one + lagged) and does that produce an observable wrong-filter window? +- `tag_filterlist` only filters **Counter + sketch** metrics (`tag_filterlist/mod.rs:235-237`); does + the Agent's equivalent filter the same metric-type subset? (See `tag-filterlist-applied-consistently`.) + +### Investigation Log + +- Examined: `watcher.rs` (full), `diff.rs` (full), `lib.rs:363,541-651,797-824` (broadcast cap 100, + dynamic updater, subscribe/watch), and the `select!` reload arms in all three filter components. +- Found: (1) `Lagged` is warn-and-continue with no re-read → permanent staleness on a dropped final + update; (2) partial-deserialize is warn-and-skip → stale/half-applied; (3) `diff_recursive` is + additive so key **deletion** emits no event (stale), while explicit-empty/`None` clears for + `tag_filterlist` but is ignored by the prefix/post-agg filters. Three distinct hazards, all + silent, all on live customer data. +- Confirmed: in standalone mode the watcher never fires (`rx: None`), so this needs Add-on 1. diff --git a/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md b/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md new file mode 100644 index 00000000000..98864f58b74 --- /dev/null +++ b/test/antithesis/scratchbook/properties/forwarder-eventual-delivery.md @@ -0,0 +1,165 @@ +--- +slug: forwarder-eventual-delivery +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Liveness +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: After a transient intake outage clears, accepted-and-retryable transactions are eventually delivered + +## Origin +SUT analysis §5 liveness #5 ("After a transient intake outage clears, queued data is +eventually delivered") and §2 egress description. Headline guarantee's *no silent loss* +half, in the egress/forwarder path. No Antithesis assertion exists. + +## What the code does + +### Retry model = circuit breaker + re-enqueue (not inline retry) +`lib/saluki-components/src/common/datadog/io.rs`: +- In-flight completion handler (~451-482). On a circuit-breaker-open result + `Err(RetryCircuitBreakerError::Open(req))` (~468-474): the request is reassembled into a + `Transaction` and **re-enqueued to the low-priority queue** via `pending_txns.push_low_priority`. + Only if *that re-enqueue itself errors* does it log "Events may be permanently lost." On + success it tracks queue drops (overflow eviction) telemetry. +- `process_http_response` (~541-563): success → `track_successful_transaction`. Non-success → + `track_permanently_failed_transaction` (these are statuses the classifier did NOT mark + retryable — see below — so they are permanent drops by design). +- `Err(RetryCircuitBreakerError::Service(e))` (~460-463): an error the **retry policy declined to + retry** → `track_permanently_failed_transaction` (permanent drop). Per the breaker logic below, + this branch is reached only for *non-retryable* outcomes; retryable transport errors do NOT land + here. + +### Circuit breaker mechanics — what becomes Open vs Service +`lib/saluki-io/src/net/util/middleware/retry_circuit_breaker.rs` `ResponseFuture::poll` (~95-128): +after the inner request completes, it calls `state.policy.retry(&mut req, &mut result)`. +- `Some(backoff)` (policy says retry) → arms the shared backoff and returns `Err(Error::Open(req))` + carrying the original request (~101-112). This is what the io.rs handler re-enqueues. +- `None` (policy declines) or request-not-cloneable → `Err(Error::Service)` (~113-121), the permanent + branch in io.rs. +The policy wraps `StandardHttpClassifier` whose `should_retry(Err(_)) == true` +(`classifier/http.rs:78-83`) and `StandardHttpRetryLifecycle` which explicitly categorizes +DNS/connection/TLS transport errors (`lifecycle/http.rs:76-100`). **Therefore connection resets, +timeouts surfacing as transport errors, and 5xx are routed to `Error::Open` → re-enqueued, NOT +dropped via `Service`.** The earlier worry that connection resets are permanently dropped is +resolved: they are retryable and re-enqueued, provided the request was cloneable. + +### Retry classification — which failures are retryable +`lib/saluki-io/src/net/util/retry/classifier/http.rs`: +- `default_should_retry` (~12-26): **400 / 401 / 403 / 413 → NOT retried** (treated as permanent + client misconfig/bug). All other 4xx and all 5xx → retried. `should_retry(Err(_))` (~81) → + transport errors retried. +- So: 5xx storms, timeouts (408/504), 429, 5xx, and transport errors are retryable → must be + eventually delivered after the outage clears. 400/401/403/413 are a permanent drop by design + (out of scope for the liveness claim). + +## Failure scenario (Antithesis) +Accept a known set of transactions, then inject a transient intake outage: 5xx storm + +timeouts + connection resets for a bounded window, then restore healthy 2xx. Liveness +expectation: every transaction that was (a) accepted and (b) retryable is eventually delivered +(observed at the mock intake), with no permanent loss — assuming the retry queue did not overflow +(see Open Questions / overflow tension). + +## Key observations +- This is a **liveness** property: the bad outcome is "never delivered." It needs an eventual + window after fault clearance; assert progress, not an instantaneous invariant. +- The re-enqueue is to the **low-priority** queue, which is also the overflow target; under a + long outage the queue can overflow and drop *oldest* (SUT §2 two-tier `PendingTransactions`, + bias to freshest). So the clean liveness claim holds only for outages short enough that the + retry queue does not overflow. Beyond that, eventual delivery is intentionally sacrificed for + bounded memory (the §5-liveness-#4 tension). +- `track_permanently_failed_transaction` and `track_queue_drops` telemetry are the observable + loss signals; `track_successful_transaction` is the delivery signal. + +## Config deps +- `forwarder_retry_queue_max_size_bytes` (`queue_max_size_bytes`) — overflow threshold; sets how + long an outage can last before eventual delivery is no longer guaranteed. +- Circuit breaker backoff schedule (exponential + jitter) — sets recovery latency, hence the size + of the "eventually" window the assertion must allow. + +## Suggested assertion (MISSING — net-new) +- **Sometimes(all-accepted-retryable-delivered-after-recovery)**: at least once, after a transient + outage clears and within a bounded window, the count of delivered transactions equals the count + of accepted-and-retryable transactions submitted before/during the outage (queue did not overflow). + This proves recovery actually happens. Best evaluated workload-side by reconciling the controlled + input set against the mock-intake received set. +- Supporting **Reachable**: the `Error::Open` re-enqueue path (`io.rs:468-474`) is hit at least once + (proves the circuit breaker engaged and re-enqueued, not silently dropped). + +## SUT-side instrumentation needs +- An SDK `assert_reachable` at the re-enqueue site (`io.rs:470`) to confirm the breaker re-enqueues. +- Primary check is workload-side reconciliation against the mock intake (needs a deterministic, + countable input set and a mock intake that records received transaction IDs). + +## Open questions +- **Retry-queue overflow bound under the test's outage length** — must size the outage shorter than + overflow, or the assertion must explicitly exclude overflowed (oldest-dropped) transactions. The + overflow drop (`track_queue_drops`) is the legitimate bounded-memory escape valve, so eventual + delivery is only guaranteed for outages that don't overflow `queue_max_size_bytes`. + +### Investigation Log + +#### (a) Are forwarder requests always cloneable? (b) Is breaker backoff per-endpoint? (2026-05-28) + +**Examined:** +- `lib/saluki-io/src/net/util/middleware/retry_circuit_breaker.rs` in full — `ResponseFuture::poll` + (:81-130), `RetryCircuitBreaker::call`/`poll_ready` (:218-258), `State`/`new` (:132-205), + `Layer for RetryCircuitBreakerLayer` (:164-173). +- `lib/saluki-io/src/net/util/retry/policy/rolling_exponential.rs:95-141` (`Policy` impl, incl. + `clone_request`). +- `lib/saluki-components/src/common/datadog/io.rs:351-410` (`run_endpoint_io_loop` service build incl. + the breaker layer at :385-388) and `:236-278` (per-endpoint task spawn loop). +- `lib/saluki-components/src/common/datadog/transaction.rs:55-243` (`TransactionBody` and + `Transaction`, incl. `#[derive(Clone)]` at :58 and :203). +- `lib/saluki-components/src/forwarders/datadog/mod.rs:83-132` (concrete forwarder instantiation). +- `lib/saluki-common/src/buf/chunked.rs:102-103` (`FrozenChunkedBytesBuffer`). + +**Found — (a) requests ARE always cloneable on the production path (re-enqueue, not permanent drop):** +- The breaker layer sits at io.rs:385-388, applied to `Request>` (the body→ + `ClientBody` conversion is the *inner* `map_request` at :388, AFTER the breaker, by explicit design + per the comment at io.rs:374-376 — so the request the breaker clones/holds is + `Request>`). +- The non-cloneable → `Error::Service` permanent-drop path (retry_circuit_breaker.rs:118-121) is + reached only when `state.policy.clone_request(&req)` returns `None` (:248 → `req: None` → `take()` + is `None` at :100,:118). The production policy is `RollingExponentialBackoffRetryPolicy`, whose + `clone_request` is `Some(req.clone())` unconditionally (rolling_exponential.rs:138-140) and which + bounds `Req: Clone` (:99). It NEVER returns `None`. (The only `None`-returning impls are + `NoopRetryPolicy` at policy/mod.rs:19 and the test-only `NonCloneableTestRetryPolicy` — neither is + on the forwarder path.) +- The concrete `B` in production is `FrozenChunkedBytesBuffer` + (`TransactionForwarder`, forwarders/datadog/mod.rs:132), which is + `#[derive(Clone)]` (chunked.rs:102-103). `TransactionBody` is `#[derive(Clone)]` (transaction.rs:58) + and `Request: Clone` when `T: Clone`. So `clone_request` always succeeds. +- Therefore every retryable outcome routes to `Error::Open(req)` (retry_circuit_breaker.rs:101-112) → + re-enqueued to the low-priority queue at io.rs:468-474. The "non-cloneable → silent permanent loss" + worry is **NOT realizable** on the production forwarder. RESOLVED. + +**Found — (b) circuit-breaker backoff is PER-ENDPOINT (no cross-endpoint serialization):** +- The backoff lives in `State { policy, backoff }` behind `Arc>>`, created fresh in + `RetryCircuitBreaker::new` (retry_circuit_breaker.rs:200-205). That constructor runs once per + `Layer::layer` call (:170-172). +- `run_endpoint_io_loop` builds its own `service` with its own `.layer(RetryCircuitBreakerLayer::new(...))` + (io.rs:385-388). Crucially, each endpoint gets its **own** `run_endpoint_io_loop` task: the spawn + loop at io.rs:253-278 iterates `resolved_endpoints` and calls `spawn_traced_named(..., + run_endpoint_io_loop(...))` once per endpoint, each with its own `endpoint_rx` channel, + `pending_txns`/`RetryQueue`, and breaker `State`. +- The shared upstream `service` is `.clone()`d into each task (io.rs:273), but the breaker `State` + `Arc>` is constructed *inside* each task's `ServiceBuilder` (io.rs:385), so each endpoint + has an independent `backoff`. The policy is `.clone()`d per layer (:171) but state is not shared. + One endpoint's open breaker (its `state.backoff = Some(...)`, retry_circuit_breaker.rs:106-108, + gating `poll_ready` at :222-229) cannot stall another endpoint's recovery. RESOLVED. + +**Not found:** No global/static breaker state, no shared backoff future across endpoints, and no +production code path that supplies a non-cloneable request or a `None`-returning `clone_request` to +the forwarder breaker. + +**Conclusion:** Both sub-questions resolved favorably. (a) Production transactions are always +cloneable (`FrozenChunkedBytesBuffer` → `TransactionBody` → `Request`, all `Clone`; policy always +clones), so retryable failures take the `Error::Open` re-enqueue path — no silent permanent drop via +`Error::Service` for retryable errors. (b) Each endpoint task owns an independent circuit breaker and +backoff, so multi-endpoint fan-out recovers per-endpoint; one slow endpoint does not serialize +others. The "eventually" window in the liveness assertion is correctly per-endpoint. The remaining +real caveat is unchanged: eventual delivery holds only for outages short enough that the low-priority +retry queue does not overflow `queue_max_size_bytes` (drop-oldest). diff --git a/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md b/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md new file mode 100644 index 00000000000..10a67fa283c --- /dev/null +++ b/test/antithesis/scratchbook/properties/graceful-shutdown-within-30s.md @@ -0,0 +1,131 @@ +--- +slug: graceful-shutdown-within-30s +title: Graceful shutdown completes within the 30s grace window without forceful kill +family: Lifecycle Transitions & Configuration +type: Liveness (bounded-time) + Reachability +priority: High +status: assertion-missing +sut_commit: 042f41db3bd97118c38981765fd49696fce9d318 +--- + +# graceful-shutdown-within-30s + +## Origin + +SUT analysis §5 Safety #6 ("Graceful shutdown completes within 30s without forceful kill (in-flight +data drained)"). This slug owns the **TIMING / clean-completion** angle. The data-loss agent owns +`shutdown-drains-no-loss` (WHAT data survives — e.g. open-window buckets dropped unless +`flush_open_windows`, retry-queue disk flush). Keep this property about *completing cleanly in +time*, not about which data is preserved. + +## Files / functions / lines + +- `bin/agent-data-plane/src/cli/run.rs:255-287` — the main `select!` that ends the run loop on one + of three triggers: + - internal supervisor finishes (run.rs:256-279), + - `running_topology.wait_for_unexpected_finish()` (run.rs:280-283) → `topology_failed = true`, + - `tokio::signal::ctrl_c()` (run.rs:284-286) → SIGINT, logs "Received SIGINT, shutting down…". +- `bin/agent-data-plane/src/cli/run.rs:289-290`: + ```rust + let topology_result = running_topology.shutdown_with_timeout(Duration::from_secs(30)).await; + ``` + The **30s grace window** is hard-coded here. +- `bin/agent-data-plane/src/cli/run.rs:292-294`: after the topology shutdown completes (clean or + forced), the internal supervisor is told to shut down (`internal_supervisor_shutdown_tx.send(())`) + and awaited. +- `bin/agent-data-plane/src/cli/run.rs:300-315`: maps `topology_result` to the final process result. + `Ok(())` → clean (or "clean despite errors" if `topology_failed`); `Err(e)` → propagated → exit 1. +- `lib/saluki-core/src/topology/running.rs:71-124` — `shutdown_with_timeout`: + - `:72-78` sets `shutdown_deadline = now + timeout`, arms a `sleep(timeout)`, and a 5s + `progress_interval` for "still waiting on component(s)" logs. + - `:82` `self.shutdown_coordinator.shutdown()` triggers source shutdown, cascading downstream. + - `:86-117` loop: as each task finishes, `handle_task_result` records clean/unclean; when + `join_next_with_id()` returns `None` (all tasks done) → `info!("All components stopped.")` → + break with `stopped_cleanly` reflecting whether every component returned Ok. + - `:111-115` **forceful stop path**: if `shutdown_timeout` fires first → + `warn!("Forcefully stopping topology after shutdown grace period.")`, `stopped_cleanly = false`, + break (remaining component tasks are dropped/aborted by the `JoinSet` being dropped). + - `:119-123` returns `Ok(())` iff `stopped_cleanly`, else + `Err(generic_error!("Topology failed to shutdown cleanly."))`. +- `lib/saluki-core/src/topology/running.rs:130-162` — `handle_task_result`: a component returning + `Ok(())` during shutdown is "stopped" (clean); `Err`/`JoinError` (panic/cancel) → unclean. + +## Key observation / honest framing + +- "Within 30s" is enforced by the `sleep(30s)` race in `shutdown_with_timeout`. The clean path + (`stopped_cleanly == true`, run.rs topology_result `Ok`) means **all** component tasks finished + and returned Ok before the deadline. The forceful path is reached only if at least one component + fails to stop within 30s. +- This is a **bounded-time liveness** property. Under *bounded* in-flight load (the slug's + condition), the expectation is that shutdown completes cleanly within 30s — the forceful-stop + warning should be rare/never. Under *unbounded* or adversarial load (e.g. forwarder blocked on a + dead intake with a huge retry queue), the forceful path is legitimately reachable, so do not + assert it as Always-clean unconditionally; scope the clean-completion assertion to the + bounded-load workload. +- Note the **internal supervisor** shutdown (run.rs:294, `_ = internal_supervisor_handle.await`) has + **no timeout** — it awaits indefinitely. The 30s bound applies only to the data topology. So + "graceful shutdown within 30s" is precisely a *topology* property; the overall process exit could + still hang on the internal supervisor (Open Question). + +## Failure scenarios (Antithesis angle) + +- **SIGINT under bounded load:** send SIGINT (run.rs:284) while a modest, finite amount of data is + in flight. Expect: topology stops cleanly, `info!("All components stopped.")` logged, `Ok` result, + exit 0; forceful-stop warning NOT emitted. Sometimes(clean shutdown completed within 30s). +- **SIGINT with a wedged downstream:** forwarder `dd_out` blocked on a dead/slow mock intake so a + component cannot drain within 30s. Expect: forceful stop path (running.rs:111-115) → + `Err("Topology failed to shutdown cleanly.")` → exit 1. Reachable("forced topology stop after + grace period"). This proves the timeout actually bounds shutdown time (no indefinite hang of the + topology). +- **Unexpected component finish** (run.rs:280) then shutdown — same 30s path. +- **Timing/interleaving:** Antithesis schedules so a component finishes right at the 30s boundary — + exercises the race between `join_next` and `shutdown_timeout`. + +## Config dependencies + +- 30s is hard-coded (run.rs:290) — not configurable. The internal supervisor children use their own + `ShutdownStrategy::Graceful(Duration::from_secs(5))` (supervisor.rs:125-126), distinct from the + topology's 30s. +- `flush_open_windows` / aggregate flush behavior (SUT §3) affects how long the aggregate component + takes to finish on shutdown — interacts with timing but is owned (for data preservation) by the + data-loss agent. +- Forwarder retry-queue disk flush on shutdown (SUT §2) can extend shutdown time / block draining. +- Memory mode / backpressure (SUT §4) affects how much is in flight at shutdown. + +## Assertion (MISSING — net-new instrumentation) + +No Antithesis SDK assertions exist. Proposed SUT-side: +- In `shutdown_with_timeout`, on the clean break (running.rs:90-93, "All components stopped"): + `assert_reachable!`/`Sometimes("topology shutdown completed cleanly")` and optionally record + elapsed since shutdown start to assert `<= 30s` (it is structurally bounded, but the assertion + documents intent). +- On the forceful-stop branch (running.rs:111-115): `assert_reachable!("topology forcefully stopped + after grace period")` so the workload can confirm the timeout path is *reachable* under adversarial + load, and — under the **bounded-load** workload only — a workload-side + `AlwaysOrUnreachable`/Unreachable expectation that this branch is not taken. +- Workload-side: on SIGINT under bounded load, assert process exits 0 within ~35s (30s grace + slack) + and that the "Forcefully stopping topology" warning is absent. + +## Open questions + +- **Does the overall process honor 30s, or can it still hang?** run.rs:294 + `internal_supervisor_handle.await` has no timeout. If an internal-supervisor child hangs on + shutdown, the *process* exit can exceed 30s even though the *topology* shut down in time. WHY IT + MATTERS: a workload asserting "process exits within ~35s" might fail for a reason outside this + property's scope. WHAT CHANGES: either scope the assertion to topology-shutdown completion + (log/assertion inside `shutdown_with_timeout`) rather than process exit, or file a separate + property for internal-supervisor shutdown bounding. +- **What happens to tasks dropped on forceful stop?** On running.rs:115 break, the `JoinSet` is + dropped, aborting remaining tasks. Confirm aborted component tasks cannot leave shared state + (interner buffer, retry queue) corrupted. WHY IT MATTERS: clean-in-time vs. data-integrity overlap + with the data-loss agent's property; keep the boundary clear. +- **Is `shutdown_coordinator.shutdown()` idempotent / does cascade reliably reach every component?** + A source that never observes the shutdown signal would never finish and force the timeout. Needs a + read of `ComponentShutdownCoordinator` to confirm all edges are signaled. + +## SUT-side instrumentation needs + +- Antithesis SDK dependency (none today). +- Reachable markers on both the clean-break and forceful-stop branches of `shutdown_with_timeout` + (running.rs:90-93 and 111-115). +- Optional elapsed-time capture at shutdown start vs. completion to assert the 30s bound explicitly. diff --git a/test/antithesis/scratchbook/properties/interner-full-bounded.md b/test/antithesis/scratchbook/properties/interner-full-bounded.md new file mode 100644 index 00000000000..777aaeccd34 --- /dev/null +++ b/test/antithesis/scratchbook/properties/interner-full-bounded.md @@ -0,0 +1,129 @@ +# interner-full-bounded + +**Family:** Resource Boundaries — memory / exhaustion +**Status:** Verified against code at commit 042f41db3b. Two-mode property: +- `allow_heap_allocs = false`: bounded + deterministic drop — **expected to HOLD**. +- `allow_heap_allocs = true` (the DEFAULT): memory no longer bounded — **expected to FAIL** the + bounded-memory reading. + +## What led to the property + +Interner determinism is the foundation of the context memory bound (`sut-analysis.md` §3). The +fixed-size interner has a hard byte capacity; the question is what happens when it fills. The +behavior pivots entirely on one config flag, and the **default flips it to the unbounded +branch** — making this the single highest-leverage config knob in the bounded-memory story. + +## Behavior in code + +Resolution path `ContextResolver::intern` (`lib/saluki-context/src/resolver.rs:339-353`): +```rust +s.try_cheap_clone() // inlineable/cheap strings escape entirely + .or_else(|| self.interner.try_intern(s.as_ref())..) // fixed buffer; None when full + .or_else(|| self.allow_heap_allocations.then(|| { // HEAP FALLBACK + self.telemetry.intern_fallback_total().increment(1); + MetaString::from(s.as_ref()) // unbounded heap alloc + })) +``` + +- **Interner-full is deterministic.** `FixedSizeInterner` shard `try_intern` + (`lib/stringtheory/src/interning/fixed_size.rs:462-494`) returns `None` when neither a reclaimed + entry nor remaining buffer capacity can fit the string (`required_cap <= self.available()` else + `None`, lines 489-493). Also `None` if the string exceeds the packed length/capacity max + (lines 465-467). No allocation, no panic — just `None`. +- **Heap-disallowed => metric dropped.** When `allow_heap_allocations == false`, the final + `or_else` yields `None`, so `intern` returns `None`, `create_context` returns `None` + (`resolver.rs:373` `let context_name = self.intern(name)?;`), `resolve` returns `None`, and + `handle_metric_packet` returns `None` (`sources/dogstatsd/mod.rs:1565-1580`): "We failed to + resolve the context, likely due to not having enough interner capacity." The metric is dropped + deterministically; no memory grows. This is exactly what the unit test + `no_metrics_when_interner_full_allocations_disallowed` + (`sources/dogstatsd/mod.rs:1808-1834`) asserts (using a noop/zero-size interner + a name longer + than the 31-byte inline limit so it can be neither inlined nor interned). +- **Heap-allowed (DEFAULT) => unbounded.** `allow_heap_allocations` builder default is `true` + (`resolver.rs:258` `unwrap_or(true)`, doc `:179-190` calls it "effectively unlimited"), and DSD + config `dogstatsd_allow_context_heap_allocs` defaults `true` + (`sources/dogstatsd/mod.rs:149-151, 402-406`; wired at `sources/dogstatsd/resolver.rs:38,56,64` + for primary, no_agg, and tags resolvers, all sharing one interner at `resolver.rs:40`). On a + full interner every over-cap context spills to heap, bumping `intern_fallback_total`, and RSS + grows without bound. + +## Failure scenario (Antithesis) + +Set a small `dogstatsd_string_interner_size_bytes` and flood high-cardinality contexts so the +interner fills. +- Mode A (`dogstatsd_allow_context_heap_allocs: false`): assert that once the interner is full, + metrics with un-internable names/tags are dropped and **no heap fallback occurs** — memory + bounded. Antithesis timing exploration probes the interner reclamation/tombstoning path + (loom-tested per existing-assertions.md, raw-pointer `'static &str` keys) under concurrent + intern-vs-drop, where the worst documented case is a duplicate entry (more pressure), never + corruption. +- Mode B (default `true`): assert `intern_fallback_total` climbs and RSS escapes the interner + budget — the bounded-memory guarantee is void. This is the more important finding because it is + the **default** posture. + +## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist) + +- Heap-disallowed branch: `AlwaysOrUnreachable(interner_full ⇒ metric dropped, no heap alloc)`. + AlwaysOrUnreachable because "interner full" is a rare/optional path that may not occur in every + run; when it does occur the drop-not-allocate behavior must hold. +- `Sometimes(try_intern returned None)` / `Sometimes(interner reported full)` — proves the + workload actually exhausts the interner; otherwise the above is vacuous. +- Heap-allowed branch: `Sometimes(intern_fallback_total > 0)` — proves the unbounded spill path is + reachable under default config (the finding). Pair with the RSS check from + `rss-bounded-under-cardinality`. +- A counter on `intern_fallback_total` already exists (`resolver.rs:349`) — a natural anchor for + the `Sometimes`, but it is telemetry, not an assertion, so the SDK assertion is still net-new. + +**SUT-side instrumentation strongly preferred:** distinguishing "interned" vs "inlined" vs +"heap-fallback" vs "dropped" requires reading internal resolver state. A workload-only checker +sees dropped metrics (missing at intake) but cannot tell a heap-fallback (memory bug) from a +clean intern (correct), nor an interner-full drop from a parse drop. + +## Configuration dependencies + +- `dogstatsd_allow_context_heap_allocs` (default **true** — the unbounded branch). +- `dogstatsd_string_interner_size_bytes` / `dogstatsd_string_interner_size` (interner capacity; + effective default 2 MiB, `sources/dogstatsd/mod.rs:1888-1890`). Both resolvers share one + interner of this size (`sources/dogstatsd/resolver.rs:40`), so two resolvers draw from the same + buffer. +- Inlining: strings <= 31 bytes are inlined by `MetaString` (`try_cheap_clone`) and bypass the + interner entirely — the workload must use long names/tags to actually pressure the interner. + +## Open questions + +- The fixed-size interner reclaims/tombstones entries when interned strings are dropped. Under + steady high-cardinality churn, does reclamation keep pace, or does fragmentation make + `try_intern` return `None` even below nominal byte capacity? Matters: if fragmentation causes + premature "full," the heap-disallowed mode drops metrics earlier than the configured budget + implies (more data loss), and the heap-allowed mode spills sooner. +- The tags resolver also has `with_heap_allocations` (`sources/dogstatsd/resolver.rs:45`) and + shares the interner. Is the bound the sum across name+tag interning of both resolvers? Affects + the byte budget the assertion measures against. + +## Investigation Log + +#### Default of `dogstatsd_allow_context_heap_allocs` and whether bounded mode is ever shipped +- **Examined**: `lib/saluki-components/src/sources/dogstatsd/mod.rs:149-151` + (`default_allow_context_heap_allocations`), `:403-406` (serde field + rename), `:438`; + `sources/dogstatsd/resolver.rs:38,45,56,64` (resolver wiring); `lib/saluki-context/src/resolver.rs` + `with_heap_allocations` (187-188) and the default fallback `.unwrap_or(true)` (258, 663); + `config_registry/datadog/dogstatsd.rs:8,392`; grep of all `with_heap_allocations(false)` in non-test code; + searched shipped configs (`dist/`, `config/`, all `*.yaml`/`*.toml`) for `heap_alloc`. +- **Found (a) — default**: `const fn default_allow_context_heap_allocations() -> bool { true }` + (`mod.rs:149-151`), applied via `#[serde(rename = "dogstatsd_allow_context_heap_allocs", + default = "default_allow_context_heap_allocations")]` (`mod.rs:403-406`). The resolver builder + default also independently falls back to `true` (`resolver.rs:258` `unwrap_or(true)`). So default + is **true = unbounded heap-allocation (spill) mode**. +- **Found (b) — bounded mode is test-only**: The only call sites of `with_heap_allocations(false)` + are inside `#[cfg(test)] mod tests` in `sources/dogstatsd/mod.rs` (lines 1820 and 1841, module + begins `#[cfg(test)]` at 1736 — tests `no_metrics_when_interner_full_allocations_disallowed` + and `metric_with_additional_tags`). Production wiring (`resolver.rs:38,45,56,64`) passes + `config.allow_context_heap_allocations` straight through. No shipped YAML/TOML config sets it to + false. There is no default/code path that forces bounded mode; it is **opt-in via config only**. +- **Not found**: Any release/default config or non-test code path that sets + `dogstatsd_allow_context_heap_allocs = false`. +- **Conclusion**: RESOLVED. Default is **true** (unbounded spill). Bounded mode (heap-disallowed, + interner-is-a-hard-cap) is **opt-in / test-only** — never shipped by default. The + realistic default-config property is "interner spills to heap (memory unbounded by the interner)"; + the hard-bounded property only holds when an operator explicitly sets the flag false. Property + framing should remain explicitly two-mode and label the bounded branch as opt-in. diff --git a/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md b/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md new file mode 100644 index 00000000000..9bdb9ef4b32 --- /dev/null +++ b/test/antithesis/scratchbook/properties/interner-reclamation-no-corruption.md @@ -0,0 +1,128 @@ +--- +slug: interner-reclamation-no-corruption +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Concurrent intern + drop-last-ref never yields overlapping/corrupt entries + +## Origin +SUT analysis §3 ("manual reclamation/tombstoning"), §4 ("Interner reclamation is loom-tested; +worst case is a duplicate entry, never corruption … the most concurrency-unsafe component in the +bounded-memory story"). existing-assertions.md notes a `loom` cfg already marks this path as +concurrency-sensitive. No Antithesis assertion exists. + +## What the code does +`lib/stringtheory/src/interning/fixed_size.rs` + `map.rs` — a fixed byte buffer with manual +refcount-based reclamation. The safety argument rests on a mutex + a refcount **re-check**: + +- `EntryHeader.refs: AtomicUsize` (fixed_size.rs:92-95). `increment_active_refs` uses `AcqRel` + (212-214); `decrement_active_refs` (219-221) returns true iff `fetch_sub(1, AcqRel) == 1` + (i.e. count hit zero). `is_active` loads `Acquire` (207-209). +- Drop (map.rs:73-84): when `decrement_active_refs()` says "now zero," takes + `interner.lock()` and calls `mark_for_reclamation(self.header)`. +- **The re-check under the lock** (map.rs:447-470 `InternerState::mark_for_reclamation`): re-reads + `header.is_active()` (459). If a concurrent `try_intern` resurrected the entry (incremented refs) + between the drop's decrement and acquiring the lock, `is_active()` is now true and reclamation is + **skipped**. Only if still inactive does it `entries.remove(entry_str)` (466) then + `storage.mark_for_reclamation` (468). Comment (450-454) states only the mutex-mediated `InternerState` + increments refs and only the handle drop decrements, so a zero count under the lock means nobody else + holds or is acquiring a reference. +- **The buffer is overwritten on reclaim**: `write_to_reclaimed_entry`/the fill at map.rs:382-393 fills + the reclaimed string capacity with `0x21` ("a known repeating value … signal that offsets/reclaimed + entries are incorrect and overlapping"). So a stale `'static &str` pointing into a reclaimed slot would + read `0x21` bytes — the corruption sentinel. +- `try_intern` (map.rs:472-517): under the lock, first checks `entries.get(s)` and on hit + `increment_active_refs` (483-484) — this is the resurrection that the drop re-check defends against. + On miss, reuses a reclaimed entry (495-497) or unoccupied space (498-499), inserts a `'static`-lifetime + key (513-514, with a SAFETY note that the lifetime never outlives the entry). +- **Loom test** `concurrent_drop_and_intern` (fixed_size.rs:1072-1142): models T1 holding an entry, T2 + interning the same string, then `drop(t1)` — asserts ≤1 reclaimed entry and that the reclaimed entry + does **not overlap** the surviving interned string (`do_reclaimed_entries_overlap`, 1078-1086). The + documented acceptable outcome is a benign duplicate (1090-1094, 1117-1121). + +## Failure scenario (Antithesis) +High-cardinality DSD load with many short-lived contexts so the same tag/name strings are repeatedly +interned and dropped across the source's context resolvers, driving concurrent `try_intern` (resolve) +against drop-last-ref reclamation on a near-full interner (forcing reclaimed-slot reuse + buffer +overwrite). The hazard: a `try_intern` that returns a handle into a slot a concurrent drop then reclaims +and overwrites with `0x21`, producing an interned `&str` whose bytes are corrupt or overlap another live +entry. + +## Key observations +- The loom test **bounds** the interleavings (loom explores a model with a small, fixed thread/op set); + Antithesis explores the real scheduler under real load with many shards and entries — the SUT analysis + explicitly flags this as where Antithesis adds value beyond loom. +- The invariant to check is exactly the loom assertion generalized: **no reclaimed entry overlaps a live + entry**, and **no live `InternedString` reads the `0x21` corruption sentinel**. The `0x21` fill is a + ready-made detector: a resolved string containing the fill pattern where it shouldn't is corruption. +- Sharding (`[Arc>; SHARD_FACTOR]`, SUT §3) means cross-shard interactions add interleavings loom + doesn't model per-run. + +## Config deps +- Interner capacity / `allow_heap_allocations` (SUT §3): with heap fallback on (default true), a full + interner spills to heap and the reclaimed-slot-reuse path is less pressured — to exercise reclamation, + the test wants a **small** interner and/or heap-fallback off so the buffer actually fills and reclaims. +- `SHARD_FACTOR` and per-shard capacity govern how often reclaimed slots are reused. + +## Suggested assertion (MISSING — net-new) +- **Always**(no corrupt/overlapping entry): generalize the loom check to a runtime invariant. Realize as + an SDK `assert_always` (or `assert_unreachable` on the corruption-detected branch) inside + `mark_for_reclamation`/`try_intern` that verifies a newly returned entry does not overlap any reclaimed + entry and that resolved bytes are not the `0x21` sentinel. This needs SUT-side instrumentation — the + race is invisible to a workload-only checker. +- **Sometimes(reclaimed-slot reused)** and **Sometimes(drop re-check found resurrected entry)**: prove the + contended reclamation path was actually hit (the `is_active()` re-check at map.rs:459 returning true), + not just steady-state interning. + +## SUT-side instrumentation needs +- A debug-build check that scans for overlap between `reclaimed` entries and the live `entries` map, or a + per-resolve check that the returned `&str` contains no unexpected `0x21` run, gated behind a test cfg. + A workload cannot see interner internals; only SUT-side assertions can catch the corruption branch. + +## Open questions +- **Memory ordering sufficiency:** the re-check relies on `AcqRel`/`Acquire` pairing between + `increment_active_refs` and the drop's `decrement` + `is_active` under the lock. Confirm the lock + acquire provides the needed synchronization with a `try_intern` increment that happened *without* the + drop's lock (the increment at map.rs:484 is under the same `InternerState` lock — verify both paths take + the same mutex so the re-check is sound). If both are under the lock, the race window is only between the + atomic decrement (no lock) and acquiring the lock — which is exactly what the re-check covers. +- **Cross-shard handles:** can an `InternedString` from shard A ever be dropped against shard B's lock? + If sharding is by string hash and stable, no — but confirm, since a wrong-shard reclaim would be + corruption the per-shard check wouldn't catch. + +### Investigation Log + +#### Is the reclamation buffer-fill sentinel present in RELEASE builds, or debug-only? +- **Examined:** `lib/stringtheory/src/interning/map.rs:368-394` (`clear_reclaimed_entry`), + `lib/stringtheory/src/interning/fixed_size.rs:435-460` (the analogous reclaim path), and grepped the + whole `interning/` dir for `0x21` / `fill` / `debug_assert` / `cfg!` / `debug_assertions`. +- **Found:** + - **Exact fill site (map.rs):** `map.rs:392` — `str_buf.fill(0x21);` inside the `unsafe` block at + `map.rs:388-393`, within `fn clear_reclaimed_entry` (`map.rs:368`). It fills the entire string + capacity of the tombstoned entry (`str_ptr = entry_ptr.add(1).cast::()`, length `str_cap`). + - **No cfg gate of any kind.** There is no `#[cfg(debug_assertions)]`, no `if cfg!(debug_assertions)`, + no `#[cfg(test)]`, no `#[cfg(loom)]` around the fill or around `clear_reclaimed_entry`. The fill is + unconditional and therefore **present in release builds**. The only cfg-gated constructs in these + files are `debug_assert!` macros (fixed_size.rs:278/286/325/390, map.rs:408) which are unrelated to + the fill. + - **Important discrepancy — two different sentinels in two different interner implementations:** + - `map.rs:392` (the `InternerState`/`Map`-backed interner) fills with **`0x21`** (ASCII `!`). + - `fixed_size.rs:458` (the `FixedSizeInterner` reclaim path, `fn` at fixed_size.rs ~430) fills with + **`0xAA`** — *not* `0x21`. Same surrounding comment ("Write a magic value … signal that + offsets/reclaimed entries are incorrect and overlapping"), same unconditional `unsafe { … fill() }`, + but a different byte value. Both are unconditional / release-present. +- **Not found:** No conditional compilation, feature flag, or runtime toggle disabling either fill. No + third fill site. +- **Conclusion:** RESOLVED. The buffer-fill-on-reclamation is unconditional and present in release builds, + so an Antithesis assertion *can* rely on a fill sentinel to detect a stale read into a reclaimed slot + rather than being forced to compute overlap directly. **However, the sentinel value is implementation- + dependent:** `0x21` for the `map.rs` interner, `0xAA` for the `fixed_size.rs` interner. An assertion + that hard-codes `0x21` would miss corruption in the `FixedSizeInterner` path. The robust check is either + (a) match against the correct sentinel per implementation, or (b) check overlap directly (the + implementation-independent invariant the loom test already uses). Detecting a *run* of either sentinel + in a resolved `&str` is a valid corruption signal in the corresponding interner. diff --git a/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md b/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md new file mode 100644 index 00000000000..557b3780c75 --- /dev/null +++ b/test/antithesis/scratchbook/properties/malformed-dsd-no-crash.md @@ -0,0 +1,98 @@ +--- +slug: malformed-dsd-no-crash +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Malformed DogStatsD packets never crash the process or kill the socket + +## Origin +SUT analysis §2 ("a malformed packet never kills the socket", `mod.rs:1283-1318`), §6 gap +#6, §8 ("undecided malformed-input error policy in the codecs"). Decomposes the headline +"ADP will not crash under load." No Antithesis assertion exists (existing-assertions.md). + +## What the code does +`lib/saluki-components/src/sources/dogstatsd/mod.rs::drive_stream` (1146-1337): +- The `'read` loop reads into an I/O buffer, then a `'frame` loop calls + `framer.next_frame(io_buffer, reached_eof)` (1237) and `handle_frame(...)` (1252). +- **Per-frame parse error** (`handle_frame` returns `Err`, 1266-1270): logged at `warn!` + and the loop continues — the bad frame is skipped, not fatal. +- **Framing error** (1272-1293): increments `framing_errors`; for **connectionless** streams + (UDP, UDS datagram) `io_buffer.clear()` + `continue 'read` (1283-1288) — clear-and-continue, + the socket survives. For **connection-oriented** streams it `break 'read` (1289-1291), closing + only that one connection (not the process). +- **I/O error** (1306-1318): connectionless → `continue 'read`; connection-oriented → `break 'read`. +- `handle_frame` (1458-1542) routes to `codec.decode_packet`; on decode error it bumps a + type-specific `*_decode_failed` counter and returns `Err` (1462-1473) — caught by the caller above. + +Codecs (`lib/saluki-io/src/deser/codec/dogstatsd/`): +- `metric.rs::parse_dogstatsd_metric` (67-194): `nom` parsers returning `IResult`; unknown + `|`-chunks are silently skipped (136-141, with a TODO "throw an error, warn, or be silently + permissive?"). `permissive_metric_name` (197-206) uses `from_utf8_unchecked` but only after + `take_while1(valid_char)` constrains bytes to printable ASCII (SAFETY comment). `raw_metric_values` + validates UTF-8 with `simdutf8` before any `from_utf8_unchecked` (232-234). +- `event.rs:146-148` and `service_check.rs:94-96`: identical "skip unknown chunk" TODO — **undecided + error policy**, currently silently permissive. +- `metric.rs:243` `unreachable!("should be constrained by alt parser")` — reachable only if the + `alt((tag("g"),tag("c"),…))` matched something not in the match arm; provably constrained, so not + a real panic site, but it *is* an `unreachable!` on the hot parse path worth covering. +- Existing proptest `property_test_malicious_input_non_exhaustive` (metric.rs:761-772): 1000 random + byte vectors, asserts no panic. This is a `cargo test` proptest, **not** an Antithesis assertion, + and is non-exhaustive by its own comment. + +## Failure scenario (Antithesis) +Drive each listener (UDP 8125, TCP, UDS datagram, UDS stream) with adversarial packets: +oversized frames (exceed buffer → framing error), invalid UTF-8 in value/name positions, +truncated extensions (`|@`, `|#`, `c:`, `e:`, `card:` with missing bodies), enormous tag lists, +embedded NULs, partial multi-value (`x:1:2:`), zero-length frames. Expectation: process stays +up; connectionless sockets keep serving subsequent valid packets; a TCP connection may close but +the listener accepts new connections; no panic. + +## Key observations +- The clear-and-continue (1283-1288) and skip-bad-frame (1266-1270) paths are the explicit + socket-survival mechanism — the property is precise: **connectionless sockets never die on a bad + packet; the process never panics on codec/framing errors.** +- The codecs return `Err` rather than panicking for malformed input by construction (nom + guarded + `unsafe`), but the `unreachable!`/`from_utf8_unchecked` sites mean a *parser regression* could turn + malformed input into a panic — exactly what Antithesis should guard. +- TCP `break 'read` closes one connection; that is acceptable (connection-oriented semantics) and must + be excluded from a "socket never dies" claim — scope the listener-survival half to connectionless. + +## Config deps +- `permissive` mode (metric.rs:73) widens accepted metric names — broadens the malformed surface; + test both permissive and strict. +- `client_origin_detection`, `timestamps` gates (metric.rs:117/122/127/132) change which extension + chunks are parsed; toggling them changes the reachable parse branches. +- Which listeners are enabled is config-driven; the property should hold for every enabled listener. + +## Suggested assertion (MISSING — net-new) +- **Always**(process up): the ADP process stays alive across the entire malformed-input workload — + realized as a workload-side liveness/health check plus a panic hook converting any codec/framing + panic into a recorded failure. +- **Unreachable** at codec panic: `assert_unreachable` covering the `unreachable!` (metric.rs:243) and + the `from_utf8_unchecked` SAFETY invariants — any panic there is a must-never. +- **Always**(connectionless socket survives): after a malformed packet on UDP/UDS-datagram, the same + socket successfully receives a subsequent valid packet (`packet_receive_success` increments again). +- **Sometimes(framing_errors > 0)** and **Sometimes(*_decode_failed > 0)**: prove the adversarial + input actually reached the error paths, not that the workload was too benign. + +## SUT-side instrumentation needs +- A panic during parse is invisible to a workload-only checker until the process dies; pair a panic + hook / `assert_unreachable` at the codec sites with workload-side liveness. `framing_errors` and + `*_decode_failed` counters already exist for the `Sometimes` reachability anchors. + +## Open questions +- **What should the codec error policy be for unknown/trailing chunks** (the four TODOs)? Currently + silently permissive. If the policy changes to "error," more inputs become `Err` (still no crash) but + more packets are dropped — affects the data-loss family, not this no-crash property. Worth resolving + before finalizing the assertion so the test's expected-drop accounting is stable. +- **Does a malformed packet ever cause a *partial* dispatch** that mis-routes remaining events (SUT §7 + #6, mod.rs dispatch-error swallow)? That is a separate routing-correctness property; confirm it is + not conflated with no-crash. +- **Is there a max frame/datagram size that, when exceeded, the framer handles gracefully on every + transport** (vs. only connectionless)? Confirm TCP oversized-frame handling does not wedge the + connection (it `break`s, but verify no resource leak per closed connection). diff --git a/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md b/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md new file mode 100644 index 00000000000..749ccd933da --- /dev/null +++ b/test/antithesis/scratchbook/properties/malformed-event-sc-no-crash.md @@ -0,0 +1,99 @@ +--- +slug: malformed-event-sc-no-crash +title: Malformed DSD event / service-check payloads never crash process or socket +type: Safety +priority: High +status: net-new (no SDK assertion exists) +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +--- + +# malformed-event-sc-no-crash + +## Origin +Coverage gap: the existing catalog's untrusted-input property `malformed-dsd-no-crash` +(`property-catalog.md` Category E) is scoped to the **metric** codec + framing. It does NOT +exercise the two separate, always-on event / service-check codecs that parse untrusted bytes +on every DSD listener whenever DogStatsD is enabled (`run.rs:681-684`). These codecs are +~394 LOC (`event.rs`) and ~312 LOC (`service_check.rs`) of net-new nom parsing with their own +length-prefix, UTF-8, timestamp, and extension-chunk logic that the metric path never touches. + +## Code paths (file:line) +- Event codec entry: `lib/saluki-io/src/deser/codec/dogstatsd/event.rs:31` `parse_dogstatsd_event`. + - Length-prefixed body: `event.rs:36-49` parses `_e{,}:` then + `take(title_len)` / `take(text_len)` — **attacker-controlled lengths**. nom `take` on a short + buffer returns `Err`, not a panic (good), but this is the untrusted-length surface to fuzz. + - UTF-8 validation + `.replace("\\n","\n")` allocation: `event.rs:51-59` (per-packet heap alloc + keyed on attacker length — memory-amplification angle under flood). + - Extension chunk loop: `event.rs:79-156` — `all_consuming` sub-parsers for `d:`(timestamp), + `h:`,`k:`,`p:`,`s:`,`t:`,`c:`,`e:`,`card:`,`#tags`. `unix_timestamp` parser at + `event.rs:88`; cardinality at `event.rs:133-139`. Unknown chunks silently skipped + (`event.rs:145-149`, TODO: "throw an error, warn, or be silently permissive?"). +- Service-check codec entry: `lib/saluki-io/src/deser/codec/dogstatsd/service_check.rs:28` + `parse_dogstatsd_service_check`. + - `_sc||` via `parse_u8` + `CheckStatus::try_from` (`service_check.rs:31-38`). + - Extension loop `service_check.rs:48-104`: `d:`,`h:`,`c:`,`e:`,`#tags`,`m:`(utf8 message),`card:`. + Same silent-skip TODO (`service_check.rs:93-97`). +- Decode dispatch + error counting: `lib/saluki-components/src/sources/dogstatsd/mod.rs:1462-1474` + (`handle_frame` → `codec.decode_packet`); decode failure increments + `event_decode_failed()` / `service_check_decode_failed()` (`mod.rs:1468-1469`, + counters defined `sources/dogstatsd/metrics.rs:58-63`) and returns `Err(ParseError)`. +- Socket-survival mechanism (shared with metrics): connectionless framing/parse errors are + logged + the buffer cleared + loop continues (`sut-analysis.md` §2, `mod.rs:1283-1318`); a bad + event/SC frame must not kill the socket or process. + +## Failure scenario +An adversarial event/SC payload triggers a panic or unbounded resource use in the dedicated codec +(e.g. a parser path the non-exhaustive unit tests miss: pathological length prefixes, invalid UTF-8 +in title/text/message, malformed `d:` timestamp, `card:` parsing, multibyte boundary in +`.replace`). Because these codecs are entirely separate from the metric codec, the existing +`malformed-dsd-no-crash` coverage gives no assurance here. A panic on any DSD listener thread is a +process crash (data-plane components are fail-stop, `sut-analysis.md` §2); a crash-loop under a +malformed-event flood violates the headline "won't crash under load" guarantee. + +## Observations +- Both codecs return `nom::Err` on bad input rather than panicking in the paths read; no `unwrap`/ + `expect`/`unsafe` was seen in `event.rs` or `service_check.rs` themselves. The risk is (a) shared + helper parsers (`helpers::*` — `unix_timestamp`, `tags`, `cardinality`, `ascii_alphanum_and_seps`, + `local_data`, `external_data`) and (b) the `.replace` allocations under flood. Helpers were not + fully read — see Open Questions. +- `title_len == 0 || text_len == 0` is rejected (`event.rs:44-46`), but huge declared lengths just + fail the `take` — confirm no pre-allocation on the declared length. +- Error policy for unknown trailing chunks is undecided (silently permissive) in BOTH codecs — same + open policy question as the metric path; affects expected-drop accounting, not no-crash. + +## Suggested assertions (MISSING / net-new — no Antithesis SDK in tree per `existing-assertions.md`) +- `Always(process_up)` and `Always(connectionless socket survives a bad event/SC packet)` — extends + the metric-only `malformed-dsd-no-crash` to the event/SC frames; can reuse the same process-up / + socket-survival workload checker but MUST drive event/SC frames specifically. +- SUT-side `Unreachable` at any panic site reachable from `parse_dogstatsd_event` / + `parse_dogstatsd_service_check` and their helpers (none confirmed yet — guards regressions and the + shared-helper risk). +- `Sometimes(event_decode_failed > 0)` and `Sometimes(service_check_decode_failed > 0)` — reachability + anchors proving the malformed-event/SC parse-error paths are actually exercised (avoids vacuity; + these counters already exist at `metrics.rs:58-63`). + +## Config dependencies +- DogStatsD enabled (`data_plane.enabled: true`); events/service_checks sub-pipelines are on by + default (`EnablePayloadsConfiguration` defaults `events: true`, `service_checks: true`, + `sources/dogstatsd/mod.rs:205-221`). +- `client_origin_detection` gates the `c:`/`e:`/`card:` extension parsers (`event.rs:122-139`, + `service_check.rs:68-92`); toggling it changes which parser branches untrusted bytes reach. Drive + BOTH settings to cover the gated parsers. + +## SUT-side instrumentation needs +- A process-up / socket-alive workload checker (shared with `malformed-dsd-no-crash`) plus + event/SC-shaped malformed frames in the workload generator. +- The two `Sometimes` anchors read existing decode-failure counters; the `Unreachable` panic guard, + if added, needs an SDK assertion compiled into the codec/helpers (net-new dependency). + +## Open Questions +- Do the shared `helpers::*` parsers (`unix_timestamp`, `tags`, `cardinality`, `local_data`, + `external_data`, `ascii_alphanum_and_seps`, `utf8`) contain any panic/`unwrap`/pre-allocation on + attacker-controlled length? Not yet read — pivotal for whether the `Unreachable` guard is needed. +- Does `take(title_len)`/`take(text_len)` (or the message `utf8` parser) ever pre-allocate on the + declared length before validating the buffer is long enough (a memory-amplification vector under + flood)? +- Is a malformed event/SC frame ever mis-classified by `parse_message_type` (`mod.rs:1466`) such that + the wrong decode-failure counter increments — cosmetic, but affects the `Sometimes` anchor wiring. diff --git a/test/antithesis/scratchbook/properties/mapper-interner-bounded.md b/test/antithesis/scratchbook/properties/mapper-interner-bounded.md new file mode 100644 index 00000000000..2a1bb6ed4ac --- /dev/null +++ b/test/antithesis/scratchbook/properties/mapper-interner-bounded.md @@ -0,0 +1,104 @@ +# mapper-interner-bounded + +## Origin + +Coverage-gap analysis. The catalog's `interner-full-bounded` covers the **DSD source's** context +interner. The `dogstatsd_mapper` carries a **second, independent** string interner (a whole separate +`ContextResolver` built inside the mapper, default 64 KiB) that interns the *mapped* names and the +*expanded* tags. When that interner is full, the mapper's `try_map` returns `None` and the metric is +**silently left un-remapped** — it flows downstream under its *original* (unmapped) name/tags. This is +a distinct, uncovered silent-failure surface: a second bounded resource with its own +saturation-drop behavior, on a transform that claims Agent equivalence. + +## Code paths + +- `lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs` + - Interner built at `mod.rs:158-167`: + `ContextResolverBuilder::from_name("…/dsd_mapper/primary")…with_interner_capacity_bytes(64 KiB + default)…build()`. **It does NOT call `with_heap_allocations(false)`**, so heap fallback defaults + to `true` (`resolver.rs:258`). Default size = `default_context_string_interner_size` = `ByteSize::kib(64)` + (`mod.rs:34-36`), key `dogstatsd_mapper_string_interner_size` (`mod.rs:51-55`). + - Slow-path resolve: `self.context_resolver.resolve_with_origin_tags(new_name.as_str(), merged_tags, + origin_tags.clone())?` (`mod.rs:317-321`). The trailing `?` means **when resolution returns + `None`, `try_map` returns `None`** → the caller does not replace the context. + - Cache-hit path resolves too: `resolve_with_origin_tags(result.name.clone(), merged_tags, …)` + (`mod.rs:277-282`) — returns the `None` directly. So even a cached positive result can fail to + materialize a context if the interner is full at apply time. + - Caller: `DogStatsDMapper::transform_buffer` (`mod.rs:388-398`) — `if let Some(new_context) = + try_map(...) { *metric.context_mut() = new_context; }`. **No `else`** → on `None` the metric keeps + its original context silently. No drop, no `events_discarded`, no dedicated counter. +- Resolution `None` semantics: `resolve_inner` → `create_context` returns `None` when name/tag + interning fails and heap is disallowed (`resolver.rs:436-483`, name interned at the + `try_intern…or_else(allow_heap_allocations.then(...))` site `resolver.rs:346-349`). + +## Failure scenario + +Two distinct modes: + +1. **Default config (heap fallback ON):** under a high-cardinality flood of mappable names, the mapper + interner never returns `None` but spills mapped names/tags to the heap — the mapper's declared + 64 KiB bound is voided and memory grows unbounded (parallels `interner-full-bounded`'s + default-config failure, but for a *second* interner the firm bound accounts for at + `mod.rs:374-375`). +2. **Heap fallback OFF (if the operator disables it):** the mapper interner fills; `try_map` returns + `None`; the metric is **forwarded under its original, unmapped name/tags**. Downstream filters + (`dsd_prefix_filter`, `dsd_tag_filterlist`, `dsd_post_agg_filter`) then make decisions on the wrong + name, and the customer sees the pre-mapping identity — a silent correctness divergence from the + Agent, not just a dropped metric. Behavior is non-deterministic across the cardinality of *mapped* + strings, independent of the source interner. + +## Property + +- **Type:** Safety. Heap-OFF: the silent-non-remap is a correctness hazard to surface. Heap-ON + (default): bounded-memory claim fails by design (mirrors `interner-full-bounded`). +- **Invariant:** + - Heap-OFF: `AlwaysOrUnreachable(mapper interner full ⇒ metric forwarded UNDER ORIGINAL context, + accounted)` — i.e. the silent-non-remap must be observable/counted, never a silent partial-map. + `Sometimes(mapper resolve == None)` proves exhaustion is reached. + - Heap-ON (default): `Sometimes(mapper intern heap fallback > 0)` proves the unbounded spill path is + reachable for the *mapper's own* interner. + - SUT-side instrumentation required to distinguish mapper-interned / heap-fallback / resolve-None / + forwarded-original — none of these has a metric today (the firm-bound accounting at + `mod.rs:367-382` is a static declaration, not a runtime counter). +- **Antithesis angle:** small `dogstatsd_mapper_string_interner_size` + a flood of *distinct mappable* + names (each expands to a unique mapped name + tags) fills the mapper interner specifically; combine + with the source-interner flood (`interner-full-bounded`) to show the two interners saturate + independently. Timing/scheduling exploration stresses the resolver under churn (idle-context + expiration is 30s, `mod.rs:166`). +- **Priority:** High. + +## Config dependencies + +- `dogstatsd_mapper_string_interner_size` (default 64 KiB) — shrink to force exhaustion cheaply. +- `dogstatsd_allow_context_heap_allocs` — note this is the **DSD source** key; the mapper interner + does **not** read it (it never sets `with_heap_allocations`), so the mapper always defaults to + heap-ON unless the resolver default changes. Confirm there is no separate mapper heap flag (there is + not in current source). This asymmetry is itself a finding. +- `dogstatsd_mapper_profiles` must be set (a profile must match) for the mapper interner to be + exercised at all. +- `dogstatsd_mapper_cache_size` (default 1000): a cached positive result still re-resolves + (`mod.rs:277-282`), so the interner can fail even on a cache hit. + +## Open Questions + +- Is the mapper's lack of a `with_heap_allocations(false)` option intentional, or an oversight that + makes the mapper interner's declared 64 KiB firm bound (`mod.rs:374-375`) unenforceable under + default behavior? `(needs human input)` +- The cache stores the mapped `name`/`extra_tags` but resolution can still fail at apply time + (`mod.rs:277-282`): does that mean a metric can be remapped on one call and silently NOT remapped on + the next, for the *same* name, purely due to interner pressure? That would be a non-deterministic, + load-dependent identity flip — needs confirmation under churn. +- Does the Datadog Agent mapper have an analogous bounded interner with the same drop-to-original + behavior, or does it always allocate? Determines whether heap-OFF behavior is an ADP-specific + divergence (ties to `mapper-output-matches-agent`). + +### Investigation Log + +- Examined: `dogstatsd_mapper/mod.rs` build (`158-183`), `try_map` (`259-342`), `transform_buffer` + (`388-398`), `specify_bounds` (`367-382`); `resolver.rs:251-299,334-349,436-483` for the resolver's + default `allow_heap_allocations=true` and the `Option` `None` path. +- Found: the mapper instantiates a fully separate `ContextResolver` with its own 64 KiB interner and + the default heap-ON behavior; on resolve-`None` the metric is silently forwarded under its original + context with no counter. This is a genuinely second interner-full surface, distinct from + `interner-full-bounded` (DSD source) — different resource, different downstream consequence + (silent non-remap vs. dropped context). diff --git a/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md b/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md new file mode 100644 index 00000000000..121d13440f5 --- /dev/null +++ b/test/antithesis/scratchbook/properties/mapper-output-matches-agent.md @@ -0,0 +1,98 @@ +# mapper-output-matches-agent + +## Origin + +Coverage-gap analysis: the existing 27-property catalog frames ADP as a *transport* and a single +aggregation *transformer* (`aggregate-matches-agent`), but the DogStatsD transform chain has four +additional correctness-affecting transforms that all claim Datadog-Agent equivalence and none has a +property. The `dogstatsd_mapper` is the most complex: it rewrites the metric **name** and injects new +**tags** by expanding regex capture groups, with its own result cache and its own string interner. +A divergence from the Agent's mapper is silent, customer-visible data corruption (wrong metric name, +wrong/missing tags) that the happy-path `panoramic` diff suite does not target as a mapper-specific +case. + +## Code paths + +- `lib/saluki-components/src/transforms/dogstatsd_mapper/mod.rs` + - `MetricMapper::try_map` (`mod.rs:259-342`) — slow path iterates profiles, runs each `Regex`, + on a match clears `new_name` and calls `captures.expand(&mapping.name, &mut new_name)` + (`mod.rs:298-299`), then for each configured tag does + `captures.expand(tag_value_expr, &mut expanded_tag_value)` (`mod.rs:302-309`). + - Profile selection: `metric_name.starts_with(&profile.prefix) || profile.prefix == "*"` + (`mod.rs:292`); **first matching profile + first matching mapping wins** (tests + `test_wildcard_prefix_order` `mod.rs:781`, `test_multiple_profiles_order` `mod.rs:821`). + - Wildcard→regex compilation: `build_regex` (`mod.rs:186-215`) escapes `.` → `\.`, turns `*` → + `([^.]*)`, anchors `^…$`; rejects chars outside `[a-zA-Z0-9\-_*.]` and consecutive `**`. + - Existing tags are preserved and merged with expanded tags (`merge_shared`, `mod.rs:314-315`; + test `test_retain_existing_tags` `mod.rs:888`). +- Pipeline placement (`bin/agent-data-plane/src/cli/run.rs:640-641,674-675`): the mapper is the + *first* transform in the `dsd_enrich` chained transform, ahead of `dsd_prefix_filter`, + `dsd_tag_filterlist`, `dsd_agg`, `dsd_post_agg_filter`. So a mapper rename changes which + prefix/blocklist/filterlist rules subsequently apply — a mapper bug cascades into every downstream + filter decision. +- Agent reference: the Datadog Agent mapper + (`pkg/dogstatsd/mapper/mapper.go`) is the equivalence target; the wildcard `([^.]*)` translation + and `$1`/`${1}` expansion syntax mirror Agent behavior (tests `test_use_regex_expansion_alternative_syntax`, + `test_expand_name`). + +## Failure scenario + +A mapping profile is configured (statically or pushed at runtime via the config stream). For an input +metric name, ADP's mapper produces a different `(name, tags)` than the Datadog Agent mapper would — +e.g. different capture-group expansion for overlapping wildcards, different first-match selection +across profiles, different handling of a name that matches the wildcard char class but not the +Agent's, or a tag injected/dropped where the Agent would do the opposite. The metric is then +aggregated and forwarded under the wrong identity. This is silent: no error, no drop counter; the +customer sees a metric that does not match the Agent's output for the same workload + mapper config. + +## Property + +- **Type:** Safety (differential). +- **Invariant:** Harness/diff-side `Always(mapped (name, tags) within ratio of Agent mapper output)` + per flush window, anchored on the existing `panoramic`/`stele` diff harness but with a + **mapper-exercising corpus** (millstone names crafted to hit the configured profiles) and an + identical `dogstatsd_mapper_profiles` config on both the Agent baseline and ADP. Pair with + `Sometimes(mapper remapped a metric)` so the diff is not vacuous (the corpus actually triggers + remapping). A SUT-side `Sometimes(cache hit returned same result as a fresh miss)` localizes the + result-cache correctness facet. +- **Antithesis angle:** Differential equivalence under (a) overlapping/ambiguous profiles that probe + first-match ordering, (b) names at the wildcard char-class boundary, (c) the same config delivered + at runtime over the config stream vs. statically, and (d) fault-induced flush-timing skew (the + `panoramic` harness alone runs one deterministic order; faults explore reordering). Run as the + Add-on 2 diff topology with a `dogstatsd_mapper_profiles` config on both sides. +- **Priority:** High. + +## Config dependencies + +- `dogstatsd_mapper_profiles` (JSON array of `{name, prefix, mappings:[{match, match_type, name, tags}]}`) + must be set identically on the Agent baseline and ADP. +- `dogstatsd_mapper_cache_size` (default 1000) — exercise both cache-on and cache-off (`0`) to cover + the cache path vs. the slow path returning the same result. +- `dogstatsd_mapper_string_interner_size` (default 64 KiB) — interacts with + `mapper-interner-bounded`; keep generous here so interner exhaustion does not confound the diff. + +## Open Questions + +- Does the Datadog Agent mapper apply **all** matching mappings within a profile, or only the first? + ADP returns on the first matching mapping (`mod.rs:332`). If the Agent differs, this is itself a + bug, not just a test-setup detail. `(needs Agent-source confirmation)` +- Does the Agent restrict wildcard match characters to the same `[a-zA-Z0-9\-_*.]` class + (`ALLOWED_WILDCARD_MATCH_PATTERN`, `mod.rs:31-32`)? A name the Agent maps but ADP's class rejects + at *config-load* (build error) vs. *match-miss* changes the observable divergence. +- Is `FLUSH_WAIT`≈32s on both sides enough once faults delay flushes (timing-artifact false diffs)? + Same caveat as `aggregate-matches-agent`. +- Can the config stream actually push `dogstatsd_mapper_profiles` at runtime, and does the mapper + rebuild on that key? (The mapper has **no `watch_for_updates`** — see Open Questions of + `filter-config-reload-correct`; the mapper appears static-only, unlike the filters.) Determines + whether the runtime-config facet of this property is reachable. + +### Investigation Log + +- Examined: full `dogstatsd_mapper/mod.rs` incl. all unit tests; `run.rs:638-679` pipeline wiring; + `resolver.rs:436-483` for the `Option` resolution semantics. +- Found: mapper is a `SynchronousTransform` (`mod.rs:388-398`) with first-match-wins selection and + capture-group expansion; equivalence to the Agent is claimed via mirrored expansion syntax and the + wildcard translation, but there is **no differential test** against the Agent — only self-consistent + Rust unit tests. +- Note: the mapper has no config-stream watcher, so unlike the filters it is configured once at build; + the runtime-config facet is likely Unreachable for the mapper specifically (flag for the team). diff --git a/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md b/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md new file mode 100644 index 00000000000..a0998f6ffba --- /dev/null +++ b/test/antithesis/scratchbook/properties/memory-limiter-survives-rss-read-failure.md @@ -0,0 +1,105 @@ +# memory-limiter-survives-rss-read-failure + +**Family:** Resource Boundaries — memory / fault tolerance +**Status:** Verified against code at commit 042f41db3b. Property is expected to **FAIL by design**: +an RSS read failure mid-run panics the limiter thread and silently freezes backpressure. Needs +SUT-side instrumentation to express well. + +## What led to the property + +`sut-analysis.md` §4 and §7 flag the limiter checker thread's `.expect()` on the RSS read as a +fail-open hazard: if `/proc` reads start failing mid-run (transient `/proc` unavailability, fault +injection, namespace/cgroup churn), the dedicated checker thread panics and dies. Memory +protection — already the only runtime memory mechanism — silently vanishes, and the +last-written backoff value is **frozen** in the `AtomicU64` forever. + +## Behavior in code + +`MemoryLimiter::new` smoke-tests RSS availability once at construction +(`lib/resource-accounting/src/limiter.rs:43-44`: `Querier::default().resident_set_size()?` — if +unavailable at startup, returns `None` and the caller falls back to a noop limiter via +`accounting.rs:175-177`). But the **steady-state checker loop** does not tolerate later failures +(`limiter.rs:99-122`): +```rust +loop { + let actual_rss = querier + .resident_set_size() + .expect("memory statistics should be available"); // <-- panics the thread mid-run + let maybe_backoff_duration = calculate_backoff(...); + match maybe_backoff_duration { + Some(d) => active_backoff_nanos.store(d.as_nanos() as u64, Relaxed), + None => active_backoff_nanos.store(0, Relaxed), + } + std::thread::sleep(Duration::from_millis(250)); +} +``` +Consequences when `resident_set_size()` returns `None` mid-run: +1. `.expect()` panics. The thread is a bare `std::thread` (`limiter.rs:54-65`), not a supervised + task and not in the data-topology JoinSet — its death does **not** trigger the process-wide + shutdown that data-component exits cause. The process keeps running. +2. The loop stops updating `active_backoff_nanos`. Whatever value was last stored is **frozen**: + - If it was 0 (RSS was below threshold when reads failed), backpressure is permanently off — + fail-open, no protection. Cooperating tasks (`wait_for_capacity`, `limiter.rs:83-88`) never + wait again even as RSS climbs. + - If it was a nonzero backoff, that exact backoff is applied forever regardless of actual RSS + — including after RSS would have dropped, needlessly throttling the source indefinitely. +3. No telemetry or log surfaces the thread death; `memory_limiter.current_backoff_secs` + (`limiter.rs:111,116`) simply stops updating. Observability goes stale silently. + +So the property — "memory protection remains active (or the failure is surfaced) when RSS reads +fail" — is violated: protection silently freezes and the failure is not surfaced. + +## Failure scenario (Antithesis) + +Run with the limiter enabled (`memory_mode: permissive|strict` + `memory_limit` set, +`enable_global_limiter: true`). Use Antithesis fault injection to make RSS reads fail mid-run +(e.g. interfere with `/proc/self/statm` or the platform stat source the `process_memory::Querier` +uses) while a load generator pushes RSS toward the limit. Observe that the checker thread dies, +backoff freezes, and no error is surfaced. The race that makes the freeze damaging — reads fail +*before* RSS crosses the threshold, leaving backoff at 0 — is exactly the kind of timing-ordering +Antithesis explores and the deterministic harness cannot. + +## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist) + +This property needs SUT-side instrumentation; a workload-only checker can only see "RSS grew +unbounded after a fault," which is indistinguishable from the off-by-default limiter case. + +- `Unreachable` on the panic path: wrap the RSS read so the `.expect()` site is replaced with a + branch that, if it would have panicked, fires `assert_unreachable("limiter RSS read failed — + protection lost")`. The panic-the-checker-thread state is a critical-failure state that should + never be observed. (Today it IS observed — that is the finding; the assertion makes it a + reportable property rather than a silent thread death.) +- Alternatively, if the fix is to surface-and-continue: `Sometimes(rss_read_failed_and_surfaced)` + on a new error-reporting path (logged + telemetry incremented + protection conservatively kept + active, e.g. retain a safe backoff). `Sometimes` because the failure is a rare optional path we + want to prove is *handled* at least once, not an always-true invariant. +- Anchor a `Sometimes(active_backoff_nanos updated within last N ms)` liveness check to detect a + frozen/dead checker — proves the limiter is still doing work, not stuck. + +Honest framing: today there is no surfaced-error path, so the realistic immediate assertion is +the `Unreachable` on the panic site (which will fire), documenting the fail-open. The +`Sometimes(surfaced)` form presupposes an SUT-side fix and should be tagged as fix-dependent. + +## Configuration dependencies + +- Requires the limiter to actually be running: `memory_mode != disabled`, `memory_limit` set, + `enable_global_limiter: true`. Under the default `disabled` mode the limiter is a noop with no + checker thread, so this failure mode does not even exist (a separate, larger problem — no + protection at all; see `rss-bounded-under-cardinality`). +- Platform: `process_memory::Querier` backing source (e.g. `/proc` on Linux) determines what + "RSS read failure" means and how to inject it. + +## Open questions + +- Can `process_memory::Querier::resident_set_size()` actually return `None`/error *after* + succeeding once at startup, on the Antithesis Linux target? If the underlying read effectively + cannot fail post-startup on this platform, the panic is only reachable via injected `/proc` + corruption — which determines whether this is a realistic production risk or a fault-injection- + only curiosity. This is the pivotal question for priority. +- Is the frozen backoff value more likely 0 (fail-open, no protection) or nonzero (fail-stuck, + over-throttle) in practice? Both are bugs but with opposite symptoms; determines which + assertion/observable to lead with. +- Should the correct behavior be "keep last-known protection" or "fail loudly and shut down"? + Given data components are fail-stop and the container s6 supervisor restarts ADP on exit, a + loud crash might be *safer* than silent freeze. The intended remediation changes whether the + property is framed as Unreachable(panic) or Reachable(clean restart). diff --git a/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md b/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md new file mode 100644 index 00000000000..0e04f13fc36 --- /dev/null +++ b/test/antithesis/scratchbook/properties/no-silent-interconnect-drop.md @@ -0,0 +1,114 @@ +--- +slug: no-silent-interconnect-drop +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: No silent inter-component drop on a correctly-wired edge + +## Origin +SUT analysis §4 ("Backpressure is real and is the load-safety mechanism") and §5 +safety guarantee #1. Decomposed from the headline guarantee "ADP will not crash +under load, losing customer data" — the *no silent loss* half. No Antithesis +assertion exists (existing-assertions.md). + +## What the code does + +### Backpressure (the await-on-full path) +`lib/saluki-core/src/topology/interconnect/dispatcher.rs`: +- `DispatchTarget::send` (~86-123): when senders are present, it sends to all but + the last sender via `sender.send(item.clone()).await` (~99-104) and the last via + `last_sender.send(item).await` (~107-111). `mpsc::Sender::send().await` **blocks + when the channel is full** — this is the backpressure. It returns `Err` only if + the receiver has been dropped (channel closed), mapped to a `GenericError`. +- All edges are bounded `tokio::mpsc` (SUT analysis §2/§4). A slow downstream stalls + the upstream send, which (in the DSD source) stalls the read loop: + `sources/dogstatsd/mod.rs:1186` `memory_limiter.wait_for_capacity().await` plus the + socket read in the same `'read` loop, so backpressure propagates to the socket. + +### Silent discard (the disconnected-output path) +`dispatcher.rs:86-92`: when `self.senders.is_empty()`, `send` increments +`events_discarded_total` by `item.item_count()` and returns `Ok(())` — the events +are **dropped silently** (no error, no backpressure). This only happens for an +output with **zero** connected senders (a disconnected/un-wired output). + +### Existing unit test (not an Antithesis assertion) +`sources/dogstatsd/mod.rs:2040-2063` `packet_forwarder_waits_when_queue_is_full`: +fills `FORWARDER_QUEUE_CAPACITY` then asserts a further `forward()` does NOT complete +within 100ms ("forwarding should wait for queue capacity instead of dropping"). +Confirms the intended backpressure (await, not drop) behavior at the statsd-forward +boundary specifically. + +## Failure scenario (Antithesis) +Sustained DSD load + a deliberately slow downstream consumer (e.g. throttle the +forwarder/intake so the encoder→forwarder edge and then all upstream edges fill). +Expectation: events are queued/awaited (backpressure to the socket), NOT discarded. +On a correctly-wired edge `events_discarded_total` must stay at 0; the only visible +effect is rising latency / falling socket read rate. + +## Key observations +- The discard path and the backpressure path are mutually exclusive and chosen purely + by `senders.is_empty()`. So the safety statement is precise: **a wired edge never + discards; only a zero-sender edge discards.** +- Partial-delivery hazard (SUT §4): send to N senders is sequential and not atomic — + if a *later* sender errors (receiver dropped) after earlier clones already sent, the + earlier sends are not rolled back. This is a *connection-closed* (shutdown/teardown) + case, not a full-channel case, so it does not contradict the no-discard-under-load + property but should be excluded from the assertion window (see Open Questions). + +## Config deps +- Channel/`interconnect_capacity` default still unread (SUT §9 open question) — sets how + much buffering exists before backpressure engages; affects timing, not correctness. +- The discard path is reachable only by a topology wiring with an unconnected output. + In the production DSD blueprint all three DSD outputs are wired, so on the production + path the discard branch should be Unreachable under load. + +## Suggested assertion (MISSING — net-new) +- **Always** on the wired edge: at every check, `events_discarded_total` for a connected + output does not increase under sustained load (i.e. delta == 0). Anchor on the + `events_discarded_total` counter scoped per output. Safety, every-check. +- **Sometimes(backpressure engaged)**: at least once, the source read loop is observed + blocked on a full downstream channel (meaningful progress into the throttled state) — + proves the fault actually exercised backpressure rather than the load being too light. + Could be read from rising send_latency_seconds or a workload-side stall signal. + +## SUT-side instrumentation needs +- `events_discarded_total` is already emitted; a workload-side checker can read it via + the telemetry endpoint. For a crisp `Always`, an SDK `assert_always` at the discard + site (`dispatcher.rs:90`) gated to "output has a name on the production DSD path" would + fire only on the must-never branch — but note the discard branch is *legitimately* + reachable for genuinely disconnected outputs, so a blanket `assert_unreachable` there + is wrong. Prefer reading the counter from the workload for wired edges. + +## Open questions +- **Does any production DSD output ever legitimately have zero senders?** If a named output + (e.g. `dsd_debug_log_out` when debug logging disabled, or `dsd_stats_out`) is conditionally + unwired, the discard path is reachable by config and the `Always(delta==0)` must be scoped + to the always-wired outputs (`metrics`, `events`, `service_checks`, `dd_out` chain). If all + outputs are always wired, the assertion can cover every edge. +- **Should partial-delivery on receiver-drop be excluded?** Yes during teardown; confirm the + assertion window ends at shutdown signal so the not-atomic multi-sender path (a closed + channel, not a full one) does not produce false positives. + +## Investigation Log + +#### Default `interconnect_capacity` (bounded mpsc size on topology edges) +- **Examined**: `lib/saluki-core/src/topology/mod.rs:37`; `blueprint.rs:56,76,87-88,94`; + `built.rs:73,431-433,651-661` (channel construction); searched `bin/agent-data-plane` and + `lib/saluki-app` for `with_interconnect_capacity` / overrides. +- **Found**: `const DEFAULT_INTERCONNECT_CAPACITY: NonZeroUsize = NonZeroUsize::new(128).unwrap();` + (`mod.rs:37`). `TopologyBlueprint::new` seeds `interconnect_capacity` to this default + (`blueprint.rs:76`). Each non-source event/payload edge builds a `mpsc::channel(interconnect_capacity.get())` + (`built.rs:653,661`), so every interconnect channel is bounded at **128** entries by default. + `with_interconnect_capacity()` exists (`blueprint.rs:87`) but no call site sets it outside the + unit test at `built.rs:717` (capacity 10). No ADP/app config overrides it. +- **Not found**: No runtime/config knob exposing `interconnect_capacity`; it is a hardcoded + compile-time default with only a programmatic setter that production code never calls. +- **Conclusion**: RESOLVED. Default interconnect capacity = **128** events/payloads per edge. + Note this is a count of *event buffers* (`EventsBuffer`), not individual metrics, so the burst + absorbed before backpressure is 128 buffers per downstream component. Property framing unchanged; + this only sizes workload load to reach the full-channel state. diff --git a/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md b/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md new file mode 100644 index 00000000000..bf202ae031d --- /dev/null +++ b/test/antithesis/scratchbook/properties/non-finite-values-handled-consistently.md @@ -0,0 +1,138 @@ +--- +slug: non-finite-values-handled-consistently +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: Medium +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Non-finite metric values are handled consistently, never crash, no ghost metric + +## Origin +SUT analysis §7 #7 ("non-finite metric values silently dropped"), #10 ("NaN poisons a +DDSketch … finiteness guarded per-source, fragile if a new producer is added"), #11 +("All-non-finite packet → ghost metric with a valid context but zero data points"). No +Antithesis assertion exists (existing-assertions.md). + +## What the code does +`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs`: +- `FloatIter::next` (286-307): parses `:`-delimited values; `Ok(value) if value.is_finite()` + → yields the value; `Ok(_)` (non-finite NaN/±Inf) → `debug!("Dropping non-finite …")` and + **loops to the next value** (does not yield, does not error); a true parse failure → `Err`. +- `metric_values_from_raw` (250-271): `num_points` is incremented via + `FloatIter::new(input).inspect(|_| num_points += 1)` (253-254) — `inspect` fires only for + *yielded* items, so **non-finite values do not increment `num_points`**. An all-non-finite + packet → `num_points == 0` and the metric-values constructor (e.g. `gauge_fallible(floats)`, + 258) receives an empty iterator, producing an empty `MetricValues` ("ghost" shape: valid + name/type, zero points). +- Verified by `metric.rs::non_finite_metric_values_are_dropped` (743-759): asserts + `packet.num_points == 0` for `NaN|g`, `inf|g`, `-inf|g`. + +`lib/saluki-components/src/sources/dogstatsd/mod.rs::handle_frame` (1458-1542): +- **1478-1480**: `if metric_packet.num_points == 0 { return Ok(None); }` — the zero-point + packet is dropped **before** context resolution (`handle_metric_packet`, 1490). So at the + source level, an all-non-finite packet consumes **no** interner/context-cache resources and + produces no downstream event. Confirmed by `non_finite_metric_values_are_silently_dropped` + (mod.rs:2470-2485): "handle_frame then returns Ok(None) for zero-point packets." +- A *partial* packet (`x:NaN:5|g`) → `num_points == 1`, the finite value flows normally and the + non-finite one is dropped — consistent with the Agent. + +## Failure scenario (Antithesis) +Send packets that are entirely non-finite (`m:NaN|g`, `m:inf:nan:-inf|h`, multi-value all-NaN +distributions) and mixed finite/non-finite, across all metric types (c/g/ms/h/d/s). Expectations: +(1) no panic; (2) all-non-finite packets produce **no** downstream metric and consume no +interner/context-cache slot (no ghost metric reaches aggregation); (3) finite values in mixed +packets are preserved exactly; (4) no NaN ever reaches a DDSketch (the §7 #10 poisoning hazard). + +## Key observations +- **The ghost-metric risk is gated at the source today** by the `num_points == 0` short-circuit + (mod.rs:1478). Frame honestly: the property asserts the gate *holds* — an all-non-finite packet + must not create a context/sketch — rather than claiming a ghost metric exists on the DSD path. +- The §7 #10 NaN-poisons-DDSketch hazard is real but currently *prevented* by the per-source + finiteness filter in `FloatIter`. The fragility is structural: finiteness is enforced in the + **codec**, not at the **sketch boundary** (`agent/sketch.rs`). A new producer (e.g. OTLP, or a + replay path that bypasses the codec) could feed NaN to a sketch. The property should assert the + invariant *at the sketch boundary* so it's robust to new producers. +- Set metrics (`s`) take a different path (metric.rs:259-265, `num_points = 1` unconditionally, value + is the raw string) — non-finite-ness doesn't apply; exclude sets from the value-finiteness check. + +## Config deps +- `permissive` mode and value parsing don't change finiteness handling. +- The sketch-boundary check matters only for histogram/distribution/timer types (which build sketches); + gauge/counter store raw f64 values. + +## Suggested assertion (MISSING — net-new) +- **Always**(no NaN in a sketch): `assert_always(value.is_finite())` at the DDSketch insert boundary + (`agent/sketch.rs` insert, ~188-206) — generalizes the per-source guard to the sketch itself, robust + to new producers. Catches §7 #10 directly. +- **AlwaysOrUnreachable**(no zero-point metric reaches aggregation): an all-non-finite packet must not + produce a downstream `Event::Metric` with empty values — anchor at handle_frame (mod.rs:1478) / + aggregation insert. If the gate ever lets a zero-point metric through, that's the ghost metric. +- **Sometimes(non_finite dropped)**: at least once, `FloatIter` drops a non-finite value (proves the + adversarial all-NaN input actually exercised the drop path). Meaningful state, not `Sometimes(true)`. +- **Sometimes(ghost-metric path reachable)** — *only if* a producer that bypasses the `num_points==0` + gate is found (e.g. replay/OTLP). On the pure DSD path this is expected **Unreachable**; do not assert + `Sometimes` for it on DSD-only without first confirming a bypass exists (see Open Questions). + +## SUT-side instrumentation needs +- The sketch-boundary `Always(is_finite)` and the zero-point-gate check need SDK assertions inside the + SUT; a workload-only checker sees aggregated output and cannot attribute a NaN sum to a sketch insert. +- A `non_finite_dropped` counter (or assertion at metric.rs:301) gives the `Sometimes` reachability anchor. + No such counter exists today — currently only a `debug!` log. + +## Open questions +- **Does `gauge_fallible([])` / the empty-iterator constructors ever return an `Err`** (the `_fallible` + suffix) rather than an empty value? If they error on empty input, the path is even safer + (handle_frame returns `Err`, counted) — confirm the empty-iter behavior. Changes whether num_points==0 + is the only gate. +- **Is `avg`/`sum` on an empty/NaN sketch ever surfaced as a metric** (the §7 #10 "permanently NaN")? Even + with the source guard, confirm a sketch can't reach a NaN aggregate via merge of pre-timestamped/ + passthrough points. Affects whether the sketch-boundary `Always` is sufficient or a merge-time check is + also needed. + +### Investigation Log + +#### Is there any path that builds a metric/sketch from values WITHOUT going through FloatIter? +- **Examined:** all metric/sketch producers and their topology wiring — DSD codec FloatIter + (`lib/saluki-io/src/deser/codec/dogstatsd/metric.rs:254,299`); OTLP source + (`lib/saluki-components/src/sources/otlp/metrics/translator.rs`, incl. `get_number_data_point_value` + :1366, `is_skippable` :1374, `map_number_metrics` :726, the histogram→`Dogsketch::try_from` path + :314-351 and the explicit-bounds `insert_interpolate_buckets` path :889); self-telemetry + (`lib/saluki-core/src/observability/metrics/mod.rs:299-310`); checks_ipc + (`lib/saluki-components/src/sources/checks_ipc/mod.rs:185-204`); the aggregate + histogram→distribution `insert_n` (`lib/saluki-components/src/transforms/aggregate/mod.rs:745`); + the Datadog metrics encoder Histogram→sketch conversion + (`lib/saluki-components/src/encoders/datadog/metrics/mod.rs:1049-1058`); and the ADP topology + (`bin/agent-data-plane/src/cli/run.rs:462-499, 664-686, 745-755`). Full detail captured in + ddsketch-no-nan-poison.md Investigation Log. +- **Found:** YES — multiple producers build metrics/sketches without FloatIter, but they fall into + three categories: + - **OTLP — guarded by its OWN finiteness filter.** Number/gauge/counter values pass `is_skippable` + (NaN/Inf skipped, translator.rs:726/754); histogram sketches are built via `Dogsketch::try_from` + (no raw insert) or `insert_interpolate_buckets` (which reconstructs finite `bin_lower_bound` + values before `adjust_basic_stats`). OTLP does not poison sum/avg with NaN. So OTLP does NOT make + the ghost/poison path live. + - **Aggregate transform `insert_n` (mod.rs:745) — DSD-ONLY.** `dsd_agg` is wired only in the DSD + pipeline (run.rs:664-679); checks_ipc and OTLP join at `metrics_enrich`, downstream of `dsd_agg`. + So the aggregate sketch path receives only FloatIter-filtered (finite) values. Not a bypass. + - **checks_ipc Histogram → Datadog metrics encoder — A REAL BYPASS (live).** checks_ipc + (mod.rs:195) builds `Metric::histogram(context, (timestamp, value))` from an external Python + check's raw f64 with **no finiteness check**, and routes `checks_ipc_in.metrics → metrics_enrich + → dd_metrics_encode` (run.rs:469/499) — skipping both FloatIter and the aggregate transform. The + encoder converts the Histogram to a sketch via `ddsketch.insert_n(sample.value...)` + (encoders/datadog/metrics/mod.rs:1054), so a NaN check value poisons the emitted sketch's + sum/avg. (Note: this poisons sum/avg but does not create the *zero-point ghost* shape — the + ghost-metric/`num_points==0` concern is specific to the DSD `FloatIter`+`num_points` interaction + and remains gated on the DSD path at mod.rs:1478.) +- **Not found:** No finiteness guard on the checks_ipc value path; no guard at the sketch boundary; + no third metric ingress that bypasses both FloatIter and a per-source filter. +- **Conclusion:** RESOLVED. The NaN-poison path (#10) is **LIVE** via checks_ipc → Datadog metrics + encoder, independent of the DSD codec. The ghost-metric (#11, zero-point) shape is NOT reproduced + by this bypass (it is DSD-`FloatIter`-specific and gated at handle_frame mod.rs:1478). Because the + finiteness invariant is enforced per-producer (DSD FloatIter, OTLP is_skippable) and NOT at the + sketch boundary, the suggested sketch-boundary `assert_always(value.is_finite())` is the + robust, producer-independent assertion and is justified by a concrete live bypass. The + `Sometimes(ghost-metric)` assertion should remain Unreachable-style on the DSD path; the live NaN + exposure is a *poisoned sum/avg* at the encoder, not a zero-point ghost. diff --git a/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md b/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md new file mode 100644 index 00000000000..f4ef0027c66 --- /dev/null +++ b/test/antithesis/scratchbook/properties/prefix-filter-ordering-matches-agent.md @@ -0,0 +1,120 @@ +# prefix-filter-ordering-matches-agent + +## Origin + +Bug-history-sensitive coverage gap. A past correctness fix **"moved DSD prefix/filter in front of +enrich"** (SUT analysis §8, churn hotspots), and the DSD transform chain now applies four +name-rewriting/filtering stages in a *specific order* that must match the Datadog Agent's +listener-side vs. time-sampler split. The order determines which name each downstream stage sees, so +an ordering regression silently changes filtering outcomes. The `dogstatsd_prefix_filter` +(listener-side: prefixing + blocklist/filterlist) and `dogstatsd_post_aggregate_filter` +(time-sampler-side: histogram-aggregate-series filtering) deliberately split responsibility for the +**same four config keys** — a split with subtle correctness rules and zero end-to-end property. + +## Code paths + +- Pipeline order (`bin/agent-data-plane/src/cli/run.rs:674-679`): + `dsd_in.metrics → dsd_enrich` (chained: `dogstatsd_mapper`, `run.rs:640-641`) + `→ dsd_prefix_filter → dsd_tag_filterlist → dsd_agg → dsd_post_agg_filter → metrics_enrich`. + - Mapper rewrites the name **first**, so prefix/blocklist see the *mapped* name. + - `dsd_prefix_filter` then prefixes (`statsd_metric_namespace`) and drops blocklisted names + **before** aggregation. + - `dsd_post_agg_filter` runs **after** `dsd_agg`, filtering only the generated histogram-aggregate + *series* names the aggregator produced (e.g. `foo.max`, `foo.95percentile`). +- `dogstatsd_prefix_filter/mod.rs` + - `process_metric` (`mod.rs:234-267`): if a prefix is configured, prefixes the name **unless** the + name already starts with a `metric_prefix_blocklist` entry (`has_excluded_prefix`, `mod.rs:269-275`); + then checks the effective blocklist matcher and drops on match + (`events.remove_if(...)`, `mod.rs:298-303`). + - Default `metric_prefix_blocklist` is a fixed list of integration prefixes (`datadog.agent`, `jvm`, + `kafka`, …) (`mod.rs:67-91`). +- `dogstatsd_post_aggregate_filter/mod.rs` + - `HistogramSuffixes::contains_filter_entry` (`mod.rs:178-186`): a filterlist entry "owns" + post-aggregate filtering **only** if it is shaped `.` (suffixes derived + from `histogram_aggregates` + `histogram_percentiles`, `mod.rs:150-172`). Other entries stay the + listener filter's responsibility — the explicit split. + - `should_filter_metric` (`mod.rs:238-240`): filters **only scalar series** + (`Counter|Rate|Gauge|Set`) — **sketches are kept** (test `sketch_metrics_are_not_filtered`, + `mod.rs:528`), matching the Agent time-sampler. +- Both filters share the same four config keys (`METRIC_FILTERLIST_CONFIG_KEY`, + `METRIC_FILTERLIST_MATCH_PREFIX_CONFIG_KEY`, `STATSD_METRIC_BLOCKLIST_CONFIG_KEY`, + `STATSD_METRIC_BLOCKLIST_MATCH_PREFIX_CONFIG_KEY`, `dogstatsd_filterlist.rs`) and both reload live + (see `filter-config-reload-correct`). + +## Failure scenario + +- **Ordering regression:** if a refactor moves `dsd_prefix_filter` back behind `dsd_enrich`/mapper or + past `dsd_agg`, the prefix/blocklist would see a different name (pre-map, or post-aggregate-expanded) + than the Agent's listener filter does → metrics blocklisted-or-not differently, or double-prefixed. + The diff suite's happy path may not catch a name that only diverges through this specific stage + order. +- **Split divergence:** an entry like `foo.max` (looks like a histogram aggregate) must be filtered + **post-aggregate** (it targets a generated series), while `foo` must be filtered **listener-side**. + If `contains_filter_entry`'s suffix detection (`mod.rs:178-186`) disagrees with the Agent's, an + entry is filtered at the wrong stage (or both, or neither) → a metric the operator blocklisted is + still forwarded, or a raw metric is dropped that should survive to aggregation. +- **Prefix double-apply / blocklist bypass:** `has_excluded_prefix` logic interacting with a mapped + name could prefix a name that already carries an integration prefix, or fail to block a name that + only matches after prefixing — silently wrong egress identity. + +## Property + +- **Type:** Safety (ordering + differential). +- **Invariant:** + - `Always(end-to-end keep/drop + final name within ratio of the Datadog Agent)` for the same + `statsd_metric_namespace`, `metric_filterlist`, `statsd_metric_blocklist`, and match-prefix flags + — the strongest check, anchored on Add-on 2's differential harness with a corpus that exercises + prefixing, blocklisting, and histogram-aggregate-series names. + - `AlwaysOrUnreachable(a non-histogram-shaped filterlist entry is NOT applied at the post-aggregate + stage)` and conversely `AlwaysOrUnreachable(a histogram-aggregate-series name is NOT dropped at + the listener stage by that entry)` — SUT-side, pins the prefix/post-agg responsibility split. + - `AlwaysOrUnreachable(post_agg_filter never drops a sketch metric)` (`mod.rs:238-240`). + - `Sometimes(prefix added)`, `Sometimes(listener blocklist dropped a metric)`, + `Sometimes(post-aggregate filter dropped a generated series)` for non-vacuity. + - Optionally a topology-shape assertion (`Always(dsd_prefix_filter is wired between dsd_enrich and + dsd_tag_filterlist; dsd_post_agg_filter after dsd_agg)`) read from the built blueprint, to catch + an ordering regression structurally. +- **Antithesis angle:** corpus crafted so the *same* metric name's keep/drop decision depends on the + stage order (a name that is blocklisted only pre-prefix, or an entry that is ambiguous between + listener and post-aggregate ownership), plus fault-induced flush-timing skew on the differential + run. Compose with `mapper-output-matches-agent` (mapper feeds the prefix filter) and + `filter-config-reload-correct` (these filters reload live). +- **Priority:** Medium (High if run as the primary regression tripwire for the prefix/filter-ordering + bug class). + +## Config dependencies + +- `statsd_metric_namespace`, `statsd_metric_namespace_blocklist`, `metric_filterlist`, + `metric_filterlist_match_prefix`, `statsd_metric_blocklist`, + `statsd_metric_blocklist_match_prefix`, `histogram_aggregates`, `histogram_percentiles` — set + identically on both sides for the differential facet. +- The differential facet needs Add-on 2 (Agent baseline + ADP, identical workload). The split/sketch/ + ordering invariants run SUT-side on the primary topology. + +## Open Questions + +- Does the Agent split listener-side vs. time-sampler filtering on exactly the + `.` shape that `contains_filter_entry` (`mod.rs:178-186`) uses? An + off-by-one in suffix detection silently routes an entry to the wrong stage. + `(needs Agent-source confirmation)` +- Is the `dsd_prefix_filter`-before-`dsd_enrich`/after-mapper ordering load-bearing for Agent + equivalence (the historical fix moved it), and is there a regression guard today other than this + proposed property? The fix suggests ordering is fragile. +- `has_excluded_prefix` only consults `metric_prefix_blocklist` when a prefix is configured + (`mod.rs:269-275`); does the Agent skip prefixing for the same default integration-prefix set + (`mod.rs:67-91`), and does mapping a name change whether it carries such a prefix? +- Both filters read the same four keys via separate watchers — a reload that updates one but lags the + other (compose with `filter-config-reload-correct`) could transiently filter at one stage but not + the other for the same logical rule. Confirm reachability. + +### Investigation Log + +- Examined: `run.rs:638-679` (chain wiring + order), full `dogstatsd_prefix_filter/mod.rs` + (process_metric, has_excluded_prefix, default blocklist, reload arms), full + `dogstatsd_post_aggregate_filter/mod.rs` (HistogramSuffixes, scalar-series gate, sketch exclusion, + reload arms). +- Found: a deliberate listener-vs-time-sampler split over four shared keys, an ordering the codebase + history shows is correctness-fragile, and only self-consistent unit tests — no end-to-end ordering + or differential property. Distinct from `mapper-output-matches-agent` (name rewrite) and + `tag-filterlist-applied-consistently` (per-metric tag stripping); this owns the **prefix/blocklist + + post-aggregate split and the stage ordering**. diff --git a/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md b/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md new file mode 100644 index 00000000000..c2223268551 --- /dev/null +++ b/test/antithesis/scratchbook/properties/replay-corruption-not-silent-eof.md @@ -0,0 +1,120 @@ +--- +slug: replay-corruption-not-silent-eof +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: Medium +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Capture corruption is distinguishable from a clean EOF (no silent truncation) + +## Origin +SUT analysis §7 #12 ("Replay reader treats corruption as clean EOF … silently truncates +the remaining record stream — false replay-fidelity confidence"). No Antithesis assertion +exists (existing-assertions.md). **Framed honestly: the current code intentionally returns +`Ok(None)` on these inputs; this property captures the data-fidelity risk, it does not +claim the code is wrong today.** + +## What the code does +`lib/saluki-components/src/sources/dogstatsd/replay/reader.rs::read_next` (84-104): +```rust +if self.offset + LENGTH_PREFIX_SIZE > self.contents.len() { return Ok(None); } // (a) +let size = u32::from_le_bytes(...) as usize; +self.offset += LENGTH_PREFIX_SIZE; +// "The writer emits a zero-length prefix to mark the start of the tagger state trailer; +// treat that (and any size that would overrun the buffer) as the end of the record stream." +if size == 0 || self.offset + size > self.contents.len() { return Ok(None); } // (b) +``` +Three distinct conditions all collapse to the **same** `Ok(None)` "clean EOF" signal: +1. Legitimate end: offset reached the start of the zero-length trailer separator (`size == 0`). +2. Truncation: a record length prefix is present but the body is cut short + (`offset + size > len`). +3. Corruption: a corrupt length prefix that happens to read as `0` or as an oversized value. + +The driver (`bin/agent-data-plane/src/cli/dogstatsd.rs:367-373`) treats `Ok(None)` as +"replay iteration completed" and stops sending packets — so cases 2 and 3 **silently drop +every remaining record** with no error and no telemetry. + +Tests currently *assert* this behavior: `truncated_record_returns_none` (245-257) writes a +file with the last 8 bytes dropped and asserts `read_next()` yields `Ok(None)` ("clean EOF on +truncation"); `read_next_stops_at_state_separator` (233-242) asserts the trailer boundary → +`Ok(None)`. So the silent-truncation behavior is encoded as intended. + +Contrast: `read_state` (126-131) **does** return an `Err` when the trailer size overruns the +buffer — so the codebase already distinguishes "oversized length → error" in the trailer path +but not in the record path. This asymmetry is the crux of the property. + +## Failure scenario (Antithesis) +Replay a capture that is valid for the first N records, then has a corrupt 4-byte length +prefix (e.g. a flipped byte making `size` huge, or a zeroed prefix mid-stream). The reader +returns `Ok(None)` at that point; the replay tool reports success having sent only N of M +records. A diff against the capture's true record count would reveal the loss, but the tool +itself signals success — the fidelity loss is invisible. + +## Key observations +- This is a *data-fidelity* property, not a crash property. The "bad thing" is **silent** + truncation reported as success, not a panic. +- A faithful fix would track whether the offset reached exactly the trailer separator vs. ran + off a malformed prefix, and surface the latter as `Err`. The trailer path (read_state) already + does this for oversize. The property can be stated without demanding a fix: *if records were + truncated by corruption, the replay must not report a clean completion.* +- Because the legitimate-EOF case (size==0 separator) and the truncation case are byte-shape + identical from `read_next`'s local view, distinguishing them requires either a record count in + the header/trailer or an explicit corruption sentinel — neither exists today (open question). + +## Config deps +- Same gating as `replay-no-panic-on-malformed-capture`: `dogstatsd replay` subcommand, UDS + listener, Linux-only. +- File version ≥ MIN_STATE_VERSION (2) means a trailer is expected (file_header.rs:11); the + separator-vs-truncation ambiguity is most acute for versioned files that *should* have a trailer. + +## Suggested assertion (MISSING — net-new) +- **AlwaysOrUnreachable**(replay completion is faithful): when `read_next` returns the + terminating `Ok(None)`, the consumed offset equals the start of the tagger-state trailer + (clean end) — i.e. the loop did not stop because of an unconsumed/over-running length prefix. + Realize as an SDK assertion at the (b) branch (reader.rs:95) distinguishing + `size == 0 && at_trailer_boundary` (clean) from the overrun/`size==0`-mid-stream case (corrupt). +- **Sometimes(corruption-detected)**: at least once, the reader reaches the (b) branch with a + length prefix that overruns the buffer (proves the corrupt input actually exercised the path, + not just clean EOF). Meaningful state, not `Sometimes(true)`. + +## SUT-side instrumentation needs +- A workload-only check cannot tell truncation from clean EOF (both look like "replay finished"). + Needs an SDK assertion at reader.rs:95 (or a new telemetry counter `replay_records_truncated`) + to expose the corrupt branch. Could also compare a record count emitted at capture time against + records replayed. + +## Open questions +- **Is there a record count or total-length field anywhere** (header/trailer) that would let the + reader detect "stopped early"? If not, distinguishing truncation from clean EOF requires a format + change. Determines whether this can be a strict `Always` or only a best-effort heuristic. +- **Do the maintainers consider silent truncation acceptable for replay** (best-effort tool) vs. a + fidelity bug? The intentional `Ok(None)` and the asserting tests suggest "accepted"; this property + documents the risk and lets Antithesis quantify how often corruption silently truncates. Changes + priority, not the property statement. +- **How does a corrupt prefix that decodes to a small but wrong `size` behave?** It would decode the + wrong bytes as a record body (case where `offset+size <= len` but bytes are garbage) → either a + prost decode `Err` (good, surfaced) or, worst case, a successfully-decoded *wrong* record. Worth + enumerating: this is a third outcome (silent corruption rather than silent truncation). + +### Investigation Log + +#### Do the maintainers consider silent truncation acceptable, or is it a fidelity bug? `(needs human input)` + +- **Examined**: `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs` `read_next` + (length-prefix parse ~84-104, the `size == 0 || offset+size > len → Ok(None)` collapse); the + reader's own unit tests (~244-257) which feed truncated/oversized prefixes and **assert** the + result is `Ok(None)`; the capture-file format (`UnixDogstatsdMsg` records + a `TaggerState` + trailer, `writer.rs`) for any record-count or total-length field. +- **Found**: the silent-truncation behavior is *intentional in code* — `Ok(None)` is the deliberate + return for both legitimate EOF and a corrupt/over-running prefix, and the tests pin that behavior + as desired. There is **no record-count or total-length field** in the format, so the reader has no + in-band way to distinguish "stopped early" from "clean end." +- **Not found**: any comment, doc, ADR, or commit message stating whether silent truncation is an + accepted best-effort property of the replay tool or a known fidelity gap. Code intent ("we return + Ok(None)") is clear; *product* intent ("is that OK?") is not recoverable from the repo. +- **Conclusion**: tagged `(needs human input)`. The behavior is unambiguous; only the maintainers can + say whether it is acceptable. The answer changes this property's **priority** (and whether a + format change adding a record count is warranted), not the property statement. diff --git a/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md b/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md new file mode 100644 index 00000000000..e9154049524 --- /dev/null +++ b/test/antithesis/scratchbook/properties/replay-no-panic-on-malformed-capture.md @@ -0,0 +1,130 @@ +--- +slug: replay-no-panic-on-malformed-capture +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Parsing an arbitrary/corrupt/truncated DogStatsD capture never panics + +## Origin +SUT analysis §6 gap #6 ("DogStatsD replay has zero suite coverage despite being the +newest, largest, untrusted-input feature"), §7 #12, §8 ("Most regression-prone area: +DogStatsD replay", `e88d04b10a`, +1765 LOC, validated only with `cargo check`). No +Antithesis assertion exists (existing-assertions.md). + +## What the code does +`lib/saluki-components/src/sources/dogstatsd/replay/reader.rs`: +- `from_path` (39-64): `fs::read` → zstd sniff (`has_zstd_magic`, 141-143, checks 4 + magic bytes) → `zstd::stream::decode_all` (44) → `valid_header` → `file_version`. + All fallible steps map to `GenericError` via `?`. zstd errors are caught (45). +- `read_next` (84-104): bounds-checks `offset + LENGTH_PREFIX_SIZE > len` → `Ok(None)` + (85-87); reads 4-byte LE length prefix; **`size_bytes.try_into().expect("length + prefix is 4 bytes")`** (90) — this `expect` is provably safe because the slice is + exactly `[offset .. offset+4]` after the bounds check, so it can never fire on any + input; bounds-checks `offset + size > len` → `Ok(None)` (95); `UnixDogstatsdMsg::decode` + returns mapped error (99-100). +- `read_state` (110-138): symmetric, with the same provably-safe `expect` at 121 and a + real error return at 126-131 if the trailer size overruns. + +`bin/agent-data-plane/src/cli/dogstatsd.rs` (the **driver**, runs inside an ADP process): +- 269-270: `from_path(&cmd.replay_file_path)?` then `read_state()?` (config-check-style path). +- 355: `from_path(replay_file_path)?`; 367: `let msg = match reader.read_next()? { … }` + inside `replay_one_iteration` — all errors propagate via `?` up to a `tokio::select!` + (341-347) that returns the error. No `unwrap`/`expect` on the reader results here. + +## Failure scenario (Antithesis) +Feed the `agent-data-plane dogstatsd replay` subcommand a capture file that is: +arbitrary bytes; a valid header followed by a truncated/garbage protobuf; a zstd stream +that decompresses to a header + corrupt body; a length prefix that decodes but whose body +is invalid protobuf; a zstd bomb / partial zstd frame. Expectation: the process exits with +a clean `Err` (non-zero exit, logged), never a panic/abort/SIGABRT. + +## Key observations +- The two non-test `expect` sites (reader.rs:90, 121) are guarded by an exact-length + bounds check, so they are **not** reachable panic sites — the "~25 unwrap/expect" figure + in the brief is the whole-file count and is dominated by test code (26 of 28 are in + `#[cfg(test)]`). The real untrusted-input panic surface in the reader is small. +- The realistic panic risk is in the *dependencies*: `zstd::stream::decode_all` on a + malicious stream (memory blowup / library panic) and `prost`'s `Message::decode` on + adversarial protobuf (recursion/length). Both are wrapped in `Result`, but a panic + inside them would still abort — Antithesis is the right tool to find such a panic. +- A panic is invisible to a workload-only checker if the replay runs as a subprocess that + is expected to exit non-zero anyway; distinguishing "clean error exit" from "panic/abort" + needs SUT-side signal. + +## Config deps +- Replay path is gated on the `dogstatsd replay` CLI subcommand + a UDS listener; Linux-only + (`#[cfg(target_os = "linux")]`, dogstatsd.rs:351). The capture file path is operator-supplied. +- zstd decompression is auto-selected by magic bytes — no config flag needed to reach it. + +## Suggested assertion (MISSING — net-new) +- **Unreachable** at any panic/abort originating from the reader or its decode calls. Best + realized as an SDK `assert_unreachable` in a panic hook installed for the replay path, or + by treating any SIGABRT/panic-unwind during replay as a property violation. The workload + cannot cleanly observe a panic from outside, so this needs SUT-side instrumentation. +- Pair with **Reachable**(replay parse executed) so the test confirms the path was actually + exercised, not skipped because the subcommand never ran. + +## SUT-side instrumentation needs +- A panic hook (or `assert_unreachable`) on the replay code path is required; a workload-only + check sees only an exit code and cannot distinguish panic from a deliberate `Err`. +- Optionally anchor `assert_always(result.is_ok() || clean_err)` at dogstatsd.rs:367 so an + unexpected panic in `read_next` is converted into a recorded assertion failure. + +## Open questions +- (none remaining — see Investigation Log) + +### Investigation Log + +#### How is replay invoked; whole-file read OOM vector; zstd decompression-bomb vector +- **Examined:** `bin/agent-data-plane/src/cli/dogstatsd.rs:38-114` (subcommand defs), + `:169-213` (`handle_dogstatsd_command` dispatch), `:261-310` (`handle_dogstatsd_replay`), + `:322-399` (`run_dogstatsd_replay` / `replay_one_iteration`); and + `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs:34-143`. +- **Found:** + - **(a) Separate process, sends over UDS to the running data-plane.** Replay is the + `dogstatsd replay` argh subcommand (`ReplayCommand`, dogstatsd.rs:103-114), dispatched at + `dogstatsd.rs:192-211` to `handle_dogstatsd_replay`. The CLI process itself opens the capture + file and reads records, then **sends each record as a UDS datagram to the already-running ADP** + via `uds_sendmsg_with_creds(socket, &msg.payload, &credentials)` (`dogstatsd.rs:394`); the + socket target is ADP's configured `dogstatsd_socket` (`dogstatsd_socket_path`, :313). So parsing + of the capture file happens **in the replay CLI process, not in the data-plane process.** A + panic during parsing aborts the replay tool (exit/SIGABRT), not the data-plane. + - Consequence for instrumentation: a panic-catch / `assert_unreachable` for malformed-capture + parsing belongs in the **replay CLI process** (the `from_path`/`read_next`/`read_state` call + sites in `dogstatsd.rs` and `reader.rs`). The data-plane process only ever sees the resulting + *bytes* of `msg.payload` arriving over the DSD UDS socket — i.e. ordinary DSD packets, which + are covered by the malformed-dsd-no-crash property, not by the capture parser. + - Note `from_path` is invoked **twice per replay**: once eagerly in `handle_dogstatsd_replay` + (dogstatsd.rs:269 for `read_state`) and again per loop iteration in `replay_one_iteration` + (dogstatsd.rs:355). Both propagate parse errors via `?`; no `unwrap`/`expect` on reader output. + - **(b) Whole-file `fs::read` with NO size guard — OOM vector confirmed.** `reader.rs:40-41`: + `let raw = fs::read(path).map_err(...)?;` reads the *entire* file into a `Vec` before any + parsing or size check. There is no stat/metadata length check, no max-size constant, no + streaming. A multi-GB capture path is loaded fully into memory in the replay process. This is an + OOM vector independent of parsing correctness. (Lives in the replay CLI process per (a).) + - **(c) `zstd::stream::decode_all` on untrusted input with NO decompressed-size cap — + decompression-bomb vector confirmed.** `reader.rs:43-48`: if `has_zstd_magic(&raw)` (4-byte magic + sniff, `reader.rs:141-143`), it calls `zstd::stream::decode_all(raw.as_slice())` (`reader.rs:44`) + and stores the full decompressed output in `contents`. There is no `Decoder` with a window/size + limit, no streaming bound, no cap on decompressed length — `decode_all` allocates as large as the + stream dictates. A small crafted `.dog.zstd` can expand to an arbitrarily large `Vec`. Errors + from `decode_all` are caught (`.map_err(...)?`, reader.rs:45), so a *malformed* stream returns a + clean `Err`; the hazard is specifically unbounded *memory*, not a panic, on a *valid but huge* + decompression. +- **Not found:** No file-size limit, no `fs::metadata` length pre-check, no zstd window/size cap, + no streaming reader. No panic-prone `unwrap`/`expect` on untrusted bytes in the reader (the two + non-test `expect`s at reader.rs:90/121 are guarded by exact-length bounds checks, as previously + noted). +- **Conclusion:** RESOLVED. (a) Replay is a separate CLI process that parses the capture and forwards + payloads to the running ADP over the DSD UDS socket — panic/assert instrumentation for capture + parsing belongs in the replay process. (b) and (c) are both LIVE resource-exhaustion vectors: + unbounded `fs::read` (reader.rs:40) and uncapped `zstd::stream::decode_all` (reader.rs:44). These + are OOM/decompression-bomb hazards (memory-bound family), distinct from the no-panic property; they + warrant either a size cap in the reader or an explicit resource-exhaustion property. The no-panic + property itself stands and its panic surface is the zstd/prost decode calls, not the reader's own + bounds-checked logic. diff --git a/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md b/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md new file mode 100644 index 00000000000..1152b04834e --- /dev/null +++ b/test/antithesis/scratchbook/properties/retry-queue-bounded-under-outage.md @@ -0,0 +1,150 @@ +# retry-queue-bounded-under-outage + +**Family:** Resource Boundaries — queues / backpressure / exhaustion +**Status:** Verified against code at commit 042f41db3b. The byte-cap + drop-oldest invariant is +**expected to HOLD** (it is genuinely enforced). The tension is that staying bounded *implies +silent data loss* on prolonged outage — both halves are properties worth asserting. + +## What led to the property + +The headline guarantee is "won't crash under load, losing customer data," but `sut-analysis.md` +§2/§5 (Liveness 4) flags a real tension: the forwarder retry queue caps memory, which means a +*prolonged* intake outage forces drops. This property pins down the safety half precisely: under +a sustained outage the in-memory + disk retry queue stays within configured byte caps and +overflow drops the **oldest** entries (bias to freshest data), always **counted**, never growing +unbounded. The existing suite never tests intake-down at the system level (§6 gap 2). + +## Behavior in code + +Egress is `TransactionForwarder` (`lib/saluki-components/src/common/datadog/io.rs`); each resolved +endpoint owns a `PendingTransactions` (two-tier: high-priority in-memory `VecDeque` for fresh +data + low-priority `RetryQueue` for retries/overflow). Under outage the retry circuit breaker +re-enqueues to the low-priority queue, so the `RetryQueue` is where unbounded growth would happen. + +**In-memory cap with drop-oldest** — `RetryQueue::push` +(`lib/saluki-io/src/net/util/retry/queue/mod.rs:179-220`): +```rust +if current_entry_size > self.max_in_memory_bytes { // single entry too big => Err + return Err(generic_error!("Entry too large to fit into retry queue. (...)")); +} +while !self.pending.is_empty() + && self.total_in_memory_bytes + current_entry_size > self.max_in_memory_bytes { + let oldest_entry = self.pending.pop_front()...; // OLDEST first + if let Some(persisted) = &mut self.persisted_pending { + persisted.push(oldest_entry).await?; // spill to disk if enabled + } else { + push_result.track_dropped_item(&oldest_entry); // else DROP + COUNT + } + self.total_in_memory_bytes -= oldest_entry_size; +} +self.pending.push_back(entry); +self.total_in_memory_bytes += current_entry_size; +``` +So `total_in_memory_bytes` is held at/under `max_in_memory_bytes` by evicting oldest-first; with +no disk persistence the evicted entries are dropped and tracked in `PushResult`. + +**Disk cap with drop-oldest** — `PersistedQueue` enforces a disk-byte limit, evicting oldest on +overflow (`retry/queue/persisted.rs:285-330`): `while !entries.is_empty() && total_on_disk_bytes ++ required_bytes > limit { ... track_dropped_item(...); entries_dropped += 1; }`. The limit is +`min(max_on_disk_bytes, total_space * storage_max_disk_ratio)` (`persisted.rs:343-349`, +ratio default 0.8). Corrupt/unconsumable persisted entries are also permanently dropped and +counted (`persisted.rs:234-238, 317-321`). + +**Drops are surfaced as telemetry** — `PushResult` (`mod.rs:49-78`) carries +`items_dropped`/`events_dropped`/`data_points_dropped`; every push site routes it through +`track_queue_drops` (`io.rs:535-539`, called at `io.rs:420, 471, 498`), and persisted-entry drops +flow via `take_persisted_entries_dropped` => `low_prio_queue_entries_dropped` (`io.rs:739-744`). +The `#[must_use]` on `PushResult` (`mod.rs:49`) makes dropped-item info hard to ignore. So drops +are counted, not fully silent — but they are still **data loss** with only a counter to show it. + +## Failure scenario (Antithesis) + +Drive sustained load into DSD while Antithesis holds the mock intake down (connection refused / +black-hole / 5xx storm / slow) long enough for the retry queue to saturate. Assert: +1. **Safety:** `total_in_memory_bytes <= max_in_memory_bytes` and on-disk bytes `<= disk limit` + at all times — the queue never grows unbounded (`Always`). +2. **Liveness/loss:** `Sometimes(items_dropped > 0)` once saturated — proves the bound is real and + the drop path is exercised (the data-loss reality of the guarantee). +3. **Recovery (cross-ref, separate property):** after the outage clears, queued data drains + high-priority-first. Antithesis interleaving stresses the shared circuit-breaker backoff + + per-endpoint queues that the deterministic harness never exercises. + +## Suggested assertions (NET-NEW — see existing-assertions.md: NO SDK assertions exist) + +- `Always(total_in_memory_bytes <= max_in_memory_bytes)` in `RetryQueue::push`/after-push; and + the analogous disk-bytes `Always` in `PersistedQueue::push`. Safety: must hold every check; the + eviction `while` loops make this a true invariant, so a real `Always`. +- `Sometimes(push_result.items_dropped > 0)` — proves saturation/eviction is reached (the + bound is actually load-bearing, not vacuous). Liveness/progress. +- `Sometimes(persisted_entries_dropped > 0)` when disk persistence is enabled — proves the disk + cap also evicts. Optional path => Sometimes, not Always. +- Consider `Reachable` on the "entry too large to ever fit" `Err` branch (`mod.rs:184-189`) if + the workload can produce an oversized payload — a distinct failure mode (the whole entry is + rejected, not evicted). + +SUT-side beats workload-only: a workload checker at the mock intake sees *which* metrics never +arrive but cannot distinguish "dropped by retry-queue overflow" from "dropped as 400/401/403/413 +permanent failure" (`classifier/http.rs`) from "still queued." The byte-bound invariant is only +observable from inside `RetryQueue`. + +## Configuration dependencies + +- `forwarder_retry_queue_payloads_max_size` / `forwarder_retry_queue_max_size` => in-memory byte + cap; default **15 MiB** when both unset (`retry.rs:97-104, 160-166`, + `FORWARDER_RETRY_QUEUE_PAYLOADS_MAX_SIZE_BYTES`). +- `forwarder_storage_max_size_in_bytes` (disk cap) default **0 => disk persistence DISABLED** + (`retry.rs:39-41, 110-113, 169-171`; gated at `io.rs:394`). So by default overflow goes + straight to **drop**, not disk. Disk path only active when operator sets a nonzero value (and + `forwarder_storage_path`). +- `forwarder_storage_max_disk_ratio` (default 0.8) caps disk usage relative to total volume space. +- Per-endpoint: each resolved endpoint has its own queue, so the *aggregate* memory bound is + `num_endpoints * max_in_memory_bytes` (+ disk). Multi-endpoint fan-out multiplies the cap. + +## Open questions + +- Is the aggregate bound across all endpoints (`num_endpoints * 15 MiB` in-memory) the right thing + to assert, or is the per-endpoint cap sufficient? With many additional endpoints the total can + be large; matters for whether "bounded" really protects process RSS. +- Does `IncomingBytesPerSec` / queue-duration accounting (`io.rs:582-639`) feed any *additional* + drop policy (time-based eviction) beyond the byte cap, the way the Datadog Agent's + queue_duration_capacity does? If so the bound is byte-AND-time and the assertion must cover both. + +### Investigation Log + +#### Disk-init fallback byte-cap + corrupt-file wedging under fault injection (2026-05-28) + +**Examined:** `lib/saluki-components/src/common/datadog/io.rs:391-410` (queue create + silent +fallback); `lib/saluki-io/src/net/util/retry/queue/persisted.rs` `pop` (:206-243), +`refresh_entry_state` (:245-273), `try_deserialize_entry` (:354-398), `push` (:164-199), +`remove_until_available_space` (:304-330). (Full trace recorded in +`disk-persisted-retry-survives-restart.md` Investigation Log, 2026-05-28.) + +**Found:** +- **Byte cap holds in the degraded (fallback) mode.** Disk-init failure falls back to + `RetryQueue::new(queue_id, config.retry().queue_max_size_bytes())` (io.rs:407) — the same + in-memory cap as the non-persisted path (io.rs:391). Degraded mode is just the drop-oldest / + drop-not-spill branch of `RetryQueue::push`, so `total_in_memory_bytes <= max_in_memory_bytes` + still holds. The disk-path `Always` is vacuous in fallback mode (no disk queue exists), but the + in-memory `Always` is preserved. +- **Durability downgrade is surfaced only as an `error!` log (io.rs:406), no metric** — so a + bounded-queue workload that intends to exercise the disk cap must detect the fallback (log-scrape + or `assert_unreachable` at io.rs:405) or it will silently be testing the in-memory cap instead. +- **A corrupt / torn-written persisted file does NOT wedge the queue and does NOT break the disk + byte cap.** `pop` drops corrupt entries (warn + `entries_dropped++` + decrement + `total_on_disk_bytes`) and `continue`s past them (persisted.rs:227-240); the eviction path does + the same (:313-322); unrecognized-named files are skipped during the scan (:255-262). Dropping a + corrupt entry *decrements* `total_on_disk_bytes`, so it can only help the cap, never violate it. + Writes are non-atomic (`tokio::fs::write` direct to final path, :184, no temp+rename), so a + SIGKILL mid-write yields a valid-name/truncated-content file → classified corrupt → dropped on + read. Recovery proceeds past any number of such files. The "~47 unwrap/expect" concern: the + recovery/eviction hot paths use `Result`-propagating `?`/match on the deserialize and IO error + paths, not `unwrap`, so a corrupt file surfaces as a handled `Err`, not a panic. + +**Not found:** No path where a corrupt/torn file inflates `total_on_disk_bytes` past the limit or +halts eviction/recovery; no metric for the fallback downgrade. + +**Conclusion (RESOLVED):** The byte-cap `Always` invariant holds under disk-init fallback (in-memory +cap unchanged) and under corrupt/torn-file fault injection (corrupt entries are dropped and decrement +the byte total, never wedge the queue). The disk-path `Always` is testable on clean disks and is not +violated by corrupt files; in fallback mode only the in-memory `Always` applies. Workloads targeting +the *disk* cap must guard against the silent fallback (log-only, no metric). diff --git a/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md b/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md new file mode 100644 index 00000000000..2aff8853287 --- /dev/null +++ b/test/antithesis/scratchbook/properties/rss-bounded-under-cardinality.md @@ -0,0 +1,159 @@ +# rss-bounded-under-cardinality + +**Family:** Resource Boundaries — memory +**Status:** Verified against code at commit 042f41db3b. Property is expected to **FAIL by design** under default configuration. + +## What led to the property + +The ADP headline guarantee (Confluence, "What Comes After DogStatsD") is **"ADP will not crash +under load, losing customer data,"** and the product is marketed on **deterministic resource +usage** (`docs/agent-data-plane/index.md:1-6`). The most direct reading of that guarantee is: +under a high-cardinality tag flood (or a single metric with many distinct timestamped values), +the process RSS stays within the operator's configured memory grant. This property tests that +runtime claim directly — not the static startup bound. + +## Why it is expected to fail (the design gaps) + +The runtime memory story is built from several independently-weak mechanisms; under default +config most are off or advisory: + +1. **Memory limiting is DISABLED by default.** `MemoryMode::default() == Disabled` + (`lib/saluki-app/src/accounting.rs:37-40`). In `initialize_memory_bounds` + (`accounting.rs:149-181`), `Disabled` => `limiter_grant = None` => `MemoryLimiter::noop()` + (`accounting.rs:174-178`). A noop limiter's `wait_for_capacity` returns immediately + (`lib/resource-accounting/src/limiter.rs:73-77, 83-88`). So out of the box there is **no + runtime backpressure at all**. A limit only auto-appears via cgroups when `DOCKER_DD_AGENT` + is set (`accounting.rs:108-121`). + +2. **Even when enabled, backpressure is advisory and tiny.** `MemoryLimiter::new` + (`limiter.rs:42-68`) starts backoff at **95% of limit** (`backoff_threshold = 0.95`, + line 47) and caps sleep at **25ms** (`backoff_max = Duration::from_millis(25)`, line 49). + The checker thread samples RSS only every **250ms** (`limiter.rs:120`). Under a burst, RSS + can blow well past the limit between samples while the worst penalty any cooperating task + pays is a 25ms sleep. + +3. **Only the source cooperates; the allocating hot path does not.** Backpressure is + cooperative — it only throttles tasks that call `wait_for_capacity`. The DSD source calls it + once per read loop iteration (`lib/saluki-components/src/sources/dogstatsd/mod.rs:1186`). But + the **aggregate transform never references the memory limiter at all** (grep of + `transforms/aggregate/mod.rs` for `wait_for_capacity`/`memory_limiter` => no matches). The + aggregation `HashMap` and the string interner both grow under + pressure regardless of backoff; throttling the socket read does not stop the map from + growing once packets are in flight. + +4. **The interner spills to the heap by default => "effectively unlimited."** Context + resolution interns into a fixed-size buffer, but on a full interner it falls back to a heap + allocation when `allow_heap_allocations` is true (`lib/saluki-context/src/resolver.rs:339-353`). + The builder defaults this to `true` (`resolver.rs:258` `unwrap_or(true)`, doc `:186-190`), + and the DSD config default `dogstatsd_allow_context_heap_allocs` is also `true` + (`sources/dogstatsd/mod.rs:149-151, 402-406`), wired through + `sources/dogstatsd/resolver.rs:38,56,64`. The doc string is explicit: heap fallback means + the interner memory is **"effectively unlimited"** (`resolver.rs:182-184`). + +5. **The declared firm bound is known-incomplete.** `AggregateConfiguration::specify_bounds` + carries a TODO (`transforms/aggregate/mod.rs:247-272`) admitting that a single metric with + **many distinct timestamped values** is not accounted for — only `aggregate_context_limit` + entries of `sizeof(Context)+sizeof(AggregatedMetric)` are summed. A many-distinct-timestamp + flood inflates each `MetricValues` `SmallVec` beyond the modeled size. So even the static + bound diverges from real RSS under exactly the workload this property injects. + +Net: the static `BoundsVerifier` is a **startup assertion, not a runtime invariant**, and the +runtime mechanism is off-by-default / advisory / non-cooperative on the path that actually +allocates. RSS staying within grant is not guaranteed under high-cardinality load. + +## Failure scenario (Antithesis) + +Deploy ADP with `memory_mode: strict` (or `permissive`) and an explicit `memory_limit`, plus a +load generator (millstone-style) that emits a high-cardinality tag flood and/or a single metric +name with many distinct timestamped values into DogStatsD. Sample real process RSS. Expect RSS +to exceed the grant (interner heap spill + unaccounted timestamped values + aggregate map +growth that the 25ms/250ms advisory backoff cannot arrest). Antithesis timing/scheduling +exploration makes the 250ms-sample / burst race observable in a way the deterministic +correctness harness (fixed clock, healthy intake) cannot. + +## Suggested assertion (NET-NEW — see existing-assertions.md: NO SDK assertions exist anywhere) + +- A workload-side or SUT-side check reading `Querier::resident_set_size()` (the same source the + limiter uses, `limiter.rs:44,100-102`): `Always(actual_rss <= grant.effective_limit_bytes() * + tolerance)`. Honest framing: this assertion is **expected to fail** under default config and + under high-cardinality load even with the limiter on — that failure is the finding. +- **SUT-side instrumentation beats a workload-only checker here.** A `Sometimes(backoff_applied + && rss_still_climbing)` anchored in the limiter, plus an assertion that the aggregate insert + path observes capacity, would localize *why* RSS escapes (advisory-only, wrong path + cooperating) rather than just that it does. A pure workload RSS probe sees the symptom, not + the mechanism. + +## Configuration dependencies + +- `memory_mode` (default `disabled` => no limiter), `memory_limit` (default unset), + `enable_global_limiter` (default true, but moot when mode is disabled), + `memory_slop_factor` (default 0.25). +- `dogstatsd_allow_context_heap_allocs` (default true => unbounded interner spill). +- `dogstatsd_string_interner_size_bytes` / `dogstatsd_string_interner_size` (interner capacity; + default 2 MiB). +- `aggregate_context_limit` (default 1,000,000) bounds map entries but not per-entry value count. + +## Open questions + +- What RSS tolerance band over `effective_limit_bytes` is "acceptable"? The 95%-threshold + + 25ms-backoff + 250ms-sample design implies meaningful overshoot is expected even in the happy + case; the assertion threshold determines whether this reports as a real violation or noise. + Matters because too-tight a bound makes the property flap; too-loose hides the design gap. +- With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do the aggregate + map and per-context value `SmallVec`s still escape the grant under the many-timestamp flood? + This decides whether the property is "fails always" or "fails only with heap spill enabled." +- Is there any RSS ceiling enforced by the container/cgroup that would OOM-kill ADP before the + assertion fires? If so the observable outcome is a process crash (a different, arguably worse, + violation of the same guarantee) rather than an RSS-over-grant reading. + +## Investigation Log + +#### What enables a memory limit by default; does cgroup auto-detection require `DOCKER_DD_AGENT`? +- **Examined**: `lib/saluki-app/src/accounting.rs`: `MemoryBoundsConfiguration` (55-94), + `try_from_config` (101-130, the cgroup auto-detect block at 107-121), `get_initial_grant` + (133-138), `initialize_memory_bounds` (149-181); `MemoryMode` default + (`memory_mode` serde `default` => `MemoryMode::Disabled`, fields/doc near 90-94); + call site `bin/agent-data-plane/src/cli/run.rs:205-206,239`; the `DOCKER_DD_AGENT` reference in + `components/apm_onboarding/install_info.rs:90`. +- **Found — cgroup auto-detect is gated on `DOCKER_DD_AGENT`**: In `try_from_config`, cgroup + detection runs **only if** `config.memory_limit.is_none()` AND + `env::var("DOCKER_DD_AGENT")` is `Ok` AND its value is non-empty (107-110). Only then does + `CgroupMemoryParser.parse()` run and, on success, set `config.memory_limit` (111-118). So with + no explicit `memory_limit` config and no non-empty `DOCKER_DD_AGENT`, `memory_limit` stays + `None`. +- **Found — two independent gates make the limiter a no-op by default**: + (1) `memory_mode` defaults to `MemoryMode::Disabled`; in `initialize_memory_bounds`, Disabled + logs "Memory limiting disabled." and yields `None` grant (158-161). (2) Even in + Permissive/Strict, a `None` `memory_limit` logs "No memory limit set ... Skipping memory bounds + verification." and yields `None` (167-170). A `None` grant => `MemoryLimiter::noop()` (174-178). +- **Not found**: Any default config shipping `memory_mode != disabled` or a default + `memory_limit`; any auto-detect path that doesn't require `DOCKER_DD_AGENT`. +- **Conclusion**: RESOLVED. By default there is **no enforced memory limit**: `memory_mode` is + `Disabled` and `memory_limit` is unset. A limit becomes active only if the operator sets + `memory_limit` (or `memory_mode` Permissive/Strict + a limit) explicitly, OR a non-empty + `DOCKER_DD_AGENT` env var is present AND cgroup parsing succeeds (which only populates + `memory_limit`, still requiring a non-Disabled `memory_mode` to actually limit). For the + Antithesis run this means: unless the harness sets `DOCKER_DD_AGENT` (non-empty) with a cgroup + memory limit, OR sets `memory_mode`+`memory_limit`, the limiter is a noop and RSS is unbounded by + ADP itself — confirming this property FAILs by design under default config. The harness should + verify whether its container sets `DOCKER_DD_AGENT`, since that silently changes the baseline. + +#### With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do per-value `SmallVec`s still escape? `(partial)` + +- **Examined**: `transforms/aggregate/mod.rs` `AggregationState` insert/cap (`:566-571`) and + `specify_bounds` (`:247-273`, incl. the self-documented gap comment `:249-256`); the + `aggregate-context-limit-enforced` finding that the live context count is bounded to exactly + `context_limit`; `MetricValues` storage for multi-value / many-timestamp metrics. +- **Found**: the *context count* is firmly bounded to `context_limit` (new contexts over the cap are + dropped) — confirmed. But the declared firm bound is `context_limit * (sizeof(Context) + + sizeof(AggregatedMetric))`, which `specify_bounds` itself admits does **not** account for a single + metric carrying many distinct timestamped values (per-value `SmallVec`/sketch-sample growth). So + bounding the context count does **not** bound per-value memory. +- **Not found**: a measured figure for how much per-value memory a single many-timestamp context can + consume under `allow_context_heap_allocs=false` — i.e. whether interner-bounded mode actually caps + total RSS or merely caps the number of contexts while per-context value vectors grow on the heap. + This needs a runtime measurement (a workload feeding one context thousands of distinct timestamps + and reading RSS), not a static read. +- **Conclusion**: `(partial)` — context count is bounded exactly; per-value memory is unaccounted and + remains an empirical question for the workload to settle. Does not change the property statement + (RSS-within-grant), only sharpens *why* it may still fail even with heap allocs disabled. diff --git a/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md b/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md new file mode 100644 index 00000000000..a72e605e938 --- /dev/null +++ b/test/antithesis/scratchbook/properties/shutdown-drains-no-loss.md @@ -0,0 +1,108 @@ +--- +slug: shutdown-drains-no-loss +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Liveness (with a safety boundary at the 30s timeout) +priority: High +assertion_status: MISSING (net-new instrumentation) +--- + +# Property: Events accepted before the shutdown signal are drained to the forwarder before clean shutdown + +## Origin +SUT analysis §5 safety #6 ("Graceful shutdown completes within 30s without forceful kill"), +docs/reference/architecture/index.md "Shutdown" section, and the §3 note that open aggregate +windows are dropped on shutdown by default. No Antithesis assertion exists. + +## What the code does + +### Drain-by-channel-closure design +`docs/reference/architecture/index.md:196-215`: shutdown signals sources first; sources stop +intake and finish in-flight work, then signal done. Transforms/destinations are NOT signaled — +they drain because their input channels close once all upstream senders shut down, then they +"naturally complete." The doc claims: "we ensure that all remaining events are processed before +the topology is completely shutdown." This is the drain guarantee under test. + +### The 30s grace window + forceful stop +`bin/agent-data-plane/src/cli/run.rs:289-290`: `running_topology.shutdown_with_timeout(Duration::from_secs(30))`. +`lib/saluki-core/src/topology/running.rs` `shutdown_with_timeout` (~71-124): +- `shutdown_coordinator.shutdown()` (~82) triggers source shutdown which cascades downstream. +- Loops on `component_tasks.join_next_with_id()` until all components stop → `stopped_cleanly`, + logs "All components stopped." (~89-97). +- On `shutdown_timeout` elapse (~110-115): `warn!("Forcefully stopping topology after shutdown grace + period.")`, sets `stopped_cleanly = false`, breaks the loop. Components still running are + abandoned (the `JoinSet` is dropped) — **in-flight data in not-yet-drained components is lost**. +- Returns `Ok(())` only if `stopped_cleanly`, else `Err("Topology failed to shutdown cleanly.")` (~119-123). + +### Aggregate open-window drop by default +`lib/saluki-components/src/transforms/aggregate/mod.rs:115-133`: `flush_open_windows` +(alias `dogstatsd_flush_incomplete_buckets`) **defaults to `false`**. On stop, the current open +bucket is NOT flushed by default (to avoid double counting across restart). So data in the *current +open aggregation window* at shutdown is intentionally dropped even on a clean, within-grace shutdown +— this is a designed loss boundary distinct from the timeout boundary. + +## Failure scenario (Antithesis) +Drive sustained load, then issue the shutdown signal (SIGINT/SIGTERM). Expectation: +- Every event accepted *before* the signal that has already passed aggregation into a *closed* + window/passthrough is forwarded to the mock intake before the process exits cleanly, provided the + topology drains within 30s. +- If the grace window is exceeded (e.g. intake is slow/blocked so the forwarder can't drain), the + forceful-stop path is taken and in-flight data is lost — this is the *acceptable* boundary, and the + property should assert the *clean* case and merely characterize (not forbid) the timeout case. + +## Key observations +- Two designed loss boundaries make the property conditional, not absolute: + 1. **Open aggregation window** at shutdown (dropped unless `flush_open_windows=true`). + 2. **30s timeout exceeded** → forceful stop drops in-flight data. + The clean drain claim is: *for events that have exited aggregation into a flushed window (or are + passthrough) and given drain completes within 30s, none are lost.* +- A blocked/slow forwarder (intake down) is exactly what pushes shutdown past 30s — so the + forwarder-eventual-delivery and disk-persistence properties interact: with disk persistence on, the + shutdown flush persists the retry backlog (no loss); without it, a forceful stop loses it. +- Backpressure during shutdown: the source stops reading, but already-accepted events in channels must + flow through. If any downstream is wedged (e.g. the source-dispatch wedge from + source-dispatch-no-misroute), drain stalls and the timeout fires. + +## Config deps +- 30s timeout is hard-coded (`run.rs:290`), not configurable. +- `aggregate_flush_open_windows` (default false) — toggles whether open-window data is part of the + drained set. The assertion's "accepted before signal" set must exclude open-window data unless this + is true. +- Disk persistence (`forwarder_retry_queue_storage_max_size`) — determines whether a forceful-stop / + blocked-forwarder shutdown loses the retry backlog or persists it. + +## Suggested assertion (MISSING — net-new) +- **Sometimes(clean-drain-no-loss):** at least once, after a shutdown signal under load with a healthy + intake, the topology stops cleanly within 30s AND every accepted-before-signal event that reached a + flushed window is observed at the mock intake (reconcile input-before-signal vs received). This is + the meaningful progress state proving drain works. +- **AlwaysOrUnreachable(timeout-implies-forceful):** whenever the 30s timeout fires, the run reports + forceful stop (`stopped_cleanly=false` / "Forcefully stopping topology" / `shutdown` returns Err) — + i.e. the timeout path is the only way in-flight loss occurs, and it is loudly signaled, never silent. + Anchor at `running.rs:110-115`. +- Do NOT assert an absolute Always-no-loss: the open-window default-drop and the 30s forceful path are + designed losses that would make a blanket Always false. + +## SUT-side instrumentation needs +- SDK `assert_reachable` at the clean-stop branch (`running.rs:91` "All components stopped.") to + confirm clean shutdown is exercised. +- SDK `assert_reachable` (characterization, not failure) at the forceful-stop branch + (`running.rs:112`) so triage can see when the timeout boundary was hit. +- Primary check is workload-side reconciliation of the pre-signal input set against the mock intake, + excluding open-window data unless `flush_open_windows=true`. + +## Open questions +- **Does the source actually wait for already-read-but-not-yet-dispatched events on shutdown?** The + doc says sources "wait for existing work to complete" — confirm the DSD `'read` loop finishes + dispatching the current buffer before signaling done, else events accepted just before the signal + but still in the source are lost even on a clean shutdown. +- **Final aggregate flush on stream close:** SUT §5 liveness #1 says aggregate does a final flush on + stream close. Confirm that final flush emits *closed* windows (not open) and that those flushed + metrics make it through the encoder→forwarder before the 30s deadline under load. +- **What is the realistic drain time under max load with a healthy intake?** If normal drain + approaches 30s, the clean-case assertion is fragile and the timeout boundary becomes the common case + — important for sizing the workload and for whether 30s is adequate (a potential finding). +- **PassthroughBatcher / passthrough_idle_flush_timeout (1s)**: pre-timestamped metrics buffered there + at shutdown — are they flushed on stop or dropped? Affects which "accepted before signal" events are + in the drained set. diff --git a/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md b/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md new file mode 100644 index 00000000000..733fa2c2fb8 --- /dev/null +++ b/test/antithesis/scratchbook/properties/source-dispatch-no-misroute.md @@ -0,0 +1,107 @@ +--- +slug: source-dispatch-no-misroute +sut_path: /home/ssm-user/src/saluki +commit: 042f41db3bd97118c38981765fd49696fce9d318 +updated: 2026-05-28 +type: Safety +priority: Medium +assertion_status: MISSING (net-new; likely needs SUT-side Unreachable instrumentation) +--- + +# Property: A mid-buffer dispatch failure never mis-routes remaining events across DSD outputs + +## Origin +SUT analysis §7 finding #6 ("Source dispatch errors are logged and swallowed ... a mid-buffer +dispatch failure can mis-route remaining events (eventd/service-check events leaking into the +metrics path)"). The TODO at `sources/dogstatsd/mod.rs:1670-1676` explicitly flags this. No +Antithesis assertion exists. + +## What the code does +`lib/saluki-components/src/sources/dogstatsd/mod.rs` `dispatch_events` (~1667-1716): +1. TODO (~1670-1676): "if we fail to dispatch the events, we may not have iterated over all of + them, so there might still be eventd events when we get to the service checks point, and eventd + events and/or service check events when we get to the metrics point." +2. Eventd path (~1679-1690): if `has_event_type(EventType::EventD)`, `extract(Event::is_eventd)` + removes all eventd events from the buffer into an iterator, then `buffered_named("events")` + `.expect("events output should always exist")` `.send_all(...)`. On error: logs, returns nothing + (function returns `()`), **does not re-insert**. +3. Service-check path (~1692-1704): same shape with `extract(Event::is_service_check)` → + `buffered_named("service_checks").expect(...)`. +4. Metrics path (~1706-1715): "if there are events left, they'll be metrics" → `dispatch_named("metrics", + event_buffer)` with the *remaining* buffer. + +### Why the actual misroute is subtler than the TODO implies +`lib/saluki-core/src/topology/interconnect/event_buffer.rs`: +- `extract` (~61-88) iterates ALL events, collects matching indices, removes them from the buffer, + and **recomputes `seen_event_types` from what remains**. It removes matching events regardless of + whether the later send succeeds. So after `extract(is_eventd)` returns, the buffer no longer + contains eventd events even before `send_all` runs. +- `send_all` (`dispatcher.rs:197-206`) consumes the *already-extracted iterator*. If it errors + mid-iteration, the events it failed to push are dropped (lost), but they were already removed from + `event_buffer`, so they cannot leak into the service_checks or metrics path. +- Net: with the current `extract`-then-`send_all` ordering, a dispatch *send* failure causes **loss** + of the events for that output, NOT misrouting into another output's path. The "leaking into metrics" + hazard would require `extract` to leave matching events in the buffer on a send failure — which it + does not, because extraction and sending are separate steps. + +So the property splits into two distinct claims: +- **(A) No misroute (Safety):** events of type eventd/service-check never arrive at the `metrics` + output, and vice versa. With current code this should hold structurally (extract is by predicate, + recomputed types), but the TODO documents authorial uncertainty and `.expect()` on outputs is a + crash if an output is ever missing. +- **(B) No silent loss on dispatch failure (related, overlaps no-silent-interconnect-drop):** a + `send_all`/`dispatch_named` error here is logged and the extracted events are dropped, with the + TODO noting the component will "continue to fail to dispatch ... until the process is restarted." + +## Failure scenario (Antithesis) +Drive a mixed buffer (eventd + service_check + metric events) while forcing a downstream output to +error mid-dispatch (e.g. close/saturate the events or service_checks downstream so `send_all` +errors). Observe at the mock intake / per-output telemetry that: +- no eventd or service-check payload appears on the metrics encode/forward path (misroute = false); +- the events that failed to dispatch are accounted as failures, not silently mixed elsewhere. + +## Key observations +- `.expect("events output should always exist")` (~1684) and `.expect("service checks output should + always exist")` (~1698) are crash points if those named outputs are ever unwired — a separate + liveness/crash hazard on this path. +- A send error in the eventd step returns control but the function continues? No — on error it only + logs inside the `if let Err` and falls through to the next `if` block; it does not early-return. + So after an eventd send failure it still attempts service_checks then metrics with the remaining + (eventd-free) buffer. This is the partial-iteration concern, but because eventd events were already + extracted out, the metrics dispatch gets only non-eventd, non-service-check events. + +## Config deps +- DSD source must have all three named outputs (`metrics`, `events`, `service_checks`) wired — they + are in the production blueprint (SUT §2). If a deployment omits one, the `.expect` panics. + +## Suggested assertion (MISSING — net-new, SUT-side likely required) +- **Unreachable(misroute):** an eventd or service-check `Event` reaching the metrics output path, or a + metric reaching events/service_checks. Best as an SDK `assert_unreachable` at the point where the + metrics dispatch buffer is assembled (`mod.rs:1706-1715`) checking that no remaining event + `is_eventd() || is_service_check()`. This directly encodes the misroute-must-never state and would + fire if a future refactor breaks `extract`/type-recompute. SUT-side instrumentation needed because + the routing decision is internal and not observable from telemetry alone. +- **AlwaysOrUnreachable(dispatch-failure-counted):** when `send_all`/`dispatch_named` returns Err on + this path, a failure/discard counter increments (no silent swallow). Overlaps the + no-silent-interconnect-drop property; here scoped to the source dispatch. + +## SUT-side instrumentation needs +- An `assert_unreachable` checking `event_buffer` contents are metrics-only at the metrics dispatch + step (`mod.rs:~1707`). The misroute path is not externally observable, so this must be in-process. +- Optional `assert_unreachable` guarding the two `.expect(...)` output lookups (~1684/1698) to convert + the latent panic into a tracked property if an output is missing. + +## Open questions +- **Can `extract` ever leave a matching event in the buffer?** Reading `event_buffer.rs:61-88`, + removal is by collected indices via `swap_remove_back`, with a `pos < self.events.len()` guard + (~79). `swap_remove_back` reorders, and indices were collected before removal then applied in + reverse — confirm no index aliasing leaves a matching event behind (would be the actual misroute + bug the TODO fears). This is the crux: if extraction is correct, misroute is structurally + impossible; if not, the assertion catches it. +- **Is the dropped-on-send-failure data counted anywhere?** The `error!` log (~1688/1702/1713) is the + only signal; there may be no counter for "events lost because the source could not dispatch them," + unlike the interconnect `events_discarded_total`. If uncounted, the loss is fully silent — worth a + finding and a counter. +- **Does a persistent downstream failure here wedge the source** ("continue to fail ... until the + process is restarted")? If so, this also feeds the fail-stop/shutdown story (slug + shutdown-drains-no-loss / supervision §2). diff --git a/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md b/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md new file mode 100644 index 00000000000..4ddeacfdc62 --- /dev/null +++ b/test/antithesis/scratchbook/properties/tag-filterlist-applied-consistently.md @@ -0,0 +1,105 @@ +# tag-filterlist-applied-consistently + +## Origin + +Coverage-gap analysis of the DogStatsD transform chain. `tag_filterlist` removes/retains tags +per-metric and claims Datadog-Agent equivalence, but has two correctness subtleties no property +covers: (1) it filters **only Counter and sketch** metrics — gauges/rates/sets are deliberately +**not** filtered (`mod.rs:235-237`); and (2) it serves results from a **context cache** keyed by the +original context, which must always agree with a fresh filter computation and must be invalidated on +config reload. A wrong metric-type predicate, or a stale cache entry, silently retains tags the +operator intended to drop (cardinality/PII leak) or drops tags that should remain. + +## Code paths + +- `bin/agent-data-plane/src/components/tag_filterlist/mod.rs` + - **Type gate:** `if metric.values().is_sketch() || matches!(metric.values(), + MetricValues::Counter(_))` (`mod.rs:235-237`). Only sketches + counters are considered; every + other metric type bypasses filtering entirely. + - **Filter logic:** `filter_metric_tags` (`mod.rs:299-318`) — looks up `filters.get(name)`; on a + rule, `retain_tags` + `retain_origin_tags` with `should_keep_tag` (`mod.rs:289-291`, + `is_exclude != names.contains(tag.name())`). Filters **both instrumented and origin tags**. + - **Context cache:** `context_cache: Cache>`, capacity 100_000 + (now operator-tunable via the `aggregator_tag_filter_cache_capacity` config key, PR #1771; eviction + just forces recompute, not incorrectness), TTI 30s (`mod.rs:40-42,204-214`). Hit path replaces + context with the cached filtered context + (`mod.rs:240-247`); miss path computes, then caches `None` (NoChange) or `Some((filtered, n))` + (`mod.rs:248-263`). + - **Compile/merge rules:** `compile_filters` (`mod.rs:111-140`) — same name+action unions tag sets; + conflicting actions → **exclude wins**; empty `metric_name` entries are dropped. + - **Reload:** on `watch_for_updates("metric_tag_filterlist")` (`mod.rs:222,274-277`) it rebuilds + `self.filters` **and** `self.context_cache = build_context_cache()` (cache invalidated on reload — + good, but see `filter-config-reload-correct` for the lag/partial/clear hazards that gate whether + the *new* filters are even applied). + - Agent reference comments cite `pkg/aggregator/time_sampler.go` equivalence for the sibling + post-aggregate filter; the tag filter targets per-metric tag stripping. + +## Failure scenario + +- **Cache divergence:** the cached filtered context for name X disagrees with a fresh + `filter_metric_tags(X)` — e.g. a metric whose tagset differs but collides on the cache key, or a + cache entry that survives a config reload window. The cached (stale/wrong) tagset is applied, + silently producing a different tag set than the current rules dictate. +- **Type-gate divergence:** a metric type the Agent *would* filter (or would *not*) is treated + differently by ADP's Counter+sketch-only gate, so a tag the operator listed is retained on (say) a + gauge that ADP never filters — silent cardinality/PII leak; or dropped on a type the Agent leaves + alone. +- **Merge divergence:** include/exclude conflict for the same name resolves differently than the + Agent ("exclude wins" + first-exclude-wins, `mod.rs:127-135`), changing which tags survive. + +All silent — only `tag_filterlist_*` telemetry counters move; no error, no drop. + +## Property + +- **Type:** Safety. +- **Invariant:** + - `Always(cache-hit filtered tags == fresh filter_metric_tags result)` — SUT-side: on a cache hit, + recompute and assert the cached context's tags equal the freshly-filtered tags. Catches stale/ + colliding cache entries. + - `Always(only Counter and sketch metrics ever have tags removed by this transform)` / + `AlwaysOrUnreachable(a gauge/rate/set leaves tag_filterlist with tags identical to input)` — + pins the deliberate type gate so a future refactor that widens/narrows it is caught. + - Differential (optional, ride Add-on 2): `Always(post-filter (name, tags) within ratio of Agent + time-sampler tag filtering)` for the same `metric_tag_filterlist` config and a corpus spanning all + metric types — the strongest equivalence check. + - `Sometimes(a tag was removed)` + `Sometimes(cache hit served a filtered result)` for non-vacuity. +- **Antithesis angle:** sustained mixed-type metric load (counters, sketches, gauges, rates, sets) + with overlapping names that stress the context cache (eviction at 100_000 / TTI 30s), interleaved + with config reloads (compose with `filter-config-reload-correct`) and node-throttling to widen the + reload-vs-apply window. Timing exploration surfaces cache entries that outlive the reload that + should have invalidated them. +- **Priority:** Medium (High if the Counter+sketch type gate is found to diverge from the Agent). + +## Config dependencies + +- `metric_tag_filterlist` (array of `{metric_name, action: include|exclude, tags:[...]}`); set + identically on the Agent baseline for the differential facet. +- Dynamic config enabled (remote-agent mode) only for the reload-interaction facet; the cache-vs-fresh + and type-gate invariants can run with a static config in the primary topology. +- Corpus must include **all** metric value types to exercise the type gate. + +## Open Questions + +- Does the Datadog Agent restrict tag filtering to the same Counter+sketch subset, or does it filter + other metric types too? Pivotal for the type-gate invariant and the differential facet. + `(needs Agent-source confirmation)` +- The context cache is keyed by the **full original `Context`** (`mod.rs:204`); can two metrics with + the same name+tags but different *origin tags* collide and get the wrong filtered result, given + filtering also touches origin tags (`mod.rs:308`)? Needs a cache-key audit. +- Cache TTI is 30s; under a config reload the cache is rebuilt (`mod.rs:276`), but only **when the + reload event actually fires** — if the reload is Lagged-dropped (`filter-config-reload-correct` + Hazard 1) the cache is *not* rebuilt and stale filtered contexts persist up to TTI. Confirm this + compound failure. +- Does "exclude wins on conflict, first-exclude-wins" (`mod.rs:127-135`) match the Agent's merge + semantics for duplicate metric-name entries? + +### Investigation Log + +- Examined: full `tag_filterlist/mod.rs` incl. the type gate (`235-237`), cache hit/miss/insert + (`240-263`), `filter_metric_tags`/`should_keep_tag` (`289-318`), `compile_filters` merge rules + (`111-140`), and the reload arm rebuilding both `filters` and `context_cache` (`274-277`). +- Found: a Counter+sketch-only type gate, a 100k-entry / 30s-TTI context cache on the hot path, and + exclude-wins merge — all claiming Agent equivalence with only self-consistent unit tests. The + cache-vs-fresh and type-gate facets are SUT-side invariants; the equivalence facet rides the + differential harness. Distinct from `filter-config-reload-correct` (which owns the *reload + mechanism* hazards) — this property owns the *filter application* correctness. diff --git a/test/antithesis/scratchbook/properties/topology-ready-before-intake.md b/test/antithesis/scratchbook/properties/topology-ready-before-intake.md new file mode 100644 index 00000000000..bc423f43551 --- /dev/null +++ b/test/antithesis/scratchbook/properties/topology-ready-before-intake.md @@ -0,0 +1,122 @@ +--- +slug: topology-ready-before-intake +title: Topology becomes ready before data intake begins +family: Lifecycle Transitions & Configuration +type: Liveness + Safety (ordering) +priority: High +status: assertion-missing +sut_commit: 042f41db3bd97118c38981765fd49696fce9d318 +--- + +# topology-ready-before-intake + +## Origin + +SUT analysis §2 ("Build → spawn lifecycle … The topology starts accepting data only +after `health_registry.all_ready()`") and §5 Liveness #3 ("Topology starts accepting +data only after all components report ready"). Owned by the Lifecycle agent. + +## Files / functions / lines + +- `bin/agent-data-plane/src/cli/run.rs:218-251` — startup ordering: + - `run.rs:219-225`: internal supervisor is spawned (`run_with_shutdown`). + - `run.rs:227-235`: `select!` waits on `health_registry.all_ready()` for the *internal + supervisor* to become healthy before proceeding. If the supervisor task completes first + (`early_result`), returns an error (`generic_error!("Internal supervisor completed + unexpectedly…")`). + - `run.rs:238`: `let built_topology = blueprint.build().await?;` — topology is only **built** + after the internal supervisor is ready. + - `run.rs:239`: `let mut running_topology = built_topology.spawn(&health_registry, memory_limiter).await?;` + — components (incl. the `dsd_in` source that opens listeners) are spawned here. + - `run.rs:242-251`: a *second* `all_ready()` wait, spawned as a detached task, logs + `topology_ready_ms` and emits startup metrics once the full topology reports ready. +- `lib/saluki-core/src/health/mod.rs:354-375` — `HealthRegistry::all_ready()` / `check_all_ready()`: + resolves only when **every** registered component's `is_ready()` is true + (`shared.ready` AND `health != Dead`). Empty registry resolves immediately. +- `lib/saluki-core/src/health/mod.rs:49-66` — `Health::mark_ready()` flips the per-component + `ready` atomic and `notify_waiters()` so `all_ready()` re-checks. +- `lib/saluki-core/src/topology/built.rs:158-410` — `BuiltTopology::spawn` wires interconnects + and spawns each component into a `JoinSet` (`spawn_component`, ~666-687). A source only begins + reading sockets after its own task runs and marks itself ready. + +## Key observation / honest framing + +The ordering guarantee is **subtle and only partial** as literally stated by the SUT analysis. +What the code actually does: + +1. The DSD source's listeners are bound and accept loop starts when the **source component task** + runs (after `spawn()` at run.rs:239), independent of whether *other* components (e.g. + `dsd_agg`, `dd_out`) have marked ready. +2. The data path is gated not by an explicit "do not read until all_ready" check in the source, + but by **backpressure**: bounded `mpsc` edges + `memory_limiter.wait_for_capacity().await` + (SUT §4). A source that reads before downstream is ready will block on `Dispatcher::send`. +3. The *observable* "topology ready" signal (`topology_ready_ms` log, startup metrics at + run.rs:244-251) fires only after `all_ready()`. + +So the precise, defensible property is **ordering of readiness milestones**, not "zero bytes +read before all_ready". A truthful assertion: + +- **Always / ordering:** the `topology_ready_ms` milestone (full `all_ready`) is reached, and the + internal-supervisor `all_ready` (run.rs:229) always precedes `blueprint.build()` (run.rs:238). + i.e. the topology is never **built/spawned** before the internal supervisor reports ready. +- **Sometimes(all_ready reached):** at least once the full topology reaches `all_ready` (the + detached task at run.rs:242 logs `topology_ready_ms`) — proves the readiness path is live. + +Overclaiming "no data processed pre-ready" would be **wrong**: the source can read and then block +on backpressure; data may sit in-flight in channels before downstream `all_ready`. Frame the +assertion around the readiness-milestone ordering + reaching ready, not byte-level intake gating. + +## Failure scenario (Antithesis angle) + +- Delay/stall a downstream component's readiness (e.g. forwarder `dd_out` blocked on a slow/dead + mock intake during init, or aggregate slow to initialize). Verify the process either reaches + `all_ready` eventually (liveness) or stays observably "Waiting for topology to become healthy" + without crashing. +- Fault: internal supervisor child fails to initialize → run.rs:232 `early_result` branch returns + error before topology is built. Assert the topology was **never spawned** in that case + (Unreachable: "topology spawned after internal-supervisor init failure"). +- Timing: with Antithesis scheduling, confirm `sup_ready_ms` (run.rs:230) is logged before any + `topology_ready_ms` (run.rs:245). + +## Config dependencies + +- `data_plane.enabled` / `dp_config.enabled()` (run.rs:152) — if not enabled, ADP exits before + building topology (no readiness milestones at all). Assertions must not fire in that case. +- Pipelines enabled (`dp_config.*_pipeline_required`, run.rs:414-457) determine which components + register with the health registry, i.e. what `all_ready` is gating on. +- Memory mode (`MemoryMode::default()==Disabled`, SUT §7) affects `memory_limiter`; doesn't change + ordering but affects backpressure behavior. + +## Assertion (MISSING — net-new instrumentation) + +No Antithesis SDK assertions exist (existing-assertions.md). Proposed SUT-side instrumentation in +`run.rs`: +- After run.rs:230 (internal supervisor healthy) and before run.rs:238: record a monotonic + milestone flag `internal_supervisor_ready`. +- Inside the detached task at run.rs:243 after `all_ready()`: `assert_reachable!` / + `Sometimes("topology reached all_ready")` and `assert_always!`/`Always("internal supervisor + ready before topology build", internal_supervisor_ready == true)`. +- For the negative path: at run.rs:232 (`early_result`) before returning the error, the code never + reaches `spawn()`; an `assert_unreachable!` placed *after* `spawn()` keyed on + "internal_supervisor_failed_to_initialize" would be hard to express in-process — better tested + via workload-side log assertion (`port_listening` should be false / no `topology_ready_ms` log). + +## Open questions + +- Does the `dsd_in` source mark itself ready *before* or *after* binding its listeners? If listeners + bind during `initialize()` (before `mark_ready`), a client could connect pre-ready. Need to read + `lib/saluki-components/src/sources/dogstatsd/mod.rs` listener-bind vs mark_ready ordering to know + whether "accepting connections before ready" is observable. WHY IT MATTERS: determines if the + truthful property is "ordering of milestones" only, or can be strengthened to "no socket bound + before ready". WHAT CHANGES: the assertion strength and whether a workload `port_listening` probe + pre-ready is a valid falsification. +- Is there any scenario where a component marks ready, processes data, then a *later*-registered + component drops the aggregate `all_ready` back to false? `register_component` can happen while + `all_ready` waits (mod.rs:347-353 docstring). WHY IT MATTERS: readiness is not latched; the + milestone could flap. Confirm all data components register before the run.rs:243 wait. + +## SUT-side instrumentation needs + +- Antithesis SDK dependency must be added (none today). +- A monotonic `internal_supervisor_ready` flag readable at the topology-ready milestone, or rely on + log-ordering assertions (`sup_ready_ms` before `topology_ready_ms`). diff --git a/test/antithesis/scratchbook/property-catalog.md b/test/antithesis/scratchbook/property-catalog.md new file mode 100644 index 00000000000..bfcf0f7cbae --- /dev/null +++ b/test/antithesis/scratchbook/property-catalog.md @@ -0,0 +1,723 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space — headline guarantees and gap analyses that seed properties. + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD + why: "ADP will not crash under load, losing customer data" — root guarantee for the memory + data-loss families. + - path: https://datadoghq.atlassian.net/browse/DADP + why: ADP Jira project for tracked gaps/regressions. + - path: https://github.com/DataDog/saluki/pull/1768 + why: PR review #4393897611 (Copilot) — priority alignments and the aggregate-panic-fixed-upstream update reconciled here. +--- + +# Property Catalog: Agent Data Plane (ADP) + +35 properties across 7 categories. The system makes one headline guarantee — **"ADP will not +crash under load, losing customer data"** — which decomposes into the *Memory & Resource Bounds* +and *Data Integrity & No Silent Loss* families. The remaining categories cover aggregation +correctness, lifecycle/config, untrusted-input parsing, concurrency, and **transform & enrichment +correctness** (Category G, added after evaluation — ADP as a *transformer*, not just a transport). + +> **Evaluation note (2026-05-28):** an 4-lens portfolio evaluation added 8 properties (G1 events/ +> service-checks; G2 transform-chain + runtime filter config-reload), applied 9 refinements, and +> escalated one scope bias (traces/APM/logs/OTLP coverage). See `evaluation/synthesis.md`. + +**Only bootstrap/workload SDK probes exist so far** (`existing-assertions.md`: 6 call sites — an +ADP `antithesis_init()` + bootstrap `assert_reachable!` behind the `antithesis` feature, and two +workload-side `assert_reachable!`/`assert_sometimes!` pairs in the harness). Every `Invariant` +below is still **net-new** SUT-side instrumentation. Several properties are **expected to fail by +design** under default config (memory limiter disabled, interner heap-fallback enabled, disk +persistence off) — these are flagged; they are the highest-value findings, not catalog errors. + +Provenance tags `[Fn]` after each slug name the discovery focus that surfaced it: +`[RB]` resource boundaries, `[DL]` data-loss/recovery, `[AG]` aggregation/sketch, +`[LC]` lifecycle/config, `[RC]` replay/codec/concurrency, `[WC]` wildcard (from SUT analysis). + +--- + +## Category A — Memory & Resource Bounds + +The "deterministic resource usage" / "won't OOM" half of the headline guarantee. Critical finding: +the bound is a **startup assertion about declared sizes**, not a runtime invariant; the runtime +limiter is advisory (≤25ms backoff, 250ms sampling, cooperative), disabled by default, and the +interner spills to the heap by default. This category probes whether RSS is *actually* bounded. + +### rss-bounded-under-cardinality — RSS bounded under high cardinality +> **Status (2026-05-29): WORKLOAD WIRED + ROOT CAUSE REPRO'D** — `parallel_driver_send_dogstatsd` +> (high-cardinality regime) floods distinct contexts in the Antithesis harness to drive this +> behavioral bug under a run; and the root cause is reproduced as a unit test in +> `lib/saluki-context/src/resolver.rs` +> `tests::bug_default_heap_fallback_makes_context_resolution_unbounded` (default heap fallback ⇒ +> resolution never refuses ⇒ unbounded memory). Not fixed. +| | | +|---|---| +| **Type** | Safety (expected to FAIL by design under default config) | +| **Property** | Under a high-cardinality tag / many-distinct-timestamp flood, process RSS stays within the configured memory grant. | +| **Invariant** | `Always(rss <= grant.effective_limit_bytes() * tolerance)`, read from the same `process_memory::Querier` the limiter uses. `Always` fits — RSS-within-grant must hold on every check. SUT-side `Sometimes(backoff_applied && rss_still_climbing)` localizes the mechanism better than a workload RSS probe. | +| **Antithesis Angle** | 250ms RSS sampling + 25ms max advisory backoff (from 95%) means bursts blow past the limit between samples; only the source cooperates, the aggregate hot path never calls `wait_for_capacity`, and the interner heap-fallback default makes growth effectively unlimited. Scheduling/timing exploration surfaces the burst-vs-sample race the deterministic harness can't. | +| **Why It Matters** | Directly tests "won't crash under load" / deterministic resource usage; failure means OOM under cardinality floods. | +| **Priority** | High | + +**Open Questions:** +- Acceptable RSS tolerance band over `effective_limit_bytes` (95%+25ms+250ms implies real overshoot even when healthy). +- With `allow_context_heap_allocs=false`, does RSS actually become bounded, or do aggregate-map / per-context `SmallVec`s still escape under the many-timestamp flood? `(partial: context count is bounded to context_limit exactly — confirmed; per-value memory still unaccounted)` +- Would the container OOM-kill ADP before the assertion fires (crash vs. over-grant reading)? +- _Resolved:_ nothing enables a memory limit by default — `memory_mode` defaults to `Disabled` (limiter is a no-op); cgroup auto-detect requires a non-empty `DOCKER_DD_AGENT` env var AND `memory_limit` unset AND a successful cgroup parse (`accounting.rs:107-121`). Confirms the fails-by-design framing. + +### aggregate-context-limit-enforced — Aggregate context limit enforced +| | | +|---|---| +| **Type** | Safety (expected to HOLD) | +| **Property** | The aggregation map never exceeds `aggregate_context_limit` (default 1,000,000) live contexts; over-cap new contexts are dropped-and-counted; existing contexts always merge. | +| **Invariant** | `Always(contexts.len() <= context_limit)` — true `Always` (no path grows past the cap). Plus `AlwaysOrUnreachable(existing context never dropped by cap)`, and `Sometimes(context_limit_breached)` / `Sometimes(events_dropped_on_cap)` to prove the boundary is reached (avoid vacuity). | +| **Antithesis Angle** | Interleave a cardinality flood with flush timing and counter zero-value keep-alives (kept-alive counters still occupy slots) to exercise hitting/clearing the breach flag and re-admitting contexts — timing-sensitive. | +| **Why It Matters** | This hard, always-on, lock-free cap is the central memory-determinism lever for the aggregator (the one non-advisory runtime bound). | +| **Priority** | High | + +**Open Questions:** +- _Resolved:_ the true live bound is exactly `context_limit` (not `limit + zero_value_count`). Zero-value keep-alive counters stay as ordinary entries in the single `contexts` map and DO count toward the cap (`mod.rs:568`; test `context_limit_with_zero_value_counters` at `mod.rs:1104-1157`); `len()` drops in the flush removal pass (`mod.rs:703-707`) when entries are empty AND past `counter_expire_secs`. The `Always(len <= context_limit)` target is correct. +- Caveat (not a question): the cap counts contexts, not bytes — one context with many timestamped values is one entry but unbounded value memory; prose must not overclaim "bounded memory" (see `rss-bounded-under-cardinality`). + +### interner-full-bounded — Interner-full bounded vs. heap spill +> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test (shared with `rss-bounded-under-cardinality`) — +> `lib/saluki-context/src/resolver.rs` `tests::bug_default_heap_fallback_makes_context_resolution_unbounded` +> shows the default heap-allowed mode never refuses (unbounded) while heap-disallowed refuses (bounded +> but lossy). Not fixed. +| | | +|---|---| +| **Type** | Safety (heap-disallowed: HOLDS; heap-allowed default: FAILS the bounded reading) | +| **Property** | When the fixed-size interner is full and heap allocations are disallowed, context resolution fails deterministically (metric dropped) instead of allocating; with heap allowed (the default), memory is no longer bounded. | +| **Invariant** | Heap-off: `AlwaysOrUnreachable(interner_full ⇒ metric dropped, no heap alloc)` (rare/optional path). `Sometimes(try_intern == None)` proves exhaustion is reached. Heap-on: `Sometimes(intern_heap_fallback > 0)` proves the unbounded spill path is reachable under default config. SUT-side needed to distinguish interned / inlined / heap-fallback / dropped. | +| **Antithesis Angle** | Small interner + high-cardinality flood fills the buffer; timing exploration stresses the loom-tested reclamation/tombstone path under concurrent intern-vs-drop. | +| **Why It Matters** | Interner determinism is the foundation of the context memory bound; the default flag flips ADP into the unbounded branch, voiding the bounded-memory guarantee silently. | +| **Priority** | High | + +**Open Questions:** +- Does fragmentation make `try_intern` return `None` below nominal byte capacity under churn (earlier spill than the budget implies)? +- Is the real bound the sum across name + tag interning (they share one interner)? +- _Resolved:_ `dogstatsd_allow_context_heap_allocs` defaults to **true** (`sources/dogstatsd/mod.rs:149-151`; resolver fallback also true, `resolver.rs:258`); bounded mode (`with_heap_allocations(false)`) appears **only in `#[cfg(test)]`**. So bounded mode is opt-in/test-only and "bounded memory" is aspirational under default config. + +### memory-limiter-survives-rss-read-failure — Memory limiter survives RSS read failure +| | | +|---|---| +| **Type** | Safety / fault-tolerance (expected to FAIL by design) | +| **Property** | If RSS becomes unreadable mid-run, memory protection remains active (or the failure is surfaced) rather than silently freezing. | +| **Invariant** | `Unreachable("limiter RSS read failed — protection lost")` at the `.expect()` site (`limiter.rs:100-102`) — the panic-the-checker-thread state is a critical failure that must never be observed (today it can be). Fix-dependent alternative: `Sometimes(rss_read_failed_and_surfaced)` + liveness check that `active_backoff` is still being updated. Needs SUT-side instrumentation. | +| **Antithesis Angle** | Inject `/proc`/RSS read failure mid-run; the damaging race is reads failing *before* RSS crosses threshold, freezing backoff at 0 (fail-open). The bare `std::thread` death doesn't trigger process shutdown — silent. | +| **Why It Matters** | The limiter is already the only runtime memory mechanism; silently disabling it removes the last guard against OOM under load. | +| **Priority** | **High** (R9; upgraded — the user confirmed custom `/proc` faults are enabled for the tenant, so the failure state is reachable). Still requires the limiter to be explicitly enabled and a SUT-side assertion, since the bare-thread death is otherwise unobservable. | + +**Open Questions:** +- Can `Querier::resident_set_size()` actually return `None`/error *after* succeeding at startup on the Antithesis Linux target, or only via injected `/proc` corruption? Pivotal for priority. +- Is the frozen backoff more likely 0 (fail-open) or nonzero (fail-stuck, over-throttle)? Opposite symptoms. +- Should correct behavior be "keep last-known protection" or "fail loudly and restart" (data components are fail-stop; s6 restarts ADP)? Changes Unreachable(panic) vs. Reachable(clean restart) framing. + +### retry-queue-bounded-under-outage — Retry queue bounded under outage +| | | +|---|---| +| **Type** | Safety (byte cap) + Liveness (bound implies counted data loss) | +| **Property** | During a prolonged intake outage the forwarder retry queue (in-memory + disk) stays within its configured byte caps; overflow drops oldest (counted), never grows unbounded. | +| **Invariant** | `Always(total_in_memory_bytes <= max_in_memory_bytes)` and the analogous disk-bytes `Always` (eviction loops make these true invariants). `Sometimes(items_dropped > 0)` / `Sometimes(persisted_entries_dropped > 0)` prove saturation is reached and the bound is load-bearing. Optional `Reachable` on the "entry too large to ever fit" branch. | +| **Antithesis Angle** | Hold mock intake down (refused / black-hole / 5xx / slow) under sustained load until the queue saturates; interleaving stresses the shared circuit-breaker backoff + per-endpoint queues + disk eviction. | +| **Why It Matters** | Tests "won't crash, won't lose data" at its sharpest tension — memory bounded by design means prolonged outage forces counted-but-real data loss. | +| **Priority** | High | + +**Open Questions:** +- Assert per-endpoint cap or aggregate `num_endpoints * cap` (+ disk)? Fan-out multiplies the bound; matters for RSS protection. +- Is there a time-based eviction policy (queue-duration) beyond the byte cap, like the Agent's `queue_duration_capacity`? +- _Resolved:_ corrupt/torn files are dropped and skipped without wedging the queue, and the byte cap holds across drops (see `disk-persisted-retry-survives-restart`); the disk-init-failure fallback keeps the in-memory byte cap and is surfaced only via an `error!` log. + +--- + +## Category B — Data Integrity & No Silent Loss + +The "won't lose customer data" half of the headline. Covers the internal backpressure path, egress +delivery, crash durability, event routing, and shutdown drain. + +### no-silent-interconnect-drop — No silent inter-component drop on a wired edge +| | | +|---|---| +| **Type** | Safety | +| **Property** | Under sustained load with a slow downstream, a correctly-wired interconnect edge applies backpressure (await) and never silently discards events; backpressure propagates back to the socket. | +| **Invariant** | `Always(events_discarded_total delta == 0)` for a connected output under load. The discard branch fires only when `senders.is_empty()`, so a wired edge can never discard. Pair with `Sometimes(backpressure engaged)` so the test proves it reached the full-channel state. Not a blanket `Unreachable` on the discard site — disconnected outputs discard legitimately. | +| **Antithesis Angle** | Throttle the forwarder/intake so encoder→forwarder fills, cascading backpressure up every bounded mpsc edge to the DSD read loop; verify queue-and-await instead of drop. | +| **Why It Matters** | Directly the "no silent loss" half of the headline guarantee on the hottest internal path. | +| **Priority** | High | + +**Open Questions:** +- Do any production DSD outputs ever have zero senders (e.g. conditional `dsd_debug_log_out`/`dsd_stats_out`)? If so, scope `Always` to always-wired outputs. +- _Resolved:_ `interconnect_capacity` default is **128** event buffers (`DEFAULT_INTERCONNECT_CAPACITY`, `topology/mod.rs:37`; every non-source edge is `mpsc::channel(128)`). No ADP override. This sizes the burst before backpressure; tunes the test, not the property. +- Exclude the non-atomic multi-sender partial-delivery case (a closed channel at teardown, not a full one) from the assertion window. + +### forwarder-eventual-delivery — Eventual delivery after transient intake outage +> **Status (2026-05-29): PARTIALLY WIRED** — `finally_verify_delivery` asserts the fault-free +> eventual-delivery liveness baseline (`Sometimes(delivered>0)`); the post-outage-recovery facet +> (inject a 5xx/timeout storm, then heal) is a later iteration. +| | | +|---|---| +| **Type** | Liveness | +| **Property** | After a transient intake outage (5xx/timeouts/connection resets) clears, every accepted retryable transaction is eventually delivered, provided the retry queue did not overflow. | +| **Invariant** | `Sometimes(all-accepted-retryable-delivered-after-recovery)`: at least once, post-recovery delivered count equals accepted-retryable count submitted before/during the outage (no overflow). Plus `Reachable` on the `Error::Open` re-enqueue site to prove the breaker engaged. Liveness → assert progress, not an instantaneous invariant. | +| **Antithesis Angle** | Inject a bounded 5xx/timeout/connection-reset storm, then restore 2xx; circuit-breaker backoff + re-enqueue to low-priority queue must recover the backlog. | +| **Why It Matters** | Egress data-loss surface; retry path only unit-tested against in-process mocks with a virtual clock. | +| **Priority** | High | + +**Open Questions:** +- Size the outage shorter than `queue_max_size_bytes` overflow, or exclude oldest-dropped txns; overflow is the intended bounded-memory escape valve (a test-setup constraint, not a property uncertainty). +- _Resolved:_ production forwarder requests are always `Clone` (`FrozenChunkedBytesBuffer`; `clone_request` returns `Some` unconditionally) → retryable failures always take the `Error::Open` re-enqueue path, never the non-cloneable `Error::Service` drop. +- _Resolved:_ the circuit-breaker backoff is **per-endpoint** (each `run_endpoint_io_loop` builds its own `Arc>`), so one slow endpoint cannot serialize others' recovery. + +### disk-persisted-retry-survives-restart — Disk-persisted retry survives kill+restart +| | | +|---|---| +| **Type** | Liveness (with no-duplication + poison-drop safety sub-clauses) | +| **Property** | With disk persistence enabled, retry-queue transactions survive a process kill+restart and are eventually delivered with no systemic loss or duplication; corrupt entries are dropped, not retried forever, without aborting recovery. | +| **Invariant** | `Sometimes(persisted-backlog-fully-recovered)`: post-restart delivered set covers the persisted backlog, deduped. `AlwaysOrUnreachable(poison-dropped)`: any corrupt on-disk entry is dropped and recovery proceeds — never infinite retry, never abort. `Reachable(persistence-active)`: the silent in-memory fallback did NOT fire, else the test is vacuous. | +| **Antithesis Angle** | SIGKILL mid-outage (s6 restarts ADP), restore intake, reconcile delivery; separately inject a corrupted on-disk entry to exercise poison handling and torn-write recovery. | +| **Why It Matters** | Durability across crash is never tested system-level; delete-before-return and silent fallback are subtle correctness levers. | +| **Priority** | High | + +**Open Questions:** +- At-most-once window: delete-before-return then crash-before-send loses one in-flight txn — the reconcile must tolerate small (~per-endpoint-concurrency) slack, not assert exact equality. +- _Resolved:_ a torn/partial write across a crash yields a valid filename + truncated content → `serde_json` fails → entry **dropped** (warn + `entries_dropped++`) on read; the scan `continue`s past any number of bad files, so **one bad file never wedges recovery** and the byte cap is never violated (dropping decrements the total). Note: `push` writes non-atomically straight to the final path (`persisted.rs:184`, despite a stale "temporary file" comment). +- _Resolved:_ the disk-init-failure → in-memory fallback is surfaced **only as an `error!` log (io.rs:406), no metric** — the workload must log-scrape or `assert_unreachable` at the fallback to keep the durability test non-vacuous; the in-memory byte cap still holds after fallback. + +### source-dispatch-no-misroute — DSD source dispatch never mis-routes events +| | | +|---|---| +| **Type** | Safety | +| **Property** | A mid-buffer dispatch failure in the DogStatsD source never mis-routes eventd/service-check events into the metrics output path (or vice versa). | +| **Invariant** | **(R8) Primary facet — silent loss:** `Sometimes(dispatch_failed)` + an SUT-side check that a dispatch failure is **counted** (today the `error!` logs at `mod.rs:1688/1702/1713` may be the only signal — unlike interconnect `events_discarded_total`). **Secondary facet — misroute:** `Unreachable(misroute)` (an eventd/service-check Event reaching the metrics dispatch, or vice versa) — structurally improbable with the current `extract`-then-`send_all` ordering, so it functions as a future-refactor guard, not the live hazard. | +| **Antithesis Angle** | Mixed eventd+service-check+metric buffer while forcing a downstream output to error mid-`send_all`; verify failures are accounted (not silently dropped) and no cross-output leakage. | +| **Why It Matters** | The live hazard is **silent, uncounted loss** on dispatch failure (a finding); wrong-stream delivery would additionally corrupt data semantics, and `.expect()` on outputs is a latent crash. | +| **Priority** | Medium | + +**Open Questions:** +- Can `extract` (swap_remove_back + collected indices) ever leave a matching event behind? This is the crux — if extraction is correct, misroute is impossible. +- Is dispatch-failure loss counted anywhere? The `error!` logs may be the only signal — possibly fully silent (a finding). +- Does a persistent downstream failure wedge the source ("until the process is restarted"), feeding the fail-stop story? + +### shutdown-drains-no-loss — Shutdown drains accepted events to the forwarder +| | | +|---|---| +| **Type** | Liveness (with a safety boundary at the 30s timeout) | +| **Property** | Every event accepted before the shutdown signal that has reached a flushed window is forwarded before clean shutdown, provided the topology drains within the 30s grace window. | +| **Invariant** | `Sometimes(clean-drain-no-loss)`: at least once, shutdown completes cleanly within 30s and all pre-signal flushed-window events reach the mock intake. `AlwaysOrUnreachable(timeout-implies-forceful)`: whenever the 30s timeout fires, the run loudly reports forceful stop — in-flight loss is never silent. NOT a blanket Always-no-loss: open-window default-drop and forceful-stop are designed losses. | +| **Antithesis Angle** | Shutdown under load; combine with a slow/blocked intake to push past 30s and exercise the forceful-stop boundary; with disk persistence on, verify backlog persists instead of being lost. | +| **Why It Matters** | Graceful-shutdown drain is a stated guarantee; interacts with forwarder/disk-persistence at the timeout boundary. | +| **Priority** | High | + +**Open Questions:** +- Does the source finish dispatching its current read buffer before signaling done, or are just-accepted events still in the source lost on clean shutdown? +- Does the final aggregate flush on stream close emit only closed windows, and do they clear the encoder→forwarder before the deadline under load? +- Realistic drain time at max load with healthy intake — if it approaches 30s, the clean case is fragile and 30s adequacy is itself a finding. +- Are PassthroughBatcher-buffered (pre-timestamped) metrics flushed on stop or dropped? + +### events-sc-no-silent-loss — Events and service-checks delivered without silent loss +| | | +|---|---| +| **Type** | Liveness (with a Safety no-silent-drop clause) | +| **Property** | Under intake backpressure/outage, the events and service-checks sub-paths apply backpressure (never silently drop on a wired edge) and, after a transient outage clears, every accepted event/SC that did not legitimately overflow is eventually delivered. | +| **Invariant** | Safety `Always(no silent drop on a wired events/SC edge under load)` + Liveness `Sometimes(all-accepted-events-delivered-after-recovery)` / `Sometimes(all-accepted-SC-delivered-after-recovery)` reconciling `component_events_received_total{message_type}` vs `events_sent`. Two silent-loss sites need SUT-side coverage: the encoder recoverable-error drop (`events/mod.rs:179-194`, undercounted) and the wrong-type swallow (`try_into_eventd`→`Continue`, uncounted). No aggregation stage → ~1:1 accept→deliver (modulo `MAX_EVENTS_PER_PAYLOAD=100`), a cleaner reconcile than metrics. | +| **Antithesis Angle** | Throttle/down the mock intake so the always-on `dsd_in.{events,service_checks} → *_enrich → dd_*_encode → dd_out` edges fill; verify queue-and-await, then restore and reconcile. Extends `no-silent-interconnect-drop` / `forwarder-eventual-delivery` to two always-on edges the catalog ignored. | +| **Why It Matters** | The "won't lose customer data" guarantee on two always-on production paths no other property watches. | +| **Priority** | High | + +**Open Questions:** +- Is the encoder recoverable-error branch ever hit on healthy intake with well-formed events, or only on oversized single events? (Decides the optional `Always` guard.) +- Do events/SC retries share `dd_out`'s per-endpoint queue with metrics (cross-stream eviction by a metric flood)? +- Are events/SC requests `Clone` (retryable failures take the re-enqueue path), as confirmed for metrics? +- Does `dispatch_events` count anything when `send_all` errors, or is dispatch-time loss fully silent? (ties to `source-dispatch-no-misroute`) + +### events-sc-pipeline-reachable — Events and service-check sub-pipelines are actually exercised +| | | +|---|---| +| **Type** | Reachability | +| **Property** | At least once per run, a well-formed event and a well-formed service-check are parsed/accepted at the source and delivered through the encoder to the intake — so the event/SC safety/liveness properties cannot pass vacuously. | +| **Invariant** | `Sometimes(event_parsed_and_accepted)` / `Sometimes(service_check_parsed_and_accepted)` (source `component_events_received_total{message_type=events|service_checks}`) + `Sometimes(event_delivered)` / `Sometimes(service_check_delivered)`. Strengthen to `Reachable` if the workload guarantees ≥1 well-formed event + SC. This is the **R4 anti-vacuity anchor** for `events-sc-no-silent-loss` and `malformed-event-sc-no-crash`. | +| **Antithesis Angle** | Anti-vacuity anchor for a metrics-dominated workload; also catches a wiring/`EnablePayloads`-default regression that silently removes an always-on path. | +| **Why It Matters** | The catalog is otherwise metrics-only; events/SC are rare in real traffic, so the new event/SC properties need an explicit reachability obligation to mean anything. | +| **Priority** | Medium | + +**Open Questions:** +- Delivery anchor on encoder `events_sent` vs the mock intake receiving the `/api/v1/events_batch` / service-check POST? Intake observation is stronger but needs the mock to distinguish endpoints. +- One anchor per stream, or four (event/SC × parsed/delivered) to localize a parse-but-not-deliver regression? + +--- + +## Category C — Aggregation & Sketch Correctness + +ADP must match the Datadog Agent bit-for-(approximately-)bit. The diff-test suite checks happy-path +equivalence; these properties target correctness under faults, edge cases, and timing — plus a +guaranteed-crash clock hazard (`aggregate-clock-skew-stable` forward-jump) the suite cannot reach. +(The former sub-second-window divide-by-zero crash is now fixed upstream, PR #1772, and survives only +as a regression tripwire.) + +### aggregate-matches-agent — Aggregated output matches the Datadog Agent +| | | +|---|---| +| **Type** | Safety | +| **Property** | For the same input stream, ADP's aggregated output (counter→rate via bucket width, half-open `[start,start+width)` buckets, histogram/distribution stats) equals the Datadog Agent's, and equivalence is preserved under faults. | +| **Invariant** | Workload/harness-side `Always(diff within ratio)` on the normalized `stele` diff per flush window, with faults active; `Sometimes(fault injected during window)`. Differential property — anchored on the `panoramic`/`stele` diff harness, not a single in-process assertion. | +| **Antithesis Angle** | Diff-test equivalence under delayed/skipped flush, process kill+restart mid-window, and downstream backpressure — faults the deterministic `panoramic` harness cannot inject. | +| **Why It Matters** | Wrong aggregates are silent, customer-visible data corruption; the headline correctness guarantee. | +| **Priority** | High | + +**Open Questions:** +- Can `panoramic` survive an ADP restart mid-run, and is `FLUSH_WAIT=32s` enough once faults delay flushes (timing-artifact false diffs)? +- Is the Agent baseline's bucket width guaranteed identical to ADP's `aggregate_window_duration`? +- Are zero-value counters emitted identically by both sides across a skipped flush? + +### aggregate-no-panic-any-window — No window duration causes a panic +> **Status (2026-05-30): FIXED UPSTREAM on main** — the `% 0` panic vector is now structurally +> impossible. The config key is renamed `aggregate_window_duration_seconds` and deserializes as +> `NonZeroU64` (`transforms/aggregate/mod.rs:95-98`); `align_to_bucket_start` takes a `NonZeroU64` +> and divides by `bucket_width_secs.get()` (`:822-823`), so zero/sub-second values fail config +> parsing instead of reaching the divisor (PR #1772). The earlier repro +> `tests::bug_sub_second_aggregate_window_panics_on_insert` is therefore **stale** — see the bug +> ledger (the repro lives in a sibling commit and should be dropped or converted to a passing guard). +> Property retained as a low-cost regression tripwire. +| | | +|---|---| +| **Type** | Safety / Reachability | +| **Property** | No `aggregate_window_duration_seconds` value causes a panic; the divisor is never zero. | +| **Invariant** | `Unreachable("align_to_bucket_start reached with bucket_width_secs == 0")` as a regression tripwire — should never fire now that the type is `NonZeroU64`. Fires only if a future refactor reintroduces a zero-able window or a finer-grained (sub-second) divisor. | +| **Antithesis Angle** | Cheap: explore the (now `NonZeroU64`) config space and confirm no divisor-zero path is reachable. Primary value is guarding against a regression, not finding the original (closed) bug. | +| **Why It Matters** | The original guaranteed crash + restart loop is closed; the tripwire keeps it closed. | +| **Priority** | Low (R-2026-05-30: demoted from High — panic vector closed upstream; retained as a regression guard). | + +**Open Questions:** +- _Resolved (2026-05-30):_ the fix landed as a type change (`NonZeroU64`), not runtime clamping, so zero/sub-second windows are rejected at config load. +- Can the gRPC dynamic-config stream push `aggregate_window_duration_seconds` at runtime? Even so, a non-`NonZeroU64` value cannot deserialize, so there is no live divisor-zero vector. + +### aggregate-clock-skew-stable — Aggregation stays sane across wall-clock skew +> **Status (2026-05-29): BUG DEMONSTRATED** (forward-jump facet) as a unit test — +> `lib/saluki-components/src/transforms/aggregate/mod.rs` +> `tests::bug_forward_clock_jump_floods_zero_value_points`. A forward wall-clock jump makes `flush` +> build `zero_value_buckets` over the whole jumped interval (O(jump) work/alloc) and flood one idle +> counter with points proportional to the jump. Backward-jump gap facet not yet repro'd. Not fixed. +| | | +|---|---| +| **Type** | Safety | +| **Property** | A backward/forward wall-clock jump never floods zero-value points nor silently breaks counter continuity; bucketing stays bounded and well-formed. | +| **Invariant** | `Always(zero_value_buckets.len() <= ceil(flush_interval/window)+slack)` and `Always(current_time >= last_flush)` inside `flush`; `Sometimes(clock jumped during flush)`. CONFIRMED two-clock hazard: wall-clock bucketing vs monotonic flush cadence, no monotonicity guard; forward jump floods the zero-value loop, backward jump empties it. | +| **Antithesis Angle** | Clock fault injection (NTP step backward/forward) during a steady counter stream. | +| **Why It Matters** | Metric flood + memory spike (forward) or silent continuity gap and mis-expiry (backward); diverges from Agent. | +| **Priority** | High | + +**Open Questions:** +- Fix: monotonic source vs clamp `current_time = max(., last_flush)` + cap loop? Decides `Unreachable(flood)` vs `Always(bounded)`. +- Any guard against `get_unix_timestamp()` returning 0 (pre-epoch via `unwrap_or_default`)? None found. +- Does the Agent behave identically under the same step (ties to `aggregate-matches-agent`)? + +### ddsketch-bin-count-bounded — Bin count never exceeds bin_limit +| | | +|---|---| +| **Type** | Safety | +| **Property** | After any inserts / multi-weight inserts / interpolations / merges, an agent `DDSketch`'s bin count never exceeds `bin_limit` (4096). | +| **Invariant** | `Always(self.bins.len() <= bin_limit)` after every mutating method (post-`trim_left`); `Reachable("trim_left collapsed bins")`. CONFIRMED enforced today by `trim_left` at every mutation site; already has unit + proptests. SUT-side. | +| **Antithesis Angle** | Live regression tripwire on real sketches after arbitrary interleavings (histogram→distribution, cross-window merges) — catches a new mutator that forgets `trim_left`. The `Reachable(trim_left collapsed bins)` anchor is **essential** or the `Always` is vacuous (real corpora rarely exceed 4096 keys). | +| **Why It Matters** | Bin explosion → memory blowup and historically an encoder panic. | +| **Priority** | Medium (R6: demoted from High — substantially duplicates existing proptests; the unique value is a live tripwire for a future `trim_left`-forgetting mutator). | + +**Open Questions:** +- Is test-only `insert_raw_bin` (bypasses `trim_left`) truly unreachable in release? +- _Resolved:_ only the **agent sketch (4096)** is on the live path; `ddsketch::DDSketch` re-exports the agent impl (`lib.rs:56`). The canonical sketch (2048 bins) has no non-test usage. Assert against the agent `bin_limit` (4096). + +### ddsketch-relative-error-bound — Quantile accuracy + merge associativity +| | | +|---|---| +| **Type** | Safety | +| **Property** | For in-range values, the sketch's quantile estimates are within configured relative error (eps≈0.78%) and merges are associative/commutative — a **library invariant** of the agent DDSketch, since ADP does not query quantiles on the live path (see below). | +| **Invariant** | Harness/SUT-side `Always(|q_est - v| <= eps_rel*|v|)` for in-range inputs and `Always(merge order-independent within tolerance)`, exercised against the agent sketch directly (not a production runtime call). The lower-value live facet — faithful bin serialization + bin count ≤ 4096 — is already covered by `ddsketch-bin-count-bounded`. eps=1/128, gamma=1+2·eps confirmed; `avg`/`sum` in `merge` are f64 order-sensitive. | +| **Antithesis Angle** | Merge order under interleaving (delayed flush / backpressure reordering) and accuracy at the representable-range boundary — invisible to the single-order diff test. | +| **Why It Matters** | Quantile/avg drift is silent wrong customer data. | +| **Priority** | Low (R5: demoted Medium→Low — the accuracy guarantee is a library/proptest invariant, not a live ADP runtime invariant; ADP never calls `DDSketch::quantile` on the customer path). | + +**Open Questions:** +- Acceptable f64 tolerance for `avg`/`sum` under reordered merges vs diff-test's 1e-8? +- _Resolved:_ ADP does **not** call `DDSketch::quantile` on the live customer path — histogram percentiles use raw-sample `HistogramSummary::quantile`; distribution sketches are serialized as raw bins and quantiled server-side. Only the agent sketch is live. So this property is a library/harness-side invariant, not a production runtime assertion. + +### ddsketch-no-nan-poison — NaN never silently poisons sum/avg +> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test — +> `lib/ddsketch/src/agent/sketch.rs` `tests::bug_nan_sample_poisons_sum_and_avg`. A single NaN sample +> permanently poisons `sum`/`avg` (sticky) while `count`/`min`/`max` stay valid (silent corruption); +> no finiteness guard at the sketch boundary. Not fixed (demonstration only). +| | | +|---|---| +| **Type** | Safety / Reachability | +| **Property** | A single NaN sample must never silently poison a sketch's `sum`/`avg`; for finite input, sum/avg stay finite, and the boundary rejects/sanitizes non-finite values. | +| **Invariant** | `Always(v.is_finite())` at `adjust_basic_stats` entry (or `Unreachable` on absorbed-NaN path) + backstop `Always(self.sum.is_finite())`. CONFIRMED no finiteness guard in `adjust_basic_stats`; `sum += v*n` makes NaN sticky; `key(NaN)` still yields a valid bin; `insert_n` called directly from aggregate. Guarded only per-source. SUT-side. | +| **Antithesis Angle** | **A confirmed LIVE bypass:** a `checks_ipc` Histogram metric carrying a NaN value (`checks_ipc/mod.rs:195`, no finiteness check) routes `checks_ipc_in.metrics → metrics_enrich → dd_metrics_encode`, bypassing both the DSD `FloatIter` and the aggregate transform, and reaches `ddsketch.insert_n(...)` in the Datadog metrics encoder (`encoders/datadog/metrics/mod.rs:1054`) → `adjust_basic_stats` → `sum += NaN`. Poison then propagates through `merge`. | +| **Why It Matters** | Permanent silent corruption of sum/avg across sketches and downstream — and it is reachable today, not hypothetical. | +| **Priority** | High | + +**Open Questions:** +- Fix policy: reject/skip (match codec drop) vs clamp; confirm against Agent baseline. +- _Resolved:_ the hazard is **LIVE** via `checks_ipc` Histogram metrics (above). OTLP is closed (number values filtered by `is_skippable`; histogram sketches reconstruct finite bounds) and the DSD `aggregate insert_n` path is closed (DSD-only, finiteness-filtered upstream). The robust fix/assertion belongs at the **sketch boundary** (`adjust_basic_stats`/`insert*`), justified by the concrete checks_ipc bypass. Target sum/avg (not `quantile`, which has its own NaN fallback); ±Inf is in scope (`is_finite` covers both). + +--- + +## Category D — Lifecycle & Configuration + +Startup ordering, the no-timeout config-stream wait, fail-fast on incompatible config, bounded +graceful shutdown, and the fail-stop model (data components are not restarted). + +### topology-ready-before-intake — Topology becomes ready before data intake begins +| | | +|---|---| +| **Type** | Liveness + Safety (ordering) | +| **Property** | The internal supervisor reaches `all_ready` before the data topology is built/spawned, and the full topology eventually reaches `all_ready`. | +| **Invariant** | `Always(internal-supervisor-ready precedes blueprint.build())` + `Sometimes(full topology reaches all_ready)`. Ordering of readiness milestones is the defensible guarantee. | +| **Antithesis Angle** | Stall a downstream component's readiness (forwarder blocked on dead intake at init) or fail an internal-supervisor child to init; verify build/spawn never precedes supervisor-ready. | +| **Why It Matters** | If the topology spawned before dependencies were ready, sources could read into an unwired/uninitialized pipeline. | +| **Priority** | High | + +**Open Questions:** +- Does `dsd_in` bind/accept on its listeners *before* `mark_ready`? Determines whether the property strengthens from "milestone ordering" to "no socket bound pre-ready." +- Readiness is not latched; a late-registered component could drop `all_ready` back to false. Confirm all data components register before the wait. + +### config-stall-no-deadlock — Config-stream stall does not deadlock or busy-loop startup +> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test — +> `lib/saluki-config/src/lib.rs` `tests::bug_config_ready_hangs_forever_without_snapshot`. With +> dynamic config enabled and the sender held open but silent, `GenericConfiguration::ready()` never +> resolves (no internal timeout) → ADP startup would hang forever. Not fixed. +| | | +|---|---| +| **Type** | Liveness | +| **Property** | When the Core Agent config stream is delayed/dropped/erroring, ADP either progresses to "Initial configuration received" or remains quiescently blocked at the wait — never crashing or busy-looping. | +| **Invariant** | `Sometimes("config received")` + `Reachable("config wait entered")`; no panic, no busy-loop (workload-side CPU/log-rate check on the quiescent hang). CONFIRMED no timeout on `ready()` nor bootstrap registration await. | +| **Antithesis Angle** | Drop the snapshot (registered but never streamed) → **quiescent indefinite hang** at `ready().await` (no timeout) — the primary falsification target; flap stream → 5s reconnect. | +| **Why It Matters** | ADP assumes Core Agent reachability; a true indefinite hang with no timeout is operationally surprising (ADP never starts the pipeline, no diagnostic deadline). | +| **Priority** | High | + +**Open Questions:** +- _Resolved (busy-loop hazard downgraded):_ a steadily-erroring stream does NOT busy-loop. The tonic `Streaming` yields at most one `Err` per stream (initial error fuses to `Terminated`; mid-stream error yields once then `None`), so the loop logs once, `continue`s once, then exits to the **5s sleep** (`remote_agent.rs:302`). No unbounded spin. +- _Resolved:_ `init_reg_rx.await` is always eventually resolved — `session_id` always starts empty so the first tick takes the register branch, which always sends Ok/Err on `initial_registration_tx`. +- Confirmed real gap: `GenericConfiguration::ready()` (`lib.rs:694-704`) and the bootstrap registration await have **no timeout** — an open-but-silent stream hangs startup forever by design. + +### config-incompatible-refuses-start — High-severity incompatible config refuses to start +| | | +|---|---| +| **Type** | Safety (Reachability / Unreachable) | +| **Property** | ADP never spawns the data pipeline when a high-severity-incompatible non-default config key is present; it exits 1 instead. | +| **Invariant** | `Unreachable("pipeline spawned with high-severity incompatible non-default key")` + `Reachable("ADP refused to start")`. The gate `check_and_warn_config` runs **before** create_topology/build/spawn; its `Err` → `exit(1)`. | +| **Antithesis Angle** | Inject a current `Severity::High` non-default key; expect exit 1, no listener, no data. Negative controls: same key at default → starts; Medium/Low/Partial → starts. | +| **Why It Matters** | Running with an incompatible setting risks wrong aggregates / silent data corruption — fail-fast is the safety stance. | +| **Priority** | Medium (R7: demoted from High — a deterministic ordered gate already covered by the integration suite's config-check-exit-code cases; the `Unreachable` is statically unreachable, so the real artifact is the `Reachable(refused)` exploration). | + +**Open Questions:** +- Confirm env-var overrides are visible to the classifier at the gate. +- _Resolved:_ partial config updates over the stream are NOT re-validated (the gate runs once at startup, `run.rs:157`; `ConfigClassifier` is referenced only in `run.rs`). This property is correctly scoped to **startup**; the runtime gap is now tracked as its own property `config-runtime-update-not-revalidated`. + +### config-runtime-update-not-revalidated — Runtime config updates bypass the incompatibility gate +| | | +|---|---| +| **Type** | Safety (Reachability / scope gap) | +| **Property** | A high-severity-incompatible non-default config key delivered over the runtime config stream is never applied silently — or, if startup-only gating is intentional, the unguarded runtime-apply path is at least documented and observable. | +| **Invariant** | `Unreachable("pipeline running with high-severity incompatible non-default key after a runtime config update")`, or `Reachable` on the unguarded runtime-apply path if the design is startup-only. The startup gate (`check_and_warn_config`, `run.rs:157`) is the only classifier callsite; runtime `Partial`/`Snapshot` updates are applied without re-validation. | +| **Antithesis Angle** | Start ADP clean (passes the gate), then inject a config-stream update carrying a high-severity-incompatible non-default key; observe whether ADP refuses/flags or silently applies it. Exercises the control-plane → data-plane config path the diff-test never touches. | +| **Why It Matters** | Running with an incompatible setting risks wrong aggregates / silent data corruption — the exact outcome the startup gate prevents. The protection has a runtime hole. | +| **Priority** | Medium | + +**Open Questions:** +- Is startup-only gating intentional (runtime updates trusted as authoritative from the Core Agent) or an oversight? `(needs human input)` +- Can a `Severity::High` key actually be delivered over the config stream, or does the Core Agent pre-filter what it sends to remote agents? Determines real-world reachability. + +### graceful-shutdown-within-30s — Graceful shutdown completes within 30s without forceful kill +| | | +|---|---| +| **Type** | Liveness (bounded-time) + Reachability | +| **Property** | On SIGINT (or unexpected component finish) under bounded in-flight load, the data topology stops cleanly within the 30s grace window without the forceful-stop path. | +| **Invariant** | `Sometimes(clean topology shutdown completed)` + `Reachable(forceful-stop path under adversarial load)`. Distinct from `shutdown-drains-no-loss` (which owns *what data survives*); this owns *clean completion in time*. | +| **Antithesis Angle** | SIGINT under bounded load → clean exit; SIGINT with forwarder wedged on dead intake → forced-stop after 30s → exit 1; schedule a component to finish at the 30s boundary. | +| **Why It Matters** | Forceful kill drops in-flight data and risks state corruption; bounded clean shutdown is the operational contract. | +| **Priority** | High | + +**Open Questions:** +- The **internal supervisor** shutdown has **no timeout** — the process could exceed 30s even when the topology met it. Scope assertions to topology-shutdown completion, or file a separate property. +- On forceful stop the `JoinSet` is dropped, aborting tasks; confirm no shared-state corruption (overlaps data-loss family). +- Confirm the `shutdown_coordinator` cascade reliably reaches every component. + +### data-component-failure-triggers-process-shutdown — Data component finish triggers whole-process shutdown +| | | +|---|---| +| **Type** | Safety (Always) + Reachability | +| **Property** | Because data components have no restarting supervisor, any data component finishing unexpectedly deterministically triggers whole-topology shutdown — never a silent half-running pipeline. | +| **Invariant** | `Reachable(unexpected-finish → shutdown path)` + temporal `Always(component-death always followed by process exit)`. Data topology uses a plain `JoinSet` with no restart; supervisor restart is internal-only. | +| **Antithesis Angle** | Induce a component panic (hot-path `.expect`/`unreachable!`; note the former sub-second `aggregate_window_duration` panic vector is closed upstream) or a clean early finish; verify fail-stop fires and the process exits (s6 restarts it). | +| **Why It Matters** | A half-running pipeline silently drops/mis-routes data while appearing alive; fail-stop + full-process restart is the recovery model. | +| **Priority** | High | + +**Open Questions:** +- Brief gap between spawn and the `select!`: a component dying there is still buffered by the `JoinSet` and caught once the select polls — confirm safe. +- `expect("no components to wait for")` panics on an empty topology; confirm a spawned topology is always non-empty (guarded by `data_pipelines_enabled`). Low priority. + +--- + +## Category E — Untrusted Input Parsing + +The DogStatsD codec and the new (zero-suite-coverage) capture/replay reader parse untrusted bytes. +Antithesis treats malformed input as a first-class fault dimension. + +### malformed-dsd-no-crash — Malformed DSD packets never crash process/socket +| | | +|---|---| +| **Type** | Safety | +| **Property** | Malformed/adversarial DogStatsD packets on any listener never crash the process or kill a connectionless socket; bad packets are skipped/cleared-and-continued. | +| **Invariant** | `Always(process up)` + `Always(connectionless socket survives a bad packet)` + `Unreachable` at codec panic sites (`unreachable!`/`from_utf8_unchecked`); `Sometimes(framing_errors>0)`, `Sometimes(*_decode_failed>0)`. | +| **Antithesis Angle** | Untrusted packet input across UDP/TCP/UDS-dgram/UDS-stream; oversized / invalid-UTF8 / truncated-extension / NUL / huge-tag packets exploring codec + framing error paths. | +| **Why It Matters** | Clear-and-continue is the socket-survival mechanism; codec error policy is undecided (4 TODOs). Only a non-exhaustive proptest covers this today. | +| **Priority** | High | + +**Open Questions:** +- Codec error policy for unknown/trailing chunks is undecided (silently permissive); resolving it changes expected-drop accounting, not no-crash. +- Does a malformed packet ever cause partial dispatch / mis-routing (ties to `source-dispatch-no-misroute`)? +- TCP oversized frame `break`s the connection — verify no per-connection resource leak. + +### replay-no-panic-on-malformed-capture — Replay never panics on malformed capture +| | | +|---|---| +| **Type** | Safety | +| **Property** | Parsing an arbitrary/corrupt/truncated/zstd DogStatsD capture file never panics or aborts the replay process. | +| **Invariant** | `Unreachable` at any panic/abort in the reader or its zstd/prost decode calls; pair with `Reachable(replay-parse-executed)`. A panic on untrusted input is a crash, invisible to a workload that already expects a non-zero exit — needs SUT-side instrumentation. | +| **Antithesis Angle** | Untrusted file input + adversarial bytes / zstd-bomb / protobuf-recursion; pure input exploration of an unfuzzed, zero-suite-coverage path. | +| **Why It Matters** | Newest/largest ADP feature (+1765 LOC, validated only with `cargo check`); reader parses untrusted files inside an ADP process. | +| **Priority** | High | + +**Open Questions:** +- _Resolved:_ replay runs as a **separate `agent-data-plane dogstatsd replay` CLI process** (`dogstatsd.rs:394`) that forwards records to the running ADP over the DSD UDS socket — so the panic-catch / SUT assertion belongs in the replay CLI process, not the data-plane process. +- _Resolved (now first-class hazards, not just questions):_ `reader.rs:40-41` `fs::read` loads the whole file with **no size guard** (OOM vector), and `reader.rs:44` `zstd::stream::decode_all` runs on untrusted input with **no decompressed-size cap** (decompression-bomb vector → unbounded memory on a valid-but-huge stream). Both overlap the memory family and are strong workload inputs. + +### replay-corruption-not-silent-eof — Corruption distinguishable from clean EOF +> **Status (2026-05-29): BUG DEMONSTRATED** as a unit test — +> `lib/saluki-components/src/sources/dogstatsd/replay/reader.rs` +> `tests::bug_corrupt_length_prefix_silently_drops_following_records`. A corrupt/oversized length +> prefix is treated as clean EOF, silently dropping all following well-formed records. Not fixed +> (demonstration only). +| | | +|---|---| +| **Type** | Safety (data fidelity) | +| **Property** | A corrupt/oversized record length prefix is detectable as truncation, not silently reported as a clean replay completion. | +| **Invariant** | `AlwaysOrUnreachable(faithful completion)`: when `read_next` terminates with `Ok(None)`, the offset reached the real trailer separator, not an overrunning/mid-stream prefix; `Sometimes(corruption-detected)`. Honestly framed: code intentionally returns `Ok(None)` today (tests assert it). | +| **Antithesis Angle** | Untrusted file input; flip/zero a length prefix mid-stream and observe the tool report success having sent only N of M records. | +| **Why It Matters** | The reader collapses legitimate-EOF, truncation, and corruption into one `Ok(None)`; the driver stops sending silently → false replay-fidelity confidence. | +| **Priority** | Medium | + +**Open Questions:** +- No record-count/total-length field exists — distinguishing truncation from clean EOF may need a format change; determines strict `Always` vs. heuristic. +- Maintainers may consider silent truncation acceptable (best-effort tool); the asserting tests suggest "accepted." `(needs human input)` +- A wrong-but-small `size` could decode garbage as a valid record — a third outcome (silent wrong record, not just truncation). + +### malformed-event-sc-no-crash — Malformed event / service-check payloads never crash +| | | +|---|---| +| **Type** | Safety | +| **Property** | Adversarial/malformed DogStatsD event and service-check payloads on any listener never crash the process or kill a connectionless socket; bad frames are counted and skipped. | +| **Invariant** | `Always(process up)` + `Always(connectionless socket survives a bad event/SC packet)` + SUT-side `Unreachable` at any panic site in `parse_dogstatsd_event` / `parse_dogstatsd_service_check` and shared `helpers::*`; `Sometimes(event_decode_failed>0)`, `Sometimes(service_check_decode_failed>0)` as anchors. Extends `malformed-dsd-no-crash` to the two separate ~394/~312-LOC codecs (per R1, the no-crash check is SUT-side `Unreachable`, not container liveness). | +| **Antithesis Angle** | Untrusted event/SC frames across UDP/TCP/UDS: pathological `_e{title_len,text_len}` prefixes with `take()` on attacker lengths (`event.rs:36-49`), invalid UTF-8, the per-packet `.replace("\\n","\n")` heap alloc (amplification under flood), malformed `d:`/`card:` parsers, origin-detection-gated branches. | +| **Why It Matters** | These codecs are entirely separate from the metric codec, so existing coverage gives no assurance; a codec panic on any listener thread crashes a fail-stop data component → crash-loop under flood. | +| **Priority** | High | + +**Open Questions:** +- Do the shared `helpers::*` parsers (`unix_timestamp`, `tags`, `cardinality`, `local_data`, `external_data`, `utf8`) contain any panic/unwrap/length-based pre-allocation? (Pivotal for the `Unreachable` guard.) +- Does `take(title_len)`/`take(text_len)` or the message UTF-8 parser pre-allocate on the declared length before bounds-checking the buffer? +- Can a malformed frame be mis-classified by `parse_message_type` (`mod.rs:1466`), incrementing the wrong decode-failure counter? + +--- + +## Category F — Concurrency & Boundary Conditions + +Interleaving-sensitive paths the deterministic diff-test cannot reach, plus the non-finite-value +boundary that cross-cuts intake and aggregation. + +### interner-reclamation-no-corruption — No corrupt/overlapping interner entries under races +| | | +|---|---| +| **Type** | Safety | +| **Property** | Under concurrent intern + drop-last-ref, the interner never returns overlapping or corrupt (`0x21`-filled) entries; worst case is a benign duplicate. | +| **Invariant** | `Always(no reclaimed entry overlaps a live entry / no live &str reads the 0x21 sentinel)`, i.e. `Unreachable` on the corruption-detected branch; `Sometimes(reclaimed-slot reused)`, `Sometimes(drop re-check found resurrected entry)`. SUT-side. | +| **Antithesis Angle** | Concurrency/interleaving under the real scheduler + real load with many shards/entries on a near-full interner — explores beyond the bounded loom model. | +| **Why It Matters** | Most concurrency-unsafe component in the bounded-memory story; raw pointers as `'static &str` into a buffer overwritten on reclaim; loom only bounds interleavings. | +| **Priority** | High | + +**Open Questions:** +- Confirm both the `try_intern` increment and the drop re-check take the same `InternerState` mutex (only race window is decrement→lock, which the re-check covers). +- Cross-shard handles: can a shard-A `InternedString` ever be dropped against shard-B's lock? +- _Resolved:_ the reclamation buffer-fill IS present in release (no cfg gate). **Two implementations use different sentinels:** `map.rs:392` fills `0x21`, `fixed_size.rs:458` fills `0xAA`. An assertion must use the correct sentinel per implementation, or check overlap directly (implementation-independent — preferred). + +### non-finite-values-handled-consistently — Non-finite values consistent, no ghost metric +| | | +|---|---| +| **Type** | Safety | +| **Property** | Non-finite metric values never crash; an all-non-finite packet produces no downstream metric and consumes no interner/cache resources; no NaN reaches a DDSketch. | +| **Invariant** | `Always(value.is_finite() at DDSketch insert boundary)` + `AlwaysOrUnreachable(no zero-point metric reaches aggregation)` + `Sometimes(non-finite dropped)`. Add `Sometimes(ghost-metric path)` ONLY if a codec-bypassing producer is confirmed. | +| **Antithesis Angle** | Untrusted input: all-NaN/Inf and mixed packets across all metric types, checking the source `num_points==0` gate and the sketch boundary. | +| **Why It Matters** | NaN-poisons-sketch and ghost-metric hazards. Honestly framed: the **ghost-metric** shape (zero-point metric) is DSD-`FloatIter`-specific and gated (`num_points==0 → Ok(None)` at `handle_frame` `mod.rs:1478`), so on the DSD path it is expected Unreachable. The **NaN-poison** facet, however, is LIVE via a non-DSD producer (see `ddsketch-no-nan-poison`). | +| **Priority** | Medium | + +**Open Questions:** +- Do the empty-iter `*_fallible` constructors return Err on empty input, or an empty value? +- _Resolved:_ the NaN-reaches-sketch hazard is LIVE via `checks_ipc` Histogram metrics (encoder `insert_n`), while OTLP, replay (re-injects through DSD codec), and the DSD aggregate path are all finiteness-gated. The ghost-metric (zero-point) shape stays DSD-specific and gated → expected Unreachable on the DSD path. + +--- + +## Category G — Transform & Enrichment Correctness + +Added after evaluation. The other categories treat ADP as a *transport* (don't crash, don't lose +bytes); this one treats it as a *transformer* — mapping, filtering, and tag-filtering customer data, +much of it driven by **runtime config that mutates while data flows**. This is the the design partner design- +partner's documented focus (the "Tag Filter RC Relay Stress Test"). These properties need the +**config-stream add-on topology** (not standalone) and/or the **diff-test add-on** (Agent baseline). + +### mapper-output-matches-agent — DogStatsD mapper output matches the Datadog Agent +| | | +|---|---| +| **Type** | Safety | +| **Property** | For the same input names and `dogstatsd_mapper_profiles`, ADP's mapper produces the same rewritten metric name and injected tags as the Datadog Agent mapper. | +| **Invariant** | Differential `Always(mapped (name,tags) within ratio of Agent mapper)` per flush window on the `panoramic`/`stele` harness with a mapper-exercising corpus + identical profiles; `Sometimes(a metric was remapped)` for non-vacuity; SUT-side `Sometimes(cache hit == fresh miss)`. Mirrored expansion/wildcard logic (`dogstatsd_mapper/mod.rs:259-342`) is only self-tested today. | +| **Antithesis Angle** | Overlapping/ambiguous profiles probing first-match ordering, names at the wildcard char-class boundary, and fault-induced flush-timing skew; runs on the diff-test add-on. | +| **Why It Matters** | Mapper rename feeds every downstream filter (`run.rs:674-675`); a divergence is silent, customer-visible name/tag corruption. | +| **Priority** | High | + +**Open Questions:** +- Does the Agent apply all matching mappings per profile, or only the first (ADP returns on first match, `mod.rs:332`)? `(needs human input)` +- Does the Agent restrict wildcard chars to the same `[a-zA-Z0-9\-_*.]` class? +- The mapper has no `watch_for_updates` — it appears static-only, so its runtime-config facet is likely unreachable (unlike the filters). + +### mapper-interner-bounded — The mapper's second interner is bounded / fails visibly +| | | +|---|---| +| **Type** | Safety (heap-off: silent non-remap hazard; heap-on default: bounded claim fails by design) | +| **Property** | The mapper's own 64 KiB string interner stays bounded; when full, the metric is not silently forwarded under its original (unmapped) identity without accounting. | +| **Invariant** | Heap-off: `AlwaysOrUnreachable(mapper interner full ⇒ metric forwarded under ORIGINAL context, accounted)` + `Sometimes(mapper resolve == None)`. Heap-on (default): `Sometimes(mapper intern heap fallback > 0)`. SUT-side needed (no counter). The `?` at `mod.rs:317-321`/`:277-282` makes resolve-`None` → keep original context with no drop/else. | +| **Antithesis Angle** | Small `dogstatsd_mapper_string_interner_size` + a flood of distinct mappable names fills the mapper interner independently of the source interner; resolver-churn scheduling stress (30s idle expiry). | +| **Why It Matters** | A SECOND bounded interner (distinct from `interner-full-bounded`) whose default heap-on voids its declared firm bound, and whose heap-off path silently emits the wrong (pre-map) identity — a load-dependent identity flip. | +| **Priority** | High | + +**Open Questions:** +- The mapper never calls `with_heap_allocations(false)` (defaults true) — intentional, or an oversight making its firm bound unenforceable? `(needs human input)` +- Can the same name be remapped on one call but silently not on the next purely due to interner pressure? + +### filter-config-reload-correct — Live filter config reload applies correctly, never stale or silently cleared +| | | +|---|---| +| **Type** | Safety (data-correctness under live config reload) | +| **Property** | When the Core Agent pushes filter config over the RC stream while metrics flow, ADP applies the new filters to live data — never keeping stale filters nor silently clearing all filtering. | +| **Invariant** | `Always(after a settled filter update, the next metric is filtered per the NEW config)` (SUT-side at apply); `Unreachable("filter update Lagged-dropped with no reconciliation")` on `watcher.rs:61` (no re-read exists); `AlwaysOrUnreachable(tag filtering not silently fully-cleared by an unintended event)`; `Sometimes(reload while metrics in flight)`. Three confirmed hazards: (1) `broadcast::Lagged` warn+continue on a cap-100 broadcast → permanent staleness; (2) partial-deserialize skip; (3) `diff_recursive` is additive so key **deletion fires no event** (stale), while explicit-empty clears `tag_filterlist` but is ignored by the prefix/post-agg filters. | +| **Antithesis Angle** | Burst config updates faster than the filter task drains its receiver (node-throttle `adp` to widen the lag window) interleaved with sustained metric load; explore deletion vs explicit-empty vs malformed-entry. **The design partner's explicit stress-test focus.** | +| **Why It Matters** | Stale/half-applied/cleared filtering on live customer data is silent and customer-visible — the single most product-relevant transform property. | +| **Priority** | High | + +**Open Questions:** +- Is the `Lagged` drop accepted as best-effort (a dropped final update is permanent staleness, no re-read)? `(needs human input)` +- Should removing a filter key from RC clear the filter? ADP's additive diff keeps it stale — confirm vs Agent RC semantics. `(needs human input)` +- The `tag_filterlist` (clears on `None`) vs prefix/post-agg (ignores `None`) asymmetry — intended? +- Requires the config-stream add-on; does **not** fire in standalone mode. + +### tag-filterlist-applied-consistently — tag_filterlist applies cached and type-gated filtering consistently +| | | +|---|---| +| **Type** | Safety | +| **Property** | tag_filterlist's context-cache results always agree with a fresh computation, only Counter+sketch metrics are filtered, and output matches the Agent's time-sampler tag filtering. | +| **Invariant** | `Always(cache-hit filtered tags == fresh filter_metric_tags)` (SUT-side, catches stale/colliding cache entries, `mod.rs:240-263`); `AlwaysOrUnreachable(only Counter+sketch metrics have tags removed)` (`mod.rs:235-237`); optional differential `Always(post-filter (name,tags) within ratio of Agent)`; `Sometimes(tag removed)` + `Sometimes(cache hit served filtered result)`. | +| **Antithesis Angle** | Mixed-type load with overlapping names stressing the 100k/30s-TTI context cache, interleaved with reloads (compose with `filter-config-reload-correct`) + node-throttling to widen the reload-vs-apply window so a cache entry outlives the reload that should invalidate it. | +| **Why It Matters** | Counter+sketch-only gate and a hot-path cache, both claiming Agent equivalence with only self-tests; a stale entry or wrong type-gate silently leaks/drops tags (cardinality/PII). | +| **Priority** | Medium (High if the type gate diverges from the Agent) | + +**Open Questions:** +- Does the Agent filter the same Counter+sketch subset? `(needs human input)` +- Cache keyed by full `Context` incl. origin tags — can origin-tag-only differences collide? +- If a reload is Lagged-dropped, the cache is not rebuilt and stale filtered contexts persist to TTI — confirm this compound failure. + +### prefix-filter-ordering-matches-agent — Prefix/blocklist and post-aggregate filtering match the Agent's stage split +| | | +|---|---| +| **Type** | Safety (ordering + differential) | +| **Property** | ADP's listener-side prefix/blocklist filter and post-aggregate histogram-series filter run in the correct pipeline order and split responsibility over the shared keys exactly as the Datadog Agent does. | +| **Invariant** | Differential `Always(end-to-end keep/drop + final name within ratio of Agent)`; `AlwaysOrUnreachable(non-histogram-shaped entry not applied post-aggregate)` and converse; `AlwaysOrUnreachable(post_agg never drops a sketch)`; optional blueprint-shape `Always(dsd_prefix_filter between dsd_enrich and dsd_tag_filterlist; dsd_post_agg_filter after dsd_agg)` (`run.rs:674-679`); `Sometimes(prefix added/listener drop/post-agg drop)`. | +| **Antithesis Angle** | Corpus where a name's keep/drop depends on stage order + fault-induced flush-timing skew on the diff run; composes with `mapper-output-matches-agent` and `filter-config-reload-correct`. | +| **Why It Matters** | A past fix "moved DSD prefix/filter in front of enrich" (bug-history-sensitive); the listener-vs-time-sampler split over four shared keys is subtle, fragile, and has no end-to-end regression guard. | +| **Priority** | Medium (High as the ordering-regression tripwire) | + +**Open Questions:** +- Does the Agent split on exactly the `.` shape `contains_filter_entry` uses? `(needs human input)` +- Is the prefix-filter-after-mapper ordering load-bearing for equivalence, with any guard besides this property? +- A reload updating one filter but lagging the other could filter at one stage but not the other for the same rule — confirm reachability. + +## Catalog-wide notes + +- **Default config is hostile to the bounded-memory family:** memory limiter disabled + (`MemoryMode::Disabled`), interner heap-spill enabled (`dogstatsd_allow_context_heap_allocs=true`), + disk retry persistence off. Workloads must opt into protective settings to test the "holds" + branches and leave defaults to capture the "fails by design" branches. This is the single most + important workload-configuration decision (see `deployment-topology.md`). +- **One guaranteed-crash finding** needs no fault injection, only clock exploration: + `aggregate-clock-skew-stable` (forward jump → flood) — cheap, high-value first target. Its former + sibling `aggregate-no-panic-any-window` (sub-second window → divide-by-zero) was **fixed upstream** + (window is now `NonZeroU64`, PR #1772) and is retained only as a regression tripwire. +- **Confirmed-live latent bug:** `ddsketch-no-nan-poison` is reachable today via a `checks_ipc` + Histogram metric carrying NaN, which bypasses the per-source finiteness filters and poisons the + encoder sketch's sum/avg permanently. The fix belongs at the sketch boundary. +- **Resource exhaustion via untrusted input:** `replay-no-panic-on-malformed-capture` carries two + confirmed vectors — unguarded whole-file `fs::read` and uncapped `zstd::decode_all` — both reachable + in the separate replay CLI process. +- **Differential property:** `aggregate-matches-agent` is anchored on the existing + `panoramic`/`stele` diff harness, not an in-process SDK assertion. +- **SUT-side instrumentation is required** (not optional) for: all crash/panic properties, NaN-at- + sketch-boundary, bin-count, interner-corruption, source-misroute, and limiter-RSS-failure — these + internal states are invisible to a workload-only checker. Existing telemetry counters + (`events_discarded_total`, `framing_errors`, `*_decode_failed`, queue-drop counters) serve only as + `Sometimes`-reachability anchors, not as the safety assertions themselves. +- **(R1) The container's s6 supervisor auto-restarts ADP on exit**, so "process is up" workload + assertions are vacuously green even during a crash-loop. Every no-crash property + (`malformed-dsd-no-crash`, `malformed-event-sc-no-crash`, `replay-no-panic-on-malformed-capture`, + the aggregate-crash pair) must assert SUT-side `Unreachable` at panic sites — or assert on + restart-count — **never** container liveness. +- **(R2, updated 2026-05-30) The Antithesis Rust SDK is now wired into ADP** behind the `antithesis` + cargo feature (`antithesis_init()` + a bootstrap `assert_reachable!`), and the harness binaries + carry workload-side anchors — so the "fork ADP + add the SDK + build an instrumented image" + prerequisite is largely satisfied (the wiring is proven). ~17 properties still need their net-new + in-process SUT-side **invariant** assertions landed on top of that scaffold; the ~10 workload-only + properties (forwarder delivery, retry-queue bounds, shutdown, config-gate, RSS) can run first. +- **(R3) No-loss properties must use TCP or UDS ingress, not UDP** — UDP's inherent packet loss + confounds any "accepted == delivered" reconciliation (`no-silent-interconnect-drop`, + `forwarder-eventual-delivery`, `disk-persisted-retry-survives-restart`, `shutdown-drains-no-loss`, + `events-sc-no-silent-loss`). Reserve UDP for the no-crash properties. +- **(R4) Anti-vacuity:** safety properties gated by hard-to-reach `Sometimes` anchors (bin collapse, + interner resurrection race, events/SC reachability) require the workload to force the anchor + config/corpus, and the run synthesizer must report an **unreached `Sometimes` as inconclusive, not + passing**. +- **(G2 topology dependency)** the runtime filter config-reload properties + (`filter-config-reload-correct`, and the reload facets of the tag-filterlist/prefix-filter + properties and `config-runtime-update-not-revalidated`) require the **config-stream add-on + topology** (Core Agent or stub) — they pass vacuously in standalone mode because the config watcher + never fires. + +## Scope (confirmed with user, 2026-05-28) + +- **In scope:** the DogStatsD pipeline end-to-end — metrics, events, service-checks — plus the + `saluki-core` runtime invariants (memory bounds, backpressure, lifecycle, pooling, interning) and + the runtime filter config-reload surface. +- **Deferred (documented exclusion, not an oversight):** the **traces/APM, logs, and OTLP** pipelines + (`run.rs:506-591,700-758`). These are wired in ADP but are not the first-customer (the design partner / Agent + 7.80.0) surface. They carry their own untrusted-input risk (notably the SQL-parsing + `trace_obfuscation/sql.rs` and a second OTLP protobuf source + forwarder) and are the natural next + expansion if/when they enter scope. +- **Fault availability (confirmed enabled for the tenant):** **node termination**, **clock jitter**, + and **custom `/proc` faults** are all enabled — so the crash-recovery + (`disk-persisted-retry-survives-restart`, `data-component-failure-triggers-process-shutdown`), + clock-skew (`aggregate-clock-skew-stable`), and limiter-RSS-failure + (`memory-limiter-survives-rss-read-failure`) properties are realizable rather than vacuous. + +## Open Questions (catalog-wide / analysis-level) + +- "METRIC_CONTROL relay" naming from Confluence has no source identifier — config flows through the + generic snapshot/partial stream. Confirm with the team. +- _Resolved:_ `interconnect_capacity` default = **128** event buffers (`topology/mod.rs:37`). +- _Resolved:_ no protective memory setting is on by default (`memory_mode` = `Disabled`, + `allow_context_heap_allocs` = true) → "bounded memory" is opt-in/aspirational under default config. + This pins Category A's "fails by design under defaults" framing. diff --git a/test/antithesis/scratchbook/property-relationships.md b/test/antithesis/scratchbook/property-relationships.md new file mode 100644 index 00000000000..866de006179 --- /dev/null +++ b/test/antithesis/scratchbook/property-relationships.md @@ -0,0 +1,187 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space — headline guarantees and gap analyses that seed properties. + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD + why: Root guarantee for the memory + data-loss clusters. + - path: https://datadoghq.atlassian.net/browse/DADP + why: ADP Jira project for tracked gaps/regressions. + - path: https://github.com/DataDog/saluki/pull/1768 + why: PR review #4393897611 (Copilot) flagged the stale property count reconciled here. +--- + +# Property Relationships + +Lightweight clustering of the 35 catalog properties by shared code paths, failure mechanisms, and +suspected dominance. Slugs match `property-catalog.md`. + +## Cluster 1 — Bounded memory (the determinism story) + +Properties: `rss-bounded-under-cardinality`, `aggregate-context-limit-enforced`, +`interner-full-bounded`, `memory-limiter-survives-rss-read-failure`, +`retry-queue-bounded-under-outage`, and (memory facet) `aggregate-clock-skew-stable`. + +- **Shared mechanism:** every one bears on whether actual RSS stays within the grant. They share the + `resource-accounting` limiter, the aggregate `HashMap`/`context_limit`, and the `stringtheory` + interner. +- **Dominance:** `rss-bounded-under-cardinality` is the **roll-up** — it observes the aggregate + outcome (RSS ≤ grant). The other four explain *why* it does or doesn't hold: + `aggregate-context-limit-enforced` and `interner-full-bounded` are the two designed bounds; + `interner-full-bounded` (heap-on default) and `memory-limiter-survives-rss-read-failure` are the + two leaks that make the roll-up fail. If `rss-bounded` passes, the sub-properties likely hold; if + it fails, the sub-properties localize the cause. Test the roll-up *and* the components. +- **Config coupling:** all are sensitive to the same three default-off protective settings + (memory_mode, allow_context_heap_allocs, disk persistence). A single workload-config matrix drives + the whole cluster. + +## Cluster 2 — Egress data loss & durability + +Properties: `forwarder-eventual-delivery`, `retry-queue-bounded-under-outage`, +`disk-persisted-retry-survives-restart`, plus the egress facet of `shutdown-drains-no-loss`. + +- **Shared code:** all live in `common/datadog/io.rs` + `retry/queue/persisted.rs` (PendingTransactions, + circuit breaker, disk queue). +- **Tension (not dominance):** `forwarder-eventual-delivery` (no loss) and + `retry-queue-bounded-under-outage` (bounded memory ⇒ eventual drop) are in direct tension — the + queue cap is the escape valve that *causes* the loss the delivery property forbids. They must be + tested with coordinated outage durations: short outage → delivery holds; prolonged outage → + bounded-drop holds. `retry-queue-bounded` is the safety backstop; `forwarder-eventual-delivery` is + the liveness goal within the backstop's budget. +- **`disk-persisted-retry-survives-restart`** extends `forwarder-eventual-delivery` across a crash; + it shares the kill+restart fault with `aggregate-matches-agent` (restart facet). + +## Cluster 3 — No silent internal loss / routing + +Properties: `no-silent-interconnect-drop`, `source-dispatch-no-misroute`, +`data-component-failure-triggers-process-shutdown`. + +- **Shared mechanism:** the `topology/interconnect` dispatcher and the DSD source's multi-output + dispatch. All three are about events going to the *right place or nowhere silently*. +- **Connection:** `no-silent-interconnect-drop` (backpressure, no discard on wired edges) and + `source-dispatch-no-misroute` (no cross-output leakage) are complementary halves of "events are + routed correctly under load/failure." `data-component-failure-triggers-process-shutdown` is the + backstop: when routing/processing *does* fail, the whole process must stop rather than run half-broken. + +## Cluster 4 — Aggregation correctness + +Properties: `aggregate-matches-agent`, `aggregate-no-panic-any-window`, `aggregate-clock-skew-stable`, +`ddsketch-bin-count-bounded`, `ddsketch-relative-error-bound`, `ddsketch-no-nan-poison`. + +- **Shared code:** `transforms/aggregate/mod.rs` + `lib/ddsketch`. +- **Dominance:** `aggregate-matches-agent` is the **differential roll-up** — any sketch-accuracy, + bin-count, NaN, clock-skew, or bucketing deviation that changes output will show up as a diff + against the Agent (if the Agent doesn't share the same deviation). The sub-properties catch + deviations that are silent in a single happy-path comparison (merge-order accuracy, NaN poison, + internal bin explosion). +- **Crash subset:** the forward-jump facet of `aggregate-clock-skew-stable` is the live crash/DoS + property that feeds `data-component-failure-triggers-process-shutdown` (a panicking aggregate is + exactly the "component finishes unexpectedly" trigger). `aggregate-no-panic-any-window` shares the + cluster but its original `% 0` panic vector was **closed upstream** (window is now `NonZeroU64`, + PR #1772) — it remains as a low-cost `Unreachable` regression tripwire, not an active crash bug. +- **NaN crosscut:** `ddsketch-no-nan-poison` shares its boundary with + `non-finite-values-handled-consistently` (Cluster 6) — same NaN, two angles (sketch internals vs. + source gating). + +## Cluster 5 — Lifecycle & config gating + +Properties: `topology-ready-before-intake`, `config-stall-no-deadlock`, +`config-incompatible-refuses-start`, `config-runtime-update-not-revalidated`, +`graceful-shutdown-within-30s`, `data-component-failure-triggers-process-shutdown`. + +- **Shared code:** `bin/agent-data-plane/src/cli/run.rs`, `internal/remote_agent.rs`, + `saluki-core/health`, `topology/running.rs`, `runtime/supervisor.rs`. +- **Lifecycle ordering:** `topology-ready-before-intake` (startup) and `graceful-shutdown-within-30s` + (shutdown) bracket the run; `config-incompatible-refuses-start` and `config-stall-no-deadlock` + gate whether startup proceeds at all. +- **Config-gate pair:** `config-incompatible-refuses-start` (startup gate) and + `config-runtime-update-not-revalidated` (the runtime hole in that same gate) are two halves of one + guarantee — incompatible config is refused. The first holds; the second is the gap. They share + `check_and_warn_config` / `ConfigClassifier` and the `run.rs` callsite. +- **Shutdown pair:** `graceful-shutdown-within-30s` (timing/clean completion) and + `shutdown-drains-no-loss` (Cluster 2; what data survives) are deliberately split views of the same + shutdown event — test together, assert on different things. Neither dominates. + +## Cluster 6 — Untrusted input parsing + +Properties: `malformed-dsd-no-crash`, `replay-no-panic-on-malformed-capture`, +`replay-corruption-not-silent-eof`, `non-finite-values-handled-consistently`. + +- **Shared mechanism:** parsing attacker-influenced bytes (`saluki-io` DSD codec; replay reader). +- **No-crash subset:** `malformed-dsd-no-crash` and `replay-no-panic-on-malformed-capture` are the + same property class (untrusted input never crashes) on two different parsers; both feed the + fail-stop backstop in Cluster 5. +- **Fidelity vs. crash:** `replay-corruption-not-silent-eof` is about *silent wrong data* rather than + crash — a distinct failure mode on the same reader as `replay-no-panic-on-malformed-capture`. +- **`non-finite-values-handled-consistently`** bridges to Cluster 4 via the NaN/sketch boundary. + +## Cluster 7 — Concurrency interleavings + +Properties: `interner-reclamation-no-corruption`, plus the timing facets of +`no-silent-interconnect-drop` (multi-sender partial delivery), `forwarder-eventual-delivery` +(shared circuit-breaker state), and `aggregate-context-limit-enforced` (breach flag vs. flush race). + +- **Shared theme:** these are the properties where Antithesis's interleaving search is the *primary* + value (vs. fault injection). `interner-reclamation-no-corruption` is the purest — a loom-tested + unsafe path where Antithesis explores beyond the bounded model. The others are concurrency facets + of properties whose main home is another cluster. + +## Cluster 8 — Transform & enrichment correctness (added after evaluation) + +Properties: `mapper-output-matches-agent`, `mapper-interner-bounded`, `filter-config-reload-correct`, +`tag-filterlist-applied-consistently`, `prefix-filter-ordering-matches-agent`. + +- **Shared code:** the DSD transform chain — `dogstatsd_mapper` → `dogstatsd_prefix_filter` → + `dsd_tag_filterlist` → `dsd_agg` → `dsd_post_agg_filter` (`run.rs:674-679`) — plus the runtime + config watcher (`saluki-config/src/dynamic/watcher.rs`). +- **Differential subset:** `mapper-output-matches-agent`, `prefix-filter-ordering-matches-agent`, and + the optional facet of `tag-filterlist-applied-consistently` all ride the **diff-test add-on** and + are facets of `aggregate-matches-agent` (Cluster 4) at earlier pipeline stages — that roll-up + catches them only if its corpus exercises mapped/filtered metrics (an open question). +- **Config-reload subset:** `filter-config-reload-correct` is the hub — `tag-filterlist-applied- + consistently` (stale cache after a Lagged-dropped reload) and `config-runtime-update-not-revalidated` + (Cluster 5) compose with it. All require the **config-stream add-on topology**; all pass vacuously + in standalone mode. +- **Interner crosscut:** `mapper-interner-bounded` is a second, independent instance of + `interner-full-bounded` (Cluster 1) — a distinct 64 KiB interner with its own silent-drop path. +- **Bug-history crosscut:** `prefix-filter-ordering-matches-agent` targets the "moved prefix/filter in + front of enrich" fix; it is the ordering-regression tripwire for the whole chain. + +## Cluster 9 — Events & service-checks paths (added after evaluation) + +Properties: `events-sc-no-silent-loss`, `malformed-event-sc-no-crash`, `events-sc-pipeline-reachable`. + +- **Shared code:** the always-on `dsd_in.{events,service_checks}` sub-pipelines and their separate + codecs/encoders. These are the events/service-checks analogues of Cluster 3's `no-silent-interconnect-drop`, + Cluster 2's `forwarder-eventual-delivery`, and Cluster 6's `malformed-dsd-no-crash` — same + mechanisms, different always-on edges. +- **Anti-vacuity dependency:** `events-sc-pipeline-reachable` is the R4 anchor that keeps the other + two from passing trivially under a metrics-dominated workload — a hard dependency, not just a relation. + +## Shared-scenario pairs (R10 — count is not 35 independent test efforts) + +These pairs share a fault scenario / assertion site and should be implemented together; treat them as +one test effort each for portfolio-sizing: + +- `shutdown-drains-no-loss` ⇄ `graceful-shutdown-within-30s` — same shutdown event, different assertions + (what data survives vs. clean completion in time). +- `non-finite-values-handled-consistently` ⇄ `ddsketch-no-nan-poison` — same NaN, two angles (source + gating vs. sketch-boundary poison). +- `rss-bounded-under-cardinality` ⇄ its four Cluster-1 sub-properties — roll-up vs. localizers. +- `aggregate-matches-agent` ⇄ its Cluster-4 sub-properties and the Cluster-8 differential subset — + roll-up vs. localizers/earlier-stage facets. +- `config-incompatible-refuses-start` ⇄ `config-runtime-update-not-revalidated` — startup gate vs. its + runtime hole. + +## Cross-cutting observations + +- **One config matrix drives many clusters.** The memory_mode / heap-allocs / disk-persistence + settings gate Clusters 1 and 2; the protective-on vs. default-off split is the master test variable. +- **Two roll-up properties** (`rss-bounded-under-cardinality`, `aggregate-matches-agent`) each + dominate a cluster — they're cheap to assert and catch broad regressions, but the sub-properties + localize causes and catch silent-but-output-neutral deviations. Keep both levels. +- **Fail-stop is the shared backstop.** `data-component-failure-triggers-process-shutdown` is + downstream of every crash property (Clusters 4, 6) — when any invariant trips a panic, this is what + must happen next. It belongs to Cluster 5 but connects to all crash properties. diff --git a/test/antithesis/scratchbook/sut-analysis.md b/test/antithesis/scratchbook/sut-analysis.md new file mode 100644 index 00000000000..d3d6e761983 --- /dev/null +++ b/test/antithesis/scratchbook/sut-analysis.md @@ -0,0 +1,307 @@ +--- +sut_path: /home/ssm-user/src/saluki +commit: fc4bb29728814ddf9321572b954ec28f58faeb53 +updated: 2026-05-30 +external_references: + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/ + why: ADP Confluence space — headline guarantees, Phase 1 bug bash, gap analyses, weekly summaries. + - path: https://datadoghq.atlassian.net/wiki/spaces/DADP/pages/6497671050/What+Comes+After+DogStatsD + why: Source of the "ADP will not crash under load, losing customer data" guarantee. + - path: https://datadoghq.atlassian.net/browse/DADP + why: ADP Jira project for tracked gaps/incidents (e.g. DADP-108 macOS gaps, DADP-45 telemetry compat). +--- + +# SUT Analysis: Agent Data Plane (ADP) + +> Synthesis of a 5-agent discovery ensemble (12 attention focuses) over `bin/agent-data-plane` +> and the `saluki-*` libraries, grounded against the Datadog ADP Confluence/Jira spaces. +> Each major finding notes the focus(es) that surfaced it. + +## 1. What ADP is (product context) + +**Agent Data Plane (ADP)** is a Rust reimplementation of the Datadog Agent's telemetry data +paths — primarily the **DogStatsD pipeline** — built on the **Saluki** toolkit. It runs +*alongside* the Datadog Core Agent (not standalone in production; `standalone` mode is a +vestige, `AGENTS.md:31`). It is being delivered to first customers for the the design partner design +partner, targeting Agent **7.80.0**; `data_plane.enabled: true` routes DogStatsD through ADP. + +Stated priorities (`AGENTS.md:13`): **correctness first, then performance.** Marketed benefit: +**deterministic resource usage** (`docs/agent-data-plane/index.md:1-6`). + +**Headline guarantee (Confluence, "What Comes After DogStatsD"): "ADP will not crash under +load, losing customer data."** This single sentence decomposes into the two property families +that dominate this analysis: *no crash / bounded memory* and *no silent data loss*. + +User-visible failure modes: lost metrics (silent drops), wrong aggregates (counter→rate, sketch +error, bucket misalignment), process crash (panics on hot paths), memory blowup (OOM). + +## 2. Architecture & data flow (Focus 1, 9) + +### Component topology model + +ADP is a **streaming topology** of typed components wired in a blueprint, built on `saluki-core`: + +- **Component kinds:** `source` → `transform`/`relay`/`decoder` → `encoder` → `forwarder`/`destination`. +- **Two data-model channel types** on edges (`topology/interconnect`, `built.rs:507-647`): + `EventsBuffer` (source/decoder/transform → transform/encoder/destination) and + `PayloadsBuffer` (encoder/relay → forwarder/decoder). +- **Build → spawn lifecycle** (`run.rs:238-239`): `TopologyBlueprint` is assembled, then + `build()` then `spawn()`. The topology starts accepting data only after `health_registry.all_ready()`. + +### The DogStatsD pipeline (production-critical path, `run.rs:593-698`) + +Source `dsd_in` has three named outputs — `metrics`, `events`, `service_checks`: + +``` +dsd_in.metrics → dsd_enrich (chained: dogstatsd_mapper) → dsd_prefix_filter + → dsd_tag_filterlist → dsd_agg (aggregate) → dsd_post_agg_filter + → metrics_enrich (host_enrichment + host_tags) → dd_metrics_encode → dd_out +dsd_in.metrics → dsd_stats_out (statistics tap) [+ dsd_debug_log_out if enabled] +dsd_in.events → events_enrich → dd_events_encode → dd_out +dsd_in.service_checks → service_checks_enrich → dd_service_checks_encode → dd_out +``` + +`dd_out` is the **Datadog forwarder** to the Datadog intake platform. Tag filtering happens +both before aggregation (`dsd_tag_filterlist`) and after (`dsd_post_agg_filter`). OTLP metrics +deliberately **skip aggregation** to avoid turning counters into rates (`run.rs:751-753`). + +### Listeners & intake (Focus 1, 9) + +DogStatsD source (`sources/dogstatsd/mod.rs`) listens on UDP (8125), TCP, **UDS datagram**, and +**UDS stream**. `SO_REUSEPORT` UDP autoscaling on Linux (`mod.rs:667-686`). DNS resolved via +`tokio::net::lookup_host` (skipped when `non_local_traffic=true`). For connectionless sockets, +framing/I/O errors **clear the buffer and continue** — a malformed packet never kills the socket +(`mod.rs:1283-1318`); per-frame parse errors are logged and skipped. Origin detection uses UDS +peer credentials; credential errors are counted but **do not drop the packet**. + +### Egress: the Datadog forwarder (Focus 8, 9 — high subtlety) + +`TransactionForwarder` (`common/datadog/io.rs`): one main I/O loop fans transactions to **one task +per resolved endpoint** (own bounded `mpsc(8)` channel + own retry queue). Per-endpoint Tower +stack: set URI/API-key → version headers → `concurrency_limit` → **`RetryCircuitBreakerLayer`** → +HTTP client. + +- **Retry model is a circuit breaker, not inline retry.** When the policy says "retry," the + breaker returns `Error::Open`, arms a shared **exponential-with-jitter** backoff, and the request + is re-enqueued to a **low-priority** queue (`io.rs:468-474`). All calls are rejected until backoff + elapses. +- **Retry classification** (`classifier/http.rs`): transport errors + 5xx + most 4xx retry; + **400/401/403/413 are NOT retried** — treated as permanent failure and **dropped** (data loss). +- **Two-tier `PendingTransactions`:** high-priority in-memory `VecDeque` for fresh data, low-priority + `RetryQueue` for retries + overflow; **oldest dropped on overflow** (bias to freshest data), + counted as `track_queue_drops` telemetry — silent loss. +- **Disk persistence** (`retry/queue/persisted.rs`): if `forwarder_retry_queue_storage_max_size > 0`, + the retry queue persists to disk; **init failure silently falls back to in-memory only** (durability + downgrade). On shutdown, pending txns flush to disk; **without disk persistence they are dropped**. + Retry-queue IDs are built to survive API-key rotation. + +### Control plane / config (Focus 1, 9) + +ADP registers as a Core Agent "remote agent" (`internal/remote_agent.rs`) and receives an +**authoritative config over a gRPC config stream** (snapshot + partial updates). Startup **blocks** +on `dynamic_config.ready().await` for the first config (`run.rs:119-121`, *no timeout shown*). Stream +end → reconnect after fixed **5s**. ADP exposes Status/Flare/Telemetry gRPC services back to the +Agent. **Note:** the grounding mentions a "METRIC_CONTROL relay," but no `METRIC_CONTROL` identifier +exists in the tree — config control flows through the generic snapshot/partial mechanism (open +question, see §9). + +### Supervision (Focus 1, 3) + +Erlang/OTP-style `Supervisor` (`runtime/supervisor.rs`) with `OneForOne`/`OneForAll` restart +strategies bounded by intensity/period. **Crucial split:** the **internal supervisor** (control +plane, internal telemetry, env/workload) *is* supervised/restartable, but **the primary data +topology is NOT** — `RunningTopology` spawns each data component into a `JoinSet` with **no restart**. +Any data component finishing → `wait_for_unexpected_finish` → **whole-process shutdown** (`run.rs:280-283`, +`running.rs:40-51`). Data-plane components are **fail-stop**; recovery is full-process restart (the s6 +supervisor in the container restarts ADP on exit). Init failures never restart; only runtime failures do. + +## 3. State management & persistence (Focus 2) + +- **Aggregation state** (`transforms/aggregate/mod.rs`): a single `HashMap` + owned exclusively by the transform's task — **no locks, single-task ownership**, all mutation `&mut self`. + Hard `context_limit` (default **1,000,000**) enforced at insert: a *new* context over the cap is + **dropped** (existing contexts always merge). This is the central memory-determinism lever. +- **Zero-value counter keep-alive:** flushed counters emit zeros until `counter_expiry_seconds` + (default 300s); kept-alive contexts still count against the limit. +- **Shutdown drop-by-design:** open (current-window) buckets are flushed on shutdown **only if + `flush_open_windows`** (default **false**) — by default in-flight open-bucket data is dropped to + avoid double counting on restart. `PassthroughBatcher` (pre-timestamped metrics) buffers up to + `passthrough_idle_flush_timeout` (default 1s); drops events if its buffer stays full. +- **Restart loses all state:** supervisor restart re-runs `initialize()` from the spec template; a + restarted component starts empty. (Mostly moot for data components, which aren't restarted.) +- **Two disk-backed subsystems only:** the forwarder retry queue (above) and **DogStatsD capture/replay** + (`sources/dogstatsd/replay/`) — capture files written by a dedicated OS thread (protobuf + `UnixDogstatsdMsg` records + `TaggerState` trailer), bounded `sync_channel`, lock-free `ArcSwapOption` + for replay tagger state. Everything else is in-memory. +- **Object pools** (`pooling/`): `FixedSizeObjectPool` (Mutex+Semaphore, async-blocks when empty), + `ElasticObjectPool` (min/max + background EWMA shrinker task — *if the shrinker future isn't driven, + the pool never shrinks*), `OnDemandObjectPool` (allocates every time). +- **String interner** (`stringtheory/interning/fixed_size.rs`): fixed byte buffer, sharded + `[Arc>; SHARD_FACTOR]`, manual reclamation/tombstoning; `try_intern` returns `None` when full. + Interner determinism backs the context memory bound — *but* the resolver's default + `allow_heap_allocations=true` lets full-interner strings **spill to the heap (effectively unbounded)**. + +## 4. Concurrency model (Focus 3) + +- **Backpressure is real and is the load-safety mechanism:** all inter-component edges are bounded + `tokio::mpsc`; `Dispatcher::send` **awaits** on a full channel (`interconnect/dispatcher.rs:86-122`), + so a slow downstream blocks upstream all the way back to the socket read loop. The DSD source calls + `memory_limiter.wait_for_capacity().await` once per read (`mod.rs:1186`). +- **Fan-out hazards:** an output with N senders clones to N-1 and moves into the last, awaiting each + **sequentially** — one slow consumer stalls delivery to the others. A disconnected output (zero + senders) **silently discards** events (`events_discarded_total`). Send is **not atomic** across + senders — partial delivery possible if a later sender errors. +- **Memory limiter** (`resource-accounting/limiter.rs`): a dedicated **std::thread** polls RSS every + **250ms** and stores a backoff in an `AtomicU64`. Backpressure is **advisory/cooperative** (only + tasks that call `wait_for_capacity` are throttled) and **capped at 25ms** of sleep, starting at 95% + of limit. The checker `.expect()`s on the RSS read — **panics the limiter thread if `/proc` reads + fail mid-run**, silently disabling all memory backpressure. +- **Interner reclamation** is loom-tested; the documented hazard (intern vs drop-last-ref race) is + resolved by a mutex + refcount re-check; worst case is a duplicate entry, never corruption. This is + the most concurrency-unsafe component in the bounded-memory story (raw pointers as `'static &str` + keys into a buffer that gets overwritten on reclaim). +- **Health registry** (`health/mod.rs`): single `Arc>`; a single liveness `Runner` task; + per-component probe over `mpsc(1)`, 5s probe timeout. A deadlocked/blocked component fails to answer + → marked not-live, but **is not killed or restarted** — "blocked but alive" is an observable degraded + state with no auto-recovery. +- **Config reload:** no in-place hot-reload of aggregate state was found; `live_config` is read once at + endpoint-task init. Config change appears to be reload-by-restart (open question, §9). + +## 5. Safety & liveness guarantees (Focus 4, 5) — candidate properties + +Claimed/observed guarantees, each a property seed (full treatment in `property-catalog.md`): + +**Safety (a bad thing never happens):** +1. Backpressure, never silent inter-component drop (the await-on-full design; tested in DSD UDP path). +2. Bounded memory — static startup bounds verification (`BoundsVerifier::verify`) rejects over-budget + topologies in strict mode. *(But not enforced at runtime — see §7.)* +3. Aggregation output matches the Datadog Agent (counter→rate using bucket width as interval, half-open + `[start, start+width)` buckets — explicitly "to match the Datadog Agent"). +4. DDSketch relative-error guarantee: eps=1/128 (~0.78%), bin_limit=4096, bin count **must never exceed** + 4096; merge associative/commutative. +5. Config incompatibility is fatal at startup (high-severity incompatible non-default key → refuse to run). +6. Graceful shutdown completes within 30s without forceful kill (in-flight data drained). + +**Liveness (a good thing eventually happens):** +1. Aggregate always flushes on its interval (default 15s) regardless of input; final flush on stream close. +2. Every passthrough/pre-timestamped metric is eventually forwarded. +3. Topology starts accepting data only after all components report ready. +4. Intake outage doesn't grow memory unbounded (retry queue caps) — but cap implies eventual drop on + prolonged outage (tension to investigate). +5. After a transient intake outage clears, queued data is eventually delivered. + +## 6. Existing test strategy & coverage gaps (Focus 7) — where Antithesis adds value + +- **Unit tests:** dense (saluki-components ~618, saluki-core ~146, saluki-io ~102, ddsketch ~82). + Forwarder I/O and circuit breaker have good targeted tests but all use the `tokio::time` **virtual + clock** — no real interleaving/scheduling exploration. +- **Correctness suite** (`make test-correctness`, `bin/correctness/`): **diff-testing**. `panoramic` + drives an **identical deterministic workload** (`millstone` load generator) into both the baseline + (Datadog Agent) and ADP, captures both via `datadog-intake` (mock intake), normalizes to a shared + `stele` representation, and diffs. Comparison is approximate (`RATIO_ERROR = 1e-8`); internal + telemetry filtered out; fixed `FLUSH_WAIT = 32s` after load. 21 cases, all DSD/OTLP happy-path. +- **Integration suite** (`make test-integration`): real ADP in Docker, process-level assertions only + (`log_contains`, `port_listening`, `http_check`, `process_exits_with`, etc.). 27 cases (startup, + port binding, config-check exit codes, memory-mode exit behavior). Note: container s6 supervisor + **restarts ADP on exit**, so the container never actually exits. + +**Gaps (Antithesis's value):** +1. **No fault injection of any kind** — grep for fault/chaos/partition/kill/crash/inject/toxiproxy/netem + found nothing. The intake is always healthy and reachable. +2. **Intake down / slow / 5xx-storm never tested system-level** — retry queue, circuit breaker, disk + persistence, backpressure only unit-tested against in-process mocks. +3. **No process-crash/restart mid-flow** — disk-persisted retry queue recovery never tested across a + real kill+restart. +4. **No network partition / connection reset / TLS handshake failure** under steady state. +5. **No timing/interleaving exploration** — diff testing is deterministic by design; concurrency bugs + in multi-endpoint fan-out, shared circuit-breaker state, JoinSet handling are invisible to it. +6. **DogStatsD replay has zero suite coverage** despite being the newest, largest, untrusted-input feature. +7. **Memory-pressure behavior under real load is untested** beyond boolean exit-code cases. +8. **Config-stream drop/flap** not tested (one happy-path case `adp-config-stream`). + +## 7. Failure & degradation modes + unproven assumptions (Focus 8, 11) — attack surfaces + +The highest-value Antithesis targets. Several directly tension the headline guarantee. + +1. **Memory limiting is DISABLED by default** (`MemoryMode::default() == Disabled`, + `saluki-app/accounting.rs:37-40`) — no bounds validation *and* no runtime limiter unless the operator + sets `memory_mode` + `memory_limit`. cgroup auto-detect only triggers when `DOCKER_DD_AGENT` is set. +2. **The bounded-memory guarantee is a startup assertion, not a runtime invariant** (Wildcard). Static + `BoundsVerifier` sums *declared* firm limits; nothing measures actual allocation. The only runtime + mechanism (`MemoryLimiter`) is advisory, cooperative, ≤25ms backoff, 250ms sampling, and the aggregate + insert hot loop does **not** call `wait_for_capacity` — so the aggregation map + interner grow under + pressure regardless; only the source is throttled. +3. **Firm bound is known-incomplete** (`aggregate/mod.rs:249-256`): a single metric with many distinct + timestamped values isn't accounted for. Combined with default heap-fallback in the interner, declared + bound and real RSS diverge arbitrarily under high-cardinality / many-timestamp load. +4. **Limiter thread panics if RSS becomes unreadable mid-run** (`.expect`, `limiter.rs:101-102`), silently + removing all memory protection. +5. **≤25ms backoff + 250ms sampling may not prevent OOM** under bursty load — directly tests "won't crash + under load." +6. **Source dispatch errors are logged and swallowed** (`dogstatsd/mod.rs:1667-1716`): a mid-buffer + dispatch failure can mis-route remaining events (eventd/service-check events leaking into the metrics + path), and the TODO admits it will "continue to fail to dispatch … until the process is restarted." +7. **Silent drops, no warning:** aggregate context-limit (one warn/episode), non-finite metric values + (`non_finite_metric_values_are_silently_dropped`), interner-full contexts (config-dependent). +8. **Sub-second aggregate window → panic — FIXED UPSTREAM (PR #1772):** historically + `bucket_width_secs = window.as_secs()` with no validation, so a value < 1s yielded `% 0` + divide-by-zero and `step_by(0)` panics. The key is now `aggregate_window_duration_seconds`, typed + `NonZeroU64`, and `bucket_width_secs` is `NonZeroU64` end-to-end (`aggregate/mod.rs:95-98,822-823`), + so zero/sub-second values fail config parsing rather than reaching the divisor. Retained here as a + closed wildcard; see `aggregate-no-panic-any-window` (now a regression tripwire). +9. **Two-clock hazard** (Wildcard): bucketing uses **wall clock** (`get_unix_timestamp`), flush cadence + uses **monotonic** `tokio::interval`. A backward wall-clock jump makes the zero-value range empty + (silent counter gap); a forward jump floods zero-value points and allocates a large `SmallVec`. No + monotonicity guard. Also means a replayed capture buckets differently than when captured (the + aggregator ignores per-record timestamps for non-timestamped metrics). +10. **NaN poisons a DDSketch** (`agent/sketch.rs:188-206`): `sum`/`avg` go NaN permanently; finiteness is + guarded per-source (DSD codec), not at the sketch boundary — fragile if a new producer is added. +11. **All-non-finite packet → ghost metric** with a valid context but zero data points (interner/cache + pressure, flows downstream) rather than a dropped packet. +12. **Replay reader treats corruption as clean EOF** (`replay/reader.rs:84-104`): a corrupt/oversized + length prefix silently truncates the remaining record stream — false replay-fidelity confidence. + ~25 unwrap/expect sites parsing untrusted capture files. +13. **Core Agent reachability assumed at startup:** ADP blocks indefinitely on `dynamic_config.ready()` + with no visible timeout — if the Agent never sends config, ADP never starts the pipeline. +14. **Hot-path panics:** numerous `.expect("… should always exist")` on default outputs, events/ + service-check outputs, framing, sketch gamma/offset; `unreachable!("semaphore should never be + closed")` in pools; metrics-recorder `panic!`. Each is a crash if its invariant is violated. +15. **UDP statsd-forward target:** on connect failure, packet forwarding is permanently disabled (no + retry); send errors debug-logged and dropped. + +## 8. Bug history & churn (Focus 6) + +Churn hotspots (last ~300 commits): `cli/run.rs` (wiring), `sources/dogstatsd/mod.rs` (the ~2500-line +DSD source — biggest, most-changed file), `internal/control_plane.rs`, config-registry alignment files, +`common/datadog/io.rs` (forwarder), metrics encoder. Notable correctness fixes (good property seeds): +drop non-finite floats in codec; compensated summation for histograms; unitless histogram counts; +match agent timestamped-count sampling; **moved DSD prefix/filter in front of enrich** (pipeline +ordering bug); stabilize additional-endpoint retry-queue IDs (now load-bearing for API-key rotation). +**Most regression-prone area: DogStatsD replay** (`e88d04b10a`, +1765 LOC, draft, validated only with +`cargo check`, zero suite coverage, parses untrusted files). 123 TODO/FIXME/HACK markers, clustered in +saluki-components and saluki-io; safety-relevant ones around dispatch partial-failure, disk-limit +"bailing out," and undecided malformed-input error policy in the codecs. + +## 9. Assumptions & open questions + +- **METRIC_CONTROL relay naming:** grounding references a Remote Config METRIC_CONTROL relay; no such + identifier exists in the source. Config control appears to use the generic snapshot/partial config + stream. Confirm naming/mechanism against Confluence ("Tag Filter RC Relay Stress Test"). +- **`interconnect_capacity` default** not yet read — needed to model backpressure precisely. +- **Config hot-reload semantics:** confirmed no in-place aggregate-state reload; is *any* runtime config + change applied without restart? Affects whether config-reload-mid-flight is a real attack surface. +- **Saturated forwarder retry queue under prolonged outage:** drop vs block tension — confirm exact + bound and whether memory stays bounded while data is eventually shed. +- **`persisted.rs` disk-full / partial-write / corrupt-file across crash** not deeply read (~47 + unwrap/expect sites) — relevant to the durability/data-loss property family. +- Discovery was read-only; no builds/tests executed. Test *counts* are annotation greps. + +## 10. Implications for property selection & topology + +Antithesis is strongest exactly where the existing suite is blind: **degraded/down/slow intake under +sustained load**, **process kill+restart with disk-persisted retry-queue durability**, **memory overload +to test the soft-backpressure-only design**, **malformed/corrupt replay capture parsing**, **clock skew +into aggregation**, and **timing/interleaving** in the forwarder and interner. The natural deployment +mirrors the correctness harness — ADP + a controllable mock intake + a deterministic load generator — +but adds Antithesis fault injection (network, process, clock) that the harness lacks. See +`property-catalog.md` and `deployment-topology.md`.